-
Notifications
You must be signed in to change notification settings - Fork 58
Emulator Architecture Part 2
(My personal notes written while reading the source. -- @panicsteve)
Most of Einstein is platform-agnostic, except the platform-specific code to display the initial setup UI and kick things off. Here's how it works on OS X.
First, the app displays the platform setup window, which allows you to choose a display resolution, ROM file, memory size, etc.
When the Start button is clicked, control transfers to TCocoaAppController's startEmulator method. This method loads the ROM, ROM extension, and sets up all the internal data structures and objects required to run the emulator.
By the time the emulator is actually started, background threads are already running for TNetworkManager and TInterruptManager.
The next to last thing the startEmulator method does is spawn a thread for TCocoaAppController's runEmulator method, which simply calls Run on the TMacMonitor object and sleeps until that returns. This spins up the emulator in a halted state at the first ROM instruction.
The last thing startEmulator does before returning is schedule an executeCommand:@"run" to be sent to the TCocoaMonitorController on the next pass through the event loop, which is equivalent to a user typing "run" in the monitor GUI. In other words, it resumes the emulator from its halted state.
By the time the run command comes in, the following threads exist and are in a waiting state:
- TNetworkManager::Run()
- TInterruptManager::Run()
- TMonitor::Run() (actually TMacMonitor subclass on OS X)
This thread sits in an infinite loop, waiting to be signaled. When the thread is signaled, it calls select() to see if anything is available to read on the network socket. If there is, it raises an interrupt on the emulated PCMCIA controller.
This thread sits in an infinite loop, waiting to be signaled. When signaled by a timer, it updates timers on the emulated hardware based on the time delta of the host machine. If an interrupt has been raised, it is passed to the TARMProcessor for handling. It then goes back to waiting for either another CPU interrupt, or another timer update. According to a comment in the source, timer interrupts happen at a rate of 3.6864 MHz.
This thread sits and waits for monitor commands to come in. On OS X, these come from the monitor window (handled by TCocoaMonitorController), with the exception of the implicit "run" command that comes in at startup. Commands include: run, stop, quit, save state, and load state.
Because the monitor receives a run command during startup, control quickly moves to TMonitor::RunEmulator(). Here it enters an infinite loop, reading instructions, checking to see if breakpoints are set, executing the instructions if not, breaking out of the infinite loop if so.
If the monitor is running and it has not hit a breakpoint, control transfers again to TEmulator::Run() which also contains an infinite loop of instruction decoding, until an interrupt comes in .
Execution of individual instructions is performed by calling: mMemory.GetJITObject()->Run( &mProcessor, &mSignal );
The class of the JIT object is determined at compile time, because there were experiments with different JIT implementations along the course of development. Currently it is TJITGeneric, which is not specific to any particular host platform, and the JIT page class is TJITGenericPage.
When the JIT's Run() method is called, the first stop is GetJITUnitForPC() with the current PC (which is actually 4 bytes ahead of the instruction to be executed -- I believe this is an ARM quirk).
GetJITUnitForPC() consults TJITCache.GetPage() with the current virtual address to see if a translation for this page has already been cached. (Cache pages are 1 KB in size.) The cache has been preloaded with the first part of ROM (see below), so the first ROM instruction at 0x00000000 is a cache hit. If the cache does not have the page, a PrefetchAbort is triggered on the virtual CPU.
At each address of each page there exists a JITUnit (or possibly a chain of several JIT units). JITUnit is a C union which can be read as a pointer, a value, or a JITFuncPtr (a pointer to a native code implementation of the ARM code at that address).
This JITUnit comes back to the Run() method and its JITFuncPtr is invoked.
TJITCache contains a PMap (physical memory map) and VMap (virtual memory map).
When TJITCache is constructed, PMap is alloc'ed with enough space to store a pointer to an SEntry for each page of the combined total of ROM and RAM space.
The VMap is of type THashMapCache. THashMapCache functions similarly to a traditional hash / associative array but is subject to some limitations which are laid out in comments at the top of THashMapCache.h.
SEntry is a C struct which contains: a TJITGenericPage object, its physical address, its virtual address (the "key" field), doubly-linked-list style pointers to the next and previous SEntry, and a singly-linked-list pointer to the "next PA Entry" (Physical Address?)
During TJITCache's construction, the VMap is initialized by iterating through memory pages until either the end of ROM is reached, or the hash map is full (currently defined as 128 entries). Effectively, I think this pre-warms the cache with the first 128 KB of ROM. This explains why when the emulator runs its first instruction, it already has a JITUnit in the cache for it.
JIT page translation occurs in TJITGenericPage::Init(). This function begins at the starting address of the page, and translates all the instructions it contains into JITUnit chains. Assuming a 1 KB page full of instructions, there will be 256 instructions translated into 256 JITUnit chains, as all ARM instructions are 32-bit.
Translation of each individual instruction occurs in TJITGenericPage::Translate(). The translation process evaluates the ARM instructions and creates a chain of one or more JITUnits that can each contain a function pointer to host-native code, or a 32-bit value.
For example, the first instruction in the ROM is an unconditional branch. This is decoded into a chain of two JITUnits that look like this:
- Function pointer for native Branch() function
- New value for the PC (program counter)
Later, when the native Branch() function is called, Branch() is able to retrieve the new PC from the next JITUnit in the chain, and updates the TARMProcessor's PC register (r15) with that value.
In addition to Paul's original platform-agnostic JIT (the "generic JIT"), which translates ARM into an intermediate bytecode called JITUnits, there are two other implementations which can be swapped in.
The LLVM JIT (on the llvm-dev branch of the GitHub repo) created by Paul uses LLVM to translate pages of ARM instructions to the host platform on-the-fly. This is nice because it is completely automatic, but the initial translation of the code is slow (ie; first boot can take several minutes or even longer). Execution is slow at this point too, but once the translated instructions are cached, execution is still slightly faster than the generic JIT (about 11K loops/sec in NewtTest vs. 8K on my MBPR)
Getting this set up to build is a little tricky. You will want to download and build llvm 3.5.0 with assertions disabled, something like this:
$ tar xvfJ llvm-3.5.0.src.tar.xz
$ cd llvm-3.5.0.src
$ ./configure --prefix=/opt/llvm-osx --enable-optimized --disable-assertions --enable-libcpp=yes
$ make -j4
Then, in the _Build_/Xcode
folder, locate and run the BuildEinsteinLLVMConfig.sh script. This will locate your llvm installation and generate a file named EinsteinLLVM.xcconfig, which must be copied into _Build_/Xcode
before Einstein can be built. The xcconfig file may get written to the root directory of the filesystem -- I'm not sure why.