Matt Ownby's Cool Projects: Emulation vs FPGA

I was thinking about this on my drive to work this morning and decided to write my thoughts down while they are still fresh in my head.

I've been hearing people talk about "one to one" (ie 1:1) or "not emulation" when talking about FPGAs. What do they mean when they says this?

I've talked about this in previous posts, but it's been a while and now may be a good time for a refresher.

The way that CPU's in classic 80's arcade games work is that they all require a clock which alternates between a high and low state at a frequency usually in the low megahertz range (ie 1 MHz). They also usually have at least 16 address lines and at least 8 data lines. A few CPUs had more than this, but I will focus on the CPUs that have 16 address lines, 8 data lines, and a single clock input.

This is your typical 8-bit CPU that can address 64k of memory. How did I conclude that it can access 64k of memory? Because it has 16 address lines, and 2 to the 16th power is 65536 which is 64k (65536 / 1024 is 64, and 1k is 1024 bytes). How did I conclude that it's an 8-bit CPU? Because it has 8 data lines which is usually what people refer to when they talk about an "8-bit", "16-bit", "32-bit", or "64-bit" CPU.

The CPU's 16 address lines are always outputs, meaning that the CPU is the only device in the computer that is allowed to change its address. The CPU's 8 data lines can be either inputs or outputs. The CPU has other lines on it which indicate whether it is trying to read from its data lines or write to its data lines.

The CPU has a very small number of internal memory buffers called registers, but almost all of the memory that the CPU needs comes from either external ROM or external RAM. When the CPU wants to execute an instruction (where a complete program consists of many instructions), it has to fetch the instruction from an external dependency (usually ROM or RAM). It does this by setting its address lines to the address that it wants to fetch the memory from, and then setting its read/write lines to let the rest of the computer know that it wants the byte stored at the address on its address lines. On power-up, the CPU will set its initial address to a fixed, known location and either start executing instructions from that location or read an address from that location, jump to that address, and start executing instructions.

Each CPU instruction takes a specific number of clock cycles to complete.

So what does all of this have to do with emulation vs FPGA ?

If I were trying to implement a CPU inside of an FPGA, I could (if I do it correctly) have the FPGA's input/output lines behave exactly like the original CPU. And if the FPGA's I/O behaves exactly like the original CPU's I/O, then the FPGA could be considered to be 100% compatible with the original CPU and one could use it as a drop-in replacement for the CPU without sacrificing any sort of accuracy. This is what people mean by "one to one." I should note that from what I have seen of efforts to implements CPU's inside of FPGA's, I am not convinced that the people who put forth the effort were terribly concerned about accuracy, and may have been more concerned about making the FPGA version CPU perform faster than the original. So just because an FPGA could in theory be a 100% accurate replacement for the CPU does not mean that this is what is happening.

Now, what might people mean when they say "not emulation" ?

CPU emulators, such as what one may find in MAME, take a few shortcuts in order to achieve decent performance. They do not emulate the clock of the CPU, but instead are designed to execute a variable number of cycles in one shot as quickly as the host machine (ie a modern x64 computer) can execute. The code that is driving this execution is then responsible to regulate the overall speed of the system so that it does not run too quickly.

For example, let's say we are emulating a 1 MHz Z80 cpu. The cpu management code may tell a z80 emulator to execute 1000 cycles which would take 1 ms on original hardware. The z80 emulator would then go execute these 1000 cycles as fast as possible and report back how many cycles were actually emulated (it might be more than 1000 because some instructions take more than 1 cycle). The cpu management code would then have to stall until the 1 ms period has completed before executing the next chunk of cycles.

This creates an unauthentic experience because it means that instead of instructions being executed at a steady slower cadence, they are executed in quick bursts with delays in between. This is usually not noticeable by a human playing the emulated game because it's just 1 ms, but several problems can arise depending on the other architecture of the original hardware.

On a game like Dragon's Lair where there is just one CPU and a steady clock that never varies, the above method of emulation is "good enough." A human is not really going to notice any meaningful difference in accuracy.

But what of the game has multiple CPUs such as a dedicated sound CPU? Now the emulator has to execute a smaller slice of cycles on the first CPU, then switch to the second CPU and execute another smaller slice of cycles. If there are interactions between these two CPUs (and there usually will be), the slice of cycles that gets executed needs to be small enough so that there is no unnatural lag in the interactions which hurts performance of the overall system. And even if each CPU takes turns executing just 1 cycle, the potential for an inaccurate interaction between the two emulated CPUs still exists since the emulator does not take into account the clock.

Now, what if the original hardware fiddles with the CPUs clock, or the clock is not constant for some reason? The Williams games are notorious for doing this. On a lot of the Williams games, like Joust, their custom DMA chip will actually halt the CPU while the DMA operation is running. On Star Rider, the CPU's clock gets halted every field for unknown reasons (that's on my TODO list to figure out why). Last time I checked, this behavior was not emulated very well in MAME (it may have improved since I last checked) and certainly Daphne is not equipped to handle this type of scenario. However, an FPGA would be able to handle it just fine.

Now, does this mean that emulators like MAME and Daphne can't be improved to take into account a variable clock? Not at all. As modern computers get faster, it will become more feasible for emulators to become more accurate without hurting performance. I believe that aside from the problems associated with running on a modern operating system (with many processes and threads all competing for the CPU's time), there is no reason why software-based emulators cannot achieve 100% accuracy as their architectures are improved. However, I do not believe that that day is here... yet.

I hope that gives people a better idea of what I consider the difference to be between FPGA solutions and emulation solutions.

1 comment:

Steve ChamberlinMarch 9, 2016 at 2:57 PM
This is a question I've thought about a lot. I call it emulation (CPU-based emulators like MAME) vs hardware replication (FPGA or discrete hardware systems like the Minimig or Replica-1). It used to be the case that emulation on desktop PC systems wasn't fast or powerful enough to do the job well, but that's no longer true. Heck, there are even great emulators of classic machines that run as javascript inside a web browser. So even though my personal interest is in hardware replication, I've reluctantly come to the view that software-based emulation is the easier and better solution in 99% of cases. The only exception would be when you need to interface with period peripherals, and even that can often be solved with a software-based emulator and some kind of adapter board.

Wednesday, March 9, 2016

Emulation vs FPGA

1 comment: