1 |
The 8-bit 1802 is
the heart of what are widely regarded as just about the slowest home
computers ever sold. (8 clocks per machine cycle, 2 or 3 machine
cycles per instruction. Clock rate fixed at 1.7 MHz in order to
drive the NTSC video of
the typical system, and the video output was serviced by
per-scanline
interrupts and
per-8-pixel
DMA
cycles, which stole much time from your application program. So,
roughly
0.1 MIPS
even without the distractions, and its 'I's were nothing to write home
about, either. Comparable in performance, perhaps, to the much older
4-bit Intel
4004.)
Superficially the 8-bit 1802, with its 16 16-bit registers, appeared unexcelled until the much later 16-bit Z8000 and 68000 devices, except that the slow instruction execution combined with some truly unfortunate missing instructions2 served to hobble it significantly. In practice, except for its use in small trainer computers like the Elf and some very early programmable home-video games, it really only was deployed in places where its extreme tolerance of environmental stresses like unstable power supplies and clocks, heat, cold, and radiation, or its ultra-low full-CMOS power requirements were crucial: Space vehicles, boreholes, traffic signs, engine computers, battery-operated handheld devices, etc. (Though considered relatively unpopular it nonetheless sold millions of units in these applications, and is still in limited production as of this writing, 48 years after its introduction.) Generally it seemed to be most satisfactorily programmed in various virtual machine environments like Chip-8, or threaded Forth3. These, of course, extracted a significant performance penalty on top of an already slow processor, but paid for themselves with substantially reduced memory requirements. (Said memory also needing to be as tolerant of the environment as the CPU needed to be, and thus possibly quite bulky and/or expensive when compared to more mainstream memory devices.) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2 |
The 16 16-bit registers can be used as program counters, pointer
registers, and counters, which is excellent, and can be treated as up
to 30 8-bit data holding registers, also good. However:
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
3 |
Threaded
Forth
is particularly noted for being memory-thrifty, but most find Forth to
be a difficult language to work in. It occurs to me that one could
combine a
P-machine
stack-oriented
execution architecture (for a more traditional
C
or
Pascal
compiler, say) with
Forth-style threading in a 2-stack
virtual
machine to get the best of both worlds. Classic P-machine runtime
interpreters
were stack-oriented, which means that the
basic blocks
that a compiler (C, or Pascal) generated could be characterized, just
like Forth words, by their stack effects. The compiler could generate
these blocks as threadable Forth-like words (with separate return and
data stacks, allowing for
zero-prologue
subroutines and simple machine refactoring) instead of the usual
inline code. If the compiler, as each basic block was identified,
fingerprinted the block by its data stack effects and its internal
operations, it could consult the cache of already emitted code (the
application itself, as built so far to that point) and before caching
a new block 'word' it could see if a match was already there, and if
it were thread to that instead of emitting fresh code. The 2-stack
machine (versus native code) constraints would encourage a greater
block uniformity that could well lend itself to increased sharing.
Post-processing could even discover larger clusters of common
sequences (of threaded words) and re-factor them for increased
sharing, and even better code density. (Manipulations of the return
stack would stop the refactoring, of course, because the refactoring
itself affects the return stack.)
In this manner the P-machine compiler would be doing, partially anyway, what the experienced Forth programmer does automatically: extensive factoring of an application with an eye towards sharing code that already exists. (This being the secret by which Forth was so memory-efficient.) If the compiler did a good enough job of factoring and sharing, it might make a C or Pascal environment for the 1802 practical for non-toybox applications. Programs would be slow(-ish) due to the triple penalty5, but maybe they'd fit in memory.
One experience I had in school with HP Pascal/64000 on an 8085 embedded system was an abject failure in this respect, and a useful object lesson. The hardware had been designed with 4KB (2× 8755 devices, IIRC) of EPROM for the program, which was typical for embedded systems of the era and felt by all to be sufficient for the intended door-lock application. Unfortunately the project specification required Pascal as the implementation language, not assembly, and the necessary program space turned out to be about 3–4× what was available in the hardware. The Pascal compiler churned out perfectly reasonable native code, but it was just so bulky (especially on an 8-bit processor) when compared to what an assembly (or Forth) programmer could do that the project was doomed from the start. Adding source code generally resulted in a linear increase in object code, there was no factoring economy exhibited, beyond what was expressed directly as subroutines in the source code. The memory 'gas gauge' dropped precipitously with each new line of code added, burying itself far below empty before the application was barely even begun. The application was eventually completed anyway, and functioned correctly, but only when using substantial development system memory emulation resources. In fact it was never realized on the un-aided target hardware, and the project was eventually abandoned, after all the door lock hardware had been installed throughout the building. This was a very useful (and painful) object lesson on the importance of project specification and estimation, and on having a viable Plan B, and on how ill-suited 8-bit microprocessors were to handling languages conceived for the much more capable mini- and mainframe computers of the day. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
4 |
Consider a not-uncommon small but significant task: copying a small
(<256) number of bytes from known (constant) places (src
and dst ) in memory—your basic structure copy. In
C, perhaps something like:
I have hand-assembled some code fragments here for comparison purposes, there may be minor errors as a result.copy(char *src, char *dst, unsigned char count) { do { *dst++ = *src++; } while (--count); } In the setup parts of the code, pre-loop, we are initializing two 16-bit pointer registers and an 8-bit counter. The 3-instruction 8-byte 8080 fragment is about as optimal as it could get. It takes the 1802 almost twice as much code, 10 instructions in 15 bytes, to do the exact same part of the task. This is due to the 8-bit accumulator bottleneck, through which all data must flow. In the loop proper, the 1802 actually has a 1-instruction 1-byte advantage (3 vs 4) copying the data because it has a combined load-and-increment-pointer (LDA) instruction. For loop termination the 8080 takes 2 instructions, 4 bytes, to decrement the counter and loop around if it's not zero. The 1802 takes 3 instructions, 4 bytes, to do the same thing, leaving the 1802 with only a 1-instruction (0-byte) penalty for this phase. This penalty is due to the lack of a condition-code register: we have to pull the counter back into the accumulator to see if it's expired. (The high half of the counter register is undamaged, because we never decrement the low half below zero.) So, for the loop itself the 8080 needed 6 instructions in 8 bytes to copy each data byte, the 1802 also needed 6 instructions but in only 7 bytes, giving it a slight size edge. Which does not even begin to pay for the huge penalty it incurred setting up the loop. Note that every 1802 instruction we used takes 16 clock cycles to execute, whereas the 8080 instructions we used take 5, 7, or 10 cycles to execute. No matter how you rate it, the 1802 comes in last on this task. (And, indeed, most tasks.) As a routine matter we preserved any processor resources (registers) that we didn't need for the copy. For the 1802, registers 1–C were untouched, as was the high half of our counter R15. In the case of the 8080, our benchmark, only the 8-bit B register wasn't touched. Using the slightly newer, more-popular Z80 and its additional instructions we can beat the 8080 a little, several different ways: These only make the 1802 look even worse, of course. Especially that last one. On the other hand, consider another wildly popular peer, and two that were not so popular:
Just as with the Z-80, the 6809 is rich enough that there are some options:
To handle blocks of any size, most of these fragments would need an additional outer loop, or other equivalent modifications to handle larger counters and/or offsets. (The Z-80's LDIR can already handle any size.) I have ignored this as I don't think it's relevant for the point(s) being made here. The little-used Hitachi 6309, a successor to the 6809, added block move instructions similar to the Z-80: Essentially the same four instructions as the Z-80, but three bytes larger due to the extension-byte orientation of the 6809 family. (Yet still two bytes smaller than the basic 6809 routine. To be fair, the Z-80 also uses extension bytes, but only needs one for this task rather than four.) This takes 17+3×N clocks, making it the absolute winner, bar none, in the speed category, and it can handle blocks of any size. (Too bad this processor was not more popular, but by the time it was available the 16-bit processors, with their vastly increased memory addressing abilities, were coming on strong.) So, here is our collected summary of the basic structure-copy subroutines, showing size and time requirements for each processor: Discounting the never-seen 6309 the Z-80 is the clear winner on all benchmark fronts here, once clock rates are made comparable. (See below.) However, benchmark fragments aren't everything. The 6502, for example, has two secret weapons: 1) due to the way its index registers work it's very good at handling data structures and arrays that are less than 256 bytes in size, and 2) it can keep up to 128 pointers (using all of zero page) on hand at once. So, that grotesque setup code probably need only be done once. Most significant programs are quite repetitive in nature, avoiding the setup on all but the first pass can really add up. For example, the second time that structure copy needs to be done, alone of the CPU's the 6502 can probably re-use the source and destination pointers where they sit in RAM. The setup for the (second and any subsequent) copy then drops to two bytes, and two clock cycles. The 6502 has been called by some the first common RISC microprocessor, because it has so few internal resources. (An arguable definition, but what is true is that the programmer generally only works with three 8-bit registers, which is Not Much no matter how you measure it. [TI's novel 16-bit 9900 takes this register-less approach even further, though in a different way.]) Instead, the 6502 is designed to work very efficiently with the first 256 bytes of RAM, using them as pointers and arrays, and arrays of pointers. Substantial applications, once written and optimized for the 6502, were fast—often very fast. (None6 of the 8-bit processors, with the exception of the later 6809/6309, natively handles high-level languages like C and Pascal well at all. Significant 8-bit applications that need maximum performance must be written in assembly language, and are thus well-poised to take advantage of architectural peculiarities.) The various '80' processors don't have enough internal registers to keep anything interesting long-term within the CPU, everything has to live out in RAM and shuffle in and out of the CPU in use. The extremely introverted 1802, focused on its generous 16 internal registers, still doesn't have enough registers to keep very much control data within the CPU, especially considering its instruction set peculiarities—most data still has to live out in RAM and endure the shuffle. The extremely extroverted 6502, with practically no registers to speak of and focused on its first 256 bytes of RAM, can keep the bulk of interesting control information out there, where (using 8-bit zero-page addressing) it is fairly fast to access and can be used where it sits, no shuffling required. (Like the 1802, you can't build even the most basic 6502 system without RAM.) While the 6502 instructions that access the data directly in RAM are inherently slower than purely register-based operations, avoiding the shuffle more than makes up for it. That efficiency adds up over an entire application. (The 6809/6309 can do many of the same in-memory tricks, and even some additional ones, but its instructions to do so are larger and slower, largely negating that 'advantage' so far as performance is concerned.) In the heyday of the 8-bit era the 'fast' 4 MHz Z-80 systems were generally (though erroneously) thought to be faster than the 'fast' 2 MHz 6502 systems. (These being approximately comparable as they interfaced to the same-speed commodity memory, and in those pre-cache days that was all that really mattered. 8-family processors were approximately half as clock-efficient as 6-family processors; scaling clock numbers by 2 was a reasonable equivalency approximation.) That conclusion was certainly true of small benchmarks: on a fragment-by-fragment basis the Z-80 does very well. But running real world applications the 2 MHz 6502 systems were usually comfortably in the lead, followed by the 4 MHz Z-80 systems, followed by everything else, with the 1802 systems solidly bringing up the rear. (The 1 MHz 6502 systems, exemplified by the Apple II, naturally found it a lot harder to compete on performance than the 2 MHz systems did, but they usually compensated by having inexpensive integrated graphic displays that shared the main memory with no performance hit or visual artifacts when accessing it.) Larger applications that take advantage of the 6502 instruction architecture simply win. In fact, if the 1802 were implemented with circuitry as cycle-efficient as the 6502, a not-unreasonable ask, it would actually fare well in these contests, as it's not the worst in code size, and having multiple pointer registers and counters available can be a real advantage in avoiding the pesky shuffle. Revisited: Instead of 160+96×N clocks, our fictional enhancement would take 20+12×N clocks, making it the fastest of these processors at this task! (Equivalently, clock the real 1802 design at 16–32 MHz—in a non-caching design it's the memory speed that matters, not the clock rate.) Of course, if we're going invent fictional derivatives of existing processors a small change to the Z80 would make it unbeatable in this particular challenge. As it happens the Z80 implements LDIR by decrementing the program counter rather than incrementing it for the second byte of the opcode, if the decremented BC counter pair is not zero. This results in re-fetching the LDIR instruction itself for each data byte transferred, except for the last one, which is why it takes a leisurely 21 cycles per byte transferred. (This is elegant, if inefficient, and also ties into preserving its ability to refresh DRAM reliably while still remaining responsive to interrupts during this lengthy instruction, all while minimizing implementation circuitry.) If it had been able to fetch the opcode pair only once and thereafter just move data, under control of the internal registers, it could have taken 13 cycles per byte transferred, or even a theoretical minimum of 6, far superior to any other 8-bit CPU. (The only non-caching processors I'm aware of that could actually keep the memory bus saturated with data while copying bulk data were the 16-bit 68010, and the 8-bit Hitachi 6309.) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
5 |
Penalties: 1: P-machine interpreter. 2: Threading overhead. 3: 1802!
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
6 | In fact, Intel's 8085 had additional instructions that were designed for the convenience of high level languages like C and Pascal, so it wasn't just the 6809/6309 that could do this. Intel, however, chose to leave them undocumented, presumably to not further fractionate the burgeoning CP/M market at the time. (No doubt this deliberate omission was heartbreaking to the designers of the 8085.) To my knowledge, no high-level language ever exploited the 8085's additional abilities. |