a2DMAMagic : Apple II II+ IIe IIc IIc+
The secret, hidden, high-throughput, transparent 6502 DMA channel
Jorge Chamorro Bieling, 4 de Marzo, 2007
home


The idea:

It was reading the "6502 Instruction Details" chapter of Jim Sather's "Understanding the Apple II" book, that I noticed that many 6502 instructions do useless, 'dead' memory access cycles, (see the threads "RFC:DMA for free" and "6502 trashing memory cycles" in comp.sys.apple2).

A classic 1MHz 6502 system like the Apple II has a memory bus bandwidth of 1 MByte/s (at least from the CPU's point of view).

Most of the bandwidth is used by read memory cycles used to fetch opcodes and its operands, a smaller percentage is used for moving (reading/writing) useful data bytes around, yet a quite meaningful percentage gets wasted by useless, 'dead' memory cycles (interlaced in-between during the opcodes' execution) .

A way to illustrate this is the NOP (No-OPeration) opcode : a 1 byte long, 2 cycles instruction that "does nothing". The 6502 fetches the opcode from memory during a 1st memory cycle, and instead of doing nothing else because it has fetched a NOP, instead of continuing fetching and executing the next opcode on the following memory cycle, it does an additional 2nd read memory cycle to get data that is irrelevant just to ignore it afterwards. This 2nd cycle is a "dead", useless memory cycle : 50% of the cycles of a NOP are useless : 50% of the memory bandwidth gets wasted whenever it's executed.

CLC, CLD, CLI. CLV, DEX, DEY, INX. INY, NOP. SEC, SED, SEI, TAX, TAY, TSX, TXA, TXS, TYA, all waste 50% of the cycles (1 out of 2).

Another example: a program jumps to a subroutine via a JSR, and later on returns with an RTS.
In this case, the 6502 should use:

-3 memory cycles, one to fetch the JSR opcode + 2 to fetch the 2 bytes of the address to jump to operand.
-2 memory cycles to push on the stack the 2 bytes of the return address.
-1 memory cycle to fetch the RTS instruction opcode.
-2 memory cycles to pop off the stack the 2 bytes of the address to return to.

Therefore 3+2+1+2=8 cycles. Instead, it uses 12 cycles : among them 4 interleaved useless, 'dead' cycles.

RTS takes 6 cycles to execute, but wastes 3 out of the 6, 50% of the bandwidth.
JSR takes 6 cycles to execute, and wastes 1 cycle out of the 6, 16.6% of the bandwidth.

The table on page 4-22 of Jim Sather's book shows that 22 out of 29 groups of instructions, waste between 1 and 3 memory cycles.
It remains to be seen what real bandwidth percentage is being wasted, but every mere 1% (*) accounts for 10 KBytes/s/MHz of ("magic", free) bandwidth !


The circuit:

A circuit could be designed to flag the dead cycles, in order to make them available as a DMA channel.
Even though this circuit has been designed with an Apple II in mind, it's applicable to any other 6502-based system.
What's most important, it can be added on-top-of almost any existing 6502 system, as it doesn't require any "architectural" modification.

These are the block and timing diagrams:


The 6502 flags every fetch opcode memory cycle with the SYNC signal (available
at the CPU and at any Apple II slot). This signal changes state at the negative-going edge of the 6502 phase 2 clock, roughly the same as the Apple II phase 0 clock (6502's phase 2 lags the Apple II's phase 0 by a few nanoseconds).

SYNC is used to clock
a transparent latch in order to get the opcode that is currently being fetched. (this latch is, in fact, optional, but conceptually fits there very well)

The (transparently) latched opcode feeds the input of a combinatory logic block (or a PROM) that outputs 6 bits,
each bit flagging each of the six forthcoming memory cycles as dead (available) or not. For example, The 6 output bits for a NOP should be 1-0-0-0-0-0, meaning that the first cycle after the opcode fetch is a dead cycle. For an RTS, the output bits should be 1-1-0-0-1-0 meaning that the 1st, 2nd and 5th cycles after the opcode fetch are dead cycles.

These six bits feed the input of a parallel-load in serial-out shift register that is parallel-loaded by SYNC, and clocked by phase 0.

The output of this shift register is the FREE/BUSY signal, that changes state either at the negative-going edge of phase 0, or at the negative-going edge of SYNC, i.e., during the start of phase 1, giving therefore plenty of time to properly "steal" the cycle.

The FREE/BUSY signal lags phase 0 by DELAY time (see the timing drawing). This delay is usually the shift register's shift-clock to output delay time, except for the cycles that inmediatly follow a SYNC (the cycles that follow an opcode fetch) when the delay is
the few nanoseconds that SYNC lags phase 0 plus the shift register's parallel-load to output delay.

FREE/BUSY==1 flags the cycles that can be "stealed" from the CPU and used for the DMA channel.

In order to isolate the 6502 from the Apple II bus during FREE/BUSY==1 marked cycles, the DMA pin on the Apple II peripheral slot could be used, if it were not because it stops the 6502 by freezing its clock input. The 6502's clock input driving logic should have to be slightly modified so that it's not be stopped by DMA if FREE/BUSY==1. This can be done without adding additional delays to worry about (I have to check that yet). (The WDC 65c02 cpu has a BE (Bus Enable) pin that will tri-state at once the R/W signal and the address and data buses).

This circuit could be designed to be plugged into the 6502 socket, or as a mixed Apple II slot card + 6502 socket daughter card.


Keep in mind:

It's worth noting that the the available bandwidth depends upon the particular sequence of instructions that are being executed at any given time : unless the code and its flow is well known the available bandwidth is not much predictable. OTOH, a 6502 looping into a known code flow will provide a *configurable* (up to a certain point) and fully predictable bandwidth. For unknown code flows it remains to be seen what actually is the typical percentage of bandwidth that is reasonable to expect, and how much "typical" this percentage figure is.

It's also worth noting that some 6502 wasted cycles that are known to be "dead" are difficult to trap. For example : conditional branches take between 2 and 4 cycles to execute depending on the state of the condition flag and wether or not a page crossing needs to take place. A 1st cycle is wasted if the branch does not take place (the condition is not met) but it can't be easily flagged (known beforehand). A 2nd
cycle is wasted if the branch does take place, but this time a circuit can tell from the outside that the branch is being taken if SYNC doesn't come true during this cycle. Yet another 3rd cycle will be wasted if the branch takes place across a page boundary, and again a circuit can tell by watching the 6502's SYNC signal.

This DMA channel is 100% transparent. Both to the software and to the 6502 processor.

WHILE the 6502 keeps running at 100% FULL SPEED, transparently,
simultaneously, concurrently, TENS of KiloBytes/s/MHz (*) of data can be moved in/out of the Apple II (or any other 6502 system) through this DMA channel... !


Magic, secret, hidden, transparent. What's special about this DMA ?

In other forms of DMA the CPU must release the control of the memory bus in order to transfer it to the device requesting a DMA
A memory-bound, cache-less 6502 CPU without memory access must be halted during a DMA transfer.
Not so with this design.
In this design the CPU isn't halted nor slows down at all while the DMA transfer takes place.
In this design the DMA transfer takes place transparently, concurrently, without contention.
Any already existing 6502 computer implementation can benefit of this design.
It can be implemented on top of an existing system, there's no need to modify the system architecture, nor redesign it.
The memory cycles that this design uses were already there, but were being wasted otherwise.


Bibliography:

(James Fielding, "Understanding the Apple II", p4-20)
(Mostek MCS6500 Microcomputer Family Hardware Manual, August, 1975, p124) :

"Note that the (650x) processor often puts out an address and fetches data which it ignores."
"This is an inherent feature of the processor which uses a "look ahead" approach to pipelining."
"Examination of the SYNC signal will allow the designer to keep track of exactly when the data fetched from memory is utilized within the processor and when it is ignored"



(*) Sample
throughput analysis :

The guinea pig is the Apple II Monitor's keyboard poll routine (KEYIN) at $FD1B.
That's the loop the Apple II gets into while waiting for a keypress at the CLI
prompt, or during a BASIC "input" command.
The code is:

FD1B  INC $4E    (5 CYCLES, 1 AVAILABLE)
FD1D  BNE $FD21  (2 CYCLES NOT TAKEN, 3 CYCLES TAKEN, 1
AVAILABLE)
FD1F  INC $4F    (5 CYCLES, 1
AVAILABLE)
FD21  BIT $C000  (4 CYCLES, 0
AVAILABLE)
FD24  BPL $FD1B  (TAKEN, 3 CYCLES, 1
AVAILABLE)

The loop takes
(5+3+4+3)=15 cycles, (1+1+0+1=3 available) 255 times, then
(5+2+5+4+3)=19 cycles, (1+0+1+0+1=3
available) the 256th time.

The routine takes 255*15+19=3844 cycles to execute, and there are 255*3+3=
768 dead cycles.
Therefore, 768/38.44= 19.98 % of the cycles can be stealed...
and the throughput figure (@ 1MHz) becomes
19.98% (1e6)= ****    199.8 KILOBYTES/s/MHz    ***** !!!

Comments ? email me, or post here , or here (usenet, csa2), or here (6502.org forum).

Valid HTML 4.01 Transitional

Jorge Chamorro, 24 de Marzo, 2007.