a2DMAMagic : Apple II II+ IIe IIc IIc+
The secret, hidden, high-throughput, transparent 6502 DMA
channel
Jorge Chamorro Bieling, 4 de Marzo, 2007 home
The idea:
It was reading the "6502 Instruction Details" chapter of
Jim
Sather's "Understanding the Apple II" book, that I noticed that many 6502
instructions do useless, 'dead' memory access cycles, (see the threads "RFC:DMA
for free" and "6502
trashing memory cycles" in comp.sys.apple2).
A classic 1MHz 6502 system like the Apple II has a memory bus bandwidth of 1 MByte/s (at least from the CPU's point of view).
Most of the bandwidth is used by read memory cycles used to fetch opcodes and its operands, a smaller percentage is used for moving (reading/writing) useful data bytes around, yet a quite meaningful percentage gets wasted by useless, 'dead' memory cycles (interlaced in-between during the opcodes' execution) .
A way to illustrate this is the NOP (No-OPeration) opcode : a 1 byte long, 2 cycles instruction that "does nothing". The 6502 fetches the opcode from memory during a 1st memory cycle, and instead of doing nothing else because it has fetched a NOP, instead of continuing fetching and executing the next opcode on the following memory cycle, it does an additional 2nd read memory cycle to get data that is irrelevant just to ignore it afterwards. This 2nd cycle is a "dead", useless memory cycle : 50% of the cycles of a NOP are useless : 50% of the memory bandwidth gets wasted whenever it's executed.
CLC, CLD, CLI. CLV, DEX, DEY, INX. INY, NOP. SEC, SED, SEI, TAX, TAY,
TSX, TXA, TXS, TYA, all waste 50% of the cycles (1 out of 2).
Another example: a program jumps to a subroutine via a JSR, and later on returns with an RTS.
In this case, the 6502 should use:
-3 memory cycles, one to fetch the JSR opcode + 2 to fetch the 2 bytes of the address to jump to operand.
-2 memory cycles to push on the stack the 2 bytes of the return address.
-1 memory cycle to fetch the RTS instruction opcode.
-2 memory cycles to pop off the stack the 2 bytes of the address to return to.
Therefore 3+2+1+2=8 cycles. Instead, it uses 12 cycles : among them 4 interleaved useless, 'dead' cycles.
RTS takes 6 cycles to execute, but wastes 3 out of the 6, 50%
of the bandwidth.
JSR takes 6 cycles to execute, and wastes 1 cycle out of the 6, 16.6% of the bandwidth.
The table on page 4-22 of Jim Sather's book shows that 22 out of 29 groups of instructions, waste between 1 and 3 memory cycles.
It remains to be seen what real bandwidth percentage is being wasted, but every mere 1% (*) accounts for 10 KBytes/s/MHz of ("magic", free) bandwidth !
The circuit:
A circuit could be designed to flag the dead cycles, in order to
make them available as a DMA channel.
Even though this circuit has been designed with an Apple II in mind,
it's applicable to any other 6502-based system.
What's most important, it can be added on-top-of almost any existing
6502 system, as it doesn't require any "architectural" modification.
These are the block and timing diagrams:
The 6502 flags every fetch opcode memory cycle with the SYNC signal
(available at the CPU and at any Apple II slot). This signal
changes
state at the
negative-going edge of the 6502 phase 2 clock, roughly the same as the
Apple II phase 0 clock (6502's phase 2 lags the Apple II's phase 0 by a
few nanoseconds).
SYNC is used to clock a
transparent latch in order
to
get the opcode that is currently being fetched. (this latch is, in
fact, optional, but conceptually fits there very well)
The (transparently) latched opcode feeds the input of a combinatory
logic
block
(or a PROM) that outputs 6 bits, each
bit flagging each of the six
forthcoming memory cycles as dead (available) or not.
For example, The 6 output bits for a NOP should be 1-0-0-0-0-0, meaning
that the first cycle after the opcode fetch is a dead cycle. For an
RTS,
the output bits should be 1-1-0-0-1-0 meaning that the 1st, 2nd and 5th
cycles after the opcode fetch are dead cycles.
These six bits feed the input of a parallel-load in serial-out shift
register that is parallel-loaded by SYNC,
and
clocked by phase 0.
The output of this shift register is the FREE/BUSY signal, that changes
state either at the negative-going edge of phase 0, or at the
negative-going edge of SYNC, i.e., during the start of
phase 1, giving therefore plenty of time to properly "steal" the cycle.
The FREE/BUSY signal lags phase 0 by DELAY time (see the timing
drawing). This delay is usually the shift register's shift-clock to
output
delay
time, except for the cycles that inmediatly follow a SYNC (the cycles
that
follow an opcode fetch) when the delay is the few nanoseconds that SYNC lags phase
0 plus the shift register's
parallel-load
to
output delay.
FREE/BUSY==1 flags the cycles that can be "stealed" from the CPU and
used for the DMA channel.
In order to isolate the 6502 from the Apple II bus during FREE/BUSY==1
marked cycles, the DMA pin on the Apple II peripheral slot could be
used, if it
were not because it stops the 6502 by freezing its clock input. The
6502's clock input driving logic should have to be slightly modified so
that it's not be stopped by DMA if FREE/BUSY==1. This can be
done without adding additional delays to worry about (I
have to check that yet). (The WDC 65c02 cpu has a BE
(Bus Enable) pin that will tri-state at once the R/W signal and the
address and data
buses).
This circuit could be designed to
be plugged into the 6502 socket, or as a mixed Apple II slot
card + 6502 socket daughter card.
Keep
in mind:
It's worth noting that the the available bandwidth depends
upon the particular sequence of instructions that are being executed at
any given time : unless
the code and its flow is well known the available bandwidth is not much
predictable.
OTOH, a 6502 looping into a known code flow will provide a
*configurable* (up to a certain point) and fully predictable bandwidth.
For unknown code flows it
remains to be seen what actually is the typical
percentage of bandwidth that is reasonable to expect, and how much
"typical" this percentage figure is.
It's also worth noting that some 6502 wasted cycles that
are
known to be "dead" are difficult to trap. For example :
conditional
branches take between 2 and 4 cycles to execute depending on the state
of the condition flag and wether or not a page crossing needs to take
place. A
1st cycle is wasted if the branch does not take place (the condition is
not met) but it can't be easily flagged (known beforehand). A 2nd cycle is wasted if the branch does take
place, but this time a
circuit can tell from the outside that the branch is being taken if
SYNC doesn't come true during this cycle. Yet another 3rd cycle will be
wasted if the branch takes place across a page boundary, and again a circuit can tell
by watching the 6502's SYNC signal.
This DMA channel is 100% transparent. Both to the software and to the
6502 processor.
WHILE the 6502 keeps running at 100% FULL SPEED, transparently, simultaneously, concurrently,
TENS of KiloBytes/s/MHz (*)
of
data can be moved in/out of the Apple II (or any other 6502 system)
through this DMA
channel... !
Magic, secret, hidden, transparent. What's special about this DMA ?
In other forms of DMA the CPU must release the control of the memory bus in order to transfer it to the device requesting a DMA
A memory-bound, cache-less 6502 CPU without memory access must be halted during a DMA transfer.
Not so with this design.
In this design the CPU isn't halted nor slows down at all while the DMA transfer takes place.
In this design the DMA transfer takes place transparently, concurrently, without contention.
Any already existing 6502 computer implementation can benefit of this design.
It can be implemented on top of an existing system, there's no need to modify the system architecture, nor redesign it.
The memory cycles that this design uses were already there, but were being wasted otherwise.
Bibliography:
(James Fielding, "Understanding the Apple II", p4-20)
(Mostek MCS6500
Microcomputer Family Hardware Manual, August, 1975,
p124) :
"Note that the (650x) processor often puts out an address and fetches
data which it ignores."
"This is an inherent feature of the processor which uses a "look ahead"
approach to pipelining."
"Examination of the SYNC signal will allow the designer to keep track
of exactly when the data fetched from memory is utilized within the
processor and when it is ignored"
(*) Sample
throughput analysis :
The guinea pig is the Apple II Monitor's keyboard poll routine (KEYIN)
at $FD1B.
That's the loop the Apple II gets into while waiting for a keypress at
the CLI prompt, or during a
BASIC "input" command.
The code is:
The loop takes
(5+3+4+3)=15 cycles, (1+1+0+1=3 available) 255 times, then
(5+2+5+4+3)=19 cycles, (1+0+1+0+1=3 available) the 256th time.
The routine takes 255*15+19=3844 cycles to execute, and there are
255*3+3=768 dead cycles. Therefore, 768/38.44= 19.98 % of the
cycles can be stealed...
and the throughput figure (@ 1MHz) becomes 19.98% (1e6)= **** 199.8 KILOBYTES/s/MHz
***** !!!
Comments ? email me, or post
here
, or
here
(usenet, csa2), or
here (6502.org forum).