Clockslide: How to waste an exact number of clock cycles on the 6502

February 6th, 2012

by Sven Oliver ‘SvOlli’ Moll; the original German language version has been simultaneously posted on his blog.

This is an article about the 6502 processor about the topic: how to “waste” a number of clock cycles stated in a register, in this case the X register. The principle is simple: you have a number of operations that do close to nothing. The more the code is jumped to at the “front”, the more clock cycles are needed to get to the actual code. If the code is jumped to more at the “end”, the CPU gets to the code in question more quickly.

This nice theory won’t work directly on the 6502, because every instruction takes at least two clock cycles to execute. If you want to get it down to the precision of one cycle, this is getting more difficult. The first half of this trick I found in code of Eckhard Stollberg, who is one of the guys that pionieered homebrew on the Atari 2600 VCS. There, I found some strange bytes:

C9 C9 C9 C9 C9 C9 C9 C9 C9 C9 C9 C9 C5 EA

The disassembly looks like this:

; CODE1
CMP #$C9 ; 2
CMP #$C9 ; 2
CMP #$C9 ; 2
CMP #$C9 ; 2
CMP #$C9 ; 2
CMP #$C9 ; 2
CMP $EA  ; 3

To run through the code, you’ll need 15 clock cycles, and nothing changes except for some state registers. If the code is called with an offset of one byte, this code will be processed:

; CODE2
CMP #$C9 ; 2
CMP #$C9 ; 2
CMP #$C9 ; 2
CMP #$C9 ; 2
CMP #$C9 ; 2
CMP #$C5 ; 2
NOP      ; 2

This makes 14 clock cycles, and only the status register will be changed. If the code is called with an offset of two bytes, it is started at the CODE1 segment at the second instruction. Add another one, you’ll get to the second instruction of the CODE2 segment, and so on. This way it is possible to specify the exact number of clock cycles to be “wasted”. With on exception: to be more specific there are 2 + X clock cycles that are wasted. There is no way to waste exactly one clock cycle.

Now we need a way to specify the “entry” of our “slide”. On a C=64 this would be done using self-modifying code. The operand of a JMP $XXXX instruction will be replaced with the calculated address. This is not possible on systems like the Atari 2600, since the code is run in ROM. One option for example would be to use JMP ($0080) after writing the entry point to $0080 and $0081.

My approach differs a bit from the usual way. RAM is scarce on the Atari, and I don’t want to “waste” up two of the 128 bytes available, when there is another way. When the CPU executes a JSR $XXXX (jump to subroutine) command, it writes the current address to the stack. To be more specific, it is the address of the JSR command + 2 which is the return address – 1. And this is what I do: I write my entry point – 1 to the stack and use the command RTS (return from subroutine) to jump into the clock slide. So, I’m still using two bytes of RAM, but only for a short time, without the need to evaluate which two bytes are available at this point.

; the X register specifies how many of the
; 15 clock cycles possible should be skipped
LDA #>clockslide
PHA
TXA
CLC
ADC #<clockslide
PHA
STA WSYNC ; <= this syncs to start of next scanline
clockslide:
RTS
CMP #$C9
CMP #$C9
CMP #$C9
CMP #$C9
CMP #$C9
CMP #$C9
CMP $EA
realcode:
; and here the real code continues

This approach still has one problem: between “clockslide” and “realcode”, no page crossing may occur. If this were the case, I’d have to increase the high byte on the stack by one. But since the position of the code segments is under my control, I left this out as an exercise for the reader. ;-)

spacer

Posted in 6502, tricks | 6 Comments »

Assembly Evolution Part 1: Accessing Memory and the strange case of the Intel 4004

October 13th, 2011

by Julien Oster, reprinted with permission.

While it has become far less relevant for non-system developers to write assembly than it was a few decades ago, by now CPUs have nevertheless made it much more comfortable to do so. Today we are used to a lot of things: fancy indirect addressing modes with scale, a galore of general purpose registers instead of an accumulator and maybe one or two crippled index registers, condition codes for nearly every instruction (on ARM)…

But also the basics themselves have evolved. Let’s take a look at what past programmers had to put up with in entirely simple, everyday things. We’ll start with the most trivial: writing to memory.

Our goal is to write a single immediate value of 3 into the memory location 5. In light of paging, segmenting and bank switching, we’ll use whatever is convenient as a definition for “memory location”. Also, we’ll let the CPU decide what the word size should be. Since you only need 2 bits to represent a 3, it should fit with every CPUs word size (except for 1 bit CPUs, which actually existed, but that’s a story for another posting). If we have the choice, we’ll just take the smallest.

We’ll work backwards, from the present to the past, to explore the wonders of direct addressing in Intel CPUs. (One precautionary warning though: I only really tested the 4004 code in an emulator, and my habits are highly tainted by current Intel CPUs. So if I made some mistake somewhere, kindly point it out and I’ll fix it!)

x86

On a modern x86 CPU, it is of course fairly easy to write the value 3 to memory cell 5. You just do it:

mov byte [5], 3

A single instruction, simple and obvious. I cheated a bit by not using a segment prefix, nor did I set up any segment registers/selectors beforehand. But assuming a nowadays common OS environment in protected mode, you probably don’t want to fiddle with those selectors anyway.

8085

The Intel 8085 is somewhat of a direct predecessor to the 8086, the first in the line of the excessively successful x86 processors. While the 8086 has a 16 bit data bus, the 8085 only has 8 bit. The address bus is already full 16 bit, but its 16 bit capabilities are limited. Specifically, there is no immediate 16 bit addressing (except for branches), leaving us no way to specify our memory location in the instruction that actually performs the move.

Memory is instead addressed with a pseudo register called M. This pseudo register is in reality just backed by the registers H and L paired together, each 8 bit wide, and accessing it accesses the memory location they point at (you may take a guess which register receives the High byte, and which the Low byte of the address).

Luckily, there are a few simple 16bit instructions for moving immediate values, so all in all we can write our byte with:

LXI H, 0005h ; unlucky syntax, as this actually means HL instead of just H

MOV M, 3

By the way, bonus points if you are somehow able to find out just when the address in HL is available on the address bus. The same applies to the 8080 and 8008. Does the CPU copy the register pair’s content to the address bus pins only when actual memory operations take place, or are the address bus pins somehow directly connected to H and L itself? Is that even feasible? I’d really like to find out…

8008

We continue going further back, skipping the 8080 because it was identical in that regard, and arrive at its direct predecessor instead, the Intel 8008. The 8080 and 8085 were source compatible to the 8008 (which, mind you, is not the same as binary compatible… also it may or may not have required some light automated translation), but in the downward direction we have something vital taking from us: While already using 16bit addresses (with only a 14bit address bus, resulting in 16k memory, though), the only instructions that were allowed to contain 16bit immediate values at all are jumps and branches. Consequently, we are left with no way to completely specify our destination address in one instruction!

Instead, we have to access H and L, together forming pseudo register M’s address, one at a time:

LHI 00h

LLI 05h

LMI 3

4004

It’s hardly possible to go back further than the Intel 4004, at least if you are only considering single chip CPUs (at the time of its conception in the early 70s, there were already famous multi-chip CPUs with comfortable orthogonal instruction sets, notably the PDPs). Indeed, it was the first widely available single chip CPU. This little thing was a 4-bit CPU with some strange quirks, which we will explore further. Overall, it bears little to no resemblance to its successor in name, the Intel 8008 (except for the internal stack, which both had–I will cover that in another posting).

But let’s just look at the code for writing a value of 3 into the memory location at 5 first:

FIM P0, 5; load address 05h into pair R0,R1

SRC P0   ; set address bus to contents of R0,R1

LDM 3    ; load 3 into accumulator

WRM      ; write accumulator content to memory

That looks a bit strange.

As a 4 bit CPU, the 4004 has 4 bit wide registers and addresses 4 bit nibbles as words in memory. It has only one accumulator on which the majority of operations is performed, but sixteen index registers (R0-R15).

Those index registers are handy for accessing memory: Besides loading values directly from ROM, an instruction exists to load data indirectly, which sets the address bus to the ROM cell’s content. Another instruction performs an indirect jump instead. Other than that, you can just increment index registers, albeit there is the interesting “ISZ” instruction that not only increments, but also branches if the result is not 0.

Because the 4004 uses 8 bits to address the 4 bit nibbles, every two consecutive index registers form a pair, which is then used for memory references.

Note that I explicitly said ROM above. This is because in the 4004 architecture, ROM and RAM are actually vastly different beasts, at least from the assembly programmer’s perspective. You can not directly access RAM. It always involves index register pairs, manually sending their content to the address bus (with a strangely named instruction “SRC”, which for some reason spells out send register control) and then issuing another instruction which transfers from or to the accumulator.

Interestingly, accessing regular RAM nibbles is not your only choice among the transfer instructions. You can also fetch from and to I/O ports. But the CPU does not have any direct I/O port, instead they are available on both RAM and ROM! You can also read and write “RAM status characters”, which to me look like plain regular RAM cells within another namespace. If someone knows, I’d love to hear what they were used for (and if they maybe did behave differently to normal RAM).

Take a look at the data sheet. Within its only 9 pages, the instruction set is depicted on page 4 and 5. Especially in the light that fairly reasonable orthogonal instruction sets appear to have been available in multi-chip CPUs, this first single-chip CPU is clearly a strange specialization towards the desk calculator it was meant for (the Busicom 141-PF). It has the aforementioned index register-centered RAM access, separate ROM (although there is a transfer instruction which strangely refers to some optional “read/write program memory”), a three level internal stack which is almost useless for general purpose programming and a lone special purpose instruction for “keyboard process” (KBP).

Original 4004 CPUs go from anything from a few to a few thousand dollars on eBay, depending on their packaging and revision. If you’d like to, you can instead play around with a virtual one in this java-script based, fully fledged assembler, disassembler and emulator, or read the rescued source code of the Busicom 141-PF calculator. There’s lots more of schematics, data sheets and other resources on the Intel’s anniversary project page.

That is, if you are brave enough.

spacer

Posted in archeology | 14 Comments »

The story of 15 Second Copy for the C-64

July 18th, 2011

by Mike Pall, published with permission.

[This is a follow-up to Thomas Tempelmann's Story of FCopy for the C-64.]

Ok, I have to make a confession … more than 25 years late:

I’ve reverse-engineered Thomas Tempelmann’s code, added various improvements and spread them around. I guess I’m at least partially responsible for the slew of fast-loaders, fast-copys etc. that circulated in the German C64 scene and beyond. Uh, oh …

I’ve only published AFLG (auto-fast-loader-generator) under my real name in the German “RUN” magazine. It owes quite a bit to TT’s original ideas. I guess I have to apologize to Thomas for not giving proper credit. But back then in the 80′s, intellectual property matters wasn’t exactly something a kid like me was overly concerned with.

Later on, everyone was soldering parallel-transfer cables to the VIA #1 of the 1541 and plugging them into the C64 userport. This provided extra bandwidth compared to the standard serial cable. It allowed much faster loading of programs with a tiny parallel loader (a file named “!”, that was prepended on all disks). Note that the commercial kits with cables, custom EPROMs and silly dongles followed only much later.

So I wrote “15 second copy”, which worked with a plain parallel cable. Yes, it copied a full 35 track disk in 15 seconds! There was only one down-side: this was only the time for reading/writing from and to disk — you had to swap the floppies seven times (!) and that usually took quite a bit more extra time! ;-)

spacer

It worked by transferring the “live” GCR-encoded data from the 1541′s disk head to the C64 and simultaneously doing a fast checksum. Part of the checksumming was done on the 1541, part was done on the C64. There simply weren’t enough cycles left on either side! Most of the transfer happened asynchronously by adjusting for the slightly different CPU frequencies and with only a minimum number of handshakes. This meant meticulous cycle counting and use of some odd tricks.

The raw GCR took up more space (684*324 bytes) in the C64 RAM, so that’s why it required 4 passes. Other copy programs fully decoded the GCR and required only 3 passes. But GCR decoding was rather time-consuming, so they had to skip some sectors and read every track multiple times. OTOH my program was able to read/write at the full 300rpm, i.e. 5 tracks per second plus stepper time, which boils down to 2x ~7.5 seconds for read and write. Yep, you had to swap the floppies every 2 seconds …

Ok, so I spread the program. For free. I even made a 40 track version, which took 17 seconds. Only to see these coming back in various mutations, with the original credits ripped out, decorated with multiple intros, different groups pretending they wrote it or cracked it (it was free, there was nothing to crack!). The only thing they left alone were the copy routines, probably because they were extremely fragile and hard to understand. So it was really easy to recognize my own code. Some of the commercial parallel-cable + ROM kits even bragged with “Backups in 15 seconds!”. These were blatant rip-offs: they basically changed the screen colors and added a check for their dongles. Duh.

Let’s just say this rather frustrating experience taught me a lot and that’s why I’m doing open source today.

So I shelved my plans to write an enhanced version which would try to compress the memory to reduce the number of passes. Ah, yes … I wrote quite a few packers, too … but I’ll save that story for another time.

I still have the disks with the source code somewhere in my basement. But I’m not so sure I’ll be able to read them anymore. They weren’t of high quality to begin with … and I’d have to find my homegrown toolchain, too. ;-)

But I took the time to reverse-engineer my own code from one of the copies that are floating around on the net. For better understanding on the C64/1541 handshake issues, refer to this article. If you’re wondering about the weird bvc * loops: the 6502 CPU of the 1541 has an SO pin, which is triggered by a full shift register for the data from the disk head. This directly sets the overflow flag in the CPU and allows reading the contents from the shift register with very low latency.

Yes, there’s a lot more weird code in there. For the sake of brevity, here are only the inner loops of the I/O routines for the read, write and verify pass for the C64 and the 1541 side. Enjoy!

  ;--- 1541: Read ---
  ldy #$20
f_read:
  bvc *        ; Wait for disk shift register to fill
  clv
  lda $1c01    ; Load data from disk
  sta $1801    ; Send byte to C64 via parallel cable
  inc $1800    ; Toggle serial pin
  eor $80      ; Compute checksum for 1st GCR byte in $80
  sta $80
  bvc *
  clv
  lda $1c01    ; Load data from disk
  sta $1801    ; Send byte to C64 via parallel cable
  dec $1800    ; Toggle serial pin
  eor $81      ; Compute checksum for 2nd GCR byte in $81
  sta $81
  ; ...
  ; Copy and checksum to $82 $83 $84
  ; And another time for $80 $81 $82 $83 $84 with inverted toggles
  ; ...
  dey
  beq f_read_end
  jmp f_read
f_read_end:
  ; Copy the remaining 4 bytes and checksum to $80 $81 $82
  ; Lots of bit-shifting and xoring to indirectly verify
  ; the sector checksum from the 5 byte xor of the raw GCR data

  ;--- C64: Read ---
  ; Setup ($5d) and ($5f) to point to GCR buffer
  ldy #$00
c_read:
  bit $dd00    ; Wait for serial pin to toggle
  bpl *-3
  lda $dd01    ; Read incoming data (from 1541)
  sta ($5d),y  ; Store to buffer
  iny
  bit $dd00    ; Wait for serial pin to toggle
  bmi *-3
  lda $dd01    ; Read incoming data (from 1541)
  sta ($5d),y  ; Store to buffer
  iny
  bne c_read
c_read2:
  bit $dd00    ; Wait for serial pin to toggle
  bpl *-3
  lda $dd01    ; Read incoming data (from 1541)
  sta ($5d),y  ; Store to buffer
  iny
  bit $dd00    ; Wait for serial pin to toggle
  bmi *-3
  lda $dd01    ; Read incoming data (from 1541)
  sta ($5d),y  ; Store to buffer
  iny
  cpy #$44
  bne c_read2

  ;--- C64: Write ---
  ; Setup ($5d) and ($5f) to point to GCR buffer
  ldy #$00
  tya
c_write:
  eor ($5d),y  ; Load from buffer and compute checksum
  bit $dd00    ; Wait for serial pin to toggle
  bpl *-3
  sta $dd01    ; Store xor'ed outgoing data (to 1541)
  iny
  eor ($5d),y  ; Load from buffer and compute checksum
  bit $dd00    ; Wait for serial pin to toggle
  bmi *-3
  sta $dd01    ; Store xor'ed outgoing data (to 1541)
  iny
  bne c_write
c_write2:
  eor ($5f),y  ; Load from buffer and compute checksum
  bit $dd00    ; Wait for serial pin to toggle
  bpl *-3
  sta $dd01    ; Store xor'ed outgoing data (to 1541)
  iny
  eor ($5f),y  ; Load from buffer and compute checksum
  bit $dd00    ; Wait for serial pin to toggle
  bmi *-3
  sta $dd01    ; Store xor'ed outgoing data (to 1541)
  iny
  cpy #$44
  bne c_write2
  ldx $5b
  sta $0200,x  ; Store checksum for verify pass
  inx
  stx $5b

  ;--- 1541: Write ---
  ldy #$a2
  lda #$00
f_write:
  bvc *        ; Wait for disk shift register to clear
  clv
  eor $1801    ; Xor with incoming data (from C64)
  sta $1c01    ; Write data to disk shift register
  dec $1800    ; Toggle serial pin
  lda $1801    ; Reload data to undo xor for next byte
  bvc *        ; Wait for disk shift register to clear
  clv
  eor $1801    ; Xor with incoming data (from C64)
  sta $1c01    ; Write data to disk shift register
  inc $1800    ; Toggle serial pin
  lda $1801    ; Reload data to undo xor for next byte
  dey
  bne f_write

  ;--- 1541: Verify ---
  ; Get checksum computed by c_write on the C64 side
  ldy #$a2
f_verify:
  bvc *        ; Wait for disk shift register to fill
  clv
  eor $1c01    ; Xor with data from disk
  bvc *        ; Wait for disk shift register to fill
  clv
  eor $1c01    ; Xor with data from disk
  dey
  bne f_verify
  ; Verify is ok if checksum is zero
spacer

Posted in 6502, archeology, trivia | 12 Comments »

The story of FCopy for the C-64

July 15th, 2011

by Thomas Tempelmann, reprinted with permission.

Back in the 80s, the Commodore C-64 had an intelligent floppy drive, the 1541, i.e. an external unit that had its own CPU and everything.

The C-64 would send commands to the drive which in turn would then execute them on its own, reading files, and such, then send the data to the C-64, all over a propriatory serial cable.

The manual for the 1541 mentioned, besides the commands for reading and writing files, that one would read and write to its internal memory space. Even more exciting was that one could download 6502 code into the drive’s memory and have it executed there.

This got me hooked and I wanted to play with that – execute code on the drive. Of course, there was no documention on what code could be executed there, and which functions it could use.

A friend of mine had written a disassembler in BASIC, and so I read out all its ROM contents, which was 16KB of 6502 CPU code, and tried to understand what it does. The OS on the drive was quite amazing and advanced IMO – it had a kind of task management, with commands being sent from the communication unit to the disk I/O task handler.

I learned enough to understand how to use the disk I/O commands to read/write sectors of the disk. Actually, having read the Apple ]['s DOS 3.3 book which explained all of the workings of its disk format and algos in much detail, was a big help in understanding it all.

(I later learned that I could have also found reverse-eng'd info on the more 4032/4016 disk drives for the "business" Commodore models which worked quite much the same as the 1541, but that was not available to me as a rather disconnected hobby programmer at that time.)

Most importantly, I also learnt how the serial comms worked. I realized that the serial comms, using 4 lines, two for data, two for handshake, was programmed very inefficiently, all in software (though done properly, using classic serial handshaking).

Thus I managed to write a much faster comms routine, where I made fixed timing assumptions, using both the data and the handshake line for data transmission.

Now I was able to read and write sectors, and also transmit data faster than ever before.

Of course, it would have been great if one could simply load some code into the drive which speeds up the comms, and then use the normal commands to read a file, which in turn would use the faster comms. This was not possible, though, as the OS on the drive did not provide any hooks for that (mind that all of the OS was in ROM, unmodifiable).

Hence I was wondering how I could turn my exciting findings into a useful application.

Having been a programmer for a while already, dealing with data loss all the time (music tapes and floppy disks were not very realiable back then), I thought: Backup!

So I wrote a backup program which could duplicate a floppy disk in never-before seen speed: The first version did copy an entire 170 KB disk in only 8 minutes (yes, minutes), the second version did it even in about 4.5 minutes. Whereas the apps before mine took over 25 minutes. (Mind you, the Apple ][, which had its disk OS running on the Apple directly, with fast parallel data access, did this all in a minute or so).

And so FCopy for the C-64 was born.

It became soon extremely popular. Not as a backup program as I had intended it, but as the primary choice for anyone wanting to copy games and other software for their friends.

Turned out that a simplification in my code, which would simply skip unreadable sectors, writing a sector with a bad CRC to the copy, did circumvent most of the then-used copy protection schemes, making it possible to copy most formerly uncopyable disks.

I had tried to sell my app and sold it actually 70 times. When it got advertised in the magazines, claiming it would copy a disk in less than 5 minutes, customers would call and not believe it, “knowing better” that it can’t be done, yet giving it a try.

Not much later, others started to reverse engineer my app, and optimize it, making the comms even faster, leading to copy apps that did it even in 1.5 minutes. Faster was hardly possible, because, due to the limited amount of memory available on the 1541 and the C-64, you had to swap disks several times in the single disk drive to copy all 170 KB of its contents.

In the end, FCopy and its optimized successors were probably the most-popular software ever on the C-64 in the 80s. And even though it didn’t pay off financially for me, it still made me proud, and I learned a lot about reverse-engineering, futility of copy protection and how stardom feels. (Actually, Jim Butterfield, an editor for a C-64 magazine in Canada, told its readers my story, and soon he had a cheque for about 1000 CA$ for me – collected by the magazine from many grateful users sending 5$-cheques, which was a big bunch of money back then for me.)

spacer

Posted in 6502, archeology, trivia | 30 Comments »

Chaosradio Express #177: Commodore 64

July 5th, 2011

(This article is about a German-language podcast episode on the C64.)

Im Februar hat mich Tim Pritlove auf der Durchreise in Frankfurt abgefangen, wo ich mit einem Koffer voll mit zwölf Commodore 64 Motherboards in einem Hotelzimmer saß, und mit mir eine 2 Stunden und 42 Minuten lange Episode für Chaosradio Express aufgenommen.

Chaosradio Express #177: Commodore 64

spacer

Hier also nochmal der Hinweis auf die Folge, die jetzt schon ein paar Monate zurückliegt, für all diejenigen, die sie nicht schon anderweitig entdeckt haben. :-)

spacer

spacer

Posted in 6502, archeology, hacks, tricks, trivia | 3 Comments »

Inside Commodore DOS [PDF]

June 28th, 2011

spacer
Richard Immers and Gerald G. Neufeld:
Inside Commodore DOS : the complete guide to the 1541 disk operating system.
Northridge, Calif. : Datamost, 1985.
ISBN 0-8359-3091-2
(512 pages, 7.4 MB PDF)

In my quest to preserve retrocomputing documents, here is the invaluable book “Inside Commodore DOS”, which describes most of the internals of the Commodore 1541 disk drive. The scanning was done in 2002 by Kenneth S. Moore, who in 2005 released an OCRed version, which unfortunately replaced the original page images. My version here comes with the original page images and a table of contents, and is nevertheless fully searchable.

Here is a fun quote from the book by the way:

Over the years numerous writers have advised Commodore owners not to use the save and replace command because it contained a bug. Our study of the ROM routines and a lot of testing has convinced us that the bug in the replace command is a myth.

Of course, this is wrong. Don’t use “SAVE@” on a 1541.

spacer

Posted in 6502, archeology, literature | 5 Comments »

« Older Entries