A Typeable DOS Quine Program

Demonstration of the program on MS-DOS 5.0

The following text is a valid DOS program:

X5M@P5~APZ%3@PY%AAP[5~$P^5T%P_)5GGG)5GG)5XPLFXLFLEQX5M@P5~APZ%3@PY%AAP[5~$P^5T%P_)5GGG)5GG)5XPLFXLFLEQ

What happens when you run it?

Try it out for yourself. As long as you’re using a version of DOS that’s 2.0 or later, it’ll work. Okay, DOS systems are getting harder to come by (and you can’t run DOS programs on 64-bit Windows), so you can head to PCjs and run an emulator in your Web browser. Create a text file, save it with a name ending with .COM, and then run it.

The program outputs its own code!

This type of program is called a quine. Creating a quine can be a fun programming exercise. For example, here’s a short Python one-liner:

p = 'p = {!r}; print(p.format(p))'; print(p.format(p))

(This Python quine is part of a collection of one-line quines I’ve been building over the past few years.)

But most quines are written in human-readable programming languages that have to either be compiled or (like Python) interpreted by a separate program. The DOS program is written in actual machine code which can be executed directly on a processor, which is not meant to be human-readable. So how did I do it?

The EICAR Test File

I was inspired by the EICAR test file, which is a fake “virus” used to test antivirus programs. All it does is output some text, but the idea is that it should be treated as a virus for detection purposes. Since it was developed in the 1990s, it ran on DOS.

It was designed to contain only printable ASCII characters so it could be published and recreated easily. In fact, here’s the whole thing for you to copy right now:

X5O!P%@AP[4\PZX54(P^)7CC)7}$EICAR-STANDARD-ANTIVIRUS-TEST-FILE!$H+H*

There’s a great analysis of the program that explains how this was accomplished. Spoiler alert: Self-modifying code is involved. More specifically, it relies on the fact that the x86 opcodes for certain forms of push, pop, xor, and, and sub correspond to ASCII printable characters, and it uses these instructions to modify its own code to create other instructions. (This opcode chart is a helpful reference to determine which instructions correspond to ASCII characters.)

Creating the Typeable Quine

To create the typeable quine, I started with a simple program to illustrate the strategy:

.code16

# Repeat the following twice: First instance is used as code and second instance is used as data
.rept 2

mov $0x40, %ah
mov $1, %bx
mov $0x13, %cx
mov $0x113, %dx
int $0x21
mov $0x40, %ah
int $0x21
int $0x20

.endr

(I’m using GNU Assember [GAS] so this is AT&T syntax. I’ll discuss how I actually built the program later on.)

The strategy here is to include two duplicate copies of the code. The program executes the first copy and reads the second copy as data. It outputs two copies of that data, and the end result is that the output is identical to the executable program code.

Many DOS system functions are accessed using int $0x21, and AH determines the specific function to invoke. The 0x40 function is for writing data to a file (the standard output can be treated as a file). BX contains the file handle (1 for standard output), CX contains the number of bytes to write, and DX contains the address of the data. In this program, the first few mov instructions simply set the appropriate values of the registers, and then int $0x21 invokes the function. After the first invocation, the program has to set AH again because AX is set to a return value. After that, the program invokes the same function again and then exits with int $0x20.

Once I had this overall strategy laid out, I used the same techniques as the EICAR test file to modify the code. Here’s the final result:

.code16

.rept 2

# Set AH to 0x40, then save value on stack (lower byte AL is somewhat arbitrary)
pop %ax
xor $0x404D, %ax
push %ax

# Set DX to 0x0133 (start address of data)
xor $0x417E, %ax
push %ax
pop %dx

# Set CX to 0x0033 (data length)
and $0x4033, %ax
push %ax
pop %cx

# Set BX to 0x0001 (file handle; 1 = standard output)
and $0x4141, %ax
push %ax
pop %bx

# Set SI to 0x247F (value to subtract; somewhat arbitrary)
xor $0x247E, %ax
push %ax
pop %si

# Set DI to 0x012B (address of code to modify)
xor $0x2554, %ax
push %ax
pop %di

# Modify code
sub %si, (%di)
inc %di
inc %di
inc %di
sub %si, (%di)
inc %di
inc %di
sub %si, (%di)

# Set AH to 0x40 from stack, push again for later use
pop %ax
push %ax

# Code to modify (0x4C 0x46 will become 0xCD 0x21, or int $0x21)
dec %sp
inc %si

# Set AH again because the previous call sets AX to the number of bytes written
pop %ax

# More code to modify (0x4C 0x46 0x4C 0x45 will become 0xCD 0x21 0xCD 0x20, or int $0x21 followed by int $0x20)
dec %sp
inc %si
dec %sp
inc %bp

# Dummy byte to make length odd so BX can be set using AND with length
push %cx

.endr

The program data to be modified happens to represent a series of one-byte instructions, but this isn’t really necessary for the quine to work. Similarly, the dummy byte doesn’t need to represent a one-byte instruction either, but I chose it because it’s “Q” in ASCII, for “quine.” (Also, this byte doesn’t occur anywhere else, so it can serve as a marker for the two halves.) I had to play around with the constants a bit to get these to work, but I think they add a nice touch.

Building DOS Programs on Linux

I mainly use Linux so I built these programs using GNU binutils. This includes the assembler (as) and linker (ld), as well as objdump which serves as the disassembler. These are generally installed alongside GCC, so if you have GCC installed you should already have these available.

To produce a COM executable for DOS from assembly code:

  1. Add .code16 to the top of the code. DOS runs in 16-bit real mode, but the assembler produces either 32-bit or 64-bit code by default (depending on your operating system).
  2. Run as to produce an object file: as -o myprog.o myprog.s produces an object file called myprog.o from assembly code in myprog.s
  3. Run ld to produce the executable from the object file: ld -Ttext=0x100 -e 0x100 --oformat binary -o myprog.com myprog.o
    • The -Ttext option sets the start address of the code. Here, the start address is 0x100 since that’s where programs are loaded in DOS. The -e option sets the entry point, although all it does here is override the default behavior of looking for a _start symbol. The --oformat option sets the output format: A COM executable is just a simple dump of machine code with no metadata, which is what the “binary” format is.

To run the program I used DOSBox. There are other DOS emulators that would probably work just as well, but this was the one available in my distribution’s (Debian) package repository.

To disassemble an executable, run objdump -D -m i8086 -b binary myprog.com

Note that AT&T syntax is used by default, and that’s what I’ll use in this post.

(While writing this, I found out that NASM can assemble code directly to a binary format for a COM executable, and there’s even a section in the documentation about building COM executables.)

A Final Note About Newlines

The usual methods of creating a text file in MS-DOS, including EDIT and ECHO, add a newline (0x0D 0x0A) at the end of the file. You can tell if this has happened by checking the file size: The original file without the newline is 102 bytes, but it becomes 104 with the newline. In this situation, the output of the program won’t match exactly with the program file. (That is, if you run TYPQUINE.COM > OUTPUT.TXT and then COMP TYPQUINE.COM OUTPUT.TXT, the files would be reported as different.) Windows Notepad lets you save the file without the final newline, but if that’s not an option and you want an exact match, here’s a modified version (two identical lines of text, 106 bytes total with the newlines):

X5K@P5~APZ%5@PY%AAP[5~$P^5T%P_)5GGG)5GG)5XPLFXLFLEQ
X5K@P5~APZ%5@PY%AAP[5~$P^5T%P_)5GGG)5GG)5XPLFXLFLEQ
Reply to this post via: E-mail, Twitter
Philip Chung
Philip Chung
Software Developer