Yet another 8086 disassembler pet project

Disclosure: I’m a so-so programmer. This can’t stop me, of course. In search of a pet project that

  • has the Tunes nature
  • but is modest in scope and amateur-friendly

… I found this stimulating passage:

Another important feature of Tunes is reflexivity. Consider the DOS emulator on Linux (or any such system) to start with: it “reflects” DOS under Linux, through the use of the Intel 386 Virtual mode. Now is Linux capable of running a Linux emulator, in other words of reflecting itself? It isn’t, because the Intel 386 Virtual mode is only capable of virtualizing (reflecting) real mode. It would be possible, however, to use a roundabout way, by writing an Intel 386 emulator, that is, reflecting the Intel chip under Linux, and then running Linux on that virtual Intel chip [...]

[David Madore's http://www.madore.org/~david/computers/tunes.html.]

The prospect of writing an i386 emulator excited me pretty good. How hard can that be? In Common Lisp?

I blew a night over failed attempts to organize opcodes and had to reduce my ambition:

  • start with the 8086 (no protected mode, 20bits)
  • write an opcode decoder first, make sure that it works before turning to simulation
  • look how the smart people have done this (i.e. find source code)

The last point, looking at other people’s source code, was the hardest. As Joel Spolsky observes, reading [understanding] source code is harder than writing it.

I could not find an 8086 opcode decoder written in Common Lisp, but it became clear that writing a disassembler was the way to go. After going that extra mile for attaching a “disassembly backend”, you can make sure that your decoder works by comparing the disassembly output to that of a trust-worthy other disassembler.

Smart people have written all sorts of disassemblers.

The first was the vintage program DEBUG.EXE (ca. 1980, still included in 32-bit Windows Vista distros). This 16-bit living fossil sports an interactive disassembler, to aid debugging. You type in an address or a small address range, and DEBUG.EXE shows you “source code”, i.e. disassembly of the raw bytes at, let’s say 18EF:0201 – 18EF:0240. This was Microsoft’s idea of an IDE in 1980, when Funky Town ruled the earth. A distinctive feature of DEBUG.EXE is its primitivity. It disassembles with the relentless stupor of a steam engine on tracks, even data. Invalid instructions cause a hick-up in form of “???”, but can’t seriously derail DEBUG.EXE.

All this is in contrast to static disassembly, for which the disassembler slurps, disassembles and dumps an entire program at once. This makes possible all sorts of analyses and conclusions about how the binary code is structured and behaves. In its most elaborate form, such a disassembler can identify procedures, identify and name symbolic labels, generate hyperlinked Web-pages of assembly source code, etc. An example for a good static open source disassembler is ndisasm, a utility in the Netwide Assembler project (http://www.nasm.us). ndisasm can disassemble 32-bit and 64-bit opcodes, but also 16-bit.

It does not discover labels or do anything fancy, however. [For an example of what a fancy disassembler can do, look at this Z80-disassembly, a 10,000 line bouncer -- note the clickable labels].

I considered both DEBUG.EXE and ndisasm as totally trustworthy.

  • DEBUG.EXE due to its respectable length of service, plus I never heard complaints about it
  • ndisasm, because it is part of a successful open source project

Both disassemblers can serve as good yard-sticks for how well other disassemblers work.

In terms of enlightenment for underground disassembler constructors, however, they are not very useful, because they are not easily accessible:

  • DEBUG.EXE is a closed source program
  • ndisasm is big and scary

Working with other people’s source code is hard enough, because other people’s source code is only marginally more comprehensible than raw opcodes. Many more industrial-strength disassemblers exist, but, like DEBUG.EXE and ndisasm, they are either too closed or too big for impatient hobbyists like me who
want to steal and code code code.

I found a handful of pet 8086 disassemblers, in various programming languages:

  • disasm.asm is programmed in TASM 3.0 (no TASM, could not compile it)
  • disasm.c sort of works, but dies with a segfault half-way into a random bytes file
  • disasm.pas is programmed in TurboPascal; it can process a 64k random bytes file without dying
  • dasm3.py is programmed in python, can process a 64k random bytes file without dying, and comes with links and an explanation how the program works and how data is organized (jackpot!)

Of all those, Michael Heyeck’s dasm3′.py clearly is the choice for the inquisitive tinkerer:

  • dasm3.py comes with explanations of how the program works
  • the author wrote it, because he needed it (retro-gaming)
  • python is fun and easy
  • did I mention that dasm3.py comes with explanations of how the program works?

Marius Gedminas’ disasm.asm is an impressive tour de force, but precisely for this reason not exactly a teaching aid if you want to find out what the domain is all about – the 8086 instruction set and opcodes. Reverse engineering an assembly language program in order to duplicate its function with a Common Lisp program is most certainly the scenic route here, but I preferred a less manly approach.

Marius Gedminas is also the author of disasm.pas, a smart disassembler that can mark unreachable code as data. disasm.pas follows jumps, calls and rets. With its 600 lines of TurboPascal code it is almost as compact as Michael Heyeck’s 300 lines of python in dasm3.py.

disasm.c consumes 1000 lines of code, but that’s about all I know about it. It seems to sort of work, but a segfault is a bad omen. There is no documentation whatsoever, but superficial browsing revealed that the author Kestutis Rutkauskas knew what he was doing in terms of C (despite an unusual coding convention).

So dasm3.py was an easy choice. For enlightenment I picked the most compact program, written in the softest, most dynamic language, the one that came with detailed instructions on how to understand that program. All this in the name of enlightenment, of course, to make stealing easy for the epic hobbyist. It is also the only disassembler on the pet project market that can process .EXE files.

Note that any disassembler, no matter how simple, has the Tunes nature. Disassembly is a very elementary form of reflection, you can’t go any deeper than that. A disassembler is the most basic (and hardest) form of recovering meaning from code. In Tunes, reflection makes the meaning of code accessible to compilers and programs and run-time, what facilitates insane optimizations and water-tight security. However, not only compilers and programs can benefit from access to code meaning. Programmers, too, can work insanely fast when they understand the code, i.e. if they have a mental map of their work. Michael Heyeck’s blog posts give you the fast track to his world of python disassembly:

[If you are interested in 8086 assembly, don't miss 8086 Opcode Redundancies I and II.]

The only fly in the ointment was lack of demonstration that dasm3.py works as advertised, so at first I didn’t know if it could serve as a reliable decoder foundation for my “own” homegrown 8086 emulator. Did dasm3.py work for all the edge cases which made my little head spin when I tinkered with my Common Lisp implementation?

As it turned out, dasm3.py worked almost perfectly out of the box. However, getting it into the box was quite an instructive pet project in its own right for a so-so programmer like me. It is a hairy problem, as I will illustrate in the next installments. [Tip of the hat to Michael Heyeck, for patient support.]

Does your pet project have the Tunes nature? Can you demonstrate that it works?

Advertisement

About transistorski

brain in a jar
This entry was posted in Uncategorized. Bookmark the permalink.

3 Responses to Yet another 8086 disassembler pet project

  1. Incidentally, there are three i386 emulators that I know of: DosBox, QEMU and Valgrind.

  2. Pingback: A disassembler and Fare’s vision | Transistorski's Blog

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Connecting to %s