I recently wrote a small, but cryptic “Hello World” program in Python:

p = list(map(lambda _: ord(_) - 0x21, "!!&!!dzăC|"))
i = "DÒVV×0Ë×YVR "

while p[2] < len(p):
    if (p[p[2]] & 0xC0) >> 6 == 0:
        p[p[p[2]] & 0x07] = (p[p[2]] & 0x38) >> 3
    elif (p[p[2]] & 0xC0) >> 6 == 1:
        if (p[p[2]] & 0x38) >> 3 == 0:
            p[0] = ((((ord(i[0]) - 0x20) & 128) >> 7) | ((ord(i[0]) - 0x20) << 1)) & 0xFF
            i = i[1:]
        p[p[p[2]] & 0x07] = p[(p[p[2]] & 0x38) >> 3]
        if p[p[2]] & 0x07 == 1:
            print(chr(p[1]), end="")
    elif (p[p[2]] & 0xC0) >> 6 == 2:
        p[p[p[2]] & 0x07] = (p[p[2]] & 0x38) >> 3 + p[p[p[2]] & 0x07]
    else:
        p[p[p[2]] & 0x07] = p[(p[p[2]] & 0x38) >> 3] + p[p[p[2]] & 0x07]
    p[2] += 1
    p[4] = 1 if p[3] == 0 else 0

Since /r/shittyprogramming seemed to like it, I thought I’d post a write-up on how it (actually) works.

First, let’s deobfuscate a little:

data = [
    0,  # IO_IN
    0,  # IO_OUT
    5,  # PC
    0,  # ACC
    0,  # Z
    0x43,
    0x59,
    0xe2,
    0x22,
    0x5b
]

i = [ord(c) for c in "Hello World"] + [0]

while data[2] < len(data):
    b = data[data[2]]

    o = (b & 0xC0) >> 6
    x = (b & 0x38) >> 3
    y = b & 0x07

    if o == 0:
        data[y] = x
        if y == 1:
            print(chr(data[1]), end="")
    elif o == 1:
        if x == 0:
            if len(i) > 0:
                data[0] = i[0]
                i = i[1:]
        data[y] = data[x]
        if y == 1:
            print(chr(data[1]), end="")
    elif o == 2:
        data[y] = x + data[y]
    else:
        data[y] = data[x] + data[y]
    data[2] += 1
    data[4] = 1 if data[3] == 0 else 0

A lot easier to read now, though still not very easy to understand.

Essentially, we are ‘emulating’ a tiny custom CPU that has access to a very small amount of memory (the data variable). Even if you’re not that familiar with how CPUs and memory work, the design should still be simple enough to understand.

Memory

Our memory is laid out just like it’s defined in the code:

data = [
    0,  # IO_IN
    0,  # IO_OUT
    5,  # PC
    0,  # ACC
    0,  # Z
    0x43,
    0x59,
    0xe2,
    0x22,
    0x5b
]

You can think of each item in the list as holding a value at a specific address in memory. data[0] retrieves the value stored at memory address 0, for example. Each value in data is exactly one byte in size.

The first two addresses act as memory-mapped IO ports (I’ll explain those later on).

The third address stores what is called the program counter register, or PC for short. The PC holds the address of the next instruction to execute. It is initialized to 5 here, because data[5] holds the address of the first instruction, 0x43. More on instructions in a bit.

Next, we have at data[3] the accumulator register, or ACC for short. This is a general-purpose register, for storing whatever the programmer of this CPU wants. Lots of real CPUs have a register like this for storing the result of arithmetic operations. In our case, it’s just a free slot in memory that we can read and write to, with a useful side effect, covered in the next paragraph.

At data[4] lies the zero flag register, or Z for short. In almost all CPUs, there will be a flag register where certain bits in that register being 1 or 0 will signify something has happened in the CPU. For example, if the result of an addition is 0, then the bit representing the Z flag may be set to 1. This CPU is very minimal, so the Z flag is the only flag. It is automatically set to 1 whenever the value in ACC becomes 0, and is reset when it changes again to something not 0.

Only the first 5 addresses are ‘special’. Values in data[5] and onwards hold instructions for the actual program the CPU will be executing.

Instructions

Instructions are encoded in a special format. Every instruction is exactly one byte, which helps simplify implementing the CPU a bit.

The format of an instruction is as follows:

OOXXXYYY

Each letter represents one bit. The O bits represent the opcode, which determine which instruction the CPU is going to execute. The X and Y bits represent the parameters for the instruction. In this CPU, every instruction has exactly 2 parameters, which we will call simply x and y.

There are only two O bits, which means there can be at most 4 different instructions in this CPU. This is actually (more than) enough for our needs.

Here are the opcodes paired with their corresponding instructions:

  • 00: Store the value x at the address y
    • Mnemonic: mov x, [y]
  • 01: Store the contents of address x at address y
    • Mnemonic: mov [x], [y]
  • 10: Add x to the contents of address y, storing the result at address y
    • Mnemonic: add x, [y]
  • 11: Add the contents of address x to the contents of address y, storing the result at address y
    • Mnemonic: add [x], [y]

The mnemonics are just convenient names for the instructions, and should be familiar if you’ve written assembly before. They follow the format <op_name> <src>, <dst>, meaning that the result of the operation is stored at the address specified by the second argument. In some assembly syntaxes, <dst> and <src> are the other way around, i.e. <op_name> <dst>, <src>.

x (without brackets) means the literal value represented by x is used as the argument, and [x] means that the value stored at the address x is used as the argument. y is always surrounded by brackets since it always refers to an address that we end up storing the result into.

I actually didn’t end up using add x, [y] in the Hello World program, but I kept it anyway.

With that, let’s try understanding the first instruction, which is stored at data[5], 0x43.

Here is 0x43 in binary:

01 000 011

The first two bits signify that this is the mov [x], [y] instruction. In particular, x here is 0 and y is 3 (binary 011). Since they both refer to addresses, this means that we’re taking the value stored at address 0 and storing (i.e. copying) it into address 3. In case you forgot, address 0 is the I/O input port (IO_IN) and address 3 is the accumulator (ACC).

Reading from address 0 is handled differently by the CPU than reading other addresses. When the CPU sees that address 0 is read from, it will first write one byte of ‘input’ (we’ll see how this is implemented later) to address 0, then lets the operation proceed as normal.

So, ultimately this instruction will read one byte of input and store it into ACC.

The decoded program

I’ll save a bit of time and just write the mnemonics for all of the instructions in the program:

mov [0], [3]
mov [3], [1]
add [4], [2]
mov 5, [2]
mov [3], [3]

Replacing special addresses with their name:

mov IO_IN, ACC
mov ACC, IO_OUT
add Z, PC
mov 4, PC
mov ACC, ACC

The full program should be a little clearer now.

  1. Read a byte of input, store it into ACC
  2. Output the value of ACC
  3. Add the value of Z to PC
  4. Set the value of PC to 4
  5. Move the value of ACC into ACC

That 5th instruction may seem a little weird. It’s actually not necessary here due to how the CPU is implemented, but I decided to keep it as an example of what’s called a no-op instruction. It does nothing but waste a little bit of CPU time, and is often implemented as moving the value of a register into itself.

The 3rd instruction is a jump instruction in disguise. Jump instructions are instructions that change the flow of control, sometimes based on particular conditions (like certain flags). A jump changes the value of PC to something specific, so that the next instruction to be executed is not simply the one in the next memory address.

Rather than having a separate instruction for jumping to a specific address if the Z flag is flipped on, I simply add the value of Z register to PC using the already-available add instruction. If the ACC does not contain 0, then 0 is added and nothing changes (i.e. the next instruction is executed as normal). But if it’s 1, then the CPU will skip over the next instruction (since the PC is also automatically incremented by 1 after every instruction), thereby changing the operation.

The 4th instruction is what’s usually called an unconditional jump, i.e. changing the value of PC without checking for any conditions. Here, we store 4 into PC. Taking into account that PC is incremented after every instruction, this means that the value of PC will be 5 after this instruction is executed, and so the CPU will go execute the first instruction again.

If the value of Z was 1, the CPU will execute the 5th instruction instead of the 4th, and so the no-op will be executed. When our CPU reaches the end of memory (i.e. the value in PC is too big), the CPU stops executing. This is why I mentioned that the no-op wasn’t necessary, since PC would have a too-big value in it if 1 was added to it.

Implementing the CPU

That was a lot to go through! But now comes the (slightly) simpler part: implementing the CPU.

Here is the outline of what the CPU actually does:

  1. Read the byte stored at the address specified by PC
  2. Decode the byte into an instruction: the opcode o and the arguments x and y
  3. Handle each of the four possible instructions
  4. Increment the value in PC by 1
  5. Set Z to 1 if ACC is 0, and 0 otherwise

The relevant code:

while data[2] < len(data):
    b = data[data[2]]

    o = (b & 0xC0) >> 6
    x = (b & 0x38) >> 3
    y = b & 0x07

    if o == 0:
        data[y] = x
        if y == 1:
            print(chr(data[1]), end="")
    elif o == 1:
        if x == 0:
            if len(i) > 0:
                data[0] = i[0]
                i = i[1:]
        data[y] = data[x]
        if y == 1:
            print(chr(data[1]), end="")
    elif o == 2:
        data[y] = x + data[y]
    else:
        data[y] = data[x] + data[y]
    data[2] += 1
    data[4] = 1 if data[3] == 0 else 0

The first line just checks the stop condition we described earlier: does PC (found at data[2]) specify an address that is too high? If it’s not too high, then we proceed.


b = data[data[2]]

o = (b & 0xC0) >> 6
x = (b & 0x38) >> 3
y = b & 0x07

b is the byte stored at the address specified by PC. o, x, and y are extracted by masking and shifting bits. I won’t go into this part here, but essentially this is how you extract specific bits of a number.


Next comes a very simple if block, with four branches – one for each possible instruction.


if o == 0:
    data[y] = x
    if y == 1:
        print(chr(data[1]), end="")

Opcode 00 is simple: take x and store it at address y. We can also see special treatment for I/O: if the destination address was 1 (the IO_OUT address), then we additionally print the value of IO_OUT to the console.


elif o == 1:
    if x == 0:
        if len(i) > 0:
            data[0] = i[0]
            i = i[1:]
    data[y] = data[x]
    if y == 1:
        print(chr(data[1]), end="")

Opcode 01 is a little more complex, because we have to deal with input as well as output. If the source argument x is the IO_IN address (0), then we need to first store a byte of ‘input’ at data[0]:

if x == 0:
    if len(i) > 0:
        data[0] = i[0]
        i = i[1:]

The ‘input’ is actually just implemented as a list of the ASCII values of the string “Hello World” followed by a terminating 0, declared at the top of the code:

i = [ord(c) for c in "Hello World"] + [0]

This is also why the program checks that ACC is 0. If ACC is 0, that means we’ve read the whole string, and can terminate the program. As an aside, this is how strings are represented in a lot of languages, and this representation is usually called null-terminated.

After we store the first byte of the input in data[0], we remove the first byte of i by doing i = i[1:]. This could’ve also been implemented by keeping track of the current index into the input and incrementing it.

Also, there is a little check to make sure that there is still input left to write:

if len(i) > 0:

This actually isn’t necessary since I never read from IO_IN after reading all the input, but it might be a nice safety feature for a different application.

The actual storing is very straightforward:

data[y] = data[x]

The IO_OUT address is handled the same way as for opcode 00:

if y == 1:
    print(chr(data[1]), end="")


elif o == 2:
    data[y] = x + data[y]

This is the add x, [y] instruction, which is very simple to implement. As previously mentioned, it is also not actually used and so could’ve have been left unimplemented.


else:
    data[y] = data[x] + data[y]

Finally, this is add [x], [y]. It is also very simple, since we don’t care about I/O for add in our CPU.


Now, we simply increment PC and update the Z flag by checking if ACC is 0:

data[2] += 1
data[4] = 1 if data[3] == 0 else 0

With that, we’ve successfully implemented a tiny CPU.

Extra obfuscation

That’s actually it for the underlying logic, but it looked a bit too clean so I decided to make it a little harder to read.

The first step was really simple: use less variables. Rather than have separate variables for b, o, x, and y, I just copy-pasted their definitions anywhere they were used. I also renamed data to p. The result is somewhat intimidating:

while p[2] < len(p):
    if (p[p[2]] & 0xC0) >> 6 == 0:
        p[p[p[2]] & 0x07] = (p[p[2]] & 0x38) >> 3
    elif (p[p[2]] & 0xC0) >> 6 == 1:
        if (p[p[2]] & 0x38) >> 3 == 0:
            p[0] = ((((ord(i[0]) - 0x20) & 128) >> 7) | ((ord(i[0]) - 0x20) << 1)) & 0xFF
            i = i[1:]
        p[p[p[2]] & 0x07] = p[(p[p[2]] & 0x38) >> 3]
        if p[p[2]] & 0x07 == 1:
            print(chr(p[1]), end="")
    elif (p[p[2]] & 0xC0) >> 6 == 2:
        p[p[p[2]] & 0x07] = (p[p[2]] & 0x38) >> 3 + p[p[p[2]] & 0x07]
    else:
        p[p[p[2]] & 0x07] = p[(p[p[2]] & 0x38) >> 3] + p[p[p[2]] & 0x07]
    p[2] += 1
    p[4] = 1 if p[3] == 0 else 0

The second obfuscation is a little more involved: obfuscate the “Hello World” string.

I first performed a bitwise operation called right-rotating on each character. Right-rotating simply means that each bit gets shifted to the right (in this case by 1), but the right-most bit becomes the left-most bit. This differs from shifting, where the right-most bit would simply disappear and the left-most would just be 0. Rotating preserves all the bits, which means that I can perform the reverse operation, left-rotating, to get back the original value.

I then added 0x20 to each character (since the 0 at the end wasn’t a visually-representable ASCII code, and 0x20 is ASCII for the space character).

This (beautiful) line of code undoes those operations to get back the original character:

p[0] = ((((ord(i[0]) - 0x20) & 128) >> 7) | ((ord(i[0]) - 0x20) << 1)) & 0xFF

Finally, the memory was also obfuscated into a string. The obfuscation is very obvious here, namely adding 0x21 to each value (to make them visible ASCII characters that look a bit different to the obfuscated input).

p = list(map(lambda _: ord(_) - 0x21, "!!&!!dzăC|"))

Conclusions

This post ended up being a lot longer than I expected. Hopefully, it was informative.

Most of the concepts here would probably be covered in an introductory class on microprocessors or other low-level computer hardware. Real CPUs are a lot more complicated (usually) but, as shown here, it doesn’t actually take too much to design a functioning (though severely limited) one. Regardless, this is definitely a fun way to obfuscate simple programs and gain a better understanding of how a CPU might work.

As an aside, the implementation of the CPU is not quite perfect. Python doesn’t use 8-bit integers by default, so the code should really be masking the last 8 bits before storing anything to memory. You could say that the while loop should check if PC has a negative value as well, but with no sub instruction and no way to represent negative numbers, it is a bit difficult to get PC to be negative. But hey, it works well enough for “Hello World”.