8051

The 8051 is an 8-bit microcontroller core that turned 45 in the year this book was written, and it is still the architecture inside a remarkable amount of new silicon. Sub-GHz and 2.4 GHz radio SoCs (Nordic nRF24LE1 — a proprietary 2.4 GHz radio core, not Bluetooth; Realtek's 8051-core BLE parts; the Telink TLSR series for BLE HID), USB controllers (Cypress / Infineon FX2 / FX3 are 8051-based, sort of), wireless mice and keyboards, legacy industrial controllers, and an unending river of clone microcontrollers from Asian fabs all use 8051 cores.

If you are reverse engineering 2.4 GHz HID dongles, BLE keyboards (on the Telink TLSR family), cheap RF dongles, or older industrial gear, you will meet 8051. This chapter covers the parts of the architecture that matter for r2 work, the loading recipes, and the gotchas that separate "working with an 8051 binary" from "working with anything else".

Architectural overview

The 8051 is genuinely strange by modern standards:

Harvard architecture. Code and data live in separate address spaces with separate instruction families. Code memory (CODE / PMEM) is read-only at runtime; data memory (IDATA / internal RAM, XDATA / external RAM, SFR / special function registers) is read-write.
Banked register set. R0..R7 are 8 registers, but there are 4 banks of them. The current bank is selected by bits in PSW. Switching banks is fast; debugging switches is annoying.
Internal data RAM. Classic 8051 has 128 bytes (0x00–0x7F). The 8052 and most modern derivatives add a second 128 bytes at 0x80–0xFF that is only reachable via indirect addressing (MOV @Ri); direct addressing of 0x80–0xFF reaches the SFR space instead. So MOV A, 0x90 reads SFR P1; MOV R0, #0x90; MOV A, @R0 reads upper-128 IRAM byte 0x90 on an 8052.
Bit-addressable region. A 16-byte block of IDATA (bytes 0x20–0x2F) is bit-addressable: each of the 128 bits has its own address in the bit-address space.
64 KiB code address space in classic 8051, expanded with banking on most modern derivatives. Different vendors implement banking differently (XDATA paging, EXTRA codes for the upper banks).
Big-endian encoding for instruction immediates. LJMP and LCALL encode their 16-bit target high-byte-first, and MOV DPTR, #imm16 encodes DPH before DPL. The CPU itself is byte-addressed and has no inherent endianness for data; multi-byte data layout (int, long in C) is compiler-defined — SDCC and Keil both default to big-endian, which is the opposite of every common 32-bit embedded platform.
No stack pointer relative addressing. All addressing is direct, indirect through R0/R1, indirect through DPTR, or PC-relative for jumps. No SP-relative loads.
A single accumulator (A) for arithmetic. Most operations go through it.

For r2:

The architecture name is 8051.
The bits is 8 for the canonical case but the cpu flag often needs setting to a derivative that knows the right SFR map.

Loading

A raw 8051 image:

text

$ r2 -a 8051 -b 8 -m 0x0 firmware.bin

If you have an Intel HEX file (most common 8051 distribution format):

text

$ r2 firmware.hex                    # r2 reads the addresses from the file

Many 8051 binaries are >64 KiB and use bank switching. R2's 8051 plugin has limited support for banks; when you exceed 64 KiB you typically have to load each bank as its own mapping and treat bank-switch calls as opaque jumps.

Set the SFR map via the cpu flag:

text

$ r2 -a 8051 -b 8 -c 8051 -m 0x0 firmware.bin     # generic 8051 SFRs
$ r2 -a 8051 -b 8 -c 8052 firmware.bin            # 8052 SFRs (extra timer 2)
$ r2 -a 8051 -b 8 -c at89s51 firmware.bin         # Atmel-specific

For chips with vendor-specific SFRs (Nordic nRF24LE1, Telink TLSR, Realtek RTL8762), check e asm.cpu = ? for available variants. If your specific chip is not listed, pick the closest cousin and manually flag the vendor-specific SFRs.

SFRs and addressing

Special function registers live in addresses 0x80–0xFF in IDATA. The classic 8051 SFR map:

text

[0x...]> f sfr.P0     = 0x80
[0x...]> f sfr.SP     = 0x81
[0x...]> f sfr.DPL    = 0x82
[0x...]> f sfr.DPH    = 0x83
[0x...]> f sfr.PCON   = 0x87
[0x...]> f sfr.TCON   = 0x88
[0x...]> f sfr.TMOD   = 0x89
[0x...]> f sfr.TL0    = 0x8A
[0x...]> f sfr.TL1    = 0x8B
[0x...]> f sfr.TH0    = 0x8C
[0x...]> f sfr.TH1    = 0x8D
[0x...]> f sfr.P1     = 0x90
[0x...]> f sfr.SCON   = 0x98
[0x...]> f sfr.SBUF   = 0x99
[0x...]> f sfr.P2     = 0xA0
[0x...]> f sfr.IE     = 0xA8
[0x...]> f sfr.P3     = 0xB0
[0x...]> f sfr.IP     = 0xB8
[0x...]> f sfr.PSW    = 0xD0
[0x...]> f sfr.A      = 0xE0     ; accumulator
[0x...]> f sfr.B      = 0xF0

Vendor-specific extensions add radio control registers, timers, ADCs, USB endpoint registers, and so on at addresses 0xA8–0xFF. Get the TRM, transcribe the SFR table to flags, and reload.

Instruction set highlights

The 8051 has about 110 instructions. The ones you see most:

Mnemonic	Meaning
`MOV A, #imm`	move immediate to accumulator
`MOV A, Rn`	move from register
`MOV A, @Ri`	indirect through R0 or R1
`MOVX A, @DPTR`	read XDATA at DPTR (external memory)
`MOVC A, @A+DPTR`	read CODE at DPTR + A (table lookup)
`MOVC A, @A+PC`	read CODE relative to PC
`LCALL addr`	long call (16-bit address)
`ACALL addr`	absolute call (11-bit, within 2 KiB page)
`LJMP addr`	long jump
`AJMP addr`	absolute jump (11-bit)
`SJMP rel`	short jump (8-bit relative, ±128)
`JZ`, `JNZ`	jump if A is zero / non-zero
`CJNE`	compare and jump if not equal
`DJNZ`	decrement and jump if not zero
`RET`, `RETI`	return / return from interrupt
`SETB bit`	set bit
`CLR bit`	clear bit
`MOV C, bit`	move bit to carry
`JB`, `JNB`, `JBC`	jump if bit set / not set / set then clear

MOVC @A+DPTR is the lookup-table primitive; you will see it for every state machine transition table, every UTF-8 decode, every cosine table.

A typical 8051 function

text

; void uart_send(unsigned char c)
0x0400:  C0 E0           push  ACC
0x0402:  90 12 34        mov   DPTR, #0x1234   ; status reg
0x0405:  E0              movx  A, @DPTR
0x0406:  30 E0 FB        jnb   ACC.0, 0x0404   ; wait for TX ready
0x0409:  90 12 35        mov   DPTR, #0x1235   ; data reg
0x040C:  EF              mov   A, R7           ; arg in R7
0x040D:  F0              movx  @DPTR, A
0x040E:  D0 E0           pop   ACC
0x0410:  22              ret

Reading this:

The function takes its argument in R7 (Keil/SDCC convention; the C compiler's choice).
It polls a status register at XDATA 0x1234.
When bit 0 is set, it writes the character to data register at 0x1235.
Push/pop of A is to preserve it across the call (compiler's choice).

R2 will produce roughly this output. The decompiler (r2ghidra) handles 8051 reasonably well for simple functions; complex ones tend to produce code that is correct but ugly because of the constant shuttling through A.

Calling conventions

8051 calling convention is whatever the C compiler decided. For SDCC (the most common open-source toolchain):

First argument in DPL, DPH for pointer-class types; R7 for char; R7:R6 for int; R7:R6:R5:R4 for long.
Subsequent arguments on the stack or in fixed memory locations.
Return values in the same registers.
Reentrant functions (compiled with __reentrant) use the stack for everything.

For Keil C51:

Args in R7..R3 then in fixed memory.
Different mangling for parameter passing.

R2 does not always guess the convention right. Set per-function:

text

[0x...]> afc sdcc8051 @ sym.foo
[0x...]> afc keil8051 @ sym.bar

If the right cc name is not present in afcl, define one in ~/.config/radare2/cc.sdb (the calling-convention database).

XDATA, IDATA, CODE: telling them apart in disassembly

Memory accesses in 8051 are explicit about which space they touch:

MOV A, R0 — internal R-bank
MOV A, @R0 — IDATA at the address in R0
MOVX A, @DPTR — XDATA at DPTR
MOVX A, @R0 — XDATA at R0 (low byte; high byte is P2)
MOVC A, @A+DPTR — CODE at A+DPTR

When you see MOVX, you are looking at external memory (often peripheral registers or external SRAM). Build a table of XDATA peripheral addresses for your chip.

8051 in Bluetooth chips: a worked recipe

Suppose you have firmware for an old Nordic nRF24LE1 keyboard. nRF24LE1 is an 8051 with a 2.4 GHz radio. Reverse engineering steps:

Get the chip's TRM (publicly available from Nordic).
Build an SFR map from the radio control registers (RFCON, RFDAT, RFCSEN) and the standard 8051 SFRs.
Load:

text

$ r2 -a 8051 -b 8 -c 8051 -m 0x0 keyboard.bin
[0x0]> .nrf24le1_sfr.r2     # apply the SFR flags
[0x0]> aaa

Look at the reset vector (address 0):

text

[0x0]> pd 4 @ 0x0
0x0000  LJMP 0x0064          ; reset vector
0x0003  LJMP 0x0070          ; external interrupt 0
0x000B  LJMP 0x0080          ; timer 0 interrupt
...

Define functions at each vector target.
Find the radio TX path:

text

[0x0]> /c MOVX @DPTR, A      # search for XDATA writes
[0x0]> # find one at the radio data register

Trace backwards from the radio access to the keyboard scan code handler.

This pattern — start at the SFR access of the most distinctive peripheral, trace back — is the standard 8051 RE workflow. The opposite (top-down from main) is harder because the call graph in 8051 firmware is often deep and the function names are gone.

8051 gotchas

Bank switching for code >64 KiB. Some chips have a "code bank" register that selects which 32 KiB or 64 KiB region of flash is mapped into the upper half of code space. R2's 8051 plugin treats the code space as a flat 64 KiB. To analyse a multi-bank binary, load each bank separately.

Bit-addressing is stateful. A bit address like 0x86 could mean "bit 6 of byte 0x20+0x10" depending on the address range. R2 shows them; cross-reference with the SFR map.

Compiler-generated stubs everywhere. SDCC and Keil emit helper routines for 16-bit arithmetic, soft-multiply, soft-divide, long-to-string, etc. They often live at fixed addresses and have no symbols. Build a per-compiler signature DB and apply it.

No real stack frames. Local variables live in fixed memory locations the compiler allocated at compile time, not on the stack. The decompiler shows them as global accesses; you have to recognise that the global "g_var_4" used only inside one function is actually that function's local variable.

Interrupts switch banks automatically. ISRs commonly run with register bank 1 (or 2 or 3) selected; the main code runs with bank 0. A function that reads R0 may be reading bank 1's R0 if it is inside an ISR. Check the bank-select bits in PSW context.

8051 reverse engineering is slow. There is no shortcut; the architecture is unfriendly, the toolchain conventions are inconsistent across vendors, and the firmware is usually written by people working at the limits of what the chip can do, which means clever, dense, and weakly commented even in source. r2 is a fine tool for it; the bottleneck is the architecture, not the tooling.

8051 ​

Architectural overview ​

Loading ​

SFRs and addressing ​

Instruction set highlights ​

A typical 8051 function ​

Calling conventions ​

XDATA, IDATA, CODE: telling them apart in disassembly ​

8051 in Bluetooth chips: a worked recipe ​

8051 gotchas ​

8051

Architectural overview

Loading

SFRs and addressing

Instruction set highlights

A typical 8051 function

Calling conventions

XDATA, IDATA, CODE: telling them apart in disassembly

8051 in Bluetooth chips: a worked recipe

8051 gotchas