Reverse Engineering

https://ctf101.org/reverse-engineering/overview/

Reverse Engineering in a CTF is typically the process of taking a compiled (machine code, bytecode) program and converting it back into a more human-readable format.

Very often the goal of a reverse engineering challenge is to understand the functionality of a given program such that you can identify deeper issues.

Assembly / Machine Code
The C Programming Language
Disassemblers
Decompilers

Assembly/Machine Code

Machine Code or Assembly is code that has been formatted for direct execution by a CPU. Machine Code is why readable programming languages like C, when compiled, cannot be reversed into source code (well Decompilers can sort of, but more on that later).

From Source to Compilation

Godbolt shows the differences in machine code generated by various compilers.

For example, if we have a simple C++ function:

#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>

int main() {
    char c;
    int fd = syscall(2, "/etc/passwd", 0);
    while (syscall(0, fd, &c, 1)) {
        putchar(c);
    }
}

We can see the compilation results in some verbose instructions for the CPU:

.LC0:
  .string "/etc/passwd"
main:
  push rbp
  mov rbp, rsp
  sub rsp, 16
  mov edx, 0
  mov esi, OFFSET FLAT:.LC0
  mov edi, 2
  mov eax, 0
  call syscall
  mov DWORD PTR [rbp-4], eax
.L3:
  lea rdx, [rbp-5]
  mov eax, DWORD PTR [rbp-4]
  mov ecx, 1
  mov esi, eax
  mov edi, 0
  mov eax, 0
  call syscall
  test rax, rax
  setne al
  test al, al
  je .L2
  movzx eax, BYTE PTR [rbp-5]
  movsx eax, al
  mov edi, eax
  call putchar
  jmp .L3
.L2:
  mov eax, 0
  leave
  ret

This is a one-way process for compiled languages as there is no way to generate sources from machine code. While the machine code may seem unintelligible, the extremely basic functions can be interpreted with some practice.

x86-64

x86-64 or amd64 or i64 is a 64-bit Complex Instruction Set Computing (CISC) architecture. This basically means that the registers used for this architecture extend an extra 32 bits on Intel's x86 architecture. CISC means that a single instruction can do a bunch of different things at once such as memory accesses, register reads, etc. It is also a variable-length instruction set which means different instructions can be of diferent sizes ranging from 1 to 16 bytes long. And finally, x86-64 allows for multi-sized register access which means that you can access certain parts of a register that are different sizes.

x86-64 Registers

x86-64 registers behave similarly to other architectures. A key component of x86-64 registers is multi-sized access, meaning the register RAX can have its lower 32-bits accessed with EAX. The next lower 16 bits can be accessed with AX and the lowest 8 bits can be accessed with AL, allowing the computer to make optimizations that boost program execution. Multi-access Register

x86-64 has plenty of registers, including rax, rbx, rcx, rdx, rdi, rsi, rsp, rip, r8-r15, and more! But some registers serve special purposes.

The special registers include: - RIP: the instruction pointer - RSP: the stack pointer - RBP: the base pointer

Instructions

An instruction represents a single operation for the CPU to perform.

There are different types of instructions including:

Data movement: mov rax, [rsp - 0x40]
Arithmetic: add rbx, rcx
Control-flow: jne 0x8000400

Because x86-64 is a CISC architecture, instructions can be quite complex for machine code such as repne scasb which repeats up to ECX times over memory at EDI looking for NULL byte (0x00), decrementing ECX each byte (Essentially strlen() in a single instruction!)

It is important to remember that an instruction really is just memory, this idea will become useful with Return Oriented Programming or ROP.

Instructions, numbers, strings, everything! Always represented in hex.

add rax, rbx
mov rax, 0xdeadbeef
mov rax, [0xdeadbeef] == 67 48 8b 05 ef be ad de
"Hello" == 48 65 6c 6c 6f
== 48 01 d8
== 48 c7 c0 ef be ad de

Execution

What should the CPU execute? This is determined by the RIP register where IP means instruction pointer. Execution follows the pattern: fetch the instruction at the address in RIP, decode it, and run it.

Examples

mov rax, 0xdeadbeef

Here the operation mov is moving the "immediate" 0xdeadbeef into the register RAX

mov rax, [0xdeadbeef + rbx * 4]

Here the operation mov is moving the data at the address of [0xdeadbeef + RBX*4] into the register RAX. When brackets are used, you can think of the program as getting the content from that effective address.

Example Execution

-> 0x0804000: mov eax, 0xdeadbeef            Register Values:
   0x0804005: mov ebx, 0x1234                RIP = 0x0804000
   0x080400a: add, rax, rbx                  RAX = 0x0
   0x080400d: inc rbx                        RBX = 0x0
   0x0804010: sub rax, rbx                   RCX = 0x0
   0x0804013: mov rcx, rax                   RDX = 0x0
   0x0804000: mov eax, 0xdeadbeef            Register Values:
-> 0x0804005: mov ebx, 0x1234                RIP = 0x0804005
   0x080400a: add, rax, rbx                  RAX = 0xdeadbeef
   0x080400d: inc rbx                        RBX = 0x0
   0x0804010: sub rax, rbx                   RCX = 0x0
   0x0804013: mov rcx, rax                   RDX = 0x0
   0x0804000: mov eax, 0xdeadbeef            Register Values:
   0x0804005: mov ebx, 0x1234                RIP = 0x080400a
-> 0x080400a: add, rax, rbx                  RAX = 0xdeadbeef
   0x080400d: inc rbx                        RBX = 0x1234
   0x0804010: sub rax, rbx                   RCX = 0x0
   0x0804013: mov rcx, rax                   RDX = 0x0
   0x0804000: mov eax, 0xdeadbeef            Register Values:
   0x0804005: mov ebx, 0x1234                RIP = 0x080400d
   0x080400a: add, rax, rbx                  RAX = 0xdeadd123
-> 0x080400d: inc rbx                        RBX = 0x1234
   0x0804010: sub rax, rbx                   RCX = 0x0
   0x0804013: mov rcx, rax                   RDX = 0x0
   0x0804000: mov eax, 0xdeadbeef            Register Values:
   0x0804005: mov ebx, 0x1234                RIP = 0x0804010
   0x080400a: add, rax, rbx                  RAX = 0xdeadd123
   0x080400d: inc rbx                        RBX = 0x1235
-> 0x0804010: sub rax, rbx                   RCX = 0x0
   0x0804013: mov rcx, rax                   RDX = 0x0
   0x0804000: mov eax, 0xdeadbeef            Register Values:
   0x0804005: mov ebx, 0x1234                RIP = 0x0804013
   0x080400a: add, rax, rbx                  RAX = 0xdeadbeee
   0x080400d: inc rbx                        RBX = 0x1235
   0x0804010: sub rax, rbx                   RCX = 0x0
-> 0x0804013: mov rcx, rax                   RDX = 0x0
   0x0804000: mov eax, 0xdeadbeef            Register Values:
   0x0804005: mov ebx, 0x1234                RIP = 0x0804005
   0x080400a: add, rax, rbx                  RAX = 0xdeadbeee
   0x080400d: inc rbx                        RBX = 0x1235
   0x0804010: sub rax, rbx                   RCX = 0xdeadbeee
   0x0804013: mov rcx, rax                   RDX = 0x0

Control Flow

How can we express conditionals in x86-64? We use conditional jumps such as:

jnz <address>
je <address>
jge <address>
jle <address>
etc.

They jump if their condition is true and just go to the next instruction otherwise. These conditionals are checking EFLAGS which are special registers that store flags on certain instructions such as add rax, rbx which sets the o (overflow) flag if the sum is greater than a 64-bit register can hold, and wraps around. You can jump based on that with a jo instruction. The most important thing to remember is the cmp instruction:

cmp rax, rbx
jle error

This assembly jumps if RAX <= RBX

Addresses

Memory acts similarly to a big array where the indices of this "array" are memory addresses. Remember from earlier:

mov rax, [0xdeadbeef]

The square brackets mean "get the data at this address". This is analogous to the C/C++ syntax: rax = *0xdeadbeef;

Disassemblers

A disassembler is a tool that breaks down a compiled program into machine code.

List of Disassemblers

IDA
Binary Ninja
GNU Debugger (GDB)
radare2
Hopper

IDA

The Interactive Disassembler (IDA) is the industry standard for binary disassembly. IDA is capable of disassembling "virtually any popular file format". This makes it very useful to security researchers and CTF players who often need to analyze obscure files without knowing what they are or where they came from. IDA also features the industry-leading Hex-Rays decompiler which can convert assembly code back into a pseudo code-like format.

IDA

IDA also has a plugin interface which has been used to create some successful plugins that can make reverse engineering easier:

https://github.com/google/binnavi
https://github.com/yegord/snowman
https://github.com/gaasedelen/lighthouse
https://github.com/joxeankoret/diaphora
https://github.com/REhints/HexRaysCodeXplorer
https://github.com/osirislab/Fentanyl

Binary Ninja

Binary Ninja is an up-and-coming disassembler that attempts to bring a new, more programmatic approach to reverse engineering. Binary Ninja brings an improved plugin API and modern features to reverse engineering. While it's less popular or as old as IDA, Binary Ninja (often called binja) is quickly gaining ground and has a small community of dedicated users and followers.

Binja

Binja also has some community-contributed plugins which are collected here: https://github.com/Vector35/community-plugins

gdb

The GNU Debugger is a free and open-source debugger that also disassembles programs. It's capable as a disassembler, but most notably it is used by CTF players for its debugging and dynamic analysis capabilities.

gdb is often used in tandem with enhancement scripts like peda, pwndbg, and GEF

GDB

The GNU Debugger (GDB)

The GNU Debugger or GDB is a powerful debugger that allows for the step-by-step execution of a program. It can be used to trace program execution and is an important part of any reverse engineering toolkit.

Vanilla GDB

GDB without any modifications is unintuitive and obscures a lot of useful information. The plug-in pwndb solves a lot of these problems and makes for a much more pleasant experience. But if you are constrained and have to use vanilla gdb, here are several things to make your life easier.

Starting GDB

To execute GBD and attach it to a program simply run gdb [program]

Disassembly

(gdb) disassemble [address/symbol] will display the disassembly for that function/frame

GDB will autocomplete functions, so saying (gdb) disas main suffices if you'd like to see the disassembly of the main

View Disassembly During Execution

Another handy thing to see while stepping through a program is the disassembly of nearby instructions:

(gdb) display/[# of instructions]i $pc [± offset]

display shows data with each step
/[#]i shows how much data in the format i for instruction
$pc means the pc, program counter, register
[± offset] allows you to specify how you would like the data offset from the current instruction

Example Usage

(gdb) display/10i $pc - 0x5

This command will show 10 instructions on screen with an offset from the next instruction of 5, giving us this display:

   0x8048535 <main+6>:  lock pushl -0x4(%ecx)
   0x8048539 <main+10>: push   %ebp
=> 0x804853a <main+11>: mov    %esp,%ebp
   0x804853c <main+13>: push   %ecx
   0x804853d <main+14>: sub    $0x14,%esp
   0x8048540 <main+17>: sub    $0xc,%esp
   0x8048543 <main+20>: push   $0x400
   0x8048548 <main+25>: call   0x80483a0 <malloc@plt>
   0x804854d <main+30>: add    $0x10,%esp
   0x8048550 <main+33>: sub    $0xc,%esp

Deleting Views

If for whatever reason, a view no long suits your needs simply call (gdb) info display which will give you a list of active displays:

Auto-display expressions now in effect:
Num Enb Expression
1:   y  /10bi $pc-0x5

Then simply execute (gdb) delete display 1 and your execution will resume without the display.

Registers

In order to view the state of registers with vanilla gdb, you need to run the command info registers which will display the state of all the registers:

eax            0xf77a6ddc   -142971428
ecx            0xffe06b10   -2069744
edx            0xffe06b34   -2069708
ebx            0x0  0
esp            0xffe06af8   0xffe06af8
ebp            0x0  0x0
esi            0xf77a5000   -142979072
edi            0xf77a5000   -142979072
eip            0x804853a    0x804853a <main+11>
eflags         0x286    [ PF SF IF ]
cs             0x23 35
ss             0x2b 43
ds             0x2b 43
es             0x2b 43
fs             0x0  0
gs             0x63 99

If you simply would like to see the contents of a single register, the notation x/x $[register] where:

x/x means to display the address in hex notation
$[register] is the register code such as eax, rax, etc.

Pwndbg

These commands work with vanilla gdb as well.

Setting Breakpoints

Setting breakpoints in GDB uses the format b*[Address/Symbol]

Example Usage

(gdb) b*main: Break at the start
(gdb) b*0x804854d: Break at 0x804854d
(gdb) b*0x804854d-0x100: Break at 0x804844d

Deleting Breakpoints

As before, in order to delete a view, you can list the available breakpoints using (gdb) info breakpoints (don't forget about GDB's autocomplete, you don't always need to type out every command!) which will display all breakpoints:

Num     Type           Disp Enb Address    What
1       breakpoint     keep y   0x0804852f <main>
3       breakpoint     keep y   0x0804864d <__libc_csu_init+61>

Then simply execute (gdb) delete 1

Note

GDB creates breakpoints chronologically and does NOT reuse numbers.

Stepping

What good is a debugger if you can't control where you are going? In order to begin the execution of a program, use the command r [arguments] similar to how if you ran it with dot-slash notation you would execute it ./program [arguments]. In this case, the program will run normally and if no breakpoints are set, you will execute normally. If you have breakpoints set, you will stop at that instruction.

(gdb) continue [# of breakpoints]: Resumes the execution of the program until it finishes or until another breakpoint is hit (shorthand c)
(gdb) step[# of instructions]: Steps into an instruction the specified number of times, default is 1 (shorthand s)
(gdb) next instruction [# of instructions]: Steps over an instruction meaning it will not delve into called functions (shorthand ni)
(gdb) finish: Finishes a function and breaks after it gets returned (shorthand fin)

Examining

Examining data in GDB is also very useful for seeing how the program is affecting data. The notation may seem complex at first, but it is flexible and provides powerful functionality.

(gdb) x/[#][size][format] [Address/Symbol/Register][± offset]

x/ means examine
[#] means how much
[size] means what size the data should be such as a word w (2 bytes), double word d (4 bytes), or giant word g (8 bytes)
[format] means how the data should be interpreted such as an instruction i, a string s, hex bytes x
[Address/Symbol][± offset] means where to start interpreting the data

Example Usage

(gdb) x/x $rax: Displays the content of the register RAX as hex bytes
(gdb) x/i 0xdeadbeef: Displays the instruction at address 0xdeadbeef
(gdb) x/10s 0x893e10: Displays 10 strings at the address
(gdb) x/10gx 0x7fe10: Displays 10 giant words as hex at the address

Forking

If the program happens to be an accept-and-fork server, gdb will have issues following the child or parent processes. In order to specify how you want gdb to function you can use the command set follow-fork-mode [on/off]

Setting Data

If you would like to set data at any point, it is possible using the command set [Address/Register]=[Hex Data]

Example Usage

set $rax=0x0: Sets the register rax to 0
set 0x1e4a70=0x123: Sets the data at 0x1e4a70 to 0x123

Process Mapping

A handy way to find the process's mapped address spaces is to use info proc map:

Mapped address spaces:

    Start Addr   End Addr       Size     Offset objfile
     0x8048000  0x8049000     0x1000        0x0 /directory/program
     0x8049000  0x804a000     0x1000        0x0 /directory/program
     0x804a000  0x804b000     0x1000     0x1000 /directory/program
    0xf75cb000 0xf75cc000     0x1000        0x0
    0xf75cc000 0xf7779000   0x1ad000        0x0 /lib32/libc-2.23.so
    0xf7779000 0xf777b000     0x2000   0x1ac000 /lib32/libc-2.23.so
    0xf777b000 0xf777c000     0x1000   0x1ae000 /lib32/libc-2.23.so
    0xf777c000 0xf7780000     0x4000        0x0
    0xf778b000 0xf778d000     0x2000        0x0 [vvar]
    0xf778d000 0xf778f000     0x2000        0x0 [vdso]
    0xf778f000 0xf77b1000    0x22000        0x0 /lib32/ld-2.23.so
    0xf77b1000 0xf77b2000     0x1000        0x0
    0xf77b2000 0xf77b3000     0x1000    0x22000 /lib32/ld-2.23.so
    0xf77b3000 0xf77b4000     0x1000    0x23000 /lib32/ld-2.23.so
    0xffc59000 0xffc7a000    0x21000        0x0 [stack]

This will show you where the stack, heap (if there is one), and libc are located.

Attaching Processes

Another useful feature of GDB is to attach to processes that are already running. Simply launch gdb using gdb, then find the process id of the program you would like to attach to an execute attach [pid].