ARM-32 Course 1
by Kevin Thomas
Part 1 – The Meaning Of Life
“So if I go to college and learn Python or Java will I make a million dollars and have nice things?”
I felt it necessary to start out this tutorial series with such a statement. This is NOT an attack on Python or Java as in a prior life I worked with Java primarily in Android Development and currently use Python in my professional environment. In today’s Agile environment, rapid-development is reality. With the increased challenges in both the commercial market and the government sector, software development will continue to focus on more robust libraries that will do more with less.
As a Senior Software Engineer in Test, I try to help as many people as possible bridge their skill-set with either an entry-point or career advancement into the job market. One thing that is critical to understand is that there is and will continue to be a dramatic shortage of engineers and developers of all shapes and sizes.
Like it or not, hardware is getting smaller and smaller and the trend is going from CISC to RISC. A CISC is your typical x86 computer with a complex series of instructions. CISC computers will always exist however with the trend going toward cloud computing and the fact that RISC machines with a reduced instruction set are so enormously powerful today, they are the obvious choice for consumption.
How many cell phones do you think exist on earth today? Most of them are RISC machines. How many of you have a Smart TV or Amazon Echo or any number of devices considered part of the IoT or Internet Of Things? Each of these devices have one thing in common – they are RISC and all are primarily ARM based.
ARM is an advanced RISC machine. Compared to the very complex architecture of a CISC, most ARM systems today are what is referred to as a SoC or system on chip which is an integrated circuit which has all of the components of a computer and electronic system on a single chip. This includes RF functionality as well. These low-power embedded devices can run versions of Windows, Linux and many other advanced operating systems.
“Well who cares about ARM, you can call it anything you want, I know Python or Java and that’s all I need to know cause when I program it works everywhere so I don’t have to worry about anything under the hood.”
I again just want you to reflect on the above statement for a brief moment. As every day continues to pass, more and more systems are becoming vulnerable to attack and compromise. Taking the time to understand what is going on under the hood can only help to curb this unfortunate reality.
This series will focus on ARM Assembly. We will work with a Raspberry Pi 3 which contains the Broadcom BCM2837 SoC with a 4x ARM Cortex-A53, 1.2GHz CPU and 1 GB LPDDR2 RAM. We will work with the Raspbian Jessie, Linux-based operating system. If you don’t own a Raspberry Pi 3, they are usually available for $35 on Amazon or any number of retailers. If you would like to learn more visit https://www.raspberrypi.org.
We will work solely in the terminal so no pretty pictures and graphics as we are keeping it to the hardcore bare-bones utilizing the GNU toolkit to compile and debug our code base.
UNDER NO CONDITIONS ARE YOU TO EVER USE THIS EDUCATION TO CAUSE HARM TO ANY SYSTEM OF ANY KIND AS I AM NOT RESPONSIBLE! THIS IS FOR LEARNING PURPOSES ONLY!
Part 2 – Number Systems
At the core of the microprocessor are a series of binary numbers which are either +5V (on or 1) or 0V (off or 0). Each 0 or 1 represents a bit of information within the microprocessor. A combination of 8 bits results in a single byte.
Before we dive into binary, lets examine the familiar decimal. If we take the number 2017, we would understand this to be two thousand and seventeen.
Value 1000s 100s 10s 1s
Representation 10^3 10^2 10^1 10^0
Digit 2 0 1 7
Let’s take a look at the binary system and the basics of how it operates.
Bit Number b7 b6 b5 b4 b3 b2 b1 b0
Representation 2^7 2^6 2^5 2^4 2^3 2^2 2^1 2^0
Decimal Weight 128 64 32 16 8 4 2 1
If we were to convert a binary number into decimal, we would very simply do the following. Lets take a binary number of 0101 1101 and as you can see it is 93 decimal.
Bit Weight Value
0 128 0
1 64 64
0 32 0
1 16 16
1 8 8
1 4 4
0 2 0
1 1 1
Adding the values in the value column gives us 0 + 64 + 0 + 16 + 8 + 4 + 0 + 1 = 93 decimal.
If we were to convert a decimal number into binary, we would check to see if a subtraction is possible relative to the highest order bit and if so, a 1 would be placed into the binary column to which the remainder would be carried into the next row. Let’s consider the example of the decimal value of 120 which is 0111 1000 binary.
128 64 32 16 8 4 2 1
0 1 1 1 1 0 0 0
1)Can 128 fit inside of 120: No, therefore 0.
2)Can 64 fit inside of 120: Yes, therefore 1, then 120 – 64 = 56.
3)Can 32 fit inside of 56: Yes, therefore 1, then 56 – 32 = 24.
4)Can 16 fit inside of 24: Yes, therefore 1, then 24 – 16 = 8.
5)Can 8 fit inside of 8: Yes, therefore 1, then 8 – 8 = 0.
6)Can 4 fit inside of 0: No, therefore 0.
7)Can 2 fit inside of 0: No, therefore 0.
8)Can 1 fit inside of 0: No, therefore 0.
When we want to convert binary to hex we simply work with the following table.
Decimal Hex Binary
0 0 0000
1 1 0001
2 2 0010
3 3 0011
4 4 0100
5 5 0101
6 6 0110
7 7 0111
8 8 1000
9 9 1001
10 A 1010
11 B 1011
12 C 1100
13 D 1101
14 E 1110
15 F 1111
Lets convert a binary number such as 0101 1111 to hex. To do this we very simply look at the table and compare each nibble which is a combination of 4 bits. Keep in mind, 8 bits is equal to a byte and 2 nibbles are equal to a byte.
0101 = 5
1111 = F
Therefore 0101 1111 binary = 0x5f hex. The 0x notation denotes hex.
To go from hex to binary it’s very simple as you have to simply do the opposite such as:
0x3a = 0011 1010
3 = 0011
A = 1010
It is important to understand that each hex digit is a nibble in length therefore two hex digits are a byte in length.
To convert from hex to decimal we do the following:
0x5f = 95
5 = 5 x 16^1 = 5 x 16 = 80
F = 15 x 16^0 = 15 x 1 = 15
Therefore we can see that 80 + 15 = 95 which is 0x5f hex.
Finally to convert from decimal to hex. Lets take the number 850 decimal which is 352 hex.
Division Result(No Remainder) Remainder Remainder Multiplication
850 / 16 53 0.125 0.125 x 16 = 2
53 / 16 3 0.3125 0.3125 x 16 = 5
3 / 16 0 0.1875 0.1875 x 16 = 3
We put the numbers together from bottom to the top and we get 352 hex.
“Why the hell would I waste my time learning all this crap when the computer does all this for me!”
If you happen to know any reverse engineers please if you would take a moment and ask them the above question.
The reality is, if you do NOT have a very firm understanding of how all of the above works, you will NEVER get a grasp on how the ARM processor registers hold and manipulate data. You will NEVER get a grasp on how the ARM processor deals with a binary overflow and it’s effect on how carry operations work nor will you understand how compare operations work or even the most basic operations of the most simple assembly code.
I am not suggesting you memorize the above, nor am I suggesting that you do a thousand examples of each. All I ask is that you take the time to really understand that literally everything and I mean everything goes down to binary bits in the processor.
We as humans operate on the base 10 decimal system. The processor works on a base 16 (hex) system. The registers we are dealing with in conjunction with Linux are addressed in 32-bit sizes. When we begin discussion of the processor registers, we will learn that each are 32-bits wide (technically the BCM2837 are 64-bit wide however our version of Linux that we are working with is 32-bit therefore we only address 32-bits of each register).
Part 3 – Binary Addition
Binary addition can occur in one of four different fashions:
0 + 0 = 0
1 + 0 = 1
0 + 1 = 1
1 + 1 = 0 (1) [One Plus One Equals Zero, Carry One]
Keep in mind the (1) means a carry bit. It very simply means an overflow.
Lets take the following 4-bit nibble example:
We see an obvious carry in the 3rd bit. If the 8th bit had a carry then this would generate a carry flag within the CPU.
Let’s examine an 8-bit number:
If we had:
Here we see a carry bit which would trigger the carry flag within the CPU to be 1 or true. We will discuss the carry flag in later tutorials. Please just keep in mind this example to reference as it is very important to understand.
Part 4 – Binary Subtraction
Binary subtraction is nothing more than adding the negative value of the number to be subtracted. For example 8 + - 4, the starting point would be zero to which we move 8 points in the positive direction and then four points in the negative direction yielding a value of 4.
We represent a sign bit in binary to which bit 7 indicates the sign of number where 0 is positive and 1 is negative.
Sign Bit 7 Bits 0 – 6
The above would represent -2.
We utilize the concept of twos compliment which inverts each bit and then finally adding 1.
Lets example binary 2.
Invert the bits.
Let’s examine a subtraction operation:
00000100 4 decimal
+ 11111110 -2 decimal
(1)00000010 2 decimal
So what is the (1) you may ask, that is the overflow bit. In future tutorials we will examine what we refer to as the overflow flag and carry flag.
Part 5 – Word Lengths
The system on chip we are working with has a 32-bit ARM CPU. 32-bits is actually 4 bytes of information which make up a word.
If you remember my prior tutorial on x86 Assembly, a word was 16-bits. Every different architecture defines a word differently.
The most significant bit of a word for our ARM CPU is located at bit 31 therefore a carry is generated if an overflow occurs there.
The lowest address in our architecture starts at 0x00000000 and goes to 0xFFFFFFFF. The processor sees memory in word blocks therefore every 4 bytes. A memory address associated with the start of a word is referred to as a word boundary and is divisible by 4. For example here is our first word:
So why is this important? There is the concept of fetching and executing to which the processor deals with instructions to which it must work in this fashion for proper execution.
Before we dive into coding assembly it is critical that you understand some basics of how the CPU operates. There will be a number of more lectures going over the framework so I appreciate everyone hanging in there!
Part 6 – Registers
Our ARM microprocessor has internal storage which make any operation must faster as there is no external memory access needed. There are two modes, User and Thumb. We will be focusing on User Mode as we are ultimately focused on developing for a system on chip within a Linux OS rather than bare-metal programming which would be better suited on a microcontroller device.
In User Mode we have 16 registers and a CPSR register to which have a word length each which is 32-bits each or 8 bytes each.
Registers R0 to R12 are multi-purpose registers to which R13 – R15 have a unique purpose as well as the CPSR. Lets take a look at a simple table to illustrate.
R0 GPR (General-Purpose Register)
R1 GPR (General-Purpose Register)
R2 GPR (General-Purpose Register)
R3 GPR (General-Purpose Register)
R4 GPR (General-Purpose Register)
R5 GPR (General-Purpose Register)
R6 GPR (General-Purpose Register)
R7 GPR (General-Purpose Register)
R8 GPR (General-Purpose Register)
R9 GPR (General-Purpose Register)
R10 GPR (General-Purpose Register)
R11 GPR (General-Purpose Register)
R12 GPR (General-Purpose Register)
R13 Stack Pointer
R14 Link Register
R15 Program Counter
CPSR Current Program Status Register
It is critical that we understand registers in a very detailed way. At this point we understand R0 – R12 are general purpose and will be used to manipulate data as we build our programs and additionally when you are hacking apart or reverse engineering binaries from a hex dump on a cell phone or other ARM device, no matter what high-level language it is written in, it must ultimately come down to assembly which you need to understand registers and how they work to grasp and understand of any such aforementioned operation.
The chip we are working with is known as a load and store machine. This means we load a register with the contents of a register or memory location and we can store a register with the contents of a memory or register location. For example:
ldr, r4, [r10] @
load r4 with the contents of r10, if r10 had the decimal value of
say 22, 22 would go to r4
str, r9, [r4] @
store r9 contents into location in r4, if r9 had 0x02 hex,
0x02 would be stored into location r4
The @ simply indicates to the compiler that what follows it on a given line is a comment and to be ignored.
The next few weeks we will take our time and look at each of the special purpose registers so you have a great understanding of what they do.
Part 7 - Program Counter
We will dive into the registers over the coming weeks to make sure you obtain a firm understand of their role and what they can do.
We begin with the PC or program counter. The program counter is responsible for directing the CPU to what instruction will be executed next. The PC literally holds the address of the instruction to be fetched next.
When coding you can refer to the PC as PC or R15 as register 15 is the program counter. You MUST treat it with care as you can set it wrong and crash the executable quite easily.
You can control the PC directly in code:
mov r15, 0x00000000
I would not suggest trying that as we are not in Thumb mode and that will cause a fault as you would be going to an OS area rather than designated program area.
Regarding our ARM processor, we follow the standard calling convention meaning params are passed by placing the param values into regs R0 – R3 before calling the subroutine and the subroutine returns a value by putting it in R0 before returning.
This is important to understand when we think about how execution flows when dealing with a stack operation and the link register which we will discuss in future tutorials.
When you are hacking or reversing a binary, controlling the PC is essential when you want to test for subroutine execution and learning about how the program flows in order to break it down and understand exactly what it is doing.
Part 8 - CPSR
The CPSR register stores info about the program and the results of a particular operation. Bits that are in the respective registers have pre-assigned conditions that are tested for an occurrence which are flags.
There are 32-bits that total this register. The highest 4 we are concerned with most which are:
Bit 31 – N = Negative Flag
Bit 30 – Z = Zero Flag
Bit 29 – C = Carry Flag (UNSIGNED OPERATIONS)
Bit 28 – V = Overflow flag (SIGNED OPERATIONS)
When the instruction completes the CPSR can get updated if it falls into one of the aforementioned scenarios. If one of the conditions occurs, a 1 goes into the respective bits.
There are two instructions that directly effect the CPSR flags which are CMP and CMN. CMP is compare such as:
CMP R1, R0 @notational subtraction where R1 – R0 and if the result is 0, bit 30 Z would be set to 1
The most logical command that usually follows is BEQ = branch if equal, meaning the zero flag was set and branches to another label within the code.
Regarding CMP, if two operands are equal then the result is zero. CMN makes the same comparison but with the second operand negated for example:
CMN R1, R0 @ R1 - (-R0) or R1 + R0
When dealing with the SUB command, the result would NOT update the CPSR you would have to use the SUBS command to make any flag update respectively.
Part 9 – Link Register
The Link Register, R14, is used to hold the return address of a function call.
When a BL (branch with link) instruction performs a subroutine call, the link register is set to the subroutine return address. BL jumps to another location in the code and when complete allows a return to the point right after the BL code section. When the subroutine returns, the link register returns the address back to the program counter.
The link register does not require the writes and reads of the memory containing the stack which can save a considerable percentage of execution time with repeated calls of small subroutines.
When BL has executed, the return address which is the address of the next instruction to be executed, is loaded into the LR or R14. When the subroutine has finished, the LR is copied directly to the PC (Program Counter) or R15 and code execution continues where it was prior in the sequential code source.
CODE TIME! Don’t be discouraged if you don’t understand everything in the code example here. It will become clear over the next few lessons.
as -o lr_demo.o lr_demo.s
ld -o lr_demo lr_demo.o
The simple example I created here is pretty self-explanatory. We start and proceed to the no_return subroutine and proceed to the my_function subroutine then to the wrap_up subroutine and finally exit.
It is necessary that we jump into GDB which is our debugger to see exactly what happens with each step:
As you can see with every step inside the debugger it shows you exactly the progression from no_return to my_function skipping wrap_up until the program counter gets the address from the link register.
Here we see the progression from wrap_up to exit.
This is a fundamental operation when we see next week how the stack operates as the LR is an essential part of this process.
Part 10 – Stack Pointer
The Stack is an abstract data type to which is a LIFO (Last In First Out). When we push a value onto the stack it goes into the Stack Pointer and when it is popped off of the stack it pops the value off of the stack and into a register of your choosing.
CODE TIME! Again, don’t be discouraged if you don’t understand everything in the code example here. It will become clear over the next few lessons.
as -o sp_demo.o sp_demo.s
ld -o sp_demo sp_demo.o
Once again lets load the binary into GDB to see what is happening.
Lets step into one time.
We see hex 30 or 48 decimal moved into r7. Lets step into again.
We see the value of the sp change from 0x7efff3a0 to 0xefff39c. That is a movement backward 4 bytes. Why the heck is the stack pointer going backward you may ask!
The answer revolves around the fact that the stack grows DOWNWARD. When we say the top of the stack you can imagine a series of plates being placed BENEATH of each other.
Originally the sp was at 0x7efff3a0.
When we pushed r7 onto the stack, the new value of the Stack Pointer is now 0x7efff39c so we can see the Stack truly grows DOWNWARD in memory.
Now lets step into again.
We can see the value of hex 10 or decimal 16 moved into r7. Notice the sp did not change.
Before we step into again, lets look at the value inside the sp.
Lets step into again.
We see the value in the stack was popped off the stack and put back into r7 therefore the value of hex 30 is back in r7 as well as the sp is back at 0x73fff3a0.
Please take the time to type out the code, compile and link it and then step through the binary in GDB. Stack operations are critical to understanding Reverse Engineering and Malware Analysis as well as any debugging of any kind.
Part 11 - ARM Firmware Boot Procedures
Let’s take a moment to talk about what happens when we first power on our Raspberry Pi device.
As soon as the Pi receives power, the graphics processor is the first thing to run as the processor is held in a reset state to which the GPU starts executing code. The ROM reads from the SD card and reads bootcode.bin to which gets loaded into memory in C2 cache and turns on the rest of the RAM to which start.elf then loads.
The start.elf is an OS for the graphics processor and reads config.txt to which you can mod. The kernel.img then gets loaded into 0x8000 in memory which is the Linux kernel.
Once loaded, kernel.img turns on the CPU and starts running at 0x8000 in memory.
If we wanted, we could create our own kernel.img to which we can hard code machine code into a file and replace the original image and then reboot. Keep in mind the ARM word size is 32 bit long which go from bit 0 to 31.
As stated, when kernel.img is loaded the first byte, which is 8-bits, is loaded into address 0x8000.
Lets open up a hex editor and write the following:
FE FF FF EA
Save the file as kernel.img and reboot.
“Ok nothing happens, this sucks!”
Actually something did happen, you created your first bare-metal firmware! Time to break out the champagne!
When the Pi boots, the below code when it reached kernel.img loads the following:
FE FF FF EA
@ address 0x8000, 0xfe gets loaded.
@ address 0x8001, 0xff gets loaded.
@ address 0x8002, 0xff gets loaded.
@ address 0x8003, 0xea gets loaded.
“So what the hell is really going on?”
This set of commands simply executes an infinite loop.
Review the datasheet:
The above code has 3 parts to it:
1)Conditional – Set To Always
2)Op Code – Branch
3)Offset – How Far To Move Within The Current Location
Condition – bits 31-28: 0xe or 1110
Op Code – bits 27-24: 0xa or 1010
Offset – bits 23-0 -2
I know this may be a lot to wrap your mind around however it is critical that you take the time and read the datasheet linked above. Do not cut corners if you truly have the passion to understand the above. READ THE DATASHEET!
I will go through painstaking efforts to break everything down step-by-step however there are exercises like the above that I am asking you to review the datasheet above so you learn how to better understand where to look when you are stuck on a particular routine or set of machine code. This is one of those times I ask you to please read and research the datasheet above!
“I’m bored! Why the hell does this crap matter?”
Glad you asked! The single most dangerous malware on planet earth today is that of the root-kit variety. If you do not have a basic understanding of the above, you will never begin to even understand what a root-kit is as you progress in your understanding.
Anyone can simply replace the kernel.img file with their own hacked version and you can have total control over the entire process from boot.
Part 12 - Von Neumann Architecture
ARM is a load and store machine to which the Arithmetic Logic Unit only operates on the registers themselves and any data that needs to be stored out to RAM, the control unit moves the data between memory and the registers which share the same data bus.
Program memory and data memory share the same data bus. This is what we call the Von Neumann Architecture.
The CPU chip of this architecture holds a control unit and the arithmetic logic unit (along with some local memory) and the main memory is in the form of RAM sticks located on the motherboard.
A stored-program digital computer is one that keeps its program instructions, as well as its data, in read-write, random-access memory or RAM.