Programming a “hello world!” in assembly from the first line to the end (x86)
Hello everyone, I’m Pablo Corbalán and this is my first Medium post, I’ll be using Medium as a little blog because I’m to lazy for programming a blog myself.
In this post I’m going to explain how you can code a “Hello world!” program using assembly, more specifically x86 Linux assembly. But first of all, what’s assembly?
Before starting, all the code is here: https://gist.github.com/pablocorbalann/f9d39a80e30b8d8230a9760048d0e575
This article is going to be divided in three sections
- What is assembly and how do computers understand code
- Programming a Hello World in assembly
- Running the program in your computer
What is “the assembly language”?
A bit of history
Computers can’t understand a human language, they understand electric current, this “electric current” can be represented as a set of instructions the computer should reproduce, we call this “machine language”, the language that computers can understand. At the end computers only understand binary instructions that are evaluated using logic gates, but that is hardware stuff, and we are here to build software!
It is due to this that throughout history programmers have tried to make the computer understand languages more and more similar to real human language. Nowadays we have programming languages that are really similar to human language, for example Python. You can understand basic Python code with a quick look at it. For example the expression
if value is not None
makes sense in a Python program and in a Cambridge exam.
However, to achieve this, programming languages have had to evolve little by little. We haven’t created a language like Python in a week…
So, how do computers understand other languages?
Programmers have programmed special programs to “translate” other languages to a language that computers can actually understand. For example imagine that you just now how to speak English, you’ll have troubles if a Korean tries to maintain a conversation with you, however if you have a translator that can translate from Korean to English, you’ll not have any problems. The same applies in computers, a computer can’t directly understand Python or Golang, however Python or Golang can be translated to binary so that the computer can understand the program.
This programs that “translate” the languages are called compilers or interpreters (or semi-compilers) depending on the process they use for “translating” the language, there are interpreted languages (as Python), compiled programming languages (as Golang) and a special type called “semi-compiled” (for example, Java is a semi-compiled programming language). You can read about the differences between compiled and interpreted languages in this article.
We can call the languages as Python or Golang “high-level programming languages”, because they are very similar to human language and they are really different from what a computer can understand. There are “low-level programming languages”, for example C, as it’s a point between a high-level language and a language that a computer can understand. Low-level programming languages are designed to create hardware, as they work directly with the memory and specifications of the computer, meanwhile high-level programming languages are more “designed” to building software.
But languages are not compiled to binary directly. The process of compiling a program can be pretty complex. If we talk about compiled languages, a high-level programming language is compiled to assembly code or machine code). Assembly is a very low-level programming language, that is the most similar thing to machine code that we can write without problems.
There are numerous different versions of assembly. It’s not one language; it’s actually a collection of similar languages. In most cases, assembly is just “shorthand” for machine language. It usually has some symbols that are close to words (like JMP for “jump”) that are easier for humans to digest. The real machine code is just a bunch of numbers. While understandable for small programs, it’s unmanageable for large projects.
For example, C/C++/Rust compile directly to machine code, Java/C#/Python compile to an intermediate language that is then run on a system that interprets that output when the program runs. It used to be fairly slow, but it’s not anymore given modern hardware and operating systems. Python is kind of halfway in this camp, it’s actually interpreted at runtime but it’s the compiled code that’s interpreted. JavaScript is entirely interpreted at runtime. Not compiled at all.
It’s easy to convert assembly code to machine code and vice versa.
Now let’s get into code!
To code in assembly, we can use any plain text editor, for example Visual Studio Code, Sublimetext3… Programming in assembly is the same as working with any other programming language, but you have to directly move the RAM memory as you want for it to work.
I’ll use Vim, The first step is creating an assembly file, Assembly code is written in files with the .asm
extension. I’ll call my file hello.asm
.
Basic assembly x86 syntax
Today we are going to use x86 assembly, as I have said before assembly is not a single language, but x86 is the most common of all of them. We are going to see a bunch of assembly keywords and symbols, to then understand how the hello world works.
Comments in assembly start with a semi colon. For example:
; This is a comment in assembly, this part doesn’t run
As assembly instructions are pretty different to human language, it’s a good practice to comment every line of code and explain what it does. I can’t explain assembly to you guys, but if you have curiosity and want to learn more x86, you can learn this guide. Basically you have to keep in mind two important things, we have registers. From tutorials point:
Processor operations mostly involve processing data. This data can be stored in memory and accessed from thereon. However, reading data from and storing data into memory slows down the processor, as it involves complicated processes of sending the data request across the control bus and into the memory storage unit and getting the data through the same channel. To speed up the processor operations, the processor includes some internal memory storage locations, called registers. The registers store data elements for processing without having to access the memory. A limited number of registers are built into the processor chip.
More about registers here. The second think we are going to use today is the mov
keyword, as an abbreviation of “move”, mov
is used for moving memory inside the program, it can be also used for moving a register. For example:
mov ax, 1234h ; Copies the value 1234hex (4660d) into register AX
mov bx, ax ; Copies the value of AX into register BX
The Wikipedia page has a clear explication about how does it work. Now that we know this, we can start coding in our file. The first step is opening the file in your text editor and creating two sections. We will call this sections .text
and .data
. In assembly a section is the smallest unit of an object that can be relocated in .elf
files (elf is a file format for executable programs, libraries and more). We can use sections for executable text, read-only data, read-write data, read-write uninitialized data…
Sections are created with the section
keyword, so in our program we will have to write:
section .text
section .data
As I have previously said it’s a good practice to comment the lines (you don’t have to comment all, but we will do it today as this is a “tutorial”)
section .text ; Create the text section
section .data ; Create the data section
It’s also a good practice to tabulate the code in columns, so that it’s not a mess. The next line we are going to type inside the .text
section is
global _start
global
is a directive (or an instruction) for Nasm (Netwide Assembler), wich is the assembler for the x86 CPU. It can be used for creating 16, 32 or 64 bits programs. Don’t worry you’ll have to use it with a linker later. And now we have to write the entry point for it:
section .text ; Create the text section
global _start ; Has to be declared for the linker
_start: ; The start section beginssection .data ; Create the data section
Now we are going to declare the data that we are going to be using in the program. All this data goes into the data
section. For creating a “Hello world!” program we’ll have to use two things that can be expressed as data.
- The message we want to show in the console (in this case “hello world”)
- The size of that message
Maybe you think “do we really need to now the size of the message for showing it in the console? Well my friend, yes we have. Maybe you don’t need the size of the message for creating a Hello world in Python, but in assembly you have the control over all. This means that you can control exactly the number of bits, bytes or whatever you want to use, to move, to show, to write etc etc… This is what we refer to when we say “working directly with the memory of the computer”.
So let’s start creating the message, for creating the message we are going to use db
, witch is an abbreviation for “define byte”, so we are going to use 8 bits of memory (as maximum) for storing our message. You can use other variable sizes:
- db: Define byte (8 bits of memory for the variable)
- dw: Define Word. Generally 2 bytes on a typical x86 32-bit system
- dd: Define double word. Generally 4 bytes on a typical x86 32-bit system
We will assign the message to msg
, so now the code looks like:
section .text
global _start
_start: section .data
msg db "Hello world!", 0xa ; declare the message
0xa
is the hexadecimal character for creating a new line
And now we are going to assign the lenght of the message to another expression called len
. We can do this using a pointer “$” and the equ
(from “equals) statement.
msg equ $ -msg ; assign the len of the message to len
So now the code will look like:
section .text
global _start
_start:section .data
msg db "Hello world!", 0xa
len equ $ -msg
Note: I’m not commenting the code because it is too wide for a device screen. At the end of the article you have the code completely commented
Now we can return to the _start
piece of code. For creating a Hello world program in assembly, we have to do four things:
- “Invoke” the data to the
.text
section. - Set the file descriptor for the program.
- Print the text
- Exit the program
So, let’s start with the first point, for doing so we have to first “invoke” the data from the .data
section, we do this using the mov
keyword and moving the memory:
mov edx, len ; move the length of the message to EDX
mov ecx, msg ; move the message to ECX
The third step is also very simple, for setting the file descriptor of the program we just have to use the following line of code:
mov ebx, 1 ; set the descriptor
If you don’t know what the file descriptor is in Linux (fd) you can read about it here. So now our code should look like this (but with some comments)
section .text
global _start
_start:
mov edx, len
mov ecx, msg
mov ebx, 1section .data
msg db "Hello world!", 0xa
len equ $ -msg
The next step is to call the system for printing in the console. x86 for linux works with system calls, every system call has an ID. The system call for printing in x86 is eax, 4
. So we’ll have to add to the .start
section:
mov eax, 4
And now we have to add another line for stopping this process. In Linux assembly is what we call “call the kernel”:
int 0x80
Int is the abbreviation for “interrupt” An interrupt transfers the program flow to whomever is handling that interrupt, which is interrupt 0x80
in this case. In Linux, 0x80
interrupt handler is the kernel, and is used to make system calls to the kernel by other programs. So the code will now look like:
section .text
global _start
_start:
mov edx, len
mov ecx, msg
mov ebx, 1
mov eax, 4
int 0x80section .data
msg db "Hello world!", 0xa
len equ $ -msg
The last step is to exit the program and again pass it to the Linux kernel. The exit system call is eax, 1
, so the code will look like:
section .text
global _start
_start:
mov edx, len
mov ecx, msg
mov ebx, 1
mov eax, 4
int 0x80
mov eax, 1
int 0x80section .data
msg db "Hello world!", 0xa
len equ $ -msg
And that’s it, this is a complete “Hello world” program in assembly x86. We can comment the code like:
Again, you don’t really have to comment all the lines as I have done, but this is a tutorial so you I have done it so that you can understand what I’m doing, however lines as global _start
, the sections or int 0x80
don’t really need to be commented as everyone knows what they are.
I have updated all the code (commented) to a GitHub gits. Link under the image
Running the program without problems
For running the program we are going to use 2 things
- nasm Nasm is the assembler we are going to use. An assembler is a program that converts assembly language into machine code. It takes the basic commands and operations from assembly code and converts them into binary code that can be recognized by a specific type of processor. Assemblers are similar to compilers in that they produce executable code.
If you want to install nasm is really simple in most Linux distributions, for example in Debian (and similar ones) you just have to open a terminal and type:
sudo apt-get update -y
sudo apt-get install -y nasm
If you are not using Debian, type in Google “How to install nasm in <name>”, and instead of <name> type your distribution. For example “how to install nasm in arch linux”.
Verify if you have installed nasm using:
nasm -v
2. ld: Ld is the linker we are going to use, a linker is a program that “joins” lot’s of pieces of code into the same executable file. If you don’t know if you have ld installed in your computer type
ld -v
in the terminal. If it does not create any errors, you have ld installed. If you don’t have ld installed you can download it from the binutils project (GNU):
https://www.gnu.org/software/binutils/
Generating the executable
Now that you have nasm
and ld
installed, we are going to generate the executable. The first thing we are going to run is nasm, in a terminal type:
nasm -f elf32 -o hello.o hello.asm
And this should not raise any error. Now you have to link the object file hello.o
into an executable using ld, so type:
ld -m elf_i386 -o hello hello.o
You can also type both of the commands in the same line using &&
:
nasm -f elf32 -o hello.o hello.asm && ld -m elf_i386 -o hello hello.o
And that’s it! Now you can run the Hello world as an executable using ./hello
!
That’s the end of this tutorial. Remember to follow me on Twitter and GitHub @pablocorbalann!