You may have heard the phrases “My code is compiling” or “once it has compiled” before and not had a clue what the person was on about. After reading this article you should have a much better idea about what a compiler is and what a compiler does.
The first thing to say is that a compiler is just a program, it is a quite complex program but it is essentially just some text files on a computer. Nothing too scary.
What is a compiler
The code it accepts as input is known as the source language. The code that it outputs is known as the target language.
Most programming languages are written for people to understand and be able to write (like Ruby or PHP).
Unfortunately computers don’t understand human very well, so languages that are easy for us to read or write are impossible for computers to read and understand.
Computers understand Machine Code, zeros and ones. So we need a way of taking languages that we like to write and convert then into a language that the computer knows how to read. This is the job of the compiler.
Types of compiler
Before we talk about how a compiler works I want to briefly mention the some of the common types of compiler there are.
- Cross-Compiler – The compiled program will be ran on a system that is different from what the compiler runs on. Imagine making a game for an Android phone on your Apple Mac.
- Decompiler – Normally compilers go from a high level language (what humans like) to a lower level language (what computers like). Decompilers go from a low-level language to higher level.
- Transpiler – Turns a high level language into a different high level language.
- Language Rewriter – Keeps the language of the source code the same but changes how code is written.
Regardless of the specific name or job being done, the compiler’s job is still to understand some code as an input, and output some different code as output.
How a compiler works
A compiler has three main areas which it divides its work. These are called the front end, middle end, and back end.
This part of the compiler is concerned with verifying the syntax and the semantics of the source language.
The syntax of something is the structure of statements. To have valid syntax means that you have written something that makes sense based on the rules of grammar.
This isn’t unique to programming: written and spoken language has a syntax. For example in English we can say “That is a big house”. It would not make sense to say “That is a house big”. We are using the exact same words, but because they aren’t in the correct order we don’t have valid syntax and therefore we can’t understand the sentence.
The semantics is concerned with the meaning of everything, you may have written something syntactically correct but it still doesn’t make sense. Imagine if in a recipe it suddenly called for “3 eggs and 12:00 PM”. “12:00 PM” makes perfect sense as a thing, but not in the context of a recipe.
If the code we have written contains a syntax error or doesn’t make sense then the compiler would let us know.
Code comes out of the front end in what we call an intermediate representation. This could take the form of an Abstract Syntax Tree. It is a version of the code written without all the syntax of the source language.
The middle end (which, for the record, I think is a really silly name, how can the middle be an end?!?) concerns itself with performing optimisations on the code.
An optimisation means changing the code to make it run faster. A really simple example would be if you wrote:
Instead of forcing the computer to work out
5 + 5 each time, why not ask it to do this instead:
There are many optimisations it could do, for example it could remove dead code. If there is code that is never going to get run, why bother generating output for it?
In this example
do_that would never get ran because the if statement is always
true. If this is the only place in the code that
do_that ever gets called, there is no point in wasting the memory or effort in creating the function.
The back end takes the intermediate representation of the code which has now been optimised and turns it into the target language.
It is incredibly hard to debug if you have written something correctly in your source code but it comes out wrong after being compiled. Because of this compiler writers take huge amounts of care to ensure that the compiler does the correct thing.
Here is an example. Lets assume we are writing something in a language that doesn’t really care about symbols like
; and we’re compiling into a language that cares about them.
This source code goes into the front end and an intermediate representation comes out.
Now the intermediate representation of the code goes into middle end of the compiler.
Optimised Intermediate Representation:
Finally, we can compile this optimised intermediate representation into our target language:
One of the nice benefits of having a compiler work in these three distinct areas is you can imagine how you could swap out the front end of a compiler but keep the same middle and back end parts. This would allow us to convert more than one source language into a particular target language.
This is part of our Simple CS series, where we explain Computer Science and Web Development terms in really simple language. Excellent for beginners or if you just need a quick refresher.
Access to the series is completely free, if you have found it useful we would really appreciate it if you could let people know about the project.