Compilers - Let's understand compilers a bit better

Mar 2018
1163 words
Simple CS

You may have heard the phrases “My code is compiling” or “once it has compiled” before and not had a clue what the person was on about. After reading this article you should have a much better idea about what a compiler is and what a compiler does.

The first thing to say is that a compiler is a program, it is a quite complex program but it is simply some text files on a computer. Nothing too scary.

What is a compiler

A compiler is a program that converts code written in one programming language into another. It might take code written in Ruby and turn it into C, or code written in JavaScript and turn it into Machine Code.

The code it accepts as input is known as the source language. The code that it outputs is known as the target language.

Most programming languages are written for people to understand and be able to write (like Ruby or PHP).

Unfortunately computers don’t understand human very well, so languages that are easy for us to read or write are impossible for computers to read and understand.

Computers understand Machine Code, zeros and ones. So we need a way of taking languages that we like to write and convert then into a language that the computer knows how to read. This is the job of the compiler.

Types of compiler

Before we talk about how a compiler works I want to briefly mention the some of the common types of compiler there are.

Cross-Compiler – The compiled program will be ran on a system that is different from what the compiler runs on. Imagine making a game for an Android phone on your Apple Mac.
Bootstrap Compiler – The compiler is written in the language that it intends to compile. For example if a JavaScript compiler was written in JavaScript, it could be called a Bootstrap Compiler.
Decompiler – Normally compilers go from a high level language (what humans like) to a lower level language (what computers like). Decompilers go from a low-level language to higher level.
Transpiler – Turns a high level language into a different high level language.
Language Rewriter – Keeps the language of the source code the same but changes how code is written.

Regardless of the specific name or job being done, the compiler’s job is still to understand some code as an input, and output some different code as output.

How a compiler works

A compiler has three main areas which it divides its work. These are called the front end, middle end, and back end.

Front End

This part of the compiler is concerned with verifying the syntax and the semantics of the source language.

The syntax of something is the structure of statements. To have valid syntax means that you have written something that makes sense based on the rules of grammar.

This isn’t unique to programming: written and spoken language has a syntax. For example in English we can say “That is a big house”. It would not make sense to say “That is a house big”. We are using the exact same words, but because they aren’t in the correct order we don’t have valid syntax and therefore we can’t understand the sentence.

The semantics is concerned with the meaning of everything, you may have written something syntactically correct but it still doesn’t make sense. Imagine if in a recipe it suddenly called for “3 eggs and 12:00 PM”. “12:00 PM” makes perfect sense as a thing, but not in the context of a recipe.

If the code we have written contains a syntax error or doesn’t make sense then the compiler would let us know.

Code comes out of the front end in what we call an intermediate representation. This could take the form of an Abstract Syntax Tree. It is a version of the code written without all the syntax of the source language.

Middle End

The middle end (which, for the record, I think is a really silly name, how can the middle be an end?!?) concerns itself with performing optimisations on the code.

An optimisation means changing the code to make it run faster. A really simple example would be if you wrote:

  x = 5 + 5

Instead of forcing the computer to work out 5 + 5 each time, why not ask it to do this instead:

  x = 10

There are many optimisations it could do, for example it could remove dead code. If there is code that is never going to get run, why bother generating output for it?

  if (true)
    do_this
  else
    do_that

In this example do_that would never get ran because the if statement is always true. If this is the only place in the code that do_that ever gets called, there is no point in wasting the memory or effort in creating the function.

Back End

The back end takes the intermediate representation of the code which has now been optimised and turns it into the target language.

The generated code should now be able to be understood by anything that can understand the target language. So if we were compiling into JavaScript, the back end has done its job if something that can read JavaScript can successfully read your program.

It is incredibly hard to debug if you have written something correctly in your source code but it comes out wrong after being compiled. Because of this compiler writers take huge amounts of care to ensure that the compiler does the correct thing.

Example

Here is an example. Lets assume we are writing something in a language that doesn’t really care about symbols like (, ), or ; and we’re compiling into a language that cares about them.

Source Code:

  if true
    do_this
  else
    do_that

This source code goes into the front end and an intermediate representation comes out.

Intermediate Representation:

  if_statement:
    argument: true
    when_true: do_this
    when_false: do_that

Now the intermediate representation of the code goes into middle end of the compiler.

Optimised Intermediate Representation:

  do_this

Finally, we can compile this optimised intermediate representation into our target language:

  do_this();

One of the nice benefits of having a compiler work in these three distinct areas is you can imagine how you could swap out the front end of a compiler but keep the same middle and back end parts. This would allow us to convert more than one source language into a particular target language.

This is part of our Simple CS series, where we explain Computer Science and Web Development terms in really simple language. Excellent for beginners or if you need a quick refresher.

Access to the series is completely free, if you have found it useful we would really appreciate it if you could let people know about the project.

If there is a term you would like me to cover please drop us an email.