Abstract Syntax Trees

Jan 2018
1277 words
Simple CS

Abstract syntax trees (ASTs for short) are a way of representing a computer program in an abstract way. In this article I am going to unpack what this means, and why we do this.

By the end of the article you should have a good understanding of ASTs, certainly more than enough understanding than most developers should normally require.

Let’s start by talking about each term that makes up the name “Abstract Syntax Tree”.

Abstract

In this context all we mean by Abstract is that we’re not going to express the exact code you’ve typed, instead we are going to make a high level representation of it.

For example, if our code was a mathematical function:

  5 + (10 * 2)

Our AST wouldn’t need to literally represent any whitespace or the brackets, we can represent them in a more abstract way.

We will see some examples later which will show this in better detail.

Syntax

It is a Syntax tree because we are representing the Syntax of a language. The syntax of something is the structure of statements. To have valid syntax means that you have written something that makes sense based on the rules of grammar.

This isn’t unique to programming: written and spoken language has a syntax. For example in English we can say “That is a big house”. It would not make sense to say “That is a house big”. We are using the exact same words, but because they aren’t in the correct order we don’t have valid syntax and therefore we can’t understand the sentence.

What this means for us is that in order for an Abstract Syntax Tree to be made out of the code we’ve written we have to have syntactically valid code.

Tree

At some point I will write an entire article just on the term “Tree” but for now all we need to know is that this is a special data structure (a particular way of storing information) that works the same way as branches on a tree.

You have one node called the root, which is the trunk of the tree. The root can have several nodes splitting off, each of which can have one or more other nodes and so on.

A lot of tree structures you may experience will have a maximum of two nodes coming off of one node. Abstract Syntax Trees are an example of a tree structure that can have more than two.

It is easier to understand with a diagram:

Random numbers displayed as a tree structure

In this diagram we see a collection of random numbers, each one lives in a node.

Number 1 is the root node and has two child nodes.
Number 2 is a child of the root node and has one child.
Number 3 is a child of the number 2 node and has two children.
Numbers 4, 5, and 7 are what we call a leaf nodes, because they have no children and are at the edge of our tree structure.

Just like the branches of a tree you can imagine how over time this structure can start to grow very wide at the bottom (we need plenty of space for our leaves!).

When ASTs are used

ASTs are often generated by compilers when they’re running on code.

Compiling code is something else that will have an entire page dedicated to at some point but there is a type of computer program called a compiler, its job is to turn code written into one language into another (normally less human friendly) language.

Let’s say you have some code written in Ruby and you wanted it to automatically turn into JavaScript.

It would be very hard to write a program that read some Ruby code, completely understood it and immediately changed the code to JavaScript. When Ruby or JavaScript syntax changed we would need to rewrite our entire program.

What most compilers do is generate an AST first from the language we are compiling from.

We might have some code that only cares about taking our Ruby code and turning it into an AST. Now if the Ruby syntax changes, we only need to make some minor tweaks to continue to generate the AST.

Another part of our code can then worry about reading the AST and generating the JavaScript for it. This way we don’t actually care what the original code was, because it is now being represented in an Abstract way.

Outside of compilers I haven’t seen too many people creating ASTs of their code, most programming languages we write are easy enough to understand that going abstract with them wouldn’t serve a huge purpose.

Examples

Lets turn some code into abstract syntax trees.

When we were talking about the word Abstract I shared this bit of code:

  5 + (10 * 2)

This might end up looking like this as an Abstract Syntax Tree:

Note how we don’t care about the brackets, because what we’re saying here (reading from the root node) is:

The first thing we’re doing is adding something together.
The left side of our addition is the number 5.
The right side of our addition is the result of multiplying something.
The left side of our multiplication is 10.
The right side of our multiplication is 2.

Let’s say we had the following bit of code:

  if number > 5
    return 'Bigger than 5'
  else
    return 'Not bigger than 5'

What we are saying here is if a number is bigger than five it will return the string “Bigger than 5” and if it isn’t it will return “Not bigger than 5”. Not a particularly exciting bit of code but let’s look at what that could look like as an AST:

Here you can see an example of a node with three children, we have one representing what the if statement looks for and then two representing if that statement was true or false.

It is worth noting that you could ask 5 people to come up with an AST for some code and you could get 5 different representations. As long as the code that interprets the AST knows what to look for there can be personal preference at play.

Other Terms

When you’ve been reading related articles or hearing different things around the topic of ASTs you may well have bumped into these terms:

Syntax tree – Normally when you hear the term “Syntax Tree” you can assume people are talking about an “abstract syntax tree”.
Concrete syntax tree – This is a more formal version of our abstract syntax tree and would include representations of literally everything written in the source file (parentheses, semicolons, the lot). These get abbreviated to CSTs.
Parse tree – A “Parse tree” is just another name for “Concrete syntax tree”.

Questions?

I hope you feel more confident about what abstract syntax trees are now. If you still have questions, please leave a comment below and I will get right on it.

This is part of our Simple CS series, where we explain Computer Science and Web Development terms in really simple language. Excellent for beginners or if you need a quick refresher.

Access to the series is completely free, if you have found it useful we would really appreciate it if you could let people know about the project.

If there is a term you would like me to cover please drop us an email.