Technology
14 minute read

How to Approach Writing an Interpreter From Scratch

A back-end expert, Sakib is the creator of the static site generator Hepek. Always learning, he writes tutorials in English and Bosnian.

Some say that “everything boils down to ones and zeros”—but do we really understand how our programs get translated into those bits?

Compilers and interpreters both take a raw string representing a program, parse it, and make sense of it. Though interpreters are the simpler of the two, writing even a very simple interpreter (that only does addition and multiplication) will be instructive. We’ll focus on what compilers and interpreters have in common: lexing and parsing the input.

The Dos and Don’ts of Writing Your Own Interpreter

Readers may wonder What’s wrong with a regex? Regular expressions are powerful, but source code grammars aren’t simple enough to be parsed by them. Neither are domain-specific languages (DSLs), and a client might need a custom DSL for authorization expressions, for example. But even without applying this skill directly, writing an interpreter makes it much easier to appreciate the effort behind many programming languages, file formats, and DSLs.

Writing parsers correctly by hand can be challenging with all the edge cases involved. That’s why there are popular tools, like ANTLR, that can generate parsers for many popular programming languages. There are also libraries called parser combinators, that enable developers to write parsers directly in their preferred programming languages. Examples include FastParse for Scala and Parsec for Python.

We recommend that, in a professional context, readers use such tools and libraries to avoid reinventing the wheel. Still, understanding the challenges and possibilities of writing an interpreter from scratch will help developers leverage such solutions more effectively.

Overview of Interpreter Components

An interpreter is a complex program, so there are multiple stages to it:

  1. A lexer is the part of an interpreter that turns a sequence of characters (plain text) into a sequence of tokens.
  2. A parser, in turn, takes a sequence of tokens and produces an abstract syntax tree (AST) of a language. The rules by which a parser operates are usually specified by a formal grammar.
  3. An interpreter is a program that interprets the AST of the source of a program on the fly (without compiling it first).

We won’t build a specific, integrated interpreter here. Instead, we’ll explore each of these parts and their common issues with separate examples. In the end, the user code will look like this:

val input = "2 * 7 + 5"
val tokens = Lexer(input).lex()
val ast = Parser(tokens).parse()
val res = Interpreter(ast).interpret()
println(s"Result is: $res")

Following the three stages, we’ll expect this code to calculate a final value and print Result is: 19. This tutorial happens to use Scala because it’s:

  • Very concise, fitting a lot of code in one screen.
  • Expression oriented, with no need for uninitialized/null variables.
  • Type safe, with a powerful collections library, enums, and case classes.

Specifically, the code here is written in Scala3 optional-braces syntax (a Pythonlike, indentation-based syntax). But none of the approaches are Scala-specific, and Scala is similar to many other languages: Readers will find it straightforward to convert these code samples into other languages. Barring that, the examples can be run online using Scastie.

Lastly, the Lexer, Parser, and Interpreter sections have different example grammars. As the corresponding GitHub repo shows, the dependencies in later examples change slightly to implement these grammars, but the overall concepts stay the same.

Interpreter Component 1: Writing a Lexer

Let’s say we want to lex this string: "123 + 45 true * false1". It contains different types of tokens:

  • Integer literals
  • A + operator
  • A * operator
  • A true literal
  • An identifier, false1

Whitespace between tokens will be skipped in this example.

At this stage, expressions don’t have to make sense; the lexer simply converts the input string into a list of tokens. (The job of “making sense of tokens” is left to the parser.)

We’ll use this code to represent a token:

case class Token(
  tpe: Token.Type,
  text: String,
  startPos: Int
)

object Token:
  enum Type:
    case Num
    case Plus
    case Times
    case Identifier
    case True
    case False
    case EOF

Every token has a type, textual representation, and position in the original input. The position can help end users of the lexer with debugging.

The EOF token is a special token that marks the end of input. It doesn’t exist in the source text; we only use it to simplify the parser stage.

This will be the output of our lexer:

Lexing input:
123 + 45 true * false1

Tokens:
List(
  Token(tpe = Num, text = "123", tokenStartPos = 0),
  Token(tpe = Plus, text = "+", tokenStartPos = 4),
  Token(tpe = Num, text = "45", tokenStartPos = 6),
  Token(tpe = True, text = "true", tokenStartPos = 9),
  Token(tpe = Times, text = "*", tokenStartPos = 14),
  Token(tpe = Identifier, text = "false1", tokenStartPos = 16),
  Token(tpe = EOF, text = "<EOF>", tokenStartPos = 22)
)

Let’s examine the implementation:

class Lexer(input: String):

  def lex(): List[Token] =
    val tokens = mutable.ArrayBuffer.empty[Token]
    var currentPos = 0
    while currentPos < input.length do
      val tokenStartPos = currentPos
      val lookahead = input(currentPos)
      if lookahead.isWhitespace then
        currentPos += 1 // ignore whitespace
      else if lookahead == '+' then
        currentPos += 1
        tokens += Token(Type.Plus, lookahead.toString, tokenStartPos)
      else if lookahead == '*' then
        currentPos += 1
        tokens += Token(Type.Times, lookahead.toString, tokenStartPos)
      else if lookahead.isDigit then
        var text = ""
        while currentPos < input.length && input(currentPos).isDigit do
          text += input(currentPos)
          currentPos += 1
        tokens += Token(Type.Num, text, tokenStartPos)
      else if lookahead.isLetter then // first must be letter
        var text = ""
        while currentPos < input.length && input(currentPos).isLetterOrDigit do
          text += input(currentPos)
          currentPos += 1
        val tpe = text match
          case "true"  => Type.True // special casing literals
          case "false" => Type.False
          case _       => Type.Identifier
        tokens += Token(tpe, text, tokenStartPos)
      else
        error(s"Unknown character '$lookahead' at position $currentPos")

    tokens += Token(Type.EOF, "<EOF>", currentPos) // special end marker
    tokens.toList

We start with an empty list of tokens, then we go through the string and add tokens as they come.

We use the lookahead character to decide the type of the next token. Note that the lookahead character is not always the furthest character ahead being examined. Based on the lookahead, we know what the token looks like and use currentPos to scan all of the expected characters in the current token, then add the token to the list:

If the lookahead is whitespace, we skip it. Single letter tokens are trivial; we add them and increment the index. For integers, we only need to take care of the index.

Now we get to something a bit complicated: identifiers versus literals. The rule is that we take the longest possible match and check if it’s a literal; if it’s not, it’s an identifier.

Take care when handling operators like < and <=. There you have to look ahead one more character and see if it’s = before concluding that it’s a <= operator. Otherwise, it’s just a <.

With that, our lexer has produced a list of tokens.

Interpreter Component 2: Writing a Parser

We have to give some structure to our tokens—we can’t do much with a list alone. For example, we need to know:

Which expressions are nested? Which operators are applied in what order? Which scoping rules apply, if any?

A tree structure supports nesting and ordering. But first, we have to define some rules for constructing trees. We would like our parser to be unambiguous—to always return the same structure for a given input.

Note that the following parser doesn’t use the previous lexer example. This one is for adding numbers, so its grammar has only two tokens, '+' and NUM:

expr -> expr '+' expr
expr -> NUM

An equivalent, using a pipe character (|) as an “or” symbol like in regular expressions, is:

expr -> expr '+' expr | NUM

Either way, we have two rules: one that says that we can sum two exprs and another that says that expr can be a NUM token, which here will mean a nonnegative integer.

Rules are usually specified with a formal grammar. A formal grammar consists of: The rules themselves, as shown above A starting rule (the first rule specified, per convention) Two types of symbols to define the rules with: Terminals: the “letters” (and other characters) of our language—the irreducible symbols that make up tokens Nonterminals: intermediate constructs used for parsing (i.e., symbols that can be replaced)

Only a nonterminal can be on the left side of a rule; the right side can have both terminals and nonterminals. In the example above, the terminals are '+' and NUM, and the only nonterminal is expr. For a wider example, in the Java language, we have terminals like 'true', '+', Identifier, and '[', and nonterminals like BlockStatements, ClassBody, and MethodOrFieldDecl.

There are many ways we could implement this parser. Here, we’ll use a “recursive descent” parsing technique. It’s the most common type because it’s the simplest to understand and implement.

A recursive descent parser uses one function for each nonterminal in the grammar. It starts from the starting rule and goes down from there (hence “descent”), figuring out which rule to apply in each function. The “recursive” part is vital because we can nest nonterminals recursively! Regexes can’t do that: They can’t even handle balanced parentheses. So we need a more powerful tool.

A parser for the first rule would look something like this (full code):

def expr() = 
  expr()
  eat('+')
  expr()

The eat() function checks if the lookahead matches the expected token and then moves the lookahead index. Unfortunately, this won’t work yet because we need to fix some problems with our grammar.

Grammar Ambiguity

The first issue is the ambiguity of our grammar, which may not be apparent at first glance:

expr -> expr '+' expr | NUM

Given the input 1 + 2 + 3, our parser could choose to calculate either the left expr or the right expr first in the resulting AST:

Two abstract syntax trees. Both start with expr and branch left and right each to another expr, one of which branches straight down to a NUM, and the other of which branches left and right each to another expr that branches down to a NUM. The AST on the left has the larger subtree on its left expr, whereas the AST on the right has the larger subtree on its right expr.
Left-handed and right-handed ASTs.

That’s why we need to introduce some asymmetry:

expr -> expr '+' NUM | NUM

The set of expressions we can represent with this grammar hasn’t changed since its first version. Only now it is unambiguous: The parser always goes left. Just what we needed!

This makes our + operation left associative, but this will become apparent when we get to the Interpreter section.

Left-recursive Rules

Unfortunately, the above fix doesn’t solve our other problem, left recursion:

def expr() =
  expr()
  eat('+')
  eat(NUM)

We have infinite recursion here. If we were to step into this function, we’d eventually get a stack-overflow error. But parsing theory can help!

Suppose we have a grammar like this, where alpha could be any sequence of terminals and nonterminals:

A -> A alpha | B

We can rewrite this grammar as:

A   -> B A'
A'  -> alpha A' | epsilon

There, epsilon is an empty string—nothing, no token.

Let’s take the current revision of our grammar:

expr -> expr '+' NUM | NUM

Following the method of rewriting parsing rules detailed above, with alpha being our '+' NUM tokens, our grammar becomes:

expr    -> NUM exprOpt
exprOpt -> '+' NUM exprOpt | epsilon

Now the grammar is OK, and we can parse it with a recursive descent parser. Let’s see how such a parser would look for this latest iteration of our grammar:

class Parser(allTokens: List[Token]):
  import Token.Type
  
  private var tokens = allTokens
  private var lookahead = tokens.head
  
  def parse(): Unit = 
    expr()
    if lookahead.tpe != Type.EOF then
      error(s"Unknown token '${lookahead.text}' at position ${lookahead.tokenStartPos}")

  private def expr(): Unit =
    eat(Type.Num)
    exprOpt()
  
  private def exprOpt(): Unit =
    if lookahead.tpe == Type.Plus then
      eat(Type.Plus)
      eat(Type.Num)
      exprOpt()
    // else: end recursion, epsilon
  
  private def eat(tpe: Type): Unit =
    if lookahead.tpe != tpe then
      error(s"Expected: $tpe, got: ${lookahead.tpe} at position ${lookahead.startPos}")
    tokens = tokens.tail
    lookahead = tokens.head

Here we use the EOF token to simplify our parser. We are always sure that there is at least one token in our list, so we don’t need to handle a special case of an empty list.

Also, if we switch to a streaming lexer, we wouldn’t have an in-memory list but an iterator, so we need a marker to know when we come to the end of the input. When we come to the end, the EOF token should be the last token remaining.

Walking through the code, we can see that an expression can be just a number. If there’s nothing left, the next token wouldn’t be a Plus, so we would stop parsing. The last token would be EOF, and we would be done.

If the input string has more tokens, then these would have to look like + 123. That’s where recursion on exprOpt() kicks in!

Generating an AST

Now that we successfully parsed our expression, it’s hard to do anything with it as is. We could put some callbacks in our parser, but that would be very cumbersome and unreadable. Instead, we will return an AST, a tree representing the input expression:

case class Expr(num: Int, exprOpt: ExprOpt)

enum ExprOpt:
  case Opt(num: Int, exprOpt: ExprOpt)
  case Epsilon

This resembles our rules, using simple data classes.

Our parser now returns a useful data structure:

class Parser(allTokens: List[Token]):
  import Token.Type
  
  private var tokens = allTokens
  private var lookahead = tokens.head
  
  def parse(): Expr = 
    val res = expr()
    if lookahead.tpe != Type.EOF then
      error(s"Unknown token '${lookahead.text}' at position ${lookahead.tokenStartPos}")
    else
      res

  private def expr(): Expr =
    val num = eat(Type.Num)
    Expr(num.text.toInt, exprOpt())
  
  private def exprOpt(): ExprOpt =
    if lookahead.tpe == Type.Plus then
      eat(Type.Plus)
      val num = eat(Type.Num)
      ExprOpt.Opt(num.text.toInt, exprOpt())
    else
      ExprOpt.Epsilon

For eat(), error(), and other implementation details, please see the accompanying GitHub repo.

Simplifying Rules

Our ExprOpt nonterminal can still be improved:

'+' NUM exprOpt | epsilon

It’s hard to recognize the pattern it represents in our grammar just by looking at it. It turns out that we can replace this recursion with a simpler construct:

('+' NUM)*

This construct simply means '+' NUM occurs zero or more times.

Now our full grammar looks like this:

expr  -> NUM exprOpt*
exprOpt -> '+' NUM

And our AST looks nicer:

case class Expr(num: Int, exprOpts: Seq[ExprOpt])
case class ExprOpt(num: Int)

The resulting parser is the same length but simpler to understand and use. We’ve eliminated Epsilon, which is now implied by starting with an empty structure.

We didn’t even need the ExprOpt class here. We could have just put case class Expr(num: Int, exprOpts: Seq[Int]), or in grammar format, NUM ('+' NUM)*. So why didn’t we?

Consider that if we had multiple possible operators, like - or *, then we’d have a grammar like this:

expr  -> NUM exprOpt*
exprOpt -> [+-*] NUM

In this case, the AST then needs ExprOpt to accommodate the operator type:

case class Expr(num: Int, exprOpts: Seq[ExprOpt])
case class ExprOpt(op: String, num: Int)

Note that the [+-*] syntax in the grammar means the same thing as in regular expressions: “one of those three characters.” We’ll see this in action soon.

Interpreter Component 3: Writing an Interpreter

Our interpreter will make use of our lexer and parser to get the AST of our input expression and then evaluate that AST whichever way we want. In this case, we’re dealing with numbers, and we want to evaluate their sum.

In the implementation of our interpreter example, we will use this simple grammar:

expr  -> NUM exprOpt*
exprOpt -> [+-] NUM

And this AST:

case class Expr(num: Int, exprOpts: Seq[ExprOpt])
case class ExprOpt(op: Token.Type, num: Int)

(We’ve covered how to implement a lexer and a parser for similar grammars, but any readers who get stuck can peruse the lexer and parser implementations for this exact grammar in the repo.)

Now we’ll see how to write an interpreter for the above grammar:

class Interpreter(ast: Expr):

  def interpret(): Int = eval(ast)

  private def eval(expr: Expr): Int =
    var tmp = expr.num
    expr.exprOpts.foreach { exprOpt =>
      if exprOpt.op == Token.Type.Plus
      then tmp += exprOpt.num
      else tmp -= exprOpt.num
    }
    tmp

If we parsed our input into an AST without encountering an error, we’re sure that we’ll always have at least one NUM. Then we take the optional numbers and add them to (or subtract them from) our result.

The note from the beginning about the left associativity of + is now clear: We start from the leftmost number and add others, from left to right. This may seem unimportant for addition, but consider subtraction: The expression 5 - 2 - 1 is evaluated as (5 - 2) - 1 = 3 - 1 = 2 and not as 5 - (2 - 1) = 5 - 1 = 4!

But if we want to go beyond interpreting plus and minus operators, there’s another rule to define.

Precedence

We know how to parse a simple expression like 1 + 2 + 3, but when it comes to 2 + 3 * 4 + 5, we have a bit of a problem.

Most people agree on the convention that multiplication has higher precedence than addition. But the parser doesn’t know that. We can’t just evaluate it as ((2 + 3) * 4) + 5. Instead we want (2 + (3 * 4)) + 5.

This means that we need to evaluate multiplication first. Multiplication needs to be further from the root of the AST to force it to be evaluated before addition. For this, we need to introduce yet another layer of indirection.

Fixing a Naive Grammar From Start to Finish

This is our original, left-recursive grammar, which has no precedence rules:

expr -> expr '+' expr | expr '*' expr | NUM

First, we give it rules of precedence and remove its ambiguity:

expr    -> expr '+' term | term
term    -> term '*' NUM | NUM

Then it gets non-left-recursive rules:

expr      -> term exprOpt*
exprOpt   -> '+' term
term      -> NUM termOpt*
termOpt   -> '*' NUM

The result is a beautifully expressive AST:

case class Expr(term: Term, exprOpts: Seq[ExprOpt])
case class ExprOpt(term: Term)

case class Term(num: Int, termOpts: Seq[TermOpt])
case class TermOpt(num: Int)

This leaves us with a concise interpreter implementation:

class Interpreter(ast: Expr):

  def interpret(): Int = eval(ast)

  private def eval(expr: Expr): Int =
    var tmp = eval(expr.term)
    expr.exprOpts.foreach { exprOpt =>
      tmp += eval(exprOpt.term)
    }
    tmp

  private def eval(term: Term): Int =
    var tmp = term.num
    term.termOpts.foreach { termOpt =>
      tmp *= termOpt.num
    }
    tmp

As before, the ideas in the requisite lexer and grammar were covered earlier, but readers can find them in the repo if needed.

Next Steps in Writing Interpreters

We didn’t cover this, but error handling and reporting are crucial features of any parser. As developers, we know how frustrating it can be when a compiler produces confusing or misleading errors. It’s an area that has many interesting problems to solve, like giving correct and precise error messages, not deterring the user with more messages than necessary, and recovering gracefully from errors. It’s up to the developers writing an interpreter or compiler to ensure their future users have a better experience.

In walking through our example lexers, parsers, and interpreters, we only scratched the surface of the theories behind compilers and interpreters, which cover topics like:

  • Scopes and symbol tables
  • Static types
  • Compile-time optimization
  • Static program analyzers and linters
  • Code formatting and pretty-printing
  • Domain-specific languages

For further reading, I recommend these resources:

Understanding the basics

To create an interpreter first you need to create a lexer to get the tokens of your input program. Next you create a parser that takes those tokens and, by following the rules of a formal grammar, returns an AST of your input program. Finally, the interpreter takes that AST and interprets it in some way.

A compiler takes a higher-level language program and converts it into a lower-level language program. An interpreter takes a program and runs it on the fly. It doesn’t produce any files.

Interpreters can be written in any programming language. A popular choice are functional languages because they have great abstractions for transforming data.

Interpreters mostly execute one statement at a time. They take an AST returned by a parser and execute it. In that process they typically use some helper constructs like symbol tables, generators, and optimizers.

A lexer takes a string of characters and returns a list of tokens, which are basically grouped characters. Tokens are usually defined with regular expressions.

A programming parser is defined by a formal grammar describing the rules of the language it parses. The most common kind is a recursive descent parser, and it resembles the given grammar, having one function for every nonterminal. It takes a sequence of tokens as input and returns an AST as output.

An abstract syntax tree (AST) is a representation of the structure of the source code of a program. It contains only data that is relevant for the interpreter or compiler. It does not contain whitespace, braces, semicolons, and similar parts of the input program.

An abstract syntax tree is used as an intermediate representation of an input program. The interpreter/compiler can then do whatever it needs to do with it: optimize, simplify, execute, or something else.