Just so that I don’t link to this randomly in the middle of the post, the project I will be talking about can be viewed on GitHub, and you can download the thesis (it’s in Serbian, sorry) here.

Back in university (which, scarily enough, was roughly 10 years ago), a few other students and myself were introduced to the glorious programming language under the name of Brainfuck. In case you are not familiar with it, Brainfuck is an esoteric language (which Google further describes as “intended for or likely to be understood by only a small number of people with a specialized knowledge or interest“) consisting of only 8 very simple command, yet one that is Turing-complete. To give you a brief example, this is what a “Hello, World!” program looks like in Brainfuck.

++++++++[>++++[>++>+++>+++>+<<<<-]>+>+>->>+[<]<-]>>.>---.+++++++..+++.>>.<-.<.+++.------.--------.>>+.>++.

As you can see, it’s pretty messed up and near-impossible to understand. But, before I go off on a tangent, let me continue with my story.

Even though Brainfuck didn’t have any real use, one of my friends and I kept toying with the idea of applying it in some situation, just for the challenge and for the lols. For example, we were both involved in organizing programming contests for high school students, i.e. coming up with problems, and injecting some Brainfuck in there (granted, while omitting the probably-frowned-upon-by-all-professors name) kept being brought up, but nothing came out of it. Eventually, we graduated and got jobs; history became legend, and legend became myth…

That is, up until a year ago, when yours truly decided that it would be a phenomenal idea (Narrator: “It was not.“) to revisit all of this and incorporate Brainfuck into his Master’s thesis – thus, I present to you: “Transpiling C into Brainfuck.”

Brainfuck

Before I get into the whys and the hows, let me give you a brief description of Brainfuck – and given its simplicity, it will be really brief.

Brainfuck consists of 8 commands, and each of them is represented by 1 character, so any other character is ignored. These characters allow the program to operate on an infinite memory “tape” which consists of 1-byte cells, and these operations are:

  • moving the cell pointer left or right (initially, the pointer is on the 1st cell, which doesn’t matter much given the infinite space),
  • incrementing/decrementing the value in the current cell by 1 (initially, they are all 0),
  • reading a character into the current cell or writing it out (by working with ASCII values), and
  • running a while loop that repeats as long as the value of the current cell is non-zero.

In addition to this… Err, well, nothing – that’s the entire language. Onto the thesis!

Idea

Like I said – and, more importantly, like Wikipedia said – Brainfuck is Turing-complete, which in theory means that you can write any program by using only these 8 commands. Impossible, you say? No, just quite difficult. I didn’t want to solve this entire problem, as I would never graduate, but I did want to demonstrate that this was true (i.e. do so empirically) by creating a transpiler for a subset of C functionality into Brainfuck.

You can think of this in terms of assembly languages, because they are very simple, yet are used at the low levels to represent significantly more complex programs written in high-level languages such as Java, C++, etc. However, even these languages have certain functionalities that we take for granted, yet do not have in Brainfuck – for example, the latter doesn’t allows us to do arithmetic operations, move values around the memory, or jump to certain parts of code, all of which you would agree are extremely necessary for any sensible code.

Given that state of the world, the approach I decided to take was to pick a subset of C functionality that one would consider core, break those into smaller building blocks as far as I could and get to Brainfuck-supported operations, and then build a transpiler by going in the opposite direction.

The Logic

Let’s dig deeper into what breaking functionality into smaller blocks entails, as well as what functionality we want to break down. C, while not particularly widely used anymore, is a very powerful language, and as such has a lot of complex functionality, with people’s favorite one to hate usually being pointers. Since I didn’t want to support the entire language, I decided to tackle the following subset:

  • char variables and assigning a value,
  • arithmetical operations,
  • boolean logic,
  • simple versions of printf() and scanf(),
  • if/else, and
  • for, while, and do…while loops.

(I did not implement support for functions, but I will touch on them in the “Going the Extra Mile” section.)

These operations have roughly increasing complexity, and we’ll oftentimes use ones which we’ve “solved” to solve the next ones.

Variables

Tackling variables is pretty simple, as we only need to assign every variable that we encounter a location in the memory. As long as we have full control and information over what is in the memory – and we should have – we can assign the next unused cell to a variable, and then move the pointer to it when its value needs to be used. Assigning a value is also pretty simple, as we just need to increment the cell’s value N times (i.e. use the “+” Brainfuck operation) to get it to have the value N.

However, assigning it another variable’s value is not as straight-forward, because we only have one memory pointer. To go around this, we’ll need the following two concepts:

  1. Moving values. By utilizing a loop, we can move a value from one cell to another – until the value is 0, decrement it by one, move the pointer X cells to the right (or left), increment the value by one, and then move the pointer back X cells to the left. Such a loop will always start and end on the original cell, which allows us to check the condition, and will in the end leave us with the aforementioned value moved X cells to the right.
  2. Copying values. Now that we know how to move values, it’s clear that we can also make multiple copies while moving it – simply use the same loop, but instead of incrementing just one new cell, move to another cell and increment that one as well. The result is the original cell having a value of 0, and two new cells have two copies of the original value. At this point, we just need to move the value from one of them into the original cell (using the same process), and we’ve accomplished copying a value.

Now that we can copy values, how do we assign the value of one variable to another? Well, simply by copying the value from the former’s cell to the latter’s.

If that is somewhat hard to visualize, another convenient way to represent it is by using operations which map directly to Brainfuck operations, i.e. incrementing, decrementing, and a simple while loop:

// b = a;
while (a) {
  a--; b++; c++;
}
while (c) {
  a++; c--;
}

Arithmetic operations

In the interest of not going too deep into how Brainfuck would behave, as well as making use of the “building blocks” approach, I’ll explain the rest just by using pseudocode which we’ve already explained. For example, this is how addition, subtraction, and multiplication would work:

// c = a + b;
c = a; bTemp = b;
while (b) {
  c++;
  b--;
}
b = bTemp;

// c = a - b;
c = a; bTemp = b;
while (b) {
  c--;
  b--;
};
b = bTemp;

// c = a * b;
aTemp = a;
while (a) {
  c += b;
  a--;
}
a = aTemp;

Integer division, however, is a bit more problematic, because we need to do some comparison:

// c = a / b;
while (a >= b) {
  c++;
  a -= b;
}

This is not aligned with our rule of only using things we had explained before, but we will get there a bit later, and do so without using division (as that would create a circular dependency).

Branching

In order to perform if/else branching, we can make use of the while loop, as long as we have the condition in a single variable – the trick is to set the condition to zero as soon as we enter the first branch, as well as change the flag telling us not to go into the second branch.

// if (a) { IF_BLOCK } else { ELSE_BLOCK }

ifCondition = a;
elseCondition = 1;

while (ifCondition) {
  ifCondition = 0;
  elseCondition = 0;
  IF_BLOCK
}

while (elseCondition) {
  elseCondition = 0;
  ELSE_BLOCK
}

Boolean logic

It’s getting fun, isn’t it? So let’s talk about equality – two numbers are equal if subtracting one from the other yields zero. But first, how do we check if a value is zero?

// result = (a == 0);
if (a) {
  result = 1;
} else {
  result = 0
};

Now that we have that, we can check for equality:

// result = (a == b);
c = a - b;
result = (c == 0);

What about negation?

// result = !a;
if (a) { result = 0; } else { result = 1; }

We can perform AND and OR by using two if checks:

// result = a && b;
result = 0;
if (a) {
  if (b) {
    result = 1;
  }
}

// result = a || b;
result = 0;
if (a) {
  result = 1;
}
if (b) {
  result = 1;
}

With all of this, we just need a single comparison operation (e.g. less than), and we’ll be able to build all the other ones. This operation can be implemented in the following way (a is smaller than b if it reaches zero first), and that brings us to having full boolean and comparison logic without using any complex arithmetics.

// result = (a < b);
while (a != 0 && b != 0) {
  a--;
  b--;
}
result = a == 0 && b != 0;

Loops

Given all the things that we already have, loops are pretty easy, since both a for loop and a do…while loop can be replaced with a regular while loop and some additional conditions.

Input and Output

In my paper, I didn’t go too deep into supporting printf() and scanf(), instead choosing to focus on working only with characters and numbers which can fit into a single memory cell. The former is very easy, as it is supported in Brainfuck, whereas for the latter we have to do some looping – we divide and mod the value by 10 until we reach the first digit, which we print out, and then repeat the process with the remained of the number until we run out of digits. For input, characters are once again easy, and for numbers we just need to read digits and multiply them by 10 as long there are more digits to read.

The Implementation

All of this looks good on paper, but there are some additional complexities when it comes to actually implementing it. Namely, for this transpiler to work properly, it needs to be able to read C code and represent it in some data structure over which we could iterate.

Luckily, this is not something I had to implement myself, as there are various existing solutions. Back in university, we used LEX and YACC (Lexical Analyzer Generator and Yet Another Compiler Compiler), but I had a hard time setting them up (and, to be honest, understanding them again), so instead I went for a more modern solution under the name of ANTLR (Another Tool for Language Recognition).

ANTLR is an open-source project that does exactly this, can do it in various languages (the masochist in me decided to write the implementation in C++), and has predefined grammars for a bunch of other languages, including C. All I had to do was set it up, get the grammar, and then reduce it drastically in order to support only the subset of commands which I was interested in.

Once all of that was functional, running the program backed by ANTLR would produce an abstract syntax tree (i.e. AST), which is a tree representing the different blocks of the program. For example, the root node would be the entire program, it’s children would be global variable declarations and the main function, latter of which would then further be split into the function name, parameters, and body, the body would be split into commands, etc. The leaves of this tree would be individual tokens, which are either symbols with a certain meaning (e.g. “=”, “+”, “;”, etc.) or reserved words (e.g. “char”, “if”, “while”, etc.). We don’t need to break these down as getting separate characters wouldn’t give us any additional value – this is exactly the job of the lexical analyzer, as it observes a stream of characters and finds those tokens (with syntax analysis later determining if those tokens are grouped together in a correct way).

In order to make use of this tree, running ANTLR – not the transpiler itself, but rather the code which analyzes the grammar – creates various classes and placeholder code which represent the actions that need to be performed for each node type that exists in the tree. For me, that meant implementing the Brainfuck code that would need to be outputted for each line of the C program, which goes back to the process described earlier in this post, i.e. implementing the simplest ones by actually printing Brainfuck commands, and then use those to build out the rest. If you end up looking at the source code on GitHub, all of this logic lives in BrainfuckCompiler.cpp.

On top of this, I also created a simple testing framework that would, for specific input and C code, run the transpiler and then run the Brainfuck code in order to compare the output.

Going the Extra Mile

At this point, I had already spent quite a bit of time, and the date of my presentation was fast approaching, so I chose not to implement anything else. However, there were two notable things that were missing from this implementation and that would be needed for any serious program.

The first one of those were variable types which would require more than 1 byte of storage, i.e. anything beyond char. To support this, we need to allocate multiple memory cells to each such variable, and then rely on primary school knowledge for arithmetic operations, except we wouldn’t be operating in base-10, but rather base-256, because we can consider each memory cell a single digit and observe overflow when performing operations with it.

The second, far more complex one, were functions. As I mentioned at the beginning, Brainfuck doesn’t have support for jumping to different parts of the code (in the form of a GOTO command), which makes functions very difficult to implement – not impossible, however! One way in which we could implement support for jumping throughout the code is through grouping blocks of code inside if branches and assigning each block of code a unique ID which we would use to target it. For example:

blockID = 1;

while (blockID) {
  if (blockID = 1) {
    MAIN_FUNCTION
    if (someConditionToCallTheHelperFunction) {
      blockID = 2;
    } else {
      blockID = 0;
    }
  }
  if (blockID = 2) {
    HELPER_FUNCTION;
    blockID = 0;
  }
}

Here, we loop the entire program until some part of the code sets the block ID to 0 (equivalent to returning 0 from the main function), and otherwise set the block ID to a specific value when we want to execute that function. We’d also have to keep an eye on where execution should resume once a function returns, but that can be solved by using a similar approach (e.g. with a returnBlockID).

Next Steps

It took me a while to find time to actually write this post, but there are some other things I’d love to do before considering the project done.

First of all, I’d want to make this easier to use, and what better way to do so than to rewrite/reorganize the code in Javascript and make it a web app (luckily, ANTLR has a Javascript runtime, too).

Secondly, because I didn’t care much about performance and instead focused on proving the concept, there are certain inefficiencies which could be addressed, such as memory management or optimizing arithmetic operations, for which I have a couple of ideas.

Finally, the professor who was my mentor for this thesis proposed publishing this work as an actual paper – the work might be a bit too playful and clearly not particularly useful, but at least it should be an interesting read.

So, wish me luck, and stay tuned for those!