Correctly designing a concurrent system is hard.

Edit: case in point: it has been brought to my attention that the algorithm I present below as a “fix” actually leads to an infinite loop. This just goes to show how hard it is, and that it’s important to be aware of what exactly you’re validating (I forgot to check that condition for that program).

Homework for the reader: uncover the bug and fix it.

At work, we recently had an annoying race condition in one of our systems, causing us to sometimes lose data. This brought the reliability of the entire system into question, and it had to be solved.

(Our problem turned out to be in cache invalidation, so I guess it’s true what they say)

## The wetware approach: rack our brains

Luckily, one of our engineers had a hunch.

After theorizing about the possible problem and solutions for about a good hour with three people, and nearly missing an edge case just before closing the meeting, I got scared. Yeah, we just solved a bug, but how could we be sure this was the only one? Maybe there were more race conditions lurking about, just waiting to be triggered.

Race conditions are hard problems, and our human minds just aren’t well-equipped to reason through all the different interleavings of instructions. How could we be sure our solution was finally correct?

I wanted proof.

## Formal Methods to the rescue

So I turned to formal tools. Formal methods sort of have a reputation for being hard and requiring a lot of effort to use. I haven’t met any developer using them, and they seem to be more of an academic curiosity than anything else.

Nonetheless, I was going to spend some effort trying to reproduce our problem and our solution in a formal tool, and seeing how it would go. The worst that could happen, I might learn something.

I picked the model checking tool TLA+. This seemed especially apt since we were having a concurrency issue, and TLA+ is designed by the godfather of distributed systems: Leslie Lamport. In model checking, we write down a (simplified) specification of our program, and the computer will simluate execution of the program, enumerating all possible states the program can be in, all the while checking correctness and liveness.

What I was expecting was to wrestle with this for a good couple of days, but actually this tool chain is so easy to get started with, I got our algorith modeled, the problem detected and the solution verified, all within half a day!

Let me say that again:

TLA+ is fantastically easy to use!

This experience motivated me to write a short tutorial about it, to get more regular programmers educated. This tool is so easy to use and so effective, you can’t afford to not use it in your development cycle somewhere.

## Getting set up

The only thing you need to do is download the TLA+ toolbox, and for good measure, the PlusCal user’s manual. You only need to read through this last document once to be an effective TLA+ user.

TLA+ is actually a logic language, whose possible state space can be searched by the model checker tool, TLC. However, writing your program in TLA+ itself is quite cumbersome.

Instead, we’ll write our program in a simple imperative language called PlusCal, which will then get compiled into TLA+, and that model will be checked. PlusCal is the unique selling point that makes TLA+ so darn comfortable to work with: you’re just typing your program in a normal programming language, albeit slightly stylized… and bam! Guaranteed correctness!

In this tutorial, I’ll skip explaining TLA+ and only do PlusCal, since that’s what us working schmoes are probably going to be most comfortable with.

## Usage of the toolbox

The way of using this translator is a little… let’s say… idiosyncratic. You type your PlusCal program in comments, like so:

(* --algorithm MyAlgo {

...PlusCal program here...

} *)


And then hit Ctrl-T in the editor to translate to TLA+. The output will appear below between \* BEGIN TRANSLATION and \* END TRANSLATION comment lines.

To model check a specification, right-click the specification and select New Model. These are the parameters for the checker. Be sure to only do this after you’ve written and compiled your PlusCal program, so the user interface knows to fill out sensible defaults for some of the values.

## An example program

Let’s look at a simple program that (obviously) has problems, transcribe it into PlusCal, and see if TLC can find the errors.

The program deals with an increment register, backed by a database and cached in an in-memory cache for fast access. A Python version of the program is:

cache = {}

def Inc(id):
x = cache.get(id, None)
if x is None:
cache[id] = x

x = x + 1
DB_Write(id, x)
cache[id] = x


(Can you see a problem with this routine? What about if it’s run in parallel? It might be worthwhile to think about it for a minute.)

Our first order of duty is to transcribe this algorithm into PlusCal. That might look like this:

---------------------- MODULE cacheexample ----------------------
EXTENDS Naturals, TLC, Sequences

CONSTANT N, nil

(* --algorithm CacheExample {

variables
cache = nil;
database = 0;

variables x;
{
t1: if (cache /= nil) {
x := cache;
} else {
t2:     x := database;
t3:     cache := x;
};

t4: x := x + 1;
t5: database := x;
t6: cache := x;
}

}*)
===================================================================


Notice a bunch of things:

• We’ve included some standard modules, Naturals, TLC, Sequences. Just include those everywhere, they contain useful stuff we want.
• We define some constants, N and nil. We’ll need to give the constants a value in the Model before we can run it. N should just be a number, and it will be the number of concurrent processes running our algorithm (for example, 2). nil will be a special type of value, called a model value. Think of a model value like a symbol, it’s a unique value that is equal only to itself. In this case, we use it to signify a special value that means “empty”.
• We define some variables that define the state of our system. Global variables will be shared between all processes. Inside a process, we may also declare variables, which will be private to each process. In fact, PlusCal does a simple translation: local variables just get transformed into a global variable that is an array, indexed with self, which is the process number of the “current” process. Our database and cache are shared between all processes (there will be a single database, and the cache might be in the shared memory of a number of processes).
• We define N processes, which will be running through the algorithm concurrently, in every possible interleaving.
• Notice that all the statements are labeled, t0 through t6. The labels define the atomicity of statements. Multiple statements on a single line will be executed atomically (in this example, the test for cache nilness and reading from the cache). After the translation to TLA+, there is going to be a pc array, for Program Counter, that contains the label that every process is currently at in its execution.

## Abstracting the model

Picking the right level of abstraction, the right set of state variables to represent the model is the interesting part of this exercise. For example, in this case, though my original routine has a parameter id to increment any number of counters, I’m abstracting that away in my model, and only dealing with a single increment register. This is because in this case I’m confident that if it works for one register, it’s going to work for more registers. If I’m trying to check consistenty between multiple registers, then I might need to include that in my model and make the state a bit more complex.

What about the data that’s in a register? Right now, I’ve specified that the data is a number that can only increment by 1, and I’m modeling that in the same way. However, what about when my register can contain string data, but I might not care about the exact contents? In that case, I might still choose to model my data as a number, and have a “version” of the data stand in for the actual data?

Abstracting uninteresting parts of the state away makes the model checker run much quicker, simply because there’s less state to search through.

## Specifying a condition to check

Now we have a specification, but we needs something to check our model against. We need a condition that must be valid at all points in the program: an invariant.

In this case, my condition is going to be that if all N instances programs are done running, the value of the database register is also N (because there have been N increments). We’re going to express this in a TLA+ expression, somewhere below the translation block:

DatabaseOk == (\A i \in 1..N: pc[i] = "Done") => database = N


The condition is that IF all programs are at the label "Done" (a special label that follows every program) THEN the value of the database register equals N (i.e., all updates have been applied).

If we now create a model for this specification and have the model check the invariant, what will happen?

TLC finds a counterexample that violates our invariant, that’s what!

Ant that’s right, if we look at the last state we can see that both processes are in the state "Done", yet the value of the database is only 1 instead of 2, so we lost an increment somewhere.

The trace tells us the values of all variables at each step of the evaluation. It may be a little hard to follow, but the variable pc usually gives enough information: it tells us what steps in the program each process took and how they were interleaved to produce the error. By looking at the steps and the values of the variables that change, we can probably deduce the cause of the error.

It may be hard to read the traces in the minimal UI that the Toolbox gives you. To help a bit, I made a web application where you can paste your program and the trace, and it allows you to step through your program like a normal debugger would. Click here

Note: be sure to clear the TLC trace window before copying the trace out (the window doesn’t clear by itself so otherwise you’’ be copying an old trace)

## The bug

In this case, you may have spotted the bug already from the very first listing.

We obviously have a problem if the two processes have the following access pattern:

• Update
• Update
• Write
• Write

The second write clobbers the first one.

## The solution

What’s supposed to happen in this case, of course, is that the second operation should not go through while the first one is in progress. There are two solutions to this: pessimistic concurrency (locking) and optimistic concurrency (compare-and-swap).

Orchestrating exclusive access to a database seems cumbersome. Instead, we’ll use a facility that some key-value stores have, called a Conditional Write. This allows us to achieve Compare-and-Swap semantics.

Our (hopefully) improved code looks like this:

def Inc(id):
x = cache.get(id, None)
if not x:
cache[id] = x

y = x + 1
if DB_ConditionalWrite(id, x, y):
cache[id] = y
else:
Inc(id)


Note that I’ve cheekily used recursion here to retry the operation if the conditional write fails. This puts you at risk of a StackOverflow, so you wouldn’t do this in real code, but I just enjoyed writing it this way.

The corresponding PlusCal code like this (which does use iteration):

variables
cache = nil;
database = 0;

variables x; y; version;
{
t0: if (cache /= nil) {
x := cache;
} else {
t1:     x := database;
t2:     cache := x;
};

t3: y := x + 1;
t4: if (database = x) {
database := y;
} else { goto t0; }; (* Retry *)
t5: cache := y;
}


If we run this code again, we’ll see that hurray: the checker says it has finished examining the model, and there have been no errors. That’s great, we’ve found and fixed a tricky race condition bug in our code!

Do you think this code is correct now?

## Another wrinkle: interprocess concurrency

From the PlusCal code, you can tell we’re modeling threads right now. Why? Because the cache is modeled as a single global variable, shared between all processes. What if we wanted to model separate processes?

In that case, all we’d have to do is move the cache variable into the process as a local variable, ensuring that every process has its own copy.

If you look really carefully, you’ll see that I’ve also changed the keyword process into fair process. This is a slight cheat, as I know that there’s another kind of property I’m interested in: Termination. Check that checkbox on the TLA What to check? tab:

This is going to check for termination, whose definition is:

Termination == <>(\A i \in 1..N: pc[i] = "Done")


In other words: at some point in time it should be TRUE that all processes are in the ‘Done’ state.

In other other words: all processes should terminate.

If we check our program with the per-process cache, we’ll see that this property is violated. Can you figure out why?

.

.

.

It’s because in the program with the shared cache, the losing process was relying on the winning process to eventually update the cache, so that the loser’s write could proceed. Since we don’t have a shared cache anymore, this will never happen, and the loser will loop forever.

The fix is, of course, to have the lower blow its cache when it detects a stale value. I’ll let you think of the proper fix, and test it out in TLA+!

## Final Thoughts

The TLA+ toolkit is just ridiculously easy to get started with. PlusCal makes it effortless to write down program specifications, and TLA+ can be applied to any kind of program. Of course, it really shines when checking concurrency issues.

I would like to address the most common objection that I hear about applying this: *“What if my model doesn’t match my code?”.

Well, what about that is different than if you had a model in your head, but you accidentally typed something else? At least with formal methods, you know for a fact that you model works, and you have something tangible to compare your code against.

Having a proven model in software is strictly better than having a tentative model in wetware.

Go use TLA+ today! You won’t be sorry.