Executive summary: asynchronous beats synchronous.

I’ve recently had to design and implement a message bus in a componentized system. The goal of the message bus was to completely abstract away components from each other: the only interaction happens by sending messages to the message bus (and by subscribing to messages on the message bus). Because of complete isolation, components can be replaced transparently; not to mention the benefits in the areas of tracing, debugging and simulation you get by being able to record and replay all communication between components.

Communication Types

The message bus needed to support two types of messages:

  • Publish/Subscribe: just fire away messages on the message bus. Other components will subscribe to messages that interest them, and handle them when they occur.
  • Request/Response: sometimes, the sender of a message may want to wait with its next operation until it knows the message has been handled (and whether some operation that got triggered by that message succeeded or failed). This corresponds to request/response semantics, and can easily be implemented on top of publish/subscribe semantics, just by doing some extra work on the sender’s part by matching outgoing and incoming messages.

Note that in this type of request/response, the sender still doesn’t know which component handled the message, so it is a form of “untargeted RPC”.

Ultimately, components are a collection of “message handling routines” grouped with a set of (private) state variables, invoked by the messaging runtime at the appropriate time. This document describes the considerations for the threading model of the messaging runtime (whether the system should be single or multi threaded) and whether we should allow a synchronous API for request/response style exchanges (publish/subscribe exchanges are trivially asynchronous so they do not come into play).

Principle: Reentrancy

The most important consideration I’ll take into account while evaluation the alternatives is that of reentrancy. This can be explained as follows:

The system is reentrant if a component can be asked to handle a message B while it is still handling a message A. To make a reentrant system work, all of its components have to be reentrant, i.e., be able to deal with this occurrence.

Making a component reentrant is harder than if the system guarantees that a component will not be asked to handle two messages at the same time. Absence of reentrancy has the following benefits:

  • Handlers can be considered “atomic”; state only needs to be consistent at the start and end of message handlers, even if messages are sent out in between, irrespective of what happens in response to these messages. This means that the correctness of a handler can be guaranteed locally.
  • In a multithreaded context, locking is not required, because at most one thread will be touching the component’s variables at any time.

When can reentrancy occur?

Given that the system is reentrant (i.e., takes no effort to avoid it), in what situations could it actually occur that a component is activated twice?

  • In a single-threaded system, reentrancy can only happen when a component gives away its flow of control (for example, when it sends an outgoing message using the messaging runtime).
  • In a multi-threaded system, a component could potentially be activated at any time – at any time, some other component could be sending it a message, potentially activating it.

A reentrant threading model in a multithreaded system would be the hardest to guarantee (and test) the correctness of, since any message handler could be interrupted between any two instructions, not just at fixed points. Hence, non-reentrancy is a useful property to have.

Threading and synchronicity

The messaging runtime model can be classified along two axes: threading and synchronicity of calls. Both of these influence reentrancy of the system.

Axis 1: Threading

We can either run all components on a single thread, or use multiple threads. The advantage of the first model is simplicity and throughput (because there is no need for context switching). However, potentially long operations on one component block ALL OTHER components from doing work (for example, messages may be blocked while a component is doing some big I/O operation such as writing and fsyncing a file), and concurrency on multiprocessor machines is limited. Using multiple threads solves these issues.

We can implement multi-threading in a number of ways:

  1. Use a thread per component, handle messages on the receiver thread
  2. Use a thread per component, handle messages on the sender thread
  3. Use a thread pool to handle messages, making sure no component is activated more than once simultaneously.

When a component’s handlers can be run on multiple threads, some sort of locking is required (because that would make the handlers reentrant). Hence, option 1 or 3 would be best, where option 1 has the advantage of being simple.

Axis 2: Synchronous Calls

Synchronous calls are convenient from the perspective of the programmer, because it allows him to write in a nice top-to-bottom fashion. However, synchronous calls require that a thread is blocked while it is waiting for an answer.

This is linked to reentrancy and threading: what happens when new messages arrive while the component is waiting for a synchronous response?

If they should be handled (the system is reentrant) then the handling should happen on a different thread than the thread that is currently blocked, and hence locking is required to access the state of the component. This could potentially lead to deadlock.

If the model is not reentrant, messages are queued up for later consumption after the wait has completed. However, this may lead to false assumptions about the state of the world when messages that were queued before a synchronous call was handled, are seemingly received after the synchronous call was handled, which may appear odd if the sychronous call should have influenced the contents of these messages (see the “stale messages problem” section below).

Asynchronous calls solve this issue by making reentrancy on a single thread possible and explicit, while still keeping individual message handlers (or callbacks). However, they make state management on the part of the requestor more cumbersome. This can be appropriately hidden only in a language with good closure/anonymous function support (Java is acceptable; C++ is better with C++11 support, but older versions are not so good).

Promises/futures are another way to manage operations that may take a long time, but ultimately will need to be resolved synchronously as well.

Stale messages problem

If we have synchronous calls in combination with asynchronous messaging ( which is a prerequisite if we want to make components non-reentrant with no need to worry about locking), a problem may crop up where seemingly impossible things happen.

We illustratie this with an example of a client and a server. The server periodically has the following features:

  • It periodically sends updates to the client.
  • The updates can be disabled via a request/response command. The server will confirm the disable command in the response.

The problem occurs when the client disables the server after a certain while, and uses a synchronous call to do so. By definition, because the call is synchronous, the client must be blocked until the response comes in. This means that any updates that have been sent in the interim must needs be queued until after the synchronous calls complete.

However, after the synchronous call has completed, the client is under the assumption that the server will not send any updates anymore (after all, it has been disabled!). At that point, the client is free to handle the next message from the queue which is – surprise, surprise – an “impossible” update message.

The issue is illustrated in the diagram below (we use slanted lines for message delivery because there are no guarantees on timing since messages are queued):

Illustration of the "stale messages problem"

There are a couple of solutions to this problem:

  1. Add code to every client to take these kinds of “ghost updates” into account. This is error prone because it is up to the developer to remember this[^asd]. The solution can be formalized a little more by using some sort of token in messages that represents local knowledge of the remote server state, and only handling messages if the local state matches the state token in the messages.
  2. To avoid having to do this in every client: add code to the messaging system to allow it to purge messages from queues when certain messages are encountered. This seems to be special-case behaviour for a few specific instances that shouldn’t belong in a general-purpose messaging framework.
  3. Don’t block the client while it is sending the request. I.e., don’t make calls synchronous; instead, the client can handle the trailing update after it wanted to disable the server but before it actually did so, in any way that is appropriate in that scenario. Unlike in the synchronous communication scenario, once it receives the disable confirmation, it knows for sure that no more updates will be received.

Design Space

If we look at the possible combinations of the options along both axes, we end up with the following concrete design options:

  Synchronous calls Asynchronous calls
Single threaded (1) (3)
Multi threaded (2) (4)

(1) Synchronous, single-threaded

This model corresponds exactly to regular function calls.

In this model, reentrancy is required, otherwise upon the initiation of mutually synchronous calls no component could make progress anymore.

Additionally, out-of-process communication is not possible without severely stalling the application.

(2) Synchronous, multi-threaded

A thread waits until a synchronous call has been handled by another component.

  • Without reentrancy (thread per component): the “stale messages problem” may crop up. Mutual synchronous calls lead to deadlock (unless this case is detected and in-thread reentrancy is used, similar to regular function calls).

  • With reentrancy (any thread can activate a component at any time): requires locking.

(This is typically the model advised to build request/response on top of enterprise message queues such as JMS/ActiveMQ)

(3) Asynchronous, single-threaded

There is a single message pump that dispatches all messages. Corresponds to eventing systems that are currently becoming popular (like Node.js). The system does not wait, and hence cannot run into a deadlock or stale messages problem.

The downside of this is the of “work to do when a response is received” has to be done in CPS1, and can lead to a “callback pyramid” when a lot of request-response has to be done in sequence.

Because of a single thread, performance is higher because no time is lost to context switching, but message handlers that take a long time block progress of other components.

(4) Asynchronous, multi-threaded

Similar to (3), except with a thread per component so the message handling latency of one component does not influence the progress of another. Locking is still not needed.

Same language-support downsides as in (3).

N.B: synchronous I/O (such as reading from a serial port) will NEED to be done on separate threads anyway.


Ultimately, systems that use synchronous calls will either need component reentrancy or run into the stale messages problem. Both entail extra work on the part of component writers, which we want to avoid.

Asynchronous message passing has the least amount of surprise, but requires some familiarity with callback-style progrmaming on the part of the component developer, and language support that doesn’t make it horrible to look at (such as C# 5’s async/await CPS transformation).

On the threading front, as soon as you have non-reentrant asynchronous calling style, you can pick either single-threaded or multi-threaded designs. You will not be able to tell the difference between them (except in terms of performance).

part_ of a computation is explicitly passed into a function. In practice, functions take a callback that will be invoked when the work has been done. (Wikipedia). [^asd]: Tools like Verum ASD exist to typically check for issues like this.

  1. Continuation Passing Style; a style of function calls where the _next