Jepsen: A framework for distributed systems verification, with fault injection
January 26, 2025Overview
Jepsen is a Clojure testing framework for distributed systems. A Jepsen test case defines operations on a system, expected responses to the operations, and a schedule for injecting faults to inject and the operations. The test case executes the operations, introduces faults as scheduled, and validates whether the system’s response meets expectations.
Jepsen appears to be well-regarded in the field of distributed systems. Tests results for major distributed systems using Jepsen are published on its official website. Additionally, A Jepsen test result for etcd, a key-value store used within Kubernetes, is available on its official website.
Despite its recognition, the tutorial of Jepsen wasn’t informative enough for me to learn the basic usage of Jepsen. For instance, while the tutorial uses etcd as an example, the client library for etcd is outdated, making it impossible to execute the tutorial as instructed. Instead, by researching Jepsen’s source code, I implemented an application with two-phase commits and a test case for it. The implemented application and test caseare available in one of my GitHub repository.
Despite of the insufficent reference, using Jepsen to implement test cases is still more efficient than building them from scratch. For a future reference, I will explore an advanced usage of Jepsen compared with the tutorial.
Test Cases
A Jepsen test case is a map acceptedby jepsen.core/run!
.
The map description is explained in the function’s documentation.
As the name implies, the jepsen.core/run!
function executes a test case.
Unlike the jepsen.cli/run!
function used in the tutorial, jepsen.core/run!
does not terminate the test case process after invocation.
The following text is cited from the description of jepsen.core/run!
:
:nodes A sequence of string node names involved in the test
:concurrency (optional) How many processes to run concurrently
:ssh SSH credential information: a map containing...
:username The username to connect with (root)
:password The password to use
:sudo-password The password to use for sudo, if needed
:port SSH listening port (22)
:private-key-path A path to an SSH identity file (~/.ssh/id_rsa)
:strict-host-key-checking Whether or not to verify host keys
:logging Logging options; see jepsen.store/start-logging!
:os The operating system; given by the OS protocol
:db The database to configure: given by the DB protocol
:remote The remote to use for control actions. Try, for example,
(jepsen.control.sshj/remote).
:client A client for the database
:nemesis A client for failures
:generator A generator of operations to apply to the DB
:checker Verifies that the history is valid
:log-files A list of paths to logfiles/dirs which should be captured at
the end of the test.
:nonserializable-keys A collection of top-level keys in the test which
shouldn't be serialized to disk.
:leave-db-running? Whether to leave the DB running at the end of the test.
Jepsen automatically adds some additional keys during the run
:start-time When the test began
:history The operations the clients and nemesis performed
:results The results from the checker, once the test is completed
In addition, tests have some fields added by Jepsen which are present during
their execution, but not persisted.
:barrier A CyclicBarrier, mainly used for synchronizing DB setup
:store State used for reading and writing data to and from disk
:sessions Connected sessions used by jepsen.control to talk to nodes
Communication Between a Test Case an a System Under Test
The value of :remote
indicates the method to connect to the nodes defined by :nodes
.
It is an implementation of the Remote
protocol.
Protocols in Clojure are abstract data structures similar to the interfaces in Java and Go.
By default, a test case connects to the nodes using SSH as specified by :ssh
.
If :strict-host-key-checking
is enabled, the connection attempt will stop if the node’s secret key at the IP has changed since the last execution.
In the systems running as Docker containers, :strict-host-key-checking
should be disabled.
In addition to the SSH implementation clj_ssh
, Jepsen provides other implementations like DockerRemote
.
Note that when the distributed system runs on Docker and the test case on the host, DockerRemote
may not be usable, requiring a custom Remote
implementation.
When using DockerRemote
, nodes need to be specified in <ipv4>:<port>
format.
In such cases, DockerRemote
passes the location of a node in <ipv4>:<port>
format as an argument to iptables
, leading to resolution errors due to unresolvable hosts.
Initialization of Nodes
The value of :os
is should be be an implementation of the OS
protocol, installing necessary resources on nodes for test cases.
For instance, the Debian implementation provided by Jepsen installs programs like iptables
on the nodes.
While :os
serves to install resources for the test case, the value of :db
is responsible for initializing the system.
Like preceeding keywords, the value of :db
is also an implementation of a DB
protocol.
Operations on Systems
:generator
is responsible for scheduling outage, network partitions, and operations.
The value of :client
defines operations, while :nemesis
defines failures.
They are implementations of the protocols Generator, Client
, and Nemesis
respectively.
Verification of Execution
Once the scheduled events conclude, a value of :checker
that implements Checker
validates the history of the events.
Besides protocols, Jepsen also provides implementations.
Particularly for Nemesis
, Generator
, and Checker
, composable functions are available.