In the vein of https://research.cs.wisc.edu/adsl/Publications/cuttlefs-tos21.pdf, Lee Brotherston writes that NATS sometimes neglects to check whether calls to `fsync` actually succeeded: https://github.com/nats-io/nats-server/issues/7629
In the vein of https://research.cs.wisc.edu/adsl/Publications/cuttlefs-tos21.pdf, Lee Brotherston writes that NATS sometimes neglects to check whether calls to `fsync` actually succeeded: https://github.com/nats-io/nats-server/issues/7629
A new #Jepsen report: we demonstrate data loss and persistent split-brain in the NATS streaming system, in response to simulated power failures/OS crashes.
https://jepsen.io/analyses/nats-2.12.1
Jepsen and Antithesis worked together to write a glossary for anyone building, testing, and operating distributed systems. It covers the basics of concurrency, consistency models and phenomena, faults, and some testing approaches:
https://antithesis.com/resources/reliability_glossary/ […]
A new #Jepsen release, 0.3.10, brings improved support for controllable random value generation, and running tests inside Antithesis. Jepsen's composable generator system has also been extracted to a minimal library, making it easier to re-use in other systems […]
The latest Jepsen talk, from Systems Distributed in June, goes live in 15 minutes. We'll be doing a live chat during the premier, if you want to chat about databases and testing. :-)
https://www.youtube.com/watch?v=dpTxWePmW5Y
A new #Jepsen report! We tested early builds of Capela, an unreleased distributed programming environment, and found twenty-two issues, including four language problems, fourteen crashes or non-fatal panics, performance degradation, and three safety issues including lost update […]
An interview with Kaivalya Apte, on The GeekNarrator podcast. We talk about mapping properties to tests, type I and II errors, performance, LLMs, and more.
https://www.youtube.com/watch?v=IvE1VbOol88
The video of my BugBash talk, "Jepsen 17: ACID Jazz" is out now! https://www.youtube.com/watch?v=v8cG2hh10SQ
Antithesis and Jepsen are releasing a glossary of terms useful in distributed systems testing: https://antithesis.com/resources/reliability_glossary/
A parody of John Waters' "Serial Mom", except it's "Serializable Mom". I'm holding scissors (to partition the network) and trying to channel my best homage to Kathleen Turner.
Systems Distributed. June 19-20, Amsterdam.
https://systemsdistributed.com/
Two bugs sitting on cozy chairs, sewing. The Tiger Beetle bug is gesturing "No! Stop!" to the other bug, who is sewing a shirt all wrong. I think the other bug is Jepsen. :D
The companion blog post from TigerBeetle is great too--dives into detail on the bug we found in the index-intersection query code:
https://tigerbeetle.com/blog/2025-06-06-fuzzer-blind-spots-meet-jepsen/
A new #Jepsen report! We worked with TigerBeetle to find seven crashes, elevated latencies during single-node failures, and requests which were retried forever in version 0.16.11. We found only two safety issues: missing results for queries with multiple predicates, and incorrect timestamps in a […]
Jepsen 0.3.9 is now available, including a module for restarting flaky databases, a few improvements to downloading logs, more capable generators, and friendlier error messages. https://github.com/jepsen-io/jepsen/releases/tag/v0.3.9
Justin Jaffray has written up a lovely companion to this piece, diving into Snapshot Isolation and how to understand transaction dependency structures. https://buttondown.com/jaffray/archive/how-to-understand-that-jepsen-report/
Thanks to AWS's Sergey Melnik, as well as HN commenters matashii and Ants Aasma, we now know that Long Fork in PostgreSQL clusters is caused by a disagreement between primaries and secondaries on the order in which transactions become visible […]
A small issue in Amazon RDS for PostgreSQL: at the "Repeatable Read” isolation level, which in PostgreSQL normally means Snapshot Isolation, Amazon RDS for PostgreSQL clusters appear to exhibit Long Fork. We observed this behavior in healthy clusters, in versions ranging from 13.15 to 17.4 […]
Added four new phenomena to Jepsen's docs: P4 (Lost Update), A5A (Read Skew), A5B (Write Skew) and Process.
https://jepsen.io/consistency/phenomena#sql
For the daytime crew: Jepsen's distributed systems class starts next week. The accompanying workshop, where we practice writing and debugging our own distributed systems, follows the week after […]
Jepsen 0.3.8 is now available. It includes a new nemesis for file corruption, and improvements to clock-skew tests. https://github.com/jepsen-io/jepsen/releases/tag/v0.3.8
What IS Strong Serializability, really? Ever want to try writing your own gossip service. Two open sessions of Jepsen's training classes are coming up: the Distributed Systems Fundamentals class, and (for the first time!) its accompanying workshop […]
I'll be speaking on Jepsen at Bug Bash (DC, April 3-4), and Systems Distributed (Amsterdam, June 19-20). Come join!
https://bugbash.antithesis.com/
https://systemsdistributed.com/
Woke up to a bunch of excellent emails--y'all rock. Will try to write back to everyone in the next hour or so. ❤️
So, uh, the last time I had to call malloc() was a quarter century ago, and I am struggling to do basic tasks without corrupting the heap. I would love to hire one of you stripey-socked C witches for a very small contract to help finish […]
Added descriptions of the SQL isolation level anomalies P0, P1, P2, and P3 to the phenomena page: https://jepsen.io/consistency/phenomena#sql
Released version 0.2.4 of Maelstrom, Jepsen's workbench for writing toy distributed systems: https://github.com/jepsen-io/maelstrom
There's a few tickets left for the distributed systems class coming up in just over a week. If you'd like to join, now's the time. :-)
https://www.eventbrite.com/e/distributed-systems-fundamentals-registration-1060426286569?aff=mastodon
Bin Wang put together a Jepsen test for Patroni, a PostgreSQL replication system. All sorts of good stuff in here, including that the cluster can't handle a series of single-node failures: https://www.binwang.me/2024-12-02-PostgreSQL-High-Availability-Solutions-Part-1.html
Antithesis, Buf, and Jepsen are running a joint webinar on December 5th. We'll discuss a Kafka protocol safety issue, talk about the challenges of distributed systems testing, and show how Jepsen and Antithesis helped identify critical safety errors in Bufstream. Come watch Antithesis pause […]
Thanks to everyone who wrote in objecting to the report's description of data loss due to auto-commit. Some experiments this morning suggest that we got it wrong (at least for the official Java client). We've published an update to the report: https://jepsen.io/analyses/bufstream-0.1.0#updates