Some of my favorite technical papers.
I’ve long been a fan of hosting paper reading groups, where a group of folks sit down and talk about interesting technical papers. One of the first steps to do that is identifying some papers worth chatting about, and here is a list of some papers I’ve seen lead to excellent discussions!
Dynamo: Amazon’s Highly Available Key-value Store
Reading only the abstract, you’d be forgiven for not being overly excited about the Dynamo paper: This paper presents the design and implementation of Dynamo, a highly available key-value storage system that some of Amazon’s core services use to provide an always-on experience. To achieve this level of availability, Dynamo sacrifices consistency under certain failure scenarios. It makes extensive use of object versioning and application-assisted conflict resolution in a manner that provides a novel interface for developers to use.
That said, this is in some senses “the” classic modern systems paper. It has happened more than once that an engineer I’ve met has only read a single systems paper in their career, and that paper was the Dynamo paper. This paper is a phenomenal introduction to eventual consistency, coordinating state across distributed storage, reconciling data as it diverges across replicas and much more.
Hints for Computer System Design
Butler Lampson is an ACM Turning Award winner (among other awards), and worked at the Xerox PARC. This paper concisely summarizes many of his ideas around system design, and is a great read.
In his words:
Studying the design and implementation of a number of computer has led to some general hints for system design. They are described here and illustrated by many examples, ranging from hardware such as the Alto and the Dorado to application programs such as Bravo and Star.
This paper itself acknowledges that it doesn’t aim to break any new ground, but it’s a phenomenal overview.
Big Ball of Mud
A reaction against exuberant papers about grandiose design patterns, this paper labels the most frequent architectural pattern as the Big Ball of Mud, and explores why elegant initial designs rarely remain intact as a system goes from concept to solution.
From the abstract:
While much attention has been focused on high-level software architectural patterns, what is, in effect, he de-facto standard software architecture is seldom discussed. This paper examines this mostfrequently deployed of software architectures: the BIG BALL OF MUD. A BIG BALL OF MUD is a casually, even haphazardly, structured system. Its organization, if one can call it that, is dictated more by expediency than design. Yet, its enduring popularity cannot merely be indicative of a general disregard for architecture.
Although humor is certainly infuses this paper, it’s also true that software design is remarkably poor, with very few systems having a design phase and few of those resembling the initial design (and documentation is rarely updated to reflect later decisions), making this an important topic for consideration.
The Google File System
From the abstract:
The file system has successfully met our storage needs. It is widely deployed within Google as the storage platform for the generation and processing of data used by our service as well as research and development efforts that require large data sets. The largest cluster to date provides hundreds of terabytes of storage across thousands of disks on over a thousand machines, and it is concurrently accessed by hundreds of clients.
In this paper, we present file system interface extensions designed to support distributed applications, discuss many aspects of our design, and report measurements from both micro-benchmarks and real world use. Google has done something fairly remarkable in defining the technical themes in Silicon Valley and, at least debatably across the technology industry, for more than the last decade (only recently joined to a lesser extent by Facebook and Twitter as they reach significant scale), and it’s done that largely through their remarkable technical papers. The Google File System (GFS) paper is one of the early entries in that strategy, and is also remarkable as the paper which largely inspired the Hadoop File System (HFS).
On Designing and Deploying Internet-Scale Services
We don’t always remember to consider Microsoft as one of the largest internet technology players, although increasingly Azure is making that comparison obvious and immediate, and it certainly wasn’t a name that necessarily came to mind in 2007. This excellent paper from James Hamilton, exploring tips on building operable systems at extremely large scale, makes it clear that not considering Microsoft as a large internet player was a lapse in our collective judgement.
From the abstract:
The system-to-administrator ratio is commonly used as a rough metric to understand administrative costs in high-scale services. With smaller, less automated services this ratio can be as low as 2:1, whereas on industry leading, highly automated services, we’ve seen ratios as high as 2,500:1. Within Microsoft services, Autopilot [1] is often cited as the magic behind the success of the Windows Live Search team in achieving high system-to-administrator ratios. While auto-administration is important, the most important factor is actually the service itself. Is the service efficient to automate? Is it what we refer to more generally as operations-friendly? Services that are operations friendly require little human intervention, and both detect and recover from all but the most obscure failures without administrative intervention. This paper summarizes the best practices accumulated over many years in scaling some of the largest services at MSN and Windows Live.
This is a true checklist of how to design and evaluate large scale systems (almost like The Twelve Factor App wants to be for a checklist for operable applications).
CAP Twelve Years Later: How the Rules Have Changed
Eric Brewer posited the CAP theorem in the early 2000s, and twelve years later he wrote this excellent overview and review of CAP (which argues distributed systems have to pick between either availability or consistency during partitions), in part because:
In the decade since its introduction, designers and researchers have used (and sometimes abused) the CAP theorem as a reason to explore a wide variety of novel distributed systems. The NoSQL movement also has applied it as an argument against traditional databases. CAP is interesting because there is not a “seminal CAP paper”, but this article serves well in such a paper’s stead. These ideas are expanded on in the Harvest and Yield paper.
Harvest, Yield, and Scalable Tolerant Systems
This paper builds on the concepts from CAP Twelve Years Later, introducing the concepts of harvest and yield to add more nuance to the AP vs CP discussion:
The cost of reconciling consistency and state management with high availability is highly magnified by the unprecedented scale and robustness requirements of today’s Internet applications. We propose two strategies for improving overall availability using simple mechanisms that scale over large applications whose output behavior tolerates graceful degradation. We characterize this degradation in terms of harvest and yield, and map it directly onto engineering mechanisms that enhance availability by improving fault isolation, and in some cases also simplify programming. By collecting examples of related techniques in the literature and illustrating the surprising range of applications that can benefit from these approaches, we hope to motivate a broader research program in this area.
The harvest and yield concepts are particularly interesting because they are both self-evidence and very rarely explicitly used, instead distributed systems continue to fail in mostly undefined ways. Hopefully as we keep rereading this paper, we’ll also start to incorporate it’s design concepts into the systems we subsequently build!
MapReduce: Simplified Data Processing on Large Clusters
The MapReduce paper is an excellent example of an idea which has been so successful that it now seems self-evident. The idea of applying the concepts of functional programming at scale became a clarion call, provoking a shift from data warehousing to a new paradigm for data analysis:
MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model, as shown in the paper.
Much like Google File System paper was an inspiration for the Hadoop File System, this paper was itself a major inspiration for Hadoop.
Dapper, a Large-Scale Distributed Systems Tracing Infrastructure
The Dapper paper introduces a performant approach to tracing requests across many services, which has become increasingly relevant as more companies refactor core monolithic applications into dozens or hundreds of micro-services.
From the abstract:
Here we introduce the design of Dapper, Google’s production distributed systems tracing infrastructure, and describe how our design goals of low overhead, application-level transparency, and ubiquitous deployment on a very large scale system were met. Dapper shares conceptual similarities with other tracing systems, particularly Magpie and X-Trace, but certain design choices were made that have been key to its success in our environment, such as the use of sampling and restricting the instrumentation to a rather small number of common libraries.
The ideas from Dapper have since made their way into open source, especially in Zipkin and OpenTracing.
Kafka: a Distributed Messaging System for Log Processing
Apache Kafka has become a core piece of infrastructure for many internet companies. It’s versatility lends it to many roles, serving as the ingress point to “data land” for some, a durable queue for others, and that’s just scratching the surface.
Not only a useful addition to your toolkit, Kafka is also a beautifully designed system:
Log processing has become a critical component of the data pipeline for consumer internet companies. We introduce Kafka, a distributed messaging system that we developed for collecting and delivering high volumes of log data with low latency. Our system incorporates ideas from existing log aggregators and messaging systems, and is suitable for both offline and online message consumption. We made quite a few unconventional yet practical design choices in Kafka to make our system efficient and scalable. Our experimental results show that Kafka has superior performance when compared to two popular messaging systems. We have been using Kafka in production for some time and it is processing hundreds of gigabytes of new data each day.
In particular, Kafka’s partitions do a phenomenal job of forcing application designers to make explicit tradeoffs about trading off performance for predictable message ordering.
Wormhole: Reliable Pub-Sub to Support Geo-replicated Internet Services
In many ways similar to Kafka, Facebook’s Wormhole is another highly scalable approach to messaging:
Wormhole is a publish-subscribe (pub-sub) system developed for use within Facebook’s geographically replicated datacenters. It is used to reliably replicate changes among several Facebook services including TAO, Graph Search and Memcache. This paper describes the design and implementation of Wormhole as well as the operational challenges of scaling the system to support the multiple data storage systems deployed at Facebook. Our production deployment of Wormhole transfers over 35 GBytes/sec in steady state (50 millions messages/sec or 5 trillion messages/day) across all deployments with bursts up to 200 GBytes/sec during failure recovery. We demonstrate that Wormhole publishes updates with low latency to subscribers that can fail or consume updates at varying rates, without compromising efficiency.
In particular, note the approach to supporting lagging consumers without sacrificing overall system throughput.
Borg, Omega, and Kubernetes
While the individual papers for each of Google’s orchestration systems (Borg, Omega and Kubernetes) are worth reading in their own right, this article is an excellent overview of the three:
Though widespread interest in software containers is a relatively recent phenomenon, at Google we have been managing Linux containers at scale for more than ten years and built three different container-management systems in that time. Each system was heavily influenced by its predecessors, even though they were developed for different reasons. This article describes the lessons we’ve learned from developing and operating them.
Fortunately, not all orchestration happens under Google’s aegis, and Mesos' alternative two-layer scheduling architecture is a fascinating read as well.
Large-scale cluster management at Google with Borg
Borg has been orchestrating much of Google’s infrastructure for quite some time (significantly predating Omega, although fascinatingly the Omega paper predates the Borg paper by two years):
Google’s Borg system is a cluster manager that runs hundreds of thousands of jobs, from many thousands of different applications, across a number of clusters each with up to tens of thousands of machines.
This paper takes a look at Borg’s centralized scheduling model, which was both effective and efficient, although it became increasingly challenging to modify and scale over time, inspiring both Omega and Kubernetes within Google (the former to optimistically replace it, and the later seemingly to commercialize their learnings, or at least prevent Mesos from capturing too much mindshare).
Omega: flexible, scalable schedulers for large compute clusters
Omega is, among many other things, an excellent example of the second-system effect, where an attempt to replace a complex existing system with something far more elegant ends up being more challenging than anticipated.
In particular, Omega is a reaction against the realities of extending the aging Borg system:
Increasing scale and the need for rapid response to changing requirements are hard to meet with current monolithic cluster scheduler architectures. This restricts the rate at which new features can be deployed, decreases efficiency and utilization, and will eventually limit cluster growth. We present a novel approach to address these needs using parallelism, shared state, and lock-free optimistic concurrency control.
Perhaps also an example of worse is better once again taking the day.
Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center
This paper describes the design of Apache Mesos, in particular its distinctive two-level scheduler:
We present Mesos, a platform for sharing commodity clusters between multiple diverse cluster computing frameworks, such as Hadoop and MPI. Sharing improves cluster utilization and avoids per-framework data replication. Mesos shares resources in a fine-grained manner, allowing frameworks to achieve data locality by taking turns reading data stored on each machine. To support the sophisticated schedulers of today’s frameworks, Mesos introduces a distributed two-level scheduling mechanism called resource offers. Mesos decides how many resources to offer each framework, while frameworks decide which resources to accept and which computations to run on them. Our results show that Mesos can achieve near-optimal data locality when sharing the cluster among diverse frameworks, can scale to 50,000 (emulated) nodes, and is resilient to failures.
Used heavily by Twitter and Apple, Mesos was for some time the only open-source general scheduler with significant adoption, and is now in a fascinating competition for mindshare with Kubernetes.
Design patterns for container-based distributed systems
The move to containers-based deployment and orchestration has introduced a whole new set of vocabulary like sidecars and adapters, and this paper provides a survey of the patterns which have evolved over the past decade as microservices and containers have become increasingly prominent infrastructure components:
In the late 1980s and early 1990s, object-oriented programming revolutionized software development, popularizing the approach of building of applications as collections of modular components. Today we are seeing a similar revolution in distributed system development, with the increasing popularity of microservice architectures built from containerized software components. Containers are particularly well-suited as the fundamental object in distributed systems by virtue of the walls they erect at the container boundary. As this architectural style matures, we are seeing the emergence of design patterns, much as we did for object-oriented programs, and for the same reason thinking in terms of objects (or containers) abstracts away the low-level details of code, eventually revealing higher-level patterns that are common to a variety of applications and algorithms.
The “sidecar” term in particular likely originated in this blog post from Netflix, which is a worthy read in its own right.
Raft: In Search of an Understandable Consensus Algorithm
Where we often see the second-system effect where a second system becomes bloated and complex relative to a simple initial system, the roles are reversed in the case of Paxos and Raft. Whereas Paxos is often considered beyond human comprehension, Raft is a fairly easy read:
Raft is a consensus algorithm for managing a replicated log. It produces a result equivalent to (multi-)Paxos, and it is as efficient as Paxos, but its structure is different from Paxos; this makes Raft more understandable than Paxos and also provides a better foundation for building practical systems. In order to enhance understandability, Raft separates the key elements of consensus, such as leader election, log replication, and safety, and it enforces a stronger degree of coherency to reduce the number of states that must be considered. Results from a user study demonstrate that Raft is easier for students to learn than Paxos. Raft also includes a new mechanism for changing the cluster membership, which uses overlapping majorities to guarantee safety.
Raft is used by etcd and influxdb among many others.
Paxos Made Simple
One of Leslie Lamport’s numerous influential papers, Paxos Made Simple is a gem both in explaining the notoriously complex Paxos algorithm, and because even at it’s simplest, Paxos isn’t really that simple:
The Paxos algorithm for implementing a fault-tolerant distributed system has been regarded as difficult to understand, perhaps because the original presentation was Greek to many readers. In fact, it is among the simplest and most obvious of distributed algorithms. At its heart is a consensus algorithmthe synod algorithm. The next section shows that this consensus algorithm follows almost unavoidably from the properties we want it to satisfy. The last section explains the complete Paxos algorithm, which is obtained by the straightforward application of consensus to the state machine approach for building a distributed systeman approach that should be well-known, since it is the subject of what is probably the most often-cited article on the theory of distributed systems.
Paxos itself remains a deeply innovative concept, and is the algorithm behind Google’s Chubby and Apache Zookeeper, among many others.
SWIM: Scalable Weakly-consistent Infection-style Process Group Membership Protocol
The majority of consensus algorithms focus on being consistent during partition, SWIM goes the other direction and focuses on availability:
Several distributed peer-to-peer applications require weakly-consistent knowledge of process group membership information at all participating processes. SWIM is a generic software module that offers this service for large scale process groups. The SWIM effort is motivated by the unscalability of traditional heart-beating protocols, which either impose network loads that grow quadratically with group size, or compromise response times or false positive frequency w.r.t. detecting process crashes. This paper reports on the design, implementation and performance of the SWIM sub-system on a large cluster of commodity PCs.
SWIM is used in Hashicorp’s software, as well as Uber’s Ringpop.
The Byzantine Generals Problem
Another classic Leslie Lamport paper on consensus, the Byzantine Generals Problem explores how to deal with distributed actors which intentionally or accidentally submit incorrect messages:
Reliable computer systems must handle malfunctioning components that give conflicting information to different parts of the system. This situation can be expressed abstractly in terms of a group of generals of the Byzantine army camped with their troops around an enemy city. Communicating only by messenger, the generals must agree upon a common battle plan. However, one or more of them may be traitors who will try to confuse the others. The problem is to find an algorithm to ensure that the loyal generals will reach agreement. It is shown that, using only oral messages, this problem is solvable if and only if more than two-thirds of the generals are loyal; so a single traitor can confound two loyal generals. With unforgeable written messages, the problem is solvable for any number of generals and possible traitors. Applications of the solutions to reliable computer systems are then discussed.
The paper is mostly focused on the formal proof, a bit of a theme from Lamport who developed TLA+ to make formal proving easier, but is also a useful reminder that we still tend to assume our components will behave reliably and honestly, and perhaps we shouldn’t!
Out of the Tar Pit
Out of the Tar Pit bemoans unnecessary complexity in software, and proposes that that functional programming and better data modeling can help us reduce accidental complexity (arguing that most unnecessary complexity comes from state).
From the abstract:
Complexity is the single major difficulty in the successful development of large-scale software systems. Following Brooks we distinguish accidental from essential difficulty, but disagree with his premise that most complexity remaining in contemporary systems is essential. We identify common causes of complexity and discuss general approaches which can be taken to eliminate them where they are accidental in nature. To make things more concrete we then give an outline for a potential complexity-minimizing approach based on functional programming and Codd’s relational model of data.
Certainly a good read, although reading a decade later it’s fascinating to see that neither of those approaches have particularly taken off, and instead the closest “universal” approach to reducing complexity seems to be the move to numerous mostly stateless services, which is perhaps more a reduction of local complexity, at the expense of larger systemic complexity, whose maintenance is then delegated to more specialized systems engineers.
(This is yet another paper that makes me wish TLA+ felt natural enough to be a commonly adopted tool.)
The Chubby lock service for loosely-coupled distributed systems
Distributed systems are hard enough without having to frequently reimplement Paxos or Raft, and the model proposed by Chubby is to implement consensus once in a shared service, which will allow systems built upon it to share in the resilience of distribution by following greatly simplified patterns.
From the abstract:
We describe our experiences with the Chubby lock service, which is intended to provide coarse-grained locking as well as reliable (though low-volume) storage for a loosely-coupled distributed system. Chubby provides an interface much like a distributed file system with advisory locks, but the design emphasis is on availability and reliability, as opposed to high performance. Many instances of the service have been used for over a year, with several of them each handling a few tens of thousands of clients concurrently. The paper describes the initial design and expected use, compares it with actual use, and explains how the design had to be modified to accommodate the differences.
In the open source world, the way Zookeeper is used in projects like Kafka and Mesos has the same role as Chubby.
Bigtable: A Distributed Storage System for Structured Data
One of Google’s preeminent papers and technologies is Bigtable, which was an early (early in the internet era, anyway) NoSQL datastore, operating at extremely high scale and built on top of Chubby.
Bigtable is a distributed storage system for managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers. Many projects at Google store data in Bigtable, including web indexing, Google Earth, and Google Finance. These applications place very different demands on Bigtable, both in terms of data size (from URLs to web pages to satellite imagery) and latency requirements (from backend bulk processing to real-time data serving). Despite these varied demands, Bigtable has successfully provided a flexible, high-performance solution for all of these Google products. In this paper we describe the simple data model provided by Bigtable, which gives clients dynamic control over data layout and format, and we describe the design and implementation of Bigtable.
From the SSTable design to the bloom filters, Cassandra and inherits significantly from the Bigtable paper, and is probably rightfully considered a merging of the Dynamo and Bigtable papers.
Spanner: Google’s Globally-Distributed Database
Where many early NoSQL storage systems traded eventual consistency for increased resiliency, building on top of eventually consistent systems can be harrowing. Spanner represents an approach from Google to offering both strong consistency and distributed reliability, relying in part on a novel approach to managing time.
Spanner is Google’s scalable, multi-version, globally distributed, and synchronously-replicated database. It is the first system to distribute data at global scale and support externally-consistent distributed transactions. This paper describes how Spanner is structured, its feature set, the rationale underlying various design decisions, and a novel time API that exposes clock uncertainty. This API and its implementation are critical to supporting external consistency and a variety of powerful features: nonblocking reads in the past, lock-free read-only transactions, and atomic schema changes, across all of Spanner.
We haven’t seen any open-source Spanner equivalents yet, but I imagine we’ll start seeing them in 2018.
Security Keys: Practical Cryptographic Second Factors for the Modern Web
Security keys like the YubiKey have emerged as the most secure second authentication factor, and this paper out of Google explains the motivations that lead to their creation, and the design that makes them work.
From the abstract:
Security Keys are second-factor devices that protect users against phishing and man-in-the-middle attacks. Users carry a single device and can self-register it with any online service that supports the protocol. The devices are simple to implement and deploy, simple to use, privacy preserving, and secure against strong attackers. We have shipped support for Security Keys in the Chrome web browser and in Google’s online services. We show that Security Keys lead to both an increased level of security and user satisfaction by analyzing a two year deployment which began within Google and has extended to our consumer-facing web applications. The Security Key design has been standardized by the FIDO Alliance, an organization with more than 250 member companies spanning the industry. Currently, Security Keys have been deployed by Google, Dropbox, and GitHub.
They’re also remarkably cheap! Order a few and start securing your life in a day or two.
BeyondCorp: Design to Deployment at Google
Building on the original BeyondCorp paper in 2014, this paper is slightly more detailed and benefits from two more years of migration-fueled wisdom. That said, the big ideas have remained fairly consistent and there is not much new relative to the BeyondCorp paper itself (although that was a fantastic paper, and if you haven’t read it, this is an equally good starting point):
The goal of Google’s BeyondCorp initiative is to improve our security with regard to how employees and devices access internal applications. Unlike the conventional perimeter security model, BeyondCorp doesn’t gate access to services and tools based on a user’s physical location or the originating network; instead, access policies are based on information about a device, its state, and its associated user. BeyondCorp considers both internal networks and external networks to be completely untrusted, and gates access to applications by dynamically asserting and enforcing levels, or tiers, of access.
As is often the case reading Google papers, my biggest take away thought here is wondering when we’ll start to see reusable, pluggable open source versions of the techniques described within.
Availability in Globally Distributed Storage Systems
This paper explores how to think about availability in replicated distributed systems, and is a useful starting point for those of us who are trying to determine the correct way to measure uptime for their storage layer or any other sufficiently complex system.
From the abstract:
We characterize the availability properties of cloud storage systems based on an extensive one year study of Google’s main storage infrastructure and present statistical models that enable further insight into the impact of multiple design choices, such as data placement and replication strategies. With these models we compare data availability under a variety of system parameters given the real patterns of failures observed in our fleet.
Particularly interesting is the focus on correlated failures, building on the premise that users of distributed systems only experience the failure when multiple components have overlapping failures. Another expected but reassuring observation is that at Google’s scale (and with resources distributed across racks and regions), most failure comes from tuning and system design, not from the underlying hardware.
I was also surprised by how simple their definition of availability was in this case:
A storage node becomes unavailable when it fails to respond positively to periodic health checking pings sent by our monitoring system. The node remains unavailable until it regains responsiveness or the storage system reconstructs the data from other surviving nodes.
Often discussions of availability become arbitrarily complex (“really it should be response rates are over X, but with correct results and within our latency SLO!"), and it’s reassuring to see the simplest definitions are still usable.
Still All on One Server: Perforce at Scale
As a company grows, code hosting performance becomes one of the critical factors in overall developer productivity (along with build and test performance), but it’s a topic that isn’t discussed frequently. This paper from Google discusses their experience scaling Perforce:
Google runs the busiest single Perforce server on the planet, and one of the largest repositories in any source control system. From this high-water mark this paper looks at server performance and other issues of scale, with digressions into where we are, how we got here, and how we continue to stay one step ahead of our users.
This paper is particularly impressive when you consider the difficulties that companies run into scaling Git monorepos (talk to a ex-Twitter employee near you for war stories).
Large-Scale Automated Refactoring Using ClangMR
Large code bases tend to age poorly, especially in the case of monorepos storing hundreds or thousands of different teams collaborating on different projects. This paper covers one of Google’s attempts to reduce the burden of maintaining their large monorepo through tooling that makes it easy to rewrite abstract syntax trees (ASTs) across the entire codebase.
From the abstract:
In this paper, we present a real-world implementation of a system to refactor large C++ codebases efficiently. A combination of the Clang compiler framework and the MapReduce parallel processor, ClangMR enables code maintainers to easily and correctly transform large collections of code. We describe the motivation behind such a tool, its implementation and then present our experiences using it in a recent API update with Google’s C++ codebase.
Similar work is being done with Pivot.
Source Code Rejuvenation is not Refactoring
This paper introduces the concept of “Code Rejuvenation”, a unidirectional process of moving towards cleaner abstractions as new language features and libraries become available, which is particularly applicable to sprawling, older code bases.
From the abstract:
In this paper, we present the notion of source code rejuvenation, the automated migration of legacy code and very briefly mention the tools we use to achieve that. While refactoring improves structurally inadequate source code, source code rejuvenation leverages enhanced program language and library facilities by finding and replacing coding patterns that can be expressed through higher-level software abstractions. Raising the level of abstraction benefits software maintainability, security, and performance.
There are some strong echoes of this work in Google’s ClangMR paper.
Searching for Build Debt: Experiences Managing Technical Debt at Google
This paper is an interesting cover of how to perform large-scale migrations in living codebases. Using broken builds as the running example, they break down their strategy to three pillars: automation, make it easy to do the right thing, and make it hard to do the wrong thing.
From the abstract:
With a large and rapidly changing codebase, Google software engineers are constantly paying interest on various forms of technical debt. Google engineers also make efforts to pay down that debt, whether through special Fixit days, or via dedicated teams, variously known as janitors, cultivators, or demolition experts. We describe several related efforts to measure and pay down technical debt found in Google’s BUILD files and associated dead code. We address debt found in dependency specifications, unbuildable targets, and unnecessary command line flags. These efforts often expose other forms of technical debt that must first be managed.
No Silver Bullet - Essence and Accident in Software Engineering
A seminal paper from the author of The Mythical Man Month, “No Silver Bullet” expands on discussion of accidental versus essential complexity, and argues that there is no longer enough accidental complexity to allow individual reductions in accidental complexity to significantly increase engineer productivity.
From the abstract:
Most of the big past gains in software productivity have come from removing artificial barriers that have made the accidental tasks inordinately hard, such as severe hardware cosntraints, awkward programming languages, lack of machine time. How much of what software engineers now do is still devoted to the accidental, as opposed to the essential? Unless it is more than 9/10 of all effort, shrinking all the accidental activities to zero time will not give an order of magnitude improvement.
Interestingly, I think we do see accidental complexity in large codebases become large enough to make order of magnitude improvements (motivating, for example, Google’s investments into ClangMR and such), so perhaps we’re not quite as far ahead in the shift to essential complexity as we’d like to believe.
The UNIX TimeSharing System
This paper describes the fundamentals of UNIX as of 1974, and what is truly remarkable is how many of the design decisions are still used today. From the permission model we’ve all manipulated with chmod to system calls used to manipulate files, it’s amazing how much remains intact.
From the abstract:
UNIX is a general-purpose, multi-user, interactive operating system for the Digital Equipment Corporation PDP-11/40 and 11/45 computers. It offers a number of features seldom found even in larger operating systems, including: (1) a hierarchical file system incorporating demountable volumes; (2) compatible file, device, and inter-process I/O; (3) the ability to initiate asynchronous processes; (4) system command language selectable on a per-user basis; and (5) over 100 subsystems including a dozen languages. This paper discusses the nature and implementation of the file system and of the user command interface.
Also fascinating is their observation that UNIX has in part succeeded because it was designed to solve a general problem by its authors (working with the PDP-7 was frustrating), and not towards a more specified goal.
After sharing this post, folks recommended a ton of additional amazing papers:
- Amazon Aurora: Design Considerations for High Throughput Cloud-Native Relational Databases
- Canopy: An End-to-End Performance Tracing And Analysis System
- Unreliable Failure Detectors for Reliable Distributed Systems
- CRDTs: Consistency without concurrency control
- Logic and Lattices for Distributed Programming
- Zab: High-performance broadcast for primary-backup systems
- Paxos Made Live - An Engineering Perspective
- Towards a Solution to the Red Wedding Problem
- Profiling a warehouse-scale computer
- Google-Wide Profiling: A Continuous Profiling Infrastructure for Data Centers
- A Few Billion Lines of Code Later: Using Static Analysis to Find Bugs in the Real World
- Rules of Machine Learning: Best Practices for ML Engineering
- A comprehensive study of Convergent and Commutative Replicated Data Types
- Online, Asynchronous Schema Change in F1
- Time, Clocks and the Ordering of Events in a Distributed System
- C-Store: A Column-oriented DBMS
- Life beyond Distributed Transactions: an Apostate’s Opinion
- Immutability Changes Everything
- Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks
- Basic Local Alignment Search Tool
Send me yours! If you’re looking for more reading materials, consider taking a look at my favorite books as well!