Data Expedition, Inc.


Articles, events, announcements, and blogs

From Silicon To Memes: Part 3

by Seth Noble |  Blog Sep 10, 2019

This is part three of a four-part series published on which explores how automated arithmetic machines have come to dominate human communication and the consequences for our culture and economy.

The Cost Of Communicating With A Protocol From 1974

In the first and second parts of this series, we explored the rapid evolution of electronic computing over the past 80 years and how a single software technology has become responsible for nearly all human communication. The Transmission Control Protocol (TCP) performs the essential computing involved in moving data from one place to another. But its virtual-circuit data model, inspired by analog telephone lines, has remained unchanged for more than forty years. This has profound costs.

Consider a 10-petabyte video vault that needs to be moved into the cloud for the launch of a streaming service. (Data sets of that size are a lot more common than most people realize, and so is the need to move them around.) Bringing a 10 gigabit per second data pipe to the vault should move the data in about three-and-a-half months. But with TCP often achieving less than a quarter utilization across long distance paths, the project could find itself delayed by nearly a year. (This example is grossly simplified, but the scaling issues hold.) Imagine the cost of any enterprise project being multiplied by a factor of four. Imagine the cost of all enterprise projects being delayed.

To say that data transport inefficiency is costing businesses hundreds of billions of dollars per year is very likely understating the problem. But the effects of inefficient network communication reach far beyond a few big companies or niche industries. With our entire economy and culture dependent on real-time communication, even small inefficiencies have huge effects.

Former Google VP Marissa Mayer famously observed in 2006 that just a half-second delay in results dropped Google search traffic by 20%. Google's websites generated nearly 100 billion dollars last year, so half a second of delay translates to nearly 20 billion dollars. Amazon (per data from Former Amazon Engineer Greg Linden), Walmart and others have identified similar costs in the past, though most now view this as an area of proprietary optimization and no longer share revenue impact. Multiply those effects across every user of Google, Facebook, YouTube, Twitter, Instagram, Snapchat and all the other communication platforms in use today and the cost of TCP grows to impact our entire global economy.

The key to solving this problem is understanding where TCP's evolution stalled. Inspired by analog telephone lines, its full-duplex, virtual-circuit model assumes data must be delivered in order, data must be able to flow in both directions at the same time, and that network congestion will be rare. However, the furious innovations of modern applications and the astounding scale of modern networks have left those assumptions behind.

Ensuring that every application receives all data in order requires that TCP store that data in a buffer. The size of the buffer limits how much data can be in flight on the network, which in turn limits the speed at which data can be moved. (Search for "bandwidth delay product" for more about the relationship between buffer size, latency, and speed.) But for bulk data and interactive user data, the kinds most sensitive to delays, such ordering is rarely necessary. Consider the transfer of web page image files. While some formats can be displayed progressively, none require it and vanishingly few benefit from it. All that matters to the user is when the image is ready to view in its entirety. By forcing ordered delivery on all data transfers, TCP places an unnecessary burden on finite resources and unnecessary limits on speed and latency tolerance.

A full-duplex (bi-directional) byte pipe makes for a very general communication model, but the great majority of network communications consist of one-way bursts. Loading a web page or transferring a file involves sending a few small requests, sometimes to many different servers, after which all of the data flow is in one direction. When the size of each data flow is small, the time it takes to create and shutdown down each TCP session can be many times what it takes to transfer the actual data. When the data is large, maintaining an unused backchannel and the ability to start and stop data flow at the source forces compromises in congestion control and error recovery that further limit scaling.

Today, everyone is familiar with network congestion and the slowdowns it causes. But TCP was designed around the idea that it would be the receiving computer that would have trouble keeping up with the network. As a result, network congestion causes much worse problems than merely slowing TCP down. Because of the ordering requirement and the finite buffer, TCP must stop for holes in the data, completely halting the flow until the missing packets can be recovered.

Over the years, there have been many enhancements to TCP's congestion-control. From "selective acknowledgments" to "bottleneck bandwidth and round-trip propagation time," there have been many attempts to make TCP better at figuring out the right speed to send data to avoid drops and to recover faster when drops occur. The result is more than a dozen variants of TCP in use today, none of which are efficient across the full scale of the internet.

For the most part, it has been left to applications to work around these problems. For example, some have adopted multiplexed TCP sessions (many operations on a single TCP), while others use parallel TCP sessions (the opposite: a single operation split across many TCP sessions). But ad-hoc workarounds have not been able to solve the fundamental efficiency problems that are costing billions of dollars per year. In the next and final part of this series, we look at how we can use TCP's outdated assumptions to inform better data transport algorithms and what it will take to deploy them at scale.

In the next and final part of this series, we look at how we can use TCP's outdated assumptions to inform better data transport algorithms and what it will take to deploy them at scale.