Tech Note 0021
Loss, Latency, and Speed
How network performance metrics are observed and analyzed
This article discusses packet loss, round-trip-time, and throughput as metrics of network path performance, with particular attention paid to methods of gathering and interpreting these statistics.
At all times it is vital to remember that these numbers are statistics, and as such are highly sensitive to the context and methods of observation.
Each of these statistics may be observed at individual network nodes, in a round-trip along a path, or in the experience of a particular transport layer. Observations may be compiled into statistics using a variety of mathematical methods. Different combinations of observations and calculations will result in radically different results.
Before gathering loss, latency, and speed information about a network path, consider how this information will be used. Possibilities include the evaluation of Service Level Agreements, prediction of how specific applications will perform, diagnosis of existing problems, comparison of different network paths, and the development of emulation models.
As a general rule, you will need to take measurements using the method closest to the question you are trying to answer. For example, if you are evaluating SLA targets, you will want to use ping statistics if that is how your service provider is measuring their own performance. But if you are trying to predict how an application will perform, you will need to consider transport experience. If you are trying to diagnose a particular problem, a combination of methods may be most helpful.
Packet loss refers to the failure of a network datagram to reach its destination.
In terrestrial networks, packet loss almost always occurs due to congestion. When a datagram is arriving at a network device, it must be stored in memory prior to being moved along toward its destination. If the device has run out of buffer memory, the datagram must be discarded.
Some routers implement an algorithm called Random Early Drop (RED). This causes the router to begin dropping a small number of datagrams "at random" when the buffers have become partially filled. RED allows a router to signal data transport protocols that they should slow down so as to avoid a complete buffer overflow, and thus avoid the loss of many packets. The selection of which datagrams to drop and how many is not actually "random". It is determined by a set of configurable rules which may take into account transport type, port and address numbers, quality of service metrics, datagram format, and more. Some datagrams are less likely to be dropped than others due to these and other factors.
There are many other router algorithms which use packet loss to regulate data flows. Traffic shaping, quality of service, and security rules are the most common examples. Such selective congestion control makes measuring packet loss difficult because different data flows will experience different loss rates.
Packet loss is a signal that some device believes your data was going too fast.
A much less common cause of packet loss is bit error. This occurs when an error in the physical signal causes data in a packet to become corrupted. Such corruption is most often detected at the link layer and causes the entire datagram to be discarded. Because of this, a seemingly small bit error rate (BER) can result in a substantial packet loss rate. For example, a bit error rate of one in a million (1 in 1e6) or 0.0001% will result in packet drop rate of 1.2%. That is because a typical link layer frame of 1500 bytes contains 12000 bits, and the loss of just one of those bits forces the entire frame to be dropped. If datagrams are fragmented across multiple frames, the drop rate will be even higher.
In wired networks, bit error rates are extremely low, typically 1 in 1e12 or less, roughly corresponding to a packet loss rate of 0.000001%. BER can be much higher in wireless networks, such as WiFi, cellular, or satellite. But many wireless layers use forward error correction to automatically correct for all but the most extreme BER.
Latency refers to the time it takes a datagram to travel between two points.
As a network datagram moves along a path, it passes through several stages for each link and network device. The bits of the datagram must be physically transmitted across the link. This occurs at the speed of the link, less overhead for link layer services such as checksums, addressing, forward error correction, and flow control. The bigger the datagram, the longer it will take to traverse each link.
As a datagram is read off of a link and into a network device, it is placed into a memory buffer called a queue. When datagrams arrive into a device more quickly than they can be moved out, they build up in the queue. Datagrams may need to wait in the queue until the datagrams in front of it have moved along. Datagrams are not necessarily processed in the order they are received. Routers may prioritize some datagrams over others based on a wide variety of characteristics, many of which may change over time.
Latency changes as queues along the path fill and empty. This variation is sometimes referred to as "jitter". Jitter may occur on a time scale of milliseconds or hours, but it is never random. It is an important signal of path utilization.
Rising latency is a signal that buffers are filling up and that increasing loss may be imminent.
The delay of transmitting a datagram across a link plus waiting in a queue is repeated for every node in the path. That is why, all else being equal, paths with more nodes (or "hops") have higher latency than paths with fewer.
Even though "speed" or "throughput" is the most commonly talked about network metric, it does not actually exist as a well defined characteristic. In any packet-switched network, datagrams arrive at fixed points in time. Only the physical and link layers deal with the signaling of individual bits. At the network layer and above, the "speed" at any single moment in time is either zero or infinity.
To calculate a meaningful speed statistic, discrete datagram arrival events must be averaged over time. For example if two 1400 byte datagrams arrive 10 milliseconds apart, you could say that data is arriving at a rate of 2800 bytes per 10 milliseconds or 2.24 megabits per second. But if a third 1400 byte datagram arrives 10 milliseconds after that, then this same calculation (4200 bytes in 20 milliseconds) yields a speed of only 1.68 megabits per second. Even though data is arriving at the same rate, we get a radically different result from the same calculation.
The speed of data movement heavily depends on how, when, and where you measure it.
The interaction of device buffers, signaling rates, and competing traffic flows means that speed will vary not just over time, but also depending on how and where you measure it. For example, data may appear to be moving very quickly as it leaves the source, but will become slowed down as it crosses the network and experiences latency and loss. The same idea even applies at the receiving end. Data may appear to be arriving quickly at the receiving network, but may become delayed in I/O processing such as writing to disk.
Speed calculations are further complicated by the distinction between the application-level data that is being transported, versus all the overhead consisting of packet headers, link frames, error correction, lost datagrams, and duplicate datagrams. Thus the bit-rate of a link is only an upper bound on speed. The actual rate of transport of useable data, sometimes called "goodput", will be significantly lower.
Many network devices record information about how many datagrams they have dropped, how much data is passing through over time, and how long it is taking datagrams to be processed. Enterprise class devices allow this information to be accessed by network administrators and may even report it in real-time. Information about queue sizes, traffic types, checksum errors, and other statistics may also be available.
Statistics viewed at any node provide only a tiny snapshot of the overall data flow.
In theory, if you were to add up the loss and latency for each node in a path, in both directions, your calculation should match the values observed for the entire round-trip. In practice, this is only a lower bound because there are very likely devices in the network which are contributing to loss and latency but which you are not able to observe. Likewise, the speed observed is only an upper bound because buffering, loss, and latency may be occurring at other points.
Round Trip Measurement
The most common method of observing loss and latency is a "ping". This involves transmitting a datagram from one node to another, followed by a second datagram being transmitted back to the first node. Thus two datagrams must traverse the data path, one in each direction, and each node must transmit and receive a datagram.
Latency for a ping refers to the round-trip-time: the time between transmission of the first datagram and receipt of its reply. A loss is counted when no reply is received within some short period of time. The loss rate is usually expressed as the ratio between the number of losses and the number of datagrams transmitted over a fixed time period.
The most common method for performing pings is to transmit small datagrams using the Internet Control Message Protocol (ICMP) format containing an "Echo Reply" request. Most operating systems will automatically respond to such a request by sending a matching ICMP message back to the sender. Typically one ICMP packet is sent every second for a fixed period of time. Pings can also be performed using UDP datagrams. The "mtping" utility measures network performance in this way.
Pings only reflect a narrow view of the network path. A tremendous amount of variation can occur between pings.
Measuring path speed in the absence of a transport protocol is possible, but the results can be highly misleading. Ping utilities can sometimes be set to "flood" mode in which they send large datagrams or large numbers of datagrams at a fixed rate. However, most modern network routers and operating systems will refuse to process more than a set rate of ping requests. This is done as a security measure to prevent denial of service attacks. Even a seemingly low rate of pings may be enough to trigger a black-out, particularly if many different nodes are sending pings at the same time.
A reliable data transport protocol must be able to detect when one of its datagrams fails to arrive so that the data can be resent. This involves an exchange of messages between nodes, which allows the transport protocol to measure the datagram loss and round-trip-time that it is actually experiencing.
The datagrams carrying data payloads are much larger than those commonly used for pings. Such datagrams are more susceptible to loss and will take longer to traverse each link. Larger datagrams may become fragmented into multiple IP datagram fragments, further increasing the potential for loss and delay. Processing time at each node may also be much longer.
There are other ways in which transport datagrams may be handled differently than ICMP ping datagrams. Routers may prioritize some traffic higher than others. Such prioritization may be based on port numbers, protocol, source or destination addresses, packet size, or any other characteristic of the datagram. Many modern routers observe traffic patterns and may change how they handle one datagram based on experiences with previous, possibly unrelated, datagrams.
The rate of arrival of useable data ("goodput") can only be measured at the transport layer or higher, because it is the transport layer which ultimately decides what data is good and what data is not. When transporting discrete data sets such as a file, the speed can be calculated by the number of bytes and the time from start to finish. But determining the speed on a moment to moment basis is extremely subjective due to the discrete nature of datagram arrival described above.
Because a transport layer involves an exchange of messages, it is possible for the loss of a single packet in one direction to cause an apparent loss of many packets in the other direction. For example, if a message signaling that it is okay to send more data is dropped, then that data may not be sent and may appear to have been lost. In this way, small losses in one direction can appear to cause large losses in the other direction. This "indirect loss" can cause the loss rate observed at the transport layer to be much higher than that observed at the network nodes.
Speed observations at the transport or application level can sometimes be misleading due to buffering and flow control. For example, an application may write hundreds of kilobytes of data into a TCP socket in an instant, but the data will not actually be moved across the network for some period of time. Likewise, data may have finished arriving from the network, but may not yet have been read out of the TCP buffers.
Transport protocols should vary the speed as network conditions change, so observations made over longer intervals may be very different observations made over shorter intervals.
Averaging the observations of many brief data flows is not the same as observing a long data flow: many factors may change in reaction to how long a particular data flow has been going.
The easiest way to observe loss and latency on a network path is to perform a ping. Unfortunately, this is also the least accurate method. Each ping only samples a small instance of the network. For example, a ping sent once per second on a network with 100ms latency is only active 10% of the time. Many flow oscillations and loss events can occur at intervals less than the round-trip-latency.
Measuring ICMP pings at more frequent intervals can help, but is not sufficient to overcome ping limitations because ICMP ping datagrams are not processed in the same way as data transport datagrams. Pings are less susceptible to loss, may be handled with a different queuing priority, and may be subject to different security filters.
Ping numbers are best compared to each-other. For example, a rising rate of ping loss or latency may indicate increasing congestion while decreasing values may indicate improving performance. At best, Ping numbers may form a lower-bound: actual transport loss and latency is almost always higher than the values observed by a standard ping.
Configuring your ping utility to use large datagrams, over 1400 bytes, will cause the ping results to be somewhat closer to those experienced by actual data transport. But differences in routing policies, fragmentation, and processing will still make the results merely suggestive of what actual data transport is experiencing.
Node observations can often be obtained from enterprise class network devices. Provided that you have administrative access to such a device, you may be able to obtain statistics as seen by that one device. As described above, this individual node experience is very different from the experience of an entire round-trip path. But comparing statistics from a number of devices along a path may help isolate problems in particular devices or links.
Observing transport experience varies depending on the protocol and operating system. Many unix based systems maintain overall statistics regarding TCP performance, but few do so on an individual session basis. Statistics may be available via sysctl or similar kernel queries.
MTP records detailed transport observations about loss, latency, and speed for each transaction. Much of this information is made available to the parent application.
For terrestrial networks, packet loss means that congestion has occurred. Even though the overall speed may appear to be well below the capacity of the data path, it only takes one device experiencing one moment of overflow to drop a datagram. In this way it is possible, and actually very common, to experience significant packet loss even while the network path appears to be underutilized. Full network utilization requires careful, real-time monitoring of packet loss patterns.
Data blasters which simply ignore packet loss can cause extreme network disruption and loss of connectivity.
Increasing latency is also an indication of congestion, as it may be the result of longer router queues. However, increasing latency can also be the result of changes in the types of traffic triggering changes in router queuing priority, or the use of larger datagrams along the path. At the transport and application layers, processing delays such as disk or CPU contention can also lead to an increase in round-trip-time.
Speed is ultimately a shared resource. Speed may increase as other competing traffic flows are reduced, and it may decrease as other competing traffic flows rise. Ideally, the sum total of all the speeds of all the traffic flows through any given network node should equal the capacity of the node. However, transport inefficiencies may cause the total to be much lower than the capacity.
TCP data flows are typically very sensitive to both latency and packet loss. They often slow down much more than is necessary to maintain a steady flow of data.
Most UDP based transfer mechanisms focus only on a fixed speed, ignoring latency entirely and often ignoring packet loss as well. Since both are an indication that router buffers are overflowing, this can result in exponentially increasing packet loss. This high loss results in an overall reduction in network capacity and slower overall speed. It can even result in a complete loss of connectivity.
MTP continuously assesses its observations of loss, latency, and speed to provide a careful balance of flow-control and performance. It pays particular attention to packet loss, looking for patterns which distinguish occasional spikes in node queues from an actual over-utilization of the path. By default, MTP strives to ensure that the network as a whole is used to its fullest capacity: no more, no less. It does not normally attempt to do this to the exclusion other traffic. Tech Note 0005 describes some of the trade-offs MTP makes in evaluating performance statistics and how they can be changed.
Packet loss, latency, and speed are useful in assessing the health and potential of a network data path. They can be used to compare different scenarios, plan modifications, and predict behaviors. But like all statistics, the numbers are only meaningful in context. It is vital to understand exactly how any given set of numbers was calculated, and to be rigorously consistent when making observations. Comparisons, plans, and predictions can only be valid when the methods of observation and analysis are consistent and their interactions are well understood.
Tech Note History