This article examines the systems integration challenges which must be addressed when moving data at speeds of two gigabits per second or faster. See Tech Note 0023 for a more general discussion of system requirements.
MTP/IP can deliver data disk-to-disk at full 10 gigabit per second line speed across high latency WANs with real-time AES encryption and data integrity checks, all on off-the-shelf hardware. This article explains the network, storage, and CPU configuration necessary to support that.
Data cannot be moved along a path faster than the slowest component of that path. To move data at very high speeds, all the components of the path must be functioning at very high speeds. This includes storage, CPU, operating system, drivers, switches, routers, links, and software.
It is not enough that each and every component be capable of high speed operation. Each component must be coordinated with every other component to ensure that they work together in practice.
The most common performance limitation is storage. Most data paths start and end in files. Most storage devices rate their throughput according to the speed of their interface, but their real-world transfer speed is much, much slower. Storage speed may vary greatly depending on how it is accessed. Storage is usually fastest when accessed sequentially, in a single file, using large blocks. Performance may be reduced dramatically when storage is accessed out-of-order, in multiple files at a time, or using small I/O operations.
Most consumer grade hard-drives are not capable of sustaining multigigabit throughput. Achieving high throughput, particularly for write operations, requires specialized storage hardware. See the Storage section of Tech Note 0023 for general storage guidance.
- SSD - Solid State Drive
High performance SSD storage will provide the best throughput. Because SSDs have no moving parts, performance does not degrade when accessed out-of-order or across multiple parallel files. However, they may still slow down if data is accessed in small or irregular chunks. Not all SSDs are built for high performance, so care must be taken to ensure
that the read/write capabilities are well above the desired path speed.
- RAID / Storage Array
Arrays of hard drives can be configured to provide high reliability, high throughput, or both. They can be optimized for single or parallel data streams, sequential or random access, small or large I/O operations, or other usage patterns. All of these involve trade-offs. No storage array is simply "fast": if it is to support multigigabit throughput, it must be tuned to the specific patterns of use. See Tech Note 0018 for more tips about RAID configuration.
- Attachment - Direct, SAN, NAS
Direct attached storage such as SATA, SCSI, or point-to-point Fibre Channel is the fastest method of connecting storage to a network host. Storage Area Network (SAN) connections such as iSCSI or switched Fibre channel are the next best. SANs can be very fast when the storage device is used exclusively by a single network host, but may slow down greatly when shared. Network Attached Storage (NAS) protocols such as CIFS, AFP, and NFS are not recommended for multigigabit performance. Tech Note 0029 has some recommendations for tuning NAS storage, but NAS will be much slower than Direct or SAN storage.
Fast storage can be defeated by buggy or incorrectly configured drivers. High performance storage may come with its own drivers which may override, ignore, or conflict with operating system settings. Driver configurations must be carefully examined to make sure they match the requirements of the storage hardware, the host operating system, and expected patterns of access.
A network path consists of many devices, ranging from the Network Interface Cards of the hosts, to switches, routers, firewalls, and wide area links. A very common mistake in multigigabit networking is to focus on only one of these components. For example, a wide area link might be 10 gigabits per second, but the NIC of a host only 1 gigabit per second. Such a path would be limited to at most 1 gigabit per second, even though the "network" is 10 gigabits per second.
- Link Speed
A network path is only as fast as its slowest link. When setting up a network for multigigabit speeds, check that all components are actually rated for the desired speed. This is an extremely common mistake!
- Bonding / Multiplexing / Teaming
Combining ten 1 gigabit links does not make a 10 gigabit link. At best, most such arrangements would allow ten 1 gigabit per second sessions to flow in parallel but restrict each individual session to just 1 gigabit. The capabilities of a bonded link depend entirely on the type of bonding, its correct configuration, and the distribution of traffic amongst multiple hosts. A good rule of thumb is: do not expect bonded performance to be faster than the speed of a single link.
- MTU - Maximum Transmit Unit
Larger datagrams greatly reduce overhead at all points in a network path. In some environments, jumbo frames (9000 MTU) may be required for multigigabit throughput. If a network path has a smaller MTU (e.g. 1500), then it is likely that some component of the path is not multigigabit (see above). For speeds approaching or exceeding 10 gigabits per second, Super Jumbo frames (up to 65536 MTU) may be required.
- QoS / Throttling
Devices which seek to selectively limit throughput typically use rules to match certain network packets and drop some of them when the flow is determined to be too high. MTP/IP will obey such throttling and adjust its speed to match. This is a good way to segment data flows on a large link, but care must be taken to ensure that the matching rules operate as intended. If given a choice of throttling algorithms, RED is more efficient than tail-end drop.
- Emulation - Don't Do It
As discussed throughout this article, achieving multigigabit speeds depends on precise integration of many components. It is impossible to accurately model such complexity in a single device. Even if one assumes an idealized network, it is extremely difficult to correctly configure an emulator to provide realistic predictions (see Tech Note 0022 for details). These difficulties are greatly magnified at multigigabit speeds where the properties and limitations of the emulator itself will dominate any results.
In addition to the multigigabit specific considerations above, all the usual network constraints discussed in Tech Note 0009 also apply. Remember that a flaw in one device may cause unexpected behavior in a seemingly unrelated device. For example, high latency in a storage device at one end of the path can be triggered by a misconfigured MTU at the other end of the path.
Multigigabit throughput requires that each host system process hundreds of thousands of network datagrams and store or retrieve billions of bytes per second. Most operating systems require special configuration for this level of performance.
- Operating System
Recent versions of Unix-like operating systems such as Linux, Mac OS X, and FreeBSD are best able to handle high throughput networking. Windows should be avoided if possible. If Windows must be used, it must be Windows Server 2012 or later with no non-essential software installed.
The default settings of some operating systems, notably Linux, govern
CPU clock rates for power savings rather than performance. This can cause inconsistent or poor throughput, even on highly capable machines. To ensure maximum performance under Linux, adjust
the scaling_governor to "performance" for every core.
A single modern CPU core can support between 1 and 2 gigabits per second. Additional CPU cores scale about the same. For example, 5 to 10 cores and Jumbo MTU are needed to support 10 gigabits per second. Enabling encryption increases CPU usage to the upper end of the range, but does not slow down throughput so long as there are sufficient cores available.
- UDP Buffers
Some operating systems allow very little memory to be used for buffering network data. Some Linux distributions limit UDP buffers to as little as 128 kilobytes, which is not even enough to hold two full-sized IP datagrams. For multigigabit speeds, at least two megabytes of kernel UDP buffer space is recommended for both send and receive. Tech Note 0024 provides instructions on verifying and adjusting UDP buffer sizes where needed.
- Storage Buffers
All modern operating systems attempt to accelerate storage operations by keeping recently read or written data in RAM. This works well for files which are smaller than the RAM buffers, but can cause catastrophic performance problems when writing files which are larger than the system's buffers.
Linux in particular may freeze all storage access for the time it takes to flush its storage buffers. Ironically, having more RAM can make this worse by allowing larger amounts of unsaved data to accumulate and lengthening the time it takes for storage to catch up. Large delays at the end of writing a large file are a common symptom of buffering problems, as are sudden drops in speed after a specific amount of data has been transferred. MTP/IP may log an MTP0 diagnostic message "WARNING: Filesystem took XX ms to write YY bytes" when this occurs.
By default, Linux delays writing data to storage until 10% of RAM is filled and will freeze all storage access for flushing when 20% of RAM is filled. When writing files larger than these sizes to storage which is slower than the network, it may be necessary to reduce the amount of write buffering so as to avoid crippling cache flushes. This can be done by reducing the sysctl variables vm.dirty_background_bytes and vm.dirty_bytes. For sustained writing of very large files, these may need to set these as low as two and four times the bandwidth delay product of the network, respectively. For example, a 10 gigabit network with 50ms of latency might set "vm.dirty_background_bytes=125000000" and "vm.dirty_bytes=250000000". You should experiment to find the best match between the operating system and storage.
If you are unable to adjust the operating system's buffering (or acquire faster storage), you may need to limit MTP/IP's total network speeds to match that of storage. In servedat, for example, you could set "MaxRateTotalIn=732422" to limit incoming throughput to 6 gigabits per second.
- Physical Memory (RAM)
When storage is faster than the network, MTP/IP will typically use less than 512 kilobytes per transaction even at 10 gigabit speeds. When writing to storage is consistently slower than the network, MTP/IP may use several megabytes for buffering and will automatically throttle the network throughput. Slow or malfunctioning storage may occasionally block I/O for a long time, in which case MTP/IP may buffer tens or even hundreds of megabytes of data (depending on available RAM) and will pause network throughput. These storage problems can be seen by setting diagnostic level 1 during a file transfer.
- UDP Checksum Offload
Network datagram checksums can consume substantial CPU resources and in some operating systems may be restricted to a single CPU core. Many network interface cards have the ability to perform this function in dedicated hardware. If checksum offloading is available, performance should be tested with it both on and off to determine which is faster. Testing should be performed with varying MinDatagram sizes (see below) as some offload engines may be more or less efficient with larger datagrams.
MTP/IP will attempt to adapt its behavior to match the requirements of the network, hosts, and storage. But for optimal performance at multigigabit speeds, it may be necessary to provide it with some guidance.
- MinDatagram (-U)
If the speed of an individual session exceeds 1.1 gigabits per second, MTP/IP will automatically switch to using Jumbo Datagrams with 8192 bytes of content payload. But if many sessions are sharing a multigigabit path, some sessions may not reach this threshold. Greater efficiency may be achieved by advising MTP/IP that it can safely use larger payloads. For networks which support Jumbo frames (9000 MTU) use a MinDatagram of 8192 bytes. For Super Jumbo support, round down to the next multiple of 4096 bytes below the smallest MTU in the path. Common values for Super Jumbo networks are 32768 bytes and 61440 bytes.
- MaxDatagram (-T)
If the speed of an individual session exceeds 1.1 gigabits per second, MTP/IP will automatically switch to using Jumbo frames with 8192 bytes of content payload. This should be well within the capabilities of any multigigabit link. But if bonded links with smaller MTUs are being used, it may be necessary to limit MTP/IP's payload sizes by setting its MaxDatagram to the sub-gigabit default of 1408 bytes.
- MaxRate (-k) / MaxRateTotal
When network segmentation or other throughput limiting is desired, MTP/IP can be told to limit the speed of either individual sessions or all sessions being handled by a given server. It is also possible to configure a QoS device to perform such throttling (see above). When a variety of hosts will be sharing a line, use of a QoS device may be the most efficient way to precisely segment the line.
- StreamSize (-b)
Some MTP/IP applications may stream data to other applications instead of delivering it directly to storage. Such streaming requires that data be buffered in memory. The default buffer size for most MTP/IP applications capable of streaming is 16 megabytes. This is adequate for speeds up to one gigabit at 75ms latency, but may need to be increased proportionally for multigigabit speeds. This only affects applications which are streaming data to other applications. It does not apply to direct file transfer.
- Inline Software Compression
At multigigabit speeds, the CPU constraints of performing inline compression are greater than gains from reducing the amount of data to be transferred. Inline compression should be disabled. Data which may benefit from compression should be compressed offline, prior to transfer.
Other software sharing an MTP/IP host may require adjustment so as to minimize its impact on network and storage performance. In general, only software which is essential to the network task should be installed.
Windows is especially vulnerable to having extra software installed or running. For example, an RDP session in which a user simply clicks the desktop once per second can cause a 25% drop in network throughput. All user logins should be logged out and all RDP sessions disconnected to ensure maximum performance.
Multigigabit performance is difficult because it pushes current hardware to or beyond its limits. As hardware capabilities improve, and components capable of operating at these speeds become more common, many of the considerations above will become less significant and the challenge of high performance will move on to higher orders of speed.