Linux TCP Tuning

The aim of this post is to point out potential kernel tunables that might improve network performance in certain scenarios. As with any other post on the subject, make sure you test before and after you make an adjustment to have a measurable, quantitative result. For the most part, the kernel is smart enough to detect and adjust certain TCP options after boot, or even dynamically, e.g the Sliding Window size etc.

With that in mind, here's a quick overview of the steps taken during data transmission and reception:

1. The application first writes the data to a socket which in turn is put in the transmit buffer.
2. The kernel encapsulates the data into a PDU - protocol data unit.
3. The PDU is then moved onto the per-device transmit queue.
4. The NIC driver then pops the PDU from the transmit queue and copies it to the NIC.
5. The NIC sends the data and raises a hardware interrupt.
6. On the other end of the communication channel the NIC receives the frame, copies it on the receive buffer and raises a hard interrupt.
7. The kernel in turn handles the interrupt and raises a soft interrupt to process the packet.
8. Finally the kernel handles the soft interrupt and moves the packet up the TCP/IP stack for decapsulation and puts it in a receive buffer for a process to read from.
 
To make persistent changes to the kernel settings described bellow, add the entries to the /etc/sysctl.conf file and then run "sysctl -p" to apply.

Like all operating systems, the default maximum Linux TCP buffer sizes are way too small. I suggest changing them to the following settings:

To increase TCP max buffer size setable using setsockopt():
Good starting point is the BDP based on a measured delay, e.g, multiple the bandwidth of the link to the average round trip time to some host.

To increase Linux autotuning TCP buffer limits min, default, and max number of bytes to use set max to 16MB for 1GE, and 32M or 54M for 10GE:

Use netstat -s | grep -i listen to monitor for "xxx times the listen queue of a socket overflowed" events.

You should also verify that the following are all set to the default value of 1:
Note: you should leave tcp_mem alone. The defaults are fine.

Another thing you can do to help increase TCP throughput with 1GB NICs is to increase the size of the interface queue. For paths with more than 50 ms RTT, a value of 5000-10000 is recommended. To increase txqueuelen, do the following:
You can achieve increases in bandwidth of up to 10x by doing this on some long, fast paths. This is only a good idea for Gigabit Ethernet connected hosts, and may have other side effects such as uneven sharing between multiple streams.

Other kernel settings that help with the overall server performance when it comes to network traffic are the following:

TCP_FIN_TIMEOUT - This setting determines the time that must elapse before TCP/IP can release a closed connection and reuse its resources. During this TIME_WAIT state, reopening the connection to the client costs less than establishing a new connection. By reducing the value of this entry, TCP/IP can release closed connections faster, making more resources available for new connections. Addjust this in the presense of many connections sitting in the TIME_WAIT state:
TCP_KEEPALIVE_INTERVAL - This determines the wait time between isAlive interval probes. To set:
TCP_KEEPALIVE_PROBES - This determines the number of probes before timing out. To set:
TCP_TW_RECYCLE - This enables fast recycling of TIME_WAIT sockets. The default value is 0 (disabled). Should be used with caution with loadbalancers.
TCP_TW_REUSE - This allows reusing sockets in TIME_WAIT state for new connections when it is safe from protocol viewpoint. Default value is 0 (disabled). It is generally a safer alternative to tcp_tw_recycle
Note: The tcp_tw_reuse setting is particularly useful in environments where numerous short connections are open and left in TIME_WAIT state, such as web servers and loadbalancers. Reusing the sockets can be very effective in reducing server load.
Starting in Linux 2.6.7 (and back-ported to 2.4.27), linux includes alternative congestion control algorithms beside the traditional 'reno' algorithm. These are designed to recover quickly from packet loss on high-speed WANs.

There are a couple additional sysctl settings for kernels 2.6 and newer:
Not to cache ssthresh from previous connection:
To increase this for 10G NICS:
Starting with version 2.6.13, Linux supports pluggable congestion control algorithms . The congestion control algorithm used is set using the sysctl variable net.ipv4.tcp_congestion_control, which is set to bic/cubic or reno by default, depending on which version of the 2.6 kernel you are using.
To get a list of congestion control algorithms that are available in your kernel (if you are running 2.6.20 or higher), run:
The choice of congestion control options is selected when you build the kernel. The following are some of the options are available in the 2.6.23 kernel:
* reno: Traditional TCP used by almost all other OSes. (default)
* cubic: CUBIC-TCP (NOTE: There is a cubic bug in the Linux 2.6.18 kernel used by Redhat Enterprise Linux 5.3 and Scientific Linux 5.3. Use 2.6.18.2 or higher!)
* bic: BIC-TCP
* htcp: Hamilton TCP
* vegas: TCP Vegas
* westwood: optimized for lossy networks

If cubic and/or htcp are not listed when you do 'sysctl net.ipv4.tcp_available_congestion_control', try the following, as most distributions include them as loadable kernel modules:

For long fast paths, I highly recommend using cubic or htcp. Cubic is the default for a number of Linux distributions, but if is not the default on your system, you can do the following:
On systems supporting RPMS, You can also try using the ktune RPM, which sets many of these as well.

If you have a load server that has many connections in TIME_WAIT state decrease the TIME_WAIT interval that determines the time that must elapse before TCP/IP can release a closed connection and reuse its resources. This interval between closure and release is known as the TIME_WAIT state or twice the maximum segment lifetime (2MSL) state. During this time, reopening the connection to the client and server cost less than establishing a new connection. By reducing the value of this entry, TCP/IP can release closed connections faster, providing more resources for new connections. Adjust this parameter if the running application requires rapid release, the creation of new connections, and a low throughput due to many connections sitting in the TIME_WAIT state:

If you are often dealing with SYN floods the following tunning can be helpful:
The parameter on line 1 is the maximum number of remembered connection requests, which still have not received an acknowledgment from connecting clients.
The parameter on line 2 determines the number of SYN+ACK packets sent before the kernel gives up on the connection. To open the other side of the connection, the kernel sends a SYN with a piggybacked ACK on it, to acknowledge the earlier received SYN. This is part 2 of the three-way handshake.
And lastly on line 3 is the maximum number of TCP sockets not attached to any user file handle, held by system. If this number is exceeded orphaned connections are reset immediately and warning is printed. This limit exists only to prevent simple DoS attacks, you _must_ not rely on this or lower the limit artificially, but rather increase it (probably, after increasing installed memory), if network conditions require more than default value, and tune network services to linger and kill such states more aggressively.

More information on tuning parameters and defaults for Linux 2.6 are available in the file ip-sysctl.txt, which is part of the 2.6 source distribution.

Warning on Large MTUs: If you have configured your Linux host to use 9K MTUs, but the connection is using 1500 byte packets, then you actually need 9/1.5 = 6 times more buffer space in order to fill the pipe. In fact some device drivers only allocate memory in power of two sizes, so you may even need 16/1.5 = 11 times more buffer space!

And finally a warning for both 2.4 and 2.6: for very large BDP paths where the TCP window is > 20 MB, you are likely to hit the Linux SACK implementation problem. If Linux has too many packets in flight when it gets a SACK event, it takes too long to located the SACKed packet, and you get a TCP timeout and CWND goes back to 1 packet. Restricting the TCP buffer size to about 12 MB seems to avoid this problem, but clearly limits your total throughput. Another solution is to disable SACK.

Starting with Linux 2.4, Linux implemented a sender-side autotuning mechanism, so that setting the optimal buffer size on the sender is not needed. This assumes you have set large buffers on the receive side, as the sending buffer will not grow beyond the size of the receive buffer.

However, Linux 2.4 has some other strange behavior that one needs to be aware of. For example: The value for ssthresh for a given path is cached in the routing table. This means that if a connection has has a retransmission and reduces its window, then all connections to that host for the next 10 minutes will use a reduced window size, and not even try to increase its window. The only way to disable this behavior is to do the following before all new connections (you must be root):
I would like to also point out how important it is to have a sufficient number of available file descriptors, since pretty much everything on Linux is a file.
To check your current max and availability run the following:
The first value (197600) is the number of allocated file handles.
The second value (0) is the number of unused but allocated file handles. And the third value (3624009) is the system-wide maximum number of file handles. It can be increased by tuning the following kernel parameter: To see how many file descriptors are being used by a process you can use one of the following: The 28290 number is the process id.

An finally if you are using stateful iptable rules the nf_conntrack kernel module might run out of memory for connection tracking and an error will be logged: nf_conntrack: table full, dropping packet

In order to raise that limit, therefore allocate more memory, you need to calculate how much RAM each connection uses. You can get that information from the proc file /proc/slabinfo.
The nf_conntrack entry show the active entries, and how big each object is, and how many fit in a slab (each slab fits in one or more kernel page, usually 4K if not using hugepages). Accounting for the overhead of the kernel page size you can see from the slabinfo that each nf_conntrack object takes about 316 bytes (this will differ on different systems). So to track 1M connections you'll need to allocate roughly 316 MB of memory.