The parameters we’ll adjust are as follows:
Increase max open files to 100,000 from the default (typically 1024). In Linux, every open network socket requires a file descriptor. Increasing this limit will ensure that lingering
TIME_WAITsockets and other consumers of file descriptors don’t impact our ability to handle lots of concurrent requests.
Decrease the time that sockets stay in the
TIME_WAITstate by lowering
tcp_fin_timeoutfrom its default of 60 seconds to 10. You can lower this even further, but too low, and you can run into socket close errors in networks with lots of jitter. We will also set
tcp_tw_reuseto tell the kernel it can reuse sockets in the
Increase the port range for ephemeral (outgoing) ports, by lowering the minimum port to 10000 (normally 32768), and raising the maximum port to 65000 (normally 61000). Important: This means you can’t have server software that attempts to bind to a port above 9999! If you need to bind to a higher port, say 10075, just modify this port range appropriately.
Increase the read/write TCP buffers (
tcp_wmem) to allow for larger window sizes. This enables more data to be transferred without ACKs, increasing throughput. We won’t tune the total TCP memory (
tcp_mem), since this is automatically tuned based on available memory by Linux.
Decrease the VM
swappinessparameter, which discourages the kernel from swapping memory to disk. By default, Linux attempts to swap out idle processes fairly aggressively, which is counterproductive for long-running server processes that desire low latency.
Increase the TCP congestion window, and disable reverting to TCP slow start after the connection is idle. By default, TCP starts with a single small segment, gradually increasing it by one each time. This results in unnecessary slowness that impacts the start of every request – which is especially bad for HTTP.
Ok, enough chat, more code.
To start, edit
/etc/sysctl.conf and add these lines:
# /etc/sysctl.conf # Increase system file descriptor limit fs.file-max = 100000 # Discourage Linux from swapping idle processes to disk (default = 60) vm.swappiness = 10 # Increase ephermeral IP ports net.ipv4.ip_local_port_range = 10000 65000 # Increase Linux autotuning TCP buffer limits # Set max to 16MB for 1GE and 32M (33554432) or 54M (56623104) for 10GE # Don't set tcp_mem itself! Let the kernel scale it based on RAM. net.core.rmem_max = 16777216 net.core.wmem_max = 16777216 net.core.rmem_default = 16777216 net.core.wmem_default = 16777216 net.core.optmem_max = 40960 net.ipv4.tcp_rmem = 4096 87380 16777216 net.ipv4.tcp_wmem = 4096 65536 16777216 # Make room for more TIME_WAIT sockets due to more clients, # and allow them to be reused if we run out of sockets # Also increase the max packet backlog net.core.netdev_max_backlog = 50000 net.ipv4.tcp_max_syn_backlog = 30000 net.ipv4.tcp_max_tw_buckets = 2000000 net.ipv4.tcp_tw_reuse = 1 net.ipv4.tcp_fin_timeout = 10 # Disable TCP slow start on idle connections net.ipv4.tcp_slow_start_after_idle = 0 # If your servers talk UDP, also up these limits net.ipv4.udp_rmem_min = 8192 net.ipv4.udp_wmem_min = 8192 # Disable source routing and redirects net.ipv4.conf.all.send_redirects = 0 net.ipv4.conf.all.accept_redirects = 0 net.ipv4.conf.all.accept_source_route = 0 # Log packets with impossible addresses for security net.ipv4.conf.all.log_martians = 1
Since some of these settings can be cached by networking services, it’s best to reboot to apply them properly (
sysctl -p does not work reliably).
Open File Descriptors
In addition to the Linux
fs.file-max kernel setting above, we need to edit a few more files to increase the file descriptor limits. The reason is the above just sets an absolute max, but we still need to tell the shell what our per-user session limits are.
So, first edit
/etc/security/limits.conf to increase our session limits:
# /etc/security/limits.conf # allow all users to open 100000 files # alternatively, replace * with an explicit username * soft nofile 100000 * hard nofile 100000
/etc/ssh/sshd_config needs to make sure to use PAM:
# /etc/ssh/sshd_config # ensure we consult pam UsePAM yes
/etc/pam.d/sshd needs to load the modified
# /etc/pam.d/sshd # ensure pam includes our limits session required pam_limits.so
You can confirm these settings have taken effect by opening a new ssh connection to the box and checking
$ ulimit -n 100000
Why Linux has evolved to require 4 different settings in 4 different files is beyond me, but that’s a topic for a different post. :)
TCP Congestion Window
Finally, let’s increase the TCP congestion window from 1 to 10 segments. This is done on the interface, which makes it a more manual process that our
sysctl settings. First, use
ip route to find the default route, shown in bold below:
$ ip route default via 10.248.77.193 dev eth0 proto kernel 10.248.77.192/26 dev eth0 proto kernel scope link src 10.248.77.212
Copy that line, and paste it back to the
ip route change command, adding
initcwnd 10 to the end to increase the congestion window:
$ sudo ip route change default via 10.248.77.193 dev eth0 proto kernel initcwnd 10
To make this persistent across reboots, you’ll need to add a few lines of bash like the following to a startup script somewhere. Often the easiest candidate is just pasting these lines into
defrt=`ip route | grep "^default" | head -1` ip route change $defrt initcwnd 10
Once you’re done with all these changes, you’ll need to either bundle a new machine image, or integrate these changes into a system management package such as Chef or Puppet.
The above settings were pulled together from a variety of other resources out there, and then validated through testing on EC2. You may need to tweak the exact limits depending on your application’s profile. Below are a few additional posts that make good reading:
- US Dept of Energy Guide to Linux TCP Tuning
- Linux tuning parameters used by Last.fm
- Definitions of Linux TCP kernel variables
- Understanding ephemeral ports
- In-depth post by CDN Planet on TCP slow start (with tests!)
- Google Research Paper Proposing a Default Congestion Window of 10 Segments
- Determining a safe value for tcp_tw_reuse (ServerFault)
- Dropping of connections with tcp_tw_recycle (StackOverflow)