Troubleshooting the "Out of socket memory" error

If the following error message occasionally gets written to the /var/log/messages file:

It usually means one of two things:
  1. The server is running out of TCP memory
  2. There are too many orphaned sockets on the system
To see how much memory the kernel is configured to dedicate to TCP run:

tcp_mem is a vector of 3 integers: min, pressure and max.

  • min : below this number of pages TCP is not bothered about its memory consumption. 
  • pressure: when the amount of memory allocated to TCP by the kernel exceeds this threshold, the kernel starts to moderate the memory consumption. This mode is exited when memory consumption falls under min.
  • max : the max number of pages allowed for queuing by all TCP sockets. When the system goes above this threshold, the kernel will start throwing the "Out of socket memory" error in the logs.

Now let's compare the 'max' number with how much of that memory TCP actually uses:

The last value on line 3 (mem 102910) is the number of pages currently allocated to TCP. In this example you can see that this value is way lower than the maximum number of pages the kernel is willing to give to TCP - the 'max' vector described above, so we can dismiss this as a cause of the error.
To examine if the server has too many orphan sockets run the following:

An orphan socket is a socket that isn't associated with a file descriptor, usually after the close() call and there is no longer a file descriptor that reference it, but the scoket still exists in memory, until TCP is done with it.The tcp_max_orphans file shows the maximal number of TCP sockets not attached to any user file handle, held by system that the kernel can support. If this number is exceeded orphaned connections are reset immediately and warning is printed. This limit exists only to prevent simple DoS attacks. Each orphan sockets eats up to 64K of unswappable memory.
Now that we know what the limit of orphaned sockets on a system can be, let's see the current number of orphaned sockets:

In this case the 'orphan 126800' on line 3 is the field we are interested in. If this number is bigger than the one from tcp_max_orphans then this can be a reason for the "Out of socket memory".Fixing this is a matter of increasing the max limit in tcp_max_orphans:

One thing worth mentioning is that in certain cases, the kernel may penalize some sockets more by multiplying the number of orphans by 2x or 4x to artificially increase the "score" of the "bad socket".

To account for that get the number of orphaned sockets during peak server utilization and multiple that by 4 to be safe. That should be the value you set in tcp_max_orphans.

In some cases if there are many TCP short lived connections on the system the number of orphaned sockets such as TIME_WAIT will be pretty big. To fix this situation you might need to increase the TIME_WAIT timeout (MSL) and experiment with tcp_tw_reuse /tcp_tw_recycle kernel tunable as describe in my other article on TCP Tuning.



No comments:

Post a Comment