Linux Administration: Diagnosing High CPU utilization and memory leaks

I have a cluster of systems running a particular kind of networking software that under heavy traffic and performing deep packet inspection remains relatively underutilized in terms of CPU load and memory consumption.

One day though the CPU spiked from 0.5 to 18 on a 12 core system on every node on that cluster, including the ones that were just passive fail-overs and that did not do any processing, just being a member of the same multicast domain. Nothing seemed out of the ordinary though, the traffic pattern was the same, no upgrades, updates or any system changes whatsoever. OSSEC did not report any file checksum changes, so I knew the systems were not compromised.

This is a perfect case to demonstrate a way to troubleshoot sudden and persistent CPU utilization spikes and how to look for memory leaks. Of course this will not apply to every case, but demonstrates the use of certain tool-set and the methodology behind approaching similar problems.

First I checked the load averages on the server:
Then I got the list of processes that consume the most CPU:

File: gistfile1.sh ------------------ root@server1:~# ps -eo pcpu,pid,user,args | sort -b -k1,1n ... 10.6 30146 root /opt/nd/networkingd 92.0 18352 nobody /opt/nd/networkingd 92.0 18360 nobody /opt/nd/networkingd 92.0 18377 nobody /opt/nd/networkingd 92.1 18344 nobody /opt/nd/networkingd 92.1 18349 nobody /opt/nd/networkingd 92.1 18357 nobody /opt/nd/networkingd 92.1 18361 nobody /opt/nd/networkingd 92.1 18370 nobody /opt/nd/networkingd 92.2 18345 nobody /opt/nd/networkingd 92.2 18346 nobody /opt/nd/networkingd 92.2 18353 nobody /opt/nd/networkingd 92.2 18354 nobody /opt/nd/networkingd ...

No surprises here, the main networking software is the one causing the load, as no other services are running on that server. I'll focus on the biggest consumer - PID 18354.

Next I performed a strace on that pid for about 10 seconds:

This returned lots of the following lines:
The only thing that stood out from this trace, was this particular failure to read file descriptor ID 11.

Then I checked to see what this file descriptor is:
So at least I learned that it’s a socket.

Got some more information about that particular socket:
Then I checked the memory consumption and any indicators for a memory leak for that PID:
Pretty high memory usage, each of the networking child processes havs close to that much mapped memory.

File: gistfile1.sh ------------------ root@server1:~# pmap -x 18354 18354: /opt/nd/networkingd Address Kbytes RSS Anon Locked Mode Mapping 0000000000400000 9448 - - - r-x-- networkingd 0000000000e3a000 76 - - - rwx-- networkingd 0000000000e4d000 1344 - - - rwx-- [ anon ] 0000000002bbf000 442288 - - - rwx-- [ anon ] 00007f9860000000 132 - - - rwx-- [ anon ] 00007f9860021000 65404 - - - ----- [ anon ] 00007f98677fc000 88 - - - r-x-- libresolv-2.11.1.so 00007f9867812000 2044 - - - ----- libresolv-2.11.1.so 00007f9867a11000 4 - - - r-x-- libresolv-2.11.1.so 00007f9867a12000 4 - - - rwx-- libresolv-2.11.1.so 00007f9867a13000 8 - - - rwx-- [ anon ] 00007f9867a15000 20 - - - r-x-- libnss_dns-2.11.1.so 00007f9867a1a000 2044 - - - ----- libnss_dns-2.11.1.so 00007f9867c19000 4 - - - r-x-- libnss_dns-2.11.1.so 00007f9867c1a000 4 - - - rwx-- libnss_dns-2.11.1.so 00007f9867c1f000 4 - - - rwxs- shared 00007f9867c20000 4 - - - rwxs- zero (deleted) 00007f9867c21000 4 - - - rwxs- status 00007f986ac2a000 17536 - - - r-xs- ip-to-location.bin 00007f986bd4a000 15648 - - - r-xs- base_locations.txt 00007f986cc92000 1028 - - - rwx-- [ anon ] 00007f986ce65000 4 - - - ----- [ anon ] 00007f986ce66000 3972 - - - rwx-- [ anon ] 00007f986d247000 98308 - - - rwxs- zero (deleted) 00007f9873248000 61444 - - - rwxs- zero (deleted) 00007f9876e49000 5242880 - - - rwxs- zero (deleted) 00007f99b6e49000 104692 - - - rwxs- zero (deleted) 00007f99bd486000 195316 - - - rwxs- zero (deleted) 00007f99c9343000 88 - - - r-x-- libgcc_s.so.1 00007f99c9359000 2044 - - - ----- libgcc_s.so.1 00007f99c9558000 4 - - - r-x-- libgcc_s.so.1 00007f99c9559000 4 - - - rwx-- libgcc_s.so.1 00007f99c955a000 48 - - - r-x-- libnss_files-2.11.1.so 00007f99c9566000 2044 - - - ----- libnss_files-2.11.1.so 00007f99c9765000 4 - - - r-x-- libnss_files-2.11.1.so 00007f99c9766000 4 - - - rwx-- libnss_files-2.11.1.so 00007f99c9767000 40 - - - r-x-- libnss_nis-2.11.1.so 00007f99c9771000 2044 - - - ----- libnss_nis-2.11.1.so 00007f99c9970000 4 - - - r-x-- libnss_nis-2.11.1.so 00007f99c9971000 4 - - - rwx-- libnss_nis-2.11.1.so 00007f99c9972000 92 - - - r-x-- libnsl-2.11.1.so 00007f99c9989000 2044 - - - ----- libnsl-2.11.1.so 00007f99c9b88000 4 - - - r-x-- libnsl-2.11.1.so 00007f99c9b89000 4 - - - rwx-- libnsl-2.11.1.so 00007f99c9b8a000 8 - - - rwx-- [ anon ] 00007f99c9b8c000 32 - - - r-x-- libnss_compat-2.11.1.so 00007f99c9b94000 2044 - - - ----- libnss_compat-2.11.1.so 00007f99c9d93000 4 - - - r-x-- libnss_compat-2.11.1.so 00007f99c9d94000 4 - - - rwx-- libnss_compat-2.11.1.so 00007f99c9d99000 5120 - - - rwxs- zero (deleted) 00007f99ca299000 1856512 - - - rwxs- zero (deleted) 00007f9a3b799000 10240 - - - rwxs- zero (deleted) 00007f9a3c199000 4 - - - rwxs- zero (deleted) 00007f9a3c19a000 4 - - - rwxs- conninfo 00007f9a3c19b000 40 - - - rwxs- tipgroups 00007f9a3c1a5000 15628 - - - rwxs- zero (deleted) 00007f9a3d0e8000 208 - - - rwxs- zero (deleted) 00007f9a3d11c000 772 - - - rwxs- locations 00007f9a3d1dd000 772 - - - rwxs- pools 00007f9a3d29e000 1796 - - - rwxs- monitors 00007f9a3d45f000 580 - - - rwxs- zero (deleted) 00007f9a3d4f0000 6152 - - - rwxs- zero (deleted) 00007f9a3daf2000 1588 - - - rwxs- zero (deleted) 00007f9a3dc7f000 2276 - - - rwxs- zero (deleted) 00007f9a3deb8000 180 - - - rwxs- zero (deleted) 00007f9a3dee5000 180 - - - rwxs- zero (deleted) 00007f9a3df12000 180 - - - rwxs- zero (deleted) 00007f9a3df3f000 5080 - - - rwxs- zero (deleted) 00007f9a3e435000 1524 - - - r-x-- libc-2.11.1.so 00007f9a3e5b2000 2044 - - - ----- libc-2.11.1.so 00007f9a3e7b1000 16 - - - r-x-- libc-2.11.1.so 00007f9a3e7b5000 4 - - - rwx-- libc-2.11.1.so 00007f9a3e7b6000 20 - - - rwx-- [ anon ] 00007f9a3e7bb000 520 - - - r-x-- libm-2.11.1.so 00007f9a3e83d000 2044 - - - ----- libm-2.11.1.so 00007f9a3ea3c000 4 - - - r-x-- libm-2.11.1.so 00007f9a3ea3d000 4 - - - rwx-- libm-2.11.1.so 00007f9a3ea3e000 8 - - - r-x-- libdl-2.11.1.so 00007f9a3ea40000 2048 - - - ----- libdl-2.11.1.so 00007f9a3ec40000 4 - - - r-x-- libdl-2.11.1.so 00007f9a3ec41000 4 - - - rwx-- libdl-2.11.1.so 00007f9a3ec42000 96 - - - r-x-- libpthread-2.11.1.so 00007f9a3ec5a000 2044 - - - ----- libpthread-2.11.1.so 00007f9a3ee59000 4 - - - r-x-- libpthread-2.11.1.so 00007f9a3ee5a000 4 - - - rwx-- libpthread-2.11.1.so 00007f9a3ee5b000 16 - - - rwx-- [ anon ] 00007f9a3ee5f000 28 - - - r-x-- librt-2.11.1.so 00007f9a3ee66000 2044 - - - ----- librt-2.11.1.so 00007f9a3f065000 4 - - - r-x-- librt-2.11.1.so 00007f9a3f066000 4 - - - rwx-- librt-2.11.1.so 00007f9a3f067000 128 - - - r-x-- ld-2.11.1.so 00007f9a3f092000 4 - - - ----- [ anon ] 00007f9a3f093000 28 - - - rwx-- [ anon ] 00007f9a3f09a000 324 - - - rwxs- zero (deleted) 00007f9a3f0eb000 1620 - - - rwx-- [ anon ] 00007f9a3f282000 4 - - - rwx-- [ anon ] 00007f9a3f283000 4 - - - rwxs- zero (deleted) 00007f9a3f284000 8 - - - rwx-- [ anon ] 00007f9a3f286000 4 - - - r-x-- ld-2.11.1.so 00007f9a3f287000 4 - - - rwx-- ld-2.11.1.so 00007f9a3f288000 4 - - - rwx-- [ anon ] 00007fffb7f5c000 132 - - - rwx-- [ stack ] 00007fffb7fff000 4 - - - r-x-- [ anon ] ffffffffff600000 4 - - - r-x-- [ anon ] ---------------- ------ ------ ------ ------ total kB 8196392 - - -

Short of few regions that are shared and uninitialized (rwxs- zero (deleted)) nothing else really stood out.

After this I checked to see if we have enough file descriptors for connection allocation:
We have plenty available.

Next I checked to see how much memory the kernel is configured to dedicate to TCP:
And compared the 'max' number with how much of that memory TCP actually uses:

File: gistfile1.sh ------------------ root@server1:~# cat /proc/net/sockstat sockets: used 8012 TCP: inuse 7751 orphan 529 tw 14095 alloc 8209 mem 9413 UDP: inuse 39 mem 11 UDPLITE: inuse 0 RAW: inuse 1 FRAG: inuse 0 memory 0

The max number of pages currently allocated to TCP(9413) is less than what is reserved by the kernel (6961536).

Then I checked to see if there are too many orphaned sockets:
As you can see the current orphaned sockets (529) is less than the maximal number of TCP sockets not attached to any user file handle, displayed above (262144). So we are good on that front.

At this point I wanted to see the different states of the INET sockets so I wrote this quick script that shows the 4 important TCP states - ESTABLISHED, SYN_RECEIVED, TIME_WAIT and LAST_ACK and show me the top 5 users for them:

File: gistfile1.sh ------------------ usage() { cat << EOF usage: $0 options OPTIONS: -c top ips EOF } EXCLUDEIP=`echo $(cat /etc/network/interfaces | grep address | awk '{print $2}') | sed 's/ /|/g'` count=5 seq=$(seq 0 2 10) while getopts Ã¢â‚¬Å“hc:Ã¢â‚¬Â OPTION do case $OPTION in h) usage exit ;; c) count=$OPTARG seq=$(seq 0 2 $(expr $count \* 2)) ;; esac done TIME_WAIT=$(ss -n state time-wait | awk '{print $3}' | awk -F":" '{print $1}' | sort | uniq -c | sort -rn | egrep -vwi "$EXCLUDEIP" | head -n $count) ESTABLISHED=$(ss -n state established | awk '{print $3}' | awk -F":" '{print $1}' | sort | uniq -c | sort -rn | egrep -vwi "$EXCLUDEIP" | head -n $count) SYN_RECEIVED=$(ss -n state syn-recv | awk '{print $3}' | awk -F":" '{print $1}' | sort | uniq -c | sort -rn | egrep -vwi "$EXCLUDEIP" | head -n $count) LAST_ACK=$(ss -n state last-ack | awk '{print $3}' | awk -F":" '{print $1}' | sort | uniq -c | sort -rn | egrep -vwi "$EXCLUDEIP" | head -n $count) let i=0 for num in $TIME_WAIT do array_timewait[i]="$num" let i=i+1 done let i=0 for num in $ESTABLISHED do array_established[i]="$num" let i=i+1 done let i=0 for num in $SYN_RECEIVED do array_synreceived[i]="$num" let i=i+1 done let i=0 for num in $LAST_ACK do array_lastack[i]="$num" let i=i+1 done echo -e "ESTABLISHED SYN_RECEIVED TIME_WAIT LAST_ACK" echo -e "=======================================================================================" for i in $seq do echo -n "${array_established[$i]} " echo -n -e "${array_established[$i+1]}\t" echo -n "${array_synreceived[$i]} " echo -n -e "${array_synreceived[$i+1]}\t" echo -n "${array_timewait[$i]} " echo -n -e "${array_timewait[$i+1]}\t" echo -n "${array_lastack[$i]} " echo -e "${array_lastack[$i+1]}" done | column -t

Which after running returned the following result (I masked the first octets of the IP addresses):

File: gistfile1.sh ------------------ ESTABLISHED SYN_RECEIVED TIME_WAIT LAST_ACK ======================================================================================= 567 XXX.56.48.163 35 XXX.56.52.28 2169 XXX.56.48.163 33 XXX.57.214.223 402 XXX.57.214.223 33 XXX.57.214.223 1350 XXX.56.4.156 30 XXX.57.204.205 314 XXX.56.4.156 14 XXX.56.49.195 841 XXX.57.204.205 13 XXX.56.4.46 268 XXX.57.204.205 13 XXX.57.204.205 783 XXX.56.52.28 12 XXX.56.48.163 213 XXX.57.214.133 11 XXX.56.48.163 611 XXX.56.52.53 9 XXX.56.4.54

So nothing unusual here. TCP dump and kernel messages did not reveal anything unusual. Rebooting the software or the server did not help either.

At this point I ran out of ideas as to what to look for. I knew it's the networking software that was causing the high load and memory consumption as shown by the process list (or top), but since the software is closed source I could not go and look at the actual code.

The last thing I tried though was to start gathering memory consumption for all running processes and their children every minute for 5 days, by running the following through cron and storing it in a file along with the PIDs:
After 5 days I graphed the results with Graphite with all the processes memory usage superimposed on each other. There was a single process (the same one with the highest CPU usage) which was consuming an ever increasing memory over time indicative of a memory leak.

I sent the results to the product vendor and sure enough they found 2 bugs in their code, one that was not cleaning sockets and a memory leak.

Hope this was helpful.

Pages

Diagnosing High CPU utilization and memory leaks