Description of problem:
SRE was notified by a zabbix alert of a performance issues on a starter cluster compute node: starter-us-east-1-node-compute-f1526 . Watching htop, iptables was consuming 100% of a core for more than 20 minutes (same PID).
Version-Release number of selected component (if applicable):
kernel: Linux ip-172-31-53-238.ec2.internal 3.10.0-693.21.1.el7.x86_64 #1 SMP Fri Feb 23 18:54:16 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Seems to be a rolling problem in the starter clusters.
Steps to Reproduce:
- After 20 minutes, the iptables process appeared to exit and htop returned to normal.
- For unknown reasons, it was taking about 10 times as long to ssh to this box (~30 seconds) as it was to other nodes in the cluster.
- No excessive CPU/Disk/Memory pressure otherwise noted (iptables was taking a full core, but plenty of other cores were free; Load average: 33.85 31.63 30.86).
If this happens again, can somebody 'debuginfo-install iptables' and then 'gdb attach <pid>' and "backtrace" so we can see what's actually going on?
Can we add a cron job to check to see if iptables has been running for more than a minute and then attach to the pid and generate a backtrace and email it out?
I'd be glad to install it if someone can write it to dump to a file system location. I'll monitor for dumps and email if one is generated.
I believe you can do:
- debuginfo-install iptables
- gdb attach `pidof iptables`
- set logging file /tmp/gdb.txt
- set logging on
- set pagination off
and that should dump the required info to a filesystem location.
Ping on this issue?
I haven't seen this in awhile, so I'm fine to close and follow recommended capture steps if it reoccurs.
Gotcha; I've closed this - feel free to reopen / open a new one if it reoccurs.