Description of problem: SRE was notified by a zabbix alert of a performance issues on a starter cluster compute node: starter-us-east-1-node-compute-f1526 . Watching htop, iptables was consuming 100% of a core for more than 20 minutes (same PID). Version-Release number of selected component (if applicable): v3.9.14 kernel: Linux ip-172-31-53-238.ec2.internal 3.10.0-693.21.1.el7.x86_64 #1 SMP Fri Feb 23 18:54:16 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux How reproducible: Seems to be a rolling problem in the starter clusters. Steps to Reproduce: 1. Unknown Additional info: - After 20 minutes, the iptables process appeared to exit and htop returned to normal. - For unknown reasons, it was taking about 10 times as long to ssh to this box (~30 seconds) as it was to other nodes in the cluster. - No excessive CPU/Disk/Memory pressure otherwise noted (iptables was taking a full core, but plenty of other cores were free; Load average: 33.85 31.63 30.86).
If this happens again, can somebody 'debuginfo-install iptables' and then 'gdb attach <pid>' and "backtrace" so we can see what's actually going on?
Can we add a cron job to check to see if iptables has been running for more than a minute and then attach to the pid and generate a backtrace and email it out?
I'd be glad to install it if someone can write it to dump to a file system location. I'll monitor for dumps and email if one is generated.
I believe you can do: - debuginfo-install iptables - gdb attach `pidof iptables` - set logging file /tmp/gdb.txt - set logging on - set pagination off - backtrace - quit and that should dump the required info to a filesystem location.
Ping on this issue?
I haven't seen this in awhile, so I'm fine to close and follow recommended capture steps if it reoccurs.
Gotcha; I've closed this - feel free to reopen / open a new one if it reoccurs.