Bug 1459589 - cpu soft lockup caused by iptables
cpu soft lockup caused by iptables
Status: NEW
Product: OpenShift Container Platform
Classification: Red Hat
Component: Pod (Show other bugs)
3.4.1
Unspecified Unspecified
unspecified Severity low
: ---
: ---
Assigned To: Derek Carr
DeShuai Ma
: OpsBlocker
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2017-06-07 10:18 EDT by ihorvath
Modified: 2017-08-31 11:55 EDT (History)
4 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description ihorvath 2017-06-07 10:18:04 EDT
Description of problem:
Quite often we see messages from kernel such as these:
Message from syslogd@ip-172-31-50-120 at Jun  5 13:51:58 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#5 stuck for 23s! [iptables:47079]

Message from syslogd@ip-172-31-50-120 at Jun  6 10:50:14 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#7 stuck for 22s! [iptables:38723]

Message from syslogd@ip-172-31-62-249 at Jun  6 10:50:12 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [iptables:41786]

This happens on a cluster that's fairly busy, but the containers are not changing frequently. As in we run our monitoring on it, so it is not rare that a pod is around for more than a month. There is a massive amount of network traffic however, millions of metrics are passing through these pods every day. At this point no idea of the cause. Is it something kubernetes does? Is AWS having problems? What is puzzling to me that we are not changing any routes or pods or anything, so is iptables still churning through rules non stop?


Version-Release number of selected component (if applicable):
oc v3.4.1.18
kubernetes v1.4.0+776c994

How reproducible:
It happens multiple times a day. But have not found a way to trigger it on demand.

Steps to Reproduce:
1. There really aren't any steps that we know at this time to reproduce this.
2.
3.

Actual results:
Usually 1 or 2 a day we get these soft lockups on all nodes in the cluster. 

Expected results:
Expect no soft lockups during normal operation.

Additional info:
If we can turn on more verbose logging in syslog or need to capture something specific please reach out and I'll set everything up.

Note You need to log in before you can comment on or make changes to this bug.