Bug 1459589 - cpu soft lockup caused by iptables
Summary: cpu soft lockup caused by iptables
Keywords:
Status: CLOSED DEFERRED
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 3.4.1
Hardware: Unspecified
OS: Unspecified
unspecified
low
Target Milestone: ---
: ---
Assignee: Derek Carr
QA Contact: Xiaoli Tian
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-06-07 14:18 UTC by ihorvath
Modified: 2018-12-03 15:52 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-12-03 15:52:38 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description ihorvath 2017-06-07 14:18:04 UTC
Description of problem:
Quite often we see messages from kernel such as these:
Message from syslogd@ip-172-31-50-120 at Jun  5 13:51:58 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#5 stuck for 23s! [iptables:47079]

Message from syslogd@ip-172-31-50-120 at Jun  6 10:50:14 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#7 stuck for 22s! [iptables:38723]

Message from syslogd@ip-172-31-62-249 at Jun  6 10:50:12 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [iptables:41786]

This happens on a cluster that's fairly busy, but the containers are not changing frequently. As in we run our monitoring on it, so it is not rare that a pod is around for more than a month. There is a massive amount of network traffic however, millions of metrics are passing through these pods every day. At this point no idea of the cause. Is it something kubernetes does? Is AWS having problems? What is puzzling to me that we are not changing any routes or pods or anything, so is iptables still churning through rules non stop?


Version-Release number of selected component (if applicable):
oc v3.4.1.18
kubernetes v1.4.0+776c994

How reproducible:
It happens multiple times a day. But have not found a way to trigger it on demand.

Steps to Reproduce:
1. There really aren't any steps that we know at this time to reproduce this.
2.
3.

Actual results:
Usually 1 or 2 a day we get these soft lockups on all nodes in the cluster. 

Expected results:
Expect no soft lockups during normal operation.

Additional info:
If we can turn on more verbose logging in syslog or need to capture something specific please reach out and I'll set everything up.


Note You need to log in before you can comment on or make changes to this bug.