Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1442676

Summary:	Flapping high load on the Openshift masters
Product:	OpenShift Container Platform	Reporter:	Jaspreet Kaur <jkaur>
Component:	Networking	Assignee:	Dan Williams <dcbw>
Status:	CLOSED DUPLICATE	QA Contact:	Meng Bo <bmeng>
Severity:	high	Docs Contact:
Priority:	high
Version:	3.4.0	CC:	aos-bugs, bbennett, dcbw, jeder, ptalbert
Target Milestone:	---	Flags:	mleitner: needinfo-
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-04-24 15:18:09 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Jaspreet Kaur 2017-04-17 05:30:15 UTC

Description of problem: High load detected on Openshift masters. The load is so high that 'yum install sos' too hangs. Something is causing iptables to hit 100% cpu usage or cause and in turn is increasing load to such high values that; the other nrpe based nagios checks are failing with socket time

The problem appears to be a stale file lock which is used by the iptables command. Openshift is invoking the iptables command with the -w option. When the -w option is used, the iptables program will wait indefinitely for the "xtables" lock unless some timeout is given (spoiler alert: no timeout is given).

-w, --wait [seconds]
Wait for the xtables lock. To prevent multiple instances of the
program from running concurrently, an attempt will be made to
obtain an exclusive lock at launch. By default, the program
will exit if the lock cannot be obtained. This option will make
the program wait (indefinitely or for optional seconds) until
the exclusive lock can be obtained.

Unfortunately, whatever process last held the lock appears to no longer be running so the outstanding iptables commands will truly be waiting forever...

openmaster-67-136-2017-Apr-4-09:33:22$ grep ipt ps
root 98366 0.0 0.0 18248 728 ? S 09:40 0:00 iptables -w -C POSTROUTING -t nat -s 10.1.0.0/16 -j MASQUERADE
root 98412 0.0 0.0 16056 500 ? S 09:40 0:00 iptables -w -N KUBE-MARK-DROP -t nat
root 98477 0.0 0.1 37584 13748 ? R 09:40 0:00 iptables -w -C KUBE-PORTALS-CONTAINER -t nat -m comment --comment app-dev-on-cloud-suite/rhcs-brms-install-demo:9990-tcp -p tcp -m tcp --dport 9990 -d 172.30.33.252/32 -j REDIRECT --to-ports 46761

The xtables lock is an old file lock (flock) at /run/xtables.lock .

We can confirm the two sleeping processes are indeed waiting for the lock by checking the lsof output of the report:

openmaster-67-136-2017-Apr-4-09:33:22$ grep /run/xtables.lock sos_commands/process/lsof_-b_M_-n_-l
iptables 98366 0 3r REG 0,19 0 28474 /run/xtables.lock
iptables 98412 0 3r REG 0,19 0 28474 /run/xtables.lock

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:
1.
2.
3.

Actual results:

Expected results:

Additional info:

Comment 3 Ben Bennett 2017-04-24 15:18:09 UTC


*** This bug has been marked as a duplicate of bug 1387149 ***