Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1442676

Summary: Flapping high load on the Openshift masters
Product: OpenShift Container Platform Reporter: Jaspreet Kaur <jkaur>
Component: NetworkingAssignee: Dan Williams <dcbw>
Status: CLOSED DUPLICATE QA Contact: Meng Bo <bmeng>
Severity: high Docs Contact:
Priority: high    
Version: 3.4.0CC: aos-bugs, bbennett, dcbw, jeder, ptalbert
Target Milestone: ---Flags: mleitner: needinfo-
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-04-24 15:18:09 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Jaspreet Kaur 2017-04-17 05:30:15 UTC
Description of problem: High load detected on Openshift masters. The load is so high that 'yum install sos' too hangs. Something is causing iptables to hit 100% cpu usage or cause and in turn is increasing load to such high values that; the other nrpe based nagios checks are failing with socket time
 
The problem appears to be a stale file lock which is used by the iptables command. Openshift is invoking the iptables command with the -w option. When the -w option is used, the iptables program will wait indefinitely for the "xtables" lock unless some timeout is given (spoiler alert: no timeout is given).

       -w, --wait [seconds]
              Wait for the xtables lock.  To prevent multiple instances of the
              program from running concurrently, an attempt will  be  made  to
              obtain  an  exclusive  lock  at launch.  By default, the program
              will exit if the lock cannot be obtained.  This option will make
              the  program  wait  (indefinitely or for optional seconds) until
              the exclusive lock can be obtained.


Unfortunately, whatever process last held the lock appears to no longer be running so the outstanding iptables commands will truly be waiting forever...

openmaster-67-136-2017-Apr-4-09:33:22$ grep ipt ps
root      98366  0.0  0.0  18248   728 ?        S    09:40   0:00 iptables -w -C POSTROUTING -t nat -s 10.1.0.0/16 -j MASQUERADE
root      98412  0.0  0.0  16056   500 ?        S    09:40   0:00 iptables -w -N KUBE-MARK-DROP -t nat
root      98477  0.0  0.1  37584 13748 ?        R    09:40   0:00 iptables -w -C KUBE-PORTALS-CONTAINER -t nat -m comment --comment app-dev-on-cloud-suite/rhcs-brms-install-demo:9990-tcp -p tcp -m tcp --dport 9990 -d 172.30.33.252/32 -j REDIRECT --to-ports 46761


The xtables lock is an old file lock (flock) at /run/xtables.lock .


We can confirm the two sleeping processes are indeed waiting for the lock by checking the lsof output of the report:

openmaster-67-136-2017-Apr-4-09:33:22$ grep /run/xtables.lock sos_commands/process/lsof_-b_M_-n_-l
iptables   98366               0    3r      REG               0,19         0      28474 /run/xtables.lock
iptables   98412               0    3r      REG               0,19         0      28474 /run/xtables.lock



Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 3 Ben Bennett 2017-04-24 15:18:09 UTC

*** This bug has been marked as a duplicate of bug 1387149 ***