Bug 1481782
| Summary: | 3.5 check for and use new iptables-restore 'wait' argument | |||
|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Steven Walter <stwalter> | |
| Component: | Networking | Assignee: | Ben Bennett <bbennett> | |
| Status: | CLOSED ERRATA | QA Contact: | Meng Bo <bmeng> | |
| Severity: | urgent | Docs Contact: | ||
| Priority: | unspecified | |||
| Version: | 3.5.1 | CC: | aos-bugs, danw, dcbw, eparis, erich, pdwyer, yadu | |
| Target Milestone: | --- | |||
| Target Release: | 3.5.z | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | Doc Type: | Bug Fix | ||
| Doc Text: |
Cause: the iptables proxy was not properly locking its use of iptables.
Consequence: the iptables proxy could conflict with docker and the openshift-node process and cause a failure to start containers.
Fix: the iptables proxy now locks its use of iptables.
Result: pod creation failures due to improper locking of iptables should no longer occur
|
Story Points: | --- | |
| Clone Of: | ||||
| : | 1484133 (view as bug list) | Environment: | ||
| Last Closed: | 2017-09-07 19:13:34 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
|
Description
Steven Walter
2017-08-15 17:17:25 UTC
For bugzilla searching purposes, the messages associated with this bug are: Error syncing pod, skipping: failed to "SetupNetwork" for "package-event-source-139-deploy_vibe-develop" with SetupNetworkError: "Failed to setup network for pod \"example\" using network plugins \"cni\": CNI request failed with status 400: 'Failed to ensure that nat chain POSTROUTING jumps to MASQUERADE: error checking rule: exit status 4: Another app is currently holding the xtables lock; waiting (7s) for it to exit...\nAnother app is currently holding the xtables lock; waiting (9s) for it to exit...\n Note that the first message also appears in https://bugzilla.redhat.com/1417234 -- in that event it is logspam (In reply to Steven Walter from comment #0) > Considering these 4 scenarios: > > 3.5 on rhel7.3 -- vulnerable to xtables lock > 3.5 on rhel7.4 -- currently vulnerable to xtables lock, bug would resolve > this > 3.6 on rhel7.3 -- vulnerable to xtables lock, we advise upgrade to 7.4 Correction: 3.6 on rhel7.3 should also be safe. The code has fallback locking if iptables-restore doesn't support the wait argument. For the purpose of recovery if a cluster hits this, what is the best way of releasing the lock if it is stuck locked? We should be able to determine what is holding the lock with: # lsof /run/xtables.lock Or # find /proc -regex '\/proc\/[0-9]+\/fd\/.*' -type l -lname "*xtables.lock*" -printf "%p -> %l\n" 2> /dev/null But not sure how to release the lock (restarting any services?) -- customer ended up scaling everything down, rebooting all the hosts, and scaling back up, but there's surely an easier way Test on OCP3.5 + rhel7.3 / OCP3.5 + rhel7.4 oc v3.5.5.31.24 kubernetes v1.5.2+43a9be4 iptables v1.4.21 Use an infinite loop to keep creating services which have same endpoint from master side. On the node side, check the process which is opening the /run/xtables.lock file. [root@ip-172-18-11-127 ~]# while true ; do lsof +c0 /run/xtables.lock ; done COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME iptables-restor 18773 root 3r REG 0,18 0 24967 /run/xtables.lock COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME iptables-restor 18886 root 3r REG 0,18 0 24967 /run/xtables.lock COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME iptables-restor 19039 root 3r REG 0,18 0 24967 /run/xtables.lock COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME iptables-restor 19199 root 3r REG 0,18 0 24967 /run/xtables.lock COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME iptables-restor 19279 root 3r REG 0,18 0 24967 /run/xtables.lock COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME iptables-restor 19612 root 3rW REG 0,18 0 24967 /run/xtables.lock And no "Resource temporarily unavailable (exit status 4)" in node log. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:2670 |