Description of problem:
Builds fail with iptables-restore unable to occur due to a hold on xtables lock
This has been a known issue in the past (see "additional info") -- however after implementing the erratas, the issue still occurs. Although to some extent it is expected for containers to be blocked for some time, it is expected for the builds to eventually finish (the iptables wait flag should allow us to wait until the lock is available, rather than exiting).
Version-Release number of selected component (if applicable):
iptables-1.4.21-18.el7.x86_64
atomic-openshift-node-3.4.1.44.26-1.git.0.a62e88b.el7.x86_64
kernel-3.10.0-693.2.2.el7.x86_64
Red Hat Enterprise Linux Server release 7.4 (Maipo)
How reproducible:
Unconfirmed
Steps to Reproduce:
1. Kick off many builds
Actual results:
Nov 9 11:36:30 njrarltapp001c7 atomic-openshift-node: E1109 11:36:30.685452 25053 cni.go:273] Error deleting network: CNI request failed with status 400: 'Failed to execute iptables-restore: exit status 4 (Another app is currently holding the xtables lock. Perhaps you want to use the -w option?
Nov 9 11:36:30 njrarltapp001c7 atomic-openshift-node: E1109 11:36:30.685515 25053 docker_manager.go:1434] Failed to teardown network for pod "84aeeea2-c565-11e7-8f20-005056a97aae" using network plugins "cni": CNI request failed with status 400: 'Failed to execute iptables-restore: exit status 4 (Another app is currently holding the xtables lock. Perhaps you want to use the -w option?
Nov 9 11:36:32 njrarltapp001c7 atomic-openshift-node: E1109 11:36:32.850516 25053 kubelet.go:2092] Failed killing the pod "pcis-integration-1-hz0zp": failed to "TeardownNetwork" for "pcis-integration-1-hz0zp_cipe-c2811c-1" with TeardownNetworkError: "Failed to teardown network for pod \"84aeeea2-c565-11e7-8f20-005056a97aae\" using network plugins \"cni\": CNI request failed with status 400: 'Failed to execute iptables-restore: exit status 4 (Another app is currently holding the xtables lock. Perhaps you want to use the -w option?\n)\n'"
Nov 9 11:36:35 njrarltapp001c7 atomic-openshift-node: E1109 11:36:34.955050 25053 cni.go:273] Error deleting network: CNI request failed with status 400: 'Failed to execute iptables-restore: exit status 4 (Another app is currently holding the xtables lock. Perhaps you want to use the -w option?
Expected results:
Builds succeed
Additional info:
Customer implemented errata that fixed: https://bugzilla.redhat.com/show_bug.cgi?id=1484133
Which is related to: https://bugzilla.redhat.com/show_bug.cgi?id=1438597
Hello,
customer having the same issue. Due this error the iptables are not getting updated with the correct endpoint ip.
Therefore the service is not reachable - production down.
Issue happening on hawkular-cassandra.
Hi Dan,
I think collecting go routines from the openshift node can help us understand the issue right ?
Eg ->
Set OpenShift Nodes's log level to Debug add or edit this line in /etc/sysconfig/atomic-openshift-node:
OPTIONS='--loglevel=8'
OPENSHIFT_PROFILE=web and then restart atomic-openshift-node "systemctl restart atomic-openshift-node".
Let it run for a bit, long enough that you'd expect it to have the issue reproduced again, and then run
curl http://localhost:6060/debug/pprof/goroutine?debug=2
and attach the routines here.
Thanks and regards,
Miheer
This appears to be resolved by the updated kernel, the changed iptables wait times, and the reduction in the frequency with which we call iptables on pod creation.