Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1514627

Summary:	Builds fail with Failed to execute iptables-restore: exit status 4 (Another app is currently holding the xtables lock
Product:	OpenShift Container Platform	Reporter:	Steven Walter <stwalter>
Component:	Networking	Assignee:	Dan Williams <dcbw>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Meng Bo <bmeng>
Severity:	urgent	Docs Contact:
Priority:	urgent
Version:	3.4.1	CC:	aos-bugs, bbennett, clichybi, dcbw, dzhukous, knakai, misalunk, stwalter, vwalek, weliang
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-12-07 14:00:39 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Steven Walter 2017-11-17 23:39:43 UTC

Description of problem:
Builds fail with iptables-restore unable to occur due to a hold on xtables lock

This has been a known issue in the past (see "additional info") -- however after implementing the erratas, the issue still occurs. Although to some extent it is expected for containers to be blocked for some time, it is expected for the builds to eventually finish (the iptables wait flag should allow us to wait until the lock is available, rather than exiting).

Version-Release number of selected component (if applicable):
iptables-1.4.21-18.el7.x86_64
atomic-openshift-node-3.4.1.44.26-1.git.0.a62e88b.el7.x86_64
kernel-3.10.0-693.2.2.el7.x86_64
Red Hat Enterprise Linux Server release 7.4 (Maipo)

How reproducible:
Unconfirmed

Steps to Reproduce:
1. Kick off many builds

Actual results:
Nov  9 11:36:30 njrarltapp001c7 atomic-openshift-node: E1109 11:36:30.685452   25053 cni.go:273] Error deleting network: CNI request failed with status 400: 'Failed to execute iptables-restore: exit status 4 (Another app is currently holding the xtables lock. Perhaps you want to use the -w option?
Nov  9 11:36:30 njrarltapp001c7 atomic-openshift-node: E1109 11:36:30.685515   25053 docker_manager.go:1434] Failed to teardown network for pod "84aeeea2-c565-11e7-8f20-005056a97aae" using network plugins "cni": CNI request failed with status 400: 'Failed to execute iptables-restore: exit status 4 (Another app is currently holding the xtables lock. Perhaps you want to use the -w option?
Nov  9 11:36:32 njrarltapp001c7 atomic-openshift-node: E1109 11:36:32.850516   25053 kubelet.go:2092] Failed killing the pod "pcis-integration-1-hz0zp": failed to "TeardownNetwork" for "pcis-integration-1-hz0zp_cipe-c2811c-1" with TeardownNetworkError: "Failed to teardown network for pod \"84aeeea2-c565-11e7-8f20-005056a97aae\" using network plugins \"cni\": CNI request failed with status 400: 'Failed to execute iptables-restore: exit status 4 (Another app is currently holding the xtables lock. Perhaps you want to use the -w option?\n)\n'"
Nov  9 11:36:35 njrarltapp001c7 atomic-openshift-node: E1109 11:36:34.955050   25053 cni.go:273] Error deleting network: CNI request failed with status 400: 'Failed to execute iptables-restore: exit status 4 (Another app is currently holding the xtables lock. Perhaps you want to use the -w option?


Expected results:
Builds succeed

Additional info:
Customer implemented errata that fixed: https://bugzilla.redhat.com/show_bug.cgi?id=1484133
Which is related to: https://bugzilla.redhat.com/show_bug.cgi?id=1438597

Comment 2 Vladislav Walek 2017-11-27 11:54:03 UTC

Hello,

customer having the same issue. Due this error the iptables are not getting updated with the correct endpoint ip.
Therefore the service is not reachable - production down.

Issue happening on hawkular-cassandra.

Comment 20 Miheer Salunke 2017-11-30 14:13:50 UTC

Hi Dan,

I think collecting go routines from the openshift node can help us understand the issue right ?

Eg ->
Set OpenShift Nodes's log level to Debug add or edit this line in /etc/sysconfig/atomic-openshift-node:


OPTIONS='--loglevel=8'
OPENSHIFT_PROFILE=web and then restart atomic-openshift-node  "systemctl restart atomic-openshift-node".
Let it run for a bit, long enough that you'd expect it to have the issue reproduced again, and then run

curl http://localhost:6060/debug/pprof/goroutine?debug=2  

and attach the routines here.


Thanks and regards,
Miheer

Comment 26 Ben Bennett 2017-12-07 14:00:39 UTC

This appears to be resolved by the updated kernel, the changed iptables wait times, and the reduction in the frequency with which we call iptables on pod creation.