Bug 1843646

Summary: openshift-sdn reports failure with iptables-restore: CHAIN_USER_DEL failed (Device or resource busy)
Product: Red Hat Enterprise Linux 8 Reporter: Jacob Tanenbaum <jtanenba>
Component: iptablesAssignee: Phil Sutter <psutter>
Status: CLOSED ERRATA QA Contact: Tomas Dolezal <todoleza>
Severity: high Docs Contact:
Priority: high    
Version: 8.2CC: bleanhar, chlunde, dahernan, danw, dcbw, dornelas, egarver, iptables-maint-list, jolee, psutter, surya, todoleza
Target Milestone: rcKeywords: Reopened, Triaged, ZStream
Target Release: 8.4Flags: pm-rhel: mirror+
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: iptables-1.8.4-16.el8 Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of:
: 1891880 1891881 (view as bug list) Environment:
Last Closed: 2021-05-18 14:58:38 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1186913, 1880680, 1891880, 1891881    
Attachments:
Description Flags
fake iptables-restore script none

Description Jacob Tanenbaum 2020-06-03 17:33:41 UTC
Description of problem:

Numerious CI runs fail the networking granular tests with these being the only log level of interest 



E0603 14:12:14.698046    1891 proxier.go:1507] Failed to execute iptables-restore: exit status 4 (iptables-restore v1.8.4 (nf_tables): 
line 1323: CHAIN_USER_DEL failed (Device or resource busy): chain KUBE-SEP-EDTNOF5JKJLINSGB
line 1324: CHAIN_USER_DEL failed (Device or resource busy): chain KUBE-SEP-BIKTPJ5SHXIDOCRQ
line 1325: CHAIN_USER_DEL failed (Device or resource busy): chain KUBE-SVC-63WUVGDVBYC5V37I
line 1326: CHAIN_USER_DEL failed (Device or resource busy): chain KUBE-SEP-FAFLSMCR3YKP4QSV
)
I0603 14:12:14.698165    1891 proxier.go:1510] Closing local ports after iptables-restore failure

here is the CI run and the specific log that the error came from 
https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-sdn-network-stress-4.4/535

https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-sdn-network-stress-4.4/535/artifacts/e2e-aws-sdn-network-stress/pods/openshift-sdn_sdn-75qx6_sdn.log



How reproducible:
happens in all the 4.4-sdn network stress tests 
https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/job-history/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-sdn-network-stress-4.4
even thought the tests are green the Granular tests are failing and through the artifacts can get to the iptables-restore error in the sdn pods

Comment 1 Phil Sutter 2020-06-04 15:58:20 UTC
Hi,

How is iptables-restore called? Are you using '--noflush' in that situation? Can you please paste the input passed to iptables-restore? Also, an iptables dump at the time iptables-restore was called would help (assuming that '--noflush' is used).

Thanks, Phil

Comment 2 Jacob Tanenbaum 2020-06-05 10:54:24 UTC
I am pretty sure that these are the args to iptables 

22:37:16.742520    2413 iptables.go:402] running iptables-restore [-w 5 -W 100000 --noflush --counters]

I had the kube proxier run `iptables -S` right after the failed  iptables-restore 

-P INPUT ACCEPT
-P FORWARD ACCEPT
-P OUTPUT ACCEPT
-N KUBE-FIREWALL
-N KUBE-KUBELET-CANARY
-N OPENSHIFT-BLOCK-OUTPUT
-N OPENSHIFT-FIREWALL-FORWARD
-N OPENSHIFT-ADMIN-OUTPUT-RULES
-N OPENSHIFT-FIREWALL-ALLOW
-N KUBE-PROXY-CANARY
-N KUBE-NODEPORT-NON-LOCAL
-N KUBE-EXTERNAL-SERVICES
-N KUBE-SERVICES
-N KUBE-FORWARD
-N OPENSHIFT-SDN-CANARY
-A INPUT -m conntrack --ctstate NEW -m comment --comment "kubernetes service portals" -j KUBE-SERVICES
-A INPUT -m conntrack --ctstate NEW -m comment --comment "kubernetes externally-visible service portals" -j KUBE-EXTERNAL-SERVICES
-A INPUT -m comment --comment "Ensure that non-local NodePort traffic can flow" -j KUBE-NODEPORT-NON-LOCAL
-A INPUT -m comment --comment "firewall overrides" -j OPENSHIFT-FIREWALL-ALLOW
-A INPUT -j KUBE-FIREWALL
-A FORWARD -m comment --comment "kubernetes forwarding rules" -j KUBE-FORWARD
-A FORWARD -m conntrack --ctstate NEW -m comment --comment "kubernetes service portals" -j KUBE-SERVICES
-A FORWARD -i tun0 ! -o tun0 -m comment --comment "administrator overrides" -j OPENSHIFT-ADMIN-OUTPUT-RULES
-A FORWARD -m comment --comment "firewall overrides" -j OPENSHIFT-FIREWALL-FORWARD
-A FORWARD -m comment --comment "firewall overrides" -j OPENSHIFT-BLOCK-OUTPUT
-A OUTPUT -m conntrack --ctstate NEW -m comment --comment "kubernetes service portals" -j KUBE-SERVICES
-A OUTPUT -m comment --comment "firewall overrides" -j OPENSHIFT-BLOCK-OUTPUT
-A OUTPUT -j KUBE-FIREWALL
-A KUBE-FIREWALL -m comment --comment "kubernetes firewall for dropping marked packets" -m mark --mark 0x8000/0x8000 -j DROP
-A OPENSHIFT-BLOCK-OUTPUT -p tcp -m tcp --dport 22623 -j REJECT --reject-with icmp-port-unreachable
-A OPENSHIFT-BLOCK-OUTPUT -p tcp -m tcp --dport 22624 -j REJECT --reject-with icmp-port-unreachable
-A OPENSHIFT-FIREWALL-FORWARD -s 10.128.0.0/14 -m comment --comment "attempted resend after connection close" -m conntrack --ctstate INVALID -j DROP
-A OPENSHIFT-FIREWALL-FORWARD -d 10.128.0.0/14 -m comment --comment "forward traffic from SDN" -j ACCEPT
-A OPENSHIFT-FIREWALL-FORWARD -s 10.128.0.0/14 -m comment --comment "forward traffic to SDN" -j ACCEPT
-A OPENSHIFT-FIREWALL-ALLOW -p udp -m udp --dport 4789 -m comment --comment "VXLAN incoming" -j ACCEPT
-A OPENSHIFT-FIREWALL-ALLOW -i tun0 -m comment --comment "from SDN to localhost" -j ACCEPT
-A OPENSHIFT-FIREWALL-ALLOW -i docker0 -m comment --comment "from docker to localhost" -j ACCEPT
-A KUBE-SERVICES -d 172.30.183.93/32 -p tcp -m comment --comment "e2e-services-568/externalsvc: has no endpoints" -m tcp --dport 80 -j REJECT --reject-with icmp-port-unreachable
-A KUBE-FORWARD -m conntrack --ctstate INVALID -j DROP
-A KUBE-FORWARD -m comment --comment "kubernetes forwarding rules" -m mark --mark 0x1/0x1 -j ACCEPT
-A KUBE-FORWARD -m comment --comment "kubernetes forwarding conntrack pod source rule" -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
-A KUBE-FORWARD -m comment --comment "kubernetes forwarding conntrack pod destination rule" -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT

This was the failures this time 
E0604 22:37:16.847774    2413 proxier.go:1546] Failed to execute iptables-restore: exit status 4 (iptables-restore v1.8.4 (nf_tables): 
line 1261: CHAIN_USER_DEL failed (Device or resource busy): chain KUBE-SVC-IGPUEKXS4B56KE6S
line 1262: CHAIN_USER_DEL failed (Device or resource busy): chain KUBE-SEP-7AUGLVM4TUWHHJ67
line 1263: CHAIN_USER_DEL failed (Device or resource busy): chain KUBE-SVC-4TAJFQBTKPJOYBVI
line 1264: CHAIN_USER_DEL failed (Device or resource busy): chain KUBE-SEP-IV6FB7DM6MAWJS5C
line 1265: CHAIN_USER_DEL failed (Device or resource busy): chain KUBE-SEP-Z5Z2AU2EP7UVGYWE
line 1266: CHAIN_USER_DEL failed (Device or resource busy): chain KUBE-SVC-HJIQ6Q76ZX4DSQOG
line 1267: CHAIN_USER_DEL failed (Device or resource busy): chain KUBE-SEP-MPOOKCWDPH4HFZYH
line 1268: CHAIN_USER_DEL failed (Device or resource busy): chain KUBE-SEP-MVI3LUV66ZPY3YSO
line 1269: CHAIN_USER_DEL failed (Device or resource busy): chain KUBE-SEP-QFZ3AHRIX6T2VX4I
line 1270: CHAIN_USER_DEL failed (Device or resource busy): chain KUBE-SEP-34GXHOVCZYDB5Q2B
line 1271: CHAIN_USER_DEL failed (Device or resource busy): chain KUBE-SEP-QFNOUYUCJZJZNCXG
line 1276:
)
I0604 22:37:16.847857    2413 proxier.go:1549] Closing local ports after iptables-restore failur

Comment 3 Phil Sutter 2020-06-05 15:45:08 UTC
Hi,

Given that said chains don't exist anymore means the errors are spurious. Sadly
I still can't reproduce the bug from the info at hand. Can you possibly find
out what the ruleset looked like before the problematic iptables-restore call
and what input was fed into iptables-restore?

Thanks, Phil

Comment 4 Jacob Tanenbaum 2020-06-08 16:48:46 UTC
iptables dump from before failed iptables restore 

0608 15:43:44.263792    2371 iptables.go:404] KEYWORD---DUMP OF IPTABLES--
-P INPUT ACCEPT
-P FORWARD ACCEPT
-P OUTPUT ACCEPT
-N KUBE-FIREWALL
-N KUBE-KUBELET-CANARY
-N OPENSHIFT-BLOCK-OUTPUT
-N OPENSHIFT-FIREWALL-FORWARD
-N OPENSHIFT-ADMIN-OUTPUT-RULES
-N OPENSHIFT-FIREWALL-ALLOW
-N KUBE-PROXY-CANARY
-N KUBE-NODEPORT-NON-LOCAL
-N KUBE-EXTERNAL-SERVICES
-N KUBE-SERVICES
-N KUBE-FORWARD
-N OPENSHIFT-SDN-CANARY
-A INPUT -m conntrack --ctstate NEW -m comment --comment "kubernetes service portals" -j KUBE-SERVICES
-A INPUT -m conntrack --ctstate NEW -m comment --comment "kubernetes externally-visible service portals" -j KUBE-EXTERNAL-SERVICES
-A INPUT -m comment --comment "Ensure that non-local NodePort traffic can flow" -j KUBE-NODEPORT-NON-LOCAL
-A INPUT -m comment --comment "firewall overrides" -j OPENSHIFT-FIREWALL-ALLOW
-A INPUT -j KUBE-FIREWALL
-A FORWARD -m comment --comment "kubernetes forwarding rules" -j KUBE-FORWARD
-A FORWARD -m conntrack --ctstate NEW -m comment --comment "kubernetes service portals" -j KUBE-SERVICES
-A FORWARD -i tun0 ! -o tun0 -m comment --comment "administrator overrides" -j OPENSHIFT-ADMIN-OUTPUT-RULES
-A FORWARD -m comment --comment "firewall overrides" -j OPENSHIFT-FIREWALL-FORWARD
-A FORWARD -m comment --comment "firewall overrides" -j OPENSHIFT-BLOCK-OUTPUT
-A OUTPUT -m conntrack --ctstate NEW -m comment --comment "kubernetes service portals" -j KUBE-SERVICES
-A OUTPUT -m comment --comment "firewall overrides" -j OPENSHIFT-BLOCK-OUTPUT
-A OUTPUT -j KUBE-FIREWALL
-A KUBE-FIREWALL -m comment --comment "kubernetes firewall for dropping marked packets" -m mark --mark 0x8000/0x8000 -j DROP
-A OPENSHIFT-BLOCK-OUTPUT -p tcp -m tcp --dport 22623 -j REJECT --reject-with icmp-port-unreachable
-A OPENSHIFT-BLOCK-OUTPUT -p tcp -m tcp --dport 22624 -j REJECT --reject-with icmp-port-unreachable
-A OPENSHIFT-FIREWALL-FORWARD -s 10.128.0.0/14 -m comment --comment "attempted resend after connection close" -m conntrack --ctstate INVALID -j DROP
-A OPENSHIFT-FIREWALL-FORWARD -d 10.128.0.0/14 -m comment --comment "forward traffic from SDN" -j ACCEPT
-A OPENSHIFT-FIREWALL-FORWARD -s 10.128.0.0/14 -m comment --comment "forward traffic to SDN" -j ACCEPT
-A OPENSHIFT-FIREWALL-ALLOW -p udp -m udp --dport 4789 -m comment --comment "VXLAN incoming" -j ACCEPT
-A OPENSHIFT-FIREWALL-ALLOW -i tun0 -m comment --comment "from SDN to localhost" -j ACCEPT
-A OPENSHIFT-FIREWALL-ALLOW -i docker0 -m comment --comment "from docker to localhost" -j ACCEPT
-A KUBE-EXTERNAL-SERVICES -p tcp -m comment --comment "e2e-services-4420/affinity-nodeport-timeout: has no endpoints" -m addrtype --dst-type LOCAL -m tcp --dport 32658 -j REJECT --reject-with icmp-port-unreachable
-A KUBE-SERVICES -d 172.30.22.40/32 -p tcp -m comment --comment "e2e-services-9316/multi-endpoint-test:portname1 has no endpoints" -m tcp --dport 80 -j REJECT --reject-with icmp-port-unreachable
-A KUBE-SERVICES -d 172.30.180.158/32 -p tcp -m comment --comment "e2e-test-weighted-router-sbxnc/weightedendpoints1: has no endpoints" -m tcp --dport 8080 -j REJECT --reject-with icmp-port-unreachable
-A KUBE-SERVICES -d 172.30.208.92/32 -p tcp -m comment --comment "e2e-test-weighted-router-sbxnc/weightedendpoints2: has no endpoints" -m tcp --dport 8080 -j REJECT --reject-with icmp-port-unreachable
-A KUBE-SERVICES -d 172.30.186.116/32 -p tcp -m comment --comment "e2e-services-3134/affinity-clusterip: has no endpoints" -m tcp --dport 80 -j REJECT --reject-with icmp-port-unreachable
-A KUBE-SERVICES -d 172.30.34.187/32 -p tcp -m comment --comment "e2e-services-4420/affinity-nodeport-timeout: has no endpoints" -m tcp --dport 80 -j REJECT --reject-with icmp-port-unreachable
-A KUBE-SERVICES -d 172.30.6.223/32 -p tcp -m comment --comment "e2e-test-router-scoped-4swcz/endpoints: has no endpoints" -m tcp --dport 8080 -j REJECT --reject-with icmp-port-unreachable
-A KUBE-FORWARD -m conntrack --ctstate INVALID -j DROP
-A KUBE-FORWARD -m comment --comment "kubernetes forwarding rules" -m mark --mark 0x1/0x1 -j ACCEPT
-A KUBE-FORWARD -m comment --comment "kubernetes forwarding conntrack pod source rule" -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
-A KUBE-FORWARD -m comment --comment "kubernetes forwarding conntrack pod destination rule" -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT

---END OF DUMP---

iptables restore gets run like this 
iptables-restore -w 5 -W 100000 --noflush --counters


iptables error 
E0608 15:43:44.342011    2371 proxier.go:1546] Failed to execute iptables-restore: exit status 4 (iptables-restore v1.8.4 (nf_tables): 
line 1350: CHAIN_USER_DEL failed (Device or resource busy): chain KUBE-SVC-NJU5FEEO2FMUXJLP
line 1351: CHAIN_USER_DEL failed (Device or resource busy): chain KUBE-SEP-GHLSORH53SGHROY5


iptables dump after iptables restore 

E0608 15:43:44.341926    2371 iptables.go:420] KEYWORD---DUMP OF IPTABLES--
-P INPUT ACCEPT
-P FORWARD ACCEPT
-P OUTPUT ACCEPT
-N KUBE-FIREWALL
-N KUBE-KUBELET-CANARY
-N OPENSHIFT-BLOCK-OUTPUT
-N OPENSHIFT-FIREWALL-FORWARD
-N OPENSHIFT-ADMIN-OUTPUT-RULES
-N OPENSHIFT-FIREWALL-ALLOW
-N KUBE-PROXY-CANARY
-N KUBE-NODEPORT-NON-LOCAL
-N KUBE-EXTERNAL-SERVICES
-N KUBE-SERVICES
-N KUBE-FORWARD
-N OPENSHIFT-SDN-CANARY
-A INPUT -m conntrack --ctstate NEW -m comment --comment "kubernetes service portals" -j KUBE-SERVICES
-A INPUT -m conntrack --ctstate NEW -m comment --comment "kubernetes externally-visible service portals" -j KUBE-EXTERNAL-SERVICES
-A INPUT -m comment --comment "Ensure that non-local NodePort traffic can flow" -j KUBE-NODEPORT-NON-LOCAL
-A INPUT -m comment --comment "firewall overrides" -j OPENSHIFT-FIREWALL-ALLOW
-A INPUT -j KUBE-FIREWALL
-A FORWARD -m comment --comment "kubernetes forwarding rules" -j KUBE-FORWARD
-A FORWARD -m conntrack --ctstate NEW -m comment --comment "kubernetes service portals" -j KUBE-SERVICES
-A FORWARD -i tun0 ! -o tun0 -m comment --comment "administrator overrides" -j OPENSHIFT-ADMIN-OUTPUT-RULES
-A FORWARD -m comment --comment "firewall overrides" -j OPENSHIFT-FIREWALL-FORWARD
-A FORWARD -m comment --comment "firewall overrides" -j OPENSHIFT-BLOCK-OUTPUT
-A OUTPUT -m conntrack --ctstate NEW -m comment --comment "kubernetes service portals" -j KUBE-SERVICES
-A OUTPUT -m comment --comment "firewall overrides" -j OPENSHIFT-BLOCK-OUTPUT
-A OUTPUT -j KUBE-FIREWALL
-A KUBE-FIREWALL -m comment --comment "kubernetes firewall for dropping marked packets" -m mark --mark 0x8000/0x8000 -j DROP
-A OPENSHIFT-BLOCK-OUTPUT -p tcp -m tcp --dport 22623 -j REJECT --reject-with icmp-port-unreachable
-A OPENSHIFT-BLOCK-OUTPUT -p tcp -m tcp --dport 22624 -j REJECT --reject-with icmp-port-unreachable
-A OPENSHIFT-FIREWALL-FORWARD -s 10.128.0.0/14 -m comment --comment "attempted resend after connection close" -m conntrack --ctstate INVALID -j DROP
-A OPENSHIFT-FIREWALL-FORWARD -d 10.128.0.0/14 -m comment --comment "forward traffic from SDN" -j ACCEPT
-A OPENSHIFT-FIREWALL-FORWARD -s 10.128.0.0/14 -m comment --comment "forward traffic to SDN" -j ACCEPT
-A OPENSHIFT-FIREWALL-ALLOW -p udp -m udp --dport 4789 -m comment --comment "VXLAN incoming" -j ACCEPT
-A OPENSHIFT-FIREWALL-ALLOW -i tun0 -m comment --comment "from SDN to localhost" -j ACCEPT
-A OPENSHIFT-FIREWALL-ALLOW -i docker0 -m comment --comment "from docker to localhost" -j ACCEPT
-A KUBE-EXTERNAL-SERVICES -p tcp -m comment --comment "e2e-services-4420/affinity-nodeport-timeout: has no endpoints" -m addrtype --dst-type LOCAL -m tcp --dport 32658 -j REJECT --reject-with icmp-port-unreachable
-A KUBE-SERVICES -d 172.30.34.187/32 -p tcp -m comment --comment "e2e-services-4420/affinity-nodeport-timeout: has no endpoints" -m tcp --dport 80 -j REJECT --reject-with icmp-port-unreachable
-A KUBE-SERVICES -d 172.30.6.223/32 -p tcp -m comment --comment "e2e-test-router-scoped-4swcz/endpoints: has no endpoints" -m tcp --dport 8080 -j REJECT --reject-with icmp-port-unreachable
-A KUBE-SERVICES -d 172.30.180.158/32 -p tcp -m comment --comment "e2e-test-weighted-router-sbxnc/weightedendpoints1: has no endpoints" -m tcp --dport 8080 -j REJECT --reject-with icmp-port-unreachable
-A KUBE-SERVICES -d 172.30.208.92/32 -p tcp -m comment --comment "e2e-test-weighted-router-sbxnc/weightedendpoints2: has no endpoints" -m tcp --dport 8080 -j REJECT --reject-with icmp-port-unreachable
-A KUBE-SERVICES -d 172.30.186.116/32 -p tcp -m comment --comment "e2e-services-3134/affinity-clusterip: has no endpoints" -m tcp --dport 80 -j REJECT --reject-with icmp-port-unreachable
-A KUBE-FORWARD -m conntrack --ctstate INVALID -j DROP
-A KUBE-FORWARD -m comment --comment "kubernetes forwarding rules" -m mark --mark 0x1/0x1 -j ACCEPT
-A KUBE-FORWARD -m comment --comment "kubernetes forwarding conntrack pod source rule" -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
-A KUBE-FORWARD -m comment --comment "kubernetes forwarding conntrack pod destination rule" -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT

---END OF DUMP---

Comment 5 Dan Williams 2020-06-09 03:48:29 UTC
Jacob, can you do a new run with a changed kube-proxy image that dumps the iptables-save to the openshift-sdn container logs? 

--->	klog.V(5).Infof("Restoring iptables rules: %s", proxier.iptablesData.Bytes())
	err = proxier.iptables.RestoreAll(proxier.iptablesData.Bytes(), utiliptables.NoFlushTables, utiliptables.RestoreCounters)

change that line in vendor/k8s.io/kubernetes/pkg/proxy/iptables/proxier.go to V(2) and it should be good to go. Then you can get the rules that acutally fail.

Comment 6 Jacob Tanenbaum 2020-06-11 12:56:51 UTC
http://pastebin.test.redhat.com/874472 here is a pastebin of the above command 

and the chains that failed the iptables restore 

E0610 20:49:31.316998    2179 proxier.go:1549] Failed to execute iptables-restore: exit status 4 (iptables-restore v1.8.4 (nf_tables): 
line 1395: CHAIN_USER_DEL failed (Device or resource busy): chain KUBE-SEP-7XCE5SQCTUKSHXHO
line 1396: CHAIN_USER_DEL failed (Device or resource busy): chain KUBE-SEP-BCAZVUHD46RKELSA
line 1397: CHAIN_USER_DEL failed (Device or resource busy): chain KUBE-SEP-IBPF3YD75N2LEIAL
line 1398: CHAIN_USER_DEL failed (Device or resource busy): chain KUBE-SVC-EKTNXQ3XLFLX3WLO
line 1399: CHAIN_USER_DEL failed (Device or resource busy): chain KUBE-SVC-SIXTP2TJ37F3XPQR
line 1400: CHAIN_USER_DEL failed (Device or resource busy): chain KUBE-SEP-3ZFJSMU7GIPZHXO7
line 1401: CHAIN_USER_DEL failed (Device or resource busy): chain KUBE-SEP-QWVNMSYZUNHPV67Z
line 1402: CHAIN_USER_DEL failed (Device or resource busy): chain KUBE-SEP-IE67K7SA4NTF24MV
line 1403: CHAIN_USER_DEL failed (Device or resource busy): chain KUBE-SVC-6MBR56ZNZ45SJSWU
line 1404: CHAIN_USER_DEL failed (Device or resource busy): chain KUBE-SEP-NYUQ7JQT6NVEEAGA

Comment 7 Phil Sutter 2020-06-15 12:45:44 UTC
Hi,

The info in comment 4 is incomplete, it doesn't contain other tables than filter. As the info in comment 6 shows, affected chains reside in nat table so that is relevant.

Searching for the problematic chains in given paste, I see that all of them are
"introduced" atop by a chain line (:foo - [0:0]) and later removed by a -X
command. Yet this appears to work fine with both upstream and current RHEL8
iptables (iptables-1.8.4-13.el8.x86_64). What RPM version are you seeing this
with?

One way to reproduce the error message is by having a jump to the removed chain:

| # iptables -N asdf
| # iptables -A FORWARD -j asdf
| # iptables-restore --noflush <<EOF
| *filter
| :asdf - [0:0]
| -X asdf
| COMMIT
| EOF
| 
| iptables-restore v1.8.4 (nf_tables): 
| line 3: CHAIN_USER_DEL failed (Device or resource busy): chain asdf

The chain remains in place, the commit is not executed. This behaviour is
consistent with upstream and legacy iptables, I don't think it was ever
different. Is this new code that's just broken or did it work before? What
change introduced the problem you're experiencing?

Cheers, Phil

Comment 8 Jacob Tanenbaum 2020-06-16 19:20:25 UTC
I am working on getting the iptables version information but do you think it would be fixed in the RHEL8.2 version of iptables?

when you say the "commit is not executed" what do you mean just the affected chain or the entire table?

Comment 9 Phil Sutter 2020-06-17 10:26:42 UTC
Hi Jacob,

(In reply to jtanenba from comment #8)
> I am working on getting the iptables version information but do you think it
> would be fixed in the RHEL8.2 version of iptables?

Maybe. I'm still struggling to reconstruct what is actually happening. Given
the lack of version information in this ticket, I simply use current RHEL8
iptables (there's not much difference between RHEL8.2 and yet to become
RHEL8.3). Still I try to avoid unknowns.

More important than iptables RPM version is actually a complete dump of the
ruleset before calling the problematic command. Comment 4 used 'iptables -S'
which omits the relevant 'nat' table. Can you please redo that using
'iptables-save' instead?

> when you say the "commit is not executed" what do you mean just the affected
> chain or the entire table?

The entire table. A 'COMMIT' line in input triggers a push to kernel, if that
fails the whole transaction is aborted including anything since the last
COMMIT.

The relevant bit here is that my shot in the dark reproducer triggers for any
iptables version out there, which would mean there must be a change on your
side which caused the problem. If there wasn't, my reproducer is wrong.

Cheers, Phil

Comment 10 Jacob Tanenbaum 2020-06-17 11:44:37 UTC
It looks like Openshift 4.4 uses iptables v1.8.4


[jtanenba@dell-pe-fm120-1b release]$ oc version
Client Version: 4.5.0-0.ci-2020-05-29-203954
Server Version: 4.4.0-0.nightly-2020-06-15-203705
Kubernetes Version: v1.17.1+912792b
[jtanenba@dell-pe-fm120-1b release]$ oc get pods -n openshift-sdn
NAME                   READY   STATUS    RESTARTS   AGE
ovs-4p5fb              1/1     Running   0          76m
ovs-7nvl2              1/1     Running   0          76m
ovs-8mph6              1/1     Running   0          76m
ovs-ghjff              1/1     Running   0          66m
ovs-jl9kk              1/1     Running   0          66m
ovs-k54f5              1/1     Running   0          66m
sdn-9g5kl              1/1     Running   0          76m
sdn-c4zs9              1/1     Running   0          76m
sdn-controller-gcsl4   1/1     Running   1          76m
sdn-controller-gjhsr   1/1     Running   0          76m
sdn-controller-hv5zk   1/1     Running   0          76m
sdn-fhd69              1/1     Running   0          76m
sdn-l62bw              1/1     Running   0          66m
sdn-nt9np              1/1     Running   0          66m
sdn-tswqt              1/1     Running   0          66m
[jtanenba@dell-pe-fm120-1b release]$ oc rsh -n openshift-sdn sdn-fhd69
sh-4.2# iptables -V
iptables v1.8.4 (nf_tables)
sh-4.2# 


getting the dump later today

Comment 11 Phil Sutter 2020-06-17 12:33:36 UTC
Hi Jacob,

(In reply to jtanenba from comment #10)
> It looks like Openshift 4.4 uses iptables v1.8.4

Well, if it is based on RHEL8.2 it certainly does. Actually that info can
already be found in the ticket's description, but the release is relevant hence
why I asked for the RPM version. Can't you call 'rpm -q iptables' in that
shell?

Cheers, Phil

Comment 12 Jacob Tanenbaum 2020-06-17 21:45:56 UTC
iptables-save from before the restore http://pastebin.test.redhat.com/876280
iptables-save from after the restore http://pastebin.test.redhat.com/876282

E0617 13:32:15.264831    2224 proxier.go:1549] Failed to execute iptables-restore: exit status 4 (iptables-restore v1.8.4 (nf_tables): 
line 1270: CHAIN_USER_DEL failed (Device or resource busy): chain KUBE-SVC-OWT6JNF2E7C4ERER
line 1271: CHAIN_USER_DEL failed (Device or resource busy): chain KUBE-SEP-ZBDAJQSTA6Q3MY5T
)

[jtanenba@dell-pe-fm120-1b 4.6]$ oc rsh -n openshift-sdn sdn-frczr
sh-4.2# rpm -q iptables
iptables-1.4.21-34.el7.x86_64
sh-4.2#

Comment 13 Phil Sutter 2020-06-18 10:12:51 UTC
Hi Jacob,

(In reply to jtanenba from comment #12)
> iptables-save from before the restore http://pastebin.test.redhat.com/876280
> iptables-save from after the restore http://pastebin.test.redhat.com/876282

Thanks for the new dumps, I'll try to reproduce the problem based on them. Meanwhile:

> E0617 13:32:15.264831    2224 proxier.go:1549] Failed to execute
> iptables-restore: exit status 4 (iptables-restore v1.8.4 (nf_tables): 
> line 1270: CHAIN_USER_DEL failed (Device or resource busy): chain
> KUBE-SVC-OWT6JNF2E7C4ERER
> line 1271: CHAIN_USER_DEL failed (Device or resource busy): chain
> KUBE-SEP-ZBDAJQSTA6Q3MY5T
> )
> 
> [jtanenba@dell-pe-fm120-1b 4.6]$ oc rsh -n openshift-sdn sdn-frczr
> sh-4.2# rpm -q iptables
> iptables-1.4.21-34.el7.x86_64
> sh-4.2#

This is not possible, sorry: The error message states 'iptables-restore
v1.8.4', so this can't be iptables-1.4.21. Also, you filed the ticket for RHEL8
while this is a RHEL7 RPM. Wrong system?

Comment 14 Phil Sutter 2020-06-18 11:29:04 UTC
(In reply to jtanenba from comment #12)
> iptables-save from before the restore http://pastebin.test.redhat.com/876280
> iptables-save from after the restore http://pastebin.test.redhat.com/876282
> 
> E0617 13:32:15.264831    2224 proxier.go:1549] Failed to execute
> iptables-restore: exit status 4 (iptables-restore v1.8.4 (nf_tables): 
> line 1270: CHAIN_USER_DEL failed (Device or resource busy): chain
> KUBE-SVC-OWT6JNF2E7C4ERER
> line 1271: CHAIN_USER_DEL failed (Device or resource busy): chain
> KUBE-SEP-ZBDAJQSTA6Q3MY5T
> )

Sadly, the dumps don't match the iptables-restore input from comment 6 (neither
of the two complained about chains is contained in there). I tried to
reconstruct the input to iptables-restore based on the paste in comment 6, but
failed to reproduce this.

Can you please redo this and get me a matching set of ruleset before, ruleset
after, failing command input and error messages?

Thanks, Phil

Comment 15 Jacob Tanenbaum 2020-06-19 14:12:16 UTC
dump before restore: http://pastebin.test.redhat.com/877037

dump after restore: http://pastebin.test.redhat.com/877039

E0619 11:56:06.065389    2239 proxier.go:1549] Failed to execute iptables-restore: exit status 4 (iptables-restore v1.8.4 (nf_tables): 
line 1204: CHAIN_USER_DEL failed (Device or resource busy): chain KUBE-SEP-CUFR2OBBF3SLAKBR
line 1205: CHAIN_USER_DEL failed (Device or resource busy): chain KUBE-SEP-A2BQUJWQDWCKFTER
line 1206: CHAIN_USER_DEL failed (Device or resource busy): chain KUBE-SVC-K3IKJHQUW5PKMDAT
line 1207: CHAIN_USER_DEL failed (Device or resource busy): chain KUBE-SVC-YGRLCK7WQMPNDVO7
line 1208: CHAIN_USER_DEL failed (Device or resource busy): chain KUBE-SEP-DCWT2LHPTM4JKCNT
line 1209: CHAIN_USER_DEL failed (Device or resource busy): chain KUBE-SEP-DX3H5MAC42XR4HMC
line 1210: CHAIN_USER_DEL failed (Device or resource busy): chain KUBE-SVC-2OG3L5F66KSOLZHW
line 1211: CHAIN_USER_DEL failed (Device or resource busy): chain KUBE-SEP-I4QQM3GTUUSE2DGH
line 1212: CHAIN_USER_DEL failed (Device or resource busy): chain KUBE-SEP-5AGKG72YGQHLTOKP
line 1213: CHAIN_USER_DEL failed (Device or resource busy): chain KUBE-SVC-FMWA5JJNM4VCPMOU
)

E0619 11:56:05.878401    2239 iptables.go:409] running iptables-restore [-w 5 -W 100000 --noflush --counters]

Comment 16 Phil Sutter 2020-06-19 14:56:38 UTC
(In reply to jtanenba from comment #15)
> E0619 11:56:05.878401    2239 iptables.go:409] running iptables-restore [-w
> 5 -W 100000 --noflush --counters]

Failing command input is missing. The above are the parameters, but they didn't change and they do nothing. What's fed on stdin is what makes the difference.

Comment 17 Dan Williams 2020-06-19 18:45:27 UTC
@jacob OpenShift doesn't use the container's iptables (which would be rhel7) but uses the host's iptables (which would be RHCOS 8). So you need to get the host iptables RPM version.

Comment 18 Jacob Tanenbaum 2020-06-23 13:22:02 UTC
dump of iptables before restore: http://pastebin.test.redhat.com/877856
iptables inputs: http://pastebin.test.redhat.com/877858


E0621 22:57:59.236322    2457 proxier.go:1551] Failed to execute iptables-restore: exit status 4 (iptables-restore v1.8.4 (nf_tables): 
line 1295: CHAIN_USER_DEL failed (Device or resource busy): chain KUBE-SEP-TIC2QGAMU7NUSSEX
line 1296: CHAIN_USER_DEL failed (Device or resource busy): chain KUBE-SEP-YVONFYXSTECR7U55
line 1297: CHAIN_USER_DEL failed (Device or resource busy): chain KUBE-SEP-QO6KDN3UJJIOHA47
line 1298: CHAIN_USER_DEL failed (Device or resource busy): chain KUBE-SEP-75VO4UUHJCVASJNQ
line 1299: CHAIN_USER_DEL failed (Device or resource busy): chain KUBE-SEP-FYFKVH4XYD6JWLB5
line 1300: CHAIN_USER_DEL failed (Device or resource busy): chain KUBE-SVC-N2LFPK6FMLN4SXIY
line 1301: CHAIN_USER_DEL failed (Device or resource busy): chain KUBE-SEP-YWRZC2TRJDVFSUWZ
line 1302: CHAIN_USER_DEL failed (Device or resource busy): chain KUBE-SVC-M3QJUVJHMWJDEETQ

[jtanenba@dell-pe-fm120-1b go-controller]$ oc rsh -n openshift-sdn sdn-4vlcg
sh-4.2# chroot /host
sh-4.4# rpm -qa | grep iptables
iptables-libs-1.8.4-10.el8.x86_64
iptables-1.8.4-10.el8.x86_64
sh-4.4# 

I hope that is the iptables information you need let me know if I got it wrong again

Comment 19 Phil Sutter 2020-06-24 14:02:04 UTC
Hi Jacob,

(In reply to jtanenba from comment #18)
> dump of iptables before restore: http://pastebin.test.redhat.com/877856
> iptables inputs: http://pastebin.test.redhat.com/877858
[...]
> sh-4.4# rpm -qa | grep iptables
> iptables-libs-1.8.4-10.el8.x86_64
> iptables-1.8.4-10.el8.x86_64

For some reason I can't reproduce this. Applied the dump in 877856 first, then
the one in 877858 using options --noflush and --counters. Both commands
returned zero exit status.

What kernel version are you seeing the problem with?

Cheers, Phil

Comment 20 Jacob Tanenbaum 2020-06-25 15:29:30 UTC
sh-4.2# uname -a 
Linux ip-10-0-150-200 4.18.0-193.9.1.el8_2.x86_64 #1 SMP Sun Jun 14 15:03:05 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

Comment 21 Phil Sutter 2020-06-27 10:15:17 UTC
Hi,

OK, I'm officially out of ideas here.

I made sure to have a matching kernel and iptables version installed but still can't reproduce it with the dumps in comment 18.

How easily can you reproduce it? Maybe we can have a debugging session together next week?

Cheers, Phil

Comment 22 Carl Henrik Lunde 2020-06-30 09:08:16 UTC
Would it help to have private test clusters with this issue? I see this issue happening on random nodes, 15 times per hour in total on our clusters. If you have something to run to debug this which is not very obtrusive I can try.

Comment 23 Phil Sutter 2020-06-30 11:50:03 UTC
Created attachment 1699289 [details]
fake iptables-restore script

Comment 24 Phil Sutter 2020-06-30 12:03:43 UTC
Hi Carl,

(In reply to Carl Henrik Lunde from comment #22)
> Would it help to have private test clusters with this issue? I see this
> issue happening on random nodes, 15 times per hour in total on our clusters.
> If you have something to run to debug this which is not very obtrusive I can
> try.

Thanks for your offer. I have attached a shell script to this ticket which you
could install onto the host running the containers causing the problems.

It is designed to replace iptables-restore symlink (which is what the container
software hopefully calls). Install like so:

# mv /usr/sbin/iptables-restore /usr/sbin/iptables-restore.bak
# cp iptables-fake-restore.sh /usr/sbin/iptables-restore

The script swallows iptables-restore error messages, but retains its return
code. After the error has happened, you should see log files in /tmp named
'failure_<PID>.log' (with <PID> being the script's process ID). Collect a few
of those, then restore the original iptables-restore like so:

# mv /usr/sbin/iptables-restore.bak /usr/sbin/iptables-restore
mv: overwrite '/usr/sbin/iptables-restore'? y

I hope that's not too intrusive.

Thanks, Phil

Comment 25 Carl Henrik Lunde 2020-07-02 07:20:13 UTC
Hi, I had to package this as an RPM to install on RHCOS, because /usr/sbin is RO, and also change the temp directory to /var/tmp, because it runs in a chroot where /tmp is RO.

I looked at the failures from 27 SVC chains. 25 seems to be related to terminating CI namespaces. 2 of them was a service with a single pod in crash loops which also (in/out of the service).

If I restore one of the input files in a busy loop, it fails with the same error after a while, but this will "fight" with openshift-sdn so I guess this might be expected?

One strange thing: From strace -e flock (or -e file/open) I cannot see any traces of xtables.lock. I see it on RHEL7 but not on FCOS. Am I missing something here?

Comment 26 Phil Sutter 2020-07-02 09:18:03 UTC
Hi Carl,

(In reply to Carl Henrik Lunde from comment #25)
> Hi, I had to package this as an RPM to install on RHCOS, because /usr/sbin
> is RO, and also change the temp directory to /var/tmp, because it runs in a
> chroot where /tmp is RO.

Oh, I didn't know. Thanks for fixing it!

> I looked at the failures from 27 SVC chains. 25 seems to be related to
> terminating CI namespaces. 2 of them was a service with a single pod in
> crash loops which also (in/out of the service).

Maybe a stupid question, but what is a CI namespace?

> If I restore one of the input files in a busy loop, it fails with the same
> error after a while, but this will "fight" with openshift-sdn so I guess
> this might be expected?

Possibly, yes.

> One strange thing: From strace -e flock (or -e file/open) I cannot see any
> traces of xtables.lock. I see it on RHEL7 but not on FCOS. Am I missing
> something here?

That's iptables-nft magic. :)

RHEL8 uses nft-variants for ip(6)-, eb- and arptables. Since they use nfnetlink
internally, ruleset changes are transactions in kernel space and therefore the
problematic locking in user space is not required anymore.

Cheers, Phil

Comment 27 Carl Henrik Lunde 2020-07-02 10:30:32 UTC
By "CI namespace" I meant a temporary namespace in kubernetes where someone runs
continuous integration for their application, and then deletes it which recursively
will delete all services in it. So the point is that it is more "dynamic" and cause
multiple services to be deleted, somewhat like a stress test :)

Thanks for the info regarding locking, I see with strace that there's now one
syscall/sendmsg per table.

Comment 28 Phil Sutter 2020-07-02 11:26:57 UTC
Hi Carl,

(In reply to Carl Henrik Lunde from comment #27)
> By "CI namespace" I meant a temporary namespace in kubernetes where someone
> runs
> continuous integration for their application, and then deletes it which
> recursively
> will delete all services in it. So the point is that it is more "dynamic"
> and cause
> multiple services to be deleted, somewhat like a stress test :)

I see. Thanks for clarifying.

> Thanks for the info regarding locking, I see with strace that there's now one
> syscall/sendmsg per table.

With legacy iptables that should have been one ioctl() call per table.

Thanks for a lot for doing the debugging and providing the logs!

Sadly they don't provide any further insights: On a system running reported
kernel and iptables version, iptables-restore with given input succeeds fine
after priming the ruleset accordingly.

Meanwhile I am almost certain this is not a problem of iptables itself, but of
something in OpenShift. My bet would be that multiple iptables(-restore)
processes are called in parallel and the calling software relies upon xtables
lock to serialize them. With iptables-nft this is not as predictable anymore,
likely 'iptables-restore --noflush' (especially with large input) takes longer
to startup than a plain call to 'iptables' takes to complete.

Jacob, you may want to reassign this ticket to whatever component you see fit.
If not, I'll close as NOTABUG.

Thanks, Phil

Comment 29 Phil Sutter 2020-09-08 11:53:51 UTC
Closing as per comment 28.

Comment 30 Dan Winship 2020-10-02 12:08:43 UTC
OK, based on https://github.com/projectcalico/felix/blob/master/iptables/table.go#L1209-L1212 I reproduced this; the problem is if you have two parallel iptables-restore calls and one deletes a chain and the other doesn't. It seems like the iptables-restore logic assumes that what's true at the start of the operation will stay true throughout, but with the improved locking in iptables-nft relative to iptables-legacy, that's no longer true.

[root@p50 danw]# cat /tmp/r1.txt
# Generated by iptables-nft-save v1.8.4 on Fri Oct  2 07:55:31 2020
*filter
:INPUT ACCEPT [12024:3123388]
:FORWARD ACCEPT [0:0]
:OUTPUT ACCEPT [12840:2144421]
:FOO - [0:0]
:BAR0 - [0:0]
:BAR1 - [0:0]
:BAR2 - [0:0]
:BAR3 - [0:0]
:BAR4 - [0:0]
:BAR5 - [0:0]
:BAR6 - [0:0]
:BAR7 - [0:0]
:BAR8 - [0:0]
:BAR9 - [0:0]
-X BAR3
-X BAR6
-X BAR9
-A FOO -s 9.9.0.1/32 -j BAR1
-A FOO -s 9.9.0.2/32 -j BAR2
-A FOO -s 9.9.0.4/32 -j BAR4
-A FOO -s 9.9.0.5/32 -j BAR5
-A FOO -s 9.9.0.7/32 -j BAR7
-A FOO -s 9.9.0.8/32 -j BAR8
COMMIT
# Completed on Fri Oct  2 07:55:31 2020

[root@p50 danw]# cat /tmp/r2.txt 
# Generated by iptables-nft-save v1.8.4 on Fri Oct  2 07:55:31 2020
*filter
:INPUT ACCEPT [12024:3123388]
:FORWARD ACCEPT [0:0]
:OUTPUT ACCEPT [12840:2144421]
:FOO - [0:0]
:BAR0 - [0:0]
:BAR1 - [0:0]
:BAR2 - [0:0]
:BAR3 - [0:0]
:BAR4 - [0:0]
:BAR5 - [0:0]
:BAR6 - [0:0]
:BAR7 - [0:0]
:BAR8 - [0:0]
:BAR9 - [0:0]
-X BAR2
-X BAR5
-X BAR7
-A FOO -s 9.9.0.1/32 -j BAR1
-A FOO -s 9.9.0.3/32 -j BAR3
-A FOO -s 9.9.0.4/32 -j BAR4
-A FOO -s 9.9.0.6/32 -j BAR6
-A FOO -s 9.9.0.8/32 -j BAR8
-A FOO -s 9.9.0.9/32 -j BAR9
COMMIT
# Completed on Fri Oct  2 07:55:31 2020

[root@p50 danw]# for n in $(seq 1 10); do iptables-nft-restore --noflush < /tmp/r1.txt & iptables-nft-restore --noflush < /tmp/r2.txt & done
[1] 1528996
[2] 1528997
[3] 1528998
[4] 1528999
[5] 1529000
[6] 1529001
[7] 1529002
[8] 1529003
[9] 1529004
[10] 1529005
[11] 1529006
[12] 1529007
[13] 1529008
[14] 1529009
[15] 1529010
[16] 1529011
[17] 1529012
[18] 1529013
[19] 1529014
[20] 1529015
[1]   Done                    iptables-nft-restore --noflush < /tmp/r1.txt
[root@p50 danw]# iptables-nft-restore v1.8.4 (nf_tables): 
line 17: CHAIN_USER_DEL failed (Device or resource busy): chain BAR2
line 18: CHAIN_USER_DEL failed (Device or resource busy): chain BAR5
line 19: CHAIN_USER_DEL failed (Device or resource busy): chain BAR7
iptables-nft-restore v1.8.4 (nf_tables): 
line 10: CHAIN_USER_ADD failed (Device or resource busy): chain BAR3
line 13: CHAIN_USER_ADD failed (Device or resource busy): chain BAR6
line 16: CHAIN_USER_ADD failed (Device or resource busy): chain BAR9
line 21: RULE_APPEND failed (No such file or directory): rule in chain FOO
line 23: RULE_APPEND failed (No such file or directory): rule in chain FOO
line 24: RULE_APPEND failed (No such file or directory): rule in chain FOO
iptables-nft-restore v1.8.4 (nf_tables): 
line 10: CHAIN_USER_ADD failed (No such file or directory): chain BAR3
line 13: CHAIN_USER_ADD failed (No such file or directory): chain BAR6
iptables-nft-restore v1.8.4 (nf_tables): 
line 10: CHAIN_USER_ADD failed (No such file or directory): chain BAR3
line 13: CHAIN_USER_ADD failed (No such file or directory): chain BAR6
iptables-nft-restore: line 26 failed
iptables-nft-restore v1.8.4 (nf_tables): 
line 10: CHAIN_USER_ADD failed (No such file or directory): chain BAR3
line 13: CHAIN_USER_ADD failed (No such file or directory): chain BAR6
iptables-nft-restore v1.8.4 (nf_tables): 
line 10: CHAIN_USER_ADD failed (No such file or directory): chain BAR3
line 13: CHAIN_USER_ADD failed (No such file or directory): chain BAR6
iptables-nft-restore: line 26 failed
iptables-nft-restore: line 26 failed
iptables-nft-restore v1.8.4 (nf_tables): 
line 10: CHAIN_USER_ADD failed (No such file or directory): chain BAR3
line 13: CHAIN_USER_ADD failed (No such file or directory): chain BAR6
iptables-nft-restore: line 26 failed
iptables-nft-restore v1.8.4 (nf_tables): 
line 10: CHAIN_USER_ADD failed (No such file or directory): chain BAR3
line 13: CHAIN_USER_ADD failed (No such file or directory): chain BAR6
iptables-nft-restore: line 26 failed
iptables-nft-restore v1.8.4 (nf_tables): 
line 10: CHAIN_USER_ADD failed (No such file or directory): chain BAR3
line 13: CHAIN_USER_ADD failed (No such file or directory): chain BAR6
iptables-nft-restore: line 26 failed
iptables-nft-restore: line 26 failed
iptables-nft-restore v1.8.4 (nf_tables): 
line 10: CHAIN_USER_ADD failed (No such file or directory): chain BAR3
line 13: CHAIN_USER_ADD failed (No such file or directory): chain BAR6
iptables-nft-restore: line 26 failed

[2]   Exit 4                  iptables-nft-restore --noflush < /tmp/r2.txt
[3]   Exit 4                  iptables-nft-restore --noflush < /tmp/r1.txt
[4]   Done                    iptables-nft-restore --noflush < /tmp/r2.txt
[5]   Exit 4                  iptables-nft-restore --noflush < /tmp/r1.txt
[6]   Exit 1                  iptables-nft-restore --noflush < /tmp/r2.txt
[7]   Exit 4                  iptables-nft-restore --noflush < /tmp/r1.txt
[8]   Exit 1                  iptables-nft-restore --noflush < /tmp/r2.txt
[9]   Exit 4                  iptables-nft-restore --noflush < /tmp/r1.txt
[10]   Exit 1                  iptables-nft-restore --noflush < /tmp/r2.txt
[11]   Exit 4                  iptables-nft-restore --noflush < /tmp/r1.txt
[12]   Exit 1                  iptables-nft-restore --noflush < /tmp/r2.txt
[13]   Exit 4                  iptables-nft-restore --noflush < /tmp/r1.txt
[14]   Exit 1                  iptables-nft-restore --noflush < /tmp/r2.txt
[15]   Exit 4                  iptables-nft-restore --noflush < /tmp/r1.txt
[16]   Exit 1                  iptables-nft-restore --noflush < /tmp/r2.txt
[17]   Exit 4                  iptables-nft-restore --noflush < /tmp/r1.txt
[18]   Exit 1                  iptables-nft-restore --noflush < /tmp/r2.txt
[19]-  Exit 4                  iptables-nft-restore --noflush < /tmp/r1.txt
[20]+  Exit 1                  iptables-nft-restore --noflush < /tmp/r2.txt

[root@p50 danw]# iptables-nft -V
iptables v1.8.4 (nf_tables)
[root@p50 danw]# iptables-nft ! -V
Not 1.8.4 ;-)

(Sorry, the "! -V" thing cracks me up every time.)

The reproducer can probably be simplified further; I started with 20,000 rules just in case it only happened at scale... :)

Comment 32 Eric Garver 2020-10-02 13:42:12 UTC
Simplified reproducer:

[root@vmhost-rhel8-test1 ~]# cat /tmp/1.txt 
*filter
:INPUT ACCEPT [12024:3123388]
:FORWARD ACCEPT [0:0]
:OUTPUT ACCEPT [12840:2144421]
:FOO - [0:0]
:BAR1 - [0:0]
:BAR2 - [0:0]
-X BAR1
-A FOO -s 9.9.0.1/32 -j BAR2
COMMIT

[root@vmhost-rhel8-test1 ~]# cat /tmp/2.txt 
*filter
:INPUT ACCEPT [12024:3123388]
:FORWARD ACCEPT [0:0]
:OUTPUT ACCEPT [12840:2144421]
:FOO - [0:0]
:BAR1 - [0:0]
:BAR2 - [0:0]
-X BAR2
-A FOO -s 9.9.0.1/32 -j BAR1
COMMIT

[root@vmhost-rhel8-test1 ~]# iptables-restore -n /tmp/1.txt & iptables-restore -n /tmp/2.txt &
[1] 2186003
[2] 2186004
[root@vmhost-rhel8-test1 ~]# iptables-restore v1.8.4 (nf_tables): 
line 8: CHAIN_USER_DEL failed (Device or resource busy): chain BAR1

[1]-  Exit 4                  iptables-restore -n /tmp/1.txt
[2]+  Done                    iptables-restore -n /tmp/2.txt


Doesn't occur if done serially:

[root@vmhost-rhel8-test1 ~]# iptables-restore -n /tmp/1.txt
[root@vmhost-rhel8-test1 ~]# iptables-restore -n /tmp/2.txt
[root@vmhost-rhel8-test1 ~]# iptables-restore -n /tmp/1.txt
[root@vmhost-rhel8-test1 ~]# iptables-restore -n /tmp/2.txt

Comment 33 Phil Sutter 2020-10-05 14:33:33 UTC
Fix sent upstream: https://lore.kernel.org/netfilter-devel/20201005144858.11578-1-phil@nwl.cc/

Thanks for the good reproducer!

Comment 34 Dan Williams 2020-10-05 20:58:46 UTC
Phil, the scratch build you did with these patches resolves the issue in the customer cluster. No more CHAIN_USER_DEL messages are seen, and the iptables-restore completes successfully. Looks good.

Comment 35 Phil Sutter 2020-10-05 22:00:04 UTC
(In reply to Dan Williams from comment #34)
> Phil, the scratch build you did with these patches resolves the issue in the
> customer cluster. No more CHAIN_USER_DEL messages are seen, and the
> iptables-restore completes successfully. Looks good.

Great, thanks for verifying!

Comment 36 Phil Sutter 2020-10-10 09:10:14 UTC
First two patches applied, third one (which contains the actual fix) still under review:

0d77e64e8d9b8 ("nft: Make batch_add_chain() return the added batch object")
e98b825a03780 ("nft: Fix error reporting for refreshed transactions")

https://lore.kernel.org/netfilter-devel/20201005144858.11578-4-phil@nwl.cc/

Comment 42 Phil Sutter 2020-11-09 15:46:39 UTC
Failing RHEL8.2.z CI test exposed a related issue to this one but with flushing restore. Basically we're missing:

commit 40ad7793d1884f28767cf58c96e9d76ae0a18db1
Author: Phil Sutter <phil>
Date:   Sat Feb 29 02:08:26 2020 +0100

    nft: cache: Make nft_rebuild_cache() respect fake cache
    
    If transaction needed a refresh in nft_action(), restore with flush
    would fetch a full cache instead of merely refreshing table list
    contained in "fake" cache.
    
    To fix this, nft_rebuild_cache() must distinguish between fake cache and
    full rule cache. Therefore introduce NFT_CL_FAKE to be distinguished
    from NFT_CL_RULES.
    
    Signed-off-by: Phil Sutter <phil>

Plus one extra fixup:

--- a/iptables/nft-cache.c        
+++ b/iptables/nft-cache.c                                    
@@ -630,7 +630,7 @@ void nft_rebuild_cache(struct nft_handle *h) 
        if (h->cache_level)                                
                __nft_flush_cache(h);                                                     
                                                          
-       if (h->cache_level == NFT_CL_FAKE) {                                             
+       if (level == NFT_CL_FAKE) {                      
                nft_fake_cache(h);                                                      
        } else {                                        
                h->nft_genid = 0;                                                      

This is RHEL-only as upstream cache logic has changed so much, the above doesn't apply there anymore.

Comment 49 errata-xmlrpc 2021-05-18 14:58:38 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (iptables bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:1642