Bug 1812261 - iptables-restore is segfaulting multiple times during an e2e run on multiple nodes
Summary: iptables-restore is segfaulting multiple times during an e2e run on multiple ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.4
Hardware: Unspecified
OS: Unspecified
high
urgent
Target Milestone: ---
: 4.4.0
Assignee: Aniket Bhat
QA Contact: zhaozhanqi
URL:
Whiteboard:
: 1811342 1813214 1814334 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-03-10 21:04 UTC by Clayton Coleman
Modified: 2020-05-04 11:46 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-05-04 11:45:45 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
coredump from ci run (340.00 KB, application/x-tar)
2020-03-11 19:34 UTC, Clayton Coleman
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2020:0581 0 None None None 2020-05-04 11:46:06 UTC

Description Clayton Coleman 2020-03-10 21:04:19 UTC
A GCP 4.5 cluster was reporting iptables segfaults repeatedly during runs (maybe 10 times total)

Mar 10 04:23:49.492382 ci-op-s694g-m-0.c.openshift-gce-devel-ci.internal systemd-coredump[104947]: Process 104943 (iptables-restor) of user 0 dumped core.

                                                                                                   Stack trace of thread 104943:
                                                                                                   #0  0x00007f6b4588e49b nftnl_expr_build_payload (libnftnl.so.11)
                                                                                                   #1  0x00007f6b45888c3b nftnl_rule_nlmsg_build_payload (libnftnl.so.11)
                                                                                                   #2  0x0000561273ca0c1c nft_action (/usr/sbin/xtables-nft-multi)
                                                                                                   #3  0x0000561273c9a156 xtables_restore_parse_line (/usr/sbin/xtables-nft-multi)
                                                                                                   #4  0x0000561273c9a62a xtables_restore_parse (/usr/sbin/xtables-nft-multi)
                                                                                                   #5  0x0000561273c9aa51 xtables_restore_main (/usr/sbin/xtables-nft-multi)
                                                                                                   #6  0x00007f6b446d4873 __libc_start_main (libc.so.6)
                                                                                                   #7  0x0000561273c97e3e _start (/usr/sbin/xtables-nft-multi)
Mar 10 04:23:49.502251 ci-op-s694g-m-0.c.openshift-gce-devel-ci.internal systemd[1]: systemd-coredump: Consumed 752ms CPU time

Likely cause of several flakes - restore would prevent us from updating service rules.

We need to check how far back this is happening.

Comment 2 Clayton Coleman 2020-03-10 21:12:51 UTC
On aws

https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.5/217/artifacts/e2e-aws/nodes/

grep "segfault" ~/Downloads/workers-journal-10
Mar 10 19:45:50.161776 ip-10-0-138-153 kernel: iptables-save[16510]: segfault at 80 ip 00007f24fd2e42c4 sp 00007ffc11488818 error 4 in libnftnl.so.11.2.0[7f24fd2d5000+2c000]
Mar 10 19:54:03.107491 ip-10-0-138-153 kernel: iptables-save[84918]: segfault at 80 ip 00007fd47025f2c4 sp 00007ffccb82f928 error 4 in libnftnl.so.11.2.0[7fd470250000+2c000]
Mar 10 19:51:17.970909 ip-10-0-138-186 kernel: iptables[66915]: segfault at 80 ip 00007f9c48fbf2c4 sp 00007ffc101e7478 error 4 in libnftnl.so.11.2.0[7f9c48fb0000+2c000]
Mar 10 19:51:18.817755 ip-10-0-138-186 kernel: iptables-save[67093]: segfault at 80 ip 00007f6f33b502c4 sp 00007ffc39808508 error 4 in libnftnl.so.11.2.0[7f6f33b41000+2c000]
Mar 10 19:51:19.626851 ip-10-0-138-186 kernel: iptables-restor[67290]: segfault at 18 ip 00007f43c4e3b49b sp 00007fff78708e10 error 4 in libnftnl.so.11.2.0[7f43c4e25000+2c000]
Mar 10 19:51:22.600391 ip-10-0-138-186 kernel: iptables-save[67918]: segfault at 80 ip 00007f1382f512c4 sp 00007ffd18d08ce8 error 4 in libnftnl.so.11.2.0[7f1382f42000+2c000]
Mar 10 19:51:23.237457 ip-10-0-138-186 kernel: iptables-restor[67953]: segfault at ffffffe0 ip 00007faf5f469d07 sp 00007fff05b4cad8 error 4 in libc-2.28.so[7faf5f30c000+1b9000]
Mar 10 20:01:44.407400 ip-10-0-138-186 kernel: iptables-save[201004]: segfault at 80 ip 00007f1dbcd7a2c4 sp 00007ffe106d85c8 error 4 in libnftnl.so.11.2.0[7f1dbcd6b000+2c000]
Mar 10 20:03:23.744734 ip-10-0-138-186 kernel: iptables-restor[224900]: segfault at 99 ip 00007fbb8aebe49b sp 00007ffd1ed46210 error 4 in libnftnl.so.11.2.0[7fbb8aea8000+2c000]
Mar 10 19:55:42.073071 ip-10-0-150-43 kernel: iptables-save[127375]: segfault at 80 ip 00007f9b899e42c4 sp 00007ffdb5cb8ab8 error 4 in libnftnl.so.11.2.0[7f9b899d5000+2c000]
Mar 10 19:57:00.115999 ip-10-0-150-43 kernel: iptables-restor[137883]: segfault at 50492060 ip 00007fab88704d07 sp 00007ffda600c138 error 4 in libc-2.28.so[7fab885a7000+1b9000]
Mar 10 19:58:32.519735 ip-10-0-150-43 kernel: iptables-save[156848]: segfault at 80 ip 00007f0e312892c4 sp 00007ffc5fbec928 error 4 in libnftnl.so.11.2.0[7f0e3127a000+2c000]
Mar 10 19:58:32.552517 ip-10-0-150-43 kernel: iptables-restor[156843]: segfault at 0 ip 00007f76a7dbeebd sp 00007ffcd653b710 error 4 in libnftnl.so.11.2.0[7f76a7db2000+2c000]
Mar 10 20:06:31.309840 ip-10-0-150-43 kernel: iptables-save[253307]: segfault at 80 ip 00007fc83b4d32c4 sp 00007ffc57aa1978 error 4 in libnftnl.so.11.2.0[7fc83b4c4000+2c000]
Mar 10 20:06:40.633635 ip-10-0-150-43 kernel: iptables-save[255938]: segfault at 80 ip 00007ff0611e62c4 sp 00007ffcc9fe8938 error 4 in libnftnl.so.11.2.0[7ff0611d7000+2c000]
Mar 10 20:06:45.897857 ip-10-0-150-43 kernel: iptables-save[258365]: segfault at 80 ip 00007f70822a82c4 sp 00007ffdb5fb40b8 error 4 in libnftnl.so.11.2.0[7f7082299000+2c000]
Mar 10 20:06:53.613746 ip-10-0-150-43 kernel: iptables-save[260152]: segfault at 80 ip 00007fcd78eef2c4 sp 00007ffc61839f18 error 4 in libnftnl.so.11.2.0[7fcd78ee0000+2c000]

Comment 3 Clayton Coleman 2020-03-10 21:14:13 UTC
Happening in 4.4, release blocker:

https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.4/1908/artifacts/e2e-aws/nodes/

○ grep "segfault" ~/Downloads/workers-journal-12
Mar 10 20:51:43.912181 ip-10-0-134-209 kernel: iptables-save[294911]: segfault at 80 ip 00007f227d2d72c4 sp 00007ffc111c62c8 error 4 in libnftnl.so.11.2.0[7f227d2c8000+2c000]
Mar 10 20:40:00.021182 ip-10-0-134-31 kernel: iptables-restor[144305]: segfault at 7f2066ffddb0 ip 00007f2032141394 sp 00007ffdcbb6d568 error 4 in libc-2.28.so[7f2031fe3000+1b9000]
Mar 10 20:41:00.000713 ip-10-0-134-31 kernel: iptables[155669]: segfault at 80 ip 00007f2262e922c4 sp 00007ffff81591f8 error 4 in libnftnl.so.11.2.0[7f2262e83000+2c000]
Mar 10 20:33:44.692800 ip-10-0-156-27 kernel: iptables-save[77075]: segfault at 80 ip 00007fea9494b2c4 sp 00007ffe473c4848 error 4 in libnftnl.so.11.2.0[7fea9493c000+2c000]
Mar 10 20:38:34.544858 ip-10-0-156-27 kernel: iptables-save[133984]: segfault at 80 ip 00007f0ba2dd72c4 sp 00007fffa1ed21d8 error 4 in libnftnl.so.11.2.0[7f0ba2dc8000+2c000]
Mar 10 20:43:44.706331 ip-10-0-156-27 kernel: iptables-restor[190893]: segfault at 0 ip 00007fbc49429cd5 sp 00007ffe34c19898 error 4 in libc-2.28.so[7fbc492cc000+1b9000]

Comment 5 Ben Bennett 2020-03-11 13:56:15 UTC
*** Bug 1811342 has been marked as a duplicate of this bug. ***

Comment 6 Dan Williams 2020-03-11 16:58:37 UTC
Clayton, are we able to get either:

1) coredumps
2) 'iptables-save' output on the node?

I know we don't run the networking bits of must-gather by default, but this would be a great time to have that info :(

Comment 7 Dan Williams 2020-03-11 16:59:30 UTC
Also what specific RPM version of iptables is installed on whatever version of RHCOS is running on the node.

Comment 8 Dan Williams 2020-03-11 17:10:40 UTC
Nevermind, Phil is all over it in bug 1807811

*** This bug has been marked as a duplicate of bug 1807811 ***

Comment 9 Clayton Coleman 2020-03-11 19:34:06 UTC
Created attachment 1669423 [details]
coredump from ci run

Taken from a 4.5 master run.

Comment 13 Aniket Bhat 2020-03-16 14:46:24 UTC
*** Bug 1813214 has been marked as a duplicate of this bug. ***

Comment 18 Micah Abbott 2020-03-18 13:25:54 UTC
`iptables-1.8.4-10.el8` with the fix landed in the following RHCOS versions: 

45.81.202003180406-0
44.81.202003172130-0
43.81.202003172053.0

These should have been picked up by the respective CI/nightly release payloads, as well.

Let me know if the issue persists.

Comment 19 W. Trevor King 2020-03-19 00:01:03 UTC
Checking on the promotion, this is the current, promoted machine-os-content for 4.4:

$ oc image info --output json registry.svc.ci.openshift.org/ocp/4.4:machine-os-content | jq -r .config.config.Labels.version
44.81.202003180730-0

so we should be good to go there.  Not sure which nightly that went out with, but it's at least in:

$ oc image info --output json $(oc adm release info --image-for=machine-os-content registry.svc.ci.openshift.org/ocp/release:4.4.0-0.nightly-2020-03-18-102708) | jq -r .config.config.Labels.version
44.81.202003180730-0

Comment 23 zhaozhanqi 2020-03-23 10:54:22 UTC
Did not find this issue on 4.4.0-0.nightly-2020-03-22-214549 and 4.4.0-0.nightly-2020-03-23-010639
with rhcos image 44.81.202003192230-0
Verified this bug.

Comment 24 Jacob Tanenbaum 2020-03-24 20:59:04 UTC
*** Bug 1814334 has been marked as a duplicate of this bug. ***

Comment 26 errata-xmlrpc 2020-05-04 11:45:45 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581


Note You need to log in before you can comment on or make changes to this bug.