A GCP 4.5 cluster was reporting iptables segfaults repeatedly during runs (maybe 10 times total) Mar 10 04:23:49.492382 ci-op-s694g-m-0.c.openshift-gce-devel-ci.internal systemd-coredump[104947]: Process 104943 (iptables-restor) of user 0 dumped core. Stack trace of thread 104943: #0 0x00007f6b4588e49b nftnl_expr_build_payload (libnftnl.so.11) #1 0x00007f6b45888c3b nftnl_rule_nlmsg_build_payload (libnftnl.so.11) #2 0x0000561273ca0c1c nft_action (/usr/sbin/xtables-nft-multi) #3 0x0000561273c9a156 xtables_restore_parse_line (/usr/sbin/xtables-nft-multi) #4 0x0000561273c9a62a xtables_restore_parse (/usr/sbin/xtables-nft-multi) #5 0x0000561273c9aa51 xtables_restore_main (/usr/sbin/xtables-nft-multi) #6 0x00007f6b446d4873 __libc_start_main (libc.so.6) #7 0x0000561273c97e3e _start (/usr/sbin/xtables-nft-multi) Mar 10 04:23:49.502251 ci-op-s694g-m-0.c.openshift-gce-devel-ci.internal systemd[1]: systemd-coredump: Consumed 752ms CPU time Likely cause of several flakes - restore would prevent us from updating service rules. We need to check how far back this is happening.
Workers journal in https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/pr-logs/pull/24654/pull-ci-openshift-origin-master-e2e-gcp/6227/artifacts/e2e-gcp/nodes/ should have it (looking at aws runs now)
On aws https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.5/217/artifacts/e2e-aws/nodes/ grep "segfault" ~/Downloads/workers-journal-10 Mar 10 19:45:50.161776 ip-10-0-138-153 kernel: iptables-save[16510]: segfault at 80 ip 00007f24fd2e42c4 sp 00007ffc11488818 error 4 in libnftnl.so.11.2.0[7f24fd2d5000+2c000] Mar 10 19:54:03.107491 ip-10-0-138-153 kernel: iptables-save[84918]: segfault at 80 ip 00007fd47025f2c4 sp 00007ffccb82f928 error 4 in libnftnl.so.11.2.0[7fd470250000+2c000] Mar 10 19:51:17.970909 ip-10-0-138-186 kernel: iptables[66915]: segfault at 80 ip 00007f9c48fbf2c4 sp 00007ffc101e7478 error 4 in libnftnl.so.11.2.0[7f9c48fb0000+2c000] Mar 10 19:51:18.817755 ip-10-0-138-186 kernel: iptables-save[67093]: segfault at 80 ip 00007f6f33b502c4 sp 00007ffc39808508 error 4 in libnftnl.so.11.2.0[7f6f33b41000+2c000] Mar 10 19:51:19.626851 ip-10-0-138-186 kernel: iptables-restor[67290]: segfault at 18 ip 00007f43c4e3b49b sp 00007fff78708e10 error 4 in libnftnl.so.11.2.0[7f43c4e25000+2c000] Mar 10 19:51:22.600391 ip-10-0-138-186 kernel: iptables-save[67918]: segfault at 80 ip 00007f1382f512c4 sp 00007ffd18d08ce8 error 4 in libnftnl.so.11.2.0[7f1382f42000+2c000] Mar 10 19:51:23.237457 ip-10-0-138-186 kernel: iptables-restor[67953]: segfault at ffffffe0 ip 00007faf5f469d07 sp 00007fff05b4cad8 error 4 in libc-2.28.so[7faf5f30c000+1b9000] Mar 10 20:01:44.407400 ip-10-0-138-186 kernel: iptables-save[201004]: segfault at 80 ip 00007f1dbcd7a2c4 sp 00007ffe106d85c8 error 4 in libnftnl.so.11.2.0[7f1dbcd6b000+2c000] Mar 10 20:03:23.744734 ip-10-0-138-186 kernel: iptables-restor[224900]: segfault at 99 ip 00007fbb8aebe49b sp 00007ffd1ed46210 error 4 in libnftnl.so.11.2.0[7fbb8aea8000+2c000] Mar 10 19:55:42.073071 ip-10-0-150-43 kernel: iptables-save[127375]: segfault at 80 ip 00007f9b899e42c4 sp 00007ffdb5cb8ab8 error 4 in libnftnl.so.11.2.0[7f9b899d5000+2c000] Mar 10 19:57:00.115999 ip-10-0-150-43 kernel: iptables-restor[137883]: segfault at 50492060 ip 00007fab88704d07 sp 00007ffda600c138 error 4 in libc-2.28.so[7fab885a7000+1b9000] Mar 10 19:58:32.519735 ip-10-0-150-43 kernel: iptables-save[156848]: segfault at 80 ip 00007f0e312892c4 sp 00007ffc5fbec928 error 4 in libnftnl.so.11.2.0[7f0e3127a000+2c000] Mar 10 19:58:32.552517 ip-10-0-150-43 kernel: iptables-restor[156843]: segfault at 0 ip 00007f76a7dbeebd sp 00007ffcd653b710 error 4 in libnftnl.so.11.2.0[7f76a7db2000+2c000] Mar 10 20:06:31.309840 ip-10-0-150-43 kernel: iptables-save[253307]: segfault at 80 ip 00007fc83b4d32c4 sp 00007ffc57aa1978 error 4 in libnftnl.so.11.2.0[7fc83b4c4000+2c000] Mar 10 20:06:40.633635 ip-10-0-150-43 kernel: iptables-save[255938]: segfault at 80 ip 00007ff0611e62c4 sp 00007ffcc9fe8938 error 4 in libnftnl.so.11.2.0[7ff0611d7000+2c000] Mar 10 20:06:45.897857 ip-10-0-150-43 kernel: iptables-save[258365]: segfault at 80 ip 00007f70822a82c4 sp 00007ffdb5fb40b8 error 4 in libnftnl.so.11.2.0[7f7082299000+2c000] Mar 10 20:06:53.613746 ip-10-0-150-43 kernel: iptables-save[260152]: segfault at 80 ip 00007fcd78eef2c4 sp 00007ffc61839f18 error 4 in libnftnl.so.11.2.0[7fcd78ee0000+2c000]
Happening in 4.4, release blocker: https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.4/1908/artifacts/e2e-aws/nodes/ ○ grep "segfault" ~/Downloads/workers-journal-12 Mar 10 20:51:43.912181 ip-10-0-134-209 kernel: iptables-save[294911]: segfault at 80 ip 00007f227d2d72c4 sp 00007ffc111c62c8 error 4 in libnftnl.so.11.2.0[7f227d2c8000+2c000] Mar 10 20:40:00.021182 ip-10-0-134-31 kernel: iptables-restor[144305]: segfault at 7f2066ffddb0 ip 00007f2032141394 sp 00007ffdcbb6d568 error 4 in libc-2.28.so[7f2031fe3000+1b9000] Mar 10 20:41:00.000713 ip-10-0-134-31 kernel: iptables[155669]: segfault at 80 ip 00007f2262e922c4 sp 00007ffff81591f8 error 4 in libnftnl.so.11.2.0[7f2262e83000+2c000] Mar 10 20:33:44.692800 ip-10-0-156-27 kernel: iptables-save[77075]: segfault at 80 ip 00007fea9494b2c4 sp 00007ffe473c4848 error 4 in libnftnl.so.11.2.0[7fea9493c000+2c000] Mar 10 20:38:34.544858 ip-10-0-156-27 kernel: iptables-save[133984]: segfault at 80 ip 00007f0ba2dd72c4 sp 00007fffa1ed21d8 error 4 in libnftnl.so.11.2.0[7f0ba2dc8000+2c000] Mar 10 20:43:44.706331 ip-10-0-156-27 kernel: iptables-restor[190893]: segfault at 0 ip 00007fbc49429cd5 sp 00007ffe34c19898 error 4 in libc-2.28.so[7fbc492cc000+1b9000]
*** Bug 1811342 has been marked as a duplicate of this bug. ***
Clayton, are we able to get either: 1) coredumps 2) 'iptables-save' output on the node? I know we don't run the networking bits of must-gather by default, but this would be a great time to have that info :(
Also what specific RPM version of iptables is installed on whatever version of RHCOS is running on the node.
Nevermind, Phil is all over it in bug 1807811 *** This bug has been marked as a duplicate of bug 1807811 ***
Created attachment 1669423 [details] coredump from ci run Taken from a 4.5 master run.
*** Bug 1813214 has been marked as a duplicate of this bug. ***
`iptables-1.8.4-10.el8` with the fix landed in the following RHCOS versions: 45.81.202003180406-0 44.81.202003172130-0 43.81.202003172053.0 These should have been picked up by the respective CI/nightly release payloads, as well. Let me know if the issue persists.
Checking on the promotion, this is the current, promoted machine-os-content for 4.4: $ oc image info --output json registry.svc.ci.openshift.org/ocp/4.4:machine-os-content | jq -r .config.config.Labels.version 44.81.202003180730-0 so we should be good to go there. Not sure which nightly that went out with, but it's at least in: $ oc image info --output json $(oc adm release info --image-for=machine-os-content registry.svc.ci.openshift.org/ocp/release:4.4.0-0.nightly-2020-03-18-102708) | jq -r .config.config.Labels.version 44.81.202003180730-0
Did not find this issue on 4.4.0-0.nightly-2020-03-22-214549 and 4.4.0-0.nightly-2020-03-23-010639 with rhcos image 44.81.202003192230-0 Verified this bug.
*** Bug 1814334 has been marked as a duplicate of this bug. ***
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0581