Needs bug: Node process segfaulted expand_less 0s nodes/masters-journal.gz:Sep 03 13:50:47.723196 ip-10-0-187-118 kernel: handler3[4667]: segfault at 0 ip 000055fef547e147 sp 00007f1c78b578b0 error 4 in ovs-vswitchd[55fef5045000+629000] nodes/masters-journal.gz:Sep 03 13:50:47.765476 ip-10-0-187-118 kernel: handler1[4668]: segfault at 0 ip 000055fef547e147 sp 00007f1c783568b0 error 4 in ovs-vswitchd[55fef5045000+629000] from https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-kube-apiserver-operator/941/pull-ci-openshift-cluster-kube-apiserver-operator-master-e2e-aws-upgrade/1301495297582567424
We need a coredump, otherwise we'll have to close CANTFIX.
Also, what OVS version is in that RHCOS build?
This is still happening. See https://search.ci.openshift.org/?search=segfault&maxAge=168h&context=1&type=bug%2Bjunit&name=4.6&maxMatches=5&maxBytes=20971520&groupBy=job . Do you have a PR for gathering this data in CI to diagnose?
clicking through failing jobs, when this happens we have cascading failures.
In order to fix this we need a core dump, here is the PR we are trying to get in to get the core dumps from CI jobs -> https://github.com/openshift/release/pull/11368
I am sorry it turns out this also needs to be merged https://github.com/openshift/cluster-network-operator/pull/785
quick update on this BZ: - the PR blocking from getting core-dumps is https://github.com/openshift/cluster-network-operator/pull/785 - there are CI failures that are holding up this PR - we still see this in CI https://search.ci.openshift.org/?search=ovs.*segfault&maxAge=168h&context=1&type=bug%2Bjunit&name=4.6&maxMatches=5&maxBytes=20971520&groupBy=job - when the segault occurs it looks like one of the sdn pods goes into crash loop back off - qe has seen it on at least on cluster as posted above there is no way to recover, restarting the affected pod just spawns a new pod where the segfault is seen - only seen in sdn CI jobs but the qe cluster was ovn
as to my above comment looking over this BZ I don''t see why it was on subcomponent ovn-kubernetes. I don't see this getting hit on ovn on any CI jobs and the QE cluster was openshift-sdn. Since this has never been observed on ovn-kubernetes changing subcomponent to openshift-sdn
I am also witnessing this segfault in vsphere. [root@jcallen-wpjw6-worker-8jxgf ~]# journalctl --no-pager | grep -i segfault Sep 26 16:07:33 jcallen-wpjw6-worker-8jxgf kernel: handler2[6195]: segfault at 0 ip 000055f4d3800147 sp 00007feb848a68b0 error 4 in ovs-vswitchd[55f4d33c7000+629000] I have a cluster up if someone needs/wants access to it. The cause looks like the rapid adding and removal of the veth pair from the bridge. Sep 26 16:07:33 jcallen-wpjw6-worker-8jxgf kernel: device veth20bd0e90 entered promiscuous mode Sep 26 16:07:33 jcallen-wpjw6-worker-8jxgf kernel: device veth6eb181cc left promiscuous mode Sep 26 16:07:33 jcallen-wpjw6-worker-8jxgf kernel: device veth9b43efd6 left promiscuous mode Sep 26 16:07:33 jcallen-wpjw6-worker-8jxgf kernel: device veth1f9f0fc0 left promiscuous mode Sep 26 16:07:33 jcallen-wpjw6-worker-8jxgf kernel: device vethe7104a78 left promiscuous mode Sep 26 16:07:33 jcallen-wpjw6-worker-8jxgf kernel: device veth9c233132 left promiscuous mode Sep 26 16:07:33 jcallen-wpjw6-worker-8jxgf kernel: device veth20bd0e90 left promiscuous mode Sep 26 16:07:33 jcallen-wpjw6-worker-8jxgf kernel: device veth6eb181cc entered promiscuous mode Sep 26 16:07:33 jcallen-wpjw6-worker-8jxgf kernel: device veth9b43efd6 entered promiscuous mode Sep 26 16:07:33 jcallen-wpjw6-worker-8jxgf kernel: device veth6eb181cc left promiscuous mode Sep 26 16:07:33 jcallen-wpjw6-worker-8jxgf kernel: device veth9b43efd6 left promiscuous mode Sep 26 16:07:33 jcallen-wpjw6-worker-8jxgf kernel: device veth1f9f0fc0 entered promiscuous mode Sep 26 16:07:33 jcallen-wpjw6-worker-8jxgf kernel: device vethe7104a78 entered promiscuous mode Sep 26 16:07:33 jcallen-wpjw6-worker-8jxgf kernel: device veth9c233132 entered promiscuous mode Sep 26 16:07:33 jcallen-wpjw6-worker-8jxgf kernel: device veth1f9f0fc0 left promiscuous mode Sep 26 16:07:33 jcallen-wpjw6-worker-8jxgf kernel: device vethe7104a78 left promiscuous mode Sep 26 16:07:33 jcallen-wpjw6-worker-8jxgf kernel: device veth9c233132 left promiscuous mode Sep 26 16:07:33 jcallen-wpjw6-worker-8jxgf kernel: device veth20bd0e90 entered promiscuous mode Sep 26 16:07:33 jcallen-wpjw6-worker-8jxgf kernel: device veth20bd0e90 left promiscuous mode
Met this with OpenStack platform today. [ 227.246436] device veth065c2c61 entered promiscuous mode [ 227.247710] device vethcdb764bd entered promiscuous mode [ 227.249316] device veth182c1d70 entered promiscuous mode [ 227.253191] device veth6e810701 left promiscuous mode [ 227.255076] device veth2ef2dda9 left promiscuous mode [ 227.256406] device veth99bfeec2 left promiscuous mode [ 227.257824] device veth220ba02c left promiscuous mode [ 227.259287] device veth065c2c61 left promiscuous mode [ 227.260801] device vethcdb764bd left promiscuous mode [ 227.262594] device veth182c1d70 left promiscuous mode [ 227.263649] handler1[6860]: segfault at 0 ip 000055999cf86147 sp 00007f9f9de598b0 error 4 in ovs-vswitchd[55999cb4d000+629000] [ 227.263657] handler4[6861]: segfault at 0 ip 000055999cf86147 sp 00007f9f9d6588b0 error 4 [ 227.265837] in ovs-vswitchd[55999cb4d000+629000] [ 227.267235] Code: 00 48 89 44 24 40 48 c7 44 24 48 40 00 00 00 eb 16 66 90 0f 84 72 01 00 00 83 3b ff 0f 85 b1 01 00 00 83 f8 04 75 25 44 89 2b <41> 8b 3c 24 89 ea 4c 89 f6 e8 bb 28 c0 ff 49 89 c7 48 85 c0 79 d3 [ 227.271556] Code: 00 48 89 44 24 40 48 c7 44 24 48 40 00 00 00 eb 16 66 90 0f 84 72 01 00 00 83 3b ff 0f 85 b1 01 00 00 83 f8 04 75 25 44 89 2b <41> 8b 3c 24 89 ea 4c 89 f6 e8 bb 28 c0 ff 49 89 c7 48 85 c0 79 d3
Created attachment 1717594 [details] coredump1
Created attachment 1717595 [details] coredump2
Created attachment 1717596 [details] Core-Dump-1 Core dump 1 from a cluster where OVS is segfaulting -> https://coreos.slack.com/archives/CDCP2LA9L/p1601310424077900
ovs-vswitchd on RHCOS: sh-4.4# ovs-vswitchd --versionovs-vswitchd (Open vSwitch) 2.13.2 DPDK 19.11.3 sh-4.4# rpm -qf `which ovs-vswitchd` openvswitch2.13-2.13.0-57.el8fdp.x86_64 ovs-vswitchd in container: _ install2 oc rsh ovs-wq8kw sh-4.4# rpm -qf `which ovs-vswitchd` openvswitch2.11-2.11.3-66.el8fdp.x86_64 sh-4.4# exit _ install2 oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.6.0-0.nightly-2020-09-28-110510 True False 48m Cluster version is 4.6.0-0.nightly-2020-09-28-110510 _ install2
the seg fault failure appears to happen frequently on platforms like vsphere-ipi which then appear to suffer an SDN outage that sometimes prevents access from the kube-apsierver to aggregated apiservers. This manifests as failures to access users and to get oauth tokens.
should be fixed in 4.6.0-0.nightly-2020-09-30-091659 with the inclusion of https://github.com/openshift/cluster-network-operator/pull/785
It's been 27 hours since the last hit for ovs-vswitchd segfaulting, marking this as a dupe of 1874696 *** This bug has been marked as a duplicate of bug 1874696 ***
fixed by https://github.com/openshift/cluster-network-operator/pull/785 no segfaults seen since it was merged and in the nightly build 4.6.0-0.nightly-2020-09-30-091659 The segfault was caused by ovs processes running in both the pod and the host conflicting it was the same issue as https://bugzilla.redhat.com/show_bug.cgi?id=1874696