Bug 1875534
Summary: | ovs-vswitchd process segfaulted | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | David Eads <deads> | ||||||||
Component: | Networking | Assignee: | Jacob Tanenbaum <jtanenba> | ||||||||
Networking sub component: | openshift-sdn | QA Contact: | zhaozhanqi <zzhao> | ||||||||
Status: | CLOSED DUPLICATE | Docs Contact: | |||||||||
Severity: | high | ||||||||||
Priority: | high | CC: | aarapov, aravindh, astoycos, dcbw, jcallen, mfojtik, sdodson, wjiang | ||||||||
Version: | 4.6 | ||||||||||
Target Milestone: | --- | ||||||||||
Target Release: | 4.6.0 | ||||||||||
Hardware: | Unspecified | ||||||||||
OS: | Unspecified | ||||||||||
Whiteboard: | TechnicalReleaseBlocker | ||||||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2020-10-01 13:21:44 UTC | Type: | Bug | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Attachments: |
|
Description
David Eads
2020-09-03 17:57:18 UTC
We need a coredump, otherwise we'll have to close CANTFIX. Also, what OVS version is in that RHCOS build? This is still happening. See https://search.ci.openshift.org/?search=segfault&maxAge=168h&context=1&type=bug%2Bjunit&name=4.6&maxMatches=5&maxBytes=20971520&groupBy=job . Do you have a PR for gathering this data in CI to diagnose? clicking through failing jobs, when this happens we have cascading failures. In order to fix this we need a core dump, here is the PR we are trying to get in to get the core dumps from CI jobs -> https://github.com/openshift/release/pull/11368 I am sorry it turns out this also needs to be merged https://github.com/openshift/cluster-network-operator/pull/785 quick update on this BZ: - the PR blocking from getting core-dumps is https://github.com/openshift/cluster-network-operator/pull/785 - there are CI failures that are holding up this PR - we still see this in CI https://search.ci.openshift.org/?search=ovs.*segfault&maxAge=168h&context=1&type=bug%2Bjunit&name=4.6&maxMatches=5&maxBytes=20971520&groupBy=job - when the segault occurs it looks like one of the sdn pods goes into crash loop back off - qe has seen it on at least on cluster as posted above there is no way to recover, restarting the affected pod just spawns a new pod where the segfault is seen - only seen in sdn CI jobs but the qe cluster was ovn as to my above comment looking over this BZ I don''t see why it was on subcomponent ovn-kubernetes. I don't see this getting hit on ovn on any CI jobs and the QE cluster was openshift-sdn. Since this has never been observed on ovn-kubernetes changing subcomponent to openshift-sdn I am also witnessing this segfault in vsphere. [root@jcallen-wpjw6-worker-8jxgf ~]# journalctl --no-pager | grep -i segfault Sep 26 16:07:33 jcallen-wpjw6-worker-8jxgf kernel: handler2[6195]: segfault at 0 ip 000055f4d3800147 sp 00007feb848a68b0 error 4 in ovs-vswitchd[55f4d33c7000+629000] I have a cluster up if someone needs/wants access to it. The cause looks like the rapid adding and removal of the veth pair from the bridge. Sep 26 16:07:33 jcallen-wpjw6-worker-8jxgf kernel: device veth20bd0e90 entered promiscuous mode Sep 26 16:07:33 jcallen-wpjw6-worker-8jxgf kernel: device veth6eb181cc left promiscuous mode Sep 26 16:07:33 jcallen-wpjw6-worker-8jxgf kernel: device veth9b43efd6 left promiscuous mode Sep 26 16:07:33 jcallen-wpjw6-worker-8jxgf kernel: device veth1f9f0fc0 left promiscuous mode Sep 26 16:07:33 jcallen-wpjw6-worker-8jxgf kernel: device vethe7104a78 left promiscuous mode Sep 26 16:07:33 jcallen-wpjw6-worker-8jxgf kernel: device veth9c233132 left promiscuous mode Sep 26 16:07:33 jcallen-wpjw6-worker-8jxgf kernel: device veth20bd0e90 left promiscuous mode Sep 26 16:07:33 jcallen-wpjw6-worker-8jxgf kernel: device veth6eb181cc entered promiscuous mode Sep 26 16:07:33 jcallen-wpjw6-worker-8jxgf kernel: device veth9b43efd6 entered promiscuous mode Sep 26 16:07:33 jcallen-wpjw6-worker-8jxgf kernel: device veth6eb181cc left promiscuous mode Sep 26 16:07:33 jcallen-wpjw6-worker-8jxgf kernel: device veth9b43efd6 left promiscuous mode Sep 26 16:07:33 jcallen-wpjw6-worker-8jxgf kernel: device veth1f9f0fc0 entered promiscuous mode Sep 26 16:07:33 jcallen-wpjw6-worker-8jxgf kernel: device vethe7104a78 entered promiscuous mode Sep 26 16:07:33 jcallen-wpjw6-worker-8jxgf kernel: device veth9c233132 entered promiscuous mode Sep 26 16:07:33 jcallen-wpjw6-worker-8jxgf kernel: device veth1f9f0fc0 left promiscuous mode Sep 26 16:07:33 jcallen-wpjw6-worker-8jxgf kernel: device vethe7104a78 left promiscuous mode Sep 26 16:07:33 jcallen-wpjw6-worker-8jxgf kernel: device veth9c233132 left promiscuous mode Sep 26 16:07:33 jcallen-wpjw6-worker-8jxgf kernel: device veth20bd0e90 entered promiscuous mode Sep 26 16:07:33 jcallen-wpjw6-worker-8jxgf kernel: device veth20bd0e90 left promiscuous mode Met this with OpenStack platform today. [ 227.246436] device veth065c2c61 entered promiscuous mode [ 227.247710] device vethcdb764bd entered promiscuous mode [ 227.249316] device veth182c1d70 entered promiscuous mode [ 227.253191] device veth6e810701 left promiscuous mode [ 227.255076] device veth2ef2dda9 left promiscuous mode [ 227.256406] device veth99bfeec2 left promiscuous mode [ 227.257824] device veth220ba02c left promiscuous mode [ 227.259287] device veth065c2c61 left promiscuous mode [ 227.260801] device vethcdb764bd left promiscuous mode [ 227.262594] device veth182c1d70 left promiscuous mode [ 227.263649] handler1[6860]: segfault at 0 ip 000055999cf86147 sp 00007f9f9de598b0 error 4 in ovs-vswitchd[55999cb4d000+629000] [ 227.263657] handler4[6861]: segfault at 0 ip 000055999cf86147 sp 00007f9f9d6588b0 error 4 [ 227.265837] in ovs-vswitchd[55999cb4d000+629000] [ 227.267235] Code: 00 48 89 44 24 40 48 c7 44 24 48 40 00 00 00 eb 16 66 90 0f 84 72 01 00 00 83 3b ff 0f 85 b1 01 00 00 83 f8 04 75 25 44 89 2b <41> 8b 3c 24 89 ea 4c 89 f6 e8 bb 28 c0 ff 49 89 c7 48 85 c0 79 d3 [ 227.271556] Code: 00 48 89 44 24 40 48 c7 44 24 48 40 00 00 00 eb 16 66 90 0f 84 72 01 00 00 83 3b ff 0f 85 b1 01 00 00 83 f8 04 75 25 44 89 2b <41> 8b 3c 24 89 ea 4c 89 f6 e8 bb 28 c0 ff 49 89 c7 48 85 c0 79 d3 Created attachment 1717594 [details]
coredump1
Created attachment 1717595 [details]
coredump2
Created attachment 1717596 [details] Core-Dump-1 Core dump 1 from a cluster where OVS is segfaulting -> https://coreos.slack.com/archives/CDCP2LA9L/p1601310424077900 ovs-vswitchd on RHCOS: sh-4.4# ovs-vswitchd --versionovs-vswitchd (Open vSwitch) 2.13.2 DPDK 19.11.3 sh-4.4# rpm -qf `which ovs-vswitchd` openvswitch2.13-2.13.0-57.el8fdp.x86_64 ovs-vswitchd in container: _ install2 oc rsh ovs-wq8kw sh-4.4# rpm -qf `which ovs-vswitchd` openvswitch2.11-2.11.3-66.el8fdp.x86_64 sh-4.4# exit _ install2 oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.6.0-0.nightly-2020-09-28-110510 True False 48m Cluster version is 4.6.0-0.nightly-2020-09-28-110510 _ install2 the seg fault failure appears to happen frequently on platforms like vsphere-ipi which then appear to suffer an SDN outage that sometimes prevents access from the kube-apsierver to aggregated apiservers. This manifests as failures to access users and to get oauth tokens. should be fixed in 4.6.0-0.nightly-2020-09-30-091659 with the inclusion of https://github.com/openshift/cluster-network-operator/pull/785 It's been 27 hours since the last hit for ovs-vswitchd segfaulting, marking this as a dupe of 1874696 *** This bug has been marked as a duplicate of bug 1874696 *** fixed by https://github.com/openshift/cluster-network-operator/pull/785 no segfaults seen since it was merged and in the nightly build 4.6.0-0.nightly-2020-09-30-091659 The segfault was caused by ovs processes running in both the pod and the host conflicting it was the same issue as https://bugzilla.redhat.com/show_bug.cgi?id=1874696 |