Bug 1875534 - ovs-vswitchd process segfaulted
Summary: ovs-vswitchd process segfaulted
Keywords:
Status: CLOSED DUPLICATE of bug 1874696
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.6
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.6.0
Assignee: Jacob Tanenbaum
QA Contact: zhaozhanqi
URL:
Whiteboard: TechnicalReleaseBlocker
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-09-03 17:57 UTC by David Eads
Modified: 2020-10-01 13:25 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-10-01 13:21:44 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
coredump1 (922.29 KB, application/x-lz4)
2020-09-29 17:43 UTC, Joseph Callen
no flags Details
coredump2 (928.42 KB, application/x-lz4)
2020-09-29 17:44 UTC, Joseph Callen
no flags Details
Core-Dump-1 (928.42 KB, application/x-lz4)
2020-09-29 17:44 UTC, Andrew Stoycos
no flags Details

Description David Eads 2020-09-03 17:57:18 UTC
Needs bug: Node process segfaulted expand_less	0s

nodes/masters-journal.gz:Sep 03 13:50:47.723196 ip-10-0-187-118 kernel: handler3[4667]: segfault at 0 ip 000055fef547e147 sp 00007f1c78b578b0 error 4 in ovs-vswitchd[55fef5045000+629000]
nodes/masters-journal.gz:Sep 03 13:50:47.765476 ip-10-0-187-118 kernel: handler1[4668]: segfault at 0 ip 000055fef547e147 sp 00007f1c783568b0 error 4 in ovs-vswitchd[55fef5045000+629000]

from https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-kube-apiserver-operator/941/pull-ci-openshift-cluster-kube-apiserver-operator-master-e2e-aws-upgrade/1301495297582567424

Comment 1 Dan Williams 2020-09-03 18:02:26 UTC
We need a coredump, otherwise we'll have to close CANTFIX.

Comment 2 Dan Williams 2020-09-03 18:03:20 UTC
Also, what OVS version is in that RHCOS build?

Comment 3 David Eads 2020-09-08 16:26:14 UTC
This is still happening.  See https://search.ci.openshift.org/?search=segfault&maxAge=168h&context=1&type=bug%2Bjunit&name=4.6&maxMatches=5&maxBytes=20971520&groupBy=job .

Do you have a PR for gathering this data in CI to diagnose?

Comment 4 David Eads 2020-09-08 16:33:47 UTC
clicking through failing jobs, when this happens we have cascading failures.

Comment 5 Jacob Tanenbaum 2020-09-09 13:41:47 UTC
In order to fix this we need a core dump, here is the PR we are trying to get in to get the core dumps from CI jobs -> https://github.com/openshift/release/pull/11368

Comment 7 Jacob Tanenbaum 2020-09-23 22:29:19 UTC
I am sorry it turns out this also needs to be merged https://github.com/openshift/cluster-network-operator/pull/785

Comment 8 Jacob Tanenbaum 2020-09-25 14:48:00 UTC
quick update on this BZ:
    
   - the PR blocking from getting core-dumps is https://github.com/openshift/cluster-network-operator/pull/785
      - there are CI failures that are holding up this PR 
  
   - we still see this in CI https://search.ci.openshift.org/?search=ovs.*segfault&maxAge=168h&context=1&type=bug%2Bjunit&name=4.6&maxMatches=5&maxBytes=20971520&groupBy=job
      - when the segault occurs it looks like one of the sdn pods goes into crash loop back off

   - qe has seen it on at least on cluster as posted above there is no way to recover, restarting the affected pod just spawns a new pod where the segfault is seen

   - only seen in sdn CI jobs but the qe cluster was ovn

Comment 9 Jacob Tanenbaum 2020-09-25 15:05:19 UTC
as to my above comment looking over this BZ I don''t see why it was on subcomponent ovn-kubernetes. I don't see this getting hit on ovn on any CI jobs and the QE cluster was openshift-sdn. Since this has never been observed on ovn-kubernetes changing subcomponent to openshift-sdn

Comment 10 Joseph Callen 2020-09-26 16:41:06 UTC
I am also witnessing this segfault in vsphere. 

[root@jcallen-wpjw6-worker-8jxgf ~]# journalctl --no-pager  | grep -i segfault
Sep 26 16:07:33 jcallen-wpjw6-worker-8jxgf kernel: handler2[6195]: segfault at 0 ip 000055f4d3800147 sp 00007feb848a68b0 error 4 in ovs-vswitchd[55f4d33c7000+629000]

I have a cluster up if someone needs/wants access to it.
The cause looks like the rapid adding and removal of the veth pair from the bridge.

Sep 26 16:07:33 jcallen-wpjw6-worker-8jxgf kernel: device veth20bd0e90 entered promiscuous mode                                                                     
Sep 26 16:07:33 jcallen-wpjw6-worker-8jxgf kernel: device veth6eb181cc left promiscuous mode                                                                        
Sep 26 16:07:33 jcallen-wpjw6-worker-8jxgf kernel: device veth9b43efd6 left promiscuous mode                                                                        
Sep 26 16:07:33 jcallen-wpjw6-worker-8jxgf kernel: device veth1f9f0fc0 left promiscuous mode                                                                        
Sep 26 16:07:33 jcallen-wpjw6-worker-8jxgf kernel: device vethe7104a78 left promiscuous mode                                                                        
Sep 26 16:07:33 jcallen-wpjw6-worker-8jxgf kernel: device veth9c233132 left promiscuous mode                                                                        
Sep 26 16:07:33 jcallen-wpjw6-worker-8jxgf kernel: device veth20bd0e90 left promiscuous mode                                                                        
Sep 26 16:07:33 jcallen-wpjw6-worker-8jxgf kernel: device veth6eb181cc entered promiscuous mode                                                                     
Sep 26 16:07:33 jcallen-wpjw6-worker-8jxgf kernel: device veth9b43efd6 entered promiscuous mode                                                                     
Sep 26 16:07:33 jcallen-wpjw6-worker-8jxgf kernel: device veth6eb181cc left promiscuous mode                                                                        
Sep 26 16:07:33 jcallen-wpjw6-worker-8jxgf kernel: device veth9b43efd6 left promiscuous mode                                                                        
Sep 26 16:07:33 jcallen-wpjw6-worker-8jxgf kernel: device veth1f9f0fc0 entered promiscuous mode                                                                     
Sep 26 16:07:33 jcallen-wpjw6-worker-8jxgf kernel: device vethe7104a78 entered promiscuous mode                                                                     
Sep 26 16:07:33 jcallen-wpjw6-worker-8jxgf kernel: device veth9c233132 entered promiscuous mode                                                                     
Sep 26 16:07:33 jcallen-wpjw6-worker-8jxgf kernel: device veth1f9f0fc0 left promiscuous mode                                                                        
Sep 26 16:07:33 jcallen-wpjw6-worker-8jxgf kernel: device vethe7104a78 left promiscuous mode                                                                        
Sep 26 16:07:33 jcallen-wpjw6-worker-8jxgf kernel: device veth9c233132 left promiscuous mode                                                                        
Sep 26 16:07:33 jcallen-wpjw6-worker-8jxgf kernel: device veth20bd0e90 entered promiscuous mode                                                                     
Sep 26 16:07:33 jcallen-wpjw6-worker-8jxgf kernel: device veth20bd0e90 left promiscuous mode

Comment 11 weiwei jiang 2020-09-28 05:54:21 UTC
Met this with OpenStack platform today.

[  227.246436] device veth065c2c61 entered promiscuous mode
[  227.247710] device vethcdb764bd entered promiscuous mode
[  227.249316] device veth182c1d70 entered promiscuous mode
[  227.253191] device veth6e810701 left promiscuous mode
[  227.255076] device veth2ef2dda9 left promiscuous mode
[  227.256406] device veth99bfeec2 left promiscuous mode
[  227.257824] device veth220ba02c left promiscuous mode
[  227.259287] device veth065c2c61 left promiscuous mode
[  227.260801] device vethcdb764bd left promiscuous mode
[  227.262594] device veth182c1d70 left promiscuous mode
[  227.263649] handler1[6860]: segfault at 0 ip 000055999cf86147 sp 00007f9f9de598b0 error 4 in ovs-vswitchd[55999cb4d000+629000]
[  227.263657] handler4[6861]: segfault at 0 ip 000055999cf86147 sp 00007f9f9d6588b0 error 4
[  227.265837]  in ovs-vswitchd[55999cb4d000+629000]
[  227.267235] Code: 00 48 89 44 24 40 48 c7 44 24 48 40 00 00 00 eb 16 66 90 0f 84 72 01 00 00 83 3b ff 0f 85 b1 01 00 00 83 f8 04 75 25 44 89 2b <41> 8b 3c 24 89 ea 4c 89 f6 e8 bb 28 c0 ff 49 89 c7 48 85 c0 79 d3
[  227.271556] Code: 00 48 89 44 24 40 48 c7 44 24 48 40 00 00 00 eb 16 66 90 0f 84 72 01 00 00 83 3b ff 0f 85 b1 01 00 00 83 f8 04 75 25 44 89 2b <41> 8b 3c 24 89 ea 4c 89 f6 e8 bb 28 c0 ff 49 89 c7 48 85 c0 79 d3

Comment 12 Joseph Callen 2020-09-29 17:43:29 UTC
Created attachment 1717594 [details]
coredump1

Comment 13 Joseph Callen 2020-09-29 17:44:16 UTC
Created attachment 1717595 [details]
coredump2

Comment 14 Andrew Stoycos 2020-09-29 17:44:35 UTC
Created attachment 1717596 [details]
Core-Dump-1

Core dump 1 from a cluster where OVS is segfaulting ->  https://coreos.slack.com/archives/CDCP2LA9L/p1601310424077900

Comment 15 Joseph Callen 2020-09-29 17:46:06 UTC
ovs-vswitchd on RHCOS:
sh-4.4# ovs-vswitchd --versionovs-vswitchd (Open vSwitch) 2.13.2
DPDK 19.11.3

sh-4.4# rpm -qf `which ovs-vswitchd`
openvswitch2.13-2.13.0-57.el8fdp.x86_64

ovs-vswitchd in container:
_  install2 oc rsh ovs-wq8kw
sh-4.4# rpm -qf `which ovs-vswitchd`
openvswitch2.11-2.11.3-66.el8fdp.x86_64


sh-4.4# exit
_  install2 oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS                                          
version   4.6.0-0.nightly-2020-09-28-110510   True        False         48m     Cluster version is 4.6.0-0.nightly-2020-09-28-110510
_  install2

Comment 17 David Eads 2020-09-30 14:53:03 UTC
the seg fault failure appears to happen frequently on platforms like vsphere-ipi which then appear to suffer an SDN outage that sometimes prevents access from the kube-apsierver to aggregated apiservers.  This manifests as failures to access users and to get oauth tokens.

Comment 18 Jacob Tanenbaum 2020-09-30 18:37:27 UTC
should be fixed in 4.6.0-0.nightly-2020-09-30-091659 with the inclusion of https://github.com/openshift/cluster-network-operator/pull/785

Comment 19 Scott Dodson 2020-10-01 13:21:44 UTC
It's been 27 hours since the last hit for ovs-vswitchd segfaulting, marking this as a dupe of 1874696

*** This bug has been marked as a duplicate of bug 1874696 ***

Comment 20 Jacob Tanenbaum 2020-10-01 13:25:51 UTC
fixed by https://github.com/openshift/cluster-network-operator/pull/785 no segfaults seen since it was merged and in the nightly build 4.6.0-0.nightly-2020-09-30-091659

The segfault was caused by ovs processes running in both the pod and the host conflicting



it was the same issue as https://bugzilla.redhat.com/show_bug.cgi?id=1874696


Note You need to log in before you can comment on or make changes to this bug.