Bug 1840078
Summary: | ovs-vswitchd crashes [more info tbd] | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Eduardo Olivares <eolivare> | ||||
Component: | openvswitch | Assignee: | Open vSwitch development team <ovs-team> | ||||
Status: | CLOSED DUPLICATE | QA Contact: | Eran Kuris <ekuris> | ||||
Severity: | urgent | Docs Contact: | |||||
Priority: | urgent | ||||||
Version: | 16.1 (Train) | CC: | aconole, apevec, chrisw, dalvarez, dceara, ekuris, jlibosva, jschluet, lhh, lmiccini, rhos-maint | ||||
Target Milestone: | beta | ||||||
Target Release: | --- | ||||||
Hardware: | x86_64 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2020-06-04 12:37:08 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Eduardo Olivares
2020-05-26 10:31:26 UTC
this looks like a network issue between the controllers: May 25 00:16:02 [38575] controller-0 corosync info [KNET ] link: host: 3 link: 0 is down May 25 00:16:03 [38575] controller-0 corosync info [KNET ] link: host: 2 link: 0 is down May 25 00:16:06 [38575] controller-0 corosync info [KNET ] rx: host: 3 link: 0 is up May 25 00:16:07 [38575] controller-0 corosync info [KNET ] rx: host: 2 link: 0 is up May 25 00:16:11 [38575] controller-0 corosync info [KNET ] link: host: 2 link: 0 is down May 25 00:16:12 [38575] controller-0 corosync info [KNET ] link: host: 3 link: 0 is down May 25 00:16:14 [38575] controller-0 corosync info [KNET ] rx: host: 3 link: 0 is up May 25 00:16:15 [38575] controller-0 corosync info [KNET ] rx: host: 2 link: 0 is up May 25 00:16:20 [38575] controller-0 corosync info [KNET ] link: host: 2 link: 0 is down May 25 00:16:21 [38575] controller-0 corosync info [KNET ] rx: host: 2 link: 0 is up May 25 00:16:46 [38575] controller-0 corosync info [KNET ] link: host: 2 link: 0 is down May 25 00:16:50 [38575] controller-0 corosync info [KNET ] link: host: 3 link: 0 is down May 25 00:16:52 [38575] controller-0 corosync info [KNET ] rx: host: 3 link: 0 is up May 25 00:16:53 [38575] controller-0 corosync info [KNET ] rx: host: 2 link: 0 is up May 25 00:16:55 [38575] controller-0 corosync info [KNET ] link: host: 2 link: 0 is down May 25 00:16:58 [38575] controller-0 corosync info [KNET ] rx: host: 2 link: 0 is up May 25 00:17:32 [38575] controller-0 corosync info [KNET ] link: host: 2 link: 0 is down May 25 00:17:37 [38575] controller-0 corosync info [KNET ] rx: host: 2 link: 0 is up May 25 00:18:17 [38575] controller-0 corosync info [KNET ] link: host: 2 link: 0 is down May 25 00:18:18 [38575] controller-0 corosync info [KNET ] link: host: 3 link: 0 is down May 25 00:18:20 [38575] controller-0 corosync info [KNET ] rx: host: 3 link: 0 is up May 25 00:18:21 [38575] controller-0 corosync info [KNET ] rx: host: 2 link: 0 is up May 25 00:19:47 [38575] controller-0 corosync info [KNET ] link: host: 3 link: 0 is down May 25 00:19:47 [38575] controller-0 corosync info [KNET ] link: host: 2 link: 0 is down May 25 00:19:49 [38575] controller-0 corosync info [KNET ] rx: host: 3 link: 0 is up May 25 00:19:53 [38575] controller-0 corosync info [KNET ] link: host: 3 link: 0 is down May 25 00:19:54 [38575] controller-0 corosync info [KNET ] rx: host: 3 link: 0 is up May 25 00:19:59 [38575] controller-0 corosync info [KNET ] rx: host: 2 link: 0 is up May 25 00:22:42 [38575] controller-0 corosync info [KNET ] link: host: 3 link: 0 is down May 25 00:22:42 [38575] controller-0 corosync info [KNET ] link: host: 2 link: 0 is down May 25 00:22:45 [38575] controller-0 corosync info [KNET ] rx: host: 2 link: 0 is up May 25 00:22:47 [38575] controller-0 corosync info [KNET ] rx: host: 3 link: 0 is up May 25 00:22:49 [38575] controller-0 corosync info [KNET ] link: host: 2 link: 0 is down May 25 00:22:51 [38575] controller-0 corosync info [KNET ] link: host: 3 link: 0 is down May 25 00:22:52 [38575] controller-0 corosync info [KNET ] rx: host: 2 link: 0 is up May 25 00:22:53 [38575] controller-0 corosync info [KNET ] rx: host: 3 link: 0 is up May 25 00:23:26 [38575] controller-0 corosync info [KNET ] link: host: 3 link: 0 is down May 25 00:23:26 [38575] controller-0 corosync info [KNET ] link: host: 2 link: 0 is down May 25 00:23:27 [38575] controller-0 corosync info [KNET ] rx: host: 3 link: 0 is up May 25 00:23:28 [38575] controller-0 corosync info [KNET ] rx: host: 2 link: 0 is up May 25 00:23:32 [38575] controller-0 corosync info [KNET ] link: host: 3 link: 0 is down May 25 00:23:34 [38575] controller-0 corosync info [KNET ] rx: host: 3 link: 0 is up May 25 00:23:35 [38575] controller-0 corosync info [KNET ] link: host: 2 link: 0 is down May 25 00:23:37 [38575] controller-0 corosync info [KNET ] rx: host: 2 link: 0 is up May 25 00:24:56 [38575] controller-0 corosync info [KNET ] link: host: 3 link: 0 is down May 25 00:24:56 [38575] controller-0 corosync info [KNET ] link: host: 2 link: 0 is down May 25 00:24:59 [38575] controller-0 corosync info [KNET ] rx: host: 2 link: 0 is up May 25 00:25:01 [38575] controller-0 corosync info [KNET ] rx: host: 3 link: 0 is up May 25 00:26:17 [38575] controller-0 corosync info [KNET ] link: host: 3 link: 0 is down May 25 00:26:17 [38575] controller-0 corosync info [KNET ] link: host: 2 link: 0 is down May 25 00:26:22 [38575] controller-0 corosync info [KNET ] rx: host: 3 link: 0 is up May 25 00:26:24 [38575] controller-0 corosync info [KNET ] rx: host: 2 link: 0 is up May 25 00:28:19 [38575] controller-0 corosync info [KNET ] link: host: 3 link: 0 is down May 25 00:28:20 [38575] controller-0 corosync info [KNET ] link: host: 2 link: 0 is down May 25 00:28:21 [38575] controller-0 corosync info [KNET ] rx: host: 3 link: 0 is up May 25 00:28:24 [38575] controller-0 corosync info [KNET ] rx: host: 2 link: 0 is up May 25 00:28:27 [38575] controller-0 corosync info [KNET ] link: host: 2 link: 0 is down May 25 00:28:28 [38575] controller-0 corosync info [KNET ] link: host: 3 link: 0 is down May 25 00:28:31 [38575] controller-0 corosync info [KNET ] rx: host: 2 link: 0 is up May 25 00:28:31 [38575] controller-0 corosync info [KNET ] rx: host: 3 link: 0 is up May 25 00:30:41 [38575] controller-0 corosync info [KNET ] link: host: 3 link: 0 is down May 25 00:30:41 [38575] controller-0 corosync info [KNET ] link: host: 2 link: 0 is down May 25 00:30:43 [38575] controller-0 corosync info [KNET ] rx: host: 2 link: 0 is up May 25 00:30:44 [38575] controller-0 corosync info [KNET ] rx: host: 3 link: 0 is up May 25 00:30:48 [38575] controller-0 corosync info [KNET ] link: host: 3 link: 0 is down May 25 00:30:48 [38575] controller-0 corosync info [KNET ] link: host: 2 link: 0 is down May 25 00:30:52 [38575] controller-0 corosync info [KNET ] rx: host: 3 link: 0 is up May 25 00:30:52 [38575] controller-0 corosync info [KNET ] rx: host: 2 link: 0 is up May 25 00:32:03 [38575] controller-0 corosync info [KNET ] link: host: 2 link: 0 is down May 25 00:32:03 [38575] controller-0 corosync info [KNET ] link: host: 3 link: 0 is down May 25 00:32:07 [38575] controller-0 corosync info [KNET ] rx: host: 3 link: 0 is up May 25 00:32:08 [38575] controller-0 corosync info [KNET ] rx: host: 2 link: 0 is up May 25 00:32:12 [38575] controller-0 corosync info [KNET ] link: host: 2 link: 0 is down May 25 00:32:12 [38575] controller-0 corosync info [KNET ] link: host: 3 link: 0 is down May 25 00:32:16 [38575] controller-0 corosync info [KNET ] rx: host: 3 link: 0 is up May 25 00:32:17 [38575] controller-0 corosync info [KNET ] rx: host: 2 link: 0 is up May 25 00:34:16 [38575] controller-0 corosync info [KNET ] link: host: 2 link: 0 is down May 25 00:34:17 [38575] controller-0 corosync info [KNET ] link: host: 3 link: 0 is down May 25 00:34:24 [38575] controller-0 corosync info [KNET ] rx: host: 3 link: 0 is up May 25 00:34:27 [38575] controller-0 corosync info [KNET ] rx: host: 2 link: 0 is up May 25 00:38:16 [38575] controller-0 corosync info [KNET ] link: host: 3 link: 0 is down May 25 00:38:16 [38575] controller-0 corosync info [KNET ] link: host: 2 link: 0 is down May 25 00:38:20 [38575] controller-0 corosync info [KNET ] rx: host: 3 link: 0 is up May 25 00:38:20 [38575] controller-0 corosync info [KNET ] rx: host: 2 link: 0 is up May 25 00:40:02 [38575] controller-0 corosync info [KNET ] link: host: 3 link: 0 is down May 25 00:40:02 [38575] controller-0 corosync info [KNET ] link: host: 2 link: 0 is down May 25 00:40:06 [38575] controller-0 corosync info [KNET ] rx: host: 3 link: 0 is up May 25 00:40:06 [38575] controller-0 corosync info [KNET ] rx: host: 2 link: 0 is up May 25 00:40:11 [38575] controller-0 corosync info [KNET ] link: host: 3 link: 0 is down May 25 00:40:12 [38575] controller-0 corosync info [KNET ] link: host: 2 link: 0 is down May 25 00:40:16 [38575] controller-0 corosync info [KNET ] rx: host: 3 link: 0 is up May 25 00:40:17 [38575] controller-0 corosync info [KNET ] rx: host: 2 link: 0 is up May 25 00:40:46 [38575] controller-0 corosync info [KNET ] link: host: 3 link: 0 is down May 25 00:40:46 [38575] controller-0 corosync info [KNET ] link: host: 2 link: 0 is down May 25 00:40:50 [38575] controller-0 corosync info [KNET ] rx: host: 3 link: 0 is up May 25 00:40:51 [38575] controller-0 corosync info [KNET ] rx: host: 2 link: 0 is up May 25 00:42:35 [38575] controller-0 corosync info [KNET ] link: host: 3 link: 0 is down May 25 00:42:35 [38575] controller-0 corosync info [KNET ] link: host: 2 link: 0 is down May 25 00:42:39 [38575] controller-0 corosync info [KNET ] rx: host: 2 link: 0 is up May 25 00:42:40 [38575] controller-0 corosync info [KNET ] rx: host: 3 link: 0 is up May 25 00:45:56 [38575] controller-0 corosync info [KNET ] link: host: 2 link: 0 is down May 25 00:45:57 [38575] controller-0 corosync info [KNET ] link: host: 3 link: 0 is down May 25 00:46:03 [38575] controller-0 corosync info [KNET ] rx: host: 3 link: 0 is up May 25 00:46:03 [38575] controller-0 corosync info [KNET ] rx: host: 2 link: 0 is up May 25 00:54:05 [38575] controller-0 corosync info [KNET ] link: host: 2 link: 0 is down May 25 00:54:06 [38575] controller-0 corosync info [KNET ] link: host: 3 link: 0 is down May 25 00:54:08 [38575] controller-0 corosync info [KNET ] rx: host: 3 link: 0 is up May 25 00:54:14 [38575] controller-0 corosync info [KNET ] rx: host: 2 link: 0 is up May 25 00:58:18 [38575] controller-0 corosync info [KNET ] link: host: 2 link: 0 is down May 25 00:58:19 [38575] controller-0 corosync info [KNET ] link: host: 3 link: 0 is down May 25 00:58:23 [38575] controller-0 corosync info [KNET ] rx: host: 2 link: 0 is up May 25 00:58:23 [38575] controller-0 corosync info [KNET ] rx: host: 3 link: 0 is up ovs sigsegv? May 25 00:14:40 controller-1 systemd[1]: ovs-vswitchd.service: Main process exited, code=killed, status=11/SEGV May 25 00:14:40 controller-1 systemd[1]: ovs-vswitchd.service: Failed with result 'signal'. May 25 00:14:40 controller-1 systemd[1]: Starting nova_scheduler healthcheck... May 25 00:14:40 controller-1 systemd[1]: Starting nova_conductor healthcheck... May 25 00:14:40 controller-1 systemd[1]: Starting memcached healthcheck... May 25 00:14:40 controller-1 systemd[1]: Starting nova_vnc_proxy healthcheck... May 25 00:14:40 controller-1 systemd[1]: Starting clustercheck healthcheck... May 25 00:14:40 controller-1 systemd[1]: Starting swift_container_server healthcheck... May 25 00:14:40 controller-1 systemd[1]: Starting swift_object_server healthcheck... May 25 00:14:40 controller-1 systemd[1]: Starting swift_rsync healthcheck... May 25 00:14:41 controller-1 systemd[1]: ovs-vswitchd.service: Service RestartSec=100ms expired, scheduling restart. May 25 00:14:41 controller-1 systemd[1]: ovs-vswitchd.service: Scheduled restart job, restart counter is at 17. May 25 00:14:41 controller-1 systemd[1]: Stopping Open vSwitch... May 25 00:14:41 controller-1 systemd[1]: Stopped Open vSwitch. May 25 00:14:41 controller-1 systemd[1]: Stopped Open vSwitch Forwarding Unit. May 25 00:14:41 controller-1 systemd[1]: Stopping Open vSwitch Database Unit... May 25 00:14:41 controller-1 ovs-ctl[384851]: Exiting ovsdb-server (348892) [ OK ] May 25 00:14:41 controller-1 systemd[1]: Stopped Open vSwitch Database Unit. May 25 00:14:41 controller-1 systemd[1]: Starting Open vSwitch Database Unit... grep ovs-vswitchd.service controller-1/var/log/messages |grep SEGV |wc -l 38 no wonder the cluster breaks. we had quite few of these, see: https://bugzilla.redhat.com/show_bug.cgi?id=1824847 https://bugzilla.redhat.com/show_bug.cgi?id=1823178 and especially: https://bugzilla.redhat.com/show_bug.cgi?id=1821185 https://bugzilla.redhat.com/show_bug.cgi?id=1821185#c19 this looks a duplicate of the above bz. Can you please provide a core-dump of the crashed ovs-vswitchd? Created attachment 1692830 [details]
ovs-vswitchd coredumps
(In reply to eolivare from comment #3) > Created attachment 1692830 [details] > ovs-vswitchd coredumps I forgot to mention that this is the OVS version: openvswitch2.13-2.13.0-25.el8fdp.1.x86_64 Raising the severity to urgent as this takes the whole openstack down. .... #9056 0x0000559e0c50aee8 in xlate_normal (ctx=0x7f256f8bb6e0) at ../ofproto/ofproto-dpif-xlate.c:3166 #9057 xlate_output_action (ctx=ctx@entry=0x7f256f8bb6e0, port=<optimized out>, controller_len=<optimized out>, may_packet_in=may_packet_in@entry=true, is_last_action=<optimized out>, truncate=truncate@entry=false, group_bucket_action=false) at ../ofproto/ofproto-dpif-xlate.c:5190 #9058 0x0000559e0c50b820 in do_xlate_actions (ofpacts=<optimized out>, ofpacts_len=<optimized out>, ctx=<optimized out>, is_last_action=<optimized out>, group_bucket_action=<optimized out>) at ../include/openvswitch/ofp-actions.h:1302 #9059 0x0000559e0c511753 in xlate_actions (xin=xin@entry=0x7f256f8bc570, xout=xout@entry=0x7f256f8f78d8) at ../ofproto/ofproto-dpif-xlate.c:7699 #9060 0x0000559e0c500586 in upcall_xlate (wc=0x7f256f8f7930, odp_actions=0x7f256f8f78f0, upcall=0x7f256f8f7870, udpif=0x559e0ef12b80) at ../ofproto/ofproto-dpif-upcall.c:1204 #9061 process_upcall (udpif=udpif@entry=0x559e0ef12b80, upcall=upcall@entry=0x7f256f8f7870, odp_actions=odp_actions@entry=0x7f256f8f78f0, wc=wc@entry=0x7f256f8f7930) at ../ofproto/ofproto-dpif-upcall.c:1420 #9062 0x0000559e0c501183 in recv_upcalls (handler=<optimized out>, handler=<optimized out>) at ../ofproto/ofproto-dpif-upcall.c:842 #9063 0x0000559e0c50164c in udpif_upcall_handler (arg=0x559e0ef75b10) at ../ofproto/ofproto-dpif-upcall.c:759 #9064 0x0000559e0c5c8e03 in ovsthread_wrapper (aux_=<optimized out>) at ../lib/ovs-thread.c:383 #9065 0x00007f2572a212de in start_thread () from /lib64/libpthread.so.0 #9066 0x00007f2571e93e83 in timerfd_create () from /lib64/libc.so.6 #9067 0x0000000000000000 in ?? () Looks like might be related to https://bugzilla.redhat.com/show_bug.cgi?id=1821185 ? Is this using OVN? Yes, it is using OVN. Bug 1821185 was caused by bug 1825334 in OVN, that has been fixed in ovn2.13-2.13.0-18.el8fdp.x86_64. We are running with this version so we should no longer see bug 1821185 It would be great to have the OVN DBs to determine if this is a packet loop again or just a topology that generates enough resubmits to hit the stack limit in ovs. This is a dup of bug 1825334. It reproduced because of a package mismatch in the OVN DBs image: ovn2.13-central-2.13.0-11.el8fdp.x86_64 ovn2.13-2.13.0-18.el8fdp.x86_64 Ovn is -18 while ovn-central, which is the package that contains the fixed code, is on -11 version and this version doesn't contain the fix. So now the main question is how were the images built and why do we have an OVN version mismatch there. Marking as duplicate. The problem was caused by using bad repos. *** This bug has been marked as a duplicate of bug 1825334 *** t(In reply to Jakub Libosvar from comment #9) > This is a dup of bug 1825334. It reproduced because of a package mismatch in > the OVN DBs image: > ovn2.13-central-2.13.0-11.el8fdp.x86_64 > ovn2.13-2.13.0-18.el8fdp.x86_64 > > Ovn is -18 while ovn-central, which is the package that contains the fixed > code, is on -11 version and this version doesn't contain the fix. > > So now the main question is how were the images built and why do we have an > OVN version mismatch there. Sorry I cannot answer to how the images are built and why the mismatch. Let's see if Lon can clarify this point. I can confirm that tests started passing with later puddles, starting with RHOS-16.1-RHEL-8-20200604.n.1 I checked in an even later puddle and OVN versions are aligned: RHOS-16.1-RHEL-8-20200604.n.1 ovn2.13-central-2.13.0-30.el8fdp.x86_64 ovn2.13-2.13.0-30.el8fdp.x86_64 |