Created attachment 1481579 [details] ovsdb-server.log-20180906 Description of problem: when deploying osp14 + opendaylight, ovs-vswitchd process dies with: 2018-09-06T17:12:06.442Z|00001|util(handler28)|EMER|./include/openvswitch/list.h:261: assertion !ovs_list_is_empty(list) failed in ovs_list_back() ovs is configured with Controller and manager pointing at ODL: Manager "tcp:172.17.1.29:6640" Manager "tcp:172.17.1.10:6640" Manager "tcp:172.17.1.21:6640" Manager "ptcp:6639:127.0.0.1" Bridge br-isolated Bridge br-ex Bridge br-int Controller "tcp:172.17.1.21:6653" Controller "tcp:172.17.1.10:6653" Controller "tcp:172.17.1.29:6653" ovs can't connect to 172.17.1.10 and 172.17.1.21 on port 6653, it connects only after restarting ovs-vswitchd process on 1.10 or 1.21 the same deployment was reinstalled with 2.9 and this dying behavior was not observed more info in logs Version-Release number of selected component (if applicable): osp14 (puddle 2018-08-23.3) + opendaylight-8.3.0-3 + ovs 2.10 How reproducible: 100% Steps to Reproduce: 1. deploy osp14 + odl 2. observe loss of connectivity on certain subnets + systemctl status ovs-vswitchd showing: root@controller-0 ~]# systemctl status ovs-vswitchd [300/7519] ● ovs-vswitchd.service - Open vSwitch Forwarding Unit Loaded: loaded (/usr/lib/systemd/system/ovs-vswitchd.service; static; vendor preset: disabled) Active: inactive (dead) since Thu 2018-09-06 18:10:30 BST; 1min 25s ago Process: 560635 ExecStop=/usr/share/openvswitch/scripts/ovs-ctl --no-ovsdb-server stop (code=exited, status=0/SUCCESS) Process: 560096 ExecStart=/usr/share/openvswitch/scripts/ovs-ctl --no-ovsdb-server --no-monitor --system-id=random ${OVSUSER} start $OPTIONS (code=exited, status=0/SUCCESS) Process: 560093 ExecStartPre=/usr/bin/chmod 0775 /dev/hugepages (code=exited, status=0/SUCCESS) Process: 560091 ExecStartPre=/bin/sh -c /usr/bin/chown :$${OVS_USER_ID##*:} /dev/hugepages (code=exited, status=0/SUCCESS) Main PID: 559742 (code=killed, signal=ABRT) Sep 06 18:10:28 controller-0 systemd[1]: Starting Open vSwitch Forwarding Unit... Sep 06 18:10:29 controller-0 ovs-ctl[560096]: Starting ovs-vswitchd [ OK ] Sep 06 18:10:29 controller-0 ovs-vsctl[560185]: ovs|00001|vsctl|INFO|Called as ovs-vsctl --no-wait set Open_vSwitch . external-ids:hostname=controller-0.localdomain Sep 06 18:10:29 controller-0 ovs-ctl[560096]: Enabling remote OVSDB managers [ OK ] Sep 06 18:10:29 controller-0 systemd[1]: Started Open vSwitch Forwarding Unit. Sep 06 18:10:30 controller-0 ovs-vswitchd[560132]: ovs|00001|util(revalidator34)|EMER|./include/openvswitch/list.h:261: assertion !ovs_list_is_empty(list) failed in ovs_list_back() Sep 06 18:10:30 controller-0 ovs-ctl[560635]: 2018-09-06T17:10:30Z|00001|unixctl|WARN|failed to connect to /var/run/openvswitch/ovs-vswitchd.560132.ctl Sep 06 18:10:30 controller-0 ovs-appctl[560650]: ovs|00001|unixctl|WARN|failed to connect to /var/run/openvswitch/ovs-vswitchd.560132.ctl Sep 06 18:10:30 controller-0 ovs-ctl[560635]: ovs-appctl: cannot connect to "/var/run/openvswitch/ovs-vswitchd.560132.ctl" (Connection refused) 3. Actual results: Expected results: Additional info:
Created attachment 1481580 [details] ovs2.10_logs_dying
since it's happening in 100% of deployments I can easily provide a machine with these symptoms to troubleshoot on
I've enabled DBG on _all_ ovs components and restarted ovs-vswitchd via systemctl, attaching the ovs-vswitchd_dbg.log with the error and all DBG logs preceeding the EMERGENCY error (note ovs-vswitchd was restarted by systemctl automatically couple of times and logs will show this as well)
Created attachment 1482369 [details] ovs-vswitchd_dbg.log
any deployments of ovs + opendaylight (OVN suspected to see the same issue) are _not_ possible because of this bug, changing severity to urgent
Can you come up with a simple reproducer so I can replicate this in my lab? With simple I mean just OVS running, opendaylight (which seems needed) and what commands to trigger this. In the meantime when getting this, can you provide a core dump, and an SOS report (before the core dump so I have all the configs, flows, etc)?
I do install ovs + odl as part of tripleO OSP14 deployment in containers. I would prefer to rather use the machine I have this problem on to troubleshoot/debug/develop rather than reproducing the issue in a small/standalone ovs + opendaylight environment (which I've never tried and may take some time before to get it right). By using the machine I have this problem on we can get quicker onto the resolution and the coming fix will be tested against the environment the original issue was reported in (tripleO) so the it's less likely we'll hit some edge cases. I'm happy to sit this bug with you since it has a few more layers due to fact it's TRIPLEO deployment. WDYT?
Created attachment 1482614 [details] sosreport-controller-0-20180912110418
Created attachment 1482616 [details] sosreport-controller-0-20180912111522
Created attachment 1482617 [details] core.416227.gz
If you can get me access to your setup where I can find the core dump, I can take a look at that first. In addition, if you can let me know how I can replicate the issue on your setup step by step, so I can work on it when time permits.
installed newer version of ovs2.10 on all three controllers with: wget http://download-node-02.eng.bos.redhat.com/brewroot/packages/openvswitch2.10/2.10.0/2.el7fdp/x86_64/openvswitch2.10-2.10.0-2.el7fdp.x86_64.rpm \ http://download-node-02.eng.bos.redhat.com/brewroot/packages/openvswitch2.10/2.10.0/2.el7fdp/x86_64/python-openvswitch2.10-2.10.0-2.el7fdp.x86_64.rpm \ http://download-node-02.eng.bos.redhat.com/brewroot/packages/openvswitch2.10/2.10.0/2.el7fdp/x86_64/openvswitch2.10-debuginfo-2.10.0-2.el7fdp.x86_64.rpm \ http://download-node-02.eng.bos.redhat.com/brewroot/packages/openvswitch2.10/2.10.0/2.el7fdp/x86_64/openvswitch2.10-devel-2.10.0-2.el7fdp.x86_64.rpm yum remove -vy openvswitch2.10-2.10.0-0.20180810git58a7ce6.el7fdp.x86_64 rhosp-openvswitch-2.10-0.1.el7ost.noarch rpm -vi /home/heat-admin/ovs2.10.0-2/*.rpm systemctl restart ovs-vswitchd systemctl status ovs-vswitchd [root@controller-0 ~]# rpm -qa | grep -i openvswitch openvswitch2.10-2.10.0-2.el7fdp.x86_64 python-openvswitch2.10-2.10.0-2.el7fdp.x86_64 openvswitch-selinux-extra-policy-1.0-5.el7fdp.noarch openvswitch2.10-debuginfo-2.10.0-2.el7fdp.x86_64 openvswitch2.10-devel-2.10.0-2.el7fdp.x86_64 and ovs died on all 3 controllers after couple of minutes (1-5) restarting and getting the core dumps
Created attachment 1482633 [details] core.980899_ovs2.10.0-2_controller-0
I just took a peek on this and coredump didn't show much to me: #0 0x00007f55e6f01e9d in poll () from /lib64/libc.so.6 #1 0x00007f55e88485d4 in poll (__timeout=<optimized out>, __nfds=3, __fds=0x7f55cc0008c0) at /usr/include/bits/poll2.h:46 #2 time_poll (pollfds=pollfds@entry=0x7f55cc0008c0, n_pollfds=3, handles=handles@entry=0x0, timeout_when=9223372036854775807, elapsed=elapsed@entry=0x7f55e1abe944) at lib/timeval.c:326 #3 0x00007f55e883015c in poll_block () at lib/poll-loop.c:364 #4 0x00007f55e88177de in ovsrcu_postpone_thread (arg=<optimized out>) at lib/ovs-rcu.c:360 #5 0x00007f55e8819a1f in ovsthread_wrapper (aux_=<optimized out>) at lib/ovs-thread.c:354 #6 0x00007f55e7bf9dd5 in start_thread () from /lib64/libpthread.so.0 #7 0x00007f55e6f0cb3d in clone () from /lib64/libc.so.6 Looks like it's on the poll block for that thread so can't really pull much (at least that I know of). I also took a look at changes around the code that is reported to cause the issue and found this patch which is under review: http://patchwork.ozlabs.org/patch/965120/ Can somebody for OVS team take a look and/or get a test package with it?
(In reply to Daniel Alvarez Sanchez from comment #19) > I just took a peek on this and coredump didn't show much to me: [...] > Looks like it's on the poll block for that thread so can't really pull much > (at least that I know of). Looks like the coredump is during a normal operation. > I also took a look at changes around the code that is reported to cause the > issue and found this patch which is under review: > http://patchwork.ozlabs.org/patch/965120/ Yeah, Eelco was looking at another core dump and saw this: #0 0x00000000007551c0 in ovs_list_back (list_=0x119cdc8) at ./include/openvswitch/list.h:263 #1 0x0000000000761c1d in xlate_group_action__ (ctx=0x7f7136f18330, group=0x119cd80, is_last_action=false) at ofproto/ofproto-dpif-xlate.c:4463 #2 0x0000000000761e40 in xlate_group_action (ctx=0x7f7136f18330, group_id=210001, is_last_action=false) at ofproto/ofproto-dpif-xlate.c:4510 which seemed related to: Author: Ben Pfaff <blp> Date: Thu May 10 15:23:43 2018 -0700 ofproto-dpif-xlate: Simplify translation for groups. Translation of groups had a lot of redundant code. This commit eliminates most of it. It should also make it harder to accidentally reintroduce the reference leak fixed in a previous commit. So that upstream patch seems to be the right fix. > Can somebody for OVS team take a look and/or get a test package with it? https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=18298766 Could you please test that and report back the results? Thanks, fbl
(In reply to Flavio Leitner from comment #20) > (In reply to Daniel Alvarez Sanchez from comment #19) > > I just took a peek on this and coredump didn't show much to me: > [...] > > Looks like it's on the poll block for that thread so can't really pull much > > (at least that I know of). > > Looks like the coredump is during a normal operation. > > > I also took a look at changes around the code that is reported to cause the > > issue and found this patch which is under review: > > http://patchwork.ozlabs.org/patch/965120/ > > Yeah, Eelco was looking at another core dump and saw this: > > #0 0x00000000007551c0 in ovs_list_back (list_=0x119cdc8) at > ./include/openvswitch/list.h:263 > #1 0x0000000000761c1d in xlate_group_action__ (ctx=0x7f7136f18330, > group=0x119cd80, is_last_action=false) at ofproto/ofproto-dpif-xlate.c:4463 > #2 0x0000000000761e40 in xlate_group_action (ctx=0x7f7136f18330, > group_id=210001, is_last_action=false) at ofproto/ofproto-dpif-xlate.c:4510 > > which seemed related to: > Author: Ben Pfaff <blp> > Date: Thu May 10 15:23:43 2018 -0700 > ofproto-dpif-xlate: Simplify translation for groups. > > Translation of groups had a lot of redundant code. This commit > eliminates > most of it. It should also make it harder to accidentally reintroduce > the reference leak fixed in a previous commit. > > So that upstream patch seems to be the right fix. > > > Can somebody for OVS team take a look and/or get a test package with it? > > https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=18298766 > That was fast! Thanks Flavio for the swift response :) Fingers cross for that to be the fix (looks like it will). > Could you please test that and report back the results? > > Thanks, > fbl
Hi Waldemar, Tried to access your setup to quickly test, as I have(had) all the build stuff set up, however when I access it, I get to another machine now :( Can you please verify Flavio's build, and let us know the results?
i've ran two (on different machines) deployments with > https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=18298766 and it looks like the ovs-vswitchd is _not_ dying anymore I saw very slow network operations on one of the deployments (currently veryfing the same on the second) which, even if is a real issue, is most likely a candidate for another bz how can we make this scratch brew ovs into osp14 puddle? once we have this u/s fix in d/s puddle we can close this bz thanks for getting to this bug so quick!
This needs to be included in the current FDP 18.09. Moving to the right product and component for patch inclusion. fbl
thanks all for great support despite this busy times!
*** Bug 1625995 has been marked as a duplicate of this bug. ***
hey Christian, the fix for https://bugzilla.redhat.com/show_bug.cgi?id=1628949 was verified altho I can't see a version of openvswitch2.10 that has fix for this issue in recent OSP14 puddle (2018-09-27.3) do you need to a hand in verifying this fix?
Hello Waldemar, Yes if you can assist in verifying this that would be great. I can ask Haidong to do the same test he did for bug 1628849 if needed as well. I do not have a way to verify it myself easily.
Hi Christian, I gave an OSP14 (with broken ovs 2.10-0 directly from available puddle) machine to Haidong a few days back - to verify bug 1628949. It may cover verification of the fix also for this (1626488) bug. I'll get a machine with latest passed_phase1 puddle in d/s so you can install the RPM and verify it there yourself if needs be. I'd like to get this OVS in asap. Somehow this bug slipped thru the cracks since I was looking at bug 1628949 . cheers Waldek
Hi, Reproduced and verified on set up provided by Waldemar. Details- rhosw12.oslab.openstack.engineering.redhat.com (rhos-ci/redhat) ssh heat-admin.24.6 LOG- #REPRODUCED ON openvswitch2.10-2.10.0-2.el7fdp NOTE- OVS failed after 10-15 mins. [heat-admin@compute-1 ovs2.2]$ sudo rpm -ivh *.rpm Preparing... ################################# [100%] Updating / installing... 1:openvswitch2.10-2.10.0-2.el7fdp ################################# [ 25%] 2:openvswitch2.10-devel-2.10.0-2.el################################# [ 50%] 3:python-openvswitch2.10-2.10.0-2.e################################# [ 75%] 4:openvswitch2.10-debuginfo-2.10.0-################################# [100%] [heat-admin@compute-1 ~]$ sudo systemctl start openvswitch [heat-admin@compute-1 ~]$ sudo systemctl status openvswitch ● openvswitch.service - Open vSwitch Loaded: loaded (/usr/lib/systemd/system/openvswitch.service; disabled; vendor preset: disabled) Active: active (exited) since Tue 2018-10-02 19:16:35 BST; 7s ago Process: 118826 ExecStart=/bin/true (code=exited, status=0/SUCCESS) Main PID: 118826 (code=exited, status=0/SUCCESS) Oct 02 19:16:35 compute-1 systemd[1]: Starting Open vSwitch... Oct 02 19:16:35 compute-1 systemd[1]: Started Open vSwitch. [heat-admin@compute-1 ~]$ rpm -qa | grep openvswitch openvswitch-selinux-extra-policy-1.0-5.el7fdp.noarch openvswitch2.10-devel-2.10.0-2.el7fdp.x86_64 python-openvswitch2.10-2.10.0-2.el7fdp.x86_64 openvswitch2.10-debuginfo-2.10.0-2.el7fdp.x86_64 openvswitch2.10-2.10.0-2.el7fdp.x86_64 [heat-admin@compute-1 ~]$ systemctl status ovs-vswitchd ● ovs-vswitchd.service - Open vSwitch Forwarding Unit Loaded: loaded (/usr/lib/systemd/system/ovs-vswitchd.service; static; vendor preset: disabled) Active: inactive (dead) since Tue 2018-10-02 20:10:38 BST; 4min 45s ago Process: 134155 ExecStop=/usr/share/openvswitch/scripts/ovs-ctl --no-ovsdb-server stop (code=exited, status=0/SUCCESS) Process: 131909 ExecStart=/usr/share/openvswitch/scripts/ovs-ctl --no-ovsdb-server --no-monitor --system-id=random ${OVSUSER} start $OPTIONS (code=exited, status=0/SUCCESS) Process: 131907 ExecStartPre=/usr/bin/chmod 0775 /dev/hugepages (code=exited, status=0/SUCCESS) Process: 131905 ExecStartPre=/bin/sh -c /usr/bin/chown :$${OVS_USER_ID##*:} /dev/hugepages (code=exited, status=0/SUCCESS) Main PID: 127582 (code=killed, signal=ABRT) Oct 02 20:02:09 compute-1 systemd[1]: Starting Open vSwitch Forwarding Unit... Oct 02 20:02:09 compute-1 ovs-ctl[131909]: Starting ovs-vswitchd [ OK ] Oct 02 20:02:09 compute-1 ovs-vsctl[131985]: ovs|00001|vsctl|INFO|Called as ovs-vsctl --no-wait set Open_vSwitch . external-ids:hostname=compute-1.localdomain Oct 02 20:02:09 compute-1 ovs-ctl[131909]: Enabling remote OVSDB managers [ OK ] Oct 02 20:02:09 compute-1 systemd[1]: Started Open vSwitch Forwarding Unit. Oct 02 20:10:37 compute-1 ovs-vswitchd[131946]: ovs|00005|util(handler9)|EMER|./include/openvswitch/list.h:261: assertion !ovs_list_is_empty(list) failed in ovs_list_back() Oct 02 20:10:38 compute-1 ovs-ctl[134155]: 2018-10-02T19:10:38Z|00001|unixctl|WARN|failed to connect to /var/run/openvswitch/ovs-vswitchd.131946.ctl Oct 02 20:10:38 compute-1 ovs-appctl[134170]: ovs|00001|unixctl|WARN|failed to connect to /var/run/openvswitch/ovs-vswitchd.131946.ctl Oct 02 20:10:38 compute-1 ovs-ctl[134155]: ovs-appctl: cannot connect to "/var/run/openvswitch/ovs-vswitchd.131946.ctl" (Connection refused) #VERIFIED ON openvswitch2.10-2.10.0-4.el7fdp [heat-admin@compute-1 ovs2.10.4]$ sudo rpm -ivh *.rpm Preparing... ################################# [100%] Updating / installing... 1:openvswitch2.10-2.10.0-4.el7fdp ################################# [ 25%] 2:openvswitch2.10-devel-2.10.0-4.el################################# [ 50%] 3:python-openvswitch2.10-2.10.0-4.e################################# [ 75%] 4:openvswitch2.10-debuginfo-2.10.0-################################# [100%] [heat-admin@compute-1 ovs2.10.4]$ [heat-admin@compute-1 ovs2.10.4]$ sudo systemctl start ovs-vswitchd [heat-admin@compute-1 ovs2.10.4]$ sudo systemctl status ovs-vswitchd ● ovs-vswitchd.service - Open vSwitch Forwarding Unit Loaded: loaded (/usr/lib/systemd/system/ovs-vswitchd.service; static; vendor preset: disabled) Active: active (running) since Tue 2018-10-02 20:29:13 BST; 4s ago Process: 139129 ExecStart=/usr/share/openvswitch/scripts/ovs-ctl --no-ovsdb-server --no-monitor --system-id=random ${OVSUSER} start $OPTIONS (code=exited, status=0/SUCCESS) Process: 139127 ExecStartPre=/usr/bin/chmod 0775 /dev/hugepages (code=exited, status=0/SUCCESS) Process: 139125 ExecStartPre=/bin/sh -c /usr/bin/chown :$${OVS_USER_ID##*:} /dev/hugepages (code=exited, status=0/SUCCESS) Tasks: 6 Memory: 38.8M CGroup: /system.slice/ovs-vswitchd.service └─139165 ovs-vswitchd unix:/var/run/openvswitch/db.sock -vconsole:emer -vsyslog:err -vfile:info --mlockall --user openvswitch:hugetlbfs --no-chdir --log-file=/var/log/openvswitch/ovs-vswitchd.log --pidfile=/var/run/openvswit... Oct 02 20:29:13 compute-1 systemd[1]: Starting Open vSwitch Forwarding Unit... Oct 02 20:29:13 compute-1 ovs-ctl[139129]: Starting ovs-vswitchd [ OK ] Oct 02 20:29:13 compute-1 ovs-vsctl[139204]: ovs|00001|vsctl|INFO|Called as ovs-vsctl --no-wait set Open_vSwitch . external-ids:hostname=compute-1.localdomain Oct 02 20:29:13 compute-1 ovs-ctl[139129]: Enabling remote OVSDB managers [ OK ] Oct 02 20:29:13 compute-1 systemd[1]: Started Open vSwitch Forwarding Unit. [heat-admin@compute-1 ovs2.10.4]$ rpm -qa | grep openvswitch openvswitch2.10-2.10.0-4.el7fdp.x86_64 openvswitch-selinux-extra-policy-1.0-5.el7fdp.noarch python-openvswitch2.10-2.10.0-4.el7fdp.x86_64 openvswitch2.10-debuginfo-2.10.0-4.el7fdp.x86_64 openvswitch2.10-devel-2.10.0-4.el7fdp.x86_64 [heat-admin@compute-1 ovs2.10.4]$
waiting for fixed_in_version to be in a OSP14 puddle
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2019:0045