+++ This bug was initially created as a clone of Bug #1468299 +++ Description of problem: neutron-openvswitch-agent stops after receiving SIGTERM. In the event of a sigsev on openvswitch, systemd based on to the dependencies defined send the SIGTERM signal to neutron-openvswitch-agent. The dataplane and controlplane get disrupted. Version-Release number of selected component (if applicable): RDO newton openstack-neutron.noarch 1:9.3.2-0.20170421152919.6b11276.el7.centos python-openvswitch.noarch 1:2.6.1-4.1.git20161206.el7 openvswitch.x86_64 1:2.6.1-4.1.git20161206.el7 openstack-neutron-ml2.noarch 1:9.3.2-0.20170421152919.6b11276.el7.centos How reproducible: Always Steps to Reproduce: 1. Kill ovs-vswitchd process 2. 3. Actual results: Both openvswitch and neutron-openvswitch-agent are in inactive status Expected results: systemd to handle the failure and restart the services Additional info: This is already fixed in openvswitch 2.7 [1] [1] https://github.com/openvswitch/ovs/commit/c19bf36d848cbdf755c6760fad1726c95e4377f1
Would it be possible to backport [1] to OVS 2.6? We plan on using OVS 2.6 in the OSP 10 branch for a long time. [1] https://github.com/openvswitch/ovs/commit/c19bf36d848cbdf755c6760fad1726c95e4377f1
It'd be nice to have it in RDO as well. @Flavio, if you need some help with this please feel free to ping me anytime.
We have just enabled dumps in the compute nodes so we can share a core dump for debugging the next time it happens.
I have tried on an OSP10 environment and I could reproduce the issue easily by just killing ovs-vswitchd process. # rpm -qi openvswitch Name : openvswitch Version : 2.6.1 Release : 10.git20161206.el7fdp Architecture: x86_64 Install Date: vie 07 jul 2017 12:56:02 CEST Group : System Environment/Daemons daemon/database/utilities Size : 20229211 License : ASL 2.0 and LGPLv2+ and SISSL Signature : RSA/SHA256, mar 14 mar 2017 13:08:09 CET, Key ID 199e2f91fd431d51 Source RPM : openvswitch-2.6.1-10.git20161206.el7fdp.src.rpm Build Date : vie 24 feb 2017 18:15:39 CET Build Host : x86-030.build.eng.bos.redhat.com # rpm -qi openstack-neutron-openvswitch Name : openstack-neutron-openvswitch Epoch : 1 Version : 9.3.1 Release : 2.el7ost Architecture: noarch Install Date: vie 07 jul 2017 11:37:38 CEST Group : Unspecified Size : 22918 License : ASL 2.0 Signature : RSA/SHA256, mar 23 may 2017 22:04:25 CEST, Key ID 199e2f91fd431d51 Source RPM : openstack-neutron-9.3.1-2.el7ost.src.rpm Build Date : mar 23 may 2017 21:47:23 CEST Build Host : x86-038.build.eng.bos.redhat.com After killing ovs-vswitchd, this is the status of the ovs agent and openvswitch: # systemctl status neutron-openvswitch-agent ● neutron-openvswitch-agent.service - OpenStack Neutron Open vSwitch Agent Loaded: loaded (/usr/lib/systemd/system/neutron-openvswitch-agent.service; enabled; vendor preset: disabled) Active: inactive (dead) since vie 2017-07-07 13:05:19 CEST; 1min 38s ago # systemctl status openvswitch ● openvswitch.service - Open vSwitch Loaded: loaded (/usr/lib/systemd/system/openvswitch.service; enabled; vendor preset: disabled) Active: inactive (dead) since vie 2017-07-07 13:05:19 CEST; 2min 12s ago Of course, this is not the root cause of the issue since ovs-vswitchd was getting a segmentation fault but with the backport in place at least we could recover from the error.
Timothy, Please include the backport of comment#1 in the next OSP10 batch.
Assaf, Could you get the necessary flags for OSP granted? Thanks, fbl
OpenStack uses a bot that grants devel/qa/pm flags automatically when the bug state is flipped to ASSIGNED/POST/ON_DEV and the Triaged keyword is set, unless the RHBZ is marked as an RFE.
Hi David, thanks for the info. I've loaded the core file and did some debug: (gdb) bt #0 0x00007fd840e2d34c in ?? () from /lib64/libc.so.6 #1 0x00007fd842ae63b6 in netdev_get_addrs (dev=0x7fd844e1e750 "vlan121", paddr=paddr@entry=0x7ffe833244a0, pmask=pmask@entry=0x7ffe83324498, n_in=n_in@entry=0x7ffe83324494) at lib/netdev.c:1890 #2 0x00007fd842b70365 in netdev_linux_get_addr_list (netdev_=0x7fd844e8ec40, addr=0x7ffe833244a0, mask=0x7ffe83324498, n_cnt=0x7ffe83324494) at lib/netdev-linux.c:2517 #3 0x00007fd842ae576f in netdev_get_addr_list (netdev=<optimized out>, addr=addr@entry=0x7ffe833244a0, mask=mask@entry=0x7ffe83324498, n_addr=n_addr@entry=0x7ffe83324494) at lib/netdev.c:1133 #4 0x00007fd842b30191 in get_src_addr (ip6_dst=ip6_dst@entry=0x7ffe8332522c, output_bridge=output_bridge@entry=0x7ffe8332524c "vlan121", psrc=psrc@entry=0x7fd844e6f0a0) at lib/ovs-router.c:146 #5 0x00007fd842b30655 in ovs_router_insert__ (priority=<optimized out>, ip6_dst=ip6_dst@entry=0x7ffe8332522c, plen=<optimized out>, output_bridge=output_bridge@entry=0x7ffe8332524c "vlan121", gw=gw@entry=0x7ffe8332523c) at lib/ovs-router.c:200 #6 0x00007fd842b30e37 in ovs_router_insert (ip_dst=ip_dst@entry=0x7ffe8332522c, plen=<optimized out>, output_bridge=output_bridge@entry=0x7ffe8332524c "vlan121", gw=gw@entry=0x7ffe8332523c) at lib/ovs-router.c:228 #7 0x00007fd842b79d24 in route_table_handle_msg (change=0x7ffe83325220) at lib/route-table.c:295 #8 route_table_reset () at lib/route-table.c:174 #9 0x00007fd842b79ef5 in route_table_run () at lib/route-table.c:127 #10 0x00007fd842ae3701 in netdev_vport_run (netdev_class=<optimized out>) at lib/netdev-vport.c:319 #11 0x00007fd842ae438e in netdev_run () at lib/netdev.c:163 #12 0x00007fd8428f329c in main (argc=10, argv=0x7ffe833265a8) at vswitchd/ovs-vswitchd.c:114 (gdb) While trying to find 'vlan121' among all the interfaces present in the system, it crashes due to an interface with no name: #1 0x00007fd842ae63b6 in netdev_get_addrs (dev=0x7fd844e1e750 "vlan121", paddr=paddr@entry=0x7ffe833244a0, pmask=pmask@entry=0x7ffe83324498, n_in=n_in@entry=0x7ffe83324494) at lib/netdev.c:1890 1890 if (!strncmp(ifa->ifa_name, dev, IFNAMSIZ)) { (gdb) p *ifa $94 = {ifa_next = 0x7fd8451c2a78, ifa_name = 0x0, ifa_flags = 0, ifa_addr = 0x7fd8451c29f8, ifa_netmask = 0x7fd8451c2a1c, ifa_ifu = {ifu_broadaddr = 0x0, ifu_dstaddr = 0x0}, ifa_data = 0x0} (gdb) p ifa->ifa_addr Note that ifa->ifa_name is NULL and that is what makes ovs-vswitchd crashes when executing L1890. I took a look at the interface list (previous and next nodes) and it looks good: (gdb) p ifa $101 = (const struct ifaddrs *) 0x7fd8451c29c0 (gdb) p *(struct ifaddrs *)((const unsigned char *)ifa - 184) $102 = {ifa_next = 0x7fd8451c29c0, ifa_name = 0x7fd8451b5fb4 "qvb46ca1656-f6", ifa_flags = 69955, ifa_addr = 0x7fd8451c2940, ifa_netmask = 0x7fd8451c2964, ifa_ifu = {ifu_broadaddr = 0x0, ifu_dstaddr = 0x0}, ifa_data = 0x0} (gdb) p *ifa $103 = {ifa_next = 0x7fd8451c2a78, ifa_name = 0x0, ifa_flags = 0, ifa_addr = 0x7fd8451c29f8, ifa_netmask = 0x7fd8451c2a1c, ifa_ifu = {ifu_broadaddr = 0x0, ifu_dstaddr = 0x0}, ifa_data = 0x0} (gdb) p *ifa->ifa_next $104 = {ifa_next = 0x7fd8451c2b30, ifa_name = 0x7fd8451b606c "tap46ca1656-f6", ifa_flags = 69699, ifa_addr = 0x7fd8451c2ab0, ifa_netmask = 0x7fd8451c2ad4, ifa_ifu = {ifu_broadaddr = 0x0, ifu_dstaddr = 0x0}, ifa_data = 0x0} Having a look at openvswitch code in master branch [0], it looks like it would've crashed there as well since it's not checking that name is not NULL. However, in master branch, there's a check below in the code [1] for that which IMO should be above too. @David, could you please check at the time if crashed if a new interface was added to the system or there's an unnamed interface? The way OVS retrieves the list is trough getifaddrs() [3] and its access is protected by a lock [4]. My conclusion is that OVS code should check for ifa_name != NULL (I'll submit a patch for that) since it's already doing it in [1]. Also I don't know why you have an interface with no name but it might be something you want to discover (maybe it's being set up at that very same moment? we would need to correlate with some other logs). [0] https://github.com/openvswitch/ovs/blob/master/lib/netdev.c#L1954 [1] https://github.com/openvswitch/ovs/blob/master/lib/netdev.c#L1970 [3] http://man7.org/linux/man-pages/man3/getifaddrs.3.html [4] https://github.com/openvswitch/ovs/blob/v2.6.1/lib/netdev.c#L1873
Patched submitted to OVS: https://mail.openvswitch.org/pipermail/ovs-dev/2017-July/335769.html
I've iterated over the list of interfaces in coredump and there're 504 interfaces and two of them have their names set to NULL. The script I used in gdb was: define plist set var $n = $arg0 set var $count = 0 set var $total = 0 while $n if $n->ifa_name == 0x00 set var $count = $count + 1 end set var $n = $n->ifa_next set var $total = $total + 1 end printf "Total interfaces: %d\nNULL names: %d\n", $total, $count end (gdb) plist if_addr_list Total interfaces: 504 NULL names: 2 Apparently, we shouldn't have unnamed interfaces and it shouldn't happen but, for example, iproute2 checks for this and reports it as a bug so it's certainly possible. https://git.kernel.org/pub/scm/linux/kernel/git/shemminger/iproute2.git/tree/ip/ipaddress.c#n664 BTW, we might open a new BZ for this since this is indeed a different bug from the one reported.
I've opened a new BZ to follow up on the glibc(/kernel?) bug. This BZ here should be resolved with the backport of: https://github.com/openvswitch/ovs/commit/c19bf36d848cbdf755c6760fad1726c95e4377f1
Just to recap to avoid the confusion I created with my last analysis: 1. We need [1] to be backported. This will make openvswitch to be restarted when a daemon crashes. 2. The patch I submitted to OVS [2] will make ovs-vswitchd to not crash under the circumstances observed through the coredump submitted to this BZ. We still need [1] because it might crash for some other reasons and we still want services to be up. [1] https://github.com/openvswitch/ovs/commit/c19bf36d848cbdf755c6760fad1726c95e4377f1 [2] https://mail.openvswitch.org/pipermail/ovs-dev/2017-July/335859.html
We have add this issue again today. I am adding the core dump and the logs from the sigterm in case we can confirm it is the same root cause or we find anything different
Fixed in the attached build, we need to bump OVS 2.6 on OSP 10 and 11.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2017:2648