Bug 1468334 - neutron-openvswitch-agent crashes after SIGTERM is received and openvswitch/agent are not restarted
Summary: neutron-openvswitch-agent crashes after SIGTERM is received and openvswitch/a...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openvswitch
Version: 10.0 (Newton)
Hardware: x86_64
OS: Linux
high
urgent
Target Milestone: z4
: 10.0 (Newton)
Assignee: Timothy Redaelli
QA Contact: Alexander Stafeyev
URL:
Whiteboard:
Depends On: 1468299
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-07-06 18:04 UTC by Assaf Muller
Modified: 2022-07-09 11:16 UTC (History)
13 users (show)

Fixed In Version: openvswitch-2.6.1-11.git20161206.el7ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1468299
Environment:
Last Closed: 2017-09-06 16:59:49 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1472832 1 None None None 2021-12-10 15:09:36 UTC
Red Hat Issue Tracker OSP-17122 0 None None None 2022-07-09 11:16:10 UTC
Red Hat Product Errata RHSA-2017:2648 0 normal SHIPPED_LIVE Moderate: openvswitch security and bug fix update 2017-09-06 20:53:24 UTC

Internal Links: 1473735

Description Assaf Muller 2017-07-06 18:04:09 UTC
+++ This bug was initially created as a clone of Bug #1468299 +++

Description of problem:
neutron-openvswitch-agent stops after receiving SIGTERM.
In the event of a sigsev on openvswitch, systemd based on to the dependencies defined send the SIGTERM signal to neutron-openvswitch-agent.
The dataplane and controlplane get disrupted.

Version-Release number of selected component (if applicable):
RDO newton
openstack-neutron.noarch        1:9.3.2-0.20170421152919.6b11276.el7.centos
python-openvswitch.noarch       1:2.6.1-4.1.git20161206.el7
openvswitch.x86_64              1:2.6.1-4.1.git20161206.el7
openstack-neutron-ml2.noarch    1:9.3.2-0.20170421152919.6b11276.el7.centos


How reproducible:
Always

Steps to Reproduce:
1. Kill ovs-vswitchd process
2.
3.

Actual results:
Both openvswitch and neutron-openvswitch-agent are in inactive status

Expected results:
systemd to handle the failure and restart the services

Additional info:
This is already fixed in openvswitch 2.7 [1]

[1] https://github.com/openvswitch/ovs/commit/c19bf36d848cbdf755c6760fad1726c95e4377f1

Comment 1 Assaf Muller 2017-07-06 18:05:06 UTC
Would it be possible to backport [1] to OVS 2.6? We plan on using OVS 2.6 in the OSP 10 branch for a long time.

[1] https://github.com/openvswitch/ovs/commit/c19bf36d848cbdf755c6760fad1726c95e4377f1

Comment 2 Daniel Alvarez Sanchez 2017-07-07 09:41:21 UTC
It'd be nice to have it in RDO as well.
@Flavio, if you need some help with this please feel free to ping me anytime.

Comment 3 David Manchado 2017-07-07 10:30:46 UTC
We have just enabled dumps in the compute nodes so we can share a core dump for debugging the next time it happens.

Comment 4 Daniel Alvarez Sanchez 2017-07-07 11:09:47 UTC
I have tried on an OSP10 environment and I could reproduce the issue easily by just killing ovs-vswitchd process.

# rpm -qi openvswitch
Name        : openvswitch
Version     : 2.6.1
Release     : 10.git20161206.el7fdp
Architecture: x86_64
Install Date: vie 07 jul 2017 12:56:02 CEST
Group       : System Environment/Daemons daemon/database/utilities
Size        : 20229211
License     : ASL 2.0 and LGPLv2+ and SISSL
Signature   : RSA/SHA256, mar 14 mar 2017 13:08:09 CET, Key ID 199e2f91fd431d51
Source RPM  : openvswitch-2.6.1-10.git20161206.el7fdp.src.rpm
Build Date  : vie 24 feb 2017 18:15:39 CET
Build Host  : x86-030.build.eng.bos.redhat.com


# rpm -qi openstack-neutron-openvswitch
Name        : openstack-neutron-openvswitch
Epoch       : 1
Version     : 9.3.1
Release     : 2.el7ost
Architecture: noarch
Install Date: vie 07 jul 2017 11:37:38 CEST
Group       : Unspecified
Size        : 22918
License     : ASL 2.0
Signature   : RSA/SHA256, mar 23 may 2017 22:04:25 CEST, Key ID 199e2f91fd431d51
Source RPM  : openstack-neutron-9.3.1-2.el7ost.src.rpm
Build Date  : mar 23 may 2017 21:47:23 CEST
Build Host  : x86-038.build.eng.bos.redhat.com


After killing ovs-vswitchd, this is the status of the ovs agent and openvswitch:

# systemctl status neutron-openvswitch-agent
● neutron-openvswitch-agent.service - OpenStack Neutron Open vSwitch Agent
   Loaded: loaded (/usr/lib/systemd/system/neutron-openvswitch-agent.service; enabled; vendor preset: disabled)
   Active: inactive (dead) since vie 2017-07-07 13:05:19 CEST; 1min 38s ago


# systemctl status openvswitch
● openvswitch.service - Open vSwitch
   Loaded: loaded (/usr/lib/systemd/system/openvswitch.service; enabled; vendor preset: disabled)
   Active: inactive (dead) since vie 2017-07-07 13:05:19 CEST; 2min 12s ago


Of course, this is not the root cause of the issue since ovs-vswitchd was getting a segmentation fault but with the backport in place at least we could recover from the error.

Comment 5 Flavio Leitner 2017-07-13 15:09:14 UTC
Timothy,

Please include the backport of comment#1 in the next OSP10 batch.

Comment 6 Flavio Leitner 2017-07-13 15:10:48 UTC
Assaf,

Could you get the necessary flags for OSP granted?
Thanks,
fbl

Comment 7 Assaf Muller 2017-07-13 21:18:39 UTC
OpenStack uses a bot that grants devel/qa/pm flags automatically when the bug state is flipped to ASSIGNED/POST/ON_DEV and the Triaged keyword is set, unless the RHBZ is marked as an RFE.

Comment 9 Daniel Alvarez Sanchez 2017-07-17 11:47:13 UTC
Hi David, thanks for the info.

I've loaded the core file and did some debug:

(gdb) bt
#0  0x00007fd840e2d34c in ?? () from /lib64/libc.so.6
#1  0x00007fd842ae63b6 in netdev_get_addrs (dev=0x7fd844e1e750 "vlan121", paddr=paddr@entry=0x7ffe833244a0, pmask=pmask@entry=0x7ffe83324498, n_in=n_in@entry=0x7ffe83324494)
    at lib/netdev.c:1890
#2  0x00007fd842b70365 in netdev_linux_get_addr_list (netdev_=0x7fd844e8ec40, addr=0x7ffe833244a0, mask=0x7ffe83324498, n_cnt=0x7ffe83324494) at lib/netdev-linux.c:2517
#3  0x00007fd842ae576f in netdev_get_addr_list (netdev=<optimized out>, addr=addr@entry=0x7ffe833244a0, mask=mask@entry=0x7ffe83324498, n_addr=n_addr@entry=0x7ffe83324494)
    at lib/netdev.c:1133
#4  0x00007fd842b30191 in get_src_addr (ip6_dst=ip6_dst@entry=0x7ffe8332522c, output_bridge=output_bridge@entry=0x7ffe8332524c "vlan121", psrc=psrc@entry=0x7fd844e6f0a0)
    at lib/ovs-router.c:146
#5  0x00007fd842b30655 in ovs_router_insert__ (priority=<optimized out>, ip6_dst=ip6_dst@entry=0x7ffe8332522c, plen=<optimized out>, 
    output_bridge=output_bridge@entry=0x7ffe8332524c "vlan121", gw=gw@entry=0x7ffe8332523c) at lib/ovs-router.c:200
#6  0x00007fd842b30e37 in ovs_router_insert (ip_dst=ip_dst@entry=0x7ffe8332522c, plen=<optimized out>, output_bridge=output_bridge@entry=0x7ffe8332524c "vlan121", 
    gw=gw@entry=0x7ffe8332523c) at lib/ovs-router.c:228
#7  0x00007fd842b79d24 in route_table_handle_msg (change=0x7ffe83325220) at lib/route-table.c:295
#8  route_table_reset () at lib/route-table.c:174
#9  0x00007fd842b79ef5 in route_table_run () at lib/route-table.c:127
#10 0x00007fd842ae3701 in netdev_vport_run (netdev_class=<optimized out>) at lib/netdev-vport.c:319
#11 0x00007fd842ae438e in netdev_run () at lib/netdev.c:163
#12 0x00007fd8428f329c in main (argc=10, argv=0x7ffe833265a8) at vswitchd/ovs-vswitchd.c:114
(gdb) 


While trying to find 'vlan121' among all the interfaces present in the system, it crashes due to an interface with no name:

#1  0x00007fd842ae63b6 in netdev_get_addrs (dev=0x7fd844e1e750 "vlan121", paddr=paddr@entry=0x7ffe833244a0, pmask=pmask@entry=0x7ffe83324498, n_in=n_in@entry=0x7ffe83324494)
    at lib/netdev.c:1890
1890	                if (!strncmp(ifa->ifa_name, dev, IFNAMSIZ)) {


(gdb) p *ifa
$94 = {ifa_next = 0x7fd8451c2a78, ifa_name = 0x0, ifa_flags = 0, ifa_addr = 0x7fd8451c29f8, ifa_netmask = 0x7fd8451c2a1c, ifa_ifu = {ifu_broadaddr = 0x0, ifu_dstaddr = 0x0}, ifa_data = 0x0}
(gdb) p ifa->ifa_addr


Note that ifa->ifa_name is NULL and that is what makes ovs-vswitchd crashes when executing L1890. I took a look at the interface list (previous and next nodes) and it looks good:


(gdb) p ifa
$101 = (const struct ifaddrs *) 0x7fd8451c29c0
(gdb) p *(struct ifaddrs *)((const unsigned char *)ifa - 184)
$102 = {ifa_next = 0x7fd8451c29c0, ifa_name = 0x7fd8451b5fb4 "qvb46ca1656-f6", ifa_flags = 69955, ifa_addr = 0x7fd8451c2940, ifa_netmask = 0x7fd8451c2964, ifa_ifu = {ifu_broadaddr = 0x0, 
    ifu_dstaddr = 0x0}, ifa_data = 0x0}
(gdb) p *ifa
$103 = {ifa_next = 0x7fd8451c2a78, ifa_name = 0x0, ifa_flags = 0, ifa_addr = 0x7fd8451c29f8, ifa_netmask = 0x7fd8451c2a1c, ifa_ifu = {ifu_broadaddr = 0x0, ifu_dstaddr = 0x0}, 
  ifa_data = 0x0}
(gdb) p *ifa->ifa_next
$104 = {ifa_next = 0x7fd8451c2b30, ifa_name = 0x7fd8451b606c "tap46ca1656-f6", ifa_flags = 69699, ifa_addr = 0x7fd8451c2ab0, ifa_netmask = 0x7fd8451c2ad4, ifa_ifu = {ifu_broadaddr = 0x0, 
    ifu_dstaddr = 0x0}, ifa_data = 0x0}


Having a look at openvswitch code in master branch [0], it looks like it would've crashed there as well since it's not checking that name is not NULL. However, in master branch, there's a check below in the code [1] for that which IMO should be above too.

@David, could you please check at the time if crashed if a new interface was added to the system or there's an unnamed interface? The way OVS retrieves the list is trough getifaddrs() [3] and its access is protected by a lock [4].

My conclusion is that OVS code should check for ifa_name != NULL (I'll submit a patch for that) since it's already doing it in [1]. Also I don't know why you have an interface with no name but it might be something you want to discover (maybe it's being set up at that very same moment? we would need to correlate with some other logs).

[0] https://github.com/openvswitch/ovs/blob/master/lib/netdev.c#L1954
[1] https://github.com/openvswitch/ovs/blob/master/lib/netdev.c#L1970
[3] http://man7.org/linux/man-pages/man3/getifaddrs.3.html
[4] https://github.com/openvswitch/ovs/blob/v2.6.1/lib/netdev.c#L1873

Comment 10 Daniel Alvarez Sanchez 2017-07-17 21:06:02 UTC
Patched submitted to OVS:

https://mail.openvswitch.org/pipermail/ovs-dev/2017-July/335769.html

Comment 11 Daniel Alvarez Sanchez 2017-07-18 14:02:26 UTC
I've iterated over the list of interfaces in coredump and there're 504 interfaces and two of them have their names set to NULL.

The script I used in gdb was:

define plist
  set var $n = $arg0
  set var $count = 0
  set var $total = 0
  while $n
    if $n->ifa_name == 0x00
      set var $count = $count + 1
    end
    set var $n = $n->ifa_next
    set var $total = $total + 1
  end
  printf "Total interfaces: %d\nNULL names: %d\n", $total, $count
end

(gdb) plist if_addr_list
Total interfaces: 504
NULL names: 2

Apparently, we shouldn't have unnamed interfaces and it shouldn't happen but,
for example, iproute2 checks for this and reports it as a bug so it's certainly
possible.

https://git.kernel.org/pub/scm/linux/kernel/git/shemminger/iproute2.git/tree/ip/ipaddress.c#n664

BTW, we might open a new BZ for this since this is indeed a different bug from the one reported.

Comment 12 Daniel Alvarez Sanchez 2017-07-19 13:39:59 UTC
I've opened a new BZ to follow up on the glibc(/kernel?) bug.
This BZ here should be resolved with the backport of:
https://github.com/openvswitch/ovs/commit/c19bf36d848cbdf755c6760fad1726c95e4377f1

Comment 13 Daniel Alvarez Sanchez 2017-07-21 12:51:26 UTC
Just to recap to avoid the confusion I created with my last analysis:

1. We need [1] to be backported. This will make openvswitch to be restarted when a daemon crashes.

2. The patch I submitted to OVS [2] will make ovs-vswitchd to not crash under the circumstances observed through the coredump submitted to this BZ. We still need [1] because it might crash for some other reasons and we still want services to be up.


[1] https://github.com/openvswitch/ovs/commit/c19bf36d848cbdf755c6760fad1726c95e4377f1
[2] https://mail.openvswitch.org/pipermail/ovs-dev/2017-July/335859.html

Comment 14 David Manchado 2017-07-21 14:19:40 UTC
We have add this issue again today. I am adding the core dump and the logs from the sigterm in case we can confirm it is the same root cause or we find anything different

Comment 18 Assaf Muller 2017-07-26 13:40:17 UTC
Fixed in the attached build, we need to bump OVS 2.6 on OSP 10 and 11.

Comment 29 errata-xmlrpc 2017-09-06 16:59:49 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:2648


Note You need to log in before you can comment on or make changes to this bug.