Bug 1307025 - Open vSwitch service resilience test
Open vSwitch service resilience test
Status: CLOSED UPSTREAM
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: openvswitch (Show other bugs)
7.3
x86_64 Linux
high Severity high
: rc
: ---
Assigned To: Eelco Chaudron
ovs-qe@redhat.com
:
Depends On: 1335865
Blocks:
  Show dependency treegraph
 
Reported: 2016-02-12 08:29 EST by Flavio Leitner
Modified: 2017-03-20 08:31 EDT (History)
6 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2017-02-28 03:33:59 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Flavio Leitner 2016-02-12 08:29:49 EST
Description of problem:

When ovs-vswitchd segfaults for some reason, the monitor thread is responsible for starting it back to get the service online.  However, when the bridge includes a DPDK port, the restart doesn't work.

2016-02-12T13:19:20.142Z|00003|daemon_unix(monitor)|ERR|1 crashes: pid 78009 died, killed (Segmentation fault), core dumped, restarting
2016-02-12T13:19:20.172Z|00004|ovs_numa|INFO|Discovered 24 CPU cores on NUMA node 0
2016-02-12T13:19:20.172Z|00005|ovs_numa|INFO|Discovered 1 NUMA nodes and 24 CPU cores
2016-02-12T13:19:20.172Z|00006|memory|INFO|108952 kB peak resident set size after 47.5 seconds
2016-02-12T13:19:20.172Z|00007|reconnect|INFO|unix:/usr/local/var/run/openvswitch/db.sock: connecting...
2016-02-12T13:19:20.172Z|00008|reconnect|INFO|unix:/usr/local/var/run/openvswitch/db.sock: connected
2016-02-12T13:19:20.183Z|00009|ofproto_dpif|INFO|netdev@ovs-netdev: Datapath supports recirculation
2016-02-12T13:19:20.183Z|00010|ofproto_dpif|INFO|netdev@ovs-netdev: MPLS label stack length probed as 3
2016-02-12T13:19:20.183Z|00011|ofproto_dpif|INFO|netdev@ovs-netdev: Datapath supports unique flow ids
2016-02-12T13:19:20.183Z|00012|ofproto_dpif|INFO|netdev@ovs-netdev: Datapath does not support ct_state
2016-02-12T13:19:20.183Z|00013|ofproto_dpif|INFO|netdev@ovs-netdev: Datapath does not support ct_zone
2016-02-12T13:19:20.183Z|00014|ofproto_dpif|INFO|netdev@ovs-netdev: Datapath does not support ct_mark
2016-02-12T13:19:20.183Z|00015|ofproto_dpif|INFO|netdev@ovs-netdev: Datapath does not support ct_label
2016-02-12T13:19:20.189Z|00016|bridge|WARN|could not open network device dpdk1 (Cannot allocate memory)
2016-02-12T13:19:20.189Z|00017|bridge|WARN|could not open network device dpdk0 (Cannot allocate memory)
2016-02-12T13:19:20.200Z|00018|bridge|INFO|bridge ovsbr0: added interface ovsbr0 on port 65534
2016-02-12T13:19:20.201Z|00019|bridge|INFO|bridge ovsbr0: using datapath ID 000006b9d7c27d4f
2016-02-12T13:19:20.201Z|00020|connmgr|INFO|ovsbr0: added service controller "punix:/usr/local/var/run/openvswitch/ovsbr0.mgmt"
2016-02-12T13:19:20.227Z|00021|bridge|WARN|could not open network device dpdk1 (Cannot allocate memory)
2016-02-12T13:19:20.228Z|00022|bridge|WARN|could not open network device dpdk0 (Cannot allocate memory)
2016-02-12T13:19:20.229Z|00023|bridge|INFO|ovs-vswitchd (Open vSwitch) 2.5.0



Version-Release number of selected component (if applicable):
2.5.0


How reproducible:
Always


Steps to Reproduce:
1. do something that causes an OVS thread to segfault
2. watch the monitor thread failing to restart OVS

Expected results:
The monitor thread should be able to restart the service.
Comment 1 Panu Matilainen 2016-06-23 07:19:15 EDT
For physical ports this has been fixed in upstream development branches, probably part in dpdk and part in ovs. Haven't dug out the exact commits (yet), and stable branch situation needs testing too.

What does not work in upstream OVS is restarting vhostuser ports, they fail with 
VHOST_CONFIG: socket created, fd:50
VHOST_CONFIG: fail to bind fd:50: remove file:/var/run/openvswitch/<path> and try again.

The vhostuser sockets are registered for cleanup on fatal signals, but the problem is lib/fatal-signal.c only considers the { SIGTERM, SIGINT, SIGHUP, SIGALRM } as fatals. So the file cleanup never occurs on actual crashes, and that's why the vhostuser ports fail on restart.
Comment 11 Eelco Chaudron 2017-02-07 07:47:41 EST
Send patch upstream to restart ovsdb or vswitchd on failure.

https://mail.openvswitch.org/pipermail/ovs-dev/2017-February/328546.html
Comment 12 Eelco Chaudron 2017-02-28 03:32:30 EST
The changes have been accepted; 

https://github.com/openvswitch/ovs/commi/c19bf36d848cbdf755c6760fad1726c95e4377f1
https://github.com/openvswitch/ovs/commi/090cc60c08a513047cf0fcc8c7c63ffb42e8fef9

They will be available in next 2.7 release, probably 2.7.1.
Comment 14 Nilesh 2017-03-20 06:53:43 EDT
Modify the file as per the comment #10 ,  hit with below error. 


[root@compute-1 log]# systemctl daemon-reload
[root@compute-1 log]# systemctl restart openvswitch
Failed to restart openvswitch.service: Unit is not loaded properly: Invalid argument.
See system logs and 'systemctl status openvswitch.service' for details.
[root@compute-1 log]# systemctl restart openvswitch
Failed to restart openvswitch.service: Unit is not loaded properly: Invalid argument.
See system logs and 'systemctl status openvswitch.service' for details.
[root@compute-1 log]# systemctl status openvswitch.service -l
● openvswitch.service - Open vSwitch
   Loaded: error (Reason: Invalid argument)
   Active: active (exited) since Mon 2017-03-20 21:35:12 +03; 3min 22s ago
 Main PID: 986151 (code=exited, status=0/SUCCESS)
   CGroup: /system.slice/openvswitch.service

Mar 20 21:35:12 compute-1.localdomain systemd[1]: Starting Open vSwitch...
Mar 20 21:35:12 compute-1.localdomain systemd[1]: Started Open vSwitch.
Mar 20 21:35:16 compute-1.localdomain systemd[1]: openvswitch.service has Restart= setting other than no, which isn't allowed for Type=oneshot services. Refusing.
Mar 20 21:37:57 compute-1.localdomain systemd[1]: openvswitch.service has Restart= setting other than no, which isn't allowed for Type=oneshot services. Refusing.
[root@compute-1 log]#
Comment 16 Nilesh 2017-03-20 08:15:50 EDT
I modified below file :- 

vi /etc/systemd/system/multi-user.target.wants/openvswitch.service

~~~
[Unit]
Description=Open vSwitch
After=syslog.target network.target openvswitch-nonetwork.service
Requires=openvswitch-nonetwork.service

Requires=ovsdb-server.service  <<<<< Added 
Requires=ovs-vswitchd.service  <<<<< Added

[Service]
Type=oneshot
ExecStart=/bin/true
ExecStop=/bin/true
RemainAfterExit=yes
Restart=on-failure    <<<<< Added


[Install]
WantedBy=multi-user.target

~~~     


ovs_version: "2.5.0"

rpm -qa |grep systemd
systemd-219-30.el7_3.6.x86_64
systemd-libs-219-30.el7_3.6.x86_64
systemd-sysv-219-30.el7_3.6.x86_64


RHOSP-10



[root@compute-1 log]# systemctl daemon-reload
[root@compute-1 log]# systemctl restart openvswitch
Failed to restart openvswitch.service: Unit is not loaded properly: Invalid argument.
See system logs and 'systemctl status openvswitch.service' for details.
[root@compute-1 log]# systemctl restart openvswitch
Failed to restart openvswitch.service: Unit is not loaded properly: Invalid argument.
See system logs and 'systemctl status openvswitch.service' for details.
[root@compute-1 log]# systemctl status openvswitch.service -l
● openvswitch.service - Open vSwitch
   Loaded: error (Reason: Invalid argument)
   Active: active (exited) since Mon 2017-03-20 21:35:12 +03; 3min 22s ago
 Main PID: 986151 (code=exited, status=0/SUCCESS)
   CGroup: /system.slice/openvswitch.service

Mar 20 21:35:12 compute-1.localdomain systemd[1]: Starting Open vSwitch...
Mar 20 21:35:12 compute-1.localdomain systemd[1]: Started Open vSwitch.
Mar 20 21:35:16 compute-1.localdomain systemd[1]: openvswitch.service has Restart= setting other than no, which isn't allowed for Type=oneshot services. Refusing.
Mar 20 21:37:57 compute-1.localdomain systemd[1]: openvswitch.service has Restart= setting other than no, which isn't allowed for Type=oneshot services. Refusing.
[root@compute-1 log]#
Comment 17 Eelco Chaudron 2017-03-20 08:31:29 EDT
Hi Nilesh,

You do not need to add Restart=on-failure to the openvswitch.service file, but only to the ovs-vswitchd.service, ovsdb-server.service files.

See upstream patch: https://mail.openvswitch.org/pipermail/ovs-dev/2017-February/328546.html

Note You need to log in before you can comment on or make changes to this bug.