Bug 1307025

Summary: Open vSwitch service resilience test
Product: Red Hat Enterprise Linux 7 Reporter: Flavio Leitner <fleitner>
Component: openvswitchAssignee: Eelco Chaudron <echaudro>
Status: CLOSED UPSTREAM QA Contact: ovs-qe
Severity: high Docs Contact:
Priority: high    
Version: 7.3CC: aconole, atragler, echaudro, fbaudin, fleitner, nchandek
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-02-28 08:33:59 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1335865    
Bug Blocks:    

Description Flavio Leitner 2016-02-12 13:29:49 UTC
Description of problem:

When ovs-vswitchd segfaults for some reason, the monitor thread is responsible for starting it back to get the service online.  However, when the bridge includes a DPDK port, the restart doesn't work.

2016-02-12T13:19:20.142Z|00003|daemon_unix(monitor)|ERR|1 crashes: pid 78009 died, killed (Segmentation fault), core dumped, restarting
2016-02-12T13:19:20.172Z|00004|ovs_numa|INFO|Discovered 24 CPU cores on NUMA node 0
2016-02-12T13:19:20.172Z|00005|ovs_numa|INFO|Discovered 1 NUMA nodes and 24 CPU cores
2016-02-12T13:19:20.172Z|00006|memory|INFO|108952 kB peak resident set size after 47.5 seconds
2016-02-12T13:19:20.172Z|00007|reconnect|INFO|unix:/usr/local/var/run/openvswitch/db.sock: connecting...
2016-02-12T13:19:20.172Z|00008|reconnect|INFO|unix:/usr/local/var/run/openvswitch/db.sock: connected
2016-02-12T13:19:20.183Z|00009|ofproto_dpif|INFO|netdev@ovs-netdev: Datapath supports recirculation
2016-02-12T13:19:20.183Z|00010|ofproto_dpif|INFO|netdev@ovs-netdev: MPLS label stack length probed as 3
2016-02-12T13:19:20.183Z|00011|ofproto_dpif|INFO|netdev@ovs-netdev: Datapath supports unique flow ids
2016-02-12T13:19:20.183Z|00012|ofproto_dpif|INFO|netdev@ovs-netdev: Datapath does not support ct_state
2016-02-12T13:19:20.183Z|00013|ofproto_dpif|INFO|netdev@ovs-netdev: Datapath does not support ct_zone
2016-02-12T13:19:20.183Z|00014|ofproto_dpif|INFO|netdev@ovs-netdev: Datapath does not support ct_mark
2016-02-12T13:19:20.183Z|00015|ofproto_dpif|INFO|netdev@ovs-netdev: Datapath does not support ct_label
2016-02-12T13:19:20.189Z|00016|bridge|WARN|could not open network device dpdk1 (Cannot allocate memory)
2016-02-12T13:19:20.189Z|00017|bridge|WARN|could not open network device dpdk0 (Cannot allocate memory)
2016-02-12T13:19:20.200Z|00018|bridge|INFO|bridge ovsbr0: added interface ovsbr0 on port 65534
2016-02-12T13:19:20.201Z|00019|bridge|INFO|bridge ovsbr0: using datapath ID 000006b9d7c27d4f
2016-02-12T13:19:20.201Z|00020|connmgr|INFO|ovsbr0: added service controller "punix:/usr/local/var/run/openvswitch/ovsbr0.mgmt"
2016-02-12T13:19:20.227Z|00021|bridge|WARN|could not open network device dpdk1 (Cannot allocate memory)
2016-02-12T13:19:20.228Z|00022|bridge|WARN|could not open network device dpdk0 (Cannot allocate memory)
2016-02-12T13:19:20.229Z|00023|bridge|INFO|ovs-vswitchd (Open vSwitch) 2.5.0



Version-Release number of selected component (if applicable):
2.5.0


How reproducible:
Always


Steps to Reproduce:
1. do something that causes an OVS thread to segfault
2. watch the monitor thread failing to restart OVS

Expected results:
The monitor thread should be able to restart the service.

Comment 1 Panu Matilainen 2016-06-23 11:19:15 UTC
For physical ports this has been fixed in upstream development branches, probably part in dpdk and part in ovs. Haven't dug out the exact commits (yet), and stable branch situation needs testing too.

What does not work in upstream OVS is restarting vhostuser ports, they fail with 
VHOST_CONFIG: socket created, fd:50
VHOST_CONFIG: fail to bind fd:50: remove file:/var/run/openvswitch/<path> and try again.

The vhostuser sockets are registered for cleanup on fatal signals, but the problem is lib/fatal-signal.c only considers the { SIGTERM, SIGINT, SIGHUP, SIGALRM } as fatals. So the file cleanup never occurs on actual crashes, and that's why the vhostuser ports fail on restart.

Comment 11 Eelco Chaudron 2017-02-07 12:47:41 UTC
Send patch upstream to restart ovsdb or vswitchd on failure.

https://mail.openvswitch.org/pipermail/ovs-dev/2017-February/328546.html

Comment 12 Eelco Chaudron 2017-02-28 08:32:30 UTC
The changes have been accepted; 

https://github.com/openvswitch/ovs/commi/c19bf36d848cbdf755c6760fad1726c95e4377f1
https://github.com/openvswitch/ovs/commi/090cc60c08a513047cf0fcc8c7c63ffb42e8fef9

They will be available in next 2.7 release, probably 2.7.1.

Comment 14 Nilesh 2017-03-20 10:53:43 UTC
Modify the file as per the comment #10 ,  hit with below error. 


[root@compute-1 log]# systemctl daemon-reload
[root@compute-1 log]# systemctl restart openvswitch
Failed to restart openvswitch.service: Unit is not loaded properly: Invalid argument.
See system logs and 'systemctl status openvswitch.service' for details.
[root@compute-1 log]# systemctl restart openvswitch
Failed to restart openvswitch.service: Unit is not loaded properly: Invalid argument.
See system logs and 'systemctl status openvswitch.service' for details.
[root@compute-1 log]# systemctl status openvswitch.service -l
● openvswitch.service - Open vSwitch
   Loaded: error (Reason: Invalid argument)
   Active: active (exited) since Mon 2017-03-20 21:35:12 +03; 3min 22s ago
 Main PID: 986151 (code=exited, status=0/SUCCESS)
   CGroup: /system.slice/openvswitch.service

Mar 20 21:35:12 compute-1.localdomain systemd[1]: Starting Open vSwitch...
Mar 20 21:35:12 compute-1.localdomain systemd[1]: Started Open vSwitch.
Mar 20 21:35:16 compute-1.localdomain systemd[1]: openvswitch.service has Restart= setting other than no, which isn't allowed for Type=oneshot services. Refusing.
Mar 20 21:37:57 compute-1.localdomain systemd[1]: openvswitch.service has Restart= setting other than no, which isn't allowed for Type=oneshot services. Refusing.
[root@compute-1 log]#

Comment 16 Nilesh 2017-03-20 12:15:50 UTC
I modified below file :- 

vi /etc/systemd/system/multi-user.target.wants/openvswitch.service

~~~
[Unit]
Description=Open vSwitch
After=syslog.target network.target openvswitch-nonetwork.service
Requires=openvswitch-nonetwork.service

Requires=ovsdb-server.service  <<<<< Added 
Requires=ovs-vswitchd.service  <<<<< Added

[Service]
Type=oneshot
ExecStart=/bin/true
ExecStop=/bin/true
RemainAfterExit=yes
Restart=on-failure    <<<<< Added


[Install]
WantedBy=multi-user.target

~~~     


ovs_version: "2.5.0"

rpm -qa |grep systemd
systemd-219-30.el7_3.6.x86_64
systemd-libs-219-30.el7_3.6.x86_64
systemd-sysv-219-30.el7_3.6.x86_64


RHOSP-10



[root@compute-1 log]# systemctl daemon-reload
[root@compute-1 log]# systemctl restart openvswitch
Failed to restart openvswitch.service: Unit is not loaded properly: Invalid argument.
See system logs and 'systemctl status openvswitch.service' for details.
[root@compute-1 log]# systemctl restart openvswitch
Failed to restart openvswitch.service: Unit is not loaded properly: Invalid argument.
See system logs and 'systemctl status openvswitch.service' for details.
[root@compute-1 log]# systemctl status openvswitch.service -l
● openvswitch.service - Open vSwitch
   Loaded: error (Reason: Invalid argument)
   Active: active (exited) since Mon 2017-03-20 21:35:12 +03; 3min 22s ago
 Main PID: 986151 (code=exited, status=0/SUCCESS)
   CGroup: /system.slice/openvswitch.service

Mar 20 21:35:12 compute-1.localdomain systemd[1]: Starting Open vSwitch...
Mar 20 21:35:12 compute-1.localdomain systemd[1]: Started Open vSwitch.
Mar 20 21:35:16 compute-1.localdomain systemd[1]: openvswitch.service has Restart= setting other than no, which isn't allowed for Type=oneshot services. Refusing.
Mar 20 21:37:57 compute-1.localdomain systemd[1]: openvswitch.service has Restart= setting other than no, which isn't allowed for Type=oneshot services. Refusing.
[root@compute-1 log]#

Comment 17 Eelco Chaudron 2017-03-20 12:31:29 UTC
Hi Nilesh,

You do not need to add Restart=on-failure to the openvswitch.service file, but only to the ovs-vswitchd.service, ovsdb-server.service files.

See upstream patch: https://mail.openvswitch.org/pipermail/ovs-dev/2017-February/328546.html