Hide Forgot
Description of problem: When ovs-vswitchd segfaults for some reason, the monitor thread is responsible for starting it back to get the service online. However, when the bridge includes a DPDK port, the restart doesn't work. 2016-02-12T13:19:20.142Z|00003|daemon_unix(monitor)|ERR|1 crashes: pid 78009 died, killed (Segmentation fault), core dumped, restarting 2016-02-12T13:19:20.172Z|00004|ovs_numa|INFO|Discovered 24 CPU cores on NUMA node 0 2016-02-12T13:19:20.172Z|00005|ovs_numa|INFO|Discovered 1 NUMA nodes and 24 CPU cores 2016-02-12T13:19:20.172Z|00006|memory|INFO|108952 kB peak resident set size after 47.5 seconds 2016-02-12T13:19:20.172Z|00007|reconnect|INFO|unix:/usr/local/var/run/openvswitch/db.sock: connecting... 2016-02-12T13:19:20.172Z|00008|reconnect|INFO|unix:/usr/local/var/run/openvswitch/db.sock: connected 2016-02-12T13:19:20.183Z|00009|ofproto_dpif|INFO|netdev@ovs-netdev: Datapath supports recirculation 2016-02-12T13:19:20.183Z|00010|ofproto_dpif|INFO|netdev@ovs-netdev: MPLS label stack length probed as 3 2016-02-12T13:19:20.183Z|00011|ofproto_dpif|INFO|netdev@ovs-netdev: Datapath supports unique flow ids 2016-02-12T13:19:20.183Z|00012|ofproto_dpif|INFO|netdev@ovs-netdev: Datapath does not support ct_state 2016-02-12T13:19:20.183Z|00013|ofproto_dpif|INFO|netdev@ovs-netdev: Datapath does not support ct_zone 2016-02-12T13:19:20.183Z|00014|ofproto_dpif|INFO|netdev@ovs-netdev: Datapath does not support ct_mark 2016-02-12T13:19:20.183Z|00015|ofproto_dpif|INFO|netdev@ovs-netdev: Datapath does not support ct_label 2016-02-12T13:19:20.189Z|00016|bridge|WARN|could not open network device dpdk1 (Cannot allocate memory) 2016-02-12T13:19:20.189Z|00017|bridge|WARN|could not open network device dpdk0 (Cannot allocate memory) 2016-02-12T13:19:20.200Z|00018|bridge|INFO|bridge ovsbr0: added interface ovsbr0 on port 65534 2016-02-12T13:19:20.201Z|00019|bridge|INFO|bridge ovsbr0: using datapath ID 000006b9d7c27d4f 2016-02-12T13:19:20.201Z|00020|connmgr|INFO|ovsbr0: added service controller "punix:/usr/local/var/run/openvswitch/ovsbr0.mgmt" 2016-02-12T13:19:20.227Z|00021|bridge|WARN|could not open network device dpdk1 (Cannot allocate memory) 2016-02-12T13:19:20.228Z|00022|bridge|WARN|could not open network device dpdk0 (Cannot allocate memory) 2016-02-12T13:19:20.229Z|00023|bridge|INFO|ovs-vswitchd (Open vSwitch) 2.5.0 Version-Release number of selected component (if applicable): 2.5.0 How reproducible: Always Steps to Reproduce: 1. do something that causes an OVS thread to segfault 2. watch the monitor thread failing to restart OVS Expected results: The monitor thread should be able to restart the service.
For physical ports this has been fixed in upstream development branches, probably part in dpdk and part in ovs. Haven't dug out the exact commits (yet), and stable branch situation needs testing too. What does not work in upstream OVS is restarting vhostuser ports, they fail with VHOST_CONFIG: socket created, fd:50 VHOST_CONFIG: fail to bind fd:50: remove file:/var/run/openvswitch/<path> and try again. The vhostuser sockets are registered for cleanup on fatal signals, but the problem is lib/fatal-signal.c only considers the { SIGTERM, SIGINT, SIGHUP, SIGALRM } as fatals. So the file cleanup never occurs on actual crashes, and that's why the vhostuser ports fail on restart.
Send patch upstream to restart ovsdb or vswitchd on failure. https://mail.openvswitch.org/pipermail/ovs-dev/2017-February/328546.html
The changes have been accepted; https://github.com/openvswitch/ovs/commi/c19bf36d848cbdf755c6760fad1726c95e4377f1 https://github.com/openvswitch/ovs/commi/090cc60c08a513047cf0fcc8c7c63ffb42e8fef9 They will be available in next 2.7 release, probably 2.7.1.
Modify the file as per the comment #10 , hit with below error. [root@compute-1 log]# systemctl daemon-reload [root@compute-1 log]# systemctl restart openvswitch Failed to restart openvswitch.service: Unit is not loaded properly: Invalid argument. See system logs and 'systemctl status openvswitch.service' for details. [root@compute-1 log]# systemctl restart openvswitch Failed to restart openvswitch.service: Unit is not loaded properly: Invalid argument. See system logs and 'systemctl status openvswitch.service' for details. [root@compute-1 log]# systemctl status openvswitch.service -l ● openvswitch.service - Open vSwitch Loaded: error (Reason: Invalid argument) Active: active (exited) since Mon 2017-03-20 21:35:12 +03; 3min 22s ago Main PID: 986151 (code=exited, status=0/SUCCESS) CGroup: /system.slice/openvswitch.service Mar 20 21:35:12 compute-1.localdomain systemd[1]: Starting Open vSwitch... Mar 20 21:35:12 compute-1.localdomain systemd[1]: Started Open vSwitch. Mar 20 21:35:16 compute-1.localdomain systemd[1]: openvswitch.service has Restart= setting other than no, which isn't allowed for Type=oneshot services. Refusing. Mar 20 21:37:57 compute-1.localdomain systemd[1]: openvswitch.service has Restart= setting other than no, which isn't allowed for Type=oneshot services. Refusing. [root@compute-1 log]#
I modified below file :- vi /etc/systemd/system/multi-user.target.wants/openvswitch.service ~~~ [Unit] Description=Open vSwitch After=syslog.target network.target openvswitch-nonetwork.service Requires=openvswitch-nonetwork.service Requires=ovsdb-server.service <<<<< Added Requires=ovs-vswitchd.service <<<<< Added [Service] Type=oneshot ExecStart=/bin/true ExecStop=/bin/true RemainAfterExit=yes Restart=on-failure <<<<< Added [Install] WantedBy=multi-user.target ~~~ ovs_version: "2.5.0" rpm -qa |grep systemd systemd-219-30.el7_3.6.x86_64 systemd-libs-219-30.el7_3.6.x86_64 systemd-sysv-219-30.el7_3.6.x86_64 RHOSP-10 [root@compute-1 log]# systemctl daemon-reload [root@compute-1 log]# systemctl restart openvswitch Failed to restart openvswitch.service: Unit is not loaded properly: Invalid argument. See system logs and 'systemctl status openvswitch.service' for details. [root@compute-1 log]# systemctl restart openvswitch Failed to restart openvswitch.service: Unit is not loaded properly: Invalid argument. See system logs and 'systemctl status openvswitch.service' for details. [root@compute-1 log]# systemctl status openvswitch.service -l ● openvswitch.service - Open vSwitch Loaded: error (Reason: Invalid argument) Active: active (exited) since Mon 2017-03-20 21:35:12 +03; 3min 22s ago Main PID: 986151 (code=exited, status=0/SUCCESS) CGroup: /system.slice/openvswitch.service Mar 20 21:35:12 compute-1.localdomain systemd[1]: Starting Open vSwitch... Mar 20 21:35:12 compute-1.localdomain systemd[1]: Started Open vSwitch. Mar 20 21:35:16 compute-1.localdomain systemd[1]: openvswitch.service has Restart= setting other than no, which isn't allowed for Type=oneshot services. Refusing. Mar 20 21:37:57 compute-1.localdomain systemd[1]: openvswitch.service has Restart= setting other than no, which isn't allowed for Type=oneshot services. Refusing. [root@compute-1 log]#
Hi Nilesh, You do not need to add Restart=on-failure to the openvswitch.service file, but only to the ovs-vswitchd.service, ovsdb-server.service files. See upstream patch: https://mail.openvswitch.org/pipermail/ovs-dev/2017-February/328546.html