Description of problem: Open vswich process is not restarted after a kill 20% of the times. Version-Release number of selected component (if applicable): OSP 14 3 controllers + 3 computes + dvr How reproducible: 1. Install a baremetal with RHEL(theforeman) 2. Intall with jenkins: OSP 14 -- 3 controllers + 3 computes + dvr 3. Create sever vms with a Fip 4. Kill openvswitch process in a compuete node Actual results: the process is not restarted 20% of the times. Expected results: the process is restarted. Additional info: Logs when restart openvswich is failing: [root@compute-2 heat-admin]# tail -f /var/log/containers/neutron/metadata-agent.log 2018-11-26 08:51:10.883 8224 INFO eventlet.wsgi.server [-] 10.2.0.18,<local> "GET /2009-04-04/meta-data/placement/availability-zone HTTP/1.1" status: 200 len: 139 time: 0.1934390 2018-11-26 08:51:10.968 8223 INFO eventlet.wsgi.server [-] 10.3.0.25,<local> "GET /2009-04-04/meta-data/placement/availability-zone HTTP/1.1" status: 200 len: 139 time: 0.1648369 2018-11-26 09:11:20.883 8224 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: IOError: Socket closed 2018-11-26 09:11:34.307 7137 ERROR oslo.messaging._drivers.impl_rabbit [-] [357cec88-105c-4f2b-b26c-70ae3864ec0f] AMQP server controller-2.internalapi.localdomain:5672 closed the connection. Check login credentials: Socket closed: IOError: Socket closed 2018-11-26 09:11:35.323 7137 INFO oslo.messaging._drivers.impl_rabbit [-] [357cec88-105c-4f2b-b26c-70ae3864ec0f] Reconnected to AMQP server on controller-2.internalapi.localdomain:5672 via [amqp] client with port 50096. 2018-11-26 09:11:43.938 8223 ERROR oslo.messaging._drivers.impl_rabbit [-] [24405812-662e-4c40-b834-37a07d80366f] AMQP server controller-1.internalapi.localdomain:5672 closed the connection. Check login credentials: Socket closed: IOError: Socket closed 2018-11-26 09:11:44.955 8223 INFO oslo.messaging._drivers.impl_rabbit [-] [24405812-662e-4c40-b834-37a07d80366f] Reconnected to AMQP server on controller-1.internalapi.localdomain:5672 via [amqp] client with port 37994. 2018-11-26 09:11:45.218 8224 ERROR oslo.messaging._drivers.impl_rabbit [-] [7bf56d8a-afcf-4b87-b2dd-c865c2faf08f] AMQP server controller-2.internalapi.localdomain:5672 closed the connection. Check login credentials: Socket closed: IOError: Socket closed 2018-11-26 09:11:46.235 8224 INFO oslo.messaging._drivers.impl_rabbit [-] [7bf56d8a-afcf-4b87-b2dd-c865c2faf08f] Reconnected to AMQP server on controller-2.internalapi.localdomain:5672 via [amqp] client with port 50116. 2018-11-26 09:11:59.117 8223 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: IOError: Socket closed vi /var/log/containers/neutron/openvswitch-agent.log 2018-11-26 09:12:04.417 7137 ERROR oslo.messaging._drivers.impl_rabbit [-] [8f9c38d5-3ad8-4cab-8446-a2f07e6d370f] AMQP server controller-1.internalapi.localdomain:5672 closed the connection. Check login credentials: Socket closed: IOError: Socket closed 2018-11-26 09:12:05.444 7137 INFO oslo.messaging._drivers.impl_rabbit [-] [8f9c38d5-3ad8-4cab-8446-a2f07e6d370f] Reconnected to AMQP server on controller-1.internalapi.localdomain:5672 via [amqp] client with port 38042. 2018-11-26 09:12:15.386 8224 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: error: [Errno 104] Connection reset by peer 2018-11-26 14:23:44.338 28480 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int raise RuntimeError(m) 2018-11-26 14:23:44.338 28480 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int RuntimeError: Switch connection timeout 2018-11-26 14:23:44.338 28480 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int 2018-11-26 14:23:44.339 28480 WARNING neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-cb6bf194-1ceb-415b-aecc-3ba950053b37 - - - - -] OVS is dead. OVSNeutronAgent will keep running and checking OVS status periodically. 2018-11-26 14:23:44.340 28480 DEBUG neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-cb6bf194-1ceb-415b-aecc-3ba950053b37 - - - - -] Agent rpc_loop - iteration:623 completed. Processed ports statistics: {'regular': {'updated': 0, 'added': 0, 'removed': 0}}. Elapsed:30.020 loop_count_and_wait /usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py:1875 2018-11-26 14:23:44.340 28480 DEBUG neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-cb6bf194-1ceb-415b-aecc-3ba950053b37 - - - - -] Loop iteration exceeded interval (2 vs. 30.0197079182)! loop_count_and_wait /usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py:1882 2018-11-26 14:23:44.340 28480 DEBUG neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-cb6bf194-1ceb-415b-aecc-3ba950053b37 - - - - -] Agent rpc_loop - iteration:624 started rpc_loop /usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py:2086 2018-11-26 14:24:14.353 28480 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ofswitch [req-cb6bf194-1ceb-415b-aecc-3ba950053b37 - - - - -] Switch connection timeout 2018-11-26 14:24:14.354 28480 DEBUG ovsdbapp.backend.ovs_idl.transaction [-] Running txn n=1 command(idx=0): DbGetCommand(column=datapath_id, table=Bridge, record=br-int) do_commit /usr/lib/python2.7/site-packages/ovsdbapp/backend/ovs_idl/transaction.py:84 2018-11-26 14:24:14.355 28480 DEBUG ovsdbapp.backend.ovs_idl.transaction [-] Transaction caused no change do_commit /usr/lib/python2.7/site-packages/ovsdbapp/backend/ovs_idl/transaction.py:121 2018-11-26 14:24:14.355 28480 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int [req-cb6bf194-1ceb-415b-aecc-3ba950053b37 - - - - -] Failed to communicate with the switch: RuntimeError: Switch connection timeout 2018-11-26 14:24:14.355 28480 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int Traceback (most recent call last): 2018-11-26 14:24:14.355 28480 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int File "/usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/openflow/native/br_int.py", line 52, in check_canary_table 2018-11-26 14:24:14.355 28480 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int flows = self.dump_flows(constants.CANARY_TABLE) 2018-11-26 14:24:14.355 28480 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int File "/usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/openflow/native/ofswitch.py", line 156, in dump_flows 2018-11-26 14:24:14.355 28480 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int (dp, ofp, ofpp) = self._get_dp() 2018-11-26 14:24:14.355 28480 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int File "/usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/openflow/native/ovs_bridge.py", line 69, in _get_dp 2018-11-26 14:24:14.355 28480 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int self._cached_dpid = new_dpid 2018-11-26 14:24:14.355 28480 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__ 2018-11-26 14:24:14.355 28480 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int self.force_reraise() 2018-11-26 14:24:14.355 28480 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise 2018-11-26 14:24:14.355 28480 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int six.reraise(self.type_, self.value, self.tb) 2018-11-26 14:24:14.355 28480 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int File "/usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/openflow/native/ovs_bridge.py", line 52, in _get_dp 2018-11-26 14:24:14.355 28480 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int dp = self._get_dp_by_dpid(self._cached_dpid) 2018-11-26 14:24:14.355 28480 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int File "/usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/openflow/native/ofswitch.py", line 79, in _get_dp_by_dpid 2018-11-26 14:24:14.355 28480 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int raise RuntimeError(m) 2018-11-26 14:24:14.355 28480 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int RuntimeError: Switch connection timeout 2018-11-26 14:24:14.355 28480 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int 2018-11-26 14:24:14.356 28480 WARNING neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-cb6bf194-1ceb-415b-aecc-3ba950053b37 - - - - -] OVS is dead. OVSNeutronAgent will keep running and checking OVS status periodically. 2018-11-26 14:24:14.356 28480 DEBUG neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-cb6bf194-1ceb-415b-aecc-3ba950053b37 - - - - -] Agent rpc_loop - iteration:624 completed. Processed ports statistics: {'regular': {'updated': 0, 'added': 0, 'removed': 0}}. Elapsed:30.016 loop_count_and_wait /usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py:1875 2018-11-26 14:24:14.357 28480 DEBUG neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-cb6bf194-1ceb-415b-aecc-3ba950053b37 - - - - -] Loop iteration exceeded interval (2 vs. 30.0161988735)! loop_count_and_wait /usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py:1882 2018-11-26 14:24:14.357 28480 DEBUG neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-cb6bf194-1ceb-415b-aecc-3ba950053b37 - - - - -] Agent rpc_loop - iteration:625 started rpc_loop /usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py:2086 overcloud) [stack@undercloud-0 ~]$ openstack versions show +-------------+----------------+---------+------------+-----------------------------------+------------------+------------------+ | Region Name | Service Type | Version | Status | Endpoint | Min Microversion | Max Microversion | +-------------+----------------+---------+------------+-----------------------------------+------------------+------------------+ | regionOne | block-storage | 2.0 | DEPRECATED | http://10.0.0.101:8776/v2/ | None | None | | regionOne | block-storage | 3.0 | CURRENT | http://10.0.0.101:8776/v3/ | 3.0 | 3.55 | | regionOne | placement | None | CURRENT | http://10.0.0.101:8778/placement/ | None | None | | regionOne | network | 2.0 | CURRENT | http://10.0.0.101:9696/v2.0/ | None | None | | regionOne | alarm | 2.0 | CURRENT | http://10.0.0.101:8042/v2 | None | None | | regionOne | cloudformation | 1.0 | CURRENT | http://10.0.0.101:8000/v1/ | None | None | | regionOne | event | 2.0 | CURRENT | http://10.0.0.101:8977/v2 | None | None | | regionOne | orchestration | 1.0 | CURRENT | http://10.0.0.101:8004/v1/ | None | None | | regionOne | object-store | 1.0 | CURRENT | http://10.0.0.101:8080/v1/ | None | None | | regionOne | compute | 2.0 | SUPPORTED | http://10.0.0.101:8774/v2/ | None | None | | regionOne | compute | 2.1 | CURRENT | http://10.0.0.101:8774/v2.1/ | 2.1 | 2.65 | | regionOne | image | 2.0 | SUPPORTED | http://10.0.0.101:9292/v2/ | None | None | | regionOne | image | 2.1 | SUPPORTED | http://10.0.0.101:9292/v2/ | None | None | | regionOne | image | 2.2 | SUPPORTED | http://10.0.0.101:9292/v2/ | None | None | | regionOne | image | 2.3 | SUPPORTED | http://10.0.0.101:9292/v2/ | None | None | | regionOne | image | 2.4 | SUPPORTED | http://10.0.0.101:9292/v2/ | None | None | | regionOne | image | 2.5 | SUPPORTED | http://10.0.0.101:9292/v2/ | None | None | | regionOne | image | 2.6 | SUPPORTED | http://10.0.0.101:9292/v2/ | None | None | | regionOne | image | 2.7 | CURRENT | http://10.0.0.101:9292/v2/ | None | None | | regionOne | metric | 1.0 | CURRENT | http://10.0.0.101:8041/v1/ | None | None | | regionOne | identity | 3.10 | CURRENT | http://10.0.0.101:5000/v3/ | None | None | +-------------+----------------+---------+------------+-----------------------------------+------------------+------------------+ (overcloud) [stack@undercloud-0 ~]$ cat /etc/re redhat-lsb/ redhat-release request-key.conf request-key.d/ resolv.conf (overcloud) [stack@undercloud-0 ~]$ cat /etc/re redhat-lsb/ redhat-release request-key.conf request-key.d/ resolv.conf (overcloud) [stack@undercloud-0 ~]$ cat /etc/redhat-release Red Hat Enterprise Linux Server release 7.6 (Maipo)
Can you attach the agent logs here? Thanks.
This was reproduced on OSP 13 as well, so clearing blocker flag based on that. Will continue to investigate and/or re-assign to systemd component for further triage.
(In reply to Brian Haley from comment #8) Check the journal log if systemd is not complaining about it or not even running it again. Also the ovs-vswitchd.log might shed some light. We need to find out if ovs-vswitchd is failing to run again or if systemd is not happy with the amount of restart or rate of restarts. Thanks, fbl
Resetting to component 'openvswitch' per request from Candido.
to recap, the problem is: have restarted the investigation of this bug and I found a fix or at least a configuration change that avoids the issue. Changing the systemd options for ovs : vi /usr/lib/systemd/system/ovs-vswitchd.service [Unit] Description=Open vSwitch Forwarding Unit After=ovsdb-server.service network-pre.target systemd-udev-settle.service Before=network.target network.service Requires=ovsdb-server.service ReloadPropagatedFrom=ovsdb-server.service AssertPathIsReadWrite=/var/run/openvswitch/db.sock PartOf=openvswitch.service [Service] Type=forking Restart=on-failure Environment=HOME=/var/run/openvswitch EnvironmentFile=/etc/openvswitch/default.conf EnvironmentFile=-/etc/sysconfig/openvswitch EnvironmentFile=-/run/openvswitch/useropts ExecStartPre=-/bin/sh -c '/usr/bin/chown :$${OVS_USER_ID##*:} /dev/hugepages' ExecStartPre=-/usr/bin/chmod 0775 /dev/hugepages ExecStart=/usr/share/openvswitch/scripts/ovs-ctl \ --no-ovsdb-server --no-monitor --system-id=random \ ${OVSUSER} \ start $OPTIONS ExecStop=/usr/share/openvswitch/scripts/ovs-ctl --no-ovsdb-server stop ExecReload=/usr/share/openvswitch/scripts/ovs-ctl --no-ovsdb-server \ --no-monitor --system-id=random \ ${OVSUSER} \ restart $OPTIONS TimeoutSec=300 the change would be: .. Restart=always ... When ovs is not restarted, the systemd doesn't detect the down. jounalctl logs: when ovs is not restarted: ... an 09 15:38:44 compute-2 ntpd[20626]: new interface(s) found: waking up resolver Jan 09 15:40:49 compute-2 ovs-appctl[174056]: ovs|00001|unixctl|WARN|failed to connect to /var/run/openvswitch/ovs-vswitchd.173341.ctl Jan 09 15:40:49 compute-2 ovs-ctl[174041]: 2019-01-09T15:40:49Z|00001|unixctl|WARN|failed to connect to /var/run/openvswitch/ovs-vswitchd.173341.ctl Jan 09 15:40:49 compute-2 ovs-ctl[174041]: ovs-appctl: cannot connect to "/var/run/openvswitch/ovs-vswitchd.173341.ctl" (Connection refused) Jan 09 15:40:57 compute-2 dockerd-current[17598]: time="2019-01-09T10:40:57.261191005-05:00" level=warning msg="containerd: unable to save 5b342b15750ff4386ea85ffb94324b60c93129e224dc0539590af18a95d8d476:13d0201ec590555b629627b295601926cb29caf217a7371dab4c875c0b74e461 starttime: read /proc/174173/stat: no such process" .. when it is restarted: ... Jan 09 16:09:43 compute-2 systemd[1]: ovs-vswitchd.service: main process exited, code=killed, status=11/SEGV Jan 09 16:09:43 compute-2 ovs-appctl[183633]: ovs|00001|unixctl|WARN|failed to connect to /var/run/openvswitch/ovs-vswitchd.182730.ctl Jan 09 16:09:43 compute-2 ovs-ctl[183618]: 2019-01-09T16:09:43Z|00001|unixctl|WARN|failed to connect to /var/run/openvswitch/ovs-vswitchd.182730.ctl Jan 09 16:09:43 compute-2 ovs-ctl[183618]: ovs-appctl: cannot connect to "/var/run/openvswitch/ovs-vswitchd.182730.ctl" (Connection refused) Jan 09 16:09:43 compute-2 systemd[1]: Unit ovs-vswitchd.service entered failed state. Jan 09 16:09:43 compute-2 systemd[1]: ovs-vswitchd.service failed. Jan 09 16:09:43 compute-2 systemd[1]: ovs-vswitchd.service holdoff time over, scheduling restart. Jan 09 16:09:43 compute-2 systemd[1]: Stopped Open vSwitch Forwarding Unit. -- Subject: Unit ovs-vswitchd.service has finished shutting down -- Defined-By: systemd -- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel -- -- Unit ovs-vswitchd.service has finished shutting down. Jan 09 16:09:43 compute-2 systemd[1]: Starting Open vSwitch Forwarding Unit... -- Subject: Unit ovs-vswitchd.service has begun start-up -- Defined-By: systemd -- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel -- -- Unit ovs-vswitchd.service has begun starting up. Jan 09 16:09:43 compute-2 kernel: device vxlan_sys_4789 left promiscuous mode Jan 09 16:09:43 compute-2 NetworkManager[3075]: <info> [1547050183.9576] device (vxlan_sys_4789): state change: disconnected -> unmanaged (reason 'unmanaged', sys-iface-state: 'removed') Jan 09 16:09:43 compute-2 kernel: device vxlan_sys_4789 entered promiscuous mode Jan 09 16:09:43 compute-2 NetworkManager[3075]: <info> [1547050183.9803] manager: (vxlan_sys_4789): new Vxlan device (/org/freedesktop/NetworkManager/Devices/52) Jan 09 16:09:43 compute-2 NetworkManager[3075]: <info> [1547050183.9805] device (vxlan_sys_4789): enslaved to non-master-type device ovs-system; ignoring Jan 09 16:09:43 compute-2 NetworkManager[3075]: <info> [1547050183.9948] device (vxlan_sys_4789): state change: unmanaged -> unavailable (reason 'connection-assumed', sys-iface-state: 'external') Jan 09 16:09:43 compute-2 NetworkManager[3075]: <info> [1547050183.9962] device (vxlan_sys_4789): enslaved to non-master-type device ovs-system; ignoring Jan 09 16:09:43 compute-2 NetworkManager[3075]: <info> [1547050183.9962] device (vxlan_sys_4789): state change: unavailable -> disconnected (reason 'none', sys-iface-state: 'external') Jan 09 16:09:44 compute-2 ovs-ctl[183642]: Starting ovs-vswitchd [ OK ] Jan 09 16:09:44 compute-2 ovs-vsctl[183710]: ovs|00001|vsctl|INFO|Called as ovs-vsctl --no-wait set Open_vSwitch . external-ids:hostname=compute-2.localdomain Jan 09 16:09:44 compute-2 systemd[1]: Started Open vSwitch Forwarding Unit. -- Subject: Unit ovs-vswitchd.service has finished start-up -- Defined-By: systemd -- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel -- -- Unit ovs-vswitchd.service has finished starting up. -- -- The start-up result is done. Jan 09 16:09:44 compute-2 ovs-ctl[183642]: Enabling remote OVSDB managers [ OK ] ...
Hi, To stop the service, the systemd service uses ovs-appctl tool to signal the daemon to exit gracefully. Therefore, if you put a restart=always, then scripts or managing the daemon with ovs-appctl becomes impossible because when you signal the daemon, systemd will restart it. That's why it is restricted to on-failure only. To add more information, userspace tools talk with the daemon over the unix sockets at /var/run/openvswitch/*.ctl. When you see connection refused errors, it is most probably because the daemon is not running. Things to keep in mind: * systemd has a restart loop protection that might be preventing the service to be restarted after few attempts. * systemd might not getting the right exit code or ovs-vswitchd might not exiting with error, then systemd does not restart it. * systemd might not even know that the daemon got killed (lost notification somehow) My suggestion is to increase verbosity level in systemd and check what it is receiving when the issue reproduces. Also, you could replace OVS with another simple program to rule out OVS from the equation. Just write a script that sleeps forever and exit with error when it gets killed. fbl
Still ovs-vswitchd cannot restart sometimes with openvswitch-2.9.0-122.el7fdp.x86_64 [root@dell-per730-05 openvswitch]# rpm -q openvswitch openvswitch-2.9.0-122.el7fdp.x86_64 [root@dell-per730-05 openvswitch]# ps -ef | grep ovs-vswitchd | head -n1 | awk '{system("kill -9 "$2)}'; [root@dell-per730-05 openvswitch]# ps -ef | grep ovs-vswitchd | head -n1 | awk '{system("kill -9 "$2)}'; sh: line 0: kill: (12221) - No such process [root@dell-per730-05 openvswitch]# ps -ef | grep ovs-vswitchd | head -n1 | awk '{system("kill -9 "$2)}'; [root@dell-per730-05 openvswitch]# ps -ef | grep ovs-vswitchd | head -n1 | awk '{system("kill -9 "$2)}'; [root@dell-per730-05 openvswitch]# ps -ef | grep ovs-vswitchd | head -n1 | awk '{system("kill -9 "$2)}'; [root@dell-per730-05 openvswitch]# ps -ef | grep ovs-vswitchd | head -n1 | awk '{system("kill -9 "$2)}'; [root@dell-per730-05 openvswitch]# ps -ef | grep ovs-vswitchd | head -n1 | awk '{system("kill -9 "$2)}'; sh: line 0: kill: (12864) - No such process [root@dell-per730-05 openvswitch]# ps -ef | grep ovs-vswitchd | head -n1 | awk '{system("kill -9 "$2)}'; [root@dell-per730-05 openvswitch]# ps -ef | grep ovs-vswitchd | head -n1 | awk '{system("kill -9 "$2)}'; sh: line 0: kill: (12938) - No such process [root@dell-per730-05 openvswitch]# ps -ef | grep ovs-vswitchd | head -n1 | awk '{system("kill -9 "$2)}'; sh: line 0: kill: (12943) - No such process [root@dell-per730-05 openvswitch]# ps -ef | grep ovs-vswitchd | head -n1 | awk '{system("kill -9 "$2)}'; sh: line 0: kill: (12948) - No such process [root@dell-per730-05 openvswitch]# ps -ef | grep ovs-vswitchd | head -n1 | awk '{system("kill -9 "$2)}'; sh: line 0: kill: (12953) - No such process [root@dell-per730-05 openvswitch]# ps -ef | grep ovs-vswitchd | head -n1 | awk '{system("kill -9 "$2)}'; sh: line 0: kill: (12958) - No such process [root@dell-per730-05 openvswitch]# ps -ef | grep ovs-vswitchd | head -n1 | awk '{system("kill -9 "$2)}'; sh: line 0: kill: (12963) - No such process [root@dell-per730-05 openvswitch]# ps -ef | grep ovs-vswitchd | head -n1 | awk '{system("kill -9 "$2)}'; sh: line 0: kill: (12968) - No such process [root@dell-per730-05 openvswitch]# ps -ef | grep ovs-vswitchd root 12973 5135 0 23:28 pts/2 00:00:00 grep --color=auto ovs-vswitchd [root@dell-per730-05 openvswitch]# ps -ef | grep ovs-vswitchd root 12985 5135 0 23:28 pts/2 00:00:00 grep --color=auto ovs-vswitchd [root@dell-per730-05 openvswitch]# [root@dell-per730-05 openvswitch]# journalctl -r -- Logs begin at Wed 2019-10-09 18:29:40 EDT, end at Wed 2019-10-09 23:32:22 EDT. -- Oct 09 23:32:22 dell-per730-05.rhts.eng.pek2.redhat.com dnsmasq-dhcp[3517]: DHCPACK(virbr0) 192.168.122.222 52:54:00:8d:72:5d Oct 09 23:32:22 dell-per730-05.rhts.eng.pek2.redhat.com dnsmasq-dhcp[3517]: DHCPREQUEST(virbr0) 192.168.122.222 52:54:00:8d:72:5d Oct 09 23:28:04 dell-per730-05.rhts.eng.pek2.redhat.com systemd[1]: ovsdb-server.service failed. Oct 09 23:28:04 dell-per730-05.rhts.eng.pek2.redhat.com systemd[1]: Unit ovsdb-server.service entered failed state. Oct 09 23:28:04 dell-per730-05.rhts.eng.pek2.redhat.com systemd[1]: Job ovs-vswitchd.service/start failed with result 'dependency'. Oct 09 23:28:04 dell-per730-05.rhts.eng.pek2.redhat.com systemd[1]: Dependency failed for Open vSwitch Forwarding Unit. Oct 09 23:28:04 dell-per730-05.rhts.eng.pek2.redhat.com systemd[1]: Job openvswitch.service/start failed with result 'dependency'. Oct 09 23:28:04 dell-per730-05.rhts.eng.pek2.redhat.com systemd[1]: Dependency failed for Open vSwitch. Oct 09 23:28:04 dell-per730-05.rhts.eng.pek2.redhat.com systemd[1]: Failed to start Open vSwitch Database Unit. Oct 09 23:28:04 dell-per730-05.rhts.eng.pek2.redhat.com systemd[1]: start request repeated too quickly for ovsdb-server.service Oct 09 23:28:04 dell-per730-05.rhts.eng.pek2.redhat.com systemd[1]: Stopped Open vSwitch Database Unit. Oct 09 23:28:04 dell-per730-05.rhts.eng.pek2.redhat.com ovs-ctl[12914]: Exiting ovsdb-server (12818) [ OK ] Oct 09 23:28:04 dell-per730-05.rhts.eng.pek2.redhat.com systemd[1]: Stopping Open vSwitch Database Unit... Oct 09 23:28:04 dell-per730-05.rhts.eng.pek2.redhat.com systemd[1]: Stopped Open vSwitch Forwarding Unit. Oct 09 23:28:04 dell-per730-05.rhts.eng.pek2.redhat.com systemd[1]: Stopped Open vSwitch. Oct 09 23:28:04 dell-per730-05.rhts.eng.pek2.redhat.com systemd[1]: Stopping Open vSwitch... Oct 09 23:28:04 dell-per730-05.rhts.eng.pek2.redhat.com systemd[1]: ovs-vswitchd.service holdoff time over, scheduling restart. Oct 09 23:28:04 dell-per730-05.rhts.eng.pek2.redhat.com systemd[1]: ovs-vswitchd.service failed. Oct 09 23:28:04 dell-per730-05.rhts.eng.pek2.redhat.com systemd[1]: Unit ovs-vswitchd.service entered failed state. Oct 09 23:28:04 dell-per730-05.rhts.eng.pek2.redhat.com ovs-ctl[12892]: ovs-appctl: cannot connect to "/var/run/openvswitch/ovs-vswitchd.12873.ctl" (Connection refused) Oct 09 23:28:04 dell-per730-05.rhts.eng.pek2.redhat.com ovs-ctl[12892]: 2019-10-10T03:28:04Z|00001|unixctl|WARN|failed to connect to /var/run/openvswitch/ovs-vswitchd.12873.ctl Oct 09 23:28:04 dell-per730-05.rhts.eng.pek2.redhat.com ovs-appctl[12907]: ovs|00001|unixctl|WARN|failed to connect to /var/run/openvswitch/ovs-vswitchd.12873.ctl Oct 09 23:28:04 dell-per730-05.rhts.eng.pek2.redhat.com systemd[1]: ovs-vswitchd.service: main process exited, code=killed, status=9/KILL Oct 09 23:28:03 dell-per730-05.rhts.eng.pek2.redhat.com systemd[1]: Started Open vSwitch. Oct 09 23:28:03 dell-per730-05.rhts.eng.pek2.redhat.com systemd[1]: Starting Open vSwitch... Oct 09 23:28:03 dell-per730-05.rhts.eng.pek2.redhat.com systemd[1]: Started Open vSwitch Forwarding Unit. Oct 09 23:28:03 dell-per730-05.rhts.eng.pek2.redhat.com ovs-ctl[12835]: Enabling remote OVSDB managers [ OK ] Oct 09 23:28:03 dell-per730-05.rhts.eng.pek2.redhat.com ovs-vsctl[12880]: ovs|00001|vsctl|INFO|Called as ovs-vsctl --no-wait set Open_vSwitch . external-ids:hostname=dell-per730-05.rhts.eng.pek2.redhat.com Oct 09 23:28:03 dell-per730-05.rhts.eng.pek2.redhat.com ovs-ctl[12835]: Starting ovs-vswitchd [ OK ] Oct 09 23:28:03 dell-per730-05.rhts.eng.pek2.redhat.com ovs-vsctl[12829]: ovs|00001|vsctl|INFO|Called as ovs-vsctl --no-wait set Open_vSwitch . external-ids:hostname=dell-per730-05.rhts.eng.pek2.redhat.com Oct 09 23:28:03 dell-per730-05.rhts.eng.pek2.redhat.com systemd[1]: Starting Open vSwitch Forwarding Unit... Oct 09 23:28:03 dell-per730-05.rhts.eng.pek2.redhat.com systemd[1]: Started Open vSwitch Database Unit. Oct 09 23:28:03 dell-per730-05.rhts.eng.pek2.redhat.com ovs-ctl[12781]: Configuring Open vSwitch system IDs [ OK ] Oct 09 23:28:03 dell-per730-05.rhts.eng.pek2.redhat.com ovs-vsctl[12824]: ovs|00001|vsctl|INFO|Called as ovs-vsctl --no-wait set Open_vSwitch . ovs-version=2.9.0 "external-ids:system-id=\"aeabac21-a598-4152-a9a9-04dad5736cae\"" "external- Oct 09 23:28:03 dell-per730-05.rhts.eng.pek2.redhat.com ovs-vsctl[12819]: ovs|00001|vsctl|INFO|Called as ovs-vsctl --no-wait -- init -- set Open_vSwitch . db-version=7.15.1 Oct 09 23:28:03 dell-per730-05.rhts.eng.pek2.redhat.com ovs-ctl[12781]: Starting ovsdb-server [ OK ] Oct 09 23:28:03 dell-per730-05.rhts.eng.pek2.redhat.com runuser[12815]: pam_unix(runuser:session): session closed for user openvswitch Oct 09 23:28:03 dell-per730-05.rhts.eng.pek2.redhat.com runuser[12815]: pam_unix(runuser:session): session opened for user openvswitch by (uid=0) Oct 09 23:28:03 dell-per730-05.rhts.eng.pek2.redhat.com runuser[12812]: pam_unix(runuser:session): session closed for user openvswitch Oct 09 23:28:03 dell-per730-05.rhts.eng.pek2.redhat.com runuser[12812]: pam_unix(runuser:session): session opened for user openvswitch by (uid=0) Oct 09 23:28:03 dell-per730-05.rhts.eng.pek2.redhat.com systemd[1]: Starting Open vSwitch Database Unit... Oct 09 23:28:03 dell-per730-05.rhts.eng.pek2.redhat.com systemd[1]: Stopped Open vSwitch Database Unit. Oct 09 23:28:03 dell-per730-05.rhts.eng.pek2.redhat.com ovs-ctl[12749]: Exiting ovsdb-server (12658) [ OK ]
Tested without and with the fix in openvswitch 2.9 packages on RHEL-7.7. As can be seen in the results below, the problem was readily observed without the fix in relatively few iterations but was not seen at all across 500 iterations with the fix in place. Marking as Verified in openvswitch-2.9.0-122.el7fdp ############################################################## The following script was used to test for this issue: [root@impairment1 ~]# cat test.sh #!/bin/bash iterations=${iterations:-"50"} result_file="/home/result_file.txt" rm -f $result_file check_final_result() { if [[ $(grep FAIL $result_file) ]]; then echo "$count crash iterations run of $iterations total" echo "Overall Test Result: FAIL" else let count-- echo "$count crash iterations run of $iterations total" echo "Overall Test Result: PASS" fi } test_systemctl_status_running() { service=$1 systemctl_status=$(systemctl status $service | grep Active | awk '{print $2$3}') if [[ $service == "openvswitch" ]]; then if [[ "$systemctl_status" == "active(exited)" ]]; then echo "PASS" >> $result_file else echo "FAIL" >> $result_file check_final_result exit 1 fi elif [[ $service == "ovs-vswitchd" ]] || [[ $service == "ovsdb-server" ]] && [[ "$systemctl_status" == "active(running)" ]]; then echo "PASS" >> $result_file else echo "FAIL" >> $result_file check_final_result exit 1 fi } test_systemctl_status_stopped() { service=$1 systemctl_status=$(systemctl status $service | grep Active | awk '{print $2$3}') if [[ $service == "openvswitch" ]] && [[ "$systemctl_status" == "inactive(dead)" ]]; then echo "PASS" >> $result_file else echo "FAIL" >> $result_file check_final_result exit 1 fi } test_process_crashed_restarted() { process=$1 vswitchd_pid1=$(pgrep vswitchd) ovsdb_pid1=$(pgrep vswitchd) if [[ $process == "vswitchd" ]]; then kill -SIGSEGV $vswitchd_pid1 elif [[ $process == "ovsdb" ]]; then kill -SIGSEGV $ovsdb_pid1 fi sleep 3 vswitchd_pid2=$(pgrep vswitchd) ovsdb_pid2=$(pgrep vswitchd) if [[ -z $vswitchd_pid2 ]] || [[ $vswitchd_pid2 == $vswitchd_pid1 ]]; then echo "FAIL" >> $result_file check_final_result exit 1 else echo "PASS" >> $result_file fi if [[ -z $ovsdb_pid2 ]] || [[ $ovsdb_pid2 == $ovsdb_pid1 ]]; then echo "FAIL" >> $result_file check_final_result exit 1 else echo "PASS" >> $result_file fi } test_process_started() { vswitchd_pid=$(pgrep vswitchd) ovsdb_pid=$(pgrep vswitchd) if [[ $vswitchd_pid ]]; then echo "PASS" >> $result_file else echo "FAIL" >> $result_file check_final_result exit 1 fi if [[ $ovsdb_pid ]]; then echo "PASS" >> $result_file else echo "FAIL" >> $result_file check_final_result exit 1 fi } test_process_stopped() { vswitchd_pid=$(pgrep vswitchd) ovsdb_pid=$(pgrep vswitchd) if [[ -z $vswitchd_pid ]]; then echo "PASS" >> $result_file else echo "FAIL" >> $result_file check_final_result exit 1 fi if [[ -z $ovsdb_pid ]]; then echo "PASS" >> $result_file else echo "FAIL" >> $result_file check_final_result exit 1 fi } get_process_pid() { process=$1 "$process"_pid=$(pgrep $process) } systemctl stop openvswitch sleep 3 ### Tests # start openvswitch systemctl start openvswitch sleep 3 test_systemctl_status_running openvswitch test_systemctl_status_running ovs-vswitchd test_systemctl_status_running ovsdb-server test_process_started # crash vswitchd count=1 while [ $count -le $iterations ]; do echo "Running vswitchd crash iteration $count of $iterations" test_process_crashed_restarted vswitchd let count++ done # crash ovsdb count=1 while [ $count -le $iterations ]; do echo "Running ovsdb crash iteration $count of $iterations" test_process_crashed_restarted ovsdb let count++ done check_final_result ############################################################## [root@impairment1 ~]# uname -r 3.10.0-1062.el7.x86_64 [root@impairment1 ~]# rpm -q openvswitch openvswitch-2.9.0-83.el7fdp.1.x86_64 [root@impairment1 ~]# ./test.sh Running vswitchd crash iteration 1 of 50 Running vswitchd crash iteration 2 of 50 Running vswitchd crash iteration 3 of 50 Running vswitchd crash iteration 4 of 50 Running vswitchd crash iteration 5 of 50 Running vswitchd crash iteration 6 of 50 Running vswitchd crash iteration 7 of 50 Running vswitchd crash iteration 8 of 50 Running vswitchd crash iteration 9 of 50 Running vswitchd crash iteration 10 of 50 Running vswitchd crash iteration 11 of 50 Running vswitchd crash iteration 12 of 50 Running vswitchd crash iteration 13 of 50 Running vswitchd crash iteration 14 of 50 Running vswitchd crash iteration 15 of 50 Running vswitchd crash iteration 16 of 50 Running vswitchd crash iteration 17 of 50 Running vswitchd crash iteration 18 of 50 Running vswitchd crash iteration 19 of 50 Running vswitchd crash iteration 20 of 50 Running vswitchd crash iteration 21 of 50 Running vswitchd crash iteration 22 of 50 Running vswitchd crash iteration 23 of 50 Running vswitchd crash iteration 24 of 50 Running vswitchd crash iteration 25 of 50 Running vswitchd crash iteration 26 of 50 Running vswitchd crash iteration 27 of 50 Running vswitchd crash iteration 28 of 50 Running vswitchd crash iteration 29 of 50 Running vswitchd crash iteration 30 of 50 Running vswitchd crash iteration 31 of 50 Running vswitchd crash iteration 32 of 50 Running vswitchd crash iteration 33 of 50 Running vswitchd crash iteration 34 of 50 Running vswitchd crash iteration 35 of 50 Running vswitchd crash iteration 36 of 50 Running vswitchd crash iteration 37 of 50 Running vswitchd crash iteration 38 of 50 Running vswitchd crash iteration 39 of 50 Running vswitchd crash iteration 40 of 50 Running vswitchd crash iteration 41 of 50 Running vswitchd crash iteration 42 of 50 Running vswitchd crash iteration 43 of 50 Running vswitchd crash iteration 44 of 50 Running vswitchd crash iteration 45 of 50 Running vswitchd crash iteration 46 of 50 Running vswitchd crash iteration 47 of 50 Running vswitchd crash iteration 48 of 50 Running vswitchd crash iteration 49 of 50 Running vswitchd crash iteration 50 of 50 Running ovsdb crash iteration 1 of 50 1 crash iterations run of 50 total Overall Test Result: FAIL [root@impairment1 ~]# ./test.sh Running vswitchd crash iteration 1 of 50 Running vswitchd crash iteration 2 of 50 Running vswitchd crash iteration 3 of 50 Running vswitchd crash iteration 4 of 50 Running vswitchd crash iteration 5 of 50 Running vswitchd crash iteration 6 of 50 Running vswitchd crash iteration 7 of 50 Running vswitchd crash iteration 8 of 50 Running vswitchd crash iteration 9 of 50 Running vswitchd crash iteration 10 of 50 Running vswitchd crash iteration 11 of 50 Running vswitchd crash iteration 12 of 50 Running vswitchd crash iteration 13 of 50 Running vswitchd crash iteration 14 of 50 Running vswitchd crash iteration 15 of 50 Running vswitchd crash iteration 16 of 50 Running vswitchd crash iteration 17 of 50 Running vswitchd crash iteration 18 of 50 Running vswitchd crash iteration 19 of 50 Running vswitchd crash iteration 20 of 50 Running vswitchd crash iteration 21 of 50 Running vswitchd crash iteration 22 of 50 Running vswitchd crash iteration 23 of 50 Running vswitchd crash iteration 24 of 50 Running vswitchd crash iteration 25 of 50 Running vswitchd crash iteration 26 of 50 Running vswitchd crash iteration 27 of 50 Running vswitchd crash iteration 28 of 50 Running vswitchd crash iteration 29 of 50 Running vswitchd crash iteration 30 of 50 Running vswitchd crash iteration 31 of 50 Running vswitchd crash iteration 32 of 50 Running vswitchd crash iteration 33 of 50 Running vswitchd crash iteration 34 of 50 Running vswitchd crash iteration 35 of 50 35 crash iterations run of 50 total Overall Test Result: FAIL Moved to openvswitch-2.9.0-122.el7fdp No failures over 500 iterations each of vswitchd and ovsdb crash tests: [root@impairment1 ~]# rpm -q openvswitch openvswitch-2.9.0-122.el7fdp.x86_64 [root@impairment1 ~]# ./test.sh Running vswitchd crash iteration 1 of 500 Running vswitchd crash iteration 2 of 500 Running vswitchd crash iteration 3 of 500 Running vswitchd crash iteration 4 of 500 Running vswitchd crash iteration 5 of 500 ... ... Running vswitchd crash iteration 45 of 500 Running vswitchd crash iteration 46 of 500 Running vswitchd crash iteration 47 of 500 Running vswitchd crash iteration 48 of 500 Running vswitchd crash iteration 49 of 500 Running vswitchd crash iteration 500 of 500 Running ovsdb crash iteration 1 of 500 Running ovsdb crash iteration 2 of 500 Running ovsdb crash iteration 3 of 500 Running ovsdb crash iteration 4 of 500 Running ovsdb crash iteration 5 of 500 ... ... Running ovsdb crash iteration 45 of 500 Running ovsdb crash iteration 46 of 500 Running ovsdb crash iteration 47 of 500 Running ovsdb crash iteration 48 of 500 Running ovsdb crash iteration 49 of 500 Running ovsdb crash iteration 500 of 500 500 crash iterations run of 500 total Overall Test Result: PASS
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:3717