Bug 1653717 - Openvswich process is not restarted after a kill command, in 20% of the times
Summary: Openvswich process is not restarted after a kill command, in 20% of the times
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux Fast Datapath
Classification: Red Hat
Component: openvswitch
Version: FDP 19.F
Hardware: Unspecified
OS: Linux
unspecified
medium
Target Milestone: ---
: ---
Assignee: Timothy Redaelli
QA Contact: Rick Alongi
URL:
Whiteboard:
Depends On: 1684483
Blocks: 1759242
TreeView+ depends on / blocked
 
Reported: 2018-11-27 13:41 UTC by Candido Campos
Modified: 2020-01-14 21:05 UTC (History)
18 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1684477 1759242 (view as bug list)
Environment:
Last Closed: 2019-11-06 04:18:21 UTC
Target Upstream Version:
fleitner: needinfo-


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2019:3717 None None None 2019-11-06 04:18:40 UTC

Description Candido Campos 2018-11-27 13:41:49 UTC
Description of problem:

 Open vswich process is not restarted after a kill 20% of the times. 


Version-Release number of selected component (if applicable):

 OSP 14

 3 controllers + 3 computes + dvr

How reproducible:

 1. Install a baremetal with RHEL(theforeman)
 2. Intall with jenkins: OSP 14 -- 3 controllers + 3 computes + dvr
 3. Create sever vms with a Fip
 4. Kill openvswitch process in a compuete node



Actual results:

 the process is not restarted 20% of the times. 

Expected results:

 the process is restarted.

Additional info:

Logs when restart openvswich is failing:

[root@compute-2 heat-admin]# tail -f  /var/log/containers/neutron/metadata-agent.log
2018-11-26 08:51:10.883 8224 INFO eventlet.wsgi.server [-] 10.2.0.18,<local> "GET /2009-04-04/meta-data/placement/availability-zone HTTP/1.1" status: 200  len: 139 time: 0.1934390
2018-11-26 08:51:10.968 8223 INFO eventlet.wsgi.server [-] 10.3.0.25,<local> "GET /2009-04-04/meta-data/placement/availability-zone HTTP/1.1" status: 200  len: 139 time: 0.1648369
2018-11-26 09:11:20.883 8224 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: IOError: Socket closed
2018-11-26 09:11:34.307 7137 ERROR oslo.messaging._drivers.impl_rabbit [-] [357cec88-105c-4f2b-b26c-70ae3864ec0f] AMQP server controller-2.internalapi.localdomain:5672 closed the connection. Check login credentials: Socket closed: IOError: Socket closed
2018-11-26 09:11:35.323 7137 INFO oslo.messaging._drivers.impl_rabbit [-] [357cec88-105c-4f2b-b26c-70ae3864ec0f] Reconnected to AMQP server on controller-2.internalapi.localdomain:5672 via [amqp] client with port 50096.
2018-11-26 09:11:43.938 8223 ERROR oslo.messaging._drivers.impl_rabbit [-] [24405812-662e-4c40-b834-37a07d80366f] AMQP server controller-1.internalapi.localdomain:5672 closed the connection. Check login credentials: Socket closed: IOError: Socket closed
2018-11-26 09:11:44.955 8223 INFO oslo.messaging._drivers.impl_rabbit [-] [24405812-662e-4c40-b834-37a07d80366f] Reconnected to AMQP server on controller-1.internalapi.localdomain:5672 via [amqp] client with port 37994.
2018-11-26 09:11:45.218 8224 ERROR oslo.messaging._drivers.impl_rabbit [-] [7bf56d8a-afcf-4b87-b2dd-c865c2faf08f] AMQP server controller-2.internalapi.localdomain:5672 closed the connection. Check login credentials: Socket closed: IOError: Socket closed
2018-11-26 09:11:46.235 8224 INFO oslo.messaging._drivers.impl_rabbit [-] [7bf56d8a-afcf-4b87-b2dd-c865c2faf08f] Reconnected to AMQP server on controller-2.internalapi.localdomain:5672 via [amqp] client with port 50116.
2018-11-26 09:11:59.117 8223 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: IOError: Socket closed





vi /var/log/containers/neutron/openvswitch-agent.log


2018-11-26 09:12:04.417 7137 ERROR oslo.messaging._drivers.impl_rabbit [-] [8f9c38d5-3ad8-4cab-8446-a2f07e6d370f] AMQP server controller-1.internalapi.localdomain:5672 closed the connection. Check login credentials: Socket closed: IOError: Socket closed
2018-11-26 09:12:05.444 7137 INFO oslo.messaging._drivers.impl_rabbit [-] [8f9c38d5-3ad8-4cab-8446-a2f07e6d370f] Reconnected to AMQP server on controller-1.internalapi.localdomain:5672 via [amqp] client with port 38042.
2018-11-26 09:12:15.386 8224 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: error: [Errno 104] Connection reset by peer

2018-11-26 14:23:44.338 28480 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int     raise RuntimeError(m)
2018-11-26 14:23:44.338 28480 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int RuntimeError: Switch connection timeout
2018-11-26 14:23:44.338 28480 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int
2018-11-26 14:23:44.339 28480 WARNING neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-cb6bf194-1ceb-415b-aecc-3ba950053b37 - - - - -] OVS is dead. OVSNeutronAgent will keep running and checking OVS status periodically.
2018-11-26 14:23:44.340 28480 DEBUG neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-cb6bf194-1ceb-415b-aecc-3ba950053b37 - - - - -] Agent rpc_loop - iteration:623 completed. Processed ports statistics: {'regular': {'updated': 0, 'added': 0, 'removed': 0}}. Elapsed:30.020 loop_count_and_wait /usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py:1875
2018-11-26 14:23:44.340 28480 DEBUG neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-cb6bf194-1ceb-415b-aecc-3ba950053b37 - - - - -] Loop iteration exceeded interval (2 vs. 30.0197079182)! loop_count_and_wait /usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py:1882
2018-11-26 14:23:44.340 28480 DEBUG neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-cb6bf194-1ceb-415b-aecc-3ba950053b37 - - - - -] Agent rpc_loop - iteration:624 started rpc_loop /usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py:2086
2018-11-26 14:24:14.353 28480 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ofswitch [req-cb6bf194-1ceb-415b-aecc-3ba950053b37 - - - - -] Switch connection timeout
2018-11-26 14:24:14.354 28480 DEBUG ovsdbapp.backend.ovs_idl.transaction [-] Running txn n=1 command(idx=0): DbGetCommand(column=datapath_id, table=Bridge, record=br-int) do_commit /usr/lib/python2.7/site-packages/ovsdbapp/backend/ovs_idl/transaction.py:84
2018-11-26 14:24:14.355 28480 DEBUG ovsdbapp.backend.ovs_idl.transaction [-] Transaction caused no change do_commit /usr/lib/python2.7/site-packages/ovsdbapp/backend/ovs_idl/transaction.py:121
2018-11-26 14:24:14.355 28480 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int [req-cb6bf194-1ceb-415b-aecc-3ba950053b37 - - - - -] Failed to communicate with the switch: RuntimeError: Switch connection timeout
2018-11-26 14:24:14.355 28480 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int Traceback (most recent call last):
2018-11-26 14:24:14.355 28480 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int   File "/usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/openflow/native/br_int.py", line 52, in check_canary_table
2018-11-26 14:24:14.355 28480 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int     flows = self.dump_flows(constants.CANARY_TABLE)
2018-11-26 14:24:14.355 28480 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int   File "/usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/openflow/native/ofswitch.py", line 156, in dump_flows
2018-11-26 14:24:14.355 28480 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int     (dp, ofp, ofpp) = self._get_dp()
2018-11-26 14:24:14.355 28480 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int   File "/usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/openflow/native/ovs_bridge.py", line 69, in _get_dp
2018-11-26 14:24:14.355 28480 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int     self._cached_dpid = new_dpid
2018-11-26 14:24:14.355 28480 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
2018-11-26 14:24:14.355 28480 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int     self.force_reraise()
2018-11-26 14:24:14.355 28480 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
2018-11-26 14:24:14.355 28480 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int     six.reraise(self.type_, self.value, self.tb)
2018-11-26 14:24:14.355 28480 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int   File "/usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/openflow/native/ovs_bridge.py", line 52, in _get_dp
2018-11-26 14:24:14.355 28480 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int     dp = self._get_dp_by_dpid(self._cached_dpid)
2018-11-26 14:24:14.355 28480 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int   File "/usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/openflow/native/ofswitch.py", line 79, in _get_dp_by_dpid
2018-11-26 14:24:14.355 28480 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int     raise RuntimeError(m)
2018-11-26 14:24:14.355 28480 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int RuntimeError: Switch connection timeout
2018-11-26 14:24:14.355 28480 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int
2018-11-26 14:24:14.356 28480 WARNING neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-cb6bf194-1ceb-415b-aecc-3ba950053b37 - - - - -] OVS is dead. OVSNeutronAgent will keep running and checking OVS status periodically.
2018-11-26 14:24:14.356 28480 DEBUG neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-cb6bf194-1ceb-415b-aecc-3ba950053b37 - - - - -] Agent rpc_loop - iteration:624 completed. Processed ports statistics: {'regular': {'updated': 0, 'added': 0, 'removed': 0}}. Elapsed:30.016 loop_count_and_wait /usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py:1875
2018-11-26 14:24:14.357 28480 DEBUG neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-cb6bf194-1ceb-415b-aecc-3ba950053b37 - - - - -] Loop iteration exceeded interval (2 vs. 30.0161988735)! loop_count_and_wait /usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py:1882
2018-11-26 14:24:14.357 28480 DEBUG neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-cb6bf194-1ceb-415b-aecc-3ba950053b37 - - - - -] Agent rpc_loop - iteration:625 started rpc_loop /usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py:2086

overcloud) [stack@undercloud-0 ~]$ openstack versions show
+-------------+----------------+---------+------------+-----------------------------------+------------------+------------------+
| Region Name | Service Type | Version | Status | Endpoint | Min Microversion | Max Microversion |
+-------------+----------------+---------+------------+-----------------------------------+------------------+------------------+
| regionOne | block-storage | 2.0 | DEPRECATED | http://10.0.0.101:8776/v2/ | None | None |
| regionOne | block-storage | 3.0 | CURRENT | http://10.0.0.101:8776/v3/ | 3.0 | 3.55 |
| regionOne | placement | None | CURRENT | http://10.0.0.101:8778/placement/ | None | None |
| regionOne | network | 2.0 | CURRENT | http://10.0.0.101:9696/v2.0/ | None | None |
| regionOne | alarm | 2.0 | CURRENT | http://10.0.0.101:8042/v2 | None | None |
| regionOne | cloudformation | 1.0 | CURRENT | http://10.0.0.101:8000/v1/ | None | None |
| regionOne | event | 2.0 | CURRENT | http://10.0.0.101:8977/v2 | None | None |
| regionOne | orchestration | 1.0 | CURRENT | http://10.0.0.101:8004/v1/ | None | None |
| regionOne | object-store | 1.0 | CURRENT | http://10.0.0.101:8080/v1/ | None | None |
| regionOne | compute | 2.0 | SUPPORTED | http://10.0.0.101:8774/v2/ | None | None |
| regionOne | compute | 2.1 | CURRENT | http://10.0.0.101:8774/v2.1/ | 2.1 | 2.65 |
| regionOne | image | 2.0 | SUPPORTED | http://10.0.0.101:9292/v2/ | None | None |
| regionOne | image | 2.1 | SUPPORTED | http://10.0.0.101:9292/v2/ | None | None |
| regionOne | image | 2.2 | SUPPORTED | http://10.0.0.101:9292/v2/ | None | None |
| regionOne | image | 2.3 | SUPPORTED | http://10.0.0.101:9292/v2/ | None | None |
| regionOne | image | 2.4 | SUPPORTED | http://10.0.0.101:9292/v2/ | None | None |
| regionOne | image | 2.5 | SUPPORTED | http://10.0.0.101:9292/v2/ | None | None |
| regionOne | image | 2.6 | SUPPORTED | http://10.0.0.101:9292/v2/ | None | None |
| regionOne | image | 2.7 | CURRENT | http://10.0.0.101:9292/v2/ | None | None |
| regionOne | metric | 1.0 | CURRENT | http://10.0.0.101:8041/v1/ | None | None |
| regionOne | identity | 3.10 | CURRENT | http://10.0.0.101:5000/v3/ | None | None |
+-------------+----------------+---------+------------+-----------------------------------+------------------+------------------+
(overcloud) [stack@undercloud-0 ~]$ cat /etc/re
redhat-lsb/ redhat-release request-key.conf request-key.d/ resolv.conf
(overcloud) [stack@undercloud-0 ~]$ cat /etc/re
redhat-lsb/ redhat-release request-key.conf request-key.d/ resolv.conf
(overcloud) [stack@undercloud-0 ~]$ cat /etc/redhat-release
Red Hat Enterprise Linux Server release 7.6 (Maipo)

Comment 1 Brian Haley 2018-11-27 20:23:57 UTC
Can you attach the agent logs here?  Thanks.

Comment 8 Brian Haley 2018-11-29 16:07:26 UTC
This was reproduced on OSP 13 as well, so clearing blocker flag based on that.  Will continue to investigate and/or re-assign to systemd component for further triage.

Comment 9 Flavio Leitner 2018-12-03 20:20:27 UTC
(In reply to Brian Haley from comment #8)

Check the journal log if systemd is not complaining about it or not even running it again. Also the ovs-vswitchd.log might shed some light.
We need to find out if ovs-vswitchd is failing to run again or if systemd is not happy with the amount of restart or rate of restarts.

Thanks,
fbl

Comment 15 Nate Johnston 2019-02-11 15:21:26 UTC
Resetting to component 'openvswitch' per request from Candido.

Comment 16 Candido Campos 2019-02-11 16:55:30 UTC
to recap, the problem is:

 have restarted the investigation of this bug and I found a fix or at least a configuration change that avoids the issue. Changing the systemd options for ovs :


vi /usr/lib/systemd/system/ovs-vswitchd.service
[Unit]
Description=Open vSwitch Forwarding Unit
After=ovsdb-server.service network-pre.target systemd-udev-settle.service
Before=network.target network.service
Requires=ovsdb-server.service
ReloadPropagatedFrom=ovsdb-server.service
AssertPathIsReadWrite=/var/run/openvswitch/db.sock
PartOf=openvswitch.service

[Service]
Type=forking
Restart=on-failure
Environment=HOME=/var/run/openvswitch
EnvironmentFile=/etc/openvswitch/default.conf
EnvironmentFile=-/etc/sysconfig/openvswitch
EnvironmentFile=-/run/openvswitch/useropts
ExecStartPre=-/bin/sh -c '/usr/bin/chown :$${OVS_USER_ID##*:} /dev/hugepages'
ExecStartPre=-/usr/bin/chmod 0775 /dev/hugepages
ExecStart=/usr/share/openvswitch/scripts/ovs-ctl \
          --no-ovsdb-server --no-monitor --system-id=random \
          ${OVSUSER} \
          start $OPTIONS
ExecStop=/usr/share/openvswitch/scripts/ovs-ctl --no-ovsdb-server stop
ExecReload=/usr/share/openvswitch/scripts/ovs-ctl --no-ovsdb-server \
          --no-monitor --system-id=random \
          ${OVSUSER} \
          restart $OPTIONS
TimeoutSec=300

the change would be:

..
Restart=always
...

When ovs is not restarted, the systemd doesn't detect the down. jounalctl logs:


when ovs is not restarted:

...
an 09 15:38:44 compute-2 ntpd[20626]: new interface(s) found: waking up resolver
Jan 09 15:40:49 compute-2 ovs-appctl[174056]: ovs|00001|unixctl|WARN|failed to connect to /var/run/openvswitch/ovs-vswitchd.173341.ctl
Jan 09 15:40:49 compute-2 ovs-ctl[174041]: 2019-01-09T15:40:49Z|00001|unixctl|WARN|failed to connect to /var/run/openvswitch/ovs-vswitchd.173341.ctl
Jan 09 15:40:49 compute-2 ovs-ctl[174041]: ovs-appctl: cannot connect to "/var/run/openvswitch/ovs-vswitchd.173341.ctl" (Connection refused)
Jan 09 15:40:57 compute-2 dockerd-current[17598]: time="2019-01-09T10:40:57.261191005-05:00" level=warning msg="containerd: unable to save 5b342b15750ff4386ea85ffb94324b60c93129e224dc0539590af18a95d8d476:13d0201ec590555b629627b295601926cb29caf217a7371dab4c875c0b74e461 starttime: read /proc/174173/stat: no such process"
..

when it is restarted:

...

Jan 09 16:09:43 compute-2 systemd[1]: ovs-vswitchd.service: main process exited, code=killed, status=11/SEGV
Jan 09 16:09:43 compute-2 ovs-appctl[183633]: ovs|00001|unixctl|WARN|failed to connect to /var/run/openvswitch/ovs-vswitchd.182730.ctl
Jan 09 16:09:43 compute-2 ovs-ctl[183618]: 2019-01-09T16:09:43Z|00001|unixctl|WARN|failed to connect to /var/run/openvswitch/ovs-vswitchd.182730.ctl
Jan 09 16:09:43 compute-2 ovs-ctl[183618]: ovs-appctl: cannot connect to "/var/run/openvswitch/ovs-vswitchd.182730.ctl" (Connection refused)
Jan 09 16:09:43 compute-2 systemd[1]: Unit ovs-vswitchd.service entered failed state.
Jan 09 16:09:43 compute-2 systemd[1]: ovs-vswitchd.service failed.
Jan 09 16:09:43 compute-2 systemd[1]: ovs-vswitchd.service holdoff time over, scheduling restart.
Jan 09 16:09:43 compute-2 systemd[1]: Stopped Open vSwitch Forwarding Unit.
-- Subject: Unit ovs-vswitchd.service has finished shutting down
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit ovs-vswitchd.service has finished shutting down.
Jan 09 16:09:43 compute-2 systemd[1]: Starting Open vSwitch Forwarding Unit...
-- Subject: Unit ovs-vswitchd.service has begun start-up
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit ovs-vswitchd.service has begun starting up.
Jan 09 16:09:43 compute-2 kernel: device vxlan_sys_4789 left promiscuous mode
Jan 09 16:09:43 compute-2 NetworkManager[3075]: <info>  [1547050183.9576] device (vxlan_sys_4789): state change: disconnected -> unmanaged (reason 'unmanaged', sys-iface-state: 'removed')
Jan 09 16:09:43 compute-2 kernel: device vxlan_sys_4789 entered promiscuous mode
Jan 09 16:09:43 compute-2 NetworkManager[3075]: <info>  [1547050183.9803] manager: (vxlan_sys_4789): new Vxlan device (/org/freedesktop/NetworkManager/Devices/52)
Jan 09 16:09:43 compute-2 NetworkManager[3075]: <info>  [1547050183.9805] device (vxlan_sys_4789): enslaved to non-master-type device ovs-system; ignoring
Jan 09 16:09:43 compute-2 NetworkManager[3075]: <info>  [1547050183.9948] device (vxlan_sys_4789): state change: unmanaged -> unavailable (reason 'connection-assumed', sys-iface-state: 'external')
Jan 09 16:09:43 compute-2 NetworkManager[3075]: <info>  [1547050183.9962] device (vxlan_sys_4789): enslaved to non-master-type device ovs-system; ignoring
Jan 09 16:09:43 compute-2 NetworkManager[3075]: <info>  [1547050183.9962] device (vxlan_sys_4789): state change: unavailable -> disconnected (reason 'none', sys-iface-state: 'external')
Jan 09 16:09:44 compute-2 ovs-ctl[183642]: Starting ovs-vswitchd [  OK  ]
Jan 09 16:09:44 compute-2 ovs-vsctl[183710]: ovs|00001|vsctl|INFO|Called as ovs-vsctl --no-wait set Open_vSwitch . external-ids:hostname=compute-2.localdomain
Jan 09 16:09:44 compute-2 systemd[1]: Started Open vSwitch Forwarding Unit.
-- Subject: Unit ovs-vswitchd.service has finished start-up
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit ovs-vswitchd.service has finished starting up.
--
-- The start-up result is done.
Jan 09 16:09:44 compute-2 ovs-ctl[183642]: Enabling remote OVSDB managers [  OK  ]

...

Comment 17 Flavio Leitner 2019-02-12 13:49:13 UTC
Hi,

To stop the service, the systemd service uses ovs-appctl tool to signal the daemon to exit gracefully. Therefore, if you put a restart=always, then scripts or managing the daemon with ovs-appctl becomes impossible because when you signal the daemon, systemd will restart it. That's why it is restricted to on-failure only.

To add more information, userspace tools talk with the daemon over the unix sockets at /var/run/openvswitch/*.ctl. When you see connection refused errors, it is most probably because the daemon is not running.

Things to keep in mind:
* systemd has a restart loop protection that might be preventing the service to be restarted after few attempts.
* systemd might not getting the right exit code or ovs-vswitchd might not exiting with error, then systemd does not restart it.
* systemd might not even know that the daemon got killed (lost notification somehow)

My suggestion is to increase verbosity level in systemd and check what it is receiving when the issue reproduces.

Also, you could replace OVS with another simple program to rule out OVS from the equation. Just write a script that sleeps forever and exit with error when it gets killed.

fbl

Comment 37 qding 2019-10-10 03:36:05 UTC
Still ovs-vswitchd cannot restart sometimes with openvswitch-2.9.0-122.el7fdp.x86_64

[root@dell-per730-05 openvswitch]# rpm -q openvswitch
openvswitch-2.9.0-122.el7fdp.x86_64


[root@dell-per730-05 openvswitch]# ps -ef | grep ovs-vswitchd | head -n1 | awk '{system("kill -9 "$2)}';
[root@dell-per730-05 openvswitch]# ps -ef | grep ovs-vswitchd | head -n1 | awk '{system("kill -9 "$2)}';
sh: line 0: kill: (12221) - No such process
[root@dell-per730-05 openvswitch]# ps -ef | grep ovs-vswitchd | head -n1 | awk '{system("kill -9 "$2)}';
[root@dell-per730-05 openvswitch]# ps -ef | grep ovs-vswitchd | head -n1 | awk '{system("kill -9 "$2)}';
[root@dell-per730-05 openvswitch]# ps -ef | grep ovs-vswitchd | head -n1 | awk '{system("kill -9 "$2)}';
[root@dell-per730-05 openvswitch]# ps -ef | grep ovs-vswitchd | head -n1 | awk '{system("kill -9 "$2)}';
[root@dell-per730-05 openvswitch]# ps -ef | grep ovs-vswitchd | head -n1 | awk '{system("kill -9 "$2)}';
sh: line 0: kill: (12864) - No such process
[root@dell-per730-05 openvswitch]# ps -ef | grep ovs-vswitchd | head -n1 | awk '{system("kill -9 "$2)}';
[root@dell-per730-05 openvswitch]# ps -ef | grep ovs-vswitchd | head -n1 | awk '{system("kill -9 "$2)}';
sh: line 0: kill: (12938) - No such process
[root@dell-per730-05 openvswitch]# ps -ef | grep ovs-vswitchd | head -n1 | awk '{system("kill -9 "$2)}';
sh: line 0: kill: (12943) - No such process
[root@dell-per730-05 openvswitch]# ps -ef | grep ovs-vswitchd | head -n1 | awk '{system("kill -9 "$2)}';
sh: line 0: kill: (12948) - No such process
[root@dell-per730-05 openvswitch]# ps -ef | grep ovs-vswitchd | head -n1 | awk '{system("kill -9 "$2)}';
sh: line 0: kill: (12953) - No such process
[root@dell-per730-05 openvswitch]# ps -ef | grep ovs-vswitchd | head -n1 | awk '{system("kill -9 "$2)}';
sh: line 0: kill: (12958) - No such process
[root@dell-per730-05 openvswitch]# ps -ef | grep ovs-vswitchd | head -n1 | awk '{system("kill -9 "$2)}';
sh: line 0: kill: (12963) - No such process
[root@dell-per730-05 openvswitch]# ps -ef | grep ovs-vswitchd | head -n1 | awk '{system("kill -9 "$2)}';
sh: line 0: kill: (12968) - No such process
[root@dell-per730-05 openvswitch]# ps -ef | grep ovs-vswitchd
root      12973   5135  0 23:28 pts/2    00:00:00 grep --color=auto ovs-vswitchd
[root@dell-per730-05 openvswitch]# ps -ef | grep ovs-vswitchd
root      12985   5135  0 23:28 pts/2    00:00:00 grep --color=auto ovs-vswitchd
[root@dell-per730-05 openvswitch]# 
[root@dell-per730-05 openvswitch]# journalctl -r
-- Logs begin at Wed 2019-10-09 18:29:40 EDT, end at Wed 2019-10-09 23:32:22 EDT. --
Oct 09 23:32:22 dell-per730-05.rhts.eng.pek2.redhat.com dnsmasq-dhcp[3517]: DHCPACK(virbr0) 192.168.122.222 52:54:00:8d:72:5d
Oct 09 23:32:22 dell-per730-05.rhts.eng.pek2.redhat.com dnsmasq-dhcp[3517]: DHCPREQUEST(virbr0) 192.168.122.222 52:54:00:8d:72:5d
Oct 09 23:28:04 dell-per730-05.rhts.eng.pek2.redhat.com systemd[1]: ovsdb-server.service failed.
Oct 09 23:28:04 dell-per730-05.rhts.eng.pek2.redhat.com systemd[1]: Unit ovsdb-server.service entered failed state.
Oct 09 23:28:04 dell-per730-05.rhts.eng.pek2.redhat.com systemd[1]: Job ovs-vswitchd.service/start failed with result 'dependency'.
Oct 09 23:28:04 dell-per730-05.rhts.eng.pek2.redhat.com systemd[1]: Dependency failed for Open vSwitch Forwarding Unit.
Oct 09 23:28:04 dell-per730-05.rhts.eng.pek2.redhat.com systemd[1]: Job openvswitch.service/start failed with result 'dependency'.
Oct 09 23:28:04 dell-per730-05.rhts.eng.pek2.redhat.com systemd[1]: Dependency failed for Open vSwitch.
Oct 09 23:28:04 dell-per730-05.rhts.eng.pek2.redhat.com systemd[1]: Failed to start Open vSwitch Database Unit.
Oct 09 23:28:04 dell-per730-05.rhts.eng.pek2.redhat.com systemd[1]: start request repeated too quickly for ovsdb-server.service
Oct 09 23:28:04 dell-per730-05.rhts.eng.pek2.redhat.com systemd[1]: Stopped Open vSwitch Database Unit.
Oct 09 23:28:04 dell-per730-05.rhts.eng.pek2.redhat.com ovs-ctl[12914]: Exiting ovsdb-server (12818) [  OK  ]
Oct 09 23:28:04 dell-per730-05.rhts.eng.pek2.redhat.com systemd[1]: Stopping Open vSwitch Database Unit...
Oct 09 23:28:04 dell-per730-05.rhts.eng.pek2.redhat.com systemd[1]: Stopped Open vSwitch Forwarding Unit.
Oct 09 23:28:04 dell-per730-05.rhts.eng.pek2.redhat.com systemd[1]: Stopped Open vSwitch.
Oct 09 23:28:04 dell-per730-05.rhts.eng.pek2.redhat.com systemd[1]: Stopping Open vSwitch...
Oct 09 23:28:04 dell-per730-05.rhts.eng.pek2.redhat.com systemd[1]: ovs-vswitchd.service holdoff time over, scheduling restart.
Oct 09 23:28:04 dell-per730-05.rhts.eng.pek2.redhat.com systemd[1]: ovs-vswitchd.service failed.
Oct 09 23:28:04 dell-per730-05.rhts.eng.pek2.redhat.com systemd[1]: Unit ovs-vswitchd.service entered failed state.
Oct 09 23:28:04 dell-per730-05.rhts.eng.pek2.redhat.com ovs-ctl[12892]: ovs-appctl: cannot connect to "/var/run/openvswitch/ovs-vswitchd.12873.ctl" (Connection refused)
Oct 09 23:28:04 dell-per730-05.rhts.eng.pek2.redhat.com ovs-ctl[12892]: 2019-10-10T03:28:04Z|00001|unixctl|WARN|failed to connect to /var/run/openvswitch/ovs-vswitchd.12873.ctl
Oct 09 23:28:04 dell-per730-05.rhts.eng.pek2.redhat.com ovs-appctl[12907]: ovs|00001|unixctl|WARN|failed to connect to /var/run/openvswitch/ovs-vswitchd.12873.ctl
Oct 09 23:28:04 dell-per730-05.rhts.eng.pek2.redhat.com systemd[1]: ovs-vswitchd.service: main process exited, code=killed, status=9/KILL
Oct 09 23:28:03 dell-per730-05.rhts.eng.pek2.redhat.com systemd[1]: Started Open vSwitch.
Oct 09 23:28:03 dell-per730-05.rhts.eng.pek2.redhat.com systemd[1]: Starting Open vSwitch...
Oct 09 23:28:03 dell-per730-05.rhts.eng.pek2.redhat.com systemd[1]: Started Open vSwitch Forwarding Unit.
Oct 09 23:28:03 dell-per730-05.rhts.eng.pek2.redhat.com ovs-ctl[12835]: Enabling remote OVSDB managers [  OK  ]
Oct 09 23:28:03 dell-per730-05.rhts.eng.pek2.redhat.com ovs-vsctl[12880]: ovs|00001|vsctl|INFO|Called as ovs-vsctl --no-wait set Open_vSwitch . external-ids:hostname=dell-per730-05.rhts.eng.pek2.redhat.com
Oct 09 23:28:03 dell-per730-05.rhts.eng.pek2.redhat.com ovs-ctl[12835]: Starting ovs-vswitchd [  OK  ]
Oct 09 23:28:03 dell-per730-05.rhts.eng.pek2.redhat.com ovs-vsctl[12829]: ovs|00001|vsctl|INFO|Called as ovs-vsctl --no-wait set Open_vSwitch . external-ids:hostname=dell-per730-05.rhts.eng.pek2.redhat.com
Oct 09 23:28:03 dell-per730-05.rhts.eng.pek2.redhat.com systemd[1]: Starting Open vSwitch Forwarding Unit...
Oct 09 23:28:03 dell-per730-05.rhts.eng.pek2.redhat.com systemd[1]: Started Open vSwitch Database Unit.
Oct 09 23:28:03 dell-per730-05.rhts.eng.pek2.redhat.com ovs-ctl[12781]: Configuring Open vSwitch system IDs [  OK  ]
Oct 09 23:28:03 dell-per730-05.rhts.eng.pek2.redhat.com ovs-vsctl[12824]: ovs|00001|vsctl|INFO|Called as ovs-vsctl --no-wait set Open_vSwitch . ovs-version=2.9.0 "external-ids:system-id=\"aeabac21-a598-4152-a9a9-04dad5736cae\"" "external-
Oct 09 23:28:03 dell-per730-05.rhts.eng.pek2.redhat.com ovs-vsctl[12819]: ovs|00001|vsctl|INFO|Called as ovs-vsctl --no-wait -- init -- set Open_vSwitch . db-version=7.15.1
Oct 09 23:28:03 dell-per730-05.rhts.eng.pek2.redhat.com ovs-ctl[12781]: Starting ovsdb-server [  OK  ]
Oct 09 23:28:03 dell-per730-05.rhts.eng.pek2.redhat.com runuser[12815]: pam_unix(runuser:session): session closed for user openvswitch
Oct 09 23:28:03 dell-per730-05.rhts.eng.pek2.redhat.com runuser[12815]: pam_unix(runuser:session): session opened for user openvswitch by (uid=0)
Oct 09 23:28:03 dell-per730-05.rhts.eng.pek2.redhat.com runuser[12812]: pam_unix(runuser:session): session closed for user openvswitch
Oct 09 23:28:03 dell-per730-05.rhts.eng.pek2.redhat.com runuser[12812]: pam_unix(runuser:session): session opened for user openvswitch by (uid=0)
Oct 09 23:28:03 dell-per730-05.rhts.eng.pek2.redhat.com systemd[1]: Starting Open vSwitch Database Unit...
Oct 09 23:28:03 dell-per730-05.rhts.eng.pek2.redhat.com systemd[1]: Stopped Open vSwitch Database Unit.
Oct 09 23:28:03 dell-per730-05.rhts.eng.pek2.redhat.com ovs-ctl[12749]: Exiting ovsdb-server (12658) [  OK  ]

Comment 39 Rick Alongi 2019-10-16 14:19:00 UTC
Tested without and with the fix in openvswitch 2.9 packages on RHEL-7.7.

As can be seen in the results below, the problem was readily observed without the fix in relatively few iterations but was not seen at all across 500 iterations with the fix in place.

Marking as Verified in openvswitch-2.9.0-122.el7fdp

##############################################################

The following script was used to test for this issue:

[root@impairment1 ~]# cat test.sh 
#!/bin/bash

iterations=${iterations:-"50"}
result_file="/home/result_file.txt"
rm -f $result_file

check_final_result()
{
	if [[ $(grep FAIL $result_file) ]]; then
		echo "$count crash iterations run of $iterations total"
		echo "Overall Test Result: FAIL"		
	else
		let count--
		echo "$count crash iterations run of $iterations total"
		echo "Overall Test Result: PASS"
	fi
}

test_systemctl_status_running()
{
	service=$1
	systemctl_status=$(systemctl status $service | grep Active | awk '{print $2$3}')
	if [[ $service == "openvswitch" ]]; then
		if [[ "$systemctl_status" == "active(exited)" ]]; then
			echo "PASS" >> $result_file
		else
			echo "FAIL" >> $result_file
			check_final_result
			exit 1
		fi
	elif [[ $service == "ovs-vswitchd" ]] || [[ $service == "ovsdb-server" ]] && [[ "$systemctl_status" == "active(running)" ]]; then
		echo "PASS" >> $result_file
	else
		echo "FAIL" >> $result_file
		check_final_result
		exit 1
	fi
}

test_systemctl_status_stopped()
{
	service=$1
	systemctl_status=$(systemctl status $service | grep Active | awk '{print $2$3}')
	if [[ $service == "openvswitch" ]] && [[ "$systemctl_status" == "inactive(dead)" ]]; then
		echo "PASS" >> $result_file
	else
		echo "FAIL" >> $result_file
		check_final_result
		exit 1
	fi
}

test_process_crashed_restarted()
{
	process=$1
	vswitchd_pid1=$(pgrep vswitchd)
	ovsdb_pid1=$(pgrep vswitchd)
	if [[ $process == "vswitchd" ]]; then
		kill -SIGSEGV $vswitchd_pid1
	elif [[ $process == "ovsdb" ]]; then
		kill -SIGSEGV $ovsdb_pid1
	fi
	sleep 3
	vswitchd_pid2=$(pgrep vswitchd)
	ovsdb_pid2=$(pgrep vswitchd)
	if [[ -z $vswitchd_pid2 ]] || [[ $vswitchd_pid2 == $vswitchd_pid1 ]]; then
		echo "FAIL" >> $result_file
		check_final_result
		exit 1
	else
		echo "PASS" >> $result_file
	fi
	
	if [[ -z $ovsdb_pid2 ]] || [[ $ovsdb_pid2 == $ovsdb_pid1 ]]; then
		echo "FAIL" >> $result_file
		check_final_result
		exit 1
	else
		echo "PASS" >> $result_file
	fi
}

test_process_started()
{
	vswitchd_pid=$(pgrep vswitchd)
	ovsdb_pid=$(pgrep vswitchd)

	if [[ $vswitchd_pid ]]; then
		echo "PASS" >> $result_file
	else
		echo "FAIL" >> $result_file
		check_final_result
		exit 1
	fi
	
	if [[ $ovsdb_pid ]]; then
		echo "PASS" >> $result_file
	else
		echo "FAIL" >> $result_file
		check_final_result
		exit 1
	fi
}

test_process_stopped()
{
	vswitchd_pid=$(pgrep vswitchd)
	ovsdb_pid=$(pgrep vswitchd)

	if [[ -z $vswitchd_pid ]]; then
		echo "PASS" >> $result_file
	else
		echo "FAIL" >> $result_file
		check_final_result
		exit 1
	fi
	
	if [[ -z $ovsdb_pid ]]; then
		echo "PASS" >> $result_file
	else
		echo "FAIL" >> $result_file
		check_final_result
		exit 1
	fi
}

get_process_pid()
{
	process=$1
	"$process"_pid=$(pgrep $process)
}

systemctl stop openvswitch
sleep 3

### Tests
# start openvswitch
systemctl start openvswitch
sleep 3
test_systemctl_status_running openvswitch
test_systemctl_status_running ovs-vswitchd
test_systemctl_status_running ovsdb-server
test_process_started

# crash vswitchd
count=1
while [ $count -le $iterations ]; do
	echo "Running vswitchd crash iteration $count of $iterations"
	test_process_crashed_restarted vswitchd
	let count++
done

# crash ovsdb
count=1
while [ $count -le $iterations ]; do
	echo "Running ovsdb crash iteration $count of $iterations"
	test_process_crashed_restarted ovsdb
	let count++
done

check_final_result

##############################################################

[root@impairment1 ~]# uname -r
3.10.0-1062.el7.x86_64

[root@impairment1 ~]# rpm -q openvswitch
openvswitch-2.9.0-83.el7fdp.1.x86_64

[root@impairment1 ~]# ./test.sh 
Running vswitchd crash iteration 1 of 50
Running vswitchd crash iteration 2 of 50
Running vswitchd crash iteration 3 of 50
Running vswitchd crash iteration 4 of 50
Running vswitchd crash iteration 5 of 50
Running vswitchd crash iteration 6 of 50
Running vswitchd crash iteration 7 of 50
Running vswitchd crash iteration 8 of 50
Running vswitchd crash iteration 9 of 50
Running vswitchd crash iteration 10 of 50
Running vswitchd crash iteration 11 of 50
Running vswitchd crash iteration 12 of 50
Running vswitchd crash iteration 13 of 50
Running vswitchd crash iteration 14 of 50
Running vswitchd crash iteration 15 of 50
Running vswitchd crash iteration 16 of 50
Running vswitchd crash iteration 17 of 50
Running vswitchd crash iteration 18 of 50
Running vswitchd crash iteration 19 of 50
Running vswitchd crash iteration 20 of 50
Running vswitchd crash iteration 21 of 50
Running vswitchd crash iteration 22 of 50
Running vswitchd crash iteration 23 of 50
Running vswitchd crash iteration 24 of 50
Running vswitchd crash iteration 25 of 50
Running vswitchd crash iteration 26 of 50
Running vswitchd crash iteration 27 of 50
Running vswitchd crash iteration 28 of 50
Running vswitchd crash iteration 29 of 50
Running vswitchd crash iteration 30 of 50
Running vswitchd crash iteration 31 of 50
Running vswitchd crash iteration 32 of 50
Running vswitchd crash iteration 33 of 50
Running vswitchd crash iteration 34 of 50
Running vswitchd crash iteration 35 of 50
Running vswitchd crash iteration 36 of 50
Running vswitchd crash iteration 37 of 50
Running vswitchd crash iteration 38 of 50
Running vswitchd crash iteration 39 of 50
Running vswitchd crash iteration 40 of 50
Running vswitchd crash iteration 41 of 50
Running vswitchd crash iteration 42 of 50
Running vswitchd crash iteration 43 of 50
Running vswitchd crash iteration 44 of 50
Running vswitchd crash iteration 45 of 50
Running vswitchd crash iteration 46 of 50
Running vswitchd crash iteration 47 of 50
Running vswitchd crash iteration 48 of 50
Running vswitchd crash iteration 49 of 50
Running vswitchd crash iteration 50 of 50
Running ovsdb crash iteration 1 of 50
1 crash iterations run of 50 total
Overall Test Result: FAIL

[root@impairment1 ~]# ./test.sh 
Running vswitchd crash iteration 1 of 50
Running vswitchd crash iteration 2 of 50
Running vswitchd crash iteration 3 of 50
Running vswitchd crash iteration 4 of 50
Running vswitchd crash iteration 5 of 50
Running vswitchd crash iteration 6 of 50
Running vswitchd crash iteration 7 of 50
Running vswitchd crash iteration 8 of 50
Running vswitchd crash iteration 9 of 50
Running vswitchd crash iteration 10 of 50
Running vswitchd crash iteration 11 of 50
Running vswitchd crash iteration 12 of 50
Running vswitchd crash iteration 13 of 50
Running vswitchd crash iteration 14 of 50
Running vswitchd crash iteration 15 of 50
Running vswitchd crash iteration 16 of 50
Running vswitchd crash iteration 17 of 50
Running vswitchd crash iteration 18 of 50
Running vswitchd crash iteration 19 of 50
Running vswitchd crash iteration 20 of 50
Running vswitchd crash iteration 21 of 50
Running vswitchd crash iteration 22 of 50
Running vswitchd crash iteration 23 of 50
Running vswitchd crash iteration 24 of 50
Running vswitchd crash iteration 25 of 50
Running vswitchd crash iteration 26 of 50
Running vswitchd crash iteration 27 of 50
Running vswitchd crash iteration 28 of 50
Running vswitchd crash iteration 29 of 50
Running vswitchd crash iteration 30 of 50
Running vswitchd crash iteration 31 of 50
Running vswitchd crash iteration 32 of 50
Running vswitchd crash iteration 33 of 50
Running vswitchd crash iteration 34 of 50
Running vswitchd crash iteration 35 of 50
35 crash iterations run of 50 total
Overall Test Result: FAIL

Moved to openvswitch-2.9.0-122.el7fdp
No failures over 500 iterations each of vswitchd and ovsdb crash tests:

[root@impairment1 ~]# rpm -q openvswitch
openvswitch-2.9.0-122.el7fdp.x86_64

[root@impairment1 ~]# ./test.sh 
Running vswitchd crash iteration 1 of 500
Running vswitchd crash iteration 2 of 500
Running vswitchd crash iteration 3 of 500
Running vswitchd crash iteration 4 of 500
Running vswitchd crash iteration 5 of 500
...
...
Running vswitchd crash iteration 45 of 500
Running vswitchd crash iteration 46 of 500
Running vswitchd crash iteration 47 of 500
Running vswitchd crash iteration 48 of 500
Running vswitchd crash iteration 49 of 500
Running vswitchd crash iteration 500 of 500
Running ovsdb crash iteration 1 of 500
Running ovsdb crash iteration 2 of 500
Running ovsdb crash iteration 3 of 500
Running ovsdb crash iteration 4 of 500
Running ovsdb crash iteration 5 of 500
...
...
Running ovsdb crash iteration 45 of 500
Running ovsdb crash iteration 46 of 500
Running ovsdb crash iteration 47 of 500
Running ovsdb crash iteration 48 of 500
Running ovsdb crash iteration 49 of 500
Running ovsdb crash iteration 500 of 500
500 crash iterations run of 500 total
Overall Test Result: PASS

Comment 41 errata-xmlrpc 2019-11-06 04:18:21 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:3717


Note You need to log in before you can comment on or make changes to this bug.