+++ This bug was initially created as a clone of Bug #1684477 +++ +++ This bug was initially created as a clone of Bug #1653717 +++ Description of problem: Open vswich process is not restarted after a kill 20% of the times. Version-Release number of selected component (if applicable): OSP 14 3 controllers + 3 computes + dvr How reproducible: 1. Install a baremetal with RHEL(theforeman) 2. Intall with jenkins: OSP 14 -- 3 controllers + 3 computes + dvr 3. Create sever vms with a Fip 4. Kill openvswitch process in a compuete node Actual results: the process is not restarted 20% of the times. Expected results: the process is restarted. Additional info: Logs when restart openvswich is failing: [root@compute-2 heat-admin]# tail -f /var/log/containers/neutron/metadata-agent.log 2018-11-26 08:51:10.883 8224 INFO eventlet.wsgi.server [-] 10.2.0.18,<local> "GET /2009-04-04/meta-data/placement/availability-zone HTTP/1.1" status: 200 len: 139 time: 0.1934390 2018-11-26 08:51:10.968 8223 INFO eventlet.wsgi.server [-] 10.3.0.25,<local> "GET /2009-04-04/meta-data/placement/availability-zone HTTP/1.1" status: 200 len: 139 time: 0.1648369 2018-11-26 09:11:20.883 8224 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: IOError: Socket closed 2018-11-26 09:11:34.307 7137 ERROR oslo.messaging._drivers.impl_rabbit [-] [357cec88-105c-4f2b-b26c-70ae3864ec0f] AMQP server controller-2.internalapi.localdomain:5672 closed the connection. Check login credentials: Socket closed: IOError: Socket closed 2018-11-26 09:11:35.323 7137 INFO oslo.messaging._drivers.impl_rabbit [-] [357cec88-105c-4f2b-b26c-70ae3864ec0f] Reconnected to AMQP server on controller-2.internalapi.localdomain:5672 via [amqp] client with port 50096. 2018-11-26 09:11:43.938 8223 ERROR oslo.messaging._drivers.impl_rabbit [-] [24405812-662e-4c40-b834-37a07d80366f] AMQP server controller-1.internalapi.localdomain:5672 closed the connection. Check login credentials: Socket closed: IOError: Socket closed 2018-11-26 09:11:44.955 8223 INFO oslo.messaging._drivers.impl_rabbit [-] [24405812-662e-4c40-b834-37a07d80366f] Reconnected to AMQP server on controller-1.internalapi.localdomain:5672 via [amqp] client with port 37994. 2018-11-26 09:11:45.218 8224 ERROR oslo.messaging._drivers.impl_rabbit [-] [7bf56d8a-afcf-4b87-b2dd-c865c2faf08f] AMQP server controller-2.internalapi.localdomain:5672 closed the connection. Check login credentials: Socket closed: IOError: Socket closed 2018-11-26 09:11:46.235 8224 INFO oslo.messaging._drivers.impl_rabbit [-] [7bf56d8a-afcf-4b87-b2dd-c865c2faf08f] Reconnected to AMQP server on controller-2.internalapi.localdomain:5672 via [amqp] client with port 50116. 2018-11-26 09:11:59.117 8223 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: IOError: Socket closed vi /var/log/containers/neutron/openvswitch-agent.log 2018-11-26 09:12:04.417 7137 ERROR oslo.messaging._drivers.impl_rabbit [-] [8f9c38d5-3ad8-4cab-8446-a2f07e6d370f] AMQP server controller-1.internalapi.localdomain:5672 closed the connection. Check login credentials: Socket closed: IOError: Socket closed 2018-11-26 09:12:05.444 7137 INFO oslo.messaging._drivers.impl_rabbit [-] [8f9c38d5-3ad8-4cab-8446-a2f07e6d370f] Reconnected to AMQP server on controller-1.internalapi.localdomain:5672 via [amqp] client with port 38042. 2018-11-26 09:12:15.386 8224 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: error: [Errno 104] Connection reset by peer 2018-11-26 14:23:44.338 28480 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int raise RuntimeError(m) 2018-11-26 14:23:44.338 28480 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int RuntimeError: Switch connection timeout 2018-11-26 14:23:44.338 28480 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int 2018-11-26 14:23:44.339 28480 WARNING neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-cb6bf194-1ceb-415b-aecc-3ba950053b37 - - - - -] OVS is dead. OVSNeutronAgent will keep running and checking OVS status periodically. 2018-11-26 14:23:44.340 28480 DEBUG neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-cb6bf194-1ceb-415b-aecc-3ba950053b37 - - - - -] Agent rpc_loop - iteration:623 completed. Processed ports statistics: {'regular': {'updated': 0, 'added': 0, 'removed': 0}}. Elapsed:30.020 loop_count_and_wait /usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py:1875 2018-11-26 14:23:44.340 28480 DEBUG neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-cb6bf194-1ceb-415b-aecc-3ba950053b37 - - - - -] Loop iteration exceeded interval (2 vs. 30.0197079182)! loop_count_and_wait /usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py:1882 2018-11-26 14:23:44.340 28480 DEBUG neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-cb6bf194-1ceb-415b-aecc-3ba950053b37 - - - - -] Agent rpc_loop - iteration:624 started rpc_loop /usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py:2086 2018-11-26 14:24:14.353 28480 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ofswitch [req-cb6bf194-1ceb-415b-aecc-3ba950053b37 - - - - -] Switch connection timeout 2018-11-26 14:24:14.354 28480 DEBUG ovsdbapp.backend.ovs_idl.transaction [-] Running txn n=1 command(idx=0): DbGetCommand(column=datapath_id, table=Bridge, record=br-int) do_commit /usr/lib/python2.7/site-packages/ovsdbapp/backend/ovs_idl/transaction.py:84 2018-11-26 14:24:14.355 28480 DEBUG ovsdbapp.backend.ovs_idl.transaction [-] Transaction caused no change do_commit /usr/lib/python2.7/site-packages/ovsdbapp/backend/ovs_idl/transaction.py:121 2018-11-26 14:24:14.355 28480 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int [req-cb6bf194-1ceb-415b-aecc-3ba950053b37 - - - - -] Failed to communicate with the switch: RuntimeError: Switch connection timeout 2018-11-26 14:24:14.355 28480 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int Traceback (most recent call last): 2018-11-26 14:24:14.355 28480 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int File "/usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/openflow/native/br_int.py", line 52, in check_canary_table 2018-11-26 14:24:14.355 28480 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int flows = self.dump_flows(constants.CANARY_TABLE) 2018-11-26 14:24:14.355 28480 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int File "/usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/openflow/native/ofswitch.py", line 156, in dump_flows 2018-11-26 14:24:14.355 28480 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int (dp, ofp, ofpp) = self._get_dp() 2018-11-26 14:24:14.355 28480 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int File "/usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/openflow/native/ovs_bridge.py", line 69, in _get_dp 2018-11-26 14:24:14.355 28480 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int self._cached_dpid = new_dpid 2018-11-26 14:24:14.355 28480 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__ 2018-11-26 14:24:14.355 28480 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int self.force_reraise() 2018-11-26 14:24:14.355 28480 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise 2018-11-26 14:24:14.355 28480 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int six.reraise(self.type_, self.value, self.tb) 2018-11-26 14:24:14.355 28480 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int File "/usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/openflow/native/ovs_bridge.py", line 52, in _get_dp 2018-11-26 14:24:14.355 28480 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int dp = self._get_dp_by_dpid(self._cached_dpid) 2018-11-26 14:24:14.355 28480 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int File "/usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/openflow/native/ofswitch.py", line 79, in _get_dp_by_dpid 2018-11-26 14:24:14.355 28480 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int raise RuntimeError(m) 2018-11-26 14:24:14.355 28480 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int RuntimeError: Switch connection timeout 2018-11-26 14:24:14.355 28480 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int 2018-11-26 14:24:14.356 28480 WARNING neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-cb6bf194-1ceb-415b-aecc-3ba950053b37 - - - - -] OVS is dead. OVSNeutronAgent will keep running and checking OVS status periodically. 2018-11-26 14:24:14.356 28480 DEBUG neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-cb6bf194-1ceb-415b-aecc-3ba950053b37 - - - - -] Agent rpc_loop - iteration:624 completed. Processed ports statistics: {'regular': {'updated': 0, 'added': 0, 'removed': 0}}. Elapsed:30.016 loop_count_and_wait /usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py:1875 2018-11-26 14:24:14.357 28480 DEBUG neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-cb6bf194-1ceb-415b-aecc-3ba950053b37 - - - - -] Loop iteration exceeded interval (2 vs. 30.0161988735)! loop_count_and_wait /usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py:1882 2018-11-26 14:24:14.357 28480 DEBUG neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-cb6bf194-1ceb-415b-aecc-3ba950053b37 - - - - -] Agent rpc_loop - iteration:625 started rpc_loop /usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py:2086 overcloud) [stack@undercloud-0 ~]$ openstack versions show +-------------+----------------+---------+------------+-----------------------------------+------------------+------------------+ | Region Name | Service Type | Version | Status | Endpoint | Min Microversion | Max Microversion | +-------------+----------------+---------+------------+-----------------------------------+------------------+------------------+ | regionOne | block-storage | 2.0 | DEPRECATED | http://10.0.0.101:8776/v2/ | None | None | | regionOne | block-storage | 3.0 | CURRENT | http://10.0.0.101:8776/v3/ | 3.0 | 3.55 | | regionOne | placement | None | CURRENT | http://10.0.0.101:8778/placement/ | None | None | | regionOne | network | 2.0 | CURRENT | http://10.0.0.101:9696/v2.0/ | None | None | | regionOne | alarm | 2.0 | CURRENT | http://10.0.0.101:8042/v2 | None | None | | regionOne | cloudformation | 1.0 | CURRENT | http://10.0.0.101:8000/v1/ | None | None | | regionOne | event | 2.0 | CURRENT | http://10.0.0.101:8977/v2 | None | None | | regionOne | orchestration | 1.0 | CURRENT | http://10.0.0.101:8004/v1/ | None | None | | regionOne | object-store | 1.0 | CURRENT | http://10.0.0.101:8080/v1/ | None | None | | regionOne | compute | 2.0 | SUPPORTED | http://10.0.0.101:8774/v2/ | None | None | | regionOne | compute | 2.1 | CURRENT | http://10.0.0.101:8774/v2.1/ | 2.1 | 2.65 | | regionOne | image | 2.0 | SUPPORTED | http://10.0.0.101:9292/v2/ | None | None | | regionOne | image | 2.1 | SUPPORTED | http://10.0.0.101:9292/v2/ | None | None | | regionOne | image | 2.2 | SUPPORTED | http://10.0.0.101:9292/v2/ | None | None | | regionOne | image | 2.3 | SUPPORTED | http://10.0.0.101:9292/v2/ | None | None | | regionOne | image | 2.4 | SUPPORTED | http://10.0.0.101:9292/v2/ | None | None | | regionOne | image | 2.5 | SUPPORTED | http://10.0.0.101:9292/v2/ | None | None | | regionOne | image | 2.6 | SUPPORTED | http://10.0.0.101:9292/v2/ | None | None | | regionOne | image | 2.7 | CURRENT | http://10.0.0.101:9292/v2/ | None | None | | regionOne | metric | 1.0 | CURRENT | http://10.0.0.101:8041/v1/ | None | None | | regionOne | identity | 3.10 | CURRENT | http://10.0.0.101:5000/v3/ | None | None | +-------------+----------------+---------+------------+-----------------------------------+------------------+------------------+ (overcloud) [stack@undercloud-0 ~]$ cat /etc/re redhat-lsb/ redhat-release request-key.conf request-key.d/ resolv.conf (overcloud) [stack@undercloud-0 ~]$ cat /etc/re redhat-lsb/ redhat-release request-key.conf request-key.d/ resolv.conf (overcloud) [stack@undercloud-0 ~]$ cat /etc/redhat-release Red Hat Enterprise Linux Server release 7.6 (Maipo) --- Additional comment from Candido Campos on 2019-02-28 14:32:57 CET --- the change that avoids the issue is: /usr/lib/systemd/system/ovsdb-server.service ... [Service] Type=forking +PIDFile=/var/run/openvswitch/ovsdb-server.pid Restart=on-failure ... /usr/lib/systemd/system/ovs-vswitchd.service .... [Service] Type=forking +PIDFile=/var/run/openvswitch/ovs-vswitchd.pid Restart=on-failure .... diff --git a/rhel/usr_lib_systemd_system_ovs-vswitchd.service.in b/rhel/usr_lib_systemd_system_ovs-vswitchd.service.in index 525deae0b..82925133d 100644 --- a/rhel/usr_lib_systemd_system_ovs-vswitchd.service.in +++ b/rhel/usr_lib_systemd_system_ovs-vswitchd.service.in @@ -9,6 +9,7 @@ PartOf=openvswitch.service [Service] Type=forking +PIDFile=/var/run/openvswitch/ovs-vswitchd.pid Restart=on-failure Environment=XDG_RUNTIME_DIR=/var/run/openvswitch EnvironmentFile=/etc/openvswitch/default.conf diff --git a/rhel/usr_lib_systemd_system_ovsdb-server.service b/rhel/usr_lib_systemd_system_ovsdb-server.service index 70da1ec95..a7a1e03cb 100644 --- a/rhel/usr_lib_systemd_system_ovsdb-server.service +++ b/rhel/usr_lib_systemd_system_ovsdb-server.service @@ -8,6 +8,7 @@ PartOf=openvswitch.service [Service] Type=forking Restart=on-failure +PIDFile=/var/run/openvswitch/ovsdb-server.pid EnvironmentFile=/etc/openvswitch/default.conf EnvironmentFile=-/etc/sysconfig/openvswitch ExecStartPre=/usr/bin/chown ${OVS_USER_ID} /var/run/openvswitch /var/log/openvswitch [ccamposr@localhost ovs]$