Bug 1684483 - Openvswich process is not restarted after a kill command, in 20% of the times
Summary: Openvswich process is not restarted after a kill command, in 20% of the times
Keywords:
Status: MODIFIED
Alias: None
Product: Red Hat Enterprise Linux Fast Datapath
Classification: Red Hat
Component: openvswitch2.10
Version: FDP 19.03
Hardware: Unspecified
OS: Linux
unspecified
medium
Target Milestone: ---
: ---
Assignee: Timothy Redaelli
QA Contact: Rick Alongi
URL:
Whiteboard:
Depends On: 1684477
Blocks: 1653717 1759242
TreeView+ depends on / blocked
 
Reported: 2019-03-01 11:16 UTC by Timothy Redaelli
Modified: 2023-07-13 07:25 UTC (History)
5 users (show)

Fixed In Version: openvswitch2.10-2.10.0-48.el7fdn
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1684477
Environment:
Last Closed:
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker FD-81 0 None None None 2022-03-24 13:35:48 UTC

Description Timothy Redaelli 2019-03-01 11:16:50 UTC
+++ This bug was initially created as a clone of Bug #1684477 +++

+++ This bug was initially created as a clone of Bug #1653717 +++

Description of problem:

 Open vswich process is not restarted after a kill 20% of the times. 


Version-Release number of selected component (if applicable):

 OSP 14

 3 controllers + 3 computes + dvr

How reproducible:

 1. Install a baremetal with RHEL(theforeman)
 2. Intall with jenkins: OSP 14 -- 3 controllers + 3 computes + dvr
 3. Create sever vms with a Fip
 4. Kill openvswitch process in a compuete node



Actual results:

 the process is not restarted 20% of the times. 

Expected results:

 the process is restarted.

Additional info:

Logs when restart openvswich is failing:

[root@compute-2 heat-admin]# tail -f  /var/log/containers/neutron/metadata-agent.log
2018-11-26 08:51:10.883 8224 INFO eventlet.wsgi.server [-] 10.2.0.18,<local> "GET /2009-04-04/meta-data/placement/availability-zone HTTP/1.1" status: 200  len: 139 time: 0.1934390
2018-11-26 08:51:10.968 8223 INFO eventlet.wsgi.server [-] 10.3.0.25,<local> "GET /2009-04-04/meta-data/placement/availability-zone HTTP/1.1" status: 200  len: 139 time: 0.1648369
2018-11-26 09:11:20.883 8224 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: IOError: Socket closed
2018-11-26 09:11:34.307 7137 ERROR oslo.messaging._drivers.impl_rabbit [-] [357cec88-105c-4f2b-b26c-70ae3864ec0f] AMQP server controller-2.internalapi.localdomain:5672 closed the connection. Check login credentials: Socket closed: IOError: Socket closed
2018-11-26 09:11:35.323 7137 INFO oslo.messaging._drivers.impl_rabbit [-] [357cec88-105c-4f2b-b26c-70ae3864ec0f] Reconnected to AMQP server on controller-2.internalapi.localdomain:5672 via [amqp] client with port 50096.
2018-11-26 09:11:43.938 8223 ERROR oslo.messaging._drivers.impl_rabbit [-] [24405812-662e-4c40-b834-37a07d80366f] AMQP server controller-1.internalapi.localdomain:5672 closed the connection. Check login credentials: Socket closed: IOError: Socket closed
2018-11-26 09:11:44.955 8223 INFO oslo.messaging._drivers.impl_rabbit [-] [24405812-662e-4c40-b834-37a07d80366f] Reconnected to AMQP server on controller-1.internalapi.localdomain:5672 via [amqp] client with port 37994.
2018-11-26 09:11:45.218 8224 ERROR oslo.messaging._drivers.impl_rabbit [-] [7bf56d8a-afcf-4b87-b2dd-c865c2faf08f] AMQP server controller-2.internalapi.localdomain:5672 closed the connection. Check login credentials: Socket closed: IOError: Socket closed
2018-11-26 09:11:46.235 8224 INFO oslo.messaging._drivers.impl_rabbit [-] [7bf56d8a-afcf-4b87-b2dd-c865c2faf08f] Reconnected to AMQP server on controller-2.internalapi.localdomain:5672 via [amqp] client with port 50116.
2018-11-26 09:11:59.117 8223 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: IOError: Socket closed





vi /var/log/containers/neutron/openvswitch-agent.log


2018-11-26 09:12:04.417 7137 ERROR oslo.messaging._drivers.impl_rabbit [-] [8f9c38d5-3ad8-4cab-8446-a2f07e6d370f] AMQP server controller-1.internalapi.localdomain:5672 closed the connection. Check login credentials: Socket closed: IOError: Socket closed
2018-11-26 09:12:05.444 7137 INFO oslo.messaging._drivers.impl_rabbit [-] [8f9c38d5-3ad8-4cab-8446-a2f07e6d370f] Reconnected to AMQP server on controller-1.internalapi.localdomain:5672 via [amqp] client with port 38042.
2018-11-26 09:12:15.386 8224 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: error: [Errno 104] Connection reset by peer

2018-11-26 14:23:44.338 28480 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int     raise RuntimeError(m)
2018-11-26 14:23:44.338 28480 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int RuntimeError: Switch connection timeout
2018-11-26 14:23:44.338 28480 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int
2018-11-26 14:23:44.339 28480 WARNING neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-cb6bf194-1ceb-415b-aecc-3ba950053b37 - - - - -] OVS is dead. OVSNeutronAgent will keep running and checking OVS status periodically.
2018-11-26 14:23:44.340 28480 DEBUG neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-cb6bf194-1ceb-415b-aecc-3ba950053b37 - - - - -] Agent rpc_loop - iteration:623 completed. Processed ports statistics: {'regular': {'updated': 0, 'added': 0, 'removed': 0}}. Elapsed:30.020 loop_count_and_wait /usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py:1875
2018-11-26 14:23:44.340 28480 DEBUG neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-cb6bf194-1ceb-415b-aecc-3ba950053b37 - - - - -] Loop iteration exceeded interval (2 vs. 30.0197079182)! loop_count_and_wait /usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py:1882
2018-11-26 14:23:44.340 28480 DEBUG neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-cb6bf194-1ceb-415b-aecc-3ba950053b37 - - - - -] Agent rpc_loop - iteration:624 started rpc_loop /usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py:2086
2018-11-26 14:24:14.353 28480 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ofswitch [req-cb6bf194-1ceb-415b-aecc-3ba950053b37 - - - - -] Switch connection timeout
2018-11-26 14:24:14.354 28480 DEBUG ovsdbapp.backend.ovs_idl.transaction [-] Running txn n=1 command(idx=0): DbGetCommand(column=datapath_id, table=Bridge, record=br-int) do_commit /usr/lib/python2.7/site-packages/ovsdbapp/backend/ovs_idl/transaction.py:84
2018-11-26 14:24:14.355 28480 DEBUG ovsdbapp.backend.ovs_idl.transaction [-] Transaction caused no change do_commit /usr/lib/python2.7/site-packages/ovsdbapp/backend/ovs_idl/transaction.py:121
2018-11-26 14:24:14.355 28480 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int [req-cb6bf194-1ceb-415b-aecc-3ba950053b37 - - - - -] Failed to communicate with the switch: RuntimeError: Switch connection timeout
2018-11-26 14:24:14.355 28480 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int Traceback (most recent call last):
2018-11-26 14:24:14.355 28480 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int   File "/usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/openflow/native/br_int.py", line 52, in check_canary_table
2018-11-26 14:24:14.355 28480 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int     flows = self.dump_flows(constants.CANARY_TABLE)
2018-11-26 14:24:14.355 28480 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int   File "/usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/openflow/native/ofswitch.py", line 156, in dump_flows
2018-11-26 14:24:14.355 28480 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int     (dp, ofp, ofpp) = self._get_dp()
2018-11-26 14:24:14.355 28480 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int   File "/usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/openflow/native/ovs_bridge.py", line 69, in _get_dp
2018-11-26 14:24:14.355 28480 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int     self._cached_dpid = new_dpid
2018-11-26 14:24:14.355 28480 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
2018-11-26 14:24:14.355 28480 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int     self.force_reraise()
2018-11-26 14:24:14.355 28480 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
2018-11-26 14:24:14.355 28480 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int     six.reraise(self.type_, self.value, self.tb)
2018-11-26 14:24:14.355 28480 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int   File "/usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/openflow/native/ovs_bridge.py", line 52, in _get_dp
2018-11-26 14:24:14.355 28480 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int     dp = self._get_dp_by_dpid(self._cached_dpid)
2018-11-26 14:24:14.355 28480 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int   File "/usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/openflow/native/ofswitch.py", line 79, in _get_dp_by_dpid
2018-11-26 14:24:14.355 28480 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int     raise RuntimeError(m)
2018-11-26 14:24:14.355 28480 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int RuntimeError: Switch connection timeout
2018-11-26 14:24:14.355 28480 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int
2018-11-26 14:24:14.356 28480 WARNING neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-cb6bf194-1ceb-415b-aecc-3ba950053b37 - - - - -] OVS is dead. OVSNeutronAgent will keep running and checking OVS status periodically.
2018-11-26 14:24:14.356 28480 DEBUG neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-cb6bf194-1ceb-415b-aecc-3ba950053b37 - - - - -] Agent rpc_loop - iteration:624 completed. Processed ports statistics: {'regular': {'updated': 0, 'added': 0, 'removed': 0}}. Elapsed:30.016 loop_count_and_wait /usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py:1875
2018-11-26 14:24:14.357 28480 DEBUG neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-cb6bf194-1ceb-415b-aecc-3ba950053b37 - - - - -] Loop iteration exceeded interval (2 vs. 30.0161988735)! loop_count_and_wait /usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py:1882
2018-11-26 14:24:14.357 28480 DEBUG neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-cb6bf194-1ceb-415b-aecc-3ba950053b37 - - - - -] Agent rpc_loop - iteration:625 started rpc_loop /usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py:2086

overcloud) [stack@undercloud-0 ~]$ openstack versions show
+-------------+----------------+---------+------------+-----------------------------------+------------------+------------------+
| Region Name | Service Type | Version | Status | Endpoint | Min Microversion | Max Microversion |
+-------------+----------------+---------+------------+-----------------------------------+------------------+------------------+
| regionOne | block-storage | 2.0 | DEPRECATED | http://10.0.0.101:8776/v2/ | None | None |
| regionOne | block-storage | 3.0 | CURRENT | http://10.0.0.101:8776/v3/ | 3.0 | 3.55 |
| regionOne | placement | None | CURRENT | http://10.0.0.101:8778/placement/ | None | None |
| regionOne | network | 2.0 | CURRENT | http://10.0.0.101:9696/v2.0/ | None | None |
| regionOne | alarm | 2.0 | CURRENT | http://10.0.0.101:8042/v2 | None | None |
| regionOne | cloudformation | 1.0 | CURRENT | http://10.0.0.101:8000/v1/ | None | None |
| regionOne | event | 2.0 | CURRENT | http://10.0.0.101:8977/v2 | None | None |
| regionOne | orchestration | 1.0 | CURRENT | http://10.0.0.101:8004/v1/ | None | None |
| regionOne | object-store | 1.0 | CURRENT | http://10.0.0.101:8080/v1/ | None | None |
| regionOne | compute | 2.0 | SUPPORTED | http://10.0.0.101:8774/v2/ | None | None |
| regionOne | compute | 2.1 | CURRENT | http://10.0.0.101:8774/v2.1/ | 2.1 | 2.65 |
| regionOne | image | 2.0 | SUPPORTED | http://10.0.0.101:9292/v2/ | None | None |
| regionOne | image | 2.1 | SUPPORTED | http://10.0.0.101:9292/v2/ | None | None |
| regionOne | image | 2.2 | SUPPORTED | http://10.0.0.101:9292/v2/ | None | None |
| regionOne | image | 2.3 | SUPPORTED | http://10.0.0.101:9292/v2/ | None | None |
| regionOne | image | 2.4 | SUPPORTED | http://10.0.0.101:9292/v2/ | None | None |
| regionOne | image | 2.5 | SUPPORTED | http://10.0.0.101:9292/v2/ | None | None |
| regionOne | image | 2.6 | SUPPORTED | http://10.0.0.101:9292/v2/ | None | None |
| regionOne | image | 2.7 | CURRENT | http://10.0.0.101:9292/v2/ | None | None |
| regionOne | metric | 1.0 | CURRENT | http://10.0.0.101:8041/v1/ | None | None |
| regionOne | identity | 3.10 | CURRENT | http://10.0.0.101:5000/v3/ | None | None |
+-------------+----------------+---------+------------+-----------------------------------+------------------+------------------+
(overcloud) [stack@undercloud-0 ~]$ cat /etc/re
redhat-lsb/ redhat-release request-key.conf request-key.d/ resolv.conf
(overcloud) [stack@undercloud-0 ~]$ cat /etc/re
redhat-lsb/ redhat-release request-key.conf request-key.d/ resolv.conf
(overcloud) [stack@undercloud-0 ~]$ cat /etc/redhat-release
Red Hat Enterprise Linux Server release 7.6 (Maipo)

--- Additional comment from Candido Campos on 2019-02-28 14:32:57 CET ---

the change that avoids the issue is:

/usr/lib/systemd/system/ovsdb-server.service

...
[Service]
Type=forking
+PIDFile=/var/run/openvswitch/ovsdb-server.pid
Restart=on-failure

...

/usr/lib/systemd/system/ovs-vswitchd.service

....
[Service]
Type=forking
+PIDFile=/var/run/openvswitch/ovs-vswitchd.pid
Restart=on-failure
....


diff --git a/rhel/usr_lib_systemd_system_ovs-vswitchd.service.in b/rhel/usr_lib_systemd_system_ovs-vswitchd.service.in
index 525deae0b..82925133d 100644
--- a/rhel/usr_lib_systemd_system_ovs-vswitchd.service.in
+++ b/rhel/usr_lib_systemd_system_ovs-vswitchd.service.in
@@ -9,6 +9,7 @@ PartOf=openvswitch.service
 
 [Service]
 Type=forking
+PIDFile=/var/run/openvswitch/ovs-vswitchd.pid
 Restart=on-failure
 Environment=XDG_RUNTIME_DIR=/var/run/openvswitch
 EnvironmentFile=/etc/openvswitch/default.conf
diff --git a/rhel/usr_lib_systemd_system_ovsdb-server.service b/rhel/usr_lib_systemd_system_ovsdb-server.service
index 70da1ec95..a7a1e03cb 100644
--- a/rhel/usr_lib_systemd_system_ovsdb-server.service
+++ b/rhel/usr_lib_systemd_system_ovsdb-server.service
@@ -8,6 +8,7 @@ PartOf=openvswitch.service
 [Service]
 Type=forking
 Restart=on-failure
+PIDFile=/var/run/openvswitch/ovsdb-server.pid
 EnvironmentFile=/etc/openvswitch/default.conf
 EnvironmentFile=-/etc/sysconfig/openvswitch
 ExecStartPre=/usr/bin/chown ${OVS_USER_ID} /var/run/openvswitch /var/log/openvswitch
[ccamposr@localhost ovs]$


Note You need to log in before you can comment on or make changes to this bug.