Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
The FDP team is no longer accepting new bugs in Bugzilla. Please report your issues under FDP project in Jira. Thanks.

Bug 1684483

Summary: Openvswich process is not restarted after a kill command, in 20% of the times
Product: Red Hat Enterprise Linux Fast Datapath Reporter: Timothy Redaelli <tredaelli>
Component: openvswitch2.10Assignee: Timothy Redaelli <tredaelli>
Status: CLOSED EOL QA Contact: Rick Alongi <ralongi>
Severity: medium Docs Contact:
Priority: unspecified    
Version: FDP 19.03CC: ctrautma, jhsiao, jlibosva, qding, ralongi
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Linux   
Whiteboard:
Fixed In Version: openvswitch2.10-2.10.0-48.el7fdn Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1684477 Environment:
Last Closed: 2024-10-08 17:49:14 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1684477    
Bug Blocks: 1653717, 1759242    

Description Timothy Redaelli 2019-03-01 11:16:50 UTC
+++ This bug was initially created as a clone of Bug #1684477 +++

+++ This bug was initially created as a clone of Bug #1653717 +++

Description of problem:

 Open vswich process is not restarted after a kill 20% of the times. 


Version-Release number of selected component (if applicable):

 OSP 14

 3 controllers + 3 computes + dvr

How reproducible:

 1. Install a baremetal with RHEL(theforeman)
 2. Intall with jenkins: OSP 14 -- 3 controllers + 3 computes + dvr
 3. Create sever vms with a Fip
 4. Kill openvswitch process in a compuete node



Actual results:

 the process is not restarted 20% of the times. 

Expected results:

 the process is restarted.

Additional info:

Logs when restart openvswich is failing:

[root@compute-2 heat-admin]# tail -f  /var/log/containers/neutron/metadata-agent.log
2018-11-26 08:51:10.883 8224 INFO eventlet.wsgi.server [-] 10.2.0.18,<local> "GET /2009-04-04/meta-data/placement/availability-zone HTTP/1.1" status: 200  len: 139 time: 0.1934390
2018-11-26 08:51:10.968 8223 INFO eventlet.wsgi.server [-] 10.3.0.25,<local> "GET /2009-04-04/meta-data/placement/availability-zone HTTP/1.1" status: 200  len: 139 time: 0.1648369
2018-11-26 09:11:20.883 8224 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: IOError: Socket closed
2018-11-26 09:11:34.307 7137 ERROR oslo.messaging._drivers.impl_rabbit [-] [357cec88-105c-4f2b-b26c-70ae3864ec0f] AMQP server controller-2.internalapi.localdomain:5672 closed the connection. Check login credentials: Socket closed: IOError: Socket closed
2018-11-26 09:11:35.323 7137 INFO oslo.messaging._drivers.impl_rabbit [-] [357cec88-105c-4f2b-b26c-70ae3864ec0f] Reconnected to AMQP server on controller-2.internalapi.localdomain:5672 via [amqp] client with port 50096.
2018-11-26 09:11:43.938 8223 ERROR oslo.messaging._drivers.impl_rabbit [-] [24405812-662e-4c40-b834-37a07d80366f] AMQP server controller-1.internalapi.localdomain:5672 closed the connection. Check login credentials: Socket closed: IOError: Socket closed
2018-11-26 09:11:44.955 8223 INFO oslo.messaging._drivers.impl_rabbit [-] [24405812-662e-4c40-b834-37a07d80366f] Reconnected to AMQP server on controller-1.internalapi.localdomain:5672 via [amqp] client with port 37994.
2018-11-26 09:11:45.218 8224 ERROR oslo.messaging._drivers.impl_rabbit [-] [7bf56d8a-afcf-4b87-b2dd-c865c2faf08f] AMQP server controller-2.internalapi.localdomain:5672 closed the connection. Check login credentials: Socket closed: IOError: Socket closed
2018-11-26 09:11:46.235 8224 INFO oslo.messaging._drivers.impl_rabbit [-] [7bf56d8a-afcf-4b87-b2dd-c865c2faf08f] Reconnected to AMQP server on controller-2.internalapi.localdomain:5672 via [amqp] client with port 50116.
2018-11-26 09:11:59.117 8223 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: IOError: Socket closed





vi /var/log/containers/neutron/openvswitch-agent.log


2018-11-26 09:12:04.417 7137 ERROR oslo.messaging._drivers.impl_rabbit [-] [8f9c38d5-3ad8-4cab-8446-a2f07e6d370f] AMQP server controller-1.internalapi.localdomain:5672 closed the connection. Check login credentials: Socket closed: IOError: Socket closed
2018-11-26 09:12:05.444 7137 INFO oslo.messaging._drivers.impl_rabbit [-] [8f9c38d5-3ad8-4cab-8446-a2f07e6d370f] Reconnected to AMQP server on controller-1.internalapi.localdomain:5672 via [amqp] client with port 38042.
2018-11-26 09:12:15.386 8224 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: error: [Errno 104] Connection reset by peer

2018-11-26 14:23:44.338 28480 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int     raise RuntimeError(m)
2018-11-26 14:23:44.338 28480 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int RuntimeError: Switch connection timeout
2018-11-26 14:23:44.338 28480 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int
2018-11-26 14:23:44.339 28480 WARNING neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-cb6bf194-1ceb-415b-aecc-3ba950053b37 - - - - -] OVS is dead. OVSNeutronAgent will keep running and checking OVS status periodically.
2018-11-26 14:23:44.340 28480 DEBUG neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-cb6bf194-1ceb-415b-aecc-3ba950053b37 - - - - -] Agent rpc_loop - iteration:623 completed. Processed ports statistics: {'regular': {'updated': 0, 'added': 0, 'removed': 0}}. Elapsed:30.020 loop_count_and_wait /usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py:1875
2018-11-26 14:23:44.340 28480 DEBUG neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-cb6bf194-1ceb-415b-aecc-3ba950053b37 - - - - -] Loop iteration exceeded interval (2 vs. 30.0197079182)! loop_count_and_wait /usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py:1882
2018-11-26 14:23:44.340 28480 DEBUG neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-cb6bf194-1ceb-415b-aecc-3ba950053b37 - - - - -] Agent rpc_loop - iteration:624 started rpc_loop /usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py:2086
2018-11-26 14:24:14.353 28480 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ofswitch [req-cb6bf194-1ceb-415b-aecc-3ba950053b37 - - - - -] Switch connection timeout
2018-11-26 14:24:14.354 28480 DEBUG ovsdbapp.backend.ovs_idl.transaction [-] Running txn n=1 command(idx=0): DbGetCommand(column=datapath_id, table=Bridge, record=br-int) do_commit /usr/lib/python2.7/site-packages/ovsdbapp/backend/ovs_idl/transaction.py:84
2018-11-26 14:24:14.355 28480 DEBUG ovsdbapp.backend.ovs_idl.transaction [-] Transaction caused no change do_commit /usr/lib/python2.7/site-packages/ovsdbapp/backend/ovs_idl/transaction.py:121
2018-11-26 14:24:14.355 28480 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int [req-cb6bf194-1ceb-415b-aecc-3ba950053b37 - - - - -] Failed to communicate with the switch: RuntimeError: Switch connection timeout
2018-11-26 14:24:14.355 28480 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int Traceback (most recent call last):
2018-11-26 14:24:14.355 28480 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int   File "/usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/openflow/native/br_int.py", line 52, in check_canary_table
2018-11-26 14:24:14.355 28480 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int     flows = self.dump_flows(constants.CANARY_TABLE)
2018-11-26 14:24:14.355 28480 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int   File "/usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/openflow/native/ofswitch.py", line 156, in dump_flows
2018-11-26 14:24:14.355 28480 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int     (dp, ofp, ofpp) = self._get_dp()
2018-11-26 14:24:14.355 28480 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int   File "/usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/openflow/native/ovs_bridge.py", line 69, in _get_dp
2018-11-26 14:24:14.355 28480 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int     self._cached_dpid = new_dpid
2018-11-26 14:24:14.355 28480 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
2018-11-26 14:24:14.355 28480 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int     self.force_reraise()
2018-11-26 14:24:14.355 28480 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
2018-11-26 14:24:14.355 28480 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int     six.reraise(self.type_, self.value, self.tb)
2018-11-26 14:24:14.355 28480 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int   File "/usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/openflow/native/ovs_bridge.py", line 52, in _get_dp
2018-11-26 14:24:14.355 28480 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int     dp = self._get_dp_by_dpid(self._cached_dpid)
2018-11-26 14:24:14.355 28480 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int   File "/usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/openflow/native/ofswitch.py", line 79, in _get_dp_by_dpid
2018-11-26 14:24:14.355 28480 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int     raise RuntimeError(m)
2018-11-26 14:24:14.355 28480 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int RuntimeError: Switch connection timeout
2018-11-26 14:24:14.355 28480 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int
2018-11-26 14:24:14.356 28480 WARNING neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-cb6bf194-1ceb-415b-aecc-3ba950053b37 - - - - -] OVS is dead. OVSNeutronAgent will keep running and checking OVS status periodically.
2018-11-26 14:24:14.356 28480 DEBUG neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-cb6bf194-1ceb-415b-aecc-3ba950053b37 - - - - -] Agent rpc_loop - iteration:624 completed. Processed ports statistics: {'regular': {'updated': 0, 'added': 0, 'removed': 0}}. Elapsed:30.016 loop_count_and_wait /usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py:1875
2018-11-26 14:24:14.357 28480 DEBUG neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-cb6bf194-1ceb-415b-aecc-3ba950053b37 - - - - -] Loop iteration exceeded interval (2 vs. 30.0161988735)! loop_count_and_wait /usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py:1882
2018-11-26 14:24:14.357 28480 DEBUG neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-cb6bf194-1ceb-415b-aecc-3ba950053b37 - - - - -] Agent rpc_loop - iteration:625 started rpc_loop /usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py:2086

overcloud) [stack@undercloud-0 ~]$ openstack versions show
+-------------+----------------+---------+------------+-----------------------------------+------------------+------------------+
| Region Name | Service Type | Version | Status | Endpoint | Min Microversion | Max Microversion |
+-------------+----------------+---------+------------+-----------------------------------+------------------+------------------+
| regionOne | block-storage | 2.0 | DEPRECATED | http://10.0.0.101:8776/v2/ | None | None |
| regionOne | block-storage | 3.0 | CURRENT | http://10.0.0.101:8776/v3/ | 3.0 | 3.55 |
| regionOne | placement | None | CURRENT | http://10.0.0.101:8778/placement/ | None | None |
| regionOne | network | 2.0 | CURRENT | http://10.0.0.101:9696/v2.0/ | None | None |
| regionOne | alarm | 2.0 | CURRENT | http://10.0.0.101:8042/v2 | None | None |
| regionOne | cloudformation | 1.0 | CURRENT | http://10.0.0.101:8000/v1/ | None | None |
| regionOne | event | 2.0 | CURRENT | http://10.0.0.101:8977/v2 | None | None |
| regionOne | orchestration | 1.0 | CURRENT | http://10.0.0.101:8004/v1/ | None | None |
| regionOne | object-store | 1.0 | CURRENT | http://10.0.0.101:8080/v1/ | None | None |
| regionOne | compute | 2.0 | SUPPORTED | http://10.0.0.101:8774/v2/ | None | None |
| regionOne | compute | 2.1 | CURRENT | http://10.0.0.101:8774/v2.1/ | 2.1 | 2.65 |
| regionOne | image | 2.0 | SUPPORTED | http://10.0.0.101:9292/v2/ | None | None |
| regionOne | image | 2.1 | SUPPORTED | http://10.0.0.101:9292/v2/ | None | None |
| regionOne | image | 2.2 | SUPPORTED | http://10.0.0.101:9292/v2/ | None | None |
| regionOne | image | 2.3 | SUPPORTED | http://10.0.0.101:9292/v2/ | None | None |
| regionOne | image | 2.4 | SUPPORTED | http://10.0.0.101:9292/v2/ | None | None |
| regionOne | image | 2.5 | SUPPORTED | http://10.0.0.101:9292/v2/ | None | None |
| regionOne | image | 2.6 | SUPPORTED | http://10.0.0.101:9292/v2/ | None | None |
| regionOne | image | 2.7 | CURRENT | http://10.0.0.101:9292/v2/ | None | None |
| regionOne | metric | 1.0 | CURRENT | http://10.0.0.101:8041/v1/ | None | None |
| regionOne | identity | 3.10 | CURRENT | http://10.0.0.101:5000/v3/ | None | None |
+-------------+----------------+---------+------------+-----------------------------------+------------------+------------------+
(overcloud) [stack@undercloud-0 ~]$ cat /etc/re
redhat-lsb/ redhat-release request-key.conf request-key.d/ resolv.conf
(overcloud) [stack@undercloud-0 ~]$ cat /etc/re
redhat-lsb/ redhat-release request-key.conf request-key.d/ resolv.conf
(overcloud) [stack@undercloud-0 ~]$ cat /etc/redhat-release
Red Hat Enterprise Linux Server release 7.6 (Maipo)

--- Additional comment from Candido Campos on 2019-02-28 14:32:57 CET ---

the change that avoids the issue is:

/usr/lib/systemd/system/ovsdb-server.service

...
[Service]
Type=forking
+PIDFile=/var/run/openvswitch/ovsdb-server.pid
Restart=on-failure

...

/usr/lib/systemd/system/ovs-vswitchd.service

....
[Service]
Type=forking
+PIDFile=/var/run/openvswitch/ovs-vswitchd.pid
Restart=on-failure
....


diff --git a/rhel/usr_lib_systemd_system_ovs-vswitchd.service.in b/rhel/usr_lib_systemd_system_ovs-vswitchd.service.in
index 525deae0b..82925133d 100644
--- a/rhel/usr_lib_systemd_system_ovs-vswitchd.service.in
+++ b/rhel/usr_lib_systemd_system_ovs-vswitchd.service.in
@@ -9,6 +9,7 @@ PartOf=openvswitch.service
 
 [Service]
 Type=forking
+PIDFile=/var/run/openvswitch/ovs-vswitchd.pid
 Restart=on-failure
 Environment=XDG_RUNTIME_DIR=/var/run/openvswitch
 EnvironmentFile=/etc/openvswitch/default.conf
diff --git a/rhel/usr_lib_systemd_system_ovsdb-server.service b/rhel/usr_lib_systemd_system_ovsdb-server.service
index 70da1ec95..a7a1e03cb 100644
--- a/rhel/usr_lib_systemd_system_ovsdb-server.service
+++ b/rhel/usr_lib_systemd_system_ovsdb-server.service
@@ -8,6 +8,7 @@ PartOf=openvswitch.service
 [Service]
 Type=forking
 Restart=on-failure
+PIDFile=/var/run/openvswitch/ovsdb-server.pid
 EnvironmentFile=/etc/openvswitch/default.conf
 EnvironmentFile=-/etc/sysconfig/openvswitch
 ExecStartPre=/usr/bin/chown ${OVS_USER_ID} /var/run/openvswitch /var/log/openvswitch
[ccamposr@localhost ovs]$

Comment 2 ovs-bot 2024-10-08 17:49:14 UTC
This bug did not meet the criteria for automatic migration and is being closed.
If the issue remains, please open a new ticket in https://issues.redhat.com/browse/FDP