Bug 2052987

Summary: 504 gateway timeout after rebooting 2 (out of 3) controller nodes
Product: Red Hat OpenStack Reporter: Eduardo Olivares <eolivare>
Component: python-networking-ovnAssignee: Terry Wilson <twilson>
Status: CLOSED ERRATA QA Contact: Bharath M V <bmv>
Severity: high Docs Contact:
Priority: high    
Version: 16.1 (Train)CC: apevec, ccamposr, egarciar, jlibosva, jschluet, lhh, majopela, scohen, twilson
Target Milestone: z9Keywords: AutomationBlocker, Triaged
Target Release: 16.1 (Train on RHEL 8.2)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: python-networking-ovn-7.3.1-1.20220224203614.4e24f4c.el8ost Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-12-07 20:25:48 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Eduardo Olivares 2022-02-10 12:01:07 UTC
Description of problem:
Issue reproduced by a tobiko job, more specifically, by the test test_controllers_shutdown

Link to tobiko results: http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/rcj/gate-sanity-16.1_director-rhel-virthost-3cont_2comp-ipv4-geneve-tobiko_faults/467/infrared/.workspaces/workspace_2022-02-09_20-08-17/tobiko_gate/tobiko_gate_05_faults_faults.html
Link to openstack logs: http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/rcj/gate-sanity-16.1_director-rhel-virthost-3cont_2comp-ipv4-geneve-tobiko_faults/467/



Test steps:
1) power off and power on two controllers
2022-02-09 22:18:31.142 235662 INFO tobiko.tests.faults.ha.cloud_disruptions - Power off 2 random controller nodes: ['controller-0', 'controller-2']
2022-02-09 22:18:31.184 235662 INFO tobiko.openstack.ironic._node - Power off baremetal node '3f165d79-de5e-4c54-8f0e-9d4e02590a36' (power state = 'power on').
2022-02-09 22:18:46.381 235662 INFO tobiko.openstack.ironic._node - Power off baremetal node 'f6fcba0c-08da-4dfd-b678-7822e43c7340' (power state = 'power on').
2022-02-09 22:19:09.641 235662 INFO tobiko.tests.faults.ha.cloud_disruptions - Power on controller nodes: ['controller-0', 'controller-2']
2022-02-09 22:19:09.684 235662 INFO tobiko.openstack.ironic._node - Power on baremetal node '3f165d79-de5e-4c54-8f0e-9d4e02590a36' (power_state='power off').
2022-02-09 22:19:24.875 235662 INFO tobiko.openstack.ironic._node - Power on baremetal node 'f6fcba0c-08da-4dfd-b678-7822e43c7340' (power_state='power off').


2) Wait until all the pacemaker resources are healthy
2022-02-09 22:23:13.672 235662 INFO tobiko.tripleo.pacemaker - pcs status checks: all resources are in healthy state

3) Create a heat stack, including a port, a VM instance, etc
2022-02-09 22:31:33.398 235662 WARNING tobiko.openstack.heat._stack - Stack 'tobiko.openstack.tests._nova.TestServerCreationStack-235662-0' reached unexpected status: 'CREATE_FAILED'
2022-02-09 22:31:33.399 235662 INFO tobiko.openstack.heat._stack - Stack 'tobiko.openstack.tests._nova.TestServerCreationStack-235662-0' status is 'CREATE_FAILED'. Reason:
	Resource CREATE failed: NeutronClientException: resources.port: <html><body><h1>504 Gateway Time-out</h1>
The server didn't respond in time.
</body></html>

The previous failure affects the rest of the job, making many tests fail. During the rest of the job, there are neutron requests rejected with the error code 504.

62 504 errors occurred during the execution of the job:
$ zgrep ' 504 ' controller-0/var/log/containers/haproxy/haproxy.log.1.gz | grep -c neutron
62

First 504 error at 22:31:30:
$ zgrep ' 504 ' controller-0/var/log/containers/haproxy/haproxy.log.1.gz | grep neutron | head -1
Feb  9 22:31:30 controller-0 haproxy[14]: 10.0.0.122:58088 [09/Feb/2022:22:29:30.051] neutron neutron/controller-1.internalapi.redhat.local 0/0/0/-1/120001 504 194 - - sH-- 305/1/0/0/0 0/0 "POST /v2.0/ports HTTP/1.1"

Last 504 error at 23:38:03:
$ zgrep ' 504 ' controller-0/var/log/containers/haproxy/haproxy.log.1.gz | grep neutron | tail -1
Feb  9 23:38:03 controller-0 haproxy[14]: 172.17.1.44:54154 [09/Feb/2022:23:36:03.867] neutron neutron/controller-0.internalapi.redhat.local 0/0/0/-1/120008 504 194 - - sH-- 544/6/0/0/0 0/0 "GET /v2.0/ports?network_id=425e4e84-1c49-4e94-8dfd-f966abb7d27e&device_owner=network%3Adhcp HTTP/1.1"



Some information from neutron server logs (see comment #1 for the link to these logs):
1) many pymysql exceptions on controller-1 between 22:18:56 and 22:21:29 while controller-0 and -2 are rebooting, 
2) many logs like these ones on controller-1 between 22:23:19 and 22:45:46:
2022-02-09 22:23:19.606 30 DEBUG ovsdbapp.backend.ovs_idl.transaction [-] Running txn n=1 command(idx=0): CheckLivenessCommand() do_commit /usr/lib/python3.6/site-packages/ovsdbapp/backend/ovs_idl/transaction.py:84
2022-02-09 22:23:19.606 30 DEBUG ovsdbapp.backend.ovs_idl.transaction [-] OVSDB transaction returned TRY_AGAIN, retrying do_commit /usr/lib/python3.6/site-packages/ovsdbapp/backend/ovs_idl/transaction.py:93
...
2022-02-09 22:45:46.802 29 DEBUG ovsdbapp.backend.ovs_idl.transaction [-] Running txn n=244067 command(idx=0): CheckLivenessCommand() do_commit /usr/lib/python3.6/site-packages/ovsdbapp/backend/ovs_idl/transaction.py:84
2022-02-09 22:45:46.802 29 DEBUG ovsdbapp.backend.ovs_idl.transaction [-] OVSDB transaction returned TRY_AGAIN, retrying do_commit /usr/lib/python3.6/site-packages/ovsdbapp/backend/ovs_idl/transaction.py:93
3) The neutron server process restart at that moment on controller-1 (22:48:23)
4) Some exceptions like the following ones are raised on controller-1 at 22:49:19:
2022-02-09 22:49:19.532 25 ERROR ovsdbapp.event Traceback (most recent call last):
2022-02-09 22:49:19.532 25 ERROR ovsdbapp.event   File "/usr/lib/python3.6/site-packages/ovsdbapp/event.py", line 143, in notify_loop
2022-02-09 22:49:19.532 25 ERROR ovsdbapp.event     match.run(event, row, updates)
2022-02-09 22:49:19.532 25 ERROR ovsdbapp.event   File "/usr/lib/python3.6/site-packages/networking_ovn/ovsdb/ovsdb_monitor.py", line 448, in run
2022-02-09 22:49:19.532 25 ERROR ovsdbapp.event     self.driver.delete_mac_binding_entries(row.external_ip)
2022-02-09 22:49:19.532 25 ERROR ovsdbapp.event   File "/usr/lib/python3.6/site-packages/networking_ovn/ml2/mech_driver.py", line 954, in delete_mac_binding_entries
2022-02-09 22:49:19.532 25 ERROR ovsdbapp.event     mac_binds = self._sb_ovn.db_find_rows(
2022-02-09 22:49:19.532 25 ERROR ovsdbapp.event AttributeError: 'NoneType' object has no attribute 'db_find_rows'
 




Version-Release number of selected component (if applicable):
RHOS-16.1-RHEL-8-20220203.n.1
python3-networking-ovn-7.3.1-1.20220113183502.el8ost.noarch.rpm

How reproducible:
Not often - I have only seen it once and this job is executed daily

Steps to Reproduce:
1. reboot two controller nodes (in an environment with 3 controllers)
2. wait until pacemaker status is healthy
3. try to send any networking request

Comment 22 errata-xmlrpc 2022-12-07 20:25:48 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform 16.1.9 bug fix and enhancement advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:8795