Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1914911

Summary:	OVN DB not started after controllers reboot
Product:	Red Hat OpenStack	Reporter:	pkomarov
Component:	python-networking-ovn	Assignee:	Jakub Libosvar <jlibosva>
Status:	CLOSED DUPLICATE	QA Contact:	Eran Kuris <ekuris>
Severity:	low	Docs Contact:
Priority:	high
Version:	16.1 (Train)	CC:	apevec, ekuris, eolivare, jlibosva, jmarcian, lhh, majopela, scohen, tfreger, twilson
Target Milestone:	z14	Keywords:	AutomationBlocker, Reopened, Triaged, ZStream
Target Release:	16.1 (Train on RHEL 8.2)
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:	1892578	Environment:
Last Closed:	2022-02-22 21:45:38 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1892578, 1914912, 1919276, 1920698, 1986273
Bug Blocks:

Comment 11 Jakub Libosvar 2022-02-21 17:08:32 UTC

(In reply to Julia Marciano from comment #9)
Can you please paste which resource-agents version was used in the test?

I'm also a bit confused about the tobiko output, I'd like to see which test broke the environment but all the tests have roughly the same timestamp. Can you please provide the first failed test?

Comment 13 Julia Marciano 2022-02-21 18:02:05 UTC

Hello Jakub.

(In reply to Jakub Libosvar from comment #11)
> (In reply to Julia Marciano from comment #9)
> Can you please paste which resource-agents version was used in the test?

(undercloud) [stack@undercloud-0 ~]$ cat core_puddle_version 
RHOS-16.1-RHEL-8-20220210.n.1(undercloud)

[heat-admin@controller-1 ~]$ rpm -qa|grep resource-agents
resource-agents-4.1.1-44.el8_2.18.x86_64

> 
> I'm also a bit confused about the tobiko output, I'd like to see which test
> broke the environment but all the tests have roughly the same timestamp. Can
> you please provide the first failed test?
The first test that failed on this issue should be: tobiko/tests/faults/ha/test_cloud_recovery.py::DisruptTripleoNodesTest::test_controllers_shutdown
It seems it started at 2022-02-18 22:25:18.856 244295 .

BR,
Julia.

Comment 14 Jakub Libosvar 2022-02-22 21:35:24 UTC

I looked at the pacemaker logs and tobiko logs to correlate what was going on with the cluster. Here is the timeline of events:

The test started at 22:25:19 by powering off controller-1 and controller-2 and then powering them on:

2022-02-18 22:25:19.468 244295 INFO tobiko.tests.faults.ha.cloud_disruptions - Power off 2 random controller nodes: ['controller-1', 'controller-2']\
2022-02-18 22:25:19.518 244295 INFO tobiko.openstack.ironic._node - Power off baremetal node 'd1e86343-7f0b-4802-96f7-0970859673e5' (power state = 'power on').\
2022-02-18 22:25:34.718 244295 INFO tobiko.openstack.ironic._node - Power off baremetal node '5fa5e3f7-05e6-4d2b-93d1-409432ca0f9e' (power state = 'power on').\
2022-02-18 22:25:57.995 244295 INFO tobiko.tests.faults.ha.cloud_disruptions - Power on controller nodes: ['controller-1', 'controller-2']\
2022-02-18 22:25:58.037 244295 INFO tobiko.openstack.ironic._node - Power on baremetal node 'd1e86343-7f0b-4802-96f7-0970859673e5' (power_state='power off').\
2022-02-18 22:26:13.227 244295 INFO tobiko.openstack.ironic._node - Power on baremetal node '5fa5e3f7-05e6-4d2b-93d1-409432ca0f9e' (power_state='power off').\

Master node was controller-0 and pacemaker noticed slaves are down:

Feb 18 22:26:01 controller-0 pacemaker-controld  [24665] (peer_update_callback)         info: Remote node ovn-dbs-bundle-1 is now in unknown state
Feb 18 22:26:01 controller-0 pacemaker-controld  [24665] (peer_update_callback)         info: Remote node ovn-dbs-bundle-2 is now in unknown state

Feb 18 22:26:01 controller-0 pacemaker-schedulerd[24664] (common_print)         info:    ovn-dbs-bundle-0       (ocf::ovn:ovndb-servers):       Master controller-0
Feb 18 22:26:01 controller-0 pacemaker-schedulerd[24664] (common_print)         info:    ovn-dbs-bundle-1       (ocf::ovn:ovndb-servers):        FAILED controller-1 (UNCLEAN)
Feb 18 22:26:01 controller-0 pacemaker-schedulerd[24664] (common_print)         info:    ovn-dbs-bundle-2       (ocf::ovn:ovndb-servers):        FAILED controller-2 (UNCLEAN)

2 out of 3 nodes were out therefore cluster has lost the quorum:

Feb 18 22:26:01 controller-0 pacemaker-schedulerd[24664] (LogAction)    notice:  * Stop       ovndb_servers:0            ( Master ovn-dbs-bundle-0 )   due to no quorum


The nodes were fenced:

Feb 18 22:27:02 controller-0 pacemaker-schedulerd[24664] (pe_fence_node)        warning: Guest node ovn-dbs-bundle-2 will be fenced (by recovering its guest resource ovn-dbs-bundle-podman-2): ovndb_servers:2 is thought to be active there
Feb 18 22:27:02 controller-0 pacemaker-schedulerd[24664] (pe_fence_node)        warning: Guest node ovn-dbs-bundle-1 will be fenced (by recovering its guest resource ovn-dbs-bundle-podman-1): ovndb_servers:1 is thought to be active there


Ovn is started on controller-1 while controller-0 is in slave mode:

Feb 18 22:27:10 controller-0 pacemaker-controld  [24665] (te_rsc_command)       notice: Initiating start operation ovn-dbs-bundle-podman-2_start_0 on controller-1 | action 212
Feb 18 22:27:37 controller-0 pacemaker-schedulerd[24664] (common_print)         info:    ovn-dbs-bundle-0       (ocf::ovn:ovndb-servers):       Slave controller-0
Feb 18 22:27:37 controller-0 pacemaker-schedulerd[24664] (common_print)         info:    ovn-dbs-bundle-1       (ocf::ovn:ovndb-servers):       Stopped
Feb 18 22:27:37 controller-0 pacemaker-schedulerd[24664] (common_print)         info:    ovn-dbs-bundle-2       (ocf::ovn:ovndb-servers):       Stopped controller-1


Once controller-1 is up, the cluster gets quorum back and promotes Master on controller-0:

Feb 18 22:27:52 controller-0 pacemaker-schedulerd[24664] (LogAction)    notice:  * Promote    ovndb_servers:0                        ( Slave -> Master ovn-dbs-bundle-0 )

Tobiko notices the service is down:
2022-02-18 22:28:04.088 244295 INFO tobiko.tripleo.pacemaker - pcs status check: resource ovn is in not in healthy state


Feb 18 22:28:24 controller-0 pacemaker-schedulerd[24664] (common_print)         info:    ovn-dbs-bundle-0       (ocf::ovn:ovndb-servers):       Master controller-0
Feb 18 22:28:24 controller-0 pacemaker-schedulerd[24664] (common_print)         info:    ovn-dbs-bundle-1       (ocf::ovn:ovndb-servers):       Stopped
Feb 18 22:28:24 controller-0 pacemaker-schedulerd[24664] (common_print)         info:    ovn-dbs-bundle-2       (ocf::ovn:ovndb-servers):       Slave controller-1

controller-2 ovndb is started:

Feb 18 22:28:24 controller-0 pacemaker-schedulerd[24664] (LogAction)    notice:  * Start      ovn-dbs-bundle-podman-1                (                     controller-2 )
Feb 18 22:28:24 controller-0 pacemaker-schedulerd[24664] (LogAction)    notice:  * Start      ovn-dbs-bundle-1                       (                     controller-2 )

Feb 18 22:28:31 controller-0 pacemaker-schedulerd[24664] (common_print)         info:    ovn-dbs-bundle-0       (ocf::ovn:ovndb-servers):       Master controller-0
Feb 18 22:28:31 controller-0 pacemaker-schedulerd[24664] (common_print)         info:    ovn-dbs-bundle-1       (ocf::ovn:ovndb-servers):       Stopped controller-2
Feb 18 22:28:31 controller-0 pacemaker-schedulerd[24664] (common_print)         info:    ovn-dbs-bundle-2       (ocf::ovn:ovndb-servers):       Slave controller-1

Tobiko still sees cluster is down:

2022-02-18 22:28:33.649 244295 INFO tobiko.tripleo.pacemaker - pcs status check: resource ovn is in not in healthy state\

Tobiko sees cluster is healthy:

2022-02-18 22:28:35.409 244295 INFO tobiko.tripleo.pacemaker - pcs status check: resource ovn is in healthy state\

Ovn cluster is back fully online:
Feb 18 22:29:05 controller-0 pacemaker-schedulerd[24664] (common_print)         info:    ovn-dbs-bundle-0       (ocf::ovn:ovndb-servers):       Master controller-0
Feb 18 22:29:05 controller-0 pacemaker-schedulerd[24664] (common_print)         info:    ovn-dbs-bundle-1       (ocf::ovn:ovndb-servers):       Slave controller-2
Feb 18 22:29:05 controller-0 pacemaker-schedulerd[24664] (common_print)         info:    ovn-dbs-bundle-2       (ocf::ovn:ovndb-servers):       Slave controller-1


However, other services are still in troubles for another 40 seconds:

2022-02-18 22:29:44.168 244295 INFO tobiko.tripleo.pacemaker - Retrying pacemaker resource checks attempt 43 of 360\
2022-02-18 22:29:45.867 244295 INFO tobiko.tripleo.pacemaker - pcs status check: resource rabbitmq is in healthy state\
2022-02-18 22:29:45.869 244295 INFO tobiko.tripleo.pacemaker - pcs status check: resource galera is in healthy state\
2022-02-18 22:29:45.872 244295 INFO tobiko.tripleo.pacemaker - pcs status check: resource redis is in healthy state\
2022-02-18 22:29:45.875 244295 INFO tobiko.tripleo.pacemaker - pcs status check: resources vips are in healthy state\
2022-02-18 22:29:45.879 244295 INFO tobiko.tripleo.pacemaker - pcs status check: resources ha_proxy and cinder are in healthy state\
2022-02-18 22:29:45.886 244295 INFO tobiko.tripleo.pacemaker - pcs status check: resource ovn is in healthy state\
2022-02-18 22:29:45.886 244295 INFO tobiko.tripleo.pacemaker - pcs status checks: all resources are in healthy state\

Tobiko fails later when performing heat operations:
2022-02-18 22:29:54.167 244295 ERROR tobiko.openstack.heat._stack - Key 'use_extra_dhcp_opts' not found in template for stack 'tobiko.openstack.tests._nova.TestServerCreationStack-244295-0'\
2022-02-18 22:29:54.168 244295 ERROR tobiko.openstack.heat._stack - Key 'vlan_id' not found in template for stack 'tobiko.openstack.tests._nova.TestServerCreationStack-244295-0'\
2022-02-18 22:29:54.168 244295 ERROR tobiko.openstack.heat._stack - Key 'vlan_network' not found in template for stack 'tobiko.openstack.tests._nova.TestServerCreationStack-244295-0'\
2022-02-18 22:32:03.788 244295 WARNING tobiko.openstack.heat._stack - Stack 'tobiko.openstack.tests._nova.TestServerCreationStack-244295-0' reached unexpected status: 'CREATE_FAILED'\
2022-02-18 22:32:03.789 244295 INFO tobiko.openstack.heat._stack - Stack 'tobiko.openstack.tests._nova.TestServerCreationStack-244295-0' status is 'CREATE_FAILED'. Reason:\
    Resource CREATE failed: NeutronClientException: resources.port: <html><body><h1>504 Gateway Time-out</h1>\
The server didn't respond in time.\
</body></html>\



Based on the analysis above, I don't think this is the same issue as in description where OVN cluster doesn't come up fully and one of the slave remains in Stopped state.

Julia, I think we should open a new BZ as it seems to be a different problem.

Comment 15 Jakub Libosvar 2022-02-22 21:45:38 UTC

Actually, we don't need a new BZ, I looked at the Neutron server logs too and it's a known issue from bug 2052987. I'm closing this BZ as the original duplicate.

*** This bug has been marked as a duplicate of bug 2011934 ***

Comment 16 Julia Marciano 2022-02-23 22:37:20 UTC

Hi Jakub. 

Thanks you so much for your help and for the amazing analysis.

We will continue with the other bug.

BR,
Julia.

Comment 17 Julia Marciano 2022-02-24 00:20:08 UTC

Hi Jakub.

The problem is that ovn-dbs resource becomes unhealthy again and never recovers afterwards, please see comment #9 time stamp:
Feb 19 05:29:08 controller-0 pacemaker-schedulerd[24664]: warning: Forcing ovndb_servers:1 away from ovn-dbs-
bundle-1 after 1000000 failures (max=100000

And it seems that it happens after another shutdown operation in this test:
2022-02-18 23:01:34.087 244295 INFO tobiko.tests.faults.ha.cloud_disruptions - Power off 2 random controller nodes: ['controller-2', 'controller-1']
2022-02-18 23:01:34.134 244295 INFO tobiko.openstack.ironic._node - Power off baremetal node '5fa5e3f7-05e6-4d2b-93d1-409432ca0f9e' (power state = 'power on').
2022-02-18 23:01:54.343 244295 INFO tobiko.openstack.ironic._node - Power off baremetal node 'd1e86343-7f0b-4802-96f7-0970859673e5' (power state = 'power on').


It looks similar to the original issue. It happens in other tobiko tests/jobs as well. 
Would you advise to open a new BZ for this issue?

Thanks,
Julia.