(In reply to Julia Marciano from comment #9) Can you please paste which resource-agents version was used in the test? I'm also a bit confused about the tobiko output, I'd like to see which test broke the environment but all the tests have roughly the same timestamp. Can you please provide the first failed test?
Hello Jakub. (In reply to Jakub Libosvar from comment #11) > (In reply to Julia Marciano from comment #9) > Can you please paste which resource-agents version was used in the test? (undercloud) [stack@undercloud-0 ~]$ cat core_puddle_version RHOS-16.1-RHEL-8-20220210.n.1(undercloud) [heat-admin@controller-1 ~]$ rpm -qa|grep resource-agents resource-agents-4.1.1-44.el8_2.18.x86_64 > > I'm also a bit confused about the tobiko output, I'd like to see which test > broke the environment but all the tests have roughly the same timestamp. Can > you please provide the first failed test? The first test that failed on this issue should be: tobiko/tests/faults/ha/test_cloud_recovery.py::DisruptTripleoNodesTest::test_controllers_shutdown It seems it started at 2022-02-18 22:25:18.856 244295 . BR, Julia.
I looked at the pacemaker logs and tobiko logs to correlate what was going on with the cluster. Here is the timeline of events: The test started at 22:25:19 by powering off controller-1 and controller-2 and then powering them on: 2022-02-18 22:25:19.468 244295 INFO tobiko.tests.faults.ha.cloud_disruptions - Power off 2 random controller nodes: ['controller-1', 'controller-2']\ 2022-02-18 22:25:19.518 244295 INFO tobiko.openstack.ironic._node - Power off baremetal node 'd1e86343-7f0b-4802-96f7-0970859673e5' (power state = 'power on').\ 2022-02-18 22:25:34.718 244295 INFO tobiko.openstack.ironic._node - Power off baremetal node '5fa5e3f7-05e6-4d2b-93d1-409432ca0f9e' (power state = 'power on').\ 2022-02-18 22:25:57.995 244295 INFO tobiko.tests.faults.ha.cloud_disruptions - Power on controller nodes: ['controller-1', 'controller-2']\ 2022-02-18 22:25:58.037 244295 INFO tobiko.openstack.ironic._node - Power on baremetal node 'd1e86343-7f0b-4802-96f7-0970859673e5' (power_state='power off').\ 2022-02-18 22:26:13.227 244295 INFO tobiko.openstack.ironic._node - Power on baremetal node '5fa5e3f7-05e6-4d2b-93d1-409432ca0f9e' (power_state='power off').\ Master node was controller-0 and pacemaker noticed slaves are down: Feb 18 22:26:01 controller-0 pacemaker-controld [24665] (peer_update_callback) info: Remote node ovn-dbs-bundle-1 is now in unknown state Feb 18 22:26:01 controller-0 pacemaker-controld [24665] (peer_update_callback) info: Remote node ovn-dbs-bundle-2 is now in unknown state Feb 18 22:26:01 controller-0 pacemaker-schedulerd[24664] (common_print) info: ovn-dbs-bundle-0 (ocf::ovn:ovndb-servers): Master controller-0 Feb 18 22:26:01 controller-0 pacemaker-schedulerd[24664] (common_print) info: ovn-dbs-bundle-1 (ocf::ovn:ovndb-servers): FAILED controller-1 (UNCLEAN) Feb 18 22:26:01 controller-0 pacemaker-schedulerd[24664] (common_print) info: ovn-dbs-bundle-2 (ocf::ovn:ovndb-servers): FAILED controller-2 (UNCLEAN) 2 out of 3 nodes were out therefore cluster has lost the quorum: Feb 18 22:26:01 controller-0 pacemaker-schedulerd[24664] (LogAction) notice: * Stop ovndb_servers:0 ( Master ovn-dbs-bundle-0 ) due to no quorum The nodes were fenced: Feb 18 22:27:02 controller-0 pacemaker-schedulerd[24664] (pe_fence_node) warning: Guest node ovn-dbs-bundle-2 will be fenced (by recovering its guest resource ovn-dbs-bundle-podman-2): ovndb_servers:2 is thought to be active there Feb 18 22:27:02 controller-0 pacemaker-schedulerd[24664] (pe_fence_node) warning: Guest node ovn-dbs-bundle-1 will be fenced (by recovering its guest resource ovn-dbs-bundle-podman-1): ovndb_servers:1 is thought to be active there Ovn is started on controller-1 while controller-0 is in slave mode: Feb 18 22:27:10 controller-0 pacemaker-controld [24665] (te_rsc_command) notice: Initiating start operation ovn-dbs-bundle-podman-2_start_0 on controller-1 | action 212 Feb 18 22:27:37 controller-0 pacemaker-schedulerd[24664] (common_print) info: ovn-dbs-bundle-0 (ocf::ovn:ovndb-servers): Slave controller-0 Feb 18 22:27:37 controller-0 pacemaker-schedulerd[24664] (common_print) info: ovn-dbs-bundle-1 (ocf::ovn:ovndb-servers): Stopped Feb 18 22:27:37 controller-0 pacemaker-schedulerd[24664] (common_print) info: ovn-dbs-bundle-2 (ocf::ovn:ovndb-servers): Stopped controller-1 Once controller-1 is up, the cluster gets quorum back and promotes Master on controller-0: Feb 18 22:27:52 controller-0 pacemaker-schedulerd[24664] (LogAction) notice: * Promote ovndb_servers:0 ( Slave -> Master ovn-dbs-bundle-0 ) Tobiko notices the service is down: 2022-02-18 22:28:04.088 244295 INFO tobiko.tripleo.pacemaker - pcs status check: resource ovn is in not in healthy state Feb 18 22:28:24 controller-0 pacemaker-schedulerd[24664] (common_print) info: ovn-dbs-bundle-0 (ocf::ovn:ovndb-servers): Master controller-0 Feb 18 22:28:24 controller-0 pacemaker-schedulerd[24664] (common_print) info: ovn-dbs-bundle-1 (ocf::ovn:ovndb-servers): Stopped Feb 18 22:28:24 controller-0 pacemaker-schedulerd[24664] (common_print) info: ovn-dbs-bundle-2 (ocf::ovn:ovndb-servers): Slave controller-1 controller-2 ovndb is started: Feb 18 22:28:24 controller-0 pacemaker-schedulerd[24664] (LogAction) notice: * Start ovn-dbs-bundle-podman-1 ( controller-2 ) Feb 18 22:28:24 controller-0 pacemaker-schedulerd[24664] (LogAction) notice: * Start ovn-dbs-bundle-1 ( controller-2 ) Feb 18 22:28:31 controller-0 pacemaker-schedulerd[24664] (common_print) info: ovn-dbs-bundle-0 (ocf::ovn:ovndb-servers): Master controller-0 Feb 18 22:28:31 controller-0 pacemaker-schedulerd[24664] (common_print) info: ovn-dbs-bundle-1 (ocf::ovn:ovndb-servers): Stopped controller-2 Feb 18 22:28:31 controller-0 pacemaker-schedulerd[24664] (common_print) info: ovn-dbs-bundle-2 (ocf::ovn:ovndb-servers): Slave controller-1 Tobiko still sees cluster is down: 2022-02-18 22:28:33.649 244295 INFO tobiko.tripleo.pacemaker - pcs status check: resource ovn is in not in healthy state\ Tobiko sees cluster is healthy: 2022-02-18 22:28:35.409 244295 INFO tobiko.tripleo.pacemaker - pcs status check: resource ovn is in healthy state\ Ovn cluster is back fully online: Feb 18 22:29:05 controller-0 pacemaker-schedulerd[24664] (common_print) info: ovn-dbs-bundle-0 (ocf::ovn:ovndb-servers): Master controller-0 Feb 18 22:29:05 controller-0 pacemaker-schedulerd[24664] (common_print) info: ovn-dbs-bundle-1 (ocf::ovn:ovndb-servers): Slave controller-2 Feb 18 22:29:05 controller-0 pacemaker-schedulerd[24664] (common_print) info: ovn-dbs-bundle-2 (ocf::ovn:ovndb-servers): Slave controller-1 However, other services are still in troubles for another 40 seconds: 2022-02-18 22:29:44.168 244295 INFO tobiko.tripleo.pacemaker - Retrying pacemaker resource checks attempt 43 of 360\ 2022-02-18 22:29:45.867 244295 INFO tobiko.tripleo.pacemaker - pcs status check: resource rabbitmq is in healthy state\ 2022-02-18 22:29:45.869 244295 INFO tobiko.tripleo.pacemaker - pcs status check: resource galera is in healthy state\ 2022-02-18 22:29:45.872 244295 INFO tobiko.tripleo.pacemaker - pcs status check: resource redis is in healthy state\ 2022-02-18 22:29:45.875 244295 INFO tobiko.tripleo.pacemaker - pcs status check: resources vips are in healthy state\ 2022-02-18 22:29:45.879 244295 INFO tobiko.tripleo.pacemaker - pcs status check: resources ha_proxy and cinder are in healthy state\ 2022-02-18 22:29:45.886 244295 INFO tobiko.tripleo.pacemaker - pcs status check: resource ovn is in healthy state\ 2022-02-18 22:29:45.886 244295 INFO tobiko.tripleo.pacemaker - pcs status checks: all resources are in healthy state\ Tobiko fails later when performing heat operations: 2022-02-18 22:29:54.167 244295 ERROR tobiko.openstack.heat._stack - Key 'use_extra_dhcp_opts' not found in template for stack 'tobiko.openstack.tests._nova.TestServerCreationStack-244295-0'\ 2022-02-18 22:29:54.168 244295 ERROR tobiko.openstack.heat._stack - Key 'vlan_id' not found in template for stack 'tobiko.openstack.tests._nova.TestServerCreationStack-244295-0'\ 2022-02-18 22:29:54.168 244295 ERROR tobiko.openstack.heat._stack - Key 'vlan_network' not found in template for stack 'tobiko.openstack.tests._nova.TestServerCreationStack-244295-0'\ 2022-02-18 22:32:03.788 244295 WARNING tobiko.openstack.heat._stack - Stack 'tobiko.openstack.tests._nova.TestServerCreationStack-244295-0' reached unexpected status: 'CREATE_FAILED'\ 2022-02-18 22:32:03.789 244295 INFO tobiko.openstack.heat._stack - Stack 'tobiko.openstack.tests._nova.TestServerCreationStack-244295-0' status is 'CREATE_FAILED'. Reason:\ Resource CREATE failed: NeutronClientException: resources.port: <html><body><h1>504 Gateway Time-out</h1>\ The server didn't respond in time.\ </body></html>\ Based on the analysis above, I don't think this is the same issue as in description where OVN cluster doesn't come up fully and one of the slave remains in Stopped state. Julia, I think we should open a new BZ as it seems to be a different problem.
Actually, we don't need a new BZ, I looked at the Neutron server logs too and it's a known issue from bug 2052987. I'm closing this BZ as the original duplicate. *** This bug has been marked as a duplicate of bug 2011934 ***
Hi Jakub. Thanks you so much for your help and for the amazing analysis. We will continue with the other bug. BR, Julia.
Hi Jakub. The problem is that ovn-dbs resource becomes unhealthy again and never recovers afterwards, please see comment #9 time stamp: Feb 19 05:29:08 controller-0 pacemaker-schedulerd[24664]: warning: Forcing ovndb_servers:1 away from ovn-dbs- bundle-1 after 1000000 failures (max=100000 And it seems that it happens after another shutdown operation in this test: 2022-02-18 23:01:34.087 244295 INFO tobiko.tests.faults.ha.cloud_disruptions - Power off 2 random controller nodes: ['controller-2', 'controller-1'] 2022-02-18 23:01:34.134 244295 INFO tobiko.openstack.ironic._node - Power off baremetal node '5fa5e3f7-05e6-4d2b-93d1-409432ca0f9e' (power state = 'power on'). 2022-02-18 23:01:54.343 244295 INFO tobiko.openstack.ironic._node - Power off baremetal node 'd1e86343-7f0b-4802-96f7-0970859673e5' (power state = 'power on'). It looks similar to the original issue. It happens in other tobiko tests/jobs as well. Would you advise to open a new BZ for this issue? Thanks, Julia.