Bug 1914911
| Summary: | OVN DB not started after controllers reboot | ||
|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | pkomarov |
| Component: | python-networking-ovn | Assignee: | Jakub Libosvar <jlibosva> |
| Status: | CLOSED DUPLICATE | QA Contact: | Eran Kuris <ekuris> |
| Severity: | low | Docs Contact: | |
| Priority: | high | ||
| Version: | 16.1 (Train) | CC: | apevec, ekuris, eolivare, jlibosva, jmarcian, lhh, majopela, scohen, tfreger, twilson |
| Target Milestone: | z14 | Keywords: | AutomationBlocker, Reopened, Triaged, ZStream |
| Target Release: | 16.1 (Train on RHEL 8.2) | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | 1892578 | Environment: | |
| Last Closed: | 2022-02-22 21:45:38 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | 1892578, 1914912, 1919276, 1920698, 1986273 | ||
| Bug Blocks: | |||
|
Comment 11
Jakub Libosvar
2022-02-21 17:08:32 UTC
Hello Jakub. (In reply to Jakub Libosvar from comment #11) > (In reply to Julia Marciano from comment #9) > Can you please paste which resource-agents version was used in the test? (undercloud) [stack@undercloud-0 ~]$ cat core_puddle_version RHOS-16.1-RHEL-8-20220210.n.1(undercloud) [heat-admin@controller-1 ~]$ rpm -qa|grep resource-agents resource-agents-4.1.1-44.el8_2.18.x86_64 > > I'm also a bit confused about the tobiko output, I'd like to see which test > broke the environment but all the tests have roughly the same timestamp. Can > you please provide the first failed test? The first test that failed on this issue should be: tobiko/tests/faults/ha/test_cloud_recovery.py::DisruptTripleoNodesTest::test_controllers_shutdown It seems it started at 2022-02-18 22:25:18.856 244295 . BR, Julia. I looked at the pacemaker logs and tobiko logs to correlate what was going on with the cluster. Here is the timeline of events:
The test started at 22:25:19 by powering off controller-1 and controller-2 and then powering them on:
2022-02-18 22:25:19.468 244295 INFO tobiko.tests.faults.ha.cloud_disruptions - Power off 2 random controller nodes: ['controller-1', 'controller-2']\
2022-02-18 22:25:19.518 244295 INFO tobiko.openstack.ironic._node - Power off baremetal node 'd1e86343-7f0b-4802-96f7-0970859673e5' (power state = 'power on').\
2022-02-18 22:25:34.718 244295 INFO tobiko.openstack.ironic._node - Power off baremetal node '5fa5e3f7-05e6-4d2b-93d1-409432ca0f9e' (power state = 'power on').\
2022-02-18 22:25:57.995 244295 INFO tobiko.tests.faults.ha.cloud_disruptions - Power on controller nodes: ['controller-1', 'controller-2']\
2022-02-18 22:25:58.037 244295 INFO tobiko.openstack.ironic._node - Power on baremetal node 'd1e86343-7f0b-4802-96f7-0970859673e5' (power_state='power off').\
2022-02-18 22:26:13.227 244295 INFO tobiko.openstack.ironic._node - Power on baremetal node '5fa5e3f7-05e6-4d2b-93d1-409432ca0f9e' (power_state='power off').\
Master node was controller-0 and pacemaker noticed slaves are down:
Feb 18 22:26:01 controller-0 pacemaker-controld [24665] (peer_update_callback) info: Remote node ovn-dbs-bundle-1 is now in unknown state
Feb 18 22:26:01 controller-0 pacemaker-controld [24665] (peer_update_callback) info: Remote node ovn-dbs-bundle-2 is now in unknown state
Feb 18 22:26:01 controller-0 pacemaker-schedulerd[24664] (common_print) info: ovn-dbs-bundle-0 (ocf::ovn:ovndb-servers): Master controller-0
Feb 18 22:26:01 controller-0 pacemaker-schedulerd[24664] (common_print) info: ovn-dbs-bundle-1 (ocf::ovn:ovndb-servers): FAILED controller-1 (UNCLEAN)
Feb 18 22:26:01 controller-0 pacemaker-schedulerd[24664] (common_print) info: ovn-dbs-bundle-2 (ocf::ovn:ovndb-servers): FAILED controller-2 (UNCLEAN)
2 out of 3 nodes were out therefore cluster has lost the quorum:
Feb 18 22:26:01 controller-0 pacemaker-schedulerd[24664] (LogAction) notice: * Stop ovndb_servers:0 ( Master ovn-dbs-bundle-0 ) due to no quorum
The nodes were fenced:
Feb 18 22:27:02 controller-0 pacemaker-schedulerd[24664] (pe_fence_node) warning: Guest node ovn-dbs-bundle-2 will be fenced (by recovering its guest resource ovn-dbs-bundle-podman-2): ovndb_servers:2 is thought to be active there
Feb 18 22:27:02 controller-0 pacemaker-schedulerd[24664] (pe_fence_node) warning: Guest node ovn-dbs-bundle-1 will be fenced (by recovering its guest resource ovn-dbs-bundle-podman-1): ovndb_servers:1 is thought to be active there
Ovn is started on controller-1 while controller-0 is in slave mode:
Feb 18 22:27:10 controller-0 pacemaker-controld [24665] (te_rsc_command) notice: Initiating start operation ovn-dbs-bundle-podman-2_start_0 on controller-1 | action 212
Feb 18 22:27:37 controller-0 pacemaker-schedulerd[24664] (common_print) info: ovn-dbs-bundle-0 (ocf::ovn:ovndb-servers): Slave controller-0
Feb 18 22:27:37 controller-0 pacemaker-schedulerd[24664] (common_print) info: ovn-dbs-bundle-1 (ocf::ovn:ovndb-servers): Stopped
Feb 18 22:27:37 controller-0 pacemaker-schedulerd[24664] (common_print) info: ovn-dbs-bundle-2 (ocf::ovn:ovndb-servers): Stopped controller-1
Once controller-1 is up, the cluster gets quorum back and promotes Master on controller-0:
Feb 18 22:27:52 controller-0 pacemaker-schedulerd[24664] (LogAction) notice: * Promote ovndb_servers:0 ( Slave -> Master ovn-dbs-bundle-0 )
Tobiko notices the service is down:
2022-02-18 22:28:04.088 244295 INFO tobiko.tripleo.pacemaker - pcs status check: resource ovn is in not in healthy state
Feb 18 22:28:24 controller-0 pacemaker-schedulerd[24664] (common_print) info: ovn-dbs-bundle-0 (ocf::ovn:ovndb-servers): Master controller-0
Feb 18 22:28:24 controller-0 pacemaker-schedulerd[24664] (common_print) info: ovn-dbs-bundle-1 (ocf::ovn:ovndb-servers): Stopped
Feb 18 22:28:24 controller-0 pacemaker-schedulerd[24664] (common_print) info: ovn-dbs-bundle-2 (ocf::ovn:ovndb-servers): Slave controller-1
controller-2 ovndb is started:
Feb 18 22:28:24 controller-0 pacemaker-schedulerd[24664] (LogAction) notice: * Start ovn-dbs-bundle-podman-1 ( controller-2 )
Feb 18 22:28:24 controller-0 pacemaker-schedulerd[24664] (LogAction) notice: * Start ovn-dbs-bundle-1 ( controller-2 )
Feb 18 22:28:31 controller-0 pacemaker-schedulerd[24664] (common_print) info: ovn-dbs-bundle-0 (ocf::ovn:ovndb-servers): Master controller-0
Feb 18 22:28:31 controller-0 pacemaker-schedulerd[24664] (common_print) info: ovn-dbs-bundle-1 (ocf::ovn:ovndb-servers): Stopped controller-2
Feb 18 22:28:31 controller-0 pacemaker-schedulerd[24664] (common_print) info: ovn-dbs-bundle-2 (ocf::ovn:ovndb-servers): Slave controller-1
Tobiko still sees cluster is down:
2022-02-18 22:28:33.649 244295 INFO tobiko.tripleo.pacemaker - pcs status check: resource ovn is in not in healthy state\
Tobiko sees cluster is healthy:
2022-02-18 22:28:35.409 244295 INFO tobiko.tripleo.pacemaker - pcs status check: resource ovn is in healthy state\
Ovn cluster is back fully online:
Feb 18 22:29:05 controller-0 pacemaker-schedulerd[24664] (common_print) info: ovn-dbs-bundle-0 (ocf::ovn:ovndb-servers): Master controller-0
Feb 18 22:29:05 controller-0 pacemaker-schedulerd[24664] (common_print) info: ovn-dbs-bundle-1 (ocf::ovn:ovndb-servers): Slave controller-2
Feb 18 22:29:05 controller-0 pacemaker-schedulerd[24664] (common_print) info: ovn-dbs-bundle-2 (ocf::ovn:ovndb-servers): Slave controller-1
However, other services are still in troubles for another 40 seconds:
2022-02-18 22:29:44.168 244295 INFO tobiko.tripleo.pacemaker - Retrying pacemaker resource checks attempt 43 of 360\
2022-02-18 22:29:45.867 244295 INFO tobiko.tripleo.pacemaker - pcs status check: resource rabbitmq is in healthy state\
2022-02-18 22:29:45.869 244295 INFO tobiko.tripleo.pacemaker - pcs status check: resource galera is in healthy state\
2022-02-18 22:29:45.872 244295 INFO tobiko.tripleo.pacemaker - pcs status check: resource redis is in healthy state\
2022-02-18 22:29:45.875 244295 INFO tobiko.tripleo.pacemaker - pcs status check: resources vips are in healthy state\
2022-02-18 22:29:45.879 244295 INFO tobiko.tripleo.pacemaker - pcs status check: resources ha_proxy and cinder are in healthy state\
2022-02-18 22:29:45.886 244295 INFO tobiko.tripleo.pacemaker - pcs status check: resource ovn is in healthy state\
2022-02-18 22:29:45.886 244295 INFO tobiko.tripleo.pacemaker - pcs status checks: all resources are in healthy state\
Tobiko fails later when performing heat operations:
2022-02-18 22:29:54.167 244295 ERROR tobiko.openstack.heat._stack - Key 'use_extra_dhcp_opts' not found in template for stack 'tobiko.openstack.tests._nova.TestServerCreationStack-244295-0'\
2022-02-18 22:29:54.168 244295 ERROR tobiko.openstack.heat._stack - Key 'vlan_id' not found in template for stack 'tobiko.openstack.tests._nova.TestServerCreationStack-244295-0'\
2022-02-18 22:29:54.168 244295 ERROR tobiko.openstack.heat._stack - Key 'vlan_network' not found in template for stack 'tobiko.openstack.tests._nova.TestServerCreationStack-244295-0'\
2022-02-18 22:32:03.788 244295 WARNING tobiko.openstack.heat._stack - Stack 'tobiko.openstack.tests._nova.TestServerCreationStack-244295-0' reached unexpected status: 'CREATE_FAILED'\
2022-02-18 22:32:03.789 244295 INFO tobiko.openstack.heat._stack - Stack 'tobiko.openstack.tests._nova.TestServerCreationStack-244295-0' status is 'CREATE_FAILED'. Reason:\
Resource CREATE failed: NeutronClientException: resources.port: <html><body><h1>504 Gateway Time-out</h1>\
The server didn't respond in time.\
</body></html>\
Based on the analysis above, I don't think this is the same issue as in description where OVN cluster doesn't come up fully and one of the slave remains in Stopped state.
Julia, I think we should open a new BZ as it seems to be a different problem.
Actually, we don't need a new BZ, I looked at the Neutron server logs too and it's a known issue from bug 2052987. I'm closing this BZ as the original duplicate. *** This bug has been marked as a duplicate of bug 2011934 *** Hi Jakub. Thanks you so much for your help and for the amazing analysis. We will continue with the other bug. BR, Julia. Hi Jakub. The problem is that ovn-dbs resource becomes unhealthy again and never recovers afterwards, please see comment #9 time stamp: Feb 19 05:29:08 controller-0 pacemaker-schedulerd[24664]: warning: Forcing ovndb_servers:1 away from ovn-dbs- bundle-1 after 1000000 failures (max=100000 And it seems that it happens after another shutdown operation in this test: 2022-02-18 23:01:34.087 244295 INFO tobiko.tests.faults.ha.cloud_disruptions - Power off 2 random controller nodes: ['controller-2', 'controller-1'] 2022-02-18 23:01:34.134 244295 INFO tobiko.openstack.ironic._node - Power off baremetal node '5fa5e3f7-05e6-4d2b-93d1-409432ca0f9e' (power state = 'power on'). 2022-02-18 23:01:54.343 244295 INFO tobiko.openstack.ironic._node - Power off baremetal node 'd1e86343-7f0b-4802-96f7-0970859673e5' (power state = 'power on'). It looks similar to the original issue. It happens in other tobiko tests/jobs as well. Would you advise to open a new BZ for this issue? Thanks, Julia. |