Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1892578

Summary:	OVN DB not started after controllers reboot
Product:	Red Hat OpenStack	Reporter:	Eduardo Olivares <eolivare>
Component:	python-networking-ovn	Assignee:	Jakub Libosvar <jlibosva>
Status:	CLOSED NEXTRELEASE	QA Contact:	Eran Kuris <ekuris>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	13.0 (Queens)	CC:	apevec, jlibosva, lhh, majopela, pkomarov, scohen, tfreger
Target Milestone:	z14	Keywords:	AutomationBlocker, Triaged, ZStream
Target Release:	13.0 (Queens)
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1914911 1914912 1919276 (view as bug list)		Environment:
Last Closed:	2021-07-27 13:19:16 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1914911, 1914912, 1919276

Description Eduardo Olivares 2020-10-29 08:53:17 UTC

Description of problem:
Issue reproduced with tobiko test test_hard_reboot_controllers_recovery

This bug looks similar to https://bugzilla.redhat.com/show_bug.cgi?id=1860347, but I decided to open a new one because it was reproduced with a different test and it was reproduced on a different OSP release.

I set priority to medium because OSP should work fine with only two OVN DB resources and because I think the affected resource could be started manually (I cannot check this because I have no access to the environment). Also, because the issue is not always reproduced.


Test logs: https://rhos-ci-jenkins.lab.eng.tlv2.redhat.com/view/DFG/view/network/view/networking-ovn/job/DFG-network-networking-ovn-13_director-rhel-virthost-3cont_2comp-ipv4-geneve-tobiko-neutron/8/testReport/tobiko.tests.faults.ha.test_cloud_recovery/DisruptTripleoNodesTest/Tobiko___test_hard_reboot_controllers_recovery/

1) pcs status is healthy before controllers are rebooted

2) all the controller nodes are hard rebooted at 23:07
2020-10-28 23:07:10,938 INFO tobiko.tests.faults.ha.cloud_disruptions | disrupt exec: sudo chmod o+w /proc/sysrq-trigger;sudo echo b > /proc/sysrq-trigger on server: controller-0
2020-10-28 23:07:11,007 INFO tobiko.tests.faults.ha.cloud_disruptions | disrupt exec: sudo chmod o+w /proc/sysrq-trigger;sudo echo b > /proc/sysrq-trigger on server: controller-2
2020-10-28 23:07:11,085 INFO tobiko.tests.faults.ha.cloud_disruptions | disrupt exec: sudo chmod o+w /proc/sysrq-trigger;sudo echo b > /proc/sysrq-trigger on server: controller-1

3) the status of controller-2's ovndb resource is set to Starting at 23:10:14
2020-10-28 23:10:14,643 DEBUG tobiko.shell.sh._execute | Command executed:
command: 'sudo pcs status resources |grep ocf'
...
       ovn-dbs-bundle-0	(ocf::ovn:ovndb-servers):	Slave controller-0
       ovn-dbs-bundle-1	(ocf::ovn:ovndb-servers):	Slave controller-1
       ovn-dbs-bundle-2	(ocf::ovn:ovndb-servers):	Starting controller-2


4) the status of that resource changes to Stopped at 23:13:38
2020-10-28 23:13:38,043 DEBUG tobiko.shell.sh._execute | Command executed:
command: 'sudo pcs status resources |grep ocf'
...
       ovn-dbs-bundle-0	(ocf::ovn:ovndb-servers):	Promoting controller-0
       ovn-dbs-bundle-1	(ocf::ovn:ovndb-servers):	Slave controller-1
       ovn-dbs-bundle-2	(ocf::ovn:ovndb-servers):	Stopped controller-2

5) all pcs resources are healthy except controller-2's ovn db at 23:14:16
2020-10-28 23:14:16,081 DEBUG tobiko.shell.sh._execute | Command executed:
command: 'sudo pcs status resources |grep ocf'
...
       rabbitmq-bundle-0	(ocf::heartbeat:rabbitmq-cluster):	Started controller-0
       rabbitmq-bundle-1	(ocf::heartbeat:rabbitmq-cluster):	Started controller-1
       rabbitmq-bundle-2	(ocf::heartbeat:rabbitmq-cluster):	Started controller-2
       galera-bundle-0	(ocf::heartbeat:galera):	Master controller-0
       galera-bundle-1	(ocf::heartbeat:galera):	Master controller-1
       galera-bundle-2	(ocf::heartbeat:galera):	Master controller-2
       redis-bundle-0	(ocf::heartbeat:redis):	Master controller-0
       redis-bundle-1	(ocf::heartbeat:redis):	Slave controller-1
       redis-bundle-2	(ocf::heartbeat:redis):	Slave controller-2
     ip-192.168.24.38	(ocf::heartbeat:IPaddr2):	Started controller-0
     ip-10.0.0.112	(ocf::heartbeat:IPaddr2):	Started controller-1
     ip-172.17.1.116	(ocf::heartbeat:IPaddr2):	Started controller-2
     ip-172.17.1.30	(ocf::heartbeat:IPaddr2):	Started controller-0
     ip-172.17.3.146	(ocf::heartbeat:IPaddr2):	Started controller-0
     ip-172.17.4.57	(ocf::heartbeat:IPaddr2):	Started controller-1
       haproxy-bundle-docker-0	(ocf::heartbeat:docker):	Started controller-0
       haproxy-bundle-docker-1	(ocf::heartbeat:docker):	Started controller-1
       haproxy-bundle-docker-2	(ocf::heartbeat:docker):	Started controller-2
       ovn-dbs-bundle-0	(ocf::ovn:ovndb-servers):	Master controller-0
       ovn-dbs-bundle-1	(ocf::ovn:ovndb-servers):	Slave controller-1
       ovn-dbs-bundle-2	(ocf::ovn:ovndb-servers):	Stopped controller-2
       openstack-cinder-volume-docker-0	(ocf::heartbeat:docker):	Started controller-2

6) The next tests fail due to the same reason: controller-2's ovn db is still stopped at 23:57:17 (I guess it never recovers by itself).



Some pacemaker and OVN DB logs can be found here: http://pastebin.test.redhat.com/913915
According to them, OVN SBDB was never started after controller-2 reboot.

More logs can be downloaded from the job: https://rhos-ci-jenkins.lab.eng.tlv2.redhat.com/view/DFG/view/network/view/networking-ovn/job/DFG-network-networking-ovn-13_director-rhel-virthost-3cont_2comp-ipv4-geneve-tobiko-neutron/8/artifact/







Version-Release number of selected component (if applicable):
2020-10-06.2
ovn2.11-2.11.1-53.el7fdp.x86_64.rpm   
python-networking-ovn-4.0.4-8.el7ost.noarch.rpm   
python2-ovsdbapp-0.10.5-2.el7ost.noarch.rpm   




How reproducible:
not often



Steps to Reproduce:
Alternatives:
1: run tobiko test test_hard_reboot_controllers_recovery
2: hard reboot controller nodes and check pacemaker status after that