DescriptionEduardo Olivares
2020-10-29 08:53:17 UTC
Description of problem:
Issue reproduced with tobiko test test_hard_reboot_controllers_recovery
This bug looks similar to https://bugzilla.redhat.com/show_bug.cgi?id=1860347, but I decided to open a new one because it was reproduced with a different test and it was reproduced on a different OSP release.
I set priority to medium because OSP should work fine with only two OVN DB resources and because I think the affected resource could be started manually (I cannot check this because I have no access to the environment). Also, because the issue is not always reproduced.
Test logs: https://rhos-ci-jenkins.lab.eng.tlv2.redhat.com/view/DFG/view/network/view/networking-ovn/job/DFG-network-networking-ovn-13_director-rhel-virthost-3cont_2comp-ipv4-geneve-tobiko-neutron/8/testReport/tobiko.tests.faults.ha.test_cloud_recovery/DisruptTripleoNodesTest/Tobiko___test_hard_reboot_controllers_recovery/
1) pcs status is healthy before controllers are rebooted
2) all the controller nodes are hard rebooted at 23:07
2020-10-28 23:07:10,938 INFO tobiko.tests.faults.ha.cloud_disruptions | disrupt exec: sudo chmod o+w /proc/sysrq-trigger;sudo echo b > /proc/sysrq-trigger on server: controller-0
2020-10-28 23:07:11,007 INFO tobiko.tests.faults.ha.cloud_disruptions | disrupt exec: sudo chmod o+w /proc/sysrq-trigger;sudo echo b > /proc/sysrq-trigger on server: controller-2
2020-10-28 23:07:11,085 INFO tobiko.tests.faults.ha.cloud_disruptions | disrupt exec: sudo chmod o+w /proc/sysrq-trigger;sudo echo b > /proc/sysrq-trigger on server: controller-1
3) the status of controller-2's ovndb resource is set to Starting at 23:10:14
2020-10-28 23:10:14,643 DEBUG tobiko.shell.sh._execute | Command executed:
command: 'sudo pcs status resources |grep ocf'
...
ovn-dbs-bundle-0 (ocf::ovn:ovndb-servers): Slave controller-0
ovn-dbs-bundle-1 (ocf::ovn:ovndb-servers): Slave controller-1
ovn-dbs-bundle-2 (ocf::ovn:ovndb-servers): Starting controller-2
4) the status of that resource changes to Stopped at 23:13:38
2020-10-28 23:13:38,043 DEBUG tobiko.shell.sh._execute | Command executed:
command: 'sudo pcs status resources |grep ocf'
...
ovn-dbs-bundle-0 (ocf::ovn:ovndb-servers): Promoting controller-0
ovn-dbs-bundle-1 (ocf::ovn:ovndb-servers): Slave controller-1
ovn-dbs-bundle-2 (ocf::ovn:ovndb-servers): Stopped controller-2
5) all pcs resources are healthy except controller-2's ovn db at 23:14:16
2020-10-28 23:14:16,081 DEBUG tobiko.shell.sh._execute | Command executed:
command: 'sudo pcs status resources |grep ocf'
...
rabbitmq-bundle-0 (ocf::heartbeat:rabbitmq-cluster): Started controller-0
rabbitmq-bundle-1 (ocf::heartbeat:rabbitmq-cluster): Started controller-1
rabbitmq-bundle-2 (ocf::heartbeat:rabbitmq-cluster): Started controller-2
galera-bundle-0 (ocf::heartbeat:galera): Master controller-0
galera-bundle-1 (ocf::heartbeat:galera): Master controller-1
galera-bundle-2 (ocf::heartbeat:galera): Master controller-2
redis-bundle-0 (ocf::heartbeat:redis): Master controller-0
redis-bundle-1 (ocf::heartbeat:redis): Slave controller-1
redis-bundle-2 (ocf::heartbeat:redis): Slave controller-2
ip-192.168.24.38 (ocf::heartbeat:IPaddr2): Started controller-0
ip-10.0.0.112 (ocf::heartbeat:IPaddr2): Started controller-1
ip-172.17.1.116 (ocf::heartbeat:IPaddr2): Started controller-2
ip-172.17.1.30 (ocf::heartbeat:IPaddr2): Started controller-0
ip-172.17.3.146 (ocf::heartbeat:IPaddr2): Started controller-0
ip-172.17.4.57 (ocf::heartbeat:IPaddr2): Started controller-1
haproxy-bundle-docker-0 (ocf::heartbeat:docker): Started controller-0
haproxy-bundle-docker-1 (ocf::heartbeat:docker): Started controller-1
haproxy-bundle-docker-2 (ocf::heartbeat:docker): Started controller-2
ovn-dbs-bundle-0 (ocf::ovn:ovndb-servers): Master controller-0
ovn-dbs-bundle-1 (ocf::ovn:ovndb-servers): Slave controller-1
ovn-dbs-bundle-2 (ocf::ovn:ovndb-servers): Stopped controller-2
openstack-cinder-volume-docker-0 (ocf::heartbeat:docker): Started controller-2
6) The next tests fail due to the same reason: controller-2's ovn db is still stopped at 23:57:17 (I guess it never recovers by itself).
Some pacemaker and OVN DB logs can be found here: http://pastebin.test.redhat.com/913915
According to them, OVN SBDB was never started after controller-2 reboot.
More logs can be downloaded from the job: https://rhos-ci-jenkins.lab.eng.tlv2.redhat.com/view/DFG/view/network/view/networking-ovn/job/DFG-network-networking-ovn-13_director-rhel-virthost-3cont_2comp-ipv4-geneve-tobiko-neutron/8/artifact/
Version-Release number of selected component (if applicable):
2020-10-06.2
ovn2.11-2.11.1-53.el7fdp.x86_64.rpm
python-networking-ovn-4.0.4-8.el7ost.noarch.rpm
python2-ovsdbapp-0.10.5-2.el7ost.noarch.rpm
How reproducible:
not often
Steps to Reproduce:
Alternatives:
1: run tobiko test test_hard_reboot_controllers_recovery
2: hard reboot controller nodes and check pacemaker status after that