Description of problem: simulated disk outage on controller node [root@seal33 ~ ]# dd if=/dev/zero of=/var/lib/libvirt/images/controller-1-disk1.qcow2 bs=600M count=5 | 36657447-9ca3-482d-9134-e62322e055ba | controller-1 | ACTIVE | - | Running | ctlplane=192.168.24.10 | Version-Release number of selected component (if applicable): OSP13 puddle - 2018-07-13.1 openstack-neutron-common-12.0.2-0.20180421011364.0ec54fd.el7ost.noarch openstack-heat-common-10.0.1-0.20180411125640.el7ost.noarch openstack-ironic-api-10.1.2-4.el7ost.noarch openstack-tripleo-common-containers-8.6.1-23.el7ost.noarch openstack-ironic-inspector-7.2.1-0.20180409163360.el7ost.noarch python2-openstacksdk-0.11.3-1.el7ost.noarch openstack-tripleo-ui-8.3.1-3.el7ost.noarch openstack-zaqar-6.0.1-1.el7ost.noarch openstack-nova-placement-api-17.0.3-0.20180420001141.el7ost.noarch openstack-swift-container-2.17.1-0.20180314165245.caeeb54.el7ost.noarch puppet-openstacklib-12.4.0-0.20180329042555.4b30e6f.el7ost.noarch openstack-mistral-api-6.0.2-1.el7ost.noarch openstack-tripleo-image-elements-8.0.1-1.el7ost.noarch openstack-heat-api-cfn-10.0.1-0.20180411125640.el7ost.noarch openstack-selinux-0.8.14-12.el7ost.noarch openstack-nova-scheduler-17.0.3-0.20180420001141.el7ost.noarch puppet-openstack_extras-12.4.1-0.20180413042250.2634296.el7ost.noarch python-openstackclient-lang-3.14.1-1.el7ost.noarch openstack-tripleo-puppet-elements-8.0.0-2.el7ost.noarch openstack-tripleo-common-8.6.1-23.el7ost.noarch openstack-nova-compute-17.0.3-0.20180420001141.el7ost.noarch openstack-keystone-13.0.1-0.20180420194847.7bd6454.el7ost.noarch openstack-neutron-12.0.2-0.20180421011364.0ec54fd.el7ost.noarch openstack-heat-engine-10.0.1-0.20180411125640.el7ost.noarch openstack-ironic-common-10.1.2-4.el7ost.noarch openstack-ironic-staging-drivers-0.9.0-4.el7ost.noarch openstack-mistral-executor-6.0.2-1.el7ost.noarch openstack-swift-object-2.17.1-0.20180314165245.caeeb54.el7ost.noarch openstack-swift-proxy-2.17.1-0.20180314165245.caeeb54.el7ost.noarch openstack-tempest-18.0.0-2.el7ost.noarch openstack-mistral-common-6.0.2-1.el7ost.noarch python2-openstackclient-3.14.1-1.el7ost.noarch openstack-glance-16.0.1-2.el7ost.noarch openstack-nova-common-17.0.3-0.20180420001141.el7ost.noarch openstack-neutron-openvswitch-12.0.2-0.20180421011364.0ec54fd.el7ost.noarch openstack-heat-api-10.0.1-0.20180411125640.el7ost.noarch openstack-ironic-conductor-10.1.2-4.el7ost.noarch openstack-tripleo-validations-8.4.1-5.el7ost.noarch openstack-nova-api-17.0.3-0.20180420001141.el7ost.noarch openstack-nova-conductor-17.0.3-0.20180420001141.el7ost.noarch openstack-swift-account-2.17.1-0.20180314165245.caeeb54.el7ost.noarch openstack-neutron-ml2-12.0.2-0.20180421011364.0ec54fd.el7ost.noarch openstack-mistral-engine-6.0.2-1.el7ost.noarch openstack-tripleo-heat-templates-8.0.2-43.el7ost.noarch How reproducible: Steps to Reproduce: 1. Deploy OSP13 3ctrl+3compute + LVM with latest passed_phase2 puddle, lauch instance after_deploy 2. go to hypervisor, find qcow disk of controller and corrupt it 3. set failed node to off state using Ironic, cleanup pcs rabbit resource and launch instance after_corrupt to check that overcloud is operable 4. try to replace controller using official docs https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/13/html-single/director_installation_and_usage/#sect-Replacing_Controller_Nodes 5. Check containers status for oc nodes 6. lauch instance after_replace Actual results: instance in spawning state, reason - VirtualInterfaceCreateException: Virtual Interface creation failed Expected results: instance in ACTIVE state Additional info:
The reports should be available here: http://rhos-release.virt.bos.redhat.com/log/bz1601561
Somehow controller-2 has gotten into a cluster by itself: ()[root@controller-2 /]# rabbitmqctl cluster_status Cluster status of node 'rabbit@controller-2' [{nodes,[{disc,['rabbit@controller-2']}]}, {running_nodes,['rabbit@controller-2']}, {cluster_name,<<"rabbit">>}, {partitions,[]}, {alarms,[{'rabbit@controller-2',[]}]}] ()[root@controller-2 /]# echo $? 0 And the resource agent just checks cluster_status rc for the monitor action, so since it returns 0 it thinks everything is ok. Meanwhile, 0 and 3 are clustered fine: ()[root@controller-0 /]# rabbitmqctl cluster_status Cluster status of node 'rabbit@controller-0' [{nodes,[{disc,['rabbit@controller-0','rabbit@controller-3']}]}, {running_nodes,['rabbit@controller-3','rabbit@controller-0']}, {cluster_name,<<"rabbit">>}, {partitions,[]}, {alarms,[{'rabbit@controller-3',[]},{'rabbit@controller-0',[]}]}] Unsurprising that instance launch fails in this state.
I just did `pkill -9 beam.smp` on controller-2 to force pacemaker to fail it and restart, and it joined the cluster correctly after that: ()[root@controller-2 /]# rabbitmqctl cluster_status Cluster status of node 'rabbit@controller-2' [{nodes,[{disc,['rabbit@controller-0','rabbit@controller-2', 'rabbit@controller-3']}]}, {running_nodes,['rabbit@controller-0','rabbit@controller-3', 'rabbit@controller-2']}, {cluster_name,<<"rabbit">>}, {partitions,[]}, {alarms,[{'rabbit@controller-0',[]}, {'rabbit@controller-3',[]}, {'rabbit@controller-2',[]}]}] So two things here: 1. It never should have gotten into this state in the first place, so we should try to figure out how exactly it happened. Might not be so easy due to timing issues. 2. We have enough information in the resource agent to know how many nodes are started, so we can modify the health check to make sure each node is clustered, *and* it's clustered with the correct number of running nodes. Probably should be safe and make sure that action fails like 3x with 5s sleep so we don't catch false mismatches if the cluster is in the process of transitioning. It should settle within 15s.
Hello John, Is 'pkill -9 beam.smp' on a failure node a workaround for this issue at this moment? Best Regards, Keigo Noha
(In reply to Keigo Noha from comment #6) > Hello John, > > Is 'pkill -9 beam.smp' on a failure node a workaround for this issue at this > moment? > > Best Regards, > Keigo Noha For a workaround, better is to just restart rabbitmq-bundle with pcs. I only killed the bad one to demonstrate that a failed monitor would allow pacemaker to restart the service and it would correctly rejoin the cluster.
(In reply to John Eckersberg from comment #8) > (In reply to Keigo Noha from comment #6) Artem can you please validate the suggested W/A and see if it will allow to finish the controller replacement procedure and get the system back to a fully stable & working state.
Didn't reproduced from first attempt. will try another one
Hello All! Could someone please try to reproduce it with the latest resource-agents? Artem?
we assume this has been fixed by the latest resource agents. please reopen if required.