Bug 1601561 - [OSP13][z-stream]Cannot launch instance after controller replacement - "VirtualInterfaceCreateException: Virtual Interface creation failed"
Summary: [OSP13][z-stream]Cannot launch instance after controller replacement - "Virtu...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: rabbitmq-server
Version: 13.0 (Queens)
Hardware: x86_64
OS: Linux
high
high
Target Milestone: z2
: 13.0 (Queens)
Assignee: John Eckersberg
QA Contact: pkomarov
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-07-16 16:54 UTC by Artem Hrechanychenko
Modified: 2022-03-13 15:14 UTC (History)
19 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-03-10 15:30:27 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Artem Hrechanychenko 2018-07-16 16:54:57 UTC
Description of problem:

 simulated disk outage on controller node
[root@seal33 ~ ]# dd if=/dev/zero of=/var/lib/libvirt/images/controller-1-disk1.qcow2  bs=600M count=5


| 36657447-9ca3-482d-9134-e62322e055ba | controller-1 | ACTIVE | -          | Running     | ctlplane=192.168.24.10 |

Version-Release number of selected component (if applicable):
OSP13 puddle - 2018-07-13.1
openstack-neutron-common-12.0.2-0.20180421011364.0ec54fd.el7ost.noarch
openstack-heat-common-10.0.1-0.20180411125640.el7ost.noarch
openstack-ironic-api-10.1.2-4.el7ost.noarch
openstack-tripleo-common-containers-8.6.1-23.el7ost.noarch
openstack-ironic-inspector-7.2.1-0.20180409163360.el7ost.noarch
python2-openstacksdk-0.11.3-1.el7ost.noarch
openstack-tripleo-ui-8.3.1-3.el7ost.noarch
openstack-zaqar-6.0.1-1.el7ost.noarch
openstack-nova-placement-api-17.0.3-0.20180420001141.el7ost.noarch
openstack-swift-container-2.17.1-0.20180314165245.caeeb54.el7ost.noarch
puppet-openstacklib-12.4.0-0.20180329042555.4b30e6f.el7ost.noarch
openstack-mistral-api-6.0.2-1.el7ost.noarch
openstack-tripleo-image-elements-8.0.1-1.el7ost.noarch
openstack-heat-api-cfn-10.0.1-0.20180411125640.el7ost.noarch
openstack-selinux-0.8.14-12.el7ost.noarch
openstack-nova-scheduler-17.0.3-0.20180420001141.el7ost.noarch
puppet-openstack_extras-12.4.1-0.20180413042250.2634296.el7ost.noarch
python-openstackclient-lang-3.14.1-1.el7ost.noarch
openstack-tripleo-puppet-elements-8.0.0-2.el7ost.noarch
openstack-tripleo-common-8.6.1-23.el7ost.noarch
openstack-nova-compute-17.0.3-0.20180420001141.el7ost.noarch
openstack-keystone-13.0.1-0.20180420194847.7bd6454.el7ost.noarch
openstack-neutron-12.0.2-0.20180421011364.0ec54fd.el7ost.noarch
openstack-heat-engine-10.0.1-0.20180411125640.el7ost.noarch
openstack-ironic-common-10.1.2-4.el7ost.noarch
openstack-ironic-staging-drivers-0.9.0-4.el7ost.noarch
openstack-mistral-executor-6.0.2-1.el7ost.noarch
openstack-swift-object-2.17.1-0.20180314165245.caeeb54.el7ost.noarch
openstack-swift-proxy-2.17.1-0.20180314165245.caeeb54.el7ost.noarch
openstack-tempest-18.0.0-2.el7ost.noarch
openstack-mistral-common-6.0.2-1.el7ost.noarch
python2-openstackclient-3.14.1-1.el7ost.noarch
openstack-glance-16.0.1-2.el7ost.noarch
openstack-nova-common-17.0.3-0.20180420001141.el7ost.noarch
openstack-neutron-openvswitch-12.0.2-0.20180421011364.0ec54fd.el7ost.noarch
openstack-heat-api-10.0.1-0.20180411125640.el7ost.noarch
openstack-ironic-conductor-10.1.2-4.el7ost.noarch
openstack-tripleo-validations-8.4.1-5.el7ost.noarch
openstack-nova-api-17.0.3-0.20180420001141.el7ost.noarch
openstack-nova-conductor-17.0.3-0.20180420001141.el7ost.noarch
openstack-swift-account-2.17.1-0.20180314165245.caeeb54.el7ost.noarch
openstack-neutron-ml2-12.0.2-0.20180421011364.0ec54fd.el7ost.noarch
openstack-mistral-engine-6.0.2-1.el7ost.noarch
openstack-tripleo-heat-templates-8.0.2-43.el7ost.noarch


How reproducible:


Steps to Reproduce:
1. Deploy OSP13 3ctrl+3compute + LVM with latest passed_phase2 puddle, lauch instance after_deploy
2. go to hypervisor, find qcow disk of controller and corrupt it
3. set failed node to off state using Ironic, cleanup pcs rabbit resource and launch instance after_corrupt to check that overcloud is operable
4. try to replace controller using official docs
https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/13/html-single/director_installation_and_usage/#sect-Replacing_Controller_Nodes
5. Check containers status for oc nodes
6. lauch instance after_replace

Actual results:
instance in spawning state, reason - VirtualInterfaceCreateException: Virtual Interface creation failed

Expected results:
instance in ACTIVE state

Additional info:

Comment 2 Artem Hrechanychenko 2018-07-16 17:26:13 UTC
The reports should be available here: http://rhos-release.virt.bos.redhat.com/log/bz1601561

Comment 4 John Eckersberg 2018-07-17 16:02:04 UTC
Somehow controller-2 has gotten into a cluster by itself:

()[root@controller-2 /]# rabbitmqctl cluster_status
Cluster status of node 'rabbit@controller-2'
[{nodes,[{disc,['rabbit@controller-2']}]},
 {running_nodes,['rabbit@controller-2']},
 {cluster_name,<<"rabbit">>},
 {partitions,[]},
 {alarms,[{'rabbit@controller-2',[]}]}]
()[root@controller-2 /]# echo $?
0

And the resource agent just checks cluster_status rc for the monitor action, so since it returns 0 it thinks everything is ok.

Meanwhile, 0 and 3 are clustered fine:

()[root@controller-0 /]# rabbitmqctl cluster_status
Cluster status of node 'rabbit@controller-0'
[{nodes,[{disc,['rabbit@controller-0','rabbit@controller-3']}]},
 {running_nodes,['rabbit@controller-3','rabbit@controller-0']},
 {cluster_name,<<"rabbit">>},
 {partitions,[]},
 {alarms,[{'rabbit@controller-3',[]},{'rabbit@controller-0',[]}]}]

Unsurprising that instance launch fails in this state.

Comment 5 John Eckersberg 2018-07-17 16:12:30 UTC
I just did `pkill -9 beam.smp` on controller-2 to force pacemaker to fail it and restart, and it joined the cluster correctly after that:

()[root@controller-2 /]# rabbitmqctl cluster_status
Cluster status of node 'rabbit@controller-2'
[{nodes,[{disc,['rabbit@controller-0','rabbit@controller-2',
                'rabbit@controller-3']}]},
 {running_nodes,['rabbit@controller-0','rabbit@controller-3',
                 'rabbit@controller-2']},
 {cluster_name,<<"rabbit">>},
 {partitions,[]},
 {alarms,[{'rabbit@controller-0',[]},
          {'rabbit@controller-3',[]},
          {'rabbit@controller-2',[]}]}]

So two things here:

1.  It never should have gotten into this state in the first place, so we should try to figure out how exactly it happened.  Might not be so easy due to timing issues.

2.  We have enough information in the resource agent to know how many nodes are started, so we can modify the health check to make sure each node is clustered, *and* it's clustered with the correct number of running nodes.  Probably should be safe and make sure that action fails like 3x with 5s sleep so we don't catch false mismatches if the cluster is in the process of transitioning.  It should settle within 15s.

Comment 6 Keigo Noha 2018-07-18 00:20:02 UTC
Hello John,

Is 'pkill -9 beam.smp' on a failure node a workaround for this issue at this moment?

Best Regards,
Keigo Noha

Comment 8 John Eckersberg 2018-07-18 13:19:09 UTC
(In reply to Keigo Noha from comment #6)
> Hello John,
> 
> Is 'pkill -9 beam.smp' on a failure node a workaround for this issue at this
> moment?
> 
> Best Regards,
> Keigo Noha

For a workaround, better is to just restart rabbitmq-bundle with pcs.

I only killed the bad one to demonstrate that a failed monitor would allow pacemaker to restart the service and it would correctly rejoin the cluster.

Comment 9 Omri Hochman 2018-07-19 19:33:35 UTC
(In reply to John Eckersberg from comment #8)
> (In reply to Keigo Noha from comment #6)

Artem can you please validate the suggested W/A and see if it will allow to finish the controller replacement procedure and get the system back to a fully stable & working state.

Comment 10 Artem Hrechanychenko 2018-07-24 15:41:21 UTC
Didn't reproduced from first attempt.
will try another one

Comment 22 Peter Lemenkov 2019-06-11 13:08:17 UTC
Hello All!
Could someone please try to reproduce it with the latest resource-agents? Artem?

Comment 23 Luca Miccini 2020-03-10 15:30:27 UTC
we assume this has been fixed by the latest resource agents. please reopen if required.


Note You need to log in before you can comment on or make changes to this bug.