Bug 1309339 - RabbitMQ resource fails to stop during scale out run on IPv6 and SSL environment
Summary: RabbitMQ resource fails to stop during scale out run on IPv6 and SSL environment
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: rhosp-director
Version: 7.0 (Kilo)
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: y3
: 7.0 (Kilo)
Assignee: Angus Thomas
QA Contact: yeylon@redhat.com
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-02-17 14:15 UTC by Marius Cornea
Modified: 2016-04-18 07:12 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-02-17 20:56:00 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Marius Cornea 2016-02-17 14:15:51 UTC
Description of problem:
RabbitMQ resource fails to stop during scale out with an additional compute in an IPv6 and SSL environment

Version-Release number of selected component (if applicable):
openstack-tripleo-heat-templates-0.8.6-121.el7ost.noarch

How reproducible:
100%

Steps to Reproduce:
1. Deploy overcloud:
export THT=/home/stack/templates/my-overcloud 
openstack overcloud deploy --templates $THT \
-e $THT/environments/network-isolation-v6-storagev4.yaml \
-e $THT/environments/net-single-nic-with-vlans-v6.yaml \
-e /home/stack/templates/network-environment-v6.yaml \
-e ~/templates/enable-tls.yaml \
-e ~/templates/inject-trust-anchor.yaml \
-e ~/templates/ceph.yaml \
-e ~/templates/firstboot-environment.yaml \
--control-scale 3 \
--compute-scale 1 \
--ceph-storage-scale 3 \
--neutron-disable-tunneling \
--neutron-network-type vlan \
--neutron-network-vlan-ranges datacentre:1000:1100 \
--libvirt-type qemu \
--ntp-server clock.redhat.com \
--timeout 180

2. Rerun the deployment command with --compute-scale 2


Actual results:
overcloud  | UPDATE_FAILED

 pcs resource restart rabbitmq-clone\nError: Could not complete shutdown of rabbitmq-clone, 1 resources remaining\nError performing operation: Timer expired\n\nSet 'rabbitmq-clone' option: id=rabbitmq-clone-meta_attributes-target-role set=rabbitmq-clone-meta_attributes name=target-role=stopped\nWaiting for 1 resources to stop:\n * rabbitmq-clone\n * rabbitmq-clone\nDeleted 'rabbitmq-clone' option: id=rabbitmq-clone-meta_attributes-target-role name=target-role\n\n", 


Expected results:
The cluster gets restarted ok and the scale out completes fine.

Additional info:
Attaching the sosreports.

Comment 1 Marius Cornea 2016-02-17 14:18:13 UTC
Note that this leaves the overcloud in a non-functional state.

Comment 5 Fabio Massimo Di Nitto 2016-02-17 15:45:42 UTC
Sounds very weird to me that the cause of the problem is IPv6 and SSL. We have seen some stop timeout errors before because the VMs were running on overcommitted hosts.

We will verify this, but in the meantime can you please make sure the problem is not overcommit on the host?

Comment 6 Marius Cornea 2016-02-17 15:51:28 UTC
This could indeed be a potential cause - there are 8 x overcloud VMs with 4 vCPUs and 8GB RAM each on a physical host with 16 cores and 64GB of RAM.

Comment 7 Fabio Massimo Di Nitto 2016-02-17 15:54:53 UTC
(In reply to Marius Cornea from comment #6)
> This could indeed be a potential cause - there are 8 x overcloud VMs with 4
> vCPUs and 8GB RAM each on a physical host with 16 cores and 64GB of RAM.

Ok, this sounds familiar already. We just recently closed a similar bug due to VMs being overcommitted.

Can we please have at least a test run on baremetal or specs that are closer to customer requirements before filing urgent bugs?

if nothing at least to exclude VMs vs bug.

Thanks
Fabio

Comment 8 James Slagle 2016-02-17 16:58:28 UTC
it seems there's a repeated occurrence where rabbitmq has failed to stop via pacemaker (i've seen it 2 or 3 times, definitely a small sample size).

Now that we have bumped all the systemd resource timeouts to 200s, it could be theorized that we've just pushed the stop/start timeout problem onto the rabbitmq resource which has a 90s timeout by default.

https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/rabbitmq-cluster#L69

Comment 9 Marius Cornea 2016-02-17 20:51:38 UTC
(In reply to Fabio Massimo Di Nitto from comment #7)
> (In reply to Marius Cornea from comment #6)
> > This could indeed be a potential cause - there are 8 x overcloud VMs with 4
> > vCPUs and 8GB RAM each on a physical host with 16 cores and 64GB of RAM.
> 
> Ok, this sounds familiar already. We just recently closed a similar bug due
> to VMs being overcommitted.
> 
> Can we please have at least a test run on baremetal or specs that are closer
> to customer requirements before filing urgent bugs?
> 
> if nothing at least to exclude VMs vs bug.
> 
> Thanks
> Fabio

OK, I retried the same scenario on a beefier hardware and the scale out process completed fine. I guess we can close this one as not a bug.

Comment 10 Fabio Massimo Di Nitto 2016-02-17 20:56:00 UTC
(In reply to Marius Cornea from comment #9)
> (In reply to Fabio Massimo Di Nitto from comment #7)
> > (In reply to Marius Cornea from comment #6)
> > > This could indeed be a potential cause - there are 8 x overcloud VMs with 4
> > > vCPUs and 8GB RAM each on a physical host with 16 cores and 64GB of RAM.
> > 
> > Ok, this sounds familiar already. We just recently closed a similar bug due
> > to VMs being overcommitted.
> > 
> > Can we please have at least a test run on baremetal or specs that are closer
> > to customer requirements before filing urgent bugs?
> > 
> > if nothing at least to exclude VMs vs bug.
> > 
> > Thanks
> > Fabio
> 
> OK, I retried the same scenario on a beefier hardware and the scale out
> process completed fine. I guess we can close this one as not a bug.

OK but please re-open the bug if you experience the same issue again.

I understand the need for VM testing et all, but at least we need to make sure VMs are not overcommitted otherwise it becomes rather time consuming to chase those issues.


Note You need to log in before you can comment on or make changes to this bug.