Bug 1309339

Summary: RabbitMQ resource fails to stop during scale out run on IPv6 and SSL environment
Product: Red Hat OpenStack Reporter: Marius Cornea <mcornea>
Component: rhosp-directorAssignee: Angus Thomas <athomas>
Status: CLOSED NOTABUG QA Contact: yeylon <yeylon>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 7.0 (Kilo)CC: dbecker, fdinitto, jslagle, kbasil, mburns, morazi, rhel-osp-director-maint, srevivo, yeylon
Target Milestone: y3   
Target Release: 7.0 (Kilo)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-02-17 20:56:00 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Marius Cornea 2016-02-17 14:15:51 UTC
Description of problem:
RabbitMQ resource fails to stop during scale out with an additional compute in an IPv6 and SSL environment

Version-Release number of selected component (if applicable):
openstack-tripleo-heat-templates-0.8.6-121.el7ost.noarch

How reproducible:
100%

Steps to Reproduce:
1. Deploy overcloud:
export THT=/home/stack/templates/my-overcloud 
openstack overcloud deploy --templates $THT \
-e $THT/environments/network-isolation-v6-storagev4.yaml \
-e $THT/environments/net-single-nic-with-vlans-v6.yaml \
-e /home/stack/templates/network-environment-v6.yaml \
-e ~/templates/enable-tls.yaml \
-e ~/templates/inject-trust-anchor.yaml \
-e ~/templates/ceph.yaml \
-e ~/templates/firstboot-environment.yaml \
--control-scale 3 \
--compute-scale 1 \
--ceph-storage-scale 3 \
--neutron-disable-tunneling \
--neutron-network-type vlan \
--neutron-network-vlan-ranges datacentre:1000:1100 \
--libvirt-type qemu \
--ntp-server clock.redhat.com \
--timeout 180

2. Rerun the deployment command with --compute-scale 2


Actual results:
overcloud  | UPDATE_FAILED

 pcs resource restart rabbitmq-clone\nError: Could not complete shutdown of rabbitmq-clone, 1 resources remaining\nError performing operation: Timer expired\n\nSet 'rabbitmq-clone' option: id=rabbitmq-clone-meta_attributes-target-role set=rabbitmq-clone-meta_attributes name=target-role=stopped\nWaiting for 1 resources to stop:\n * rabbitmq-clone\n * rabbitmq-clone\nDeleted 'rabbitmq-clone' option: id=rabbitmq-clone-meta_attributes-target-role name=target-role\n\n", 


Expected results:
The cluster gets restarted ok and the scale out completes fine.

Additional info:
Attaching the sosreports.

Comment 1 Marius Cornea 2016-02-17 14:18:13 UTC
Note that this leaves the overcloud in a non-functional state.

Comment 5 Fabio Massimo Di Nitto 2016-02-17 15:45:42 UTC
Sounds very weird to me that the cause of the problem is IPv6 and SSL. We have seen some stop timeout errors before because the VMs were running on overcommitted hosts.

We will verify this, but in the meantime can you please make sure the problem is not overcommit on the host?

Comment 6 Marius Cornea 2016-02-17 15:51:28 UTC
This could indeed be a potential cause - there are 8 x overcloud VMs with 4 vCPUs and 8GB RAM each on a physical host with 16 cores and 64GB of RAM.

Comment 7 Fabio Massimo Di Nitto 2016-02-17 15:54:53 UTC
(In reply to Marius Cornea from comment #6)
> This could indeed be a potential cause - there are 8 x overcloud VMs with 4
> vCPUs and 8GB RAM each on a physical host with 16 cores and 64GB of RAM.

Ok, this sounds familiar already. We just recently closed a similar bug due to VMs being overcommitted.

Can we please have at least a test run on baremetal or specs that are closer to customer requirements before filing urgent bugs?

if nothing at least to exclude VMs vs bug.

Thanks
Fabio

Comment 8 James Slagle 2016-02-17 16:58:28 UTC
it seems there's a repeated occurrence where rabbitmq has failed to stop via pacemaker (i've seen it 2 or 3 times, definitely a small sample size).

Now that we have bumped all the systemd resource timeouts to 200s, it could be theorized that we've just pushed the stop/start timeout problem onto the rabbitmq resource which has a 90s timeout by default.

https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/rabbitmq-cluster#L69

Comment 9 Marius Cornea 2016-02-17 20:51:38 UTC
(In reply to Fabio Massimo Di Nitto from comment #7)
> (In reply to Marius Cornea from comment #6)
> > This could indeed be a potential cause - there are 8 x overcloud VMs with 4
> > vCPUs and 8GB RAM each on a physical host with 16 cores and 64GB of RAM.
> 
> Ok, this sounds familiar already. We just recently closed a similar bug due
> to VMs being overcommitted.
> 
> Can we please have at least a test run on baremetal or specs that are closer
> to customer requirements before filing urgent bugs?
> 
> if nothing at least to exclude VMs vs bug.
> 
> Thanks
> Fabio

OK, I retried the same scenario on a beefier hardware and the scale out process completed fine. I guess we can close this one as not a bug.

Comment 10 Fabio Massimo Di Nitto 2016-02-17 20:56:00 UTC
(In reply to Marius Cornea from comment #9)
> (In reply to Fabio Massimo Di Nitto from comment #7)
> > (In reply to Marius Cornea from comment #6)
> > > This could indeed be a potential cause - there are 8 x overcloud VMs with 4
> > > vCPUs and 8GB RAM each on a physical host with 16 cores and 64GB of RAM.
> > 
> > Ok, this sounds familiar already. We just recently closed a similar bug due
> > to VMs being overcommitted.
> > 
> > Can we please have at least a test run on baremetal or specs that are closer
> > to customer requirements before filing urgent bugs?
> > 
> > if nothing at least to exclude VMs vs bug.
> > 
> > Thanks
> > Fabio
> 
> OK, I retried the same scenario on a beefier hardware and the scale out
> process completed fine. I guess we can close this one as not a bug.

OK but please re-open the bug if you experience the same issue again.

I understand the need for VM testing et all, but at least we need to make sure VMs are not overcommitted otherwise it becomes rather time consuming to chase those issues.