This service will be undergoing maintenance at 00:00 UTC, 2017-10-23 It is expected to last about 30 minutes
Bug 1309339 - RabbitMQ resource fails to stop during scale out run on IPv6 and SSL environment
RabbitMQ resource fails to stop during scale out run on IPv6 and SSL environment
Status: CLOSED NOTABUG
Product: Red Hat OpenStack
Classification: Red Hat
Component: rhosp-director (Show other bugs)
7.0 (Kilo)
Unspecified Unspecified
unspecified Severity urgent
: y3
: 7.0 (Kilo)
Assigned To: Angus Thomas
yeylon@redhat.com
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2016-02-17 09:15 EST by Marius Cornea
Modified: 2016-04-18 03:12 EDT (History)
9 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2016-02-17 15:56:00 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Marius Cornea 2016-02-17 09:15:51 EST
Description of problem:
RabbitMQ resource fails to stop during scale out with an additional compute in an IPv6 and SSL environment

Version-Release number of selected component (if applicable):
openstack-tripleo-heat-templates-0.8.6-121.el7ost.noarch

How reproducible:
100%

Steps to Reproduce:
1. Deploy overcloud:
export THT=/home/stack/templates/my-overcloud 
openstack overcloud deploy --templates $THT \
-e $THT/environments/network-isolation-v6-storagev4.yaml \
-e $THT/environments/net-single-nic-with-vlans-v6.yaml \
-e /home/stack/templates/network-environment-v6.yaml \
-e ~/templates/enable-tls.yaml \
-e ~/templates/inject-trust-anchor.yaml \
-e ~/templates/ceph.yaml \
-e ~/templates/firstboot-environment.yaml \
--control-scale 3 \
--compute-scale 1 \
--ceph-storage-scale 3 \
--neutron-disable-tunneling \
--neutron-network-type vlan \
--neutron-network-vlan-ranges datacentre:1000:1100 \
--libvirt-type qemu \
--ntp-server clock.redhat.com \
--timeout 180

2. Rerun the deployment command with --compute-scale 2


Actual results:
overcloud  | UPDATE_FAILED

 pcs resource restart rabbitmq-clone\nError: Could not complete shutdown of rabbitmq-clone, 1 resources remaining\nError performing operation: Timer expired\n\nSet 'rabbitmq-clone' option: id=rabbitmq-clone-meta_attributes-target-role set=rabbitmq-clone-meta_attributes name=target-role=stopped\nWaiting for 1 resources to stop:\n * rabbitmq-clone\n * rabbitmq-clone\nDeleted 'rabbitmq-clone' option: id=rabbitmq-clone-meta_attributes-target-role name=target-role\n\n", 


Expected results:
The cluster gets restarted ok and the scale out completes fine.

Additional info:
Attaching the sosreports.
Comment 1 Marius Cornea 2016-02-17 09:18:13 EST
Note that this leaves the overcloud in a non-functional state.
Comment 5 Fabio Massimo Di Nitto 2016-02-17 10:45:42 EST
Sounds very weird to me that the cause of the problem is IPv6 and SSL. We have seen some stop timeout errors before because the VMs were running on overcommitted hosts.

We will verify this, but in the meantime can you please make sure the problem is not overcommit on the host?
Comment 6 Marius Cornea 2016-02-17 10:51:28 EST
This could indeed be a potential cause - there are 8 x overcloud VMs with 4 vCPUs and 8GB RAM each on a physical host with 16 cores and 64GB of RAM.
Comment 7 Fabio Massimo Di Nitto 2016-02-17 10:54:53 EST
(In reply to Marius Cornea from comment #6)
> This could indeed be a potential cause - there are 8 x overcloud VMs with 4
> vCPUs and 8GB RAM each on a physical host with 16 cores and 64GB of RAM.

Ok, this sounds familiar already. We just recently closed a similar bug due to VMs being overcommitted.

Can we please have at least a test run on baremetal or specs that are closer to customer requirements before filing urgent bugs?

if nothing at least to exclude VMs vs bug.

Thanks
Fabio
Comment 8 James Slagle 2016-02-17 11:58:28 EST
it seems there's a repeated occurrence where rabbitmq has failed to stop via pacemaker (i've seen it 2 or 3 times, definitely a small sample size).

Now that we have bumped all the systemd resource timeouts to 200s, it could be theorized that we've just pushed the stop/start timeout problem onto the rabbitmq resource which has a 90s timeout by default.

https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/rabbitmq-cluster#L69
Comment 9 Marius Cornea 2016-02-17 15:51:38 EST
(In reply to Fabio Massimo Di Nitto from comment #7)
> (In reply to Marius Cornea from comment #6)
> > This could indeed be a potential cause - there are 8 x overcloud VMs with 4
> > vCPUs and 8GB RAM each on a physical host with 16 cores and 64GB of RAM.
> 
> Ok, this sounds familiar already. We just recently closed a similar bug due
> to VMs being overcommitted.
> 
> Can we please have at least a test run on baremetal or specs that are closer
> to customer requirements before filing urgent bugs?
> 
> if nothing at least to exclude VMs vs bug.
> 
> Thanks
> Fabio

OK, I retried the same scenario on a beefier hardware and the scale out process completed fine. I guess we can close this one as not a bug.
Comment 10 Fabio Massimo Di Nitto 2016-02-17 15:56:00 EST
(In reply to Marius Cornea from comment #9)
> (In reply to Fabio Massimo Di Nitto from comment #7)
> > (In reply to Marius Cornea from comment #6)
> > > This could indeed be a potential cause - there are 8 x overcloud VMs with 4
> > > vCPUs and 8GB RAM each on a physical host with 16 cores and 64GB of RAM.
> > 
> > Ok, this sounds familiar already. We just recently closed a similar bug due
> > to VMs being overcommitted.
> > 
> > Can we please have at least a test run on baremetal or specs that are closer
> > to customer requirements before filing urgent bugs?
> > 
> > if nothing at least to exclude VMs vs bug.
> > 
> > Thanks
> > Fabio
> 
> OK, I retried the same scenario on a beefier hardware and the scale out
> process completed fine. I guess we can close this one as not a bug.

OK but please re-open the bug if you experience the same issue again.

I understand the need for VM testing et all, but at least we need to make sure VMs are not overcommitted otherwise it becomes rather time consuming to chase those issues.

Note You need to log in before you can comment on or make changes to this bug.