Description of problem: RabbitMQ resource fails to stop during scale out with an additional compute in an IPv6 and SSL environment Version-Release number of selected component (if applicable): openstack-tripleo-heat-templates-0.8.6-121.el7ost.noarch How reproducible: 100% Steps to Reproduce: 1. Deploy overcloud: export THT=/home/stack/templates/my-overcloud openstack overcloud deploy --templates $THT \ -e $THT/environments/network-isolation-v6-storagev4.yaml \ -e $THT/environments/net-single-nic-with-vlans-v6.yaml \ -e /home/stack/templates/network-environment-v6.yaml \ -e ~/templates/enable-tls.yaml \ -e ~/templates/inject-trust-anchor.yaml \ -e ~/templates/ceph.yaml \ -e ~/templates/firstboot-environment.yaml \ --control-scale 3 \ --compute-scale 1 \ --ceph-storage-scale 3 \ --neutron-disable-tunneling \ --neutron-network-type vlan \ --neutron-network-vlan-ranges datacentre:1000:1100 \ --libvirt-type qemu \ --ntp-server clock.redhat.com \ --timeout 180 2. Rerun the deployment command with --compute-scale 2 Actual results: overcloud | UPDATE_FAILED pcs resource restart rabbitmq-clone\nError: Could not complete shutdown of rabbitmq-clone, 1 resources remaining\nError performing operation: Timer expired\n\nSet 'rabbitmq-clone' option: id=rabbitmq-clone-meta_attributes-target-role set=rabbitmq-clone-meta_attributes name=target-role=stopped\nWaiting for 1 resources to stop:\n * rabbitmq-clone\n * rabbitmq-clone\nDeleted 'rabbitmq-clone' option: id=rabbitmq-clone-meta_attributes-target-role name=target-role\n\n", Expected results: The cluster gets restarted ok and the scale out completes fine. Additional info: Attaching the sosreports.
Note that this leaves the overcloud in a non-functional state.
Sounds very weird to me that the cause of the problem is IPv6 and SSL. We have seen some stop timeout errors before because the VMs were running on overcommitted hosts. We will verify this, but in the meantime can you please make sure the problem is not overcommit on the host?
This could indeed be a potential cause - there are 8 x overcloud VMs with 4 vCPUs and 8GB RAM each on a physical host with 16 cores and 64GB of RAM.
(In reply to Marius Cornea from comment #6) > This could indeed be a potential cause - there are 8 x overcloud VMs with 4 > vCPUs and 8GB RAM each on a physical host with 16 cores and 64GB of RAM. Ok, this sounds familiar already. We just recently closed a similar bug due to VMs being overcommitted. Can we please have at least a test run on baremetal or specs that are closer to customer requirements before filing urgent bugs? if nothing at least to exclude VMs vs bug. Thanks Fabio
it seems there's a repeated occurrence where rabbitmq has failed to stop via pacemaker (i've seen it 2 or 3 times, definitely a small sample size). Now that we have bumped all the systemd resource timeouts to 200s, it could be theorized that we've just pushed the stop/start timeout problem onto the rabbitmq resource which has a 90s timeout by default. https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/rabbitmq-cluster#L69
(In reply to Fabio Massimo Di Nitto from comment #7) > (In reply to Marius Cornea from comment #6) > > This could indeed be a potential cause - there are 8 x overcloud VMs with 4 > > vCPUs and 8GB RAM each on a physical host with 16 cores and 64GB of RAM. > > Ok, this sounds familiar already. We just recently closed a similar bug due > to VMs being overcommitted. > > Can we please have at least a test run on baremetal or specs that are closer > to customer requirements before filing urgent bugs? > > if nothing at least to exclude VMs vs bug. > > Thanks > Fabio OK, I retried the same scenario on a beefier hardware and the scale out process completed fine. I guess we can close this one as not a bug.
(In reply to Marius Cornea from comment #9) > (In reply to Fabio Massimo Di Nitto from comment #7) > > (In reply to Marius Cornea from comment #6) > > > This could indeed be a potential cause - there are 8 x overcloud VMs with 4 > > > vCPUs and 8GB RAM each on a physical host with 16 cores and 64GB of RAM. > > > > Ok, this sounds familiar already. We just recently closed a similar bug due > > to VMs being overcommitted. > > > > Can we please have at least a test run on baremetal or specs that are closer > > to customer requirements before filing urgent bugs? > > > > if nothing at least to exclude VMs vs bug. > > > > Thanks > > Fabio > > OK, I retried the same scenario on a beefier hardware and the scale out > process completed fine. I guess we can close this one as not a bug. OK but please re-open the bug if you experience the same issue again. I understand the need for VM testing et all, but at least we need to make sure VMs are not overcommitted otherwise it becomes rather time consuming to chase those issues.