Bug 1316089 - rabbitmq restart failure cause failed overcloud deployment
Summary: rabbitmq restart failure cause failed overcloud deployment
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: rabbitmq-server
Version: 7.0 (Kilo)
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 7.0 (Kilo)
Assignee: Peter Lemenkov
QA Contact: yeylon@redhat.com
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-03-09 11:48 UTC by Robin Cernin
Modified: 2019-11-14 07:34 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-07-06 08:59:16 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Robin Cernin 2016-03-09 11:48:08 UTC
During overcloud deployment we can find following commands are getting executed in the logs:

++ pcs status --full
++ grep openstack-keystone
++ grep -v Clone
+ node_states='     openstack-keystone  (systemd:openstack-
keystone):   (target-role:Stopped) Started 
controller-2
     openstack-keystone (systemd:openstack-keystone):   (target-
role:Stopped) Started controller-0
     openstack-keystone (systemd:openstack-keystone):   (target-
role:Stopped) Started controller-1'
+ echo '     openstack-keystone (systemd:openstack-
keystone):   (target-role:Stopped) Started 
controller-2
     openstack-keystone (systemd:openstack-keystone):   (target-
role:Stopped) Started controller-0
     openstack-keystone (systemd:openstack-keystone):   (target-
role:Stopped) Started controller-1'
+ grep -q Started
+ echo 'openstack-keystone not yet stopped, sleeping 3 seconds.'
+ sleep 3

^^ the item above has occurs 37 times then:

+ echo 'openstack-keystone has stopped'
+ return
+ pcs status
+ grep haproxy-clone
+ pcs resource restart haproxy-clone
+ pcs resource restart redis-master
+ pcs resource restart mongod-clone
+ pcs resource restart rabbitmq-clone
Error: Could not complete shutdown of rabbitmq-clone, 1 resources
remaining
Error performing operation: Timer expired

Set 'rabbitmq-clone' option: id=rabbitmq-clone-meta_attributes-target-
role set=rabbitmq-clone-meta_attributes name=target-role=stopped
Waiting for 1 resources to stop:
 * rabbitmq-clone
 * rabbitmq-clone
Deleted 'rabbitmq-clone' option: id=rabbitmq-clone-meta_attributes-
target-role name=target-role

this is part of the ControllerPostPuppetRestartDeployment where it
failed deployment:

  ControllerPostPuppetRestartConfig:
    type: OS::Heat::SoftwareConfig
    properties:
      group: script
      config: {get_file: pacemaker_resource_restart.sh}

 -- pacemaker_resource_restart.sh --

 38 if [ "$pacemaker_status" = "active" -a \
 39      "$(hiera bootstrap_nodeid)" = "$(facter hostname)" ]; then
 40 
 41     #ensure neutron constraints like
 42     #https://review.openstack.org/#/c/245093/
 43     if  pcs constraint order show  | grep "start neutron-server-
clone then start neutron-ovs-cleanup-clone"; then
 44         pcs constraint remove order-neutron-server-clone-neutron-
ovs-cleanup-clone-mandatory
 45     fi
 46 
 47     pcs resource disable httpd
 48     check_resource httpd stopped 300
 49     pcs resource disable openstack-keystone
 50     check_resource openstack-keystone stopped 1200
 51 
 52     if pcs status | grep haproxy-clone; then
 53         pcs resource restart haproxy-clone
 54     fi
 55     pcs resource restart redis-master
 56     pcs resource restart mongod-clone
 57     pcs resource restart rabbitmq-clone

^^ Deployment failed in this step and didn't continue. Thus the
resources are all stopped, as they all depend on the keystone, which
wasn't enabled.

 58     pcs resource restart memcached-clone
 59     pcs resource restart galera-master
 60 
 61     pcs resource enable openstack-keystone
 62     check_resource openstack-keystone started 300
 63     pcs resource enable httpd
 64     check_resource httpd started 800
 65 

^^ We were able to re-start the rabbitmq-clone by unmanaging it rabbit
from Pacemaker and trying to start it manually, this worked perfectly
^^

And the most important the deployment ends with:

Error: Could not prefetch keystone_tenant provider 'openstack':
undefined method collect' for nil:NilClass
Error: Could not prefetch keystone_role provider 'openstack': undefined
method collect' for nil:NilClass
Error: Could not prefetch keystone_user provider 'openstack': undefined
method collect' for nil:NilClass
Error: /Stage[main]/Keystone::Roles::Admin/Keystone_user_role[admin@adm
in]: Could not evaluate: undefined method empty?' for nil:NilClass
Warning: /Stage[main]/Heat::Keystone::Domain/Exec[heat_domain_create]:
Skipping because of failed dependencies

Comment 6 Mark McLoughlin 2016-03-18 07:18:25 UTC
We have put in place a new OSP director validation to catch this scenario:

https://github.com/rthallisey/clapper/commit/c64a2b2d59b2e2519332e3ff5b80c4e0a14b6556

Documentation on using the validations are available here: 

https://mojo.redhat.com/docs/DOC-1040866

Comment 8 Chen 2016-07-06 08:59:16 UTC
Hi team,

I increased the timeout to 300 and now the rabbitmq-clone can restart successfully. Sorry for the noise and closing the bug.

Best Regards,
Chen

Comment 9 Andreas Karis 2016-12-27 21:36:41 UTC
Adding this to have a complete solution / workaround in this bugzilla:

Increase the stop timeout to 360:
~~~
[root@overcloud-controller-0 ~]# pcs resource update rabbitmq op stop timeout=360
~~~

Verify again:
~~~
[root@overcloud-controller-0 ~]# pcs resource show rabbitmq
 Resource: rabbitmq (class=ocf provider=heartbeat type=rabbitmq-cluster)
  Attributes: set_policy="ha-all ^(?!amq\.).* {"ha-mode":"all"}"
  Meta Attrs: notify=true 
  Operations: start interval=0s timeout=100 (rabbitmq-start-interval-0s)
              stop interval=0s timeout=360 (rabbitmq-stop-interval-0s)
              monitor interval=10 timeout=40 (rabbitmq-monitor-interval-10)
~~~


Note You need to log in before you can comment on or make changes to this bug.