During overcloud deployment we can find following commands are getting executed in the logs: ++ pcs status --full ++ grep openstack-keystone ++ grep -v Clone + node_states=' openstack-keystone (systemd:openstack- keystone): (target-role:Stopped) Started controller-2 openstack-keystone (systemd:openstack-keystone): (target- role:Stopped) Started controller-0 openstack-keystone (systemd:openstack-keystone): (target- role:Stopped) Started controller-1' + echo ' openstack-keystone (systemd:openstack- keystone): (target-role:Stopped) Started controller-2 openstack-keystone (systemd:openstack-keystone): (target- role:Stopped) Started controller-0 openstack-keystone (systemd:openstack-keystone): (target- role:Stopped) Started controller-1' + grep -q Started + echo 'openstack-keystone not yet stopped, sleeping 3 seconds.' + sleep 3 ^^ the item above has occurs 37 times then: + echo 'openstack-keystone has stopped' + return + pcs status + grep haproxy-clone + pcs resource restart haproxy-clone + pcs resource restart redis-master + pcs resource restart mongod-clone + pcs resource restart rabbitmq-clone Error: Could not complete shutdown of rabbitmq-clone, 1 resources remaining Error performing operation: Timer expired Set 'rabbitmq-clone' option: id=rabbitmq-clone-meta_attributes-target- role set=rabbitmq-clone-meta_attributes name=target-role=stopped Waiting for 1 resources to stop: * rabbitmq-clone * rabbitmq-clone Deleted 'rabbitmq-clone' option: id=rabbitmq-clone-meta_attributes- target-role name=target-role this is part of the ControllerPostPuppetRestartDeployment where it failed deployment: ControllerPostPuppetRestartConfig: type: OS::Heat::SoftwareConfig properties: group: script config: {get_file: pacemaker_resource_restart.sh} -- pacemaker_resource_restart.sh -- 38 if [ "$pacemaker_status" = "active" -a \ 39 "$(hiera bootstrap_nodeid)" = "$(facter hostname)" ]; then 40 41 #ensure neutron constraints like 42 #https://review.openstack.org/#/c/245093/ 43 if pcs constraint order show | grep "start neutron-server- clone then start neutron-ovs-cleanup-clone"; then 44 pcs constraint remove order-neutron-server-clone-neutron- ovs-cleanup-clone-mandatory 45 fi 46 47 pcs resource disable httpd 48 check_resource httpd stopped 300 49 pcs resource disable openstack-keystone 50 check_resource openstack-keystone stopped 1200 51 52 if pcs status | grep haproxy-clone; then 53 pcs resource restart haproxy-clone 54 fi 55 pcs resource restart redis-master 56 pcs resource restart mongod-clone 57 pcs resource restart rabbitmq-clone ^^ Deployment failed in this step and didn't continue. Thus the resources are all stopped, as they all depend on the keystone, which wasn't enabled. 58 pcs resource restart memcached-clone 59 pcs resource restart galera-master 60 61 pcs resource enable openstack-keystone 62 check_resource openstack-keystone started 300 63 pcs resource enable httpd 64 check_resource httpd started 800 65 ^^ We were able to re-start the rabbitmq-clone by unmanaging it rabbit from Pacemaker and trying to start it manually, this worked perfectly ^^ And the most important the deployment ends with: Error: Could not prefetch keystone_tenant provider 'openstack': undefined method collect' for nil:NilClass Error: Could not prefetch keystone_role provider 'openstack': undefined method collect' for nil:NilClass Error: Could not prefetch keystone_user provider 'openstack': undefined method collect' for nil:NilClass Error: /Stage[main]/Keystone::Roles::Admin/Keystone_user_role[admin@adm in]: Could not evaluate: undefined method empty?' for nil:NilClass Warning: /Stage[main]/Heat::Keystone::Domain/Exec[heat_domain_create]: Skipping because of failed dependencies
We have put in place a new OSP director validation to catch this scenario: https://github.com/rthallisey/clapper/commit/c64a2b2d59b2e2519332e3ff5b80c4e0a14b6556 Documentation on using the validations are available here: https://mojo.redhat.com/docs/DOC-1040866
Hi team, I increased the timeout to 300 and now the rabbitmq-clone can restart successfully. Sorry for the noise and closing the bug. Best Regards, Chen
Adding this to have a complete solution / workaround in this bugzilla: Increase the stop timeout to 360: ~~~ [root@overcloud-controller-0 ~]# pcs resource update rabbitmq op stop timeout=360 ~~~ Verify again: ~~~ [root@overcloud-controller-0 ~]# pcs resource show rabbitmq Resource: rabbitmq (class=ocf provider=heartbeat type=rabbitmq-cluster) Attributes: set_policy="ha-all ^(?!amq\.).* {"ha-mode":"all"}" Meta Attrs: notify=true Operations: start interval=0s timeout=100 (rabbitmq-start-interval-0s) stop interval=0s timeout=360 (rabbitmq-stop-interval-0s) monitor interval=10 timeout=40 (rabbitmq-monitor-interval-10) ~~~