| Summary: | rabbitmq restart failure cause failed overcloud deployment | ||
|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Robin Cernin <rcernin> |
| Component: | rabbitmq-server | Assignee: | Peter Lemenkov <plemenko> |
| Status: | CLOSED NOTABUG | QA Contact: | yeylon <yeylon> |
| Severity: | urgent | Docs Contact: | |
| Priority: | urgent | ||
| Version: | 7.0 (Kilo) | CC: | akaris, apevec, cchen, fdinitto, jeckersb, jschluet, lhh, markmc, plemenko, srevivo, yeylon |
| Target Milestone: | --- | Keywords: | Reopened |
| Target Release: | 7.0 (Kilo) | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2016-07-06 08:59:16 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
We have put in place a new OSP director validation to catch this scenario: https://github.com/rthallisey/clapper/commit/c64a2b2d59b2e2519332e3ff5b80c4e0a14b6556 Documentation on using the validations are available here: https://mojo.redhat.com/docs/DOC-1040866 Hi team, I increased the timeout to 300 and now the rabbitmq-clone can restart successfully. Sorry for the noise and closing the bug. Best Regards, Chen Adding this to have a complete solution / workaround in this bugzilla:
Increase the stop timeout to 360:
~~~
[root@overcloud-controller-0 ~]# pcs resource update rabbitmq op stop timeout=360
~~~
Verify again:
~~~
[root@overcloud-controller-0 ~]# pcs resource show rabbitmq
Resource: rabbitmq (class=ocf provider=heartbeat type=rabbitmq-cluster)
Attributes: set_policy="ha-all ^(?!amq\.).* {"ha-mode":"all"}"
Meta Attrs: notify=true
Operations: start interval=0s timeout=100 (rabbitmq-start-interval-0s)
stop interval=0s timeout=360 (rabbitmq-stop-interval-0s)
monitor interval=10 timeout=40 (rabbitmq-monitor-interval-10)
~~~
|
During overcloud deployment we can find following commands are getting executed in the logs: ++ pcs status --full ++ grep openstack-keystone ++ grep -v Clone + node_states=' openstack-keystone (systemd:openstack- keystone): (target-role:Stopped) Started controller-2 openstack-keystone (systemd:openstack-keystone): (target- role:Stopped) Started controller-0 openstack-keystone (systemd:openstack-keystone): (target- role:Stopped) Started controller-1' + echo ' openstack-keystone (systemd:openstack- keystone): (target-role:Stopped) Started controller-2 openstack-keystone (systemd:openstack-keystone): (target- role:Stopped) Started controller-0 openstack-keystone (systemd:openstack-keystone): (target- role:Stopped) Started controller-1' + grep -q Started + echo 'openstack-keystone not yet stopped, sleeping 3 seconds.' + sleep 3 ^^ the item above has occurs 37 times then: + echo 'openstack-keystone has stopped' + return + pcs status + grep haproxy-clone + pcs resource restart haproxy-clone + pcs resource restart redis-master + pcs resource restart mongod-clone + pcs resource restart rabbitmq-clone Error: Could not complete shutdown of rabbitmq-clone, 1 resources remaining Error performing operation: Timer expired Set 'rabbitmq-clone' option: id=rabbitmq-clone-meta_attributes-target- role set=rabbitmq-clone-meta_attributes name=target-role=stopped Waiting for 1 resources to stop: * rabbitmq-clone * rabbitmq-clone Deleted 'rabbitmq-clone' option: id=rabbitmq-clone-meta_attributes- target-role name=target-role this is part of the ControllerPostPuppetRestartDeployment where it failed deployment: ControllerPostPuppetRestartConfig: type: OS::Heat::SoftwareConfig properties: group: script config: {get_file: pacemaker_resource_restart.sh} -- pacemaker_resource_restart.sh -- 38 if [ "$pacemaker_status" = "active" -a \ 39 "$(hiera bootstrap_nodeid)" = "$(facter hostname)" ]; then 40 41 #ensure neutron constraints like 42 #https://review.openstack.org/#/c/245093/ 43 if pcs constraint order show | grep "start neutron-server- clone then start neutron-ovs-cleanup-clone"; then 44 pcs constraint remove order-neutron-server-clone-neutron- ovs-cleanup-clone-mandatory 45 fi 46 47 pcs resource disable httpd 48 check_resource httpd stopped 300 49 pcs resource disable openstack-keystone 50 check_resource openstack-keystone stopped 1200 51 52 if pcs status | grep haproxy-clone; then 53 pcs resource restart haproxy-clone 54 fi 55 pcs resource restart redis-master 56 pcs resource restart mongod-clone 57 pcs resource restart rabbitmq-clone ^^ Deployment failed in this step and didn't continue. Thus the resources are all stopped, as they all depend on the keystone, which wasn't enabled. 58 pcs resource restart memcached-clone 59 pcs resource restart galera-master 60 61 pcs resource enable openstack-keystone 62 check_resource openstack-keystone started 300 63 pcs resource enable httpd 64 check_resource httpd started 800 65 ^^ We were able to re-start the rabbitmq-clone by unmanaging it rabbit from Pacemaker and trying to start it manually, this worked perfectly ^^ And the most important the deployment ends with: Error: Could not prefetch keystone_tenant provider 'openstack': undefined method collect' for nil:NilClass Error: Could not prefetch keystone_role provider 'openstack': undefined method collect' for nil:NilClass Error: Could not prefetch keystone_user provider 'openstack': undefined method collect' for nil:NilClass Error: /Stage[main]/Keystone::Roles::Admin/Keystone_user_role[admin@adm in]: Could not evaluate: undefined method empty?' for nil:NilClass Warning: /Stage[main]/Heat::Keystone::Domain/Exec[heat_domain_create]: Skipping because of failed dependencies