Bug 1384068
| Summary: | Wrong service restart prevents overcloud deployment | ||
|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Raoul Scarazzini <rscarazz> |
| Component: | openstack-tripleo-heat-templates | Assignee: | Jiri Stransky <jstransk> |
| Status: | CLOSED DUPLICATE | QA Contact: | Arik Chernetsky <achernet> |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | 7.0 (Kilo) | CC: | jstransk, mburns, michele, rhel-osp-director-maint, rscarazz |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2016-11-15 09:00:38 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
Looking at code, it seems like there is indeed restart triggered on create as well. In upstream, this code landed with the "update_identifier" condition (we probably backported downstream before merging upstream back then).
if [ "$pacemaker_status" = "active" -a \
"$(hiera bootstrap_nodeid)" = "$(facter hostname)" -a \
"$(hiera update_identifier)" != "nil" ]; then
https://github.com/openstack/tripleo-heat-templates/commit/ea1294fe9b11029edab719e8bf558733226b3fd4
However, i don't think we should just pull that in, as it introduces:
https://bugs.launchpad.net/tripleo/+bug/1567384
which needs to be fixed together with:
https://bugs.launchpad.net/tripleo/+bug/1567385
Restarting the services at the end of the deployment makes the deployment take slightly longer time, but AFAIK it has never caused severe issues during the last year on OSP 7 and it shouldn't cause such issues now either. The real problem is probably that the restart doesn't succeed. At the moment, with current info, i don't think backporting all the fixes mentioned above is worth it -- it's not much use removing the restart from stack-create, if the restart can still fail later on stack-update. Hence i think we should focus on why the restart fails rather than whether it's triggered on stack-create (triggering it on stack-create is a bug too, but with much lower severity).
Raoul, can you please post the restart script's output (IIRC it can be collected via `heat deployment-output-show`), and what state the pacemaker cluster is in after the restart? (Probably pcs status and the cluster log?)
Hi Jiri,
unfortunately the last deploy went well and so I don't have any log for this. What I can say for sure is that the cluster status was fine, there were no failed actions.
But the fact that the issue did not happened might mean that this is simply a timeout issue (stop or start), which needs just to be increased.
The actual settings around rabbit are:
Clone: rabbitmq-clone
Meta Attrs: ordered=true interleave=true
Resource: rabbitmq (class=ocf provider=heartbeat type=rabbitmq-cluster)
Attributes: set_policy="ha-all ^(?!amq\.).* {"ha-mode":"all"}"
Meta Attrs: notify=true
Operations: start interval=0s timeout=100 (rabbitmq-start-interval-0s)
stop interval=0s timeout=90 (rabbitmq-stop-interval-0s)
monitor interval=10 timeout=40 (rabbitmq-monitor-interval-10)
Which are lower then the overall defaults of the systemd resources (200s). So maybe increasing both those values should be enough.
Hi again Jiri,
reproduced, you can find here [1] all the sosreports. In addition, to make investigation easier and save some of your time, here's the output of the action that failed:
{ "deploy_stdout": "httpd not yet stopped, sleeping 3 seconds.
httpd has stopped
openstack-keystone not yet stopped, sleeping 3 seconds.
openstack-keystone not yet stopped, sleeping 3 seconds.
openstack-keystone not yet stopped, sleeping 3 seconds.
openstack-keystone not yet stopped, sleeping 3 seconds.
openstack-keystone not yet stopped, sleeping 3 seconds.
openstack-keystone not yet stopped, sleeping 3 seconds.
openstack-keystone not yet stopped, sleeping 3 seconds.
openstack-keystone not yet stopped, sleeping 3 seconds.
openstack-keystone not yet stopped, sleeping 3 seconds.
openstack-keystone not yet stopped, sleeping 3 seconds.
openstack-keystone not yet stopped, sleeping 3 seconds.
openstack-keystone not yet stopped, sleeping 3 seconds.
openstack-keystone has stopped
Clone Set: haproxy-clone [haproxy]
haproxy-clone successfully restarted
redis-master successfully restarted
mongod-clone successfully restarted
", "deploy_stderr": "++ systemctl is-active pacemaker
+ pacemaker_status=active
+ check_interval=3
++ hiera bootstrap_nodeid
++ facter hostname
+ '[' active = active -a overcloud-controller-0 = overcloud-controller-0 ']'
+ pcs constraint order show
+ grep 'start neutron-server-clone then start neutron-ovs-cleanup-clone'
+ pcs resource disable httpd
+ check_resource httpd stopped 300
+ service=httpd
+ state=stopped
+ timeout=300
++ date +%s
+ tstart=1476697518
+ tend=1476697818
+ '[' stopped = stopped ']'
+ match_for_incomplete=Started
++ date +%s
+ (( 1476697518 < 1476697818 ))
++ pcs status --full
++ grep httpd
++ grep -v Clone
+ node_states=' httpd (systemd:httpd): (target-role:Stopped) Started overcloud-controller-1
httpd (systemd:httpd): (target-role:Stopped) Started overcloud-controller-0
httpd (systemd:httpd): (target-role:Stopped) Started overcloud-controller-2'
+ echo ' httpd (systemd:httpd): (target-role:Stopped) Started overcloud-controller-1
httpd (systemd:httpd): (target-role:Stopped) Started overcloud-controller-0
httpd (systemd:httpd): (target-role:Stopped) Started overcloud-controller-2'
+ grep -q Started
+ echo 'httpd not yet stopped, sleeping 3 seconds.'
+ sleep 3
++ date +%s
+ (( 1476697522 < 1476697818 ))
++ grep httpd
++ pcs status --full
++ grep -v Clone
+ node_states=' httpd (systemd:httpd): (target-role:Stopped) Stopped
httpd (systemd:httpd): (target-role:Stopped) Stopped
httpd (systemd:httpd): (target-role:Stopped) Stopped'
+ echo ' httpd (systemd:httpd): (target-role:Stopped) Stopped
httpd (systemd:httpd): (target-role:Stopped) Stopped
httpd (systemd:httpd): (target-role:Stopped) Stopped'
+ grep -q Started
+ echo 'httpd has stopped'
+ return
+ pcs resource disable openstack-keystone
+ check_resource openstack-keystone stopped 1200
+ service=openstack-keystone
+ state=stopped
+ timeout=1200
++ date +%s
+ tstart=1476697523
+ tend=1476698723
+ '[' stopped = stopped ']'
+ match_for_incomplete=Started
++ date +%s
+ (( 1476697523 < 1476698723 ))
++ pcs status --full
++ grep openstack-keystone
++ grep -v Clone
+ node_states=' openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-1
openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-0
openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-2'
+ echo ' openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-1
openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-0
openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-2'
+ grep -q Started
+ echo 'openstack-keystone not yet stopped, sleeping 3 seconds.'
+ sleep 3
++ date +%s
+ (( 1476697527 < 1476698723 ))
++ grep openstack-keystone
++ pcs status --full
++ grep -v Clone
+ node_states=' openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-1
openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-0
openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-2'
+ echo ' openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-1
openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-0
openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-2'
+ grep -q Started
+ echo 'openstack-keystone not yet stopped, sleeping 3 seconds.'
+ sleep 3
++ date +%s
+ (( 1476697531 < 1476698723 ))
++ pcs status --full
++ grep openstack-keystone
++ grep -v Clone
+ node_states=' openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-1
openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-0
openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-2'
+ echo ' openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-1
openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-0
+ grep -q Started
openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-2'
+ echo 'openstack-keystone not yet stopped, sleeping 3 seconds.'
+ sleep 3
++ date +%s
+ (( 1476697535 < 1476698723 ))
++ pcs status --full
++ grep openstack-keystone
++ grep -v Clone
+ node_states=' openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-1
openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-0
openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-2'
+ echo ' openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-1
openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-0
openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-2'
+ grep -q Started
+ echo 'openstack-keystone not yet stopped, sleeping 3 seconds.'
+ sleep 3
++ date +%s
+ (( 1476697539 < 1476698723 ))
++ grep openstack-keystone
++ pcs status --full
++ grep -v Clone
+ node_states=' openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-1
openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-0
openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-2'
+ echo ' openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-1
openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-0
openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-2'
+ grep -q Started
+ echo 'openstack-keystone not yet stopped, sleeping 3 seconds.'
+ sleep 3
++ date +%s
+ (( 1476697543 < 1476698723 ))
++ pcs status --full
++ grep openstack-keystone
++ grep -v Clone
+ node_states=' openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-1
openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-0
openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-2'
+ grep -q Started
+ echo ' openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-1
openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-0
openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-2'
+ echo 'openstack-keystone not yet stopped, sleeping 3 seconds.'
+ sleep 3
++ date +%s
+ (( 1476697547 < 1476698723 ))
++ pcs status --full
++ grep openstack-keystone
++ grep -v Clone
+ node_states=' openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-1
openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-0
openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-2'
+ grep -q Started
+ echo ' openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-1
openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-0
openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-2'
+ echo 'openstack-keystone not yet stopped, sleeping 3 seconds.'
+ sleep 3
++ date +%s
+ (( 1476697551 < 1476698723 ))
++ pcs status --full
++ grep openstack-keystone
++ grep -v Clone
+ node_states=' openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-1
openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-0
openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-2'
+ echo ' openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-1
openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-0
openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-2'
+ grep -q Started
+ echo 'openstack-keystone not yet stopped, sleeping 3 seconds.'
+ sleep 3
++ date +%s
+ (( 1476697555 < 1476698723 ))
++ pcs status --full
++ grep openstack-keystone
++ grep -v Clone
+ node_states=' openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-1
openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-0
openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-2'
+ echo ' openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-1
openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-0
openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-2'
+ grep -q Started
+ echo 'openstack-keystone not yet stopped, sleeping 3 seconds.'
+ sleep 3
++ date +%s
+ (( 1476697558 < 1476698723 ))
++ grep openstack-keystone
++ pcs status --full
++ grep -v Clone
+ node_states=' openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-1
openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-0
openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-2'
+ echo ' openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-1
openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-0
openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-2'
+ grep -q Started
+ echo 'openstack-keystone not yet stopped, sleeping 3 seconds.'
+ sleep 3
++ date +%s
+ (( 1476697562 < 1476698723 ))
++ grep openstack-keystone
++ pcs status --full
++ grep -v Clone
+ node_states=' openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-1
openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-0
openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-2'
+ echo ' openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-1
openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-0
openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-2'
+ grep -q Started
+ echo 'openstack-keystone not yet stopped, sleeping 3 seconds.'
+ sleep 3
++ date +%s
+ (( 1476697566 < 1476698723 ))
++ grep openstack-keystone
++ pcs status --full
++ grep -v Clone
+ node_states=' openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-1
openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-0
openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-2'
+ echo ' openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-1
openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-0
openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-2'
+ grep -q Started
+ echo 'openstack-keystone not yet stopped, sleeping 3 seconds.'
+ sleep 3
++ date +%s
+ (( 1476697570 < 1476698723 ))
++ grep openstack-keystone
++ pcs status --full
++ grep -v Clone
+ node_states=' openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Stopped
openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Stopped
openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Stopped'
+ echo ' openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Stopped
openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Stopped
openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Stopped'
+ grep -q Started
+ echo 'openstack-keystone has stopped'
+ return
+ pcs status
+ grep haproxy-clone
+ pcs resource restart haproxy-clone
+ pcs resource restart redis-master
+ pcs resource restart mongod-clone
+ pcs resource restart rabbitmq-clone
Error: Could not complete shutdown of rabbitmq-clone, 1 resources remaining
Error performing operation: Timer expired
Set 'rabbitmq-clone' option: id=rabbitmq-clone-meta_attributes-target-role set=rabbitmq-clone-meta_attributes name=target-role=stopped
Waiting for 1 resources to stop:
* rabbitmq-clone
* rabbitmq-clone
Deleted 'rabbitmq-clone' option: id=rabbitmq-clone-meta_attributes-target-role name=target-role
", "deploy_status_code": 1 }
[1] http://file.rdu.redhat.com/~rscarazz/BZ1384068/
Yes we discussed this with Michele too, indeed it seems that the problem is that rabbitmq restart is taking too long: + pcs resource restart rabbitmq-clone Error: Could not complete shutdown of rabbitmq-clone, 1 resources remaining Error performing operation: Timer expired Michele suggested we may solve this by bumping the timeout a bit. What you wrote in comment #3 that there were no failed actions in pacemaker status afterwards, also sounds like things were just taking too long when restarting. Essentially, i think this bug is very likely a duplicate of #1364241. *** This bug has been marked as a duplicate of bug 1364241 *** |
Description of problem: It is impossible to have a successful overcloud deployment, since a restart operation is done on the controller node, not conditioned by the update operation. The failures happens on the overcloud-controller-0 machine. The interested file is /usr/share/openstack-tripleo-heat-templates/extraconfig/tasks/pacemaker_resource_restart.sh which has this condition: ... if [ "$pacemaker_status" = "active" -a \ "$(hiera bootstrap_nodeid)" = "$(facter hostname)" ]; then ... that misses a check for the action being UPDATES, because these restarts are only relevant during update operations. Version-Release number of selected component (if applicable): openstack-tripleo-heat-templates-0.8.6-128.el7ost.noarch How reproducible: Always Actual results: 01:59:03.995 ERROR: openstack Heat Stack create failed. 01:59:03.996 Deploying templates in the directory /usr/share/openstack-tripleo-heat-templates 01:59:03.996 Stack failed with status: Resource CREATE failed: Error: resources.ControllerNodesPostDeployment.resources.ControllerPostPuppet.resources.ControllerPostPuppetRestartDeployment.resources[0]: Deployment to server failed: deploy_status_code: Deployment exited with non-zero status code: 1 Expected results: Deploy complete.