Bug 1384068
Summary: | Wrong service restart prevents overcloud deployment | ||
---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Raoul Scarazzini <rscarazz> |
Component: | openstack-tripleo-heat-templates | Assignee: | Jiri Stransky <jstransk> |
Status: | CLOSED DUPLICATE | QA Contact: | Arik Chernetsky <achernet> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 7.0 (Kilo) | CC: | jstransk, mburns, michele, rhel-osp-director-maint, rscarazz |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2016-11-15 09:00:38 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Raoul Scarazzini
2016-10-12 13:12:42 UTC
Looking at code, it seems like there is indeed restart triggered on create as well. In upstream, this code landed with the "update_identifier" condition (we probably backported downstream before merging upstream back then). if [ "$pacemaker_status" = "active" -a \ "$(hiera bootstrap_nodeid)" = "$(facter hostname)" -a \ "$(hiera update_identifier)" != "nil" ]; then https://github.com/openstack/tripleo-heat-templates/commit/ea1294fe9b11029edab719e8bf558733226b3fd4 However, i don't think we should just pull that in, as it introduces: https://bugs.launchpad.net/tripleo/+bug/1567384 which needs to be fixed together with: https://bugs.launchpad.net/tripleo/+bug/1567385 Restarting the services at the end of the deployment makes the deployment take slightly longer time, but AFAIK it has never caused severe issues during the last year on OSP 7 and it shouldn't cause such issues now either. The real problem is probably that the restart doesn't succeed. At the moment, with current info, i don't think backporting all the fixes mentioned above is worth it -- it's not much use removing the restart from stack-create, if the restart can still fail later on stack-update. Hence i think we should focus on why the restart fails rather than whether it's triggered on stack-create (triggering it on stack-create is a bug too, but with much lower severity). Raoul, can you please post the restart script's output (IIRC it can be collected via `heat deployment-output-show`), and what state the pacemaker cluster is in after the restart? (Probably pcs status and the cluster log?) Hi Jiri, unfortunately the last deploy went well and so I don't have any log for this. What I can say for sure is that the cluster status was fine, there were no failed actions. But the fact that the issue did not happened might mean that this is simply a timeout issue (stop or start), which needs just to be increased. The actual settings around rabbit are: Clone: rabbitmq-clone Meta Attrs: ordered=true interleave=true Resource: rabbitmq (class=ocf provider=heartbeat type=rabbitmq-cluster) Attributes: set_policy="ha-all ^(?!amq\.).* {"ha-mode":"all"}" Meta Attrs: notify=true Operations: start interval=0s timeout=100 (rabbitmq-start-interval-0s) stop interval=0s timeout=90 (rabbitmq-stop-interval-0s) monitor interval=10 timeout=40 (rabbitmq-monitor-interval-10) Which are lower then the overall defaults of the systemd resources (200s). So maybe increasing both those values should be enough. Hi again Jiri, reproduced, you can find here [1] all the sosreports. In addition, to make investigation easier and save some of your time, here's the output of the action that failed: { "deploy_stdout": "httpd not yet stopped, sleeping 3 seconds. httpd has stopped openstack-keystone not yet stopped, sleeping 3 seconds. openstack-keystone not yet stopped, sleeping 3 seconds. openstack-keystone not yet stopped, sleeping 3 seconds. openstack-keystone not yet stopped, sleeping 3 seconds. openstack-keystone not yet stopped, sleeping 3 seconds. openstack-keystone not yet stopped, sleeping 3 seconds. openstack-keystone not yet stopped, sleeping 3 seconds. openstack-keystone not yet stopped, sleeping 3 seconds. openstack-keystone not yet stopped, sleeping 3 seconds. openstack-keystone not yet stopped, sleeping 3 seconds. openstack-keystone not yet stopped, sleeping 3 seconds. openstack-keystone not yet stopped, sleeping 3 seconds. openstack-keystone has stopped Clone Set: haproxy-clone [haproxy] haproxy-clone successfully restarted redis-master successfully restarted mongod-clone successfully restarted ", "deploy_stderr": "++ systemctl is-active pacemaker + pacemaker_status=active + check_interval=3 ++ hiera bootstrap_nodeid ++ facter hostname + '[' active = active -a overcloud-controller-0 = overcloud-controller-0 ']' + pcs constraint order show + grep 'start neutron-server-clone then start neutron-ovs-cleanup-clone' + pcs resource disable httpd + check_resource httpd stopped 300 + service=httpd + state=stopped + timeout=300 ++ date +%s + tstart=1476697518 + tend=1476697818 + '[' stopped = stopped ']' + match_for_incomplete=Started ++ date +%s + (( 1476697518 < 1476697818 )) ++ pcs status --full ++ grep httpd ++ grep -v Clone + node_states=' httpd (systemd:httpd): (target-role:Stopped) Started overcloud-controller-1 httpd (systemd:httpd): (target-role:Stopped) Started overcloud-controller-0 httpd (systemd:httpd): (target-role:Stopped) Started overcloud-controller-2' + echo ' httpd (systemd:httpd): (target-role:Stopped) Started overcloud-controller-1 httpd (systemd:httpd): (target-role:Stopped) Started overcloud-controller-0 httpd (systemd:httpd): (target-role:Stopped) Started overcloud-controller-2' + grep -q Started + echo 'httpd not yet stopped, sleeping 3 seconds.' + sleep 3 ++ date +%s + (( 1476697522 < 1476697818 )) ++ grep httpd ++ pcs status --full ++ grep -v Clone + node_states=' httpd (systemd:httpd): (target-role:Stopped) Stopped httpd (systemd:httpd): (target-role:Stopped) Stopped httpd (systemd:httpd): (target-role:Stopped) Stopped' + echo ' httpd (systemd:httpd): (target-role:Stopped) Stopped httpd (systemd:httpd): (target-role:Stopped) Stopped httpd (systemd:httpd): (target-role:Stopped) Stopped' + grep -q Started + echo 'httpd has stopped' + return + pcs resource disable openstack-keystone + check_resource openstack-keystone stopped 1200 + service=openstack-keystone + state=stopped + timeout=1200 ++ date +%s + tstart=1476697523 + tend=1476698723 + '[' stopped = stopped ']' + match_for_incomplete=Started ++ date +%s + (( 1476697523 < 1476698723 )) ++ pcs status --full ++ grep openstack-keystone ++ grep -v Clone + node_states=' openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-1 openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-0 openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-2' + echo ' openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-1 openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-0 openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-2' + grep -q Started + echo 'openstack-keystone not yet stopped, sleeping 3 seconds.' + sleep 3 ++ date +%s + (( 1476697527 < 1476698723 )) ++ grep openstack-keystone ++ pcs status --full ++ grep -v Clone + node_states=' openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-1 openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-0 openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-2' + echo ' openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-1 openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-0 openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-2' + grep -q Started + echo 'openstack-keystone not yet stopped, sleeping 3 seconds.' + sleep 3 ++ date +%s + (( 1476697531 < 1476698723 )) ++ pcs status --full ++ grep openstack-keystone ++ grep -v Clone + node_states=' openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-1 openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-0 openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-2' + echo ' openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-1 openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-0 + grep -q Started openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-2' + echo 'openstack-keystone not yet stopped, sleeping 3 seconds.' + sleep 3 ++ date +%s + (( 1476697535 < 1476698723 )) ++ pcs status --full ++ grep openstack-keystone ++ grep -v Clone + node_states=' openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-1 openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-0 openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-2' + echo ' openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-1 openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-0 openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-2' + grep -q Started + echo 'openstack-keystone not yet stopped, sleeping 3 seconds.' + sleep 3 ++ date +%s + (( 1476697539 < 1476698723 )) ++ grep openstack-keystone ++ pcs status --full ++ grep -v Clone + node_states=' openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-1 openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-0 openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-2' + echo ' openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-1 openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-0 openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-2' + grep -q Started + echo 'openstack-keystone not yet stopped, sleeping 3 seconds.' + sleep 3 ++ date +%s + (( 1476697543 < 1476698723 )) ++ pcs status --full ++ grep openstack-keystone ++ grep -v Clone + node_states=' openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-1 openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-0 openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-2' + grep -q Started + echo ' openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-1 openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-0 openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-2' + echo 'openstack-keystone not yet stopped, sleeping 3 seconds.' + sleep 3 ++ date +%s + (( 1476697547 < 1476698723 )) ++ pcs status --full ++ grep openstack-keystone ++ grep -v Clone + node_states=' openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-1 openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-0 openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-2' + grep -q Started + echo ' openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-1 openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-0 openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-2' + echo 'openstack-keystone not yet stopped, sleeping 3 seconds.' + sleep 3 ++ date +%s + (( 1476697551 < 1476698723 )) ++ pcs status --full ++ grep openstack-keystone ++ grep -v Clone + node_states=' openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-1 openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-0 openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-2' + echo ' openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-1 openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-0 openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-2' + grep -q Started + echo 'openstack-keystone not yet stopped, sleeping 3 seconds.' + sleep 3 ++ date +%s + (( 1476697555 < 1476698723 )) ++ pcs status --full ++ grep openstack-keystone ++ grep -v Clone + node_states=' openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-1 openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-0 openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-2' + echo ' openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-1 openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-0 openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-2' + grep -q Started + echo 'openstack-keystone not yet stopped, sleeping 3 seconds.' + sleep 3 ++ date +%s + (( 1476697558 < 1476698723 )) ++ grep openstack-keystone ++ pcs status --full ++ grep -v Clone + node_states=' openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-1 openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-0 openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-2' + echo ' openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-1 openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-0 openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-2' + grep -q Started + echo 'openstack-keystone not yet stopped, sleeping 3 seconds.' + sleep 3 ++ date +%s + (( 1476697562 < 1476698723 )) ++ grep openstack-keystone ++ pcs status --full ++ grep -v Clone + node_states=' openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-1 openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-0 openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-2' + echo ' openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-1 openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-0 openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-2' + grep -q Started + echo 'openstack-keystone not yet stopped, sleeping 3 seconds.' + sleep 3 ++ date +%s + (( 1476697566 < 1476698723 )) ++ grep openstack-keystone ++ pcs status --full ++ grep -v Clone + node_states=' openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-1 openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-0 openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-2' + echo ' openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-1 openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-0 openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Started overcloud-controller-2' + grep -q Started + echo 'openstack-keystone not yet stopped, sleeping 3 seconds.' + sleep 3 ++ date +%s + (( 1476697570 < 1476698723 )) ++ grep openstack-keystone ++ pcs status --full ++ grep -v Clone + node_states=' openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Stopped openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Stopped openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Stopped' + echo ' openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Stopped openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Stopped openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Stopped' + grep -q Started + echo 'openstack-keystone has stopped' + return + pcs status + grep haproxy-clone + pcs resource restart haproxy-clone + pcs resource restart redis-master + pcs resource restart mongod-clone + pcs resource restart rabbitmq-clone Error: Could not complete shutdown of rabbitmq-clone, 1 resources remaining Error performing operation: Timer expired Set 'rabbitmq-clone' option: id=rabbitmq-clone-meta_attributes-target-role set=rabbitmq-clone-meta_attributes name=target-role=stopped Waiting for 1 resources to stop: * rabbitmq-clone * rabbitmq-clone Deleted 'rabbitmq-clone' option: id=rabbitmq-clone-meta_attributes-target-role name=target-role ", "deploy_status_code": 1 } [1] http://file.rdu.redhat.com/~rscarazz/BZ1384068/ Yes we discussed this with Michele too, indeed it seems that the problem is that rabbitmq restart is taking too long: + pcs resource restart rabbitmq-clone Error: Could not complete shutdown of rabbitmq-clone, 1 resources remaining Error performing operation: Timer expired Michele suggested we may solve this by bumping the timeout a bit. What you wrote in comment #3 that there were no failed actions in pacemaker status afterwards, also sounds like things were just taking too long when restarting. Essentially, i think this bug is very likely a duplicate of #1364241. *** This bug has been marked as a duplicate of bug 1364241 *** |