Bug 1384068 - Wrong service restart prevents overcloud deployment
Summary: Wrong service restart prevents overcloud deployment
Keywords:
Status: CLOSED DUPLICATE of bug 1364241
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-heat-templates
Version: 7.0 (Kilo)
Hardware: x86_64
OS: Linux
high
high
Target Milestone: ---
: ---
Assignee: Jiri Stransky
QA Contact: Arik Chernetsky
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-10-12 13:12 UTC by Raoul Scarazzini
Modified: 2016-11-15 09:00 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-11-15 09:00:38 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1378391 0 medium CLOSED [RFE] Increase 'stop timeout' values of pacemaker resources on overcloud 2021-02-22 00:41:40 UTC

Internal Links: 1378391

Description Raoul Scarazzini 2016-10-12 13:12:42 UTC
Description of problem:

It is impossible to have a successful overcloud deployment, since a restart operation is done on the controller node, not conditioned by the update operation.

The failures happens on the overcloud-controller-0 machine. The interested file is /usr/share/openstack-tripleo-heat-templates/extraconfig/tasks/pacemaker_resource_restart.sh which has this condition:

...
if [ "$pacemaker_status" = "active" -a \
     "$(hiera bootstrap_nodeid)" = "$(facter hostname)" ]; then
...

that misses a check for the action being UPDATES, because these restarts are only relevant during update operations.

Version-Release number of selected component (if applicable):

openstack-tripleo-heat-templates-0.8.6-128.el7ost.noarch

How reproducible:

Always

Actual results:

01:59:03.995 ERROR: openstack Heat Stack create failed.
01:59:03.996 Deploying templates in the directory /usr/share/openstack-tripleo-heat-templates
01:59:03.996 Stack failed with status: Resource CREATE failed: Error: resources.ControllerNodesPostDeployment.resources.ControllerPostPuppet.resources.ControllerPostPuppetRestartDeployment.resources[0]: Deployment to server failed: deploy_status_code: Deployment exited with non-zero status code: 1

Expected results:

Deploy complete.

Comment 2 Jiri Stransky 2016-10-13 09:24:34 UTC
Looking at code, it seems like there is indeed restart triggered on create as well. In upstream, this code landed with the "update_identifier" condition (we probably backported downstream before merging upstream back then).

if [ "$pacemaker_status" = "active" -a \
     "$(hiera bootstrap_nodeid)" = "$(facter hostname)" -a \
     "$(hiera update_identifier)" != "nil" ]; then

https://github.com/openstack/tripleo-heat-templates/commit/ea1294fe9b11029edab719e8bf558733226b3fd4

However, i don't think we should just pull that in, as it introduces:

https://bugs.launchpad.net/tripleo/+bug/1567384

which needs to be fixed together with:

https://bugs.launchpad.net/tripleo/+bug/1567385


Restarting the services at the end of the deployment makes the deployment take slightly longer time, but AFAIK it has never caused severe issues during the last year on OSP 7 and it shouldn't cause such issues now either. The real problem is probably that the restart doesn't succeed. At the moment, with current info, i don't think backporting all the fixes mentioned above is worth it -- it's not much use removing the restart from stack-create, if the restart can still fail later on stack-update. Hence i think we should focus on why the restart fails rather than whether it's triggered on stack-create (triggering it on stack-create is a bug too, but with much lower severity).

Raoul, can you please post the restart script's output (IIRC it can be collected via `heat deployment-output-show`), and what state the pacemaker cluster is in after the restart? (Probably pcs status and the cluster log?)

Comment 3 Raoul Scarazzini 2016-10-14 13:22:58 UTC
Hi Jiri,
unfortunately the last deploy went well and so I don't have any log for this. What I can say for sure is that the cluster status was fine, there were no failed actions.
But the fact that the issue did not happened might mean that this is simply a timeout issue (stop or start), which needs just to be increased.
The actual settings around rabbit are:

 Clone: rabbitmq-clone
  Meta Attrs: ordered=true interleave=true 
  Resource: rabbitmq (class=ocf provider=heartbeat type=rabbitmq-cluster)
   Attributes: set_policy="ha-all ^(?!amq\.).* {"ha-mode":"all"}" 
   Meta Attrs: notify=true 
   Operations: start interval=0s timeout=100 (rabbitmq-start-interval-0s)
               stop interval=0s timeout=90 (rabbitmq-stop-interval-0s)
               monitor interval=10 timeout=40 (rabbitmq-monitor-interval-10)

Which are lower then the overall defaults of the systemd resources (200s). So maybe increasing both those values should be enough.

Comment 4 Raoul Scarazzini 2016-10-17 11:45:33 UTC
Hi again Jiri,
reproduced, you can find here [1] all the sosreports. In addition, to make investigation easier and save some of your time, here's the output of the action that failed:

{ "deploy_stdout": "httpd not yet stopped, sleeping 3 seconds.
httpd has stopped
openstack-keystone not yet stopped, sleeping 3 seconds.
openstack-keystone not yet stopped, sleeping 3 seconds.
openstack-keystone not yet stopped, sleeping 3 seconds.
openstack-keystone not yet stopped, sleeping 3 seconds.
openstack-keystone not yet stopped, sleeping 3 seconds.
openstack-keystone not yet stopped, sleeping 3 seconds.
openstack-keystone not yet stopped, sleeping 3 seconds.
openstack-keystone not yet stopped, sleeping 3 seconds.
openstack-keystone not yet stopped, sleeping 3 seconds.
openstack-keystone not yet stopped, sleeping 3 seconds.
openstack-keystone not yet stopped, sleeping 3 seconds.
openstack-keystone not yet stopped, sleeping 3 seconds.
openstack-keystone has stopped
 Clone Set: haproxy-clone [haproxy]
haproxy-clone successfully restarted
redis-master successfully restarted
mongod-clone successfully restarted
", "deploy_stderr": "++ systemctl is-active pacemaker
+ pacemaker_status=active
+ check_interval=3
++ hiera bootstrap_nodeid
++ facter hostname
+ '[' active = active -a overcloud-controller-0 = overcloud-controller-0 ']'
+ pcs constraint order show
+ grep 'start neutron-server-clone then start neutron-ovs-cleanup-clone'
+ pcs resource disable httpd
+ check_resource httpd stopped 300
+ service=httpd
+ state=stopped
+ timeout=300
++ date +%s
+ tstart=1476697518
+ tend=1476697818
+ '[' stopped = stopped ']'
+ match_for_incomplete=Started
++ date +%s
+ (( 1476697518 < 1476697818 ))
++ pcs status --full
++ grep httpd
++ grep -v Clone
+ node_states=' httpd	(systemd:httpd):	(target-role:Stopped) Started overcloud-controller-1
 httpd	(systemd:httpd):	(target-role:Stopped) Started overcloud-controller-0
 httpd	(systemd:httpd):	(target-role:Stopped) Started overcloud-controller-2'
+ echo ' httpd	(systemd:httpd):	(target-role:Stopped) Started overcloud-controller-1
 httpd	(systemd:httpd):	(target-role:Stopped) Started overcloud-controller-0
 httpd	(systemd:httpd):	(target-role:Stopped) Started overcloud-controller-2'
+ grep -q Started
+ echo 'httpd not yet stopped, sleeping 3 seconds.'
+ sleep 3
++ date +%s
+ (( 1476697522 < 1476697818 ))
++ grep httpd
++ pcs status --full
++ grep -v Clone
+ node_states=' httpd	(systemd:httpd):	(target-role:Stopped) Stopped
 httpd	(systemd:httpd):	(target-role:Stopped) Stopped
 httpd	(systemd:httpd):	(target-role:Stopped) Stopped'
+ echo ' httpd	(systemd:httpd):	(target-role:Stopped) Stopped
 httpd	(systemd:httpd):	(target-role:Stopped) Stopped
 httpd	(systemd:httpd):	(target-role:Stopped) Stopped'
+ grep -q Started
+ echo 'httpd has stopped'
+ return
+ pcs resource disable openstack-keystone
+ check_resource openstack-keystone stopped 1200
+ service=openstack-keystone
+ state=stopped
+ timeout=1200
++ date +%s
+ tstart=1476697523
+ tend=1476698723
+ '[' stopped = stopped ']'
+ match_for_incomplete=Started
++ date +%s
+ (( 1476697523 < 1476698723 ))
++ pcs status --full
++ grep openstack-keystone
++ grep -v Clone
+ node_states=' openstack-keystone	(systemd:openstack-keystone):	(target-role:Stopped) Started overcloud-controller-1
 openstack-keystone	(systemd:openstack-keystone):	(target-role:Stopped) Started overcloud-controller-0
 openstack-keystone	(systemd:openstack-keystone):	(target-role:Stopped) Started overcloud-controller-2'
+ echo ' openstack-keystone	(systemd:openstack-keystone):	(target-role:Stopped) Started overcloud-controller-1
 openstack-keystone	(systemd:openstack-keystone):	(target-role:Stopped) Started overcloud-controller-0
 openstack-keystone	(systemd:openstack-keystone):	(target-role:Stopped) Started overcloud-controller-2'
+ grep -q Started
+ echo 'openstack-keystone not yet stopped, sleeping 3 seconds.'
+ sleep 3
++ date +%s
+ (( 1476697527 < 1476698723 ))
++ grep openstack-keystone
++ pcs status --full
++ grep -v Clone
+ node_states=' openstack-keystone	(systemd:openstack-keystone):	(target-role:Stopped) Started overcloud-controller-1
 openstack-keystone	(systemd:openstack-keystone):	(target-role:Stopped) Started overcloud-controller-0
 openstack-keystone	(systemd:openstack-keystone):	(target-role:Stopped) Started overcloud-controller-2'
+ echo ' openstack-keystone	(systemd:openstack-keystone):	(target-role:Stopped) Started overcloud-controller-1
 openstack-keystone	(systemd:openstack-keystone):	(target-role:Stopped) Started overcloud-controller-0
 openstack-keystone	(systemd:openstack-keystone):	(target-role:Stopped) Started overcloud-controller-2'
+ grep -q Started
+ echo 'openstack-keystone not yet stopped, sleeping 3 seconds.'
+ sleep 3
++ date +%s
+ (( 1476697531 < 1476698723 ))
++ pcs status --full
++ grep openstack-keystone
++ grep -v Clone
+ node_states=' openstack-keystone	(systemd:openstack-keystone):	(target-role:Stopped) Started overcloud-controller-1
 openstack-keystone	(systemd:openstack-keystone):	(target-role:Stopped) Started overcloud-controller-0
 openstack-keystone	(systemd:openstack-keystone):	(target-role:Stopped) Started overcloud-controller-2'
+ echo ' openstack-keystone	(systemd:openstack-keystone):	(target-role:Stopped) Started overcloud-controller-1
 openstack-keystone	(systemd:openstack-keystone):	(target-role:Stopped) Started overcloud-controller-0
+ grep -q Started
 openstack-keystone	(systemd:openstack-keystone):	(target-role:Stopped) Started overcloud-controller-2'
+ echo 'openstack-keystone not yet stopped, sleeping 3 seconds.'
+ sleep 3
++ date +%s
+ (( 1476697535 < 1476698723 ))
++ pcs status --full
++ grep openstack-keystone
++ grep -v Clone
+ node_states=' openstack-keystone	(systemd:openstack-keystone):	(target-role:Stopped) Started overcloud-controller-1
 openstack-keystone	(systemd:openstack-keystone):	(target-role:Stopped) Started overcloud-controller-0
 openstack-keystone	(systemd:openstack-keystone):	(target-role:Stopped) Started overcloud-controller-2'
+ echo ' openstack-keystone	(systemd:openstack-keystone):	(target-role:Stopped) Started overcloud-controller-1
 openstack-keystone	(systemd:openstack-keystone):	(target-role:Stopped) Started overcloud-controller-0
 openstack-keystone	(systemd:openstack-keystone):	(target-role:Stopped) Started overcloud-controller-2'
+ grep -q Started
+ echo 'openstack-keystone not yet stopped, sleeping 3 seconds.'
+ sleep 3
++ date +%s
+ (( 1476697539 < 1476698723 ))
++ grep openstack-keystone
++ pcs status --full
++ grep -v Clone
+ node_states=' openstack-keystone	(systemd:openstack-keystone):	(target-role:Stopped) Started overcloud-controller-1
 openstack-keystone	(systemd:openstack-keystone):	(target-role:Stopped) Started overcloud-controller-0
 openstack-keystone	(systemd:openstack-keystone):	(target-role:Stopped) Started overcloud-controller-2'
+ echo ' openstack-keystone	(systemd:openstack-keystone):	(target-role:Stopped) Started overcloud-controller-1
 openstack-keystone	(systemd:openstack-keystone):	(target-role:Stopped) Started overcloud-controller-0
 openstack-keystone	(systemd:openstack-keystone):	(target-role:Stopped) Started overcloud-controller-2'
+ grep -q Started
+ echo 'openstack-keystone not yet stopped, sleeping 3 seconds.'
+ sleep 3
++ date +%s
+ (( 1476697543 < 1476698723 ))
++ pcs status --full
++ grep openstack-keystone
++ grep -v Clone
+ node_states=' openstack-keystone	(systemd:openstack-keystone):	(target-role:Stopped) Started overcloud-controller-1
 openstack-keystone	(systemd:openstack-keystone):	(target-role:Stopped) Started overcloud-controller-0
 openstack-keystone	(systemd:openstack-keystone):	(target-role:Stopped) Started overcloud-controller-2'
+ grep -q Started
+ echo ' openstack-keystone	(systemd:openstack-keystone):	(target-role:Stopped) Started overcloud-controller-1
 openstack-keystone	(systemd:openstack-keystone):	(target-role:Stopped) Started overcloud-controller-0
 openstack-keystone	(systemd:openstack-keystone):	(target-role:Stopped) Started overcloud-controller-2'
+ echo 'openstack-keystone not yet stopped, sleeping 3 seconds.'
+ sleep 3
++ date +%s
+ (( 1476697547 < 1476698723 ))
++ pcs status --full
++ grep openstack-keystone
++ grep -v Clone
+ node_states=' openstack-keystone	(systemd:openstack-keystone):	(target-role:Stopped) Started overcloud-controller-1
 openstack-keystone	(systemd:openstack-keystone):	(target-role:Stopped) Started overcloud-controller-0
 openstack-keystone	(systemd:openstack-keystone):	(target-role:Stopped) Started overcloud-controller-2'
+ grep -q Started
+ echo ' openstack-keystone	(systemd:openstack-keystone):	(target-role:Stopped) Started overcloud-controller-1
 openstack-keystone	(systemd:openstack-keystone):	(target-role:Stopped) Started overcloud-controller-0
 openstack-keystone	(systemd:openstack-keystone):	(target-role:Stopped) Started overcloud-controller-2'
+ echo 'openstack-keystone not yet stopped, sleeping 3 seconds.'
+ sleep 3
++ date +%s
+ (( 1476697551 < 1476698723 ))
++ pcs status --full
++ grep openstack-keystone
++ grep -v Clone
+ node_states=' openstack-keystone	(systemd:openstack-keystone):	(target-role:Stopped) Started overcloud-controller-1
 openstack-keystone	(systemd:openstack-keystone):	(target-role:Stopped) Started overcloud-controller-0
 openstack-keystone	(systemd:openstack-keystone):	(target-role:Stopped) Started overcloud-controller-2'
+ echo ' openstack-keystone	(systemd:openstack-keystone):	(target-role:Stopped) Started overcloud-controller-1
 openstack-keystone	(systemd:openstack-keystone):	(target-role:Stopped) Started overcloud-controller-0
 openstack-keystone	(systemd:openstack-keystone):	(target-role:Stopped) Started overcloud-controller-2'
+ grep -q Started
+ echo 'openstack-keystone not yet stopped, sleeping 3 seconds.'
+ sleep 3
++ date +%s
+ (( 1476697555 < 1476698723 ))
++ pcs status --full
++ grep openstack-keystone
++ grep -v Clone
+ node_states=' openstack-keystone	(systemd:openstack-keystone):	(target-role:Stopped) Started overcloud-controller-1
 openstack-keystone	(systemd:openstack-keystone):	(target-role:Stopped) Started overcloud-controller-0
 openstack-keystone	(systemd:openstack-keystone):	(target-role:Stopped) Started overcloud-controller-2'
+ echo ' openstack-keystone	(systemd:openstack-keystone):	(target-role:Stopped) Started overcloud-controller-1
 openstack-keystone	(systemd:openstack-keystone):	(target-role:Stopped) Started overcloud-controller-0
 openstack-keystone	(systemd:openstack-keystone):	(target-role:Stopped) Started overcloud-controller-2'
+ grep -q Started
+ echo 'openstack-keystone not yet stopped, sleeping 3 seconds.'
+ sleep 3
++ date +%s
+ (( 1476697558 < 1476698723 ))
++ grep openstack-keystone
++ pcs status --full
++ grep -v Clone
+ node_states=' openstack-keystone	(systemd:openstack-keystone):	(target-role:Stopped) Started overcloud-controller-1
 openstack-keystone	(systemd:openstack-keystone):	(target-role:Stopped) Started overcloud-controller-0
 openstack-keystone	(systemd:openstack-keystone):	(target-role:Stopped) Started overcloud-controller-2'
+ echo ' openstack-keystone	(systemd:openstack-keystone):	(target-role:Stopped) Started overcloud-controller-1
 openstack-keystone	(systemd:openstack-keystone):	(target-role:Stopped) Started overcloud-controller-0
 openstack-keystone	(systemd:openstack-keystone):	(target-role:Stopped) Started overcloud-controller-2'
+ grep -q Started
+ echo 'openstack-keystone not yet stopped, sleeping 3 seconds.'
+ sleep 3
++ date +%s
+ (( 1476697562 < 1476698723 ))
++ grep openstack-keystone
++ pcs status --full
++ grep -v Clone
+ node_states=' openstack-keystone	(systemd:openstack-keystone):	(target-role:Stopped) Started overcloud-controller-1
 openstack-keystone	(systemd:openstack-keystone):	(target-role:Stopped) Started overcloud-controller-0
 openstack-keystone	(systemd:openstack-keystone):	(target-role:Stopped) Started overcloud-controller-2'
+ echo ' openstack-keystone	(systemd:openstack-keystone):	(target-role:Stopped) Started overcloud-controller-1
 openstack-keystone	(systemd:openstack-keystone):	(target-role:Stopped) Started overcloud-controller-0
 openstack-keystone	(systemd:openstack-keystone):	(target-role:Stopped) Started overcloud-controller-2'
+ grep -q Started
+ echo 'openstack-keystone not yet stopped, sleeping 3 seconds.'
+ sleep 3
++ date +%s
+ (( 1476697566 < 1476698723 ))
++ grep openstack-keystone
++ pcs status --full
++ grep -v Clone
+ node_states=' openstack-keystone	(systemd:openstack-keystone):	(target-role:Stopped) Started overcloud-controller-1
 openstack-keystone	(systemd:openstack-keystone):	(target-role:Stopped) Started overcloud-controller-0
 openstack-keystone	(systemd:openstack-keystone):	(target-role:Stopped) Started overcloud-controller-2'
+ echo ' openstack-keystone	(systemd:openstack-keystone):	(target-role:Stopped) Started overcloud-controller-1
 openstack-keystone	(systemd:openstack-keystone):	(target-role:Stopped) Started overcloud-controller-0
 openstack-keystone	(systemd:openstack-keystone):	(target-role:Stopped) Started overcloud-controller-2'
+ grep -q Started
+ echo 'openstack-keystone not yet stopped, sleeping 3 seconds.'
+ sleep 3
++ date +%s
+ (( 1476697570 < 1476698723 ))
++ grep openstack-keystone
++ pcs status --full
++ grep -v Clone
+ node_states=' openstack-keystone	(systemd:openstack-keystone):	(target-role:Stopped) Stopped
 openstack-keystone	(systemd:openstack-keystone):	(target-role:Stopped) Stopped
 openstack-keystone	(systemd:openstack-keystone):	(target-role:Stopped) Stopped'
+ echo ' openstack-keystone	(systemd:openstack-keystone):	(target-role:Stopped) Stopped
 openstack-keystone	(systemd:openstack-keystone):	(target-role:Stopped) Stopped
 openstack-keystone	(systemd:openstack-keystone):	(target-role:Stopped) Stopped'
+ grep -q Started
+ echo 'openstack-keystone has stopped'
+ return
+ pcs status
+ grep haproxy-clone
+ pcs resource restart haproxy-clone
+ pcs resource restart redis-master
+ pcs resource restart mongod-clone
+ pcs resource restart rabbitmq-clone
Error: Could not complete shutdown of rabbitmq-clone, 1 resources remaining
Error performing operation: Timer expired

Set 'rabbitmq-clone' option: id=rabbitmq-clone-meta_attributes-target-role set=rabbitmq-clone-meta_attributes name=target-role=stopped
Waiting for 1 resources to stop:
 * rabbitmq-clone
 * rabbitmq-clone
Deleted 'rabbitmq-clone' option: id=rabbitmq-clone-meta_attributes-target-role name=target-role

", "deploy_status_code": 1 }

[1] http://file.rdu.redhat.com/~rscarazz/BZ1384068/

Comment 5 Jiri Stransky 2016-10-18 13:47:38 UTC
Yes we discussed this with Michele too, indeed it seems that the problem is that rabbitmq restart is taking too long:

+ pcs resource restart rabbitmq-clone
Error: Could not complete shutdown of rabbitmq-clone, 1 resources remaining
Error performing operation: Timer expired

Michele suggested we may solve this by bumping the timeout a bit. What you wrote in comment #3 that there were no failed actions in pacemaker status afterwards, also sounds like things were just taking too long when restarting.

Essentially, i think this bug is very likely a duplicate of #1364241.

Comment 6 Michele Baldessari 2016-11-15 09:00:38 UTC

*** This bug has been marked as a duplicate of bug 1364241 ***


Note You need to log in before you can comment on or make changes to this bug.