Bug 1381356

Summary: OSP8: 'overcloud update stack' used to work fine, now fails due to timeout restarting PCS resources.
Product: Red Hat OpenStack Reporter: Vincent S. Cojot <vcojot>
Component: openstack-tripleo-heat-templatesAssignee: Michele Baldessari <michele>
Status: CLOSED CURRENTRELEASE QA Contact: Udi Shkalim <ushkalim>
Severity: low Docs Contact:
Priority: low    
Version: 8.0 (Liberty)CC: chjones, fdinitto, jjoyce, jschluet, mburns, michele, rhel-osp-director-maint, slinaber, tvignaud, vcojot
Target Milestone: ---Keywords: Triaged
Target Release: 8.0 (Liberty)   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1395141 (view as bug list) Environment:
Last Closed: 2018-07-20 08:19:26 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On:    
Bug Blocks: 1395141    

Description Vincent S. Cojot 2016-10-03 19:54:16 UTC
Description of problem:

Since I do some OSP torture for our customers, I tend to deploy/delete/re-deploy OSP in my lab on a regular basis.

I've been doing the following with OSP8 every since it was released:
1) deploy OSP8 with nodes registered on CDN.
2) 'overcloud update stack' to get the latest packages.
3) do some stuff with it..

'overcloud update stack has been failing consistently for the last few days and I suspect it is caused by the update of some pcs/corosync/resource_agents rpm on the overcloud controllers.

Version-Release number of selected component (if applicable):
1) before update

[heat-admin@krynn-ctrl-0 yum]$  rpm -q kernel corosync pacemaker resource-agents openstack-puppet-modules     
kernel-3.10.0-327.18.2.el7.x86_64
corosync-2.3.4-7.el7_2.1.x86_64
pacemaker-1.1.13-10.el7_2.2.x86_64
resource-agents-3.9.5-54.el7_2.10.x86_64
openstack-puppet-modules-7.0.19-1.el7ost.noarch


2) after update:
[heat-admin@krynn-ctrl-0 yum]$  rpm -q kernel corosync pacemaker resource-agents openstack-puppet-modules     
kernel-3.10.0-327.36.1.el7.x86_64
corosync-2.3.4-7.el7_2.1.x86_64
pacemaker-1.1.13-10.el7_2.2.x86_64
resource-agents-3.9.5-54.el7_2.10.x86_64
openstack-puppet-modules-7.1.3-1.el7ost.noarch


How reproducible:

everytime

Steps to Reproduce:
1. openstack overcloud deploy
2. run 'yum update -y --downloadonly' on all nodes to pre-download all packages
3. openstack overcloud update stack

Actual results:

UPDATE_FAILED due to timeout restarting PCS resource.

Expected results:

Should UPDATE_COMPLETE without issue.

Additional info:
1) Deployed with:
stack@ospdirector$ cat osp8/deploy15.sh
#!/bin/bash
TOP_DIR="${HOME}/osp8"
set -x
time openstack overcloud deploy \
--templates ${TOP_DIR}/templates \
--control-scale 3 \
--compute-scale 2 \
--ceph-storage-scale 3 \
--swift-storage-scale 0 \
--control-flavor control \
--compute-flavor compute \
--ceph-storage-flavor ceph-storage \
--swift-storage-flavor swift-storage \
--ntp-server '10.0.128.246", "10.0.128.244' \
--validation-errors-fatal \
-e ${TOP_DIR}/templates/overcloud-resource-registry-puppet.yaml \
-e ${TOP_DIR}/templates/environments/network-isolation.yaml \
-e ${TOP_DIR}/templates/environments/storage-environment.yaml \
-e ${TOP_DIR}/net-bond-with-vlans-with-nic4.yaml \
-e ${TOP_DIR}/templates/rhel-registration/environment-rhel-registration.yaml \
-e ${TOP_DIR}/templates/rhel-registration/rhel-registration-resource-registry.yaml \
-e ${TOP_DIR}/custom_ovsbond.yaml

2) pre-downloaded all packages to /var/yum/cache
stack@ospdirector$ ansible -f 1 -i hosts -m command -a 'sudo yum update -y --downloadonly' \*

3) Updated stack:
stack@ospdirector$ cat osp8/deploy15_update.sh
#!/bin/bash
TOP_DIR="${HOME}/osp8"
set -x
yes "" | openstack overcloud update stack \
-i overcloud \
--templates ${TOP_DIR}/templates \
-e ${TOP_DIR}/templates/overcloud-resource-registry-puppet.yaml \
-e ${TOP_DIR}/templates/environments/network-isolation.yaml \
-e ${TOP_DIR}/templates/environments/storage-environment.yaml \
-e ${TOP_DIR}/net-bond-with-vlans-with-nic4.yaml \
-e ${TOP_DIR}/templates/rhel-registration/environment-rhel-registration.yaml \
-e ${TOP_DIR}/templates/rhel-registration/rhel-registration-resource-registry.yaml \
-e ${TOP_DIR}/custom_ovsbond.yaml

Comment 1 Vincent S. Cojot 2016-10-04 19:19:47 UTC
Here is more information:

After running steps 1) and 3) from above. I always get something like this:
WAITING
completed: [u'krynn-ceph-0', u'krynn-ctrl-2', u'krynn-ceph-1', u'krynn-cmpt-1', u'krynn-ctrl-0', u'krynn-ceph-2', u'krynn-ctrl-1']
on_breakpoint: [u'krynn-cmpt-0']
removing breakpoint on krynn-cmpt-0
Breakpoint reached, continue? Regexp or Enter=proceed, no=cancel update, C-c=quit interactive mode: IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
[...REPEATED.....]
IN_PROGRESS
IN_PROGRESS

IN_PROGRESS
FAILED
update finished with status FAILED

Initial investigation always shows similar to this trace:

[stack@instack ~]$ heat resource-list -n 3 overcloud|grep -v _COMPLETE
+-----------------------------------------------+-----------------------------------------------+----------------------------------------------------------------+-----------------+---------------------+---------------------------------------------------------------------------------------------------------------------------------------------------+
| resource_name                                 | physical_resource_id                          | resource_type                                                  | resource_status | updated_time        | stack_name                                                                                                                                        |
+-----------------------------------------------+-----------------------------------------------+----------------------------------------------------------------+-----------------+---------------------+---------------------------------------------------------------------------------------------------------------------------------------------------+
| ControllerNodesPostDeployment                 | b9daf7b6-bf8c-4527-8d16-8e1c8ed4ab86          | OS::TripleO::ControllerPostDeployment                          | UPDATE_FAILED   | 2016-10-04T18:37:06 | overcloud                                                                                                                                         |
| ControllerPostPuppet                          | 40b16ec5-014e-4e32-bfe9-17ce2645b9b1          | OS::TripleO::Tasks::ControllerPostPuppet                       | UPDATE_FAILED   | 2016-10-04T18:58:25 | overcloud-ControllerNodesPostDeployment-43trttftu6p4                                                                                              |
| ControllerPostPuppetRestartDeployment         | 69c70e74-b929-4737-b264-134562ae4422          | OS::Heat::SoftwareDeployments                                  | UPDATE_FAILED   | 2016-10-04T19:00:00 | overcloud-ControllerNodesPostDeployment-43trttftu6p4-ControllerPostPuppet-dmsfb7tyizaj                                                            |
| 0                                             | 74d53461-ac4e-4adf-ae90-0004c18b203f          | OS::Heat::SoftwareDeployment                                   | UPDATE_FAILED   | 2016-10-04T19:00:03 | overcloud-ControllerNodesPostDeployment-43trttftu6p4-ControllerPostPuppet-dmsfb7tyizaj-ControllerPostPuppetRestartDeployment-zlfu6dwvqphe         |
+-----------------------------------------------+-----------------------------------------------+----------------------------------------------------------------+-----------------+---------------------+---------------------------------------------------------------------------------------------------------------------------------------------------+

Comment 2 Vincent S. Cojot 2016-10-04 19:21:58 UTC
Looking into the failed resource (ControllerPostPuppetRestartDeployment) further, I always notice that it failed to restart rabbitmq:



[stack@instack ~]$ heat deployment-output-show 74d53461-ac4e-4adf-ae90-0004c18b203f deploy_stderr
[......]
+ node_states='     httpd       (systemd:httpd):        (target-role:Stopped) Stopped
     httpd      (systemd:httpd):        (target-role:Stopped) Stopped
     httpd      (systemd:httpd):        (target-role:Stopped) Stopped'
+ echo '     httpd      (systemd:httpd):        (target-role:Stopped) Stopped
     httpd      (systemd:httpd):        (target-role:Stopped) Stopped
     httpd      (systemd:httpd):        (target-role:Stopped) Stopped'
+ grep -q Started
+ echo 'httpd has stopped'
+ pcs resource disable openstack-keystone
+ check_resource openstack-keystone stopped 1800
+ '[' 3 -ne 3 ']'
+ service=openstack-keystone
+ state=stopped
+ timeout=1800
+ '[' stopped = stopped ']'
+ match_for_incomplete=Started
+ timeout -k 10 1800 crm_resource --wait
++ grep openstack-keystone
++ pcs status --full
++ grep -v Clone
+ node_states='     openstack-keystone  (systemd:openstack-keystone):   (target-role:Stopped) Stopped
     openstack-keystone (systemd:openstack-keystone):   (target-role:Stopped) Stopped
     openstack-keystone (systemd:openstack-keystone):   (target-role:Stopped) Stopped'
+ echo '     openstack-keystone (systemd:openstack-keystone):   (target-role:Stopped) Stopped
     openstack-keystone (systemd:openstack-keystone):   (target-role:Stopped) Stopped
     openstack-keystone (systemd:openstack-keystone):   (target-role:Stopped) Stopped'
+ grep -q Started
+ echo 'openstack-keystone has stopped'
+ pcs status
+ grep haproxy-clone
+ pcs resource restart haproxy-clone
+ pcs resource restart redis-master
+ pcs resource restart mongod-clone
+ pcs resource restart rabbitmq-clone
Error: Could not complete shutdown of rabbitmq-clone, 1 resources remaining
Error performing operation: Timer expired

Set 'rabbitmq-clone' option: id=rabbitmq-clone-meta_attributes-target-role set=rabbitmq-clone-meta_attributes name=target-role=stopped
Waiting for 1 resources to stop:
 * rabbitmq-clone
 * rabbitmq-clone
Deleted 'rabbitmq-clone' option: id=rabbitmq-clone-meta_attributes-target-role name=target-role

Comment 3 Vincent S. Cojot 2016-10-04 19:27:54 UTC
Logging into one of the controllers, I see this:

[heat-admin@krynn-ctrl-0 ~]$ sudo pcs status|grep -A1 rabbitmq
 Clone Set: rabbitmq-clone [rabbitmq]
     Started: [ krynn-ctrl-0 krynn-ctrl-1 krynn-ctrl-2 ]

So it seems that rabbitmq managed to get restarted.
Further restarts work fine and within a reasonnable time:

[heat-admin@krynn-ctrl-0 ~]$ time sudo pcs resource restart rabbitmq-clone
rabbitmq-clone successfully restarted

real    0m24.117s
user    0m0.944s
sys     0m0.279s

Comment 4 Fabio Massimo Di Nitto 2016-10-20 08:17:58 UTC
Can you please provide sosreports?

Andrew can you look at it when you have got time.

By the look of it the environment is simply not powerful enough and it was running at the edge before. Something else might have changed during the update that´s causing some services to take more resource or more time to shutdown, causing the cascade effect of rabbit timing out on stop, but then the resource is up at a later stage.

Comment 5 Michele Baldessari 2016-11-02 11:00:57 UTC
This is likely a duplicate of: https://bugzilla.redhat.com/show_bug.cgi?id=1364241 
We fixed this in newton. I will work on the backports

Comment 6 Vincent S. Cojot 2016-11-02 14:02:40 UTC
Hi Fabio,
Sorry about the delay. I've been swamped with a recent enagement.
I am not sure I can provide sosreports at this time since the deployment was scrapped and I went with the pkg update on the overcloud image (I update the packages inside undercloud-full.qcow2) before deployment.
I'll see if I can revisit this issue in the coming weeks.
Regards,
Vincent