Bug 1395141 - OSP9: 'overcloud update stack' used to work fine, now fails due to timeout restarting PCS resources.
Summary: OSP9: 'overcloud update stack' used to work fine, now fails due to timeout re...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-heat-templates
Version: 9.0 (Mitaka)
Hardware: x86_64
OS: Linux
low
low
Target Milestone: ---
: 9.0 (Mitaka)
Assignee: Michele Baldessari
QA Contact: Arik Chernetsky
URL:
Whiteboard:
Depends On: 1381356
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-11-15 09:21 UTC by Michele Baldessari
Modified: 2018-07-20 08:19 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1381356
Environment:
Last Closed: 2018-07-20 08:19:38 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
OpenStack gerrit 397604 0 None None None 2016-11-15 09:27:03 UTC
OpenStack gerrit 397608 0 None None None 2016-11-15 09:26:41 UTC

Description Michele Baldessari 2016-11-15 09:21:16 UTC
+++ This bug was initially created as a clone of Bug #1381356 +++

Description of problem:

Since I do some OSP torture for our customers, I tend to deploy/delete/re-deploy OSP in my lab on a regular basis.

I've been doing the following with OSP8 every since it was released:
1) deploy OSP8 with nodes registered on CDN.
2) 'overcloud update stack' to get the latest packages.
3) do some stuff with it..

'overcloud update stack has been failing consistently for the last few days and I suspect it is caused by the update of some pcs/corosync/resource_agents rpm on the overcloud controllers.

Version-Release number of selected component (if applicable):
1) before update

[heat-admin@krynn-ctrl-0 yum]$  rpm -q kernel corosync pacemaker resource-agents openstack-puppet-modules     
kernel-3.10.0-327.18.2.el7.x86_64
corosync-2.3.4-7.el7_2.1.x86_64
pacemaker-1.1.13-10.el7_2.2.x86_64
resource-agents-3.9.5-54.el7_2.10.x86_64
openstack-puppet-modules-7.0.19-1.el7ost.noarch


2) after update:
[heat-admin@krynn-ctrl-0 yum]$  rpm -q kernel corosync pacemaker resource-agents openstack-puppet-modules     
kernel-3.10.0-327.36.1.el7.x86_64
corosync-2.3.4-7.el7_2.1.x86_64
pacemaker-1.1.13-10.el7_2.2.x86_64
resource-agents-3.9.5-54.el7_2.10.x86_64
openstack-puppet-modules-7.1.3-1.el7ost.noarch


How reproducible:

everytime

Steps to Reproduce:
1. openstack overcloud deploy
2. run 'yum update -y --downloadonly' on all nodes to pre-download all packages
3. openstack overcloud update stack

Actual results:

UPDATE_FAILED due to timeout restarting PCS resource.

Expected results:

Should UPDATE_COMPLETE without issue.

Additional info:
1) Deployed with:
stack@ospdirector$ cat osp8/deploy15.sh
#!/bin/bash
TOP_DIR="${HOME}/osp8"
set -x
time openstack overcloud deploy \
--templates ${TOP_DIR}/templates \
--control-scale 3 \
--compute-scale 2 \
--ceph-storage-scale 3 \
--swift-storage-scale 0 \
--control-flavor control \
--compute-flavor compute \
--ceph-storage-flavor ceph-storage \
--swift-storage-flavor swift-storage \
--ntp-server '10.0.128.246", "10.0.128.244' \
--validation-errors-fatal \
-e ${TOP_DIR}/templates/overcloud-resource-registry-puppet.yaml \
-e ${TOP_DIR}/templates/environments/network-isolation.yaml \
-e ${TOP_DIR}/templates/environments/storage-environment.yaml \
-e ${TOP_DIR}/net-bond-with-vlans-with-nic4.yaml \
-e ${TOP_DIR}/templates/rhel-registration/environment-rhel-registration.yaml \
-e ${TOP_DIR}/templates/rhel-registration/rhel-registration-resource-registry.yaml \
-e ${TOP_DIR}/custom_ovsbond.yaml

2) pre-downloaded all packages to /var/yum/cache
stack@ospdirector$ ansible -f 1 -i hosts -m command -a 'sudo yum update -y --downloadonly' \*

3) Updated stack:
stack@ospdirector$ cat osp8/deploy15_update.sh
#!/bin/bash
TOP_DIR="${HOME}/osp8"
set -x
yes "" | openstack overcloud update stack \
-i overcloud \
--templates ${TOP_DIR}/templates \
-e ${TOP_DIR}/templates/overcloud-resource-registry-puppet.yaml \
-e ${TOP_DIR}/templates/environments/network-isolation.yaml \
-e ${TOP_DIR}/templates/environments/storage-environment.yaml \
-e ${TOP_DIR}/net-bond-with-vlans-with-nic4.yaml \
-e ${TOP_DIR}/templates/rhel-registration/environment-rhel-registration.yaml \
-e ${TOP_DIR}/templates/rhel-registration/rhel-registration-resource-registry.yaml \
-e ${TOP_DIR}/custom_ovsbond.yaml

--- Additional comment from Vincent S. Cojot on 2016-10-04 15:19:47 EDT ---

Here is more information:

After running steps 1) and 3) from above. I always get something like this:
WAITING
completed: [u'krynn-ceph-0', u'krynn-ctrl-2', u'krynn-ceph-1', u'krynn-cmpt-1', u'krynn-ctrl-0', u'krynn-ceph-2', u'krynn-ctrl-1']
on_breakpoint: [u'krynn-cmpt-0']
removing breakpoint on krynn-cmpt-0
Breakpoint reached, continue? Regexp or Enter=proceed, no=cancel update, C-c=quit interactive mode: IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
[...REPEATED.....]
IN_PROGRESS
IN_PROGRESS

IN_PROGRESS
FAILED
update finished with status FAILED

Initial investigation always shows similar to this trace:

[stack@instack ~]$ heat resource-list -n 3 overcloud|grep -v _COMPLETE
+-----------------------------------------------+-----------------------------------------------+----------------------------------------------------------------+-----------------+---------------------+---------------------------------------------------------------------------------------------------------------------------------------------------+
| resource_name                                 | physical_resource_id                          | resource_type                                                  | resource_status | updated_time        | stack_name                                                                                                                                        |
+-----------------------------------------------+-----------------------------------------------+----------------------------------------------------------------+-----------------+---------------------+---------------------------------------------------------------------------------------------------------------------------------------------------+
| ControllerNodesPostDeployment                 | b9daf7b6-bf8c-4527-8d16-8e1c8ed4ab86          | OS::TripleO::ControllerPostDeployment                          | UPDATE_FAILED   | 2016-10-04T18:37:06 | overcloud                                                                                                                                         |
| ControllerPostPuppet                          | 40b16ec5-014e-4e32-bfe9-17ce2645b9b1          | OS::TripleO::Tasks::ControllerPostPuppet                       | UPDATE_FAILED   | 2016-10-04T18:58:25 | overcloud-ControllerNodesPostDeployment-43trttftu6p4                                                                                              |
| ControllerPostPuppetRestartDeployment         | 69c70e74-b929-4737-b264-134562ae4422          | OS::Heat::SoftwareDeployments                                  | UPDATE_FAILED   | 2016-10-04T19:00:00 | overcloud-ControllerNodesPostDeployment-43trttftu6p4-ControllerPostPuppet-dmsfb7tyizaj                                                            |
| 0                                             | 74d53461-ac4e-4adf-ae90-0004c18b203f          | OS::Heat::SoftwareDeployment                                   | UPDATE_FAILED   | 2016-10-04T19:00:03 | overcloud-ControllerNodesPostDeployment-43trttftu6p4-ControllerPostPuppet-dmsfb7tyizaj-ControllerPostPuppetRestartDeployment-zlfu6dwvqphe         |
+-----------------------------------------------+-----------------------------------------------+----------------------------------------------------------------+-----------------+---------------------+---------------------------------------------------------------------------------------------------------------------------------------------------+

--- Additional comment from Vincent S. Cojot on 2016-10-04 15:21:58 EDT ---

Looking into the failed resource (ControllerPostPuppetRestartDeployment) further, I always notice that it failed to restart rabbitmq:



[stack@instack ~]$ heat deployment-output-show 74d53461-ac4e-4adf-ae90-0004c18b203f deploy_stderr
[......]
+ node_states='     httpd       (systemd:httpd):        (target-role:Stopped) Stopped
     httpd      (systemd:httpd):        (target-role:Stopped) Stopped
     httpd      (systemd:httpd):        (target-role:Stopped) Stopped'
+ echo '     httpd      (systemd:httpd):        (target-role:Stopped) Stopped
     httpd      (systemd:httpd):        (target-role:Stopped) Stopped
     httpd      (systemd:httpd):        (target-role:Stopped) Stopped'
+ grep -q Started
+ echo 'httpd has stopped'
+ pcs resource disable openstack-keystone
+ check_resource openstack-keystone stopped 1800
+ '[' 3 -ne 3 ']'
+ service=openstack-keystone
+ state=stopped
+ timeout=1800
+ '[' stopped = stopped ']'
+ match_for_incomplete=Started
+ timeout -k 10 1800 crm_resource --wait
++ grep openstack-keystone
++ pcs status --full
++ grep -v Clone
+ node_states='     openstack-keystone  (systemd:openstack-keystone):   (target-role:Stopped) Stopped
     openstack-keystone (systemd:openstack-keystone):   (target-role:Stopped) Stopped
     openstack-keystone (systemd:openstack-keystone):   (target-role:Stopped) Stopped'
+ echo '     openstack-keystone (systemd:openstack-keystone):   (target-role:Stopped) Stopped
     openstack-keystone (systemd:openstack-keystone):   (target-role:Stopped) Stopped
     openstack-keystone (systemd:openstack-keystone):   (target-role:Stopped) Stopped'
+ grep -q Started
+ echo 'openstack-keystone has stopped'
+ pcs status
+ grep haproxy-clone
+ pcs resource restart haproxy-clone
+ pcs resource restart redis-master
+ pcs resource restart mongod-clone
+ pcs resource restart rabbitmq-clone
Error: Could not complete shutdown of rabbitmq-clone, 1 resources remaining
Error performing operation: Timer expired

Set 'rabbitmq-clone' option: id=rabbitmq-clone-meta_attributes-target-role set=rabbitmq-clone-meta_attributes name=target-role=stopped
Waiting for 1 resources to stop:
 * rabbitmq-clone
 * rabbitmq-clone
Deleted 'rabbitmq-clone' option: id=rabbitmq-clone-meta_attributes-target-role name=target-role

--- Additional comment from Vincent S. Cojot on 2016-10-04 15:27:54 EDT ---

Logging into one of the controllers, I see this:

[heat-admin@krynn-ctrl-0 ~]$ sudo pcs status|grep -A1 rabbitmq
 Clone Set: rabbitmq-clone [rabbitmq]
     Started: [ krynn-ctrl-0 krynn-ctrl-1 krynn-ctrl-2 ]

So it seems that rabbitmq managed to get restarted.
Further restarts work fine and within a reasonnable time:

[heat-admin@krynn-ctrl-0 ~]$ time sudo pcs resource restart rabbitmq-clone
rabbitmq-clone successfully restarted

real    0m24.117s
user    0m0.944s
sys     0m0.279s

--- Additional comment from Fabio Massimo Di Nitto on 2016-10-20 04:17:58 EDT ---

Can you please provide sosreports?

Andrew can you look at it when you have got time.

By the look of it the environment is simply not powerful enough and it was running at the edge before. Something else might have changed during the update that´s causing some services to take more resource or more time to shutdown, causing the cascade effect of rabbit timing out on stop, but then the resource is up at a later stage.

--- Additional comment from Michele Baldessari on 2016-11-02 07:00:57 EDT ---

This is likely a duplicate of: https://bugzilla.redhat.com/show_bug.cgi?id=1364241 
We fixed this in newton. I will work on the backports

--- Additional comment from Vincent S. Cojot on 2016-11-02 10:02:40 EDT ---

Hi Fabio,
Sorry about the delay. I've been swamped with a recent enagement.
I am not sure I can provide sosreports at this time since the deployment was scrapped and I went with the pkg update on the overcloud image (I update the packages inside undercloud-full.qcow2) before deployment.
I'll see if I can revisit this issue in the coming weeks.
Regards,
Vincent


Note You need to log in before you can comment on or make changes to this bug.