| Summary: | OSP8: 'overcloud update stack' used to work fine, now fails due to timeout restarting PCS resources. | |||
|---|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Vincent S. Cojot <vcojot> | |
| Component: | openstack-tripleo-heat-templates | Assignee: | Michele Baldessari <michele> | |
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Udi Shkalim <ushkalim> | |
| Severity: | low | Docs Contact: | ||
| Priority: | low | |||
| Version: | 8.0 (Liberty) | CC: | chjones, fdinitto, jjoyce, jschluet, mburns, michele, rhel-osp-director-maint, slinaber, tvignaud, vcojot | |
| Target Milestone: | --- | Keywords: | Triaged | |
| Target Release: | 8.0 (Liberty) | |||
| Hardware: | x86_64 | |||
| OS: | Linux | |||
| Whiteboard: | ||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | ||
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 1395141 (view as bug list) | Environment: | ||
| Last Closed: | 2018-07-20 08:19:26 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Bug Depends On: | ||||
| Bug Blocks: | 1395141 | |||
Here is more information: After running steps 1) and 3) from above. I always get something like this: WAITING completed: [u'krynn-ceph-0', u'krynn-ctrl-2', u'krynn-ceph-1', u'krynn-cmpt-1', u'krynn-ctrl-0', u'krynn-ceph-2', u'krynn-ctrl-1'] on_breakpoint: [u'krynn-cmpt-0'] removing breakpoint on krynn-cmpt-0 Breakpoint reached, continue? Regexp or Enter=proceed, no=cancel update, C-c=quit interactive mode: IN_PROGRESS IN_PROGRESS IN_PROGRESS [...REPEATED.....] IN_PROGRESS IN_PROGRESS IN_PROGRESS FAILED update finished with status FAILED Initial investigation always shows similar to this trace: [stack@instack ~]$ heat resource-list -n 3 overcloud|grep -v _COMPLETE +-----------------------------------------------+-----------------------------------------------+----------------------------------------------------------------+-----------------+---------------------+---------------------------------------------------------------------------------------------------------------------------------------------------+ | resource_name | physical_resource_id | resource_type | resource_status | updated_time | stack_name | +-----------------------------------------------+-----------------------------------------------+----------------------------------------------------------------+-----------------+---------------------+---------------------------------------------------------------------------------------------------------------------------------------------------+ | ControllerNodesPostDeployment | b9daf7b6-bf8c-4527-8d16-8e1c8ed4ab86 | OS::TripleO::ControllerPostDeployment | UPDATE_FAILED | 2016-10-04T18:37:06 | overcloud | | ControllerPostPuppet | 40b16ec5-014e-4e32-bfe9-17ce2645b9b1 | OS::TripleO::Tasks::ControllerPostPuppet | UPDATE_FAILED | 2016-10-04T18:58:25 | overcloud-ControllerNodesPostDeployment-43trttftu6p4 | | ControllerPostPuppetRestartDeployment | 69c70e74-b929-4737-b264-134562ae4422 | OS::Heat::SoftwareDeployments | UPDATE_FAILED | 2016-10-04T19:00:00 | overcloud-ControllerNodesPostDeployment-43trttftu6p4-ControllerPostPuppet-dmsfb7tyizaj | | 0 | 74d53461-ac4e-4adf-ae90-0004c18b203f | OS::Heat::SoftwareDeployment | UPDATE_FAILED | 2016-10-04T19:00:03 | overcloud-ControllerNodesPostDeployment-43trttftu6p4-ControllerPostPuppet-dmsfb7tyizaj-ControllerPostPuppetRestartDeployment-zlfu6dwvqphe | +-----------------------------------------------+-----------------------------------------------+----------------------------------------------------------------+-----------------+---------------------+---------------------------------------------------------------------------------------------------------------------------------------------------+ Looking into the failed resource (ControllerPostPuppetRestartDeployment) further, I always notice that it failed to restart rabbitmq:
[stack@instack ~]$ heat deployment-output-show 74d53461-ac4e-4adf-ae90-0004c18b203f deploy_stderr
[......]
+ node_states=' httpd (systemd:httpd): (target-role:Stopped) Stopped
httpd (systemd:httpd): (target-role:Stopped) Stopped
httpd (systemd:httpd): (target-role:Stopped) Stopped'
+ echo ' httpd (systemd:httpd): (target-role:Stopped) Stopped
httpd (systemd:httpd): (target-role:Stopped) Stopped
httpd (systemd:httpd): (target-role:Stopped) Stopped'
+ grep -q Started
+ echo 'httpd has stopped'
+ pcs resource disable openstack-keystone
+ check_resource openstack-keystone stopped 1800
+ '[' 3 -ne 3 ']'
+ service=openstack-keystone
+ state=stopped
+ timeout=1800
+ '[' stopped = stopped ']'
+ match_for_incomplete=Started
+ timeout -k 10 1800 crm_resource --wait
++ grep openstack-keystone
++ pcs status --full
++ grep -v Clone
+ node_states=' openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Stopped
openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Stopped
openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Stopped'
+ echo ' openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Stopped
openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Stopped
openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Stopped'
+ grep -q Started
+ echo 'openstack-keystone has stopped'
+ pcs status
+ grep haproxy-clone
+ pcs resource restart haproxy-clone
+ pcs resource restart redis-master
+ pcs resource restart mongod-clone
+ pcs resource restart rabbitmq-clone
Error: Could not complete shutdown of rabbitmq-clone, 1 resources remaining
Error performing operation: Timer expired
Set 'rabbitmq-clone' option: id=rabbitmq-clone-meta_attributes-target-role set=rabbitmq-clone-meta_attributes name=target-role=stopped
Waiting for 1 resources to stop:
* rabbitmq-clone
* rabbitmq-clone
Deleted 'rabbitmq-clone' option: id=rabbitmq-clone-meta_attributes-target-role name=target-role
Logging into one of the controllers, I see this:
[heat-admin@krynn-ctrl-0 ~]$ sudo pcs status|grep -A1 rabbitmq
Clone Set: rabbitmq-clone [rabbitmq]
Started: [ krynn-ctrl-0 krynn-ctrl-1 krynn-ctrl-2 ]
So it seems that rabbitmq managed to get restarted.
Further restarts work fine and within a reasonnable time:
[heat-admin@krynn-ctrl-0 ~]$ time sudo pcs resource restart rabbitmq-clone
rabbitmq-clone successfully restarted
real 0m24.117s
user 0m0.944s
sys 0m0.279s
Can you please provide sosreports? Andrew can you look at it when you have got time. By the look of it the environment is simply not powerful enough and it was running at the edge before. Something else might have changed during the update that´s causing some services to take more resource or more time to shutdown, causing the cascade effect of rabbit timing out on stop, but then the resource is up at a later stage. This is likely a duplicate of: https://bugzilla.redhat.com/show_bug.cgi?id=1364241 We fixed this in newton. I will work on the backports Hi Fabio, Sorry about the delay. I've been swamped with a recent enagement. I am not sure I can provide sosreports at this time since the deployment was scrapped and I went with the pkg update on the overcloud image (I update the packages inside undercloud-full.qcow2) before deployment. I'll see if I can revisit this issue in the coming weeks. Regards, Vincent |
Description of problem: Since I do some OSP torture for our customers, I tend to deploy/delete/re-deploy OSP in my lab on a regular basis. I've been doing the following with OSP8 every since it was released: 1) deploy OSP8 with nodes registered on CDN. 2) 'overcloud update stack' to get the latest packages. 3) do some stuff with it.. 'overcloud update stack has been failing consistently for the last few days and I suspect it is caused by the update of some pcs/corosync/resource_agents rpm on the overcloud controllers. Version-Release number of selected component (if applicable): 1) before update [heat-admin@krynn-ctrl-0 yum]$ rpm -q kernel corosync pacemaker resource-agents openstack-puppet-modules kernel-3.10.0-327.18.2.el7.x86_64 corosync-2.3.4-7.el7_2.1.x86_64 pacemaker-1.1.13-10.el7_2.2.x86_64 resource-agents-3.9.5-54.el7_2.10.x86_64 openstack-puppet-modules-7.0.19-1.el7ost.noarch 2) after update: [heat-admin@krynn-ctrl-0 yum]$ rpm -q kernel corosync pacemaker resource-agents openstack-puppet-modules kernel-3.10.0-327.36.1.el7.x86_64 corosync-2.3.4-7.el7_2.1.x86_64 pacemaker-1.1.13-10.el7_2.2.x86_64 resource-agents-3.9.5-54.el7_2.10.x86_64 openstack-puppet-modules-7.1.3-1.el7ost.noarch How reproducible: everytime Steps to Reproduce: 1. openstack overcloud deploy 2. run 'yum update -y --downloadonly' on all nodes to pre-download all packages 3. openstack overcloud update stack Actual results: UPDATE_FAILED due to timeout restarting PCS resource. Expected results: Should UPDATE_COMPLETE without issue. Additional info: 1) Deployed with: stack@ospdirector$ cat osp8/deploy15.sh #!/bin/bash TOP_DIR="${HOME}/osp8" set -x time openstack overcloud deploy \ --templates ${TOP_DIR}/templates \ --control-scale 3 \ --compute-scale 2 \ --ceph-storage-scale 3 \ --swift-storage-scale 0 \ --control-flavor control \ --compute-flavor compute \ --ceph-storage-flavor ceph-storage \ --swift-storage-flavor swift-storage \ --ntp-server '10.0.128.246", "10.0.128.244' \ --validation-errors-fatal \ -e ${TOP_DIR}/templates/overcloud-resource-registry-puppet.yaml \ -e ${TOP_DIR}/templates/environments/network-isolation.yaml \ -e ${TOP_DIR}/templates/environments/storage-environment.yaml \ -e ${TOP_DIR}/net-bond-with-vlans-with-nic4.yaml \ -e ${TOP_DIR}/templates/rhel-registration/environment-rhel-registration.yaml \ -e ${TOP_DIR}/templates/rhel-registration/rhel-registration-resource-registry.yaml \ -e ${TOP_DIR}/custom_ovsbond.yaml 2) pre-downloaded all packages to /var/yum/cache stack@ospdirector$ ansible -f 1 -i hosts -m command -a 'sudo yum update -y --downloadonly' \* 3) Updated stack: stack@ospdirector$ cat osp8/deploy15_update.sh #!/bin/bash TOP_DIR="${HOME}/osp8" set -x yes "" | openstack overcloud update stack \ -i overcloud \ --templates ${TOP_DIR}/templates \ -e ${TOP_DIR}/templates/overcloud-resource-registry-puppet.yaml \ -e ${TOP_DIR}/templates/environments/network-isolation.yaml \ -e ${TOP_DIR}/templates/environments/storage-environment.yaml \ -e ${TOP_DIR}/net-bond-with-vlans-with-nic4.yaml \ -e ${TOP_DIR}/templates/rhel-registration/environment-rhel-registration.yaml \ -e ${TOP_DIR}/templates/rhel-registration/rhel-registration-resource-registry.yaml \ -e ${TOP_DIR}/custom_ovsbond.yaml