Hide Forgot
Description of problem: Since I do some OSP torture for our customers, I tend to deploy/delete/re-deploy OSP in my lab on a regular basis. I've been doing the following with OSP8 every since it was released: 1) deploy OSP8 with nodes registered on CDN. 2) 'overcloud update stack' to get the latest packages. 3) do some stuff with it.. 'overcloud update stack has been failing consistently for the last few days and I suspect it is caused by the update of some pcs/corosync/resource_agents rpm on the overcloud controllers. Version-Release number of selected component (if applicable): 1) before update [heat-admin@krynn-ctrl-0 yum]$ rpm -q kernel corosync pacemaker resource-agents openstack-puppet-modules kernel-3.10.0-327.18.2.el7.x86_64 corosync-2.3.4-7.el7_2.1.x86_64 pacemaker-1.1.13-10.el7_2.2.x86_64 resource-agents-3.9.5-54.el7_2.10.x86_64 openstack-puppet-modules-7.0.19-1.el7ost.noarch 2) after update: [heat-admin@krynn-ctrl-0 yum]$ rpm -q kernel corosync pacemaker resource-agents openstack-puppet-modules kernel-3.10.0-327.36.1.el7.x86_64 corosync-2.3.4-7.el7_2.1.x86_64 pacemaker-1.1.13-10.el7_2.2.x86_64 resource-agents-3.9.5-54.el7_2.10.x86_64 openstack-puppet-modules-7.1.3-1.el7ost.noarch How reproducible: everytime Steps to Reproduce: 1. openstack overcloud deploy 2. run 'yum update -y --downloadonly' on all nodes to pre-download all packages 3. openstack overcloud update stack Actual results: UPDATE_FAILED due to timeout restarting PCS resource. Expected results: Should UPDATE_COMPLETE without issue. Additional info: 1) Deployed with: stack@ospdirector$ cat osp8/deploy15.sh #!/bin/bash TOP_DIR="${HOME}/osp8" set -x time openstack overcloud deploy \ --templates ${TOP_DIR}/templates \ --control-scale 3 \ --compute-scale 2 \ --ceph-storage-scale 3 \ --swift-storage-scale 0 \ --control-flavor control \ --compute-flavor compute \ --ceph-storage-flavor ceph-storage \ --swift-storage-flavor swift-storage \ --ntp-server '10.0.128.246", "10.0.128.244' \ --validation-errors-fatal \ -e ${TOP_DIR}/templates/overcloud-resource-registry-puppet.yaml \ -e ${TOP_DIR}/templates/environments/network-isolation.yaml \ -e ${TOP_DIR}/templates/environments/storage-environment.yaml \ -e ${TOP_DIR}/net-bond-with-vlans-with-nic4.yaml \ -e ${TOP_DIR}/templates/rhel-registration/environment-rhel-registration.yaml \ -e ${TOP_DIR}/templates/rhel-registration/rhel-registration-resource-registry.yaml \ -e ${TOP_DIR}/custom_ovsbond.yaml 2) pre-downloaded all packages to /var/yum/cache stack@ospdirector$ ansible -f 1 -i hosts -m command -a 'sudo yum update -y --downloadonly' \* 3) Updated stack: stack@ospdirector$ cat osp8/deploy15_update.sh #!/bin/bash TOP_DIR="${HOME}/osp8" set -x yes "" | openstack overcloud update stack \ -i overcloud \ --templates ${TOP_DIR}/templates \ -e ${TOP_DIR}/templates/overcloud-resource-registry-puppet.yaml \ -e ${TOP_DIR}/templates/environments/network-isolation.yaml \ -e ${TOP_DIR}/templates/environments/storage-environment.yaml \ -e ${TOP_DIR}/net-bond-with-vlans-with-nic4.yaml \ -e ${TOP_DIR}/templates/rhel-registration/environment-rhel-registration.yaml \ -e ${TOP_DIR}/templates/rhel-registration/rhel-registration-resource-registry.yaml \ -e ${TOP_DIR}/custom_ovsbond.yaml
Here is more information: After running steps 1) and 3) from above. I always get something like this: WAITING completed: [u'krynn-ceph-0', u'krynn-ctrl-2', u'krynn-ceph-1', u'krynn-cmpt-1', u'krynn-ctrl-0', u'krynn-ceph-2', u'krynn-ctrl-1'] on_breakpoint: [u'krynn-cmpt-0'] removing breakpoint on krynn-cmpt-0 Breakpoint reached, continue? Regexp or Enter=proceed, no=cancel update, C-c=quit interactive mode: IN_PROGRESS IN_PROGRESS IN_PROGRESS [...REPEATED.....] IN_PROGRESS IN_PROGRESS IN_PROGRESS FAILED update finished with status FAILED Initial investigation always shows similar to this trace: [stack@instack ~]$ heat resource-list -n 3 overcloud|grep -v _COMPLETE +-----------------------------------------------+-----------------------------------------------+----------------------------------------------------------------+-----------------+---------------------+---------------------------------------------------------------------------------------------------------------------------------------------------+ | resource_name | physical_resource_id | resource_type | resource_status | updated_time | stack_name | +-----------------------------------------------+-----------------------------------------------+----------------------------------------------------------------+-----------------+---------------------+---------------------------------------------------------------------------------------------------------------------------------------------------+ | ControllerNodesPostDeployment | b9daf7b6-bf8c-4527-8d16-8e1c8ed4ab86 | OS::TripleO::ControllerPostDeployment | UPDATE_FAILED | 2016-10-04T18:37:06 | overcloud | | ControllerPostPuppet | 40b16ec5-014e-4e32-bfe9-17ce2645b9b1 | OS::TripleO::Tasks::ControllerPostPuppet | UPDATE_FAILED | 2016-10-04T18:58:25 | overcloud-ControllerNodesPostDeployment-43trttftu6p4 | | ControllerPostPuppetRestartDeployment | 69c70e74-b929-4737-b264-134562ae4422 | OS::Heat::SoftwareDeployments | UPDATE_FAILED | 2016-10-04T19:00:00 | overcloud-ControllerNodesPostDeployment-43trttftu6p4-ControllerPostPuppet-dmsfb7tyizaj | | 0 | 74d53461-ac4e-4adf-ae90-0004c18b203f | OS::Heat::SoftwareDeployment | UPDATE_FAILED | 2016-10-04T19:00:03 | overcloud-ControllerNodesPostDeployment-43trttftu6p4-ControllerPostPuppet-dmsfb7tyizaj-ControllerPostPuppetRestartDeployment-zlfu6dwvqphe | +-----------------------------------------------+-----------------------------------------------+----------------------------------------------------------------+-----------------+---------------------+---------------------------------------------------------------------------------------------------------------------------------------------------+
Looking into the failed resource (ControllerPostPuppetRestartDeployment) further, I always notice that it failed to restart rabbitmq: [stack@instack ~]$ heat deployment-output-show 74d53461-ac4e-4adf-ae90-0004c18b203f deploy_stderr [......] + node_states=' httpd (systemd:httpd): (target-role:Stopped) Stopped httpd (systemd:httpd): (target-role:Stopped) Stopped httpd (systemd:httpd): (target-role:Stopped) Stopped' + echo ' httpd (systemd:httpd): (target-role:Stopped) Stopped httpd (systemd:httpd): (target-role:Stopped) Stopped httpd (systemd:httpd): (target-role:Stopped) Stopped' + grep -q Started + echo 'httpd has stopped' + pcs resource disable openstack-keystone + check_resource openstack-keystone stopped 1800 + '[' 3 -ne 3 ']' + service=openstack-keystone + state=stopped + timeout=1800 + '[' stopped = stopped ']' + match_for_incomplete=Started + timeout -k 10 1800 crm_resource --wait ++ grep openstack-keystone ++ pcs status --full ++ grep -v Clone + node_states=' openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Stopped openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Stopped openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Stopped' + echo ' openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Stopped openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Stopped openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Stopped' + grep -q Started + echo 'openstack-keystone has stopped' + pcs status + grep haproxy-clone + pcs resource restart haproxy-clone + pcs resource restart redis-master + pcs resource restart mongod-clone + pcs resource restart rabbitmq-clone Error: Could not complete shutdown of rabbitmq-clone, 1 resources remaining Error performing operation: Timer expired Set 'rabbitmq-clone' option: id=rabbitmq-clone-meta_attributes-target-role set=rabbitmq-clone-meta_attributes name=target-role=stopped Waiting for 1 resources to stop: * rabbitmq-clone * rabbitmq-clone Deleted 'rabbitmq-clone' option: id=rabbitmq-clone-meta_attributes-target-role name=target-role
Logging into one of the controllers, I see this: [heat-admin@krynn-ctrl-0 ~]$ sudo pcs status|grep -A1 rabbitmq Clone Set: rabbitmq-clone [rabbitmq] Started: [ krynn-ctrl-0 krynn-ctrl-1 krynn-ctrl-2 ] So it seems that rabbitmq managed to get restarted. Further restarts work fine and within a reasonnable time: [heat-admin@krynn-ctrl-0 ~]$ time sudo pcs resource restart rabbitmq-clone rabbitmq-clone successfully restarted real 0m24.117s user 0m0.944s sys 0m0.279s
Can you please provide sosreports? Andrew can you look at it when you have got time. By the look of it the environment is simply not powerful enough and it was running at the edge before. Something else might have changed during the update that´s causing some services to take more resource or more time to shutdown, causing the cascade effect of rabbit timing out on stop, but then the resource is up at a later stage.
This is likely a duplicate of: https://bugzilla.redhat.com/show_bug.cgi?id=1364241 We fixed this in newton. I will work on the backports
Hi Fabio, Sorry about the delay. I've been swamped with a recent enagement. I am not sure I can provide sosreports at this time since the deployment was scrapped and I went with the pkg update on the overcloud image (I update the packages inside undercloud-full.qcow2) before deployment. I'll see if I can revisit this issue in the coming weeks. Regards, Vincent