Bug 1303164

Summary: rhel-osp-director: overcloud update 7.0->7.3 fails with ControllerPostPuppetRestartDeployment.resources[0]: Deployment to server failed: deploy_status_code: Deployment exited with non-zero status code: 1
Product: Red Hat OpenStack Reporter: Alexander Chuzhoy <sasha>
Component: rhosp-directorAssignee: James Slagle <jslagle>
Status: CLOSED DUPLICATE QA Contact: yeylon <yeylon>
Severity: urgent Docs Contact:
Priority: high    
Version: 7.0 (Kilo)CC: athomas, fdinitto, ipilcher, jcoufal, jeckersb, jslagle, kgaillot, mburns, mcornea, michele, morazi, ohochman, plemenko, rhel-osp-director-maint, sasha, srevivo
Target Milestone: y3Keywords: Reopened, Triaged
Target Release: 7.0 (Kilo)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-02-09 22:33:46 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
crm_report
none
stuck rabbit pids none

Description Alexander Chuzhoy 2016-01-29 18:15:22 UTC
rhel-osp-director: overcloud update 7.0->7.3 fails with ControllerPostPuppetRestartDeployment.resources[0]: Deployment to server failed: deploy_status_code: Deployment exited with non-zero status code: 1


Environment:
rabbitmq-server-3.3.5-15.el7ost.noarch
instack-undercloud-2.1.2-37.el7ost.noarch
openstack-tripleo-heat-templates-0.8.6-112.el7ost.noarch


Steps to reproduce:
1. Deploy and populate with some objects overcloud 7.0 HA with 1 ceph node + vxlan.
2. Attempt to update the overcloud to 7.3

Result:
After finishing the yum update...
IN_PROGRESS
IN_PROGRESS
FAILED
update finished with status FAILED

[stack@instack ~]$ heat resource-list -n5 overcloud|grep -v COMPLE
+-----------------------------------------------+-----------------------------------------------+---------------------------------------------------+-----------------+----------------------+-----------------------------------------------+
| resource_name                                 | physical_resource_id                          | resource_type                                     | resource_status | updated_time         | parent_resource                               |
+-----------------------------------------------+-----------------------------------------------+---------------------------------------------------+-----------------+----------------------+-----------------------------------------------+
| ControllerNodesPostDeployment                 | 930785d7-7ea8-4098-8a26-8a4f60ec13f3          | OS::TripleO::ControllerPostDeployment             | UPDATE_FAILED   | 2016-01-29T16:28:31Z |                                               |
| ControllerPostPuppet                          | a9fe3174-65cd-4833-bab7-5f23959fb98d          | OS::TripleO::Tasks::ControllerPostPuppet          | CREATE_FAILED   | 2016-01-29T16:48:34Z | ControllerNodesPostDeployment                 |
| ControllerPostPuppetRestartDeployment         | b3915f58-22ce-406a-bf96-59addac8e8fa          | OS::Heat::SoftwareDeployments                     | CREATE_FAILED   | 2016-01-29T16:48:36Z | ControllerPostPuppet                          |
| 0                                             | bf808f8b-b4d6-4620-bd79-3868a45bf641          | OS::Heat::SoftwareDeployment                      | CREATE_FAILED   | 2016-01-29T16:50:13Z | ControllerPostPuppetRestartDeployment         |
+-----------------------------------------------+-----------------------------------------------+---------------------------------------------------+-----------------+----------------------+-----------------------------------------------+



resource_status_reason | resources.ControllerNodesPostDeployment: Error: resources.ControllerPostPuppet.resources.ControllerPostPuppetRestartDeployment.resources[0]: Deployment to server failed: deploy_status_code: Deployment exited with non-zero status code: 1 


Looking for erros on controllers:
 not yet stopped, sleeping 3 seconds.'\n+ sleep 3\n++ date +%s\n+ ((  1454086689 < 1454087510  ))\n++ pcs status --full\n++ grep -v Clone\n++ grep openstack-keystone\n+ node_states='     openstack-keystone\t(systemd:openstack-keystone):\tStopped \n     openstack-keystone\t(systemd:openstack-keystone):\tStopped \n     openstack-keystone\t(systemd:openstack-keystone):\tStopped '\n+ echo '     openstack-keystone\t(systemd:openstack-keystone):\tStopped \n     openstack-keystone\t(systemd:openstack-keystone):\tStopped \n     openstack-keystone\t(systemd:openstack-keystone):\tStopped '\n+ grep -q Started\n+ echo 'openstack-keystone has stopped'\n+ return\n+ pcs status\n+ grep haproxy-clone\n+ pcs resource restart haproxy-clone\n+ pcs resource restart redis-master\n+ pcs resource restart mongod-clone\n+ pcs resource restart rabbitmq-clone\nError: Could not complete shutdown of rabbitmq-clone, 1 resources remaining\nError performing operation: Timer expired\nTransition failed: terminated\nAn invalid transition was produced\nWaiting for 1 resources to stop:\n * rabbitmq-clone\n * rabbitmq-clone\nDeleted rabbitmq-clone option: id=rabbitmq-clone-meta_attributes-target-role name=target-role\n\n", "deploy_status_code": 1}
Jan 29 12:00:17 overcloud-controller-0.localdomain os-collect-config[23206]: Error: Could not complete shutdown of rabbitmq-clone, 1 resources remaining
Jan 29 12:00:17 overcloud-controller-0.localdomain os-collect-config[23206]: Error performing operation: Timer expired
Jan 29 12:00:17 overcloud-controller-0.localdomain os-collect-config[23206]: [2016-01-29 12:00:17,193] (heat-config) [ERROR] Error running /var/lib/heat-config/heat-config-script/fc2a2fcd-127e-4edd-b490-6a5394917bf9. [1]

Comment 1 Alexander Chuzhoy 2016-01-29 18:16:05 UTC
Created attachment 1119493 [details]
crm_report

Comment 4 James Slagle 2016-02-03 10:25:52 UTC
Michele, can you take a look at the attached crm_report and see if it shows the reason why pacemaker was unable to stop rabbit?

Comment 5 Fabio Massimo Di Nitto 2016-02-03 15:42:16 UTC
send the request to Ken, he has deeper expertise on crm_reports.

Comment 6 Michele Baldessari 2016-02-03 16:01:45 UTC
So stopping rabbitmq failed like this:

- overcloud-controller-0 - Timeout
overcloud-controller-0/messages:Jan 29 12:00:13 overcloud-controller-0 lrmd[15856]: warning: operation_finished: rabbitmq_stop_0:7655 - timed out after 90000ms
overcloud-controller-0/messages:Jan 29 12:00:13 overcloud-controller-0 crmd[15859]:   error: process_lrm_event: Operation rabbitmq_stop_0: Timed Out (node=overcloud-controller-0, call=431, t
imeout=90000ms)


- overcloud-controller-1 - Timeout
overcloud-controller-1/messages:Jan 29 12:01:43 overcloud-controller-1 lrmd[436]: warning: operation_finished: rabbitmq_stop_0:13245 - timed out after 90000ms
overcloud-controller-1/messages:Jan 29 12:01:43 overcloud-controller-1 crmd[439]:   error: process_lrm_event: Operation rabbitmq_stop_0: Timed Out (node=overcloud-controller-1, call=481, tim
eout=90000ms)


- overcloud-controller-2 - OK
overcloud-controller-2/messages:Jan 29 11:58:42 overcloud-controller-2 crmd[500]:  notice: process_lrm_event: Operation rabbitmq_stop_0: ok (node=overcloud-controller-2, call=747, rc=0, cib-
update=254, confirmed=true)

So on controller 0 and 1 it failed because stopping did not complete within 90s.

Can we get sosreports of all three controllers (please make sure all rabbitmq
logs are included /var/log/rabbit*)? 

My first suspicion is that this was a moderately loaded VM and that simply 90s
was not enough time for rabbit to stop. I'd like to see some logs though.

Comment 7 Ken Gaillot 2016-02-03 16:48:43 UTC
I agree with Michele's analysis. A couple of points worth adding: on overcloud-controller-2, the successful stop of rabbitmq took only 3 seconds; and if stonith were enabled (as it should be), Pacemaker would fence any nodes on which the stop failed.

The rabbit logs will probably be the best thing to focus on next.

Comment 8 James Slagle 2016-02-03 17:30:02 UTC
sasha, can you reproduce this and provide the requested logs?

Comment 9 Angus Thomas 2016-02-07 16:10:02 UTC
Given that this is believed to be caused by running in an over-loaded VM, and it not yet been possible to reproduce the issue and provide more logs for analysis, I'm going to remove this from the 7.3 release blockers on on February 9th unless we get sufficient information for further analysis.

Comment 10 Angus Thomas 2016-02-09 09:02:01 UTC
As previously noted, I'm closing this bug because we haven't been able to reproduce it. Please re-open if we get a consistent way to reproduce the bug.

Comment 11 Alexander Chuzhoy 2016-02-09 14:41:13 UTC
Reproduced this one:
resource_status_reason | resources.ControllerNodesPostDeployment: Error: resources.ControllerPostPuppet.resources.ControllerPostPuppetRestartDeployment.resources[0]: Deployment to server failed: deploy_status_code: Deployment exited with non-zero status code: 1 



more details:
trancated output from "heat deployment-show a7f436dc-333b-462f-b886-1b93b7625088"


+ pcs status
+ grep haproxy-clone
+ pcs resource restart haproxy-clone
+ pcs resource restart redis-master
+ pcs resource restart mongod-clone
+ pcs resource restart rabbitmq-clone
Error: Could not complete shutdown of rabbitmq-clone, 1 resources remaining
Error performing operation: Timer expired
Transition failed: terminated
An invalid transition was produced
Waiting for 1 resources to stop:
 cirros-0.3.3-x86_64-disk.img deploy-ramdisk-ironic.initramfs deploy-ramdisk-ironic.kernel deploy-ramdisk-ironic.tar discovery-ramdisk.initramfs discovery-ramdisk.kernel discovery-ramdisk.tar instackenv.json keystonerc_master launch_instance.sh netiso-virt-7.0.tgz network-environment.yaml nic-configs oskey.priv overcloud-env.json overcloud-full.initrd overcloud-full.qcow2 overcloud-full.tar overcloud-full.vmlinuz overcloudrc rhel-guest-image-7.2-20151102.0.x86_64.qcow2 rhos-update.sh stackrc tripleo-overcloud-passwords undercloud.conf undercloud-passwords.conf rabbitmq-clone
 cirros-0.3.3-x86_64-disk.img deploy-ramdisk-ironic.initramfs deploy-ramdisk-ironic.kernel deploy-ramdisk-ironic.tar discovery-ramdisk.initramfs discovery-ramdisk.kernel discovery-ramdisk.tar instackenv.json keystonerc_master launch_instance.sh netiso-virt-7.0.tgz network-environment.yaml nic-configs oskey.priv overcloud-env.json overcloud-full.initrd overcloud-full.qcow2 overcloud-full.tar overcloud-full.vmlinuz overcloudrc rhel-guest-image-7.2-20151102.0.x86_64.qcow2 rhos-update.sh stackrc tripleo-overcloud-passwords undercloud.conf undercloud-passwords.conf rabbitmq-clone
Deleted rabbitmq-clone option: id=rabbitmq-clone-meta_attributes-target-role name=target-role

", "deploy_status_code": 1 }, "creation_time": "2016-02-09T03:23:34Z", "updated_time": "2016-02-09T03:32:54Z", "input_values": {}, "action": "CREATE", "status_reason": "deploy_status_code : Deployment exited with non-zero status code: 1", "id": "a7f436dc-333b-462f-b886-1b93b7625088" }

Comment 13 James Slagle 2016-02-09 16:25:57 UTC
Ken, can you have a look at the sosreports to see why rabbitmq couldn't stop?

Comment 14 Fabio Massimo Di Nitto 2016-02-09 17:04:35 UTC
also adding the rabbitmq guys

Comment 15 John Eckersberg 2016-02-09 17:34:22 UTC
Any chance the system is still around and sitting in this state?  I can see in the logs that rabbitmq begins to shut down, but does not do so completely.  If it's still running in that state we may be able to get useful information out of it by introspecting the erlang VM.

Comment 16 Ken Gaillot 2016-02-09 19:18:50 UTC
Observations on the latest logs:

* When rabbitmq stops successfully, it only takes a few seconds, so I doubt it is load-related or just needs a longer timeout. Rabbitmq appears to be hung when the stop times out.

* When the stop times out, the rabbitmq agent had printed only one line of output, like "Stopping and halting node 'rabbit@overcloud-controller-0' ...". The rabbitmq logs got as far as "Halting Erlang VM". I'm not familiar enough with rabbitmq or erlang to say what might cause a hang after that point.

* The first stop timeout in this case happened on a node that had been rebooted earlier for a yum update that included a rabbitmq package update. Also, at that reboot, stopping openstack-nova-novncproxy via systemd (not cluster) timed out and was killed by systemd. Pacemaker wouldn't be affected by that, but I'm not sure about rabbitmq.

* These are RHEL 7.2 servers, but they are running the pacemaker version from 7.1. Is there a reason it hasn't been updated?

Comment 18 John Eckersberg 2016-02-09 22:23:16 UTC
Created attachment 1122536 [details]
stuck rabbit pids

output from rabbitmqctl eval 'rabbit_diagnostics:maybe_stuck().'

Comment 19 John Eckersberg 2016-02-09 22:33:46 UTC
Lots of queue procs stuck here:

{current_stacktrace,
     [{rabbit_mirror_queue_master,'-stop_all_slaves/2-lc$^1/1-1-',1,
          [{file,"src/rabbit_mirror_queue_master.erl"},{line,188}]},
      {rabbit_mirror_queue_master,stop_all_slaves,2,
          [{file,"src/rabbit_mirror_queue_master.erl"},{line,188}]},
      {rabbit_mirror_queue_master,delete_and_terminate,2,
          [{file,"src/rabbit_mirror_queue_master.erl"},{line,181}]},
      {rabbit_amqqueue_process,'-terminate_delete/3-fun-1-',6,
          [{file,"src/rabbit_amqqueue_process.erl"},{line,159}]},
      {rabbit_amqqueue_process,terminate_shutdown,2,
          [{file,"src/rabbit_amqqueue_process.erl"},{line,320}]},
      {gen_server2,terminate,3,[{file,"src/gen_server2.erl"},{line,1119}]},
      {gen_server2,handle_msg,2,[{file,"src/gen_server2.erl"},{line,1014}]},
      {proc_lib,wake_up,3,[{file,"proc_lib.erl"},{line,249}]}]}

The same thing as bug 1303746, which I just fixed up and marked as MODIFIED, so I'm going to mark this as a duplicate of that one and it should make its way into the next puddle.

*** This bug has been marked as a duplicate of bug 1303746 ***

Comment 20 Ian Pilcher 2016-04-15 14:03:05 UTC
I just saw this error yesterday while trying to update my overcloud.  The overcloud is 3x controllers + 1x compute node (using external NFS for Cinder & Glance).

All nodes are VMs.  Host is a 4-core/8-thread Core i7 with 32GB; no swap.  Overcloud node virtual disks are logical volumes on a Samsung 850 EVO SSD.

When I got the error, all controllers were 1 VCPU with 6GB.  I subsequently changed all controllers to 2 VCPUs, and I was able to redeploy and successfully update.  (During the successful update, I made sure to wait until all nodes were relatively idle before clearing a breakpoint.)