Bug 1303164
Summary: | rhel-osp-director: overcloud update 7.0->7.3 fails with ControllerPostPuppetRestartDeployment.resources[0]: Deployment to server failed: deploy_status_code: Deployment exited with non-zero status code: 1 | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Alexander Chuzhoy <sasha> | ||||||
Component: | rhosp-director | Assignee: | James Slagle <jslagle> | ||||||
Status: | CLOSED DUPLICATE | QA Contact: | yeylon <yeylon> | ||||||
Severity: | urgent | Docs Contact: | |||||||
Priority: | high | ||||||||
Version: | 7.0 (Kilo) | CC: | athomas, fdinitto, ipilcher, jcoufal, jeckersb, jslagle, kgaillot, mburns, mcornea, michele, morazi, ohochman, plemenko, rhel-osp-director-maint, sasha, srevivo | ||||||
Target Milestone: | y3 | Keywords: | Reopened, Triaged | ||||||
Target Release: | 7.0 (Kilo) | ||||||||
Hardware: | Unspecified | ||||||||
OS: | Unspecified | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2016-02-09 22:33:46 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Attachments: |
|
Description
Alexander Chuzhoy
2016-01-29 18:15:22 UTC
Created attachment 1119493 [details]
crm_report
Michele, can you take a look at the attached crm_report and see if it shows the reason why pacemaker was unable to stop rabbit? send the request to Ken, he has deeper expertise on crm_reports. So stopping rabbitmq failed like this: - overcloud-controller-0 - Timeout overcloud-controller-0/messages:Jan 29 12:00:13 overcloud-controller-0 lrmd[15856]: warning: operation_finished: rabbitmq_stop_0:7655 - timed out after 90000ms overcloud-controller-0/messages:Jan 29 12:00:13 overcloud-controller-0 crmd[15859]: error: process_lrm_event: Operation rabbitmq_stop_0: Timed Out (node=overcloud-controller-0, call=431, t imeout=90000ms) - overcloud-controller-1 - Timeout overcloud-controller-1/messages:Jan 29 12:01:43 overcloud-controller-1 lrmd[436]: warning: operation_finished: rabbitmq_stop_0:13245 - timed out after 90000ms overcloud-controller-1/messages:Jan 29 12:01:43 overcloud-controller-1 crmd[439]: error: process_lrm_event: Operation rabbitmq_stop_0: Timed Out (node=overcloud-controller-1, call=481, tim eout=90000ms) - overcloud-controller-2 - OK overcloud-controller-2/messages:Jan 29 11:58:42 overcloud-controller-2 crmd[500]: notice: process_lrm_event: Operation rabbitmq_stop_0: ok (node=overcloud-controller-2, call=747, rc=0, cib- update=254, confirmed=true) So on controller 0 and 1 it failed because stopping did not complete within 90s. Can we get sosreports of all three controllers (please make sure all rabbitmq logs are included /var/log/rabbit*)? My first suspicion is that this was a moderately loaded VM and that simply 90s was not enough time for rabbit to stop. I'd like to see some logs though. I agree with Michele's analysis. A couple of points worth adding: on overcloud-controller-2, the successful stop of rabbitmq took only 3 seconds; and if stonith were enabled (as it should be), Pacemaker would fence any nodes on which the stop failed. The rabbit logs will probably be the best thing to focus on next. sasha, can you reproduce this and provide the requested logs? Given that this is believed to be caused by running in an over-loaded VM, and it not yet been possible to reproduce the issue and provide more logs for analysis, I'm going to remove this from the 7.3 release blockers on on February 9th unless we get sufficient information for further analysis. As previously noted, I'm closing this bug because we haven't been able to reproduce it. Please re-open if we get a consistent way to reproduce the bug. Reproduced this one: resource_status_reason | resources.ControllerNodesPostDeployment: Error: resources.ControllerPostPuppet.resources.ControllerPostPuppetRestartDeployment.resources[0]: Deployment to server failed: deploy_status_code: Deployment exited with non-zero status code: 1 more details: trancated output from "heat deployment-show a7f436dc-333b-462f-b886-1b93b7625088" + pcs status + grep haproxy-clone + pcs resource restart haproxy-clone + pcs resource restart redis-master + pcs resource restart mongod-clone + pcs resource restart rabbitmq-clone Error: Could not complete shutdown of rabbitmq-clone, 1 resources remaining Error performing operation: Timer expired Transition failed: terminated An invalid transition was produced Waiting for 1 resources to stop: cirros-0.3.3-x86_64-disk.img deploy-ramdisk-ironic.initramfs deploy-ramdisk-ironic.kernel deploy-ramdisk-ironic.tar discovery-ramdisk.initramfs discovery-ramdisk.kernel discovery-ramdisk.tar instackenv.json keystonerc_master launch_instance.sh netiso-virt-7.0.tgz network-environment.yaml nic-configs oskey.priv overcloud-env.json overcloud-full.initrd overcloud-full.qcow2 overcloud-full.tar overcloud-full.vmlinuz overcloudrc rhel-guest-image-7.2-20151102.0.x86_64.qcow2 rhos-update.sh stackrc tripleo-overcloud-passwords undercloud.conf undercloud-passwords.conf rabbitmq-clone cirros-0.3.3-x86_64-disk.img deploy-ramdisk-ironic.initramfs deploy-ramdisk-ironic.kernel deploy-ramdisk-ironic.tar discovery-ramdisk.initramfs discovery-ramdisk.kernel discovery-ramdisk.tar instackenv.json keystonerc_master launch_instance.sh netiso-virt-7.0.tgz network-environment.yaml nic-configs oskey.priv overcloud-env.json overcloud-full.initrd overcloud-full.qcow2 overcloud-full.tar overcloud-full.vmlinuz overcloudrc rhel-guest-image-7.2-20151102.0.x86_64.qcow2 rhos-update.sh stackrc tripleo-overcloud-passwords undercloud.conf undercloud-passwords.conf rabbitmq-clone Deleted rabbitmq-clone option: id=rabbitmq-clone-meta_attributes-target-role name=target-role ", "deploy_status_code": 1 }, "creation_time": "2016-02-09T03:23:34Z", "updated_time": "2016-02-09T03:32:54Z", "input_values": {}, "action": "CREATE", "status_reason": "deploy_status_code : Deployment exited with non-zero status code: 1", "id": "a7f436dc-333b-462f-b886-1b93b7625088" } Ken, can you have a look at the sosreports to see why rabbitmq couldn't stop? also adding the rabbitmq guys Any chance the system is still around and sitting in this state? I can see in the logs that rabbitmq begins to shut down, but does not do so completely. If it's still running in that state we may be able to get useful information out of it by introspecting the erlang VM. Observations on the latest logs: * When rabbitmq stops successfully, it only takes a few seconds, so I doubt it is load-related or just needs a longer timeout. Rabbitmq appears to be hung when the stop times out. * When the stop times out, the rabbitmq agent had printed only one line of output, like "Stopping and halting node 'rabbit@overcloud-controller-0' ...". The rabbitmq logs got as far as "Halting Erlang VM". I'm not familiar enough with rabbitmq or erlang to say what might cause a hang after that point. * The first stop timeout in this case happened on a node that had been rebooted earlier for a yum update that included a rabbitmq package update. Also, at that reboot, stopping openstack-nova-novncproxy via systemd (not cluster) timed out and was killed by systemd. Pacemaker wouldn't be affected by that, but I'm not sure about rabbitmq. * These are RHEL 7.2 servers, but they are running the pacemaker version from 7.1. Is there a reason it hasn't been updated? Created attachment 1122536 [details]
stuck rabbit pids
output from rabbitmqctl eval 'rabbit_diagnostics:maybe_stuck().'
Lots of queue procs stuck here: {current_stacktrace, [{rabbit_mirror_queue_master,'-stop_all_slaves/2-lc$^1/1-1-',1, [{file,"src/rabbit_mirror_queue_master.erl"},{line,188}]}, {rabbit_mirror_queue_master,stop_all_slaves,2, [{file,"src/rabbit_mirror_queue_master.erl"},{line,188}]}, {rabbit_mirror_queue_master,delete_and_terminate,2, [{file,"src/rabbit_mirror_queue_master.erl"},{line,181}]}, {rabbit_amqqueue_process,'-terminate_delete/3-fun-1-',6, [{file,"src/rabbit_amqqueue_process.erl"},{line,159}]}, {rabbit_amqqueue_process,terminate_shutdown,2, [{file,"src/rabbit_amqqueue_process.erl"},{line,320}]}, {gen_server2,terminate,3,[{file,"src/gen_server2.erl"},{line,1119}]}, {gen_server2,handle_msg,2,[{file,"src/gen_server2.erl"},{line,1014}]}, {proc_lib,wake_up,3,[{file,"proc_lib.erl"},{line,249}]}]} The same thing as bug 1303746, which I just fixed up and marked as MODIFIED, so I'm going to mark this as a duplicate of that one and it should make its way into the next puddle. *** This bug has been marked as a duplicate of bug 1303746 *** I just saw this error yesterday while trying to update my overcloud. The overcloud is 3x controllers + 1x compute node (using external NFS for Cinder & Glance). All nodes are VMs. Host is a 4-core/8-thread Core i7 with 32GB; no swap. Overcloud node virtual disks are logical volumes on a Samsung 850 EVO SSD. When I got the error, all controllers were 1 VCPU with 6GB. I subsequently changed all controllers to 2 VCPUs, and I was able to redeploy and successfully update. (During the successful update, I made sure to wait until all nodes were relatively idle before clearing a breakpoint.) |