Bug 1364241
| Summary: | rhel-osp-director: Update 7.2->7.3Async fails: Error: Could not complete shutdown of rabbitmq-clone, 1 resources remaining Error performing operation: Timer expired | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Alexander Chuzhoy <sasha> | ||||
| Component: | rhosp-director | Assignee: | Michele Baldessari <michele> | ||||
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Omri Hochman <ohochman> | ||||
| Severity: | high | Docs Contact: | |||||
| Priority: | high | ||||||
| Version: | 7.0 (Kilo) | CC: | abeekhof, aschultz, dbecker, jslagle, mburns, michele, morazi, rhel-osp-director-maint, rscarazz, sasha, ushkalim | ||||
| Target Milestone: | async | Keywords: | Regression, Triaged | ||||
| Target Release: | 7.0 (Kilo) | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2018-07-20 08:18:36 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Bug Depends On: | |||||||
| Bug Blocks: | 1344794 | ||||||
| Attachments: |
|
||||||
Rabbit shutdown on controller-1 is taking ~44 seconds. Can this be host load related? There are a few stop/start cycles, but I think you're talking about this one right? Aug 04 14:08:54 [31393] overcloud-controller-1.localdomain crmd: info: do_lrm_rsc_op: Performing key=124:48:0:31a2b3c6-bcce-4a1c-b9e3-5eefa5a12b25 op=rabbitmq_stop_0 Aug 04 14:09:56 [31393] overcloud-controller-1.localdomain crmd: notice: process_lrm_event: Operation rabbitmq_stop_0: ok (node=overcloud-controller-1, call=476, rc=0, cib-update=271, confirmed=true) Which is closer to 62s. Regardless, as far as the cluster is concerned, everything was fine. Which timeout was not satisfied and what is it set to? Its not clear from the description. Seems to me it should really be the start timeout + stop timeout + a second or so of buffer. Also I think the clock on overcloud-controller-0.localdomain is off by a couple of hours which makes log analysis rather challenging. Created attachment 1187792 [details]
journalctl errors
just trying to understand what's going on here. I see a 'failed to push cib' on control 2 from these errors as well as anything else (" Error: Could not complete shutdown of rabbitmq-clone, 1 resources remaining" control0 and "Error: {not_a_cluster_node,"The node selected is not in the cluster." on control1)
As Andrew says in comment #4 the clock skew makes it hard to understand what is going on here. I poked some more, one note of possible interest is from controller-0: Aug 4 14:09:24 overcloud-controller-0 os-collect-config: [2016-08-04 18:09:24,766] (heat-config) [ERROR] Error running /var/lib/heat-config/heat-config-script/4ad7e76b-bd9a-4cde-afa8-321f0c87d04d. [1] I confirmed that this file ^^^ is the extraconfig/tasks/pacemaker_resource_restart.sh but that is as far as I got. It could be a legit timeout bringing a service down+up on oversubscribed hardware (rabbit in this case), or it could be a regression. Is this reproducible on multiple setups @sasha or just once so far? This one is a bit concerning: 34156:Aug 04 17:40:30 overcloud-controller-2.localdomain os-collect-config[3603]: Error: unable to push cib but i don't know what os-collect-config could possibly be trying to push into the cib (In reply to Andrew Beekhof from comment #7) > This one is a bit concerning: > > 34156:Aug 04 17:40:30 overcloud-controller-2.localdomain > os-collect-config[3603]: Error: unable to push cib > > > but i don't know what os-collect-config could possibly be trying to push > into the cib o/ Andrew fyi: on current master this is happening in the pacemaker_migrations.sh https://github.com/openstack/tripleo-heat-templates/blob/292fdf87e0fdcbd66664afc4c463f2f0e9a354fa/extraconfig/tasks/major_upgrade_pacemaker_migrations.sh#L92 However, the environment this bug is for is rhos7 so looking at the rhos-7.0-patches branch to confirm, this was during the yum_update.sh which is what is being executed here (for update of 7.2 -> 7.3) since it is rhos 7 downstream link to the relevant code https://code.engineering.redhat.com/gerrit/gitweb?p=openstack-tripleo-heat-templates.git;a=blob;f=extraconfig/tasks/yum_update.sh;h=98ad25c764fc9948567253444d882d45a6b4b31d;hb=refs/heads/rhos-7.0-patches#l136 @sasha did you manage to duplicate this on other hardware? I get the impression from comment #3 and some brief mentions on irc that this might be timeout because of insufficient resources and that you were going to try and see if it happens again/esp elsewhere? thanks, marios (In reply to marios from comment #8) > (In reply to Andrew Beekhof from comment #7) > > This one is a bit concerning: > > > > 34156:Aug 04 17:40:30 overcloud-controller-2.localdomain > > os-collect-config[3603]: Error: unable to push cib > > > > > > but i don't know what os-collect-config could possibly be trying to push > > into the cib > > > o/ Andrew fyi: on current master this is happening in the > pacemaker_migrations.sh > https://github.com/openstack/tripleo-heat-templates/blob/ > 292fdf87e0fdcbd66664afc4c463f2f0e9a354fa/extraconfig/tasks/ > major_upgrade_pacemaker_migrations.sh#L92 Oh, I didn't realize those actions would be performed under the os-collect-config tag. > > However, the environment this bug is for is rhos7 so looking at the > rhos-7.0-patches branch to confirm, this was during the yum_update.sh which > is what is being executed here (for update of 7.2 -> 7.3) since it is rhos 7 > downstream link to the relevant code > https://code.engineering.redhat.com/gerrit/gitweb?p=openstack-tripleo-heat- > templates.git;a=blob;f=extraconfig/tasks/yum_update.sh; > h=98ad25c764fc9948567253444d882d45a6b4b31d;hb=refs/heads/rhos-7.0- > patches#l136 That is a big problem. It means none of the cluster updates went in. I just reproduced this bug on attempt to update 8 to latest. Environment: instack-undercloud-2.2.7-7.el7ost.noarch openstack-puppet-modules-7.1.3-1.el7ost.noarch openstack-tripleo-heat-templates-kilo-0.8.14-18.el7ost.noarch openstack-tripleo-heat-templates-0.8.14-18.el7ost.noarch No clock skew:
[stack@director ~]$ for i in `nova list|awk -F'|' '/Running/ {print $(NF-1)}'|awk -F"=" '{print $NF}'`; do ssh heat-admin@$i "hostname; date"; done
overcloud-cephstorage-0.localdomain
Wed Oct 12 17:57:47 UTC 2016
overcloud-cephstorage-1.localdomain
Wed Oct 12 17:57:47 UTC 2016
overcloud-cephstorage-2.localdomain
Wed Oct 12 17:57:47 UTC 2016
overcloud-compute-0.localdomain
Wed Oct 12 17:57:47 UTC 2016
overcloud-compute-1.localdomain
Wed Oct 12 17:57:48 UTC 2016
overcloud-compute-2.localdomain
Wed Oct 12 17:57:48 UTC 2016
overcloud-controller-0.localdomain
Wed Oct 12 17:57:48 UTC 2016
overcloud-controller-1.localdomain
Wed Oct 12 17:57:49 UTC 2016
overcloud-controller-2.localdomain
Wed Oct 12 17:57:49 UTC 2016
The system was idle after deploy (5 days ago), didn't run anything on it. (In reply to Alexander Chuzhoy from comment #10) > I just reproduced this bug on attempt to update 8 to latest. > > Environment: > instack-undercloud-2.2.7-7.el7ost.noarch > openstack-puppet-modules-7.1.3-1.el7ost.noarch > openstack-tripleo-heat-templates-kilo-0.8.14-18.el7ost.noarch > openstack-tripleo-heat-templates-0.8.14-18.el7ost.noarch Sasha, can I get access to the system or sosreports from the controllers? I wonder if this is simply a matter of increasing timeouts of rabbitmq (aka https://bugzilla.redhat.com/show_bug.cgi?id=1378391). If you can reproduce it at will can we sync up online and try a quick patch? Sasha,
can you try the following patch on the undercloud before launching the update command? (Patch needs to be applied in /usr/share/openstack-tripleo-heat-temaplates):
diff -up tripleo-heat-templates-0.8.14/extraconfig/tasks/yum_update.sh.orig tripleo-heat-templates-0.8.14/extraconfig/tasks/yum_update.sh
--- tripleo-heat-templates-0.8.14/extraconfig/tasks/yum_update.sh.orig 2016-10-13 20:09:48.607711524 +0200
+++ tripleo-heat-templates-0.8.14/extraconfig/tasks/yum_update.sh 2016-10-13 20:10:54.237854605 +0200
@@ -130,6 +130,7 @@ openstack-nova-scheduler"
echo "Making sure rabbitmq has the notify=true meta parameter"
pcs -f $pacemaker_dumpfile resource update rabbitmq meta notify=true
+ pcs -f $pacemaker_dumpfile resource update rabbitmq op start timeout=200s stop timeout=200s
echo "Applying new Pacemaker config"
if ! pcs cluster cib-push $pacemaker_dumpfile; then
My analysis so far on the reports I got about this came to the conclusion that rabbitmq simply needs more time. The reliable broken deployment I got my hands on was fixed with this. Let me know if you need help applying/testing.
cheers,
Michele
Michele, Can I apply the patch on the setup with the failed update, or should I start from scratc at this point? Hi Sasha, it would be best to launch it from scratch at this point. I haven't audited the code to see what happens if you launch it against a failed deployment. The other option would be to bring up the cluster by hand and clean up any failed action. And after that relaunch the update, that *should* work as well, although I have not tested it *** Bug 1384068 has been marked as a duplicate of this bug. *** |
rhel-osp-director: Update 7.2->7.3Async fails: Error: Could not complete shutdown of rabbitmq-clone, 1 resources remaining Error performing operation: Timer expired Environment: openstack-tripleo-heat-templates-0.8.6-128.el7ost.noarch openstack-puppet-modules-2015.1.8-51.el7ost.noarch instack-undercloud-2.1.2-39.el7ost.noarch Steps to reproduce: 1. Deploy 7.2GA with: openstack overcloud deploy --templates --control-scale 3 --compute-scale 1 --neutron-network-type vxlan --neutron-tunnel-types vxlan --ntp-server clock.redhat.com --timeout 90 -e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e network-environment.yaml --ceph-storage-scale 1 2. Populate the setup. 3. Try to update to 7.3Async Result: IN_PROGRESS IN_PROGRESS IN_PROGRESS IN_PROGRESS IN_PROGRESS IN_PROGRESS FAILED update finished with status FAILED [stack@instack ~]$ heat resource-list -n5 overcloud|grep -v COMPLE You must provide a username via either --os-username or env[OS_USERNAME] or a token via --os-auth-token or env[OS_AUTH_TOKEN] [stack@instack ~]$ . stackrc [stack@instack ~]$ heat resource-list -n5 overcloud|grep -v COMPLE +-----------------------------------------------+-----------------------------------------------+---------------------------------------------------+-----------------+----------------------+-----------------------------------------------+ | resource_name | physical_resource_id | resource_type | resource_status | updated_time | parent_resource | +-----------------------------------------------+-----------------------------------------------+---------------------------------------------------+-----------------+----------------------+-----------------------------------------------+ | ControllerNodesPostDeployment | 7a26168f-6a06-4f0f-bee1-170e1efb4967 | OS::TripleO::ControllerPostDeployment | UPDATE_FAILED | 2016-08-04T17:43:14Z | | | ControllerPostPuppet | 7db75a4c-4d71-428f-a014-2e5ba3403839 | OS::TripleO::Tasks::ControllerPostPuppet | UPDATE_FAILED | 2016-08-04T18:03:01Z | ControllerNodesPostDeployment | | ControllerPostPuppetRestartDeployment | 61534b6e-f54f-425c-b307-ddcede267abf | OS::Heat::SoftwareDeployments | UPDATE_FAILED | 2016-08-04T18:04:23Z | ControllerPostPuppet | | 0 | bc9d8d72-9c7a-4c2d-93e9-a74502be60c0 | OS::Heat::SoftwareDeployment | UPDATE_FAILED | 2016-08-04T18:04:27Z | ControllerPostPuppetRestartDeployment | +-----------------------------------------------+-----------------------------------------------+---------------------------------------------------+-----------------+----------------------+-----------------------------------------------+ ++ grep -v Clone + node_states=' openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Stopped openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Stopped openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Stopped' + echo ' openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Stopped openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Stopped openstack-keystone (systemd:openstack-keystone): (target-role:Stopped) Stopped' + grep -q Started + echo 'openstack-keystone has stopped' + return + pcs status + grep haproxy-clone + pcs resource restart haproxy-clone + pcs resource restart redis-master + pcs resource restart mongod-clone + pcs resource restart rabbitmq-clone Error: Could not complete shutdown of rabbitmq-clone, 1 resources remaining Error performing operation: Timer expired Set 'rabbitmq-clone' option: id=rabbitmq-clone-meta_attributes-target-role set=rabbitmq-clone-meta_attributes name=target-role=stopped Waiting for 1 resources to stop: cirros-0.3.4-x86_64-disk.img deploy_command deploy-ramdisk-ironic.initramfs deploy-ramdisk-ironic.kernel deploy-ramdisk-ironic.tar discovery-ramdisk.initramfs discovery-ramdisk.kernel discovery-ramdisk.tar instackenv.json keystonerc_master netiso-virt-7.x.tgz network-environment.yaml nic-configs oskey.priv overcloud-env.json overcloud-full.initrd overcloud-full.qcow2 overcloud-full.tar overcloud-full.vmlinuz overcloudrc rhel-guest-image-7.2-20160302.0.x86_64.qcow2 rhos-qe-core-installer stackrc tempest-deployer-input.conf tripleo-overcloud-passwords undercloud.conf undercloud-passwords.conf rabbitmq-clone cirros-0.3.4-x86_64-disk.img deploy_command deploy-ramdisk-ironic.initramfs deploy-ramdisk-ironic.kernel deploy-ramdisk-ironic.tar discovery-ramdisk.initramfs discovery-ramdisk.kernel discovery-ramdisk.tar instackenv.json keystonerc_master netiso-virt-7.x.tgz network-environment.yaml nic-configs oskey.priv overcloud-env.json overcloud-full.initrd overcloud-full.qcow2 overcloud-full.tar overcloud-full.vmlinuz overcloudrc rhel-guest-image-7.2-20160302.0.x86_64.qcow2 rhos-qe-core-installer stackrc tempest-deployer-input.conf tripleo-overcloud-passwords undercloud.conf undercloud-passwords.conf rabbitmq-clone Deleted 'rabbitmq-clone' option: id=rabbitmq-clone-meta_attributes-target-role name=target-role ", "deploy_status_code": 1 }, "creation_time": "2016-08-04T16:20:38Z", "updated_time": "2016-08-04T18:09:24Z", "input_values": {}, "action": "UPDATE", "status_reason": "deploy_status_code : Deployment exited with non-zero status code: 1", "id": "bc9d8d72-9c7a-4c2d-93e9-a74502be60c0" }