Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1364241

Summary:

rhel-osp-director: Update 7.2->7.3Async fails: Error: Could not complete shutdown of rabbitmq-clone, 1 resources remaining Error performing operation: Timer expired

Product:

Red Hat OpenStack

Reporter:

Alexander Chuzhoy <sasha>

Component:

rhosp-director

Assignee:

Michele Baldessari <michele>

Status:

CLOSED CURRENTRELEASE

QA Contact:

Omri Hochman <ohochman>

Severity:

high

Docs Contact:

Priority:

high

Version:

7.0 (Kilo)

CC:

abeekhof, aschultz, dbecker, jslagle, mburns, michele, morazi, rhel-osp-director-maint, rscarazz, sasha, ushkalim

Target Milestone:

async

Keywords:

Regression, Triaged

Target Release:

7.0 (Kilo)

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2018-07-20 08:18:36 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1344794

Attachments:

Description	Flags
journalctl errors	none

Description Alexander Chuzhoy 2016-08-04 19:07:31 UTC

rhel-osp-director:   Update 7.2->7.3Async fails: Error: Could not complete shutdown of rabbitmq-clone, 1 resources remaining Error performing operation: Timer expired


Environment:
openstack-tripleo-heat-templates-0.8.6-128.el7ost.noarch
openstack-puppet-modules-2015.1.8-51.el7ost.noarch
instack-undercloud-2.1.2-39.el7ost.noarch


Steps to reproduce:
1. Deploy 7.2GA with:
openstack overcloud deploy --templates --control-scale 3 --compute-scale 1 --neutron-network-type vxlan --neutron-tunnel-types vxlan --ntp-server clock.redhat.com --timeout 90 -e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e network-environment.yaml --ceph-storage-scale 1

2. Populate the setup.
3. Try to update to 7.3Async

Result:

IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
FAILED
update finished with status FAILED
[stack@instack ~]$ heat resource-list -n5 overcloud|grep -v COMPLE
You must provide a username via either --os-username or env[OS_USERNAME] or a token via --os-auth-token or env[OS_AUTH_TOKEN]
[stack@instack ~]$ . stackrc
[stack@instack ~]$ heat resource-list -n5 overcloud|grep -v COMPLE
+-----------------------------------------------+-----------------------------------------------+---------------------------------------------------+-----------------+----------------------+-----------------------------------------------+
| resource_name                                 | physical_resource_id                          | resource_type                                     | resource_status | updated_time         | parent_resource                               |
+-----------------------------------------------+-----------------------------------------------+---------------------------------------------------+-----------------+----------------------+-----------------------------------------------+
| ControllerNodesPostDeployment                 | 7a26168f-6a06-4f0f-bee1-170e1efb4967          | OS::TripleO::ControllerPostDeployment             | UPDATE_FAILED   | 2016-08-04T17:43:14Z |                                               |
| ControllerPostPuppet                          | 7db75a4c-4d71-428f-a014-2e5ba3403839          | OS::TripleO::Tasks::ControllerPostPuppet          | UPDATE_FAILED   | 2016-08-04T18:03:01Z | ControllerNodesPostDeployment                 |
| ControllerPostPuppetRestartDeployment         | 61534b6e-f54f-425c-b307-ddcede267abf          | OS::Heat::SoftwareDeployments                     | UPDATE_FAILED   | 2016-08-04T18:04:23Z | ControllerPostPuppet                          |
| 0                                             | bc9d8d72-9c7a-4c2d-93e9-a74502be60c0          | OS::Heat::SoftwareDeployment                      | UPDATE_FAILED   | 2016-08-04T18:04:27Z | ControllerPostPuppetRestartDeployment         |
+-----------------------------------------------+-----------------------------------------------+---------------------------------------------------+-----------------+----------------------+-----------------------------------------------+


++ grep -v Clone
+ node_states=' openstack-keystone      (systemd:openstack-keystone):   (target-role:Stopped) Stopped
 openstack-keystone     (systemd:openstack-keystone):   (target-role:Stopped) Stopped
 openstack-keystone     (systemd:openstack-keystone):   (target-role:Stopped) Stopped'
+ echo ' openstack-keystone     (systemd:openstack-keystone):   (target-role:Stopped) Stopped
 openstack-keystone     (systemd:openstack-keystone):   (target-role:Stopped) Stopped
 openstack-keystone     (systemd:openstack-keystone):   (target-role:Stopped) Stopped'
+ grep -q Started
+ echo 'openstack-keystone has stopped'
+ return
+ pcs status
+ grep haproxy-clone
+ pcs resource restart haproxy-clone
+ pcs resource restart redis-master
+ pcs resource restart mongod-clone
+ pcs resource restart rabbitmq-clone
Error: Could not complete shutdown of rabbitmq-clone, 1 resources remaining
Error performing operation: Timer expired

Set 'rabbitmq-clone' option: id=rabbitmq-clone-meta_attributes-target-role set=rabbitmq-clone-meta_attributes name=target-role=stopped
Waiting for 1 resources to stop:
 cirros-0.3.4-x86_64-disk.img deploy_command deploy-ramdisk-ironic.initramfs deploy-ramdisk-ironic.kernel deploy-ramdisk-ironic.tar discovery-ramdisk.initramfs discovery-ramdisk.kernel discovery-ramdisk.tar instackenv.json keystonerc_master netiso-virt-7.x.tgz network-environment.yaml nic-configs oskey.priv overcloud-env.json overcloud-full.initrd overcloud-full.qcow2 overcloud-full.tar overcloud-full.vmlinuz overcloudrc rhel-guest-image-7.2-20160302.0.x86_64.qcow2 rhos-qe-core-installer stackrc tempest-deployer-input.conf tripleo-overcloud-passwords undercloud.conf undercloud-passwords.conf rabbitmq-clone
 cirros-0.3.4-x86_64-disk.img deploy_command deploy-ramdisk-ironic.initramfs deploy-ramdisk-ironic.kernel deploy-ramdisk-ironic.tar discovery-ramdisk.initramfs discovery-ramdisk.kernel discovery-ramdisk.tar instackenv.json keystonerc_master netiso-virt-7.x.tgz network-environment.yaml nic-configs oskey.priv overcloud-env.json overcloud-full.initrd overcloud-full.qcow2 overcloud-full.tar overcloud-full.vmlinuz overcloudrc rhel-guest-image-7.2-20160302.0.x86_64.qcow2 rhos-qe-core-installer stackrc tempest-deployer-input.conf tripleo-overcloud-passwords undercloud.conf undercloud-passwords.conf rabbitmq-clone
Deleted 'rabbitmq-clone' option: id=rabbitmq-clone-meta_attributes-target-role name=target-role

", "deploy_status_code": 1 }, "creation_time": "2016-08-04T16:20:38Z", "updated_time": "2016-08-04T18:09:24Z", "input_values": {}, "action": "UPDATE", "status_reason": "deploy_status_code : Deployment exited with non-zero status code: 1", "id": "bc9d8d72-9c7a-4c2d-93e9-a74502be60c0" }

Comment 3 Mike Orazi 2016-08-04 20:41:41 UTC

Rabbit shutdown on controller-1 is taking ~44 seconds.  Can this be host load related?

Comment 4 Andrew Beekhof 2016-08-05 00:20:45 UTC

There are a few stop/start cycles, but I think you're talking about this one right?

Aug 04 14:08:54 [31393] overcloud-controller-1.localdomain       crmd:     info: do_lrm_rsc_op:	Performing key=124:48:0:31a2b3c6-bcce-4a1c-b9e3-5eefa5a12b25 op=rabbitmq_stop_0
Aug 04 14:09:56 [31393] overcloud-controller-1.localdomain       crmd:   notice: process_lrm_event:	Operation rabbitmq_stop_0: ok (node=overcloud-controller-1, call=476, rc=0, cib-update=271, confirmed=true)

Which is closer to 62s.
Regardless, as far as the cluster is concerned, everything was fine.

Which timeout was not satisfied and what is it set to?  Its not clear from the description.

Seems to me it should really be the start timeout + stop timeout + a second or so of buffer.


Also I think the clock on overcloud-controller-0.localdomain is off by a couple of hours which makes log analysis rather challenging.

Comment 5 Marios Andreou 2016-08-05 07:53:02 UTC

Created attachment 1187792 [details]
journalctl errors

just trying to understand what's going on here. I see a 'failed to push cib' on control 2 from these errors as well as anything else (" Error: Could not complete shutdown of rabbitmq-clone, 1 resources remaining" control0 and "Error: {not_a_cluster_node,"The node selected is not in the cluster." on control1)

Comment 6 Marios Andreou 2016-08-05 16:20:41 UTC

As Andrew says in comment #4 the clock skew makes it hard to understand what is going on here. I poked some more, one note of possible interest is from controller-0:

Aug  4 14:09:24 overcloud-controller-0 os-collect-config: [2016-08-04 18:09:24,766] (heat-config) [ERROR] Error running /var/lib/heat-config/heat-config-script/4ad7e76b-bd9a-4cde-afa8-321f0c87d04d. [1]


I confirmed that this file ^^^ is the extraconfig/tasks/pacemaker_resource_restart.sh but that is as far as I got. It could be a legit timeout bringing a service down+up on oversubscribed hardware (rabbit in this case), or it could be a regression. Is this reproducible on multiple setups @sasha or just once so far?

Comment 7 Andrew Beekhof 2016-08-08 03:29:19 UTC

This one is a bit concerning:

34156:Aug 04 17:40:30 overcloud-controller-2.localdomain os-collect-config[3603]: Error: unable to push cib


but i don't know what os-collect-config could possibly be trying to push into the cib

Comment 8 Marios Andreou 2016-08-08 11:54:40 UTC

(In reply to Andrew Beekhof from comment #7)
> This one is a bit concerning:
> 
> 34156:Aug 04 17:40:30 overcloud-controller-2.localdomain
> os-collect-config[3603]: Error: unable to push cib
> 
> 
> but i don't know what os-collect-config could possibly be trying to push
> into the cib


o/ Andrew fyi: on current master this is happening in the pacemaker_migrations.sh https://github.com/openstack/tripleo-heat-templates/blob/292fdf87e0fdcbd66664afc4c463f2f0e9a354fa/extraconfig/tasks/major_upgrade_pacemaker_migrations.sh#L92 

However, the environment this bug is for is rhos7 so looking at the rhos-7.0-patches branch to confirm, this was during the yum_update.sh which is what is being executed here (for update of 7.2 -> 7.3) since it is rhos 7 downstream link to the relevant code https://code.engineering.redhat.com/gerrit/gitweb?p=openstack-tripleo-heat-templates.git;a=blob;f=extraconfig/tasks/yum_update.sh;h=98ad25c764fc9948567253444d882d45a6b4b31d;hb=refs/heads/rhos-7.0-patches#l136

@sasha did you manage to duplicate this on other hardware? I get the impression from comment #3 and some brief mentions on irc that this might be timeout because of insufficient resources and that you were going to try and see if it happens again/esp elsewhere?

thanks, marios

Comment 9 Andrew Beekhof 2016-08-09 01:14:10 UTC

(In reply to marios from comment #8)
> (In reply to Andrew Beekhof from comment #7)
> > This one is a bit concerning:
> > 
> > 34156:Aug 04 17:40:30 overcloud-controller-2.localdomain
> > os-collect-config[3603]: Error: unable to push cib
> > 
> > 
> > but i don't know what os-collect-config could possibly be trying to push
> > into the cib
> 
> 
> o/ Andrew fyi: on current master this is happening in the
> pacemaker_migrations.sh
> https://github.com/openstack/tripleo-heat-templates/blob/
> 292fdf87e0fdcbd66664afc4c463f2f0e9a354fa/extraconfig/tasks/
> major_upgrade_pacemaker_migrations.sh#L92 

Oh, I didn't realize those actions would be performed under the os-collect-config tag.

> 
> However, the environment this bug is for is rhos7 so looking at the
> rhos-7.0-patches branch to confirm, this was during the yum_update.sh which
> is what is being executed here (for update of 7.2 -> 7.3) since it is rhos 7
> downstream link to the relevant code
> https://code.engineering.redhat.com/gerrit/gitweb?p=openstack-tripleo-heat-
> templates.git;a=blob;f=extraconfig/tasks/yum_update.sh;
> h=98ad25c764fc9948567253444d882d45a6b4b31d;hb=refs/heads/rhos-7.0-
> patches#l136

That is a big problem. It means none of the cluster updates went in.

Comment 10 Alexander Chuzhoy 2016-10-12 17:16:43 UTC

I just reproduced this bug on attempt to update 8 to latest.

Environment:
instack-undercloud-2.2.7-7.el7ost.noarch
openstack-puppet-modules-7.1.3-1.el7ost.noarch
openstack-tripleo-heat-templates-kilo-0.8.14-18.el7ost.noarch
openstack-tripleo-heat-templates-0.8.14-18.el7ost.noarch

Comment 11 Alexander Chuzhoy 2016-10-12 18:01:16 UTC

No clock skew:
[stack@director ~]$  for i in `nova list|awk -F'|' '/Running/ {print $(NF-1)}'|awk -F"=" '{print $NF}'`; do ssh heat-admin@$i "hostname; date"; done
overcloud-cephstorage-0.localdomain
Wed Oct 12 17:57:47 UTC 2016
overcloud-cephstorage-1.localdomain
Wed Oct 12 17:57:47 UTC 2016
overcloud-cephstorage-2.localdomain
Wed Oct 12 17:57:47 UTC 2016
overcloud-compute-0.localdomain
Wed Oct 12 17:57:47 UTC 2016
overcloud-compute-1.localdomain
Wed Oct 12 17:57:48 UTC 2016
overcloud-compute-2.localdomain
Wed Oct 12 17:57:48 UTC 2016
overcloud-controller-0.localdomain
Wed Oct 12 17:57:48 UTC 2016
overcloud-controller-1.localdomain
Wed Oct 12 17:57:49 UTC 2016
overcloud-controller-2.localdomain
Wed Oct 12 17:57:49 UTC 2016

Comment 12 Alexander Chuzhoy 2016-10-12 18:03:47 UTC

The system was idle after deploy (5 days ago), didn't run anything on it.

Comment 13 Michele Baldessari 2016-10-12 22:35:58 UTC

(In reply to Alexander Chuzhoy from comment #10)
> I just reproduced this bug on attempt to update 8 to latest.
> 
> Environment:
> instack-undercloud-2.2.7-7.el7ost.noarch
> openstack-puppet-modules-7.1.3-1.el7ost.noarch
> openstack-tripleo-heat-templates-kilo-0.8.14-18.el7ost.noarch
> openstack-tripleo-heat-templates-0.8.14-18.el7ost.noarch

Sasha,

can I get access to the system or sosreports from the controllers?
I wonder if this is simply a matter of increasing timeouts of rabbitmq (aka https://bugzilla.redhat.com/show_bug.cgi?id=1378391). If you can reproduce it at will can we sync up online and try a quick patch?

Comment 14 Michele Baldessari 2016-10-13 18:14:35 UTC

Sasha,

can you try the following patch on the undercloud before launching the update command? (Patch needs to be applied in /usr/share/openstack-tripleo-heat-temaplates):
diff -up tripleo-heat-templates-0.8.14/extraconfig/tasks/yum_update.sh.orig tripleo-heat-templates-0.8.14/extraconfig/tasks/yum_update.sh
--- tripleo-heat-templates-0.8.14/extraconfig/tasks/yum_update.sh.orig	2016-10-13 20:09:48.607711524 +0200
+++ tripleo-heat-templates-0.8.14/extraconfig/tasks/yum_update.sh	2016-10-13 20:10:54.237854605 +0200
@@ -130,6 +130,7 @@ openstack-nova-scheduler"
 
     echo "Making sure rabbitmq has the notify=true meta parameter"
     pcs -f $pacemaker_dumpfile resource update rabbitmq meta notify=true
+    pcs -f $pacemaker_dumpfile resource update rabbitmq op start timeout=200s stop timeout=200s
 
     echo "Applying new Pacemaker config"
     if ! pcs cluster cib-push $pacemaker_dumpfile; then


My analysis so far on the reports I got about this came to the conclusion that rabbitmq simply needs more time. The reliable broken deployment I got my hands on was fixed with this. Let me know if you need help applying/testing.

cheers,
Michele

Comment 15 Alexander Chuzhoy 2016-10-13 21:47:53 UTC

Michele,
Can I apply the patch on the setup with the failed update, or should I start from scratc at this point?

Comment 16 Michele Baldessari 2016-10-14 06:48:16 UTC

Hi Sasha,

it would be best to launch it from scratch at this point. I haven't audited the code to see what happens if you launch it against a failed deployment.

The other option would be to bring up the cluster by hand and clean up any failed action. And after that relaunch the update, that *should* work as well, although I have not tested it

Comment 17 Michele Baldessari 2016-11-15 09:00:38 UTC

*** Bug 1384068 has been marked as a duplicate of this bug. ***