Created attachment 1098926 [details] control0 sosreport Description of problem: I see an issue replacing the pacemaker cib during an update from a running 7.1 overcloud to 7.2/latest. Control1 is updated successfully, then compute0 is also updated, and then control0 seems to hang. Poking at the logs I see this: Nov 25 10:15:30 overcloud-controller-0.localdomain os-collect-config[1929]: Finished yum_update.sh on server 6817db4a-ba0b-4809-9071-cc6414a5adac at Wed Nov 25 10:15:30 EST 2015 Nov 25 10:15:30 overcloud-controller-0.localdomain os-collect-config[1929]: [2015-11-25 10:15:30,353] (heat-config) [DEBUG] Error: unable to push cib Nov 25 10:15:30 overcloud-controller-0.localdomain os-collect-config[1929]: Call cib_replace failed (-205): Update was older than existing configuration The changes we expect to have been made to the constraints aren't at this point - i.e., the neutron-server to ovs-cleanup is still there (yum_update.sh should have removed it). They *were* correctly removed after control1 was updated... (the first in sequence controller to have been updated) then as we expected were re-added during control0 puppet/pacemaker run (then this error, so they ultimately aren't removed). [root@overcloud-controller-2 heat-admin]# pcs constraint order show | grep neutron ... start neutron-server-clone then start neutron-openvswitch-agent-clone (kind:Mandatory) start neutron-server-clone then start neutron-ovs-cleanup-clone (kind:Mandatory) Attached sosreport from control0, control1 and compute0... At this point mostly want to hear if anyone else has seen/tested this on updates? More details and context below. Starting with a running 7.1 overcloud, without network isolation, originally deployed like openstack overcloud deploy --templates --control-scale 3 --compute-scale 1 --libvirt-type qemu --ntp-server "0.fedora.pool.ntp.org" -e /usr/share/openstack-tripleo-heat-templates/environments/puppet-pacemaker.yaml Launched a tenant vm on the overcloud and started pinging it all running fine. Enabled repos on undercloud and ran a yum update, like: sudo rhos-release 7-director -d ; sudo rhos-release 7 -d sudo yum -y update This has updated various things last few days, the heat templates specifically we've aded lots to the yum_update.sh file - wrt versions, my undercloud currently has (heat templates and heat have landed related fixes afaicr): [stack@instack ~]$ rpm -qa | grep heat openstack-heat-api-cloudwatch-2015.1.2-2.el7ost.noarch openstack-heat-api-2015.1.2-2.el7ost.noarch openstack-tripleo-heat-templates-0.8.6-83.el7ost.noarch python-heatclient-0.6.0-1.el7ost.noarch openstack-heat-engine-2015.1.2-2.el7ost.noarch openstack-heat-templates-0-0.7.20150605git.el7ost.noarch openstack-heat-common-2015.1.2-2.el7ost.noarch openstack-heat-api-cfn-2015.1.2-2.el7ost.noarch heat-cfntools-1.2.8-2.el7.noarch After updating the undercloud, I installed rhos-release and setup the repos on all overcloud nodes, and copied the 55-heat-config script, like for i in $(nova list|grep ctlplane|awk -F' ' '{ print $12 }'|awk -F'=' '{ print $2 }'); do ssh heat-admin@$i sudo yum localinstall -y http://rhos-release.virt.bos.redhat.com/repos/rhos-release/rhos-release-latest.noarch.rpm ; done for i in $(nova list|grep ctlplane|awk -F' ' '{ print $12 }'|awk -F'=' '{ print $2 }'); do ssh heat-admin@$i "sudo rhos-release 7-director -d ; sudo rhos-release 7;" ; done for i in $(nova list|grep ctlplane|awk -F' ' '{ print $12 }'|awk -F'=' '{ print $2 }'); do scp /usr/share/openstack-heat-templates/software-config/elements/heat-config/os-refresh-config/configure.d/55-heat-config heat-admin@$i: ; ssh heat-admin@$i 'sudo /bin/bash -c "cp /home/heat-admin/55-heat-config /usr/libexec/os-refresh-config/configure.d/55-heat-config"'; done Started an update, like: openstack overcloud update stack overcloud -i --templates -e /usr/share/openstack-tripleo-heat-templates/overcloud-resource-registry-puppet.yaml -e update.yaml Control1 starts, completes update. Note that the tenant router was on ctrl1 and while it was being updated, the router moved to control0. The failover was within 10 seconds, like 2015-11-25T14:43:35+0000 OK 2015-11-25T14:43:36+0000 UNREACHABLE 2015-11-25T14:43:40+0000 UNREACHABLE 2015-11-25T14:43:44+0000 OK migration to control0 ^^^ Control1 is updated successfully, then compute0 is also updated. Control0 starts to get updated, and the router fails over back to ctrl1... but then control0 seems to hang. Poking at the logs I see this (full logs attached): Nov 25 10:15:30 overcloud-controller-0.localdomain os-collect-config[1929]: Finished yum_update.sh on server 6817db4a-ba0b-4809-9071-cc6414a5adac at Wed Nov 25 10:15:30 EST 2015 Nov 25 10:15:30 overcloud-controller-0.localdomain os-collect-config[1929]: [2015-11-25 10:15:30,353] (heat-config) [DEBUG] Error: unable to push cib Nov 25 10:15:30 overcloud-controller-0.localdomain os-collect-config[1929]: Call cib_replace failed (-205): Update was older than existing configuration thanks, marios Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
Created attachment 1098928 [details] control1 sosreport
Created attachment 1098929 [details] compute0 sosreport
in my environment, I see that once UpdateDeployment has been signaled complete back to Heat on overcloud-controller-0, the update client will move onto the next breakpoint, which you can then clear. i then cleared the breakpoint on overcloud-controller-1, so it started updating. But, back on overcloud-controller-0 I see that the original os-refresh-config process is still running and is reapplying all the old puppet deployments (due to bug 1278181). Therefore, there exists a possibility of a race condition across the cluster where yum_update.sh on overcloud-controller-1 could create a cib file, modify it, and before it gets the chance to load it back into pacemaker, overcloud-controller-0 could make a change to the cluster constraints due to applying the old puppet manifests. When overcloud-controller-1 then tries to load the modified cib file, you'd get the error shown in the bugzilla. I see 2 possible fixes to this situation, there could be others: (a) make yum_update.sh smart enough to account for the race condition. if we fail to load the modified cib due to it being an older version, we sleep/retry/backoff continuously for a few attempts. after some set number of attempts, we'd have to give up and fail for real. (b) more thoroughly fix bug 1278181. somehow make 55-heat-config not retrigger the deployments if /var/run is empty. Or, populate /var/run/heat-config with some empty deployed json files based on the deployments already downloaded to /var/lib/os-collect-config. There were some ideas about this in the upstream bug, https://bugs.launchpad.net/heat-templates/+bug/1513220
other than the observed behavior, my update from 7.0 (started with no /var/run/heat-config present on any node), actually completed fine. So, I suspect this is a transient race condition or something specific with updating from 7.1.
I has been reported by others that they updated from 7.1-->7.2 without issue, FYI... so this may be an environment issue... the context is here if we find it is indeed a race.
*** Bug 1287804 has been marked as a duplicate of this bug. ***
Hi Sasha, for verification: on a good setup, should be able to see the pacemaker config (cib) being applied/pushed correctly, in particular the response from the "pcs cluster cib-push $pacemaker_dumpfile", like: Nov 25 09:54:09 overcloud-controller-1 os-collect-config: Applying new Pacemaker config Nov 25 09:54:09 overcloud-controller-1 os-collect-config: CIB updated on a bad setup, you won't see this ^^^ but rather "Error: unable to push cib" like: Nov 25 10:15:30 overcloud-controller-0 os-collect-config: Finished yum_update.sh on server 6817db4a-ba0b-4809-9071-cc6414a5adac at Wed Nov 25 10:15:30 EST 2015 Nov 25 10:15:30 overcloud-controller-0 os-collect-config: [2015-11-25 10:15:30,353] (heat-config) [DEBUG] Error: unable to push cib Nov 25 10:15:30 overcloud-controller-0 os-collect-config: Call cib_replace failed (-205): Update was older than existing configuration hope that helps.
FailedQA. Environment: openstack-tripleo-heat-templates-0.8.6-87.el7ost.noarch Updated the 7.1 setup to 7.2. Logged into a controller: [root@overcloud-controller-0 ~]# pacemaker_dumpfile=`mktemp` [root@overcloud-controller-0 ~]# echo $pacemaker_dumpfile /tmp/tmp.9lwOrQ0i9K [root@overcloud-controller-0 ~]# pcs cluster cib-push $pacemaker_dumpfile Error: unable to parse new cib: no element found: line 1, column 0
Hi Sasha, don't think it should fail qa for that though... for one you are using an empty pacemaker_dumpfile here so it fails for that. The test/fix wasn't that a cib update *per say* works, but rather that it was done correctly by us during an update. The correct way to do it is to write to a file and update the cib all at once... which is why in yum_update.sh we do the pcs commands with -f https://github.com/openstack/tripleo-heat-templates/blob/2674efae84f6ba808fbaa5f0150825e42a86ba59/extraconfig/tasks/yum_update.sh#L83 The fix that giulio pushed at https://review.openstack.org/#/c/249636/1/extraconfig/tasks/yum_update.sh makes sure this also happens for mongo... we suspect it may be the cause of the original report here. So, to check if this is bug is fixed/occurring: on a good/fixed setup, should be able to see the pacemaker config (cib) being applied/pushed correctly, in particular the response from the "pcs cluster cib-push $pacemaker_dumpfile", like: Nov 25 09:54:09 overcloud-controller-1 os-collect-config: Applying new Pacemaker config Nov 25 09:54:09 overcloud-controller-1 os-collect-config: CIB updated on a bad setup, you won't see this ^^^ but rather "Error: unable to push cib" like: Nov 25 10:15:30 overcloud-controller-0 os-collect-config: Finished yum_update.sh on server 6817db4a-ba0b-4809-9071-cc6414a5adac at Wed Nov 25 10:15:30 EST 2015 Nov 25 10:15:30 overcloud-controller-0 os-collect-config: [2015-11-25 10:15:30,353] (heat-config) [DEBUG] Error: unable to push cib Nov 25 10:15:30 overcloud-controller-0 os-collect-config: Call cib_replace failed (-205): Update was older than existing configuration
Verified. Environment: penstack-tripleo-heat-templates-0.8.6-87.el7ost.noarch Verifying based on the fact that no errors like "Error: unable to push cib" are shown in journalctl.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2015:2650