Bug 1285485
| Summary: | issue replacing the pacemaker cib during an update from a running 7.1 overcloud to 7.2/latest | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Marios Andreou <mandreou> | ||||||||
| Component: | openstack-tripleo-heat-templates | Assignee: | Giulio Fidente <gfidente> | ||||||||
| Status: | CLOSED ERRATA | QA Contact: | Alexander Chuzhoy <sasha> | ||||||||
| Severity: | unspecified | Docs Contact: | |||||||||
| Priority: | unspecified | ||||||||||
| Version: | 7.0 (Kilo) | CC: | dmacpher, dnavale, ebarrera, gfidente, jcoufal, jslagle, jstransk, mbultel, mburns, mcornea, rhel-osp-director-maint, sasha | ||||||||
| Target Milestone: | y2 | ||||||||||
| Target Release: | 7.0 (Kilo) | ||||||||||
| Hardware: | Unspecified | ||||||||||
| OS: | Unspecified | ||||||||||
| Whiteboard: | |||||||||||
| Fixed In Version: | openstack-tripleo-heat-templates-0.8.6-86.el7ost | Doc Type: | Bug Fix | ||||||||
| Doc Text: | Story Points: | --- | |||||||||
| Clone Of: | Environment: | ||||||||||
| Last Closed: | 2015-12-21 16:53:03 UTC | Type: | Bug | ||||||||
| Regression: | --- | Mount Type: | --- | ||||||||
| Documentation: | --- | CRM: | |||||||||
| Verified Versions: | Category: | --- | |||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||
| Embargoed: | |||||||||||
| Attachments: |
|
||||||||||
|
Description
Marios Andreou
2015-11-25 17:35:51 UTC
Created attachment 1098928 [details]
control1 sosreport
Created attachment 1098929 [details]
compute0 sosreport
in my environment, I see that once UpdateDeployment has been signaled complete back to Heat on overcloud-controller-0, the update client will move onto the next breakpoint, which you can then clear. i then cleared the breakpoint on overcloud-controller-1, so it started updating. But, back on overcloud-controller-0 I see that the original os-refresh-config process is still running and is reapplying all the old puppet deployments (due to bug 1278181). Therefore, there exists a possibility of a race condition across the cluster where yum_update.sh on overcloud-controller-1 could create a cib file, modify it, and before it gets the chance to load it back into pacemaker, overcloud-controller-0 could make a change to the cluster constraints due to applying the old puppet manifests. When overcloud-controller-1 then tries to load the modified cib file, you'd get the error shown in the bugzilla. I see 2 possible fixes to this situation, there could be others: (a) make yum_update.sh smart enough to account for the race condition. if we fail to load the modified cib due to it being an older version, we sleep/retry/backoff continuously for a few attempts. after some set number of attempts, we'd have to give up and fail for real. (b) more thoroughly fix bug 1278181. somehow make 55-heat-config not retrigger the deployments if /var/run is empty. Or, populate /var/run/heat-config with some empty deployed json files based on the deployments already downloaded to /var/lib/os-collect-config. There were some ideas about this in the upstream bug, https://bugs.launchpad.net/heat-templates/+bug/1513220 other than the observed behavior, my update from 7.0 (started with no /var/run/heat-config present on any node), actually completed fine. So, I suspect this is a transient race condition or something specific with updating from 7.1. I has been reported by others that they updated from 7.1-->7.2 without issue, FYI... so this may be an environment issue... the context is here if we find it is indeed a race. *** Bug 1287804 has been marked as a duplicate of this bug. *** Hi Sasha, for verification: on a good setup, should be able to see the pacemaker config (cib) being applied/pushed correctly, in particular the response from the "pcs cluster cib-push $pacemaker_dumpfile", like: Nov 25 09:54:09 overcloud-controller-1 os-collect-config: Applying new Pacemaker config Nov 25 09:54:09 overcloud-controller-1 os-collect-config: CIB updated on a bad setup, you won't see this ^^^ but rather "Error: unable to push cib" like: Nov 25 10:15:30 overcloud-controller-0 os-collect-config: Finished yum_update.sh on server 6817db4a-ba0b-4809-9071-cc6414a5adac at Wed Nov 25 10:15:30 EST 2015 Nov 25 10:15:30 overcloud-controller-0 os-collect-config: [2015-11-25 10:15:30,353] (heat-config) [DEBUG] Error: unable to push cib Nov 25 10:15:30 overcloud-controller-0 os-collect-config: Call cib_replace failed (-205): Update was older than existing configuration hope that helps. FailedQA. Environment: openstack-tripleo-heat-templates-0.8.6-87.el7ost.noarch Updated the 7.1 setup to 7.2. Logged into a controller: [root@overcloud-controller-0 ~]# pacemaker_dumpfile=`mktemp` [root@overcloud-controller-0 ~]# echo $pacemaker_dumpfile /tmp/tmp.9lwOrQ0i9K [root@overcloud-controller-0 ~]# pcs cluster cib-push $pacemaker_dumpfile Error: unable to parse new cib: no element found: line 1, column 0 Hi Sasha, don't think it should fail qa for that though... for one you are using an empty pacemaker_dumpfile here so it fails for that. The test/fix wasn't that a cib update *per say* works, but rather that it was done correctly by us during an update. The correct way to do it is to write to a file and update the cib all at once... which is why in yum_update.sh we do the pcs commands with -f https://github.com/openstack/tripleo-heat-templates/blob/2674efae84f6ba808fbaa5f0150825e42a86ba59/extraconfig/tasks/yum_update.sh#L83 The fix that giulio pushed at https://review.openstack.org/#/c/249636/1/extraconfig/tasks/yum_update.sh makes sure this also happens for mongo... we suspect it may be the cause of the original report here. So, to check if this is bug is fixed/occurring: on a good/fixed setup, should be able to see the pacemaker config (cib) being applied/pushed correctly, in particular the response from the "pcs cluster cib-push $pacemaker_dumpfile", like: Nov 25 09:54:09 overcloud-controller-1 os-collect-config: Applying new Pacemaker config Nov 25 09:54:09 overcloud-controller-1 os-collect-config: CIB updated on a bad setup, you won't see this ^^^ but rather "Error: unable to push cib" like: Nov 25 10:15:30 overcloud-controller-0 os-collect-config: Finished yum_update.sh on server 6817db4a-ba0b-4809-9071-cc6414a5adac at Wed Nov 25 10:15:30 EST 2015 Nov 25 10:15:30 overcloud-controller-0 os-collect-config: [2015-11-25 10:15:30,353] (heat-config) [DEBUG] Error: unable to push cib Nov 25 10:15:30 overcloud-controller-0 os-collect-config: Call cib_replace failed (-205): Update was older than existing configuration Verified. Environment: penstack-tripleo-heat-templates-0.8.6-87.el7ost.noarch Verifying based on the fact that no errors like "Error: unable to push cib" are shown in journalctl. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2015:2650 |