1285485 – issue replacing the pacemaker cib during an update from a running 7.1 overcloud to 7.2/latest

Bug 1285485 - issue replacing the pacemaker cib during an update from a running 7.1 overcloud to 7.2/latest

Summary: issue replacing the pacemaker cib during an update from a running 7.1 overclo...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-tripleo-heat-templates
Sub Component:
Version:	7.0 (Kilo)
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	y2
Target Release:	7.0 (Kilo)
Assignee:	Giulio Fidente
QA Contact:	Alexander Chuzhoy
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1287804 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2015-11-25 17:35 UTC by Marios Andreou
Modified:	2020-08-24 05:38 UTC (History)
CC List:	12 users (show)
Fixed In Version:	openstack-tripleo-heat-templates-0.8.6-86.el7ost
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2015-12-21 16:53:03 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
control0 sosreport (11.87 MB, application/x-xz) 2015-11-25 17:35 UTC, Marios Andreou	no flags	Details
control1 sosreport (11.86 MB, application/x-xz) 2015-11-25 17:38 UTC, Marios Andreou	no flags	Details
compute0 sosreport (7.63 MB, application/x-xz) 2015-11-25 17:42 UTC, Marios Andreou	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
OpenStack gerrit	249636	0	None	MERGED	Apply mongod timeout via cib-push	2021-01-25 17:59:30 UTC
Red Hat Product Errata	RHSA-2015:2650	0	normal	SHIPPED_LIVE	Moderate: Red Hat Enterprise Linux OpenStack Platform 7 director update	2015-12-21 21:44:54 UTC

Description Marios Andreou 2015-11-25 17:35:51 UTC

Created attachment 1098926 [details]
control0 sosreport

Description of problem:


I see an issue replacing the pacemaker cib during an update from a running 7.1
 overcloud to 7.2/latest. Control1 is updated successfully, then compute0 is 
also updated, and then control0 seems to hang. Poking at the logs I see this:

Nov 25 10:15:30 overcloud-controller-0.localdomain os-collect-config[1929]: Finished yum_update.sh on server 6817db4a-ba0b-4809-9071-cc6414a5adac at Wed Nov 25 10:15:30 EST 2015
Nov 25 10:15:30 overcloud-controller-0.localdomain os-collect-config[1929]: [2015-11-25 10:15:30,353] (heat-config) [DEBUG] Error: unable to push cib
Nov 25 10:15:30 overcloud-controller-0.localdomain os-collect-config[1929]: Call cib_replace failed (-205): Update was older than existing configuration

The changes we expect to have been made to the constraints aren't at this point
 - i.e., the neutron-server to ovs-cleanup is still there (yum_update.sh should 
have removed it). They *were* correctly removed after control1 was updated... 
(the first in sequence controller to have been updated) then as we expected were re-added
 during control0 puppet/pacemaker run (then this error, so they ultimately aren't
 removed).

[root@overcloud-controller-2 heat-admin]# pcs constraint order show | grep neutron
...
  start neutron-server-clone then start neutron-openvswitch-agent-clone (kind:Mandatory)
  start neutron-server-clone then start neutron-ovs-cleanup-clone (kind:Mandatory)

Attached sosreport from control0, control1 and compute0... At this point mostly
 want to hear if anyone else has seen/tested this on updates? More details and context 
below.

Starting with a running 7.1 overcloud, without network isolation, originally 
deployed like 

openstack overcloud deploy --templates --control-scale 3 --compute-scale 1  --libvirt-type qemu --ntp-server "0.fedora.pool.ntp.org" -e /usr/share/openstack-tripleo-heat-templates/environments/puppet-pacemaker.yaml

Launched a tenant vm on the overcloud and started pinging it all running fine.

Enabled repos on undercloud and ran a yum update, like:

sudo rhos-release 7-director -d  ;
sudo rhos-release 7 -d
sudo yum -y update

This has updated various things last few days, the heat templates specifically
 we've aded lots to the yum_update.sh file - wrt versions, my undercloud 
currently has (heat templates and heat have landed related fixes afaicr):

[stack@instack ~]$ rpm -qa | grep heat
openstack-heat-api-cloudwatch-2015.1.2-2.el7ost.noarch
openstack-heat-api-2015.1.2-2.el7ost.noarch
openstack-tripleo-heat-templates-0.8.6-83.el7ost.noarch
python-heatclient-0.6.0-1.el7ost.noarch
openstack-heat-engine-2015.1.2-2.el7ost.noarch
openstack-heat-templates-0-0.7.20150605git.el7ost.noarch
openstack-heat-common-2015.1.2-2.el7ost.noarch
openstack-heat-api-cfn-2015.1.2-2.el7ost.noarch
heat-cfntools-1.2.8-2.el7.noarch

After updating the undercloud, I installed rhos-release and setup the repos on 
all overcloud nodes, and copied the 55-heat-config script, like

    for i in $(nova list|grep ctlplane|awk -F' ' '{ print $12 }'|awk -F'=' '{ print $2 }'); do ssh heat-admin@$i sudo yum localinstall -y http://rhos-release.virt.bos.redhat.com/repos/rhos-release/rhos-release-latest.noarch.rpm ; done

for i in $(nova list|grep ctlplane|awk -F' ' '{ print $12 }'|awk -F'=' '{ print $2 }'); do ssh heat-admin@$i "sudo rhos-release 7-director -d ; sudo rhos-release 7;" ; done

for i in $(nova list|grep ctlplane|awk -F' ' '{ print $12 }'|awk -F'=' '{ print $2 }'); do scp /usr/share/openstack-heat-templates/software-config/elements/heat-config/os-refresh-config/configure.d/55-heat-config heat-admin@$i: ; ssh heat-admin@$i 'sudo /bin/bash -c "cp /home/heat-admin/55-heat-config /usr/libexec/os-refresh-config/configure.d/55-heat-config"'; done

Started an update, like:

openstack overcloud update stack overcloud -i --templates -e /usr/share/openstack-tripleo-heat-templates/overcloud-resource-registry-puppet.yaml -e update.yaml

Control1 starts, completes update. Note that the tenant router was on ctrl1 and
 while it was being updated, the router moved to control0. The failover was
 within 10 seconds, like

2015-11-25T14:43:35+0000 OK
2015-11-25T14:43:36+0000 UNREACHABLE
2015-11-25T14:43:40+0000 UNREACHABLE
2015-11-25T14:43:44+0000 OK
migration to control0 ^^^

Control1 is updated successfully, then compute0 is also updated. Control0 starts
 to get updated, and the router fails over back to ctrl1... but then control0 
seems to hang. Poking at the logs I see this (full logs attached):

Nov 25 10:15:30 overcloud-controller-0.localdomain os-collect-config[1929]: Finished yum_update.sh on server 6817db4a-ba0b-4809-9071-cc6414a5adac at Wed Nov 25 10:15:30 EST 2015
Nov 25 10:15:30 overcloud-controller-0.localdomain os-collect-config[1929]: [2015-11-25 10:15:30,353] (heat-config) [DEBUG] Error: unable to push cib
Nov 25 10:15:30 overcloud-controller-0.localdomain os-collect-config[1929]: Call cib_replace failed (-205): Update was older than existing configuration

thanks, marios







Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Marios Andreou 2015-11-25 17:38:59 UTC

Created attachment 1098928 [details]
control1 sosreport

Comment 2 Marios Andreou 2015-11-25 17:42:22 UTC

Created attachment 1098929 [details]
compute0 sosreport

Comment 3 James Slagle 2015-11-25 19:23:28 UTC

in my environment, I see that once UpdateDeployment has been signaled complete back to Heat on overcloud-controller-0, the update client will move onto the next breakpoint, which you can then clear.

i then cleared the breakpoint on overcloud-controller-1, so it started updating.

But, back on overcloud-controller-0 I see that the original os-refresh-config process is still running and is reapplying all the old puppet deployments (due to bug 1278181).

Therefore, there exists a possibility of a race condition across the cluster where yum_update.sh on overcloud-controller-1 could create a cib file, modify it, and before it gets the chance to load it back into pacemaker, overcloud-controller-0 could make a change to the cluster constraints due to applying the old puppet manifests.

When overcloud-controller-1 then tries to load the modified cib file, you'd get the error shown in the bugzilla.

I see 2 possible fixes to this situation, there could be others:

(a) make yum_update.sh smart enough to account for the race condition. if we fail to load the modified cib due to it being an older version, we sleep/retry/backoff continuously for a few attempts. after some set number of attempts, we'd have to give up and fail for real.

(b) more thoroughly fix bug 1278181. somehow make 55-heat-config not retrigger the deployments if /var/run is empty. Or, populate /var/run/heat-config with some empty deployed json files based on the deployments already downloaded to /var/lib/os-collect-config. There were some ideas about this in the upstream bug, https://bugs.launchpad.net/heat-templates/+bug/1513220

Comment 4 James Slagle 2015-11-25 21:37:11 UTC

other than the observed behavior, my update from 7.0 (started with no /var/run/heat-config present on any node), actually completed fine. So, I suspect this is a transient race condition or something specific with updating from 7.1.

Comment 5 Marios Andreou 2015-11-27 06:43:03 UTC

I has been reported by others that they updated from 7.1-->7.2 without issue, FYI... so this may be an environment issue... the context is here if we find it is indeed a race.

Comment 7 Giulio Fidente 2015-12-02 17:45:32 UTC

*** Bug 1287804 has been marked as a duplicate of this bug. ***

Comment 10 Marios Andreou 2015-12-07 17:04:57 UTC

Hi Sasha, for verification:

on a good setup, should be able to see the pacemaker config (cib) being applied/pushed correctly, in particular the response from the 

"pcs cluster cib-push $pacemaker_dumpfile", like:

Nov 25 09:54:09 overcloud-controller-1 os-collect-config: Applying new Pacemaker config
Nov 25 09:54:09 overcloud-controller-1 os-collect-config: CIB updated

on a bad setup, you won't see this ^^^ but rather 

"Error: unable to push cib" like:

Nov 25 10:15:30 overcloud-controller-0 os-collect-config: Finished yum_update.sh on server 6817db4a-ba0b-4809-9071-cc6414a5adac at Wed Nov 25 10:15:30 EST 2015
Nov 25 10:15:30 overcloud-controller-0 os-collect-config: [2015-11-25 10:15:30,353] (heat-config) [DEBUG] Error: unable to push cib
Nov 25 10:15:30 overcloud-controller-0 os-collect-config: Call cib_replace failed (-205): Update was older than existing configuration


hope that helps.

Comment 11 Alexander Chuzhoy 2015-12-07 22:11:27 UTC

FailedQA.


Environment:
openstack-tripleo-heat-templates-0.8.6-87.el7ost.noarch

Updated the 7.1 setup to 7.2.
Logged into a controller:

[root@overcloud-controller-0 ~]# pacemaker_dumpfile=`mktemp`
[root@overcloud-controller-0 ~]# echo $pacemaker_dumpfile
/tmp/tmp.9lwOrQ0i9K
[root@overcloud-controller-0 ~]# pcs cluster cib-push $pacemaker_dumpfile
Error: unable to parse new cib: no element found: line 1, column 0

Comment 12 Marios Andreou 2015-12-08 12:38:25 UTC

Hi Sasha, don't think it should fail qa for that though... for one you are using an empty pacemaker_dumpfile here so it fails for that.

The test/fix wasn't that a cib update *per say* works, but rather that it was done correctly by us during an update. The correct way to do it is to write to a file and update the cib all at once... which is why in yum_update.sh we do the pcs commands with -f  https://github.com/openstack/tripleo-heat-templates/blob/2674efae84f6ba808fbaa5f0150825e42a86ba59/extraconfig/tasks/yum_update.sh#L83

The fix that giulio pushed at https://review.openstack.org/#/c/249636/1/extraconfig/tasks/yum_update.sh makes sure this also happens for mongo... we suspect it may be the cause of the original report here. So, to check if this is bug is fixed/occurring:

on a good/fixed setup, should be able to see the pacemaker config (cib) being applied/pushed correctly, in particular the response from the 

"pcs cluster cib-push $pacemaker_dumpfile", like:

Nov 25 09:54:09 overcloud-controller-1 os-collect-config: Applying new Pacemaker config
Nov 25 09:54:09 overcloud-controller-1 os-collect-config: CIB updated

on a bad setup, you won't see this ^^^ but rather 

"Error: unable to push cib" like:

Nov 25 10:15:30 overcloud-controller-0 os-collect-config: Finished yum_update.sh on server 6817db4a-ba0b-4809-9071-cc6414a5adac at Wed Nov 25 10:15:30 EST 2015
Nov 25 10:15:30 overcloud-controller-0 os-collect-config: [2015-11-25 10:15:30,353] (heat-config) [DEBUG] Error: unable to push cib
Nov 25 10:15:30 overcloud-controller-0 os-collect-config: Call cib_replace failed (-205): Update was older than existing configuration

Comment 13 Alexander Chuzhoy 2015-12-08 15:08:56 UTC

Verified.

Environment:
penstack-tripleo-heat-templates-0.8.6-87.el7ost.noarch

Verifying based on the fact that no errors like "Error: unable to push cib" are shown in journalctl.

Comment 21 errata-xmlrpc 2015-12-21 16:53:03 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2015:2650

Note You need to log in before you can comment on or make changes to this bug.