Bug 1174955

Summary: rubygem-staypuft: Deployment - puppet error related to /Stage[main]/Quickstack::Pacemaker::Galera/Quickstack::Pacemaker::Resource::Galera[galera]/Exec[create galera resource]
Product: Red Hat OpenStack Reporter: Alexander Chuzhoy <sasha>
Component: openstack-foreman-installerAssignee: Crag Wolfe <cwolfe>
Status: CLOSED ERRATA QA Contact: Alexander Chuzhoy <sasha>
Severity: urgent Docs Contact:
Priority: urgent    
Version: unspecifiedCC: cwolfe, dvossel, jguiditt, mburns, mlopes, morazi, rhos-maint, yeylon
Target Milestone: ga   
Target Release: Installer   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: openstack-foreman-installer-3.0.8-1.el7ost Doc Type: Bug Fix
Doc Text:
This bug fix addresses a rare concurrency issue with Pacemaker that causes the Galera resource creation process to fail. This fix adds a retry to the command, with a sleep function. This is expected to avoid the concurrency issue and result in successful resource creation.
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-02-09 15:18:20 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1177026    
Attachments:
Description Flags
messages and pacemaker logs from controllers none

Description Alexander Chuzhoy 2014-12-16 20:33:00 UTC
rubygem-staypuft: Deployment - puppet error related to /Stage[main]/Quickstack::Pacemaker::Galera/Quickstack::Pacemaker::Resource::Galera[galera]/Exec[create galera resource]

Environment:
openstack-foreman-installer-3.0.6-1.el7ost.noarch
ruby193-rubygem-staypuft-0.5.6-1.el7ost.noarch
ruby193-rubygem-foreman_openstack_simplify-0.0.6-8.el7ost.noarch
rhel-osp-installer-client-0.5.3-1.el7ost.noarch
openstack-puppet-modules-2014.2.7-2.el7ost.noarch
rhel-osp-installer-0.5.3-1.el7ost.noarch


Steps to reproduce:
1. Install rhel-osp-installer
2. Create/run Neutron deployment with 3 controllers and 2 computes


Result:
Puppet reports error:
/usr/sbin/pcs cluster cib /tmp/galera-ra && /usr/sbin/pcs -f /tmp/galera-ra resource create galera galera enable_creation=true wsrep_cluster_address="gcomm://lb-backend-maca25400702876,lb-backend-maca25400702877,lb-backend-maca25400702875" op promote timeout=300s on-fail=block --master meta master-max=3 ordered=true && /usr/sbin/pcs cluster cib-push /tmp/galera-ra returned 1 instead of one of [0]

Expected result:
No such puppet error in reports.

Comment 1 Alexander Chuzhoy 2014-12-16 20:41:06 UTC
Created attachment 969748 [details]
messages and pacemaker logs from controllers

Comment 3 Jason Guiditta 2014-12-16 22:35:38 UTC
Crag, can you take a look and see if this is any fix needed on our side?

Comment 4 Crag Wolfe 2014-12-17 01:23:29 UTC
One thing that might be a clue as to the real problem from pacemaker.log1-reported_issue, although it occurs one second after the failed attempt to add the galera resource:

Dec 16 15:05:54 [13567] maca25400702876.example.com        cib:  warning: cib_process_diff:     Bad global update   <diff crm_feature_set="3.0.7" digest="eadc64bb435e1aea13a01288e1499fb8">
Dec 16 15:05:54 [13567] maca25400702876.example.com        cib:  warning: cib_process_diff:     Bad global update     <diff-removed admin_epoch="0" epoch="36" num_updates="1">
Dec 16 15:05:54 [13567] maca25400702876.example.com        cib:  warning: cib_process_diff:     Bad global update       <cib num_updates="1"/>
Dec 16 15:05:54 [13567] maca25400702876.example.com        cib:  warning: cib_process_diff:     Bad global update     </diff-removed>
Dec 16 15:05:54 [13567] maca25400702876.example.com        cib:  warning: cib_process_diff:     Bad global update     <diff-added>
Dec 16 15:05:54 [13567] maca25400702876.example.com        cib:  warning: cib_process_diff:     Bad global update       <cib epoch="36" num_updates="1" admin_epoch="0" validate-with="pacemaker-1.2" cib-last-written="Tue Dec 16 15:05:53 2014" update-origin="lb-backend-maca25400702876" update-client="cibadmin" crm_feature_set="3.0.7" have-quorum="1" dc-uuid="3"/>
Dec 16 15:05:54 [13567] maca25400702876.example.com        cib:  warning: cib_process_diff:     Bad global update     </diff-added>
Dec 16 15:05:54 [13567] maca25400702876.example.com        cib:  warning: cib_process_diff:     Bad global update   </diff>

For now, a workaround can be to add retry capability around creating the galera resource agent in puppet.

Comment 5 Crag Wolfe 2014-12-17 01:24:57 UTC
David, any ideas on this one?

Comment 6 Crag Wolfe 2014-12-17 02:23:21 UTC
The retry option:
https://github.com/redhat-openstack/astapor/pull/435

Comment 7 David Vossel 2014-12-17 16:03:21 UTC
(In reply to Crag Wolfe from comment #6)
> The retry option:
> https://github.com/redhat-openstack/astapor/pull/435

If this actually fixes something, we have bigger problems.

I'll investigate.

Comment 8 David Vossel 2014-12-17 18:24:51 UTC
(In reply to David Vossel from comment #7)
> (In reply to Crag Wolfe from comment #6)
> > The retry option:
> > https://github.com/redhat-openstack/astapor/pull/435
> 
> If this actually fixes something, we have bigger problems.
> 
> I'll investigate.

wow, you guys hit a good one. I'm actually not entirely sure what to do about this yet.

It appears the galera resource creation occurred during DC election. Somehow, it looks like between the time a local cib copy written to the file, the galera instance is injected into the copy, and the local cib copy is pushed back into pacemaker... there's a DC election going on.

This resulted in the cib copy you were trying to push back into pacemaker being rejected. The update looked out of date because it didn't have the new DC changes.

I hate to say it, but the quick fix of re-attempting the resource addition might is our best option right now. I'm going to open a pacemaker bug so we can try and come up with a better solution on our end.

This should be an incredibly rare occurrence. If you all encounter this often, then we need to investigate this even further to understand why.

-- David

Comment 9 Jason Guiditta 2014-12-17 19:38:27 UTC
Merged

Comment 13 Alexander Chuzhoy 2015-01-16 20:26:37 UTC
Verified:
Environment:
ruby193-rubygem-staypuft-0.5.12-1.el7ost.noarch
openstack-puppet-modules-2014.2.8-1.el7ost.noarch
ruby193-rubygem-foreman_openstack_simplify-0.0.6-8.el7ost.noarch
openstack-foreman-installer-3.0.10-2.el7ost.noarch
rhel-osp-installer-0.5.5-1.el7ost.noarch
rhel-osp-installer-client-0.5.5-1.el7ost.noarch



The reported issue doesn't reproduce.

Comment 15 errata-xmlrpc 2015-02-09 15:18:20 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-0156.html