Bug 1174955 - rubygem-staypuft: Deployment - puppet error related to /Stage[main]/Quickstack::Pacemaker::Galera/Quickstack::Pacemaker::Resource::Galera[galera]/Exec[create galera resource]
Summary: rubygem-staypuft: Deployment - puppet error related to /Stage[main]/Quickstac...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-foreman-installer
Version: unspecified
Hardware: x86_64
OS: Linux
urgent
urgent
Target Milestone: ga
: Installer
Assignee: Crag Wolfe
QA Contact: Alexander Chuzhoy
URL:
Whiteboard:
Depends On:
Blocks: 1177026
TreeView+ depends on / blocked
 
Reported: 2014-12-16 20:33 UTC by Alexander Chuzhoy
Modified: 2015-02-09 15:18 UTC (History)
8 users (show)

Fixed In Version: openstack-foreman-installer-3.0.8-1.el7ost
Doc Type: Bug Fix
Doc Text:
This bug fix addresses a rare concurrency issue with Pacemaker that causes the Galera resource creation process to fail. This fix adds a retry to the command, with a sleep function. This is expected to avoid the concurrency issue and result in successful resource creation.
Clone Of:
Environment:
Last Closed: 2015-02-09 15:18:20 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
messages and pacemaker logs from controllers (315.43 KB, application/x-gzip)
2014-12-16 20:41 UTC, Alexander Chuzhoy
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2015:0156 0 normal SHIPPED_LIVE Red Hat Enterprise Linux OpenStack Platform Installer Bug Fix Advisory 2015-02-09 20:13:39 UTC

Description Alexander Chuzhoy 2014-12-16 20:33:00 UTC
rubygem-staypuft: Deployment - puppet error related to /Stage[main]/Quickstack::Pacemaker::Galera/Quickstack::Pacemaker::Resource::Galera[galera]/Exec[create galera resource]

Environment:
openstack-foreman-installer-3.0.6-1.el7ost.noarch
ruby193-rubygem-staypuft-0.5.6-1.el7ost.noarch
ruby193-rubygem-foreman_openstack_simplify-0.0.6-8.el7ost.noarch
rhel-osp-installer-client-0.5.3-1.el7ost.noarch
openstack-puppet-modules-2014.2.7-2.el7ost.noarch
rhel-osp-installer-0.5.3-1.el7ost.noarch


Steps to reproduce:
1. Install rhel-osp-installer
2. Create/run Neutron deployment with 3 controllers and 2 computes


Result:
Puppet reports error:
/usr/sbin/pcs cluster cib /tmp/galera-ra && /usr/sbin/pcs -f /tmp/galera-ra resource create galera galera enable_creation=true wsrep_cluster_address="gcomm://lb-backend-maca25400702876,lb-backend-maca25400702877,lb-backend-maca25400702875" op promote timeout=300s on-fail=block --master meta master-max=3 ordered=true && /usr/sbin/pcs cluster cib-push /tmp/galera-ra returned 1 instead of one of [0]

Expected result:
No such puppet error in reports.

Comment 1 Alexander Chuzhoy 2014-12-16 20:41:06 UTC
Created attachment 969748 [details]
messages and pacemaker logs from controllers

Comment 3 Jason Guiditta 2014-12-16 22:35:38 UTC
Crag, can you take a look and see if this is any fix needed on our side?

Comment 4 Crag Wolfe 2014-12-17 01:23:29 UTC
One thing that might be a clue as to the real problem from pacemaker.log1-reported_issue, although it occurs one second after the failed attempt to add the galera resource:

Dec 16 15:05:54 [13567] maca25400702876.example.com        cib:  warning: cib_process_diff:     Bad global update   <diff crm_feature_set="3.0.7" digest="eadc64bb435e1aea13a01288e1499fb8">
Dec 16 15:05:54 [13567] maca25400702876.example.com        cib:  warning: cib_process_diff:     Bad global update     <diff-removed admin_epoch="0" epoch="36" num_updates="1">
Dec 16 15:05:54 [13567] maca25400702876.example.com        cib:  warning: cib_process_diff:     Bad global update       <cib num_updates="1"/>
Dec 16 15:05:54 [13567] maca25400702876.example.com        cib:  warning: cib_process_diff:     Bad global update     </diff-removed>
Dec 16 15:05:54 [13567] maca25400702876.example.com        cib:  warning: cib_process_diff:     Bad global update     <diff-added>
Dec 16 15:05:54 [13567] maca25400702876.example.com        cib:  warning: cib_process_diff:     Bad global update       <cib epoch="36" num_updates="1" admin_epoch="0" validate-with="pacemaker-1.2" cib-last-written="Tue Dec 16 15:05:53 2014" update-origin="lb-backend-maca25400702876" update-client="cibadmin" crm_feature_set="3.0.7" have-quorum="1" dc-uuid="3"/>
Dec 16 15:05:54 [13567] maca25400702876.example.com        cib:  warning: cib_process_diff:     Bad global update     </diff-added>
Dec 16 15:05:54 [13567] maca25400702876.example.com        cib:  warning: cib_process_diff:     Bad global update   </diff>

For now, a workaround can be to add retry capability around creating the galera resource agent in puppet.

Comment 5 Crag Wolfe 2014-12-17 01:24:57 UTC
David, any ideas on this one?

Comment 6 Crag Wolfe 2014-12-17 02:23:21 UTC
The retry option:
https://github.com/redhat-openstack/astapor/pull/435

Comment 7 David Vossel 2014-12-17 16:03:21 UTC
(In reply to Crag Wolfe from comment #6)
> The retry option:
> https://github.com/redhat-openstack/astapor/pull/435

If this actually fixes something, we have bigger problems.

I'll investigate.

Comment 8 David Vossel 2014-12-17 18:24:51 UTC
(In reply to David Vossel from comment #7)
> (In reply to Crag Wolfe from comment #6)
> > The retry option:
> > https://github.com/redhat-openstack/astapor/pull/435
> 
> If this actually fixes something, we have bigger problems.
> 
> I'll investigate.

wow, you guys hit a good one. I'm actually not entirely sure what to do about this yet.

It appears the galera resource creation occurred during DC election. Somehow, it looks like between the time a local cib copy written to the file, the galera instance is injected into the copy, and the local cib copy is pushed back into pacemaker... there's a DC election going on.

This resulted in the cib copy you were trying to push back into pacemaker being rejected. The update looked out of date because it didn't have the new DC changes.

I hate to say it, but the quick fix of re-attempting the resource addition might is our best option right now. I'm going to open a pacemaker bug so we can try and come up with a better solution on our end.

This should be an incredibly rare occurrence. If you all encounter this often, then we need to investigate this even further to understand why.

-- David

Comment 9 Jason Guiditta 2014-12-17 19:38:27 UTC
Merged

Comment 13 Alexander Chuzhoy 2015-01-16 20:26:37 UTC
Verified:
Environment:
ruby193-rubygem-staypuft-0.5.12-1.el7ost.noarch
openstack-puppet-modules-2014.2.8-1.el7ost.noarch
ruby193-rubygem-foreman_openstack_simplify-0.0.6-8.el7ost.noarch
openstack-foreman-installer-3.0.10-2.el7ost.noarch
rhel-osp-installer-0.5.5-1.el7ost.noarch
rhel-osp-installer-client-0.5.5-1.el7ost.noarch



The reported issue doesn't reproduce.

Comment 15 errata-xmlrpc 2015-02-09 15:18:20 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-0156.html


Note You need to log in before you can comment on or make changes to this bug.