Bug 1094408

Summary:

Pacemaker error on constraint creation

Product:

Red Hat Enterprise Linux 6

Reporter:

Steve Reichard <sreichar>

Component:

pacemaker

Assignee:

Andrew Beekhof <abeekhof>

Status:

CLOSED CURRENTRELEASE

QA Contact:

Cluster QE <mspqa-list>

Severity:

unspecified

Docs Contact:

Priority:

unspecified

Version:

6.7

CC:

cluster-maint, dvossel

Target Milestone:

Target Release:

6.5

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2014-10-23 03:53:37 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1040649

Attachments:

Description	Flags
cluster report	none
Comment	none

Description Steve Reichard 2014-05-05 15:33:42 UTC

Created attachment 915897 [details]
Comment

(This comment was longer than 65,535 characters and has been moved to an attachment by Red Hat Bugzilla).

Comment 2 David Vossel 2014-05-05 16:27:07 UTC

I'll explain what's going on.

Puppet is executing identical cib writes (through the use of the pcs cli tool) across multiple cluster-nodes at the same time.

from the logs. You can see creation of this order constraint gets executed on two nodes at the same time.

cat 10.16.139.31/cluster-log.txt | grep "09:13.28.*rsc_order.*varlibmysql"
May  5 09:13:28 ospha1 cibadmin[42093]:   notice: crm_log_args: Invoked: /usr/sbin/cibadmin -o constraints -R --xml-text <constraints>#012  <rsc_order first="fs-varlibmysql" first-action="start" id="order-fs-varlibmysql-mysql-ostk-mysql-mandatory" then="mysql-ostk-mysql" then-action="start"/>#012<rsc_order first="lsb-openstack-nova-consoleauth-clone" first-action="start" id="order-lsb-openstack-nova-consoleauth-clone-lsb-openstack-nova-novncproxy-clone-mandatory" then="lsb-openstack-nova-novncproxy-clone" then-action="start"/></constraints> 

[root@dvossel-laptop2 bz]# cat 10.16.139.32/cluster-log.txt | grep "09:13.28.*rsc_order.*varlibmysql"
May  5 09:13:28 ospha2 cibadmin[41188]:   notice: crm_log_args: Invoked: /usr/sbin/cibadmin -o constraints -R --xml-text <constraints>#012  <rsc_order first="fs-varlibmysql" first-action="start" id="order-fs-varlibmysql-mysql-ostk-mysql-mandatory" then="mysql-ostk-mysql" then-action="start"/>#012  <rsc_order first="lsb-openstack-nova-consoleauth-clone" first-action="start" id="order-lsb-openstack-nova-consoleauth-clone-lsb-openstack-nova-novncproxy-clone-mandatory" then="lsb-openstack-nova-novncproxy-clone" then-action="start"/>#012<rsc_order first="lsb-


The rhel 6.5 version of pacemaker does not handle this situation well.  The end result here is that after some pacemaker component failures the cib is corrupted. Two identical cib entries for the same rsc_order constraint make their way into the cib, which appears to prevent the nodes from recovering.

This issue has been addressed upstream already. Cib writes are now executed in cpg order across cluster-nodes which will prevent nodes from stomping on each other like this.

The work-around for this issue is to avoid executing synced pcs commands that involve cib writes (like resource creation and constraint creation).

-- Vossel

Comment 3 Andrew Beekhof 2014-05-06 00:05:51 UTC

This is the patch I'll be testing:

diff --git a/lib/cib/cib_utils.c b/lib/cib/cib_utils.c
index 8791eab..024dfc3 100644
--- a/lib/cib/cib_utils.c
+++ b/lib/cib/cib_utils.c
@@ -506,8 +506,12 @@ cib_perform_op(const char *op, int call_options, cib_op_t * fn, gboolean is_quer
         if (dtd_throttle++ % 20) {
             /* Throttle the amount of costly validation we perform due to slave updates.
              * The master already validated it...
+             *
+             * But since people are trying to run the same commands
+             * concurrently on multiple hosts, it is only safe to do
+             * this for status updates
              */
-            check_dtd = FALSE;
+            check_dtd = *config_changed;
         }
 
     } else if (is_set(call_options, cib_inhibit_bcast) && safe_str_eq(section, XML_CIB_TAG_STATUS)) {

Comment 4 Andrew Beekhof 2014-05-06 05:18:58 UTC

Scratch build if anyone else would like to test too:
   http://brewweb.devel.redhat.com/brew/taskinfo?taskID=7419459

Comment 5 Andrew Beekhof 2014-10-23 03:53:37 UTC

6.6 includes changes that prevent this problem from occurring anymore. Closing.