Bug 1094408

Summary: Pacemaker error on constraint creation
Product: Red Hat Enterprise Linux 6 Reporter: Steve Reichard <sreichar>
Component: pacemakerAssignee: Andrew Beekhof <abeekhof>
Status: CLOSED CURRENTRELEASE QA Contact: Cluster QE <mspqa-list>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 6.7CC: cluster-maint, dvossel
Target Milestone: rc   
Target Release: 6.5   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-10-23 03:53:37 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1040649    
Attachments:
Description Flags
cluster report
none
Comment none

Description Steve Reichard 2014-05-05 15:33:42 UTC
Created attachment 915897 [details]
Comment

(This comment was longer than 65,535 characters and has been moved to an attachment by Red Hat Bugzilla).

Comment 2 David Vossel 2014-05-05 16:27:07 UTC
I'll explain what's going on.

Puppet is executing identical cib writes (through the use of the pcs cli tool) across multiple cluster-nodes at the same time.

from the logs. You can see creation of this order constraint gets executed on two nodes at the same time.

cat 10.16.139.31/cluster-log.txt | grep "09:13.28.*rsc_order.*varlibmysql"
May  5 09:13:28 ospha1 cibadmin[42093]:   notice: crm_log_args: Invoked: /usr/sbin/cibadmin -o constraints -R --xml-text <constraints>#012  <rsc_order first="fs-varlibmysql" first-action="start" id="order-fs-varlibmysql-mysql-ostk-mysql-mandatory" then="mysql-ostk-mysql" then-action="start"/>#012<rsc_order first="lsb-openstack-nova-consoleauth-clone" first-action="start" id="order-lsb-openstack-nova-consoleauth-clone-lsb-openstack-nova-novncproxy-clone-mandatory" then="lsb-openstack-nova-novncproxy-clone" then-action="start"/></constraints> 

[root@dvossel-laptop2 bz]# cat 10.16.139.32/cluster-log.txt | grep "09:13.28.*rsc_order.*varlibmysql"
May  5 09:13:28 ospha2 cibadmin[41188]:   notice: crm_log_args: Invoked: /usr/sbin/cibadmin -o constraints -R --xml-text <constraints>#012  <rsc_order first="fs-varlibmysql" first-action="start" id="order-fs-varlibmysql-mysql-ostk-mysql-mandatory" then="mysql-ostk-mysql" then-action="start"/>#012  <rsc_order first="lsb-openstack-nova-consoleauth-clone" first-action="start" id="order-lsb-openstack-nova-consoleauth-clone-lsb-openstack-nova-novncproxy-clone-mandatory" then="lsb-openstack-nova-novncproxy-clone" then-action="start"/>#012<rsc_order first="lsb-


The rhel 6.5 version of pacemaker does not handle this situation well.  The end result here is that after some pacemaker component failures the cib is corrupted. Two identical cib entries for the same rsc_order constraint make their way into the cib, which appears to prevent the nodes from recovering.

This issue has been addressed upstream already. Cib writes are now executed in cpg order across cluster-nodes which will prevent nodes from stomping on each other like this.

The work-around for this issue is to avoid executing synced pcs commands that involve cib writes (like resource creation and constraint creation).

-- Vossel

Comment 3 Andrew Beekhof 2014-05-06 00:05:51 UTC
This is the patch I'll be testing:

diff --git a/lib/cib/cib_utils.c b/lib/cib/cib_utils.c
index 8791eab..024dfc3 100644
--- a/lib/cib/cib_utils.c
+++ b/lib/cib/cib_utils.c
@@ -506,8 +506,12 @@ cib_perform_op(const char *op, int call_options, cib_op_t * fn, gboolean is_quer
         if (dtd_throttle++ % 20) {
             /* Throttle the amount of costly validation we perform due to slave updates.
              * The master already validated it...
+             *
+             * But since people are trying to run the same commands
+             * concurrently on multiple hosts, it is only safe to do
+             * this for status updates
              */
-            check_dtd = FALSE;
+            check_dtd = *config_changed;
         }
 
     } else if (is_set(call_options, cib_inhibit_bcast) && safe_str_eq(section, XML_CIB_TAG_STATUS)) {

Comment 4 Andrew Beekhof 2014-05-06 05:18:58 UTC
Scratch build if anyone else would like to test too:
   http://brewweb.devel.redhat.com/brew/taskinfo?taskID=7419459

Comment 5 Andrew Beekhof 2014-10-23 03:53:37 UTC
6.6 includes changes that prevent this problem from occurring anymore. Closing.