Bug 1094408
Summary: | Pacemaker error on constraint creation | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Steve Reichard <sreichar> | ||||||
Component: | pacemaker | Assignee: | Andrew Beekhof <abeekhof> | ||||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Cluster QE <mspqa-list> | ||||||
Severity: | unspecified | Docs Contact: | |||||||
Priority: | unspecified | ||||||||
Version: | 6.7 | CC: | cluster-maint, dvossel | ||||||
Target Milestone: | rc | ||||||||
Target Release: | 6.5 | ||||||||
Hardware: | Unspecified | ||||||||
OS: | Unspecified | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2014-10-23 03:53:37 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | |||||||||
Bug Blocks: | 1040649 | ||||||||
Attachments: |
|
I'll explain what's going on. Puppet is executing identical cib writes (through the use of the pcs cli tool) across multiple cluster-nodes at the same time. from the logs. You can see creation of this order constraint gets executed on two nodes at the same time. cat 10.16.139.31/cluster-log.txt | grep "09:13.28.*rsc_order.*varlibmysql" May 5 09:13:28 ospha1 cibadmin[42093]: notice: crm_log_args: Invoked: /usr/sbin/cibadmin -o constraints -R --xml-text <constraints>#012 <rsc_order first="fs-varlibmysql" first-action="start" id="order-fs-varlibmysql-mysql-ostk-mysql-mandatory" then="mysql-ostk-mysql" then-action="start"/>#012<rsc_order first="lsb-openstack-nova-consoleauth-clone" first-action="start" id="order-lsb-openstack-nova-consoleauth-clone-lsb-openstack-nova-novncproxy-clone-mandatory" then="lsb-openstack-nova-novncproxy-clone" then-action="start"/></constraints> [root@dvossel-laptop2 bz]# cat 10.16.139.32/cluster-log.txt | grep "09:13.28.*rsc_order.*varlibmysql" May 5 09:13:28 ospha2 cibadmin[41188]: notice: crm_log_args: Invoked: /usr/sbin/cibadmin -o constraints -R --xml-text <constraints>#012 <rsc_order first="fs-varlibmysql" first-action="start" id="order-fs-varlibmysql-mysql-ostk-mysql-mandatory" then="mysql-ostk-mysql" then-action="start"/>#012 <rsc_order first="lsb-openstack-nova-consoleauth-clone" first-action="start" id="order-lsb-openstack-nova-consoleauth-clone-lsb-openstack-nova-novncproxy-clone-mandatory" then="lsb-openstack-nova-novncproxy-clone" then-action="start"/>#012<rsc_order first="lsb- The rhel 6.5 version of pacemaker does not handle this situation well. The end result here is that after some pacemaker component failures the cib is corrupted. Two identical cib entries for the same rsc_order constraint make their way into the cib, which appears to prevent the nodes from recovering. This issue has been addressed upstream already. Cib writes are now executed in cpg order across cluster-nodes which will prevent nodes from stomping on each other like this. The work-around for this issue is to avoid executing synced pcs commands that involve cib writes (like resource creation and constraint creation). -- Vossel This is the patch I'll be testing: diff --git a/lib/cib/cib_utils.c b/lib/cib/cib_utils.c index 8791eab..024dfc3 100644 --- a/lib/cib/cib_utils.c +++ b/lib/cib/cib_utils.c @@ -506,8 +506,12 @@ cib_perform_op(const char *op, int call_options, cib_op_t * fn, gboolean is_quer if (dtd_throttle++ % 20) { /* Throttle the amount of costly validation we perform due to slave updates. * The master already validated it... + * + * But since people are trying to run the same commands + * concurrently on multiple hosts, it is only safe to do + * this for status updates */ - check_dtd = FALSE; + check_dtd = *config_changed; } } else if (is_set(call_options, cib_inhibit_bcast) && safe_str_eq(section, XML_CIB_TAG_STATUS)) { Scratch build if anyone else would like to test too: http://brewweb.devel.redhat.com/brew/taskinfo?taskID=7419459 6.6 includes changes that prevent this problem from occurring anymore. Closing. |
Created attachment 915897 [details] Comment (This comment was longer than 65,535 characters and has been moved to an attachment by Red Hat Bugzilla).