Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1092106

Summary: pacemaker: CIB can be corrupted via pcs commands
Product: Red Hat Enterprise Linux 7 Reporter: Robert Peterson <rpeterso>
Component: pacemakerAssignee: Andrew Beekhof <abeekhof>
Status: CLOSED NOTABUG QA Contact: Cluster QE <mspqa-list>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 7.0CC: abeekhof, cluster-maint, dvossel, fdinitto, rpeterso, tis
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-07-08 02:00:11 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1111381    
Attachments:
Description Flags
crm report none

Description Robert Peterson 2014-04-28 18:24:44 UTC
Description of problem:
I was trying to follow the steps described in this document:
https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/7-Beta/html/Global_File_System_2/ch-clustsetup-GFS2.html
My goal was to set up a set of ten GFS2 file systems that
would be mounted by pacemaker. I managed to corrupt the CIB,
making it unusable, using only pcs commands. Because I didn't
know any better, I was using cluster-ssh to simultaneously
do the pcs commands from all nodes.

Version-Release number of selected component (if applicable):
RHEL7.0

How reproducible:
Seems consistent

Steps to Reproduce:
1. Build a four node cluster
2. Start a cssh session in order to simultaneously type these
   commands to all four nodes at the same time:
   systemctl disable pacemaker.service
   rm /var/lib/pacemaker/cib/* ; sync ; sync ; sync
   /sbin/reboot -fin
3. After they all reboot, and come back up, start another cssh
   session in order to simultaneously type these commands to all
   four nodes at the same time:
systemctl start  pacemaker.service
sleep 10
pcs property set no-quorum-policy=freeze
pcs resource create dlm ocf:pacemaker:controld op monitor interval=30s on-fail=fence clone interleave=true ordered=true
pcs resource create clvmd ocf:heartbeat:clvm op monitor interval=30s on-fail=fence clone interleave=true ordered=true
pcs constraint order start dlm-clone then clvmd-clone
pcs constraint colocation add clvmd-clone with dlm-clone
pcs resource create clusterfs Filesystem device="/dev/mpathc/i8ca" directory="/mnt/gfs2a" fstype="gfs2" "options=noatime" op monitor interval=10s on-fail=fence clone interleave=true
pcs resource create clusterfs Filesystem device="/dev/mpathc/i8cb" directory="/mnt/gfs2b" fstype="gfs2" "options=noatime" op monitor interval=10s on-fail=fence clone interleave=true
pcs resource create clusterfs Filesystem device="/dev/mpathc/i8cc" directory="/mnt/gfs2c" fstype="gfs2" "options=noatime" op monitor interval=10s on-fail=fence clone interleave=true
pcs resource create clusterfs Filesystem device="/dev/mpathc/i8cd" directory="/mnt/gfs2d" fstype="gfs2" "options=noatime" op monitor interval=10s on-fail=fence clone interleave=true
pcs resource create clusterfs Filesystem device="/dev/mpathc/i8ce" directory="/mnt/gfs2e" fstype="gfs2" "options=noatime" op monitor interval=10s on-fail=fence clone interleave=true
pcs resource create clusterfs Filesystem device="/dev/mpathc/i8cf" directory="/mnt/gfs2f" fstype="gfs2" "options=noatime" op monitor interval=10s on-fail=fence clone interleave=true
pcs resource create clusterfs Filesystem device="/dev/mpathc/i8cg" directory="/mnt/gfs2g" fstype="gfs2" "options=noatime" op monitor interval=10s on-fail=fence clone interleave=true
pcs resource create clusterfs Filesystem device="/dev/mpathc/i8ch" directory="/mnt/gfs2h" fstype="gfs2" "options=noatime" op monitor interval=10s on-fail=fence clone interleave=true
pcs resource create clusterfs Filesystem device="/dev/mpathc/i8ci" directory="/mnt/gfs2i" fstype="gfs2" "options=noatime" op monitor interval=10s on-fail=fence clone interleave=true
pcs resource create clusterfs Filesystem device="/dev/mpathc/i8cj" directory="/mnt/gfs2j" fstype="gfs2" "options=noatime" op monitor interval=10s on-fail=fence clone interleave=true
pcs constraint order start clvmd-clone then clusterfs-clone
pcs constraint order start dlm-clone then clvmd-clone
pcs constraint colocation add clvmd-clone with dlm-clone
pcs constraint order start clvmd-clone then clusterfs-clone
sync ; sync ; sync
/sbin/reboot

Actual results:
pacemaker doesn't start: Errors found during check: config not valid

Expected results:
pacemaker should not allow me to corrupt the cib

Additional info:

Comment 2 David Vossel 2014-04-28 19:14:54 UTC
This is not a valid way to manage the cib, but I could see where users coming from rgmanager might actually hit something like this. It would be nice if we handled it in a sane way rather than hosing up the entire cluster.

I believe the new cib v2 updates will fix this since all updates will be applied in order via cpg rather than being applied locally and then sent out via cpg.  rhel 7.1 is already scheduled to receive that cib update for the performance optimizations it provides.

-- Vossel

Comment 3 Fabio Massimo Di Nitto 2014-04-29 04:11:10 UTC
(In reply to David Vossel from comment #2)
> This is not a valid way to manage the cib, but I could see where users
> coming from rgmanager might actually hit something like this. It would be
> nice if we handled it in a sane way rather than hosing up the entire cluster.
> 
> I believe the new cib v2 updates will fix this since all updates will be
> applied in order via cpg rather than being applied locally and then sent out
> via cpg.  rhel 7.1 is already scheduled to receive that cib update for the
> performance optimizations it provides.
> 
> -- Vossel

Valid or not, I don't think CIB corruption is acceptable in any way IMHO.

Can we have an interim fix in 7.0.z that doesn't require a full rebase?

Comment 4 Andrew Beekhof 2014-04-29 06:19:20 UTC
I'd need more details about "Errors found during check: config not valid"
A crm_report would do it.

Even running all those commands on all nodes shouldn't be able to produce an invalid config.

I do notice that stonith isn't enabled... that might do it.

Comment 5 Robert Peterson 2014-04-29 12:15:26 UTC
Created attachment 890759 [details]
crm report

I gave the crm report to dvossel, but here's a copy.

Comment 6 Andrew Beekhof 2014-05-02 00:28:08 UTC
Right, so as I proposed in comment #4, its considered an invalid configuration for reasons entirely unrelated to the cssh session:

[10:20 AM] beekhof@f19 ~/Development/sources/pacemaker/rhel-branches ☹ # tools/crm_verify -x ~/Downloads/pe-input-15.bz2 -V
   error: unpack_resources: 	Resource start-up disabled since no STONITH resources have been defined
   error: unpack_resources: 	Either configure some or disable STONITH with the stonith-enabled option
   error: unpack_resources: 	NOTE: Clusters with shared data need STONITH to ensure data integrity
Errors found during check: config not valid

I do see this in the logs:

Apr 28 12:34:49 gfs-i8c-01 cib[3595]: error: xml_log: Invalid attribute id for element rsc_order
Apr 28 12:34:49 gfs-i8c-01 cib[3595]: error: xml_log: Element constraints has extra content: rsc_order
Apr 28 12:34:49 gfs-i8c-01 cib[3595]: error: xml_log: Invalid sequence in interleave
Apr 28 12:34:49 gfs-i8c-01 cib[3595]: error: xml_log: Element configuration failed to validate content
Apr 28 12:34:49 gfs-i8c-01 cib[3595]: error: xml_log: Element cib failed to validate content

but thats a precursor to:
 
Apr 28 12:34:49 gfs-i8c-01 cib[3595]: warning: cib_perform_op: Updated CIB does not validate against pacemaker-1.2 schema/dtd
Apr 28 12:34:49 gfs-i8c-01 cib[3595]: warning: cib_diff_notify: Local-only Change (client:cibadmin, call: 2): 0.18.1 (Update does not conform to the configured schema)
Apr 28 12:34:49 gfs-i8c-01 cib[3595]: warning: cib_process_request: Completed cib_replace operation for section constraints: Update does not conform to the configured schema (rc=-203, origin=local/cibadmin/2, version=0.17.1)

which is pacemaker rejecting an update because it would have made the configuration inconsistent/invalid.
Ie. it shows us actively preventing $subject

Can I close?

Comment 7 Fabio Massimo Di Nitto 2014-07-05 05:56:40 UTC
Clearing Zstream request till we get to the bottom of it.