Bug 1412309

Summary: crm_diff --no_version doesn't properly ignore version info
Product: Red Hat Enterprise Linux 7 Reporter: Ken Gaillot <kgaillot>
Component: pacemakerAssignee: Ken Gaillot <kgaillot>
Status: CLOSED ERRATA QA Contact: cluster-qe <cluster-qe>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 7.3CC: abeekhof, cfeist, cluster-maint, cluster-qe, fdinitto, idevat, kgaillot, michele, mnovacek, omular, rsteiger, snagar, tojeline
Target Milestone: rc   
Target Release: 7.4   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: pacemaker-1.1.16-2.el7 Doc Type: No Doc Update
Doc Text:
This minor issue was not reported by a customer.
Story Points: ---
Clone Of: 1404233 Environment:
Last Closed: 2017-08-01 17:54:39 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1404233, 1441673, 1482623    

Description Ken Gaillot 2017-01-11 17:35:07 UTC
+++ This bug was initially created as a clone of Bug #1404233 +++

Description of problem:
(Bug opened after an email discussion with Ken, Tomas and Andrew)

Currently pcs does a full CIB replacement when adding a resource or a constraint.
So if I start with a cluster that has one resource:
[root@centos tests]# pcs status
...
Online: [ centos ]

Full list of resources:

 Clone Set: delay1-clone [delay1]
     Started: [ centos ]
...

I then make two copies of the CIB and add two separate resources to each CIB
[root@centos tests]# pcs cluster cib 1.xml
[root@centos tests]# pcs cluster cib 2.xml
[root@centos tests]# /usr/sbin/pcs -f 1.xml resource create delayfoo ocf:heartbeat:Delay op start timeout=200s stop timeout=200s monitor timeout=200s --clone interleave=true
[root@centos tests]# /usr/sbin/pcs -f 2.xml resource create delaybar ocf:heartbeat:Delay op start timeout=200s stop timeout=200s monitor timeout=200s --clone interleave=true

[root@centos tests]# pcs cluster cib-push --config 1.xml 
CIB updated
[root@centos tests]# pcs status
...
Online: [ centos ]

Full list of resources:

 Clone Set: delay1-clone [delay1]
     Started: [ centos ]
 Clone Set: delayfoo-clone [delayfoo]
     Stopped: [ centos ]
...

Now when I push the second CIB with a separate resource I see that it actually
replaces the CIB:
[root@centos tests]# pcs cluster cib-push --config 2.xml 
CIB updated
[root@centos tests]# pcs status
...
Online: [ centos ]

Full list of resources:

 Clone Set: delay1-clone [delay1]
     Started: [ centos ]
 Clone Set: delaybar-clone [delaybar]
     Stopped: [ centos ]
...

The expected result would be one of the following:
A) An error when trying to push 2.xml
B) Both delayfoo and delaybar created


Version-Release number of selected component (if applicable):
pcs-0.9.152-10.el7.centos.x86_64
pacemaker-1.1.15-11.el7_3.2.x86_64
corosync-2.4.0-4.el7.x86_64

Additional Info:
The reason for needing either an error or a "differential CIB update" is so that we can implement composable HA race-free in TripleO.

--- Additional comment from Chris Feist on 2016-12-13 17:37:43 EST ---

I talked with Andrew and I think we want to do two things here (at least to start).

Add a command to push only a diff between 2 cib versions, so the workflow would look like this:

1.  pcs cib file1
2.  cp file1 file1.orig
3.  pcs -f file1 <do stuff here>
4.  pcs cib-push file1 --diff-only=file1.orig

So we would only be sending the difference between the original and the new files instead of updating the entire CIB.


The other thing we can do is when we're not working with files we should only push diffs instead of the full CIB.  That way it should reduce race issues when two programs are updating different parts of the CIB.

There's a pacemaker tool that will give you the diff between two CIBs that is suitable (crm_diff).

--- Additional comment from Ken Gaillot on 2017-01-11 12:30:39 EST ---

Per discussion with Andrew Beekhof, there is a bug in pacemaker's crm_diff that will need to be fixed for this to work properly without needing to loop, so I will clone this for pacemaker.

Comment 1 Ken Gaillot 2017-01-11 17:36:27 UTC
From email by Andrew:

This is a bug.
The intention of --no-version is that, well, no version information
would be included in the patch.

Specifically this part of the output from 'crm_diff --original aaa.xml
--new bbb.xml  --no-version' should have been dropped:

  <change operation="modify" path="/cib">
    <change-list>
      <change-attr name="epoch" operation="set" value="9"/>
      <change-attr name="num_updates" operation="set" value="0"/>
    </change-list>
    <change-result>
      <cib crm_feature_set="3.0.10" validate-with="pacemaker-2.3"
epoch="9" num_updates="0" admin_epoch="0" cib-last-written="Sun Dec 11
21:20:36 2016" update-origin="site09-1" update-client="cibadmin"
update-user="root" have-quorum="1" dc-uuid="3"/>
    </change-result>
  </change>


With:

diff --git a/tools/xml_diff.c b/tools/xml_diff.c
index cd3cb29..44aab5d 100644
--- a/tools/xml_diff.c
+++ b/tools/xml_diff.c
@@ -199,6 +199,20 @@ main(int argc, char **argv)
             return rc;
         }
     } else {
+        int lpc = 0;
+        const char *vfields[] = {
+            XML_ATTR_GENERATION_ADMIN,
+            XML_ATTR_GENERATION,
+            XML_ATTR_NUMUPDATES,
+        };
+
+        if (no_version) {
+            for (lpc = 0; lpc < DIMOF(vfields); lpc++) {
+                const char *value = crm_element_value(object_1, vfields[lpc]);
+                crm_xml_add(object_2, vfields[lpc], value);
+            }
+        }
+
         xml_track_changes(object_2, NULL, object_2, FALSE);
         xml_calculate_changes(object_1, object_2);
         crm_log_xml_debug(object_2, xml_file_2?xml_file_2:"target");
@@ -247,19 +261,11 @@ main(int argc, char **argv)
                     XML_TAG_DIFF_ADDED,
                 };

-                const char *vfields[] = {
-                    XML_ATTR_GENERATION_ADMIN,
-                    XML_ATTR_GENERATION,
-                    XML_ATTR_NUMUPDATES,
-                };
-
                 for (i = 0; i < DIMOF(tags); i++) {
                     xmlNode *tmp = NULL;

                     tmp = find_xml_node(output, tags[i], FALSE);
                     if (tmp) {
-                        int lpc = 0;
-
                         for (lpc = 0; lpc < DIMOF(vfields); lpc++) {
                             xml_remove_prop(tmp, vfields[lpc]);
                         }

we get:

[11:09 AM] beekhof@fedora ~/Development/sources/pacemaker/devel ☺ #
tools/crm_diff -o aaa.xml -n bbb.xml --no-version
<diff format="2">
  <change operation="create" path="/cib/configuration/resources" position="1">
    <primitive class="ocf" id="bbb" provider="pacemaker" type="Stateful">
      <instance_attributes id="bbb-instance_attributes"/>
      <operations>
        <op id="bbb-start-interval-0s" interval="0s" name="start" timeout="20"/>
        <op id="bbb-stop-interval-0s" interval="0s" name="stop" timeout="20"/>
        <op id="bbb-monitor-interval-10" interval="10" name="monitor"
role="Master" timeout="20"/>
        <op id="bbb-monitor-interval-11" interval="11" name="monitor"
role="Slave" timeout="20"/>
      </operations>
    </primitive>
  </change>
</diff>

which wont conflict with ccc.patch and no looping required

Comment 3 Ken Gaillot 2017-01-12 18:17:41 UTC
A reproducer that more specifically addresses the pacemaker issue here is:

1. Create and start a pacemaker cluster.
2. Run these commands:

pcs resource create aaa ocf:pacemaker:Dummy
pcs cluster cib aaa.xml
cp aaa.xml bbb.xml
cp aaa.xml ccc.xml
pcs -f bbb.xml resource create bbb ocf:pacemaker:Dummy
pcs -f ccc.xml resource create ccc ocf:pacemaker:Dummy
crm_diff --original aaa.xml --new bbb.xml --no-version > bbb.patch
crm_diff --original aaa.xml --new ccc.xml --no-version > ccc.patch
cibadmin --xml-pipe --patch < bbb.patch
cibadmin --xml-pipe --patch < ccc.patch

3. After the fix, the final command will succeed. Before the fix, the final command will fail with:

Call cib_apply_diff failed (-205): Update was older than existing configuration

Comment 4 Ken Gaillot 2017-01-13 22:50:08 UTC
Fixed upstream by commit 20a74b9

Comment 6 michal novacek 2017-05-16 12:55:05 UTC
I have verified that crm_diff --no_version works correctly with
pacemaker-1.1.16-9.el7.x86_64

----


unpatched version pacemaker-1.1.16-1.el7.x86_64
===============================================

[root@tardis-01 ~]# /root/bz1412309.sh
+ pcs resource create aaa ocf:pacemaker:Dummy
+ pcs cluster cib aaa.xml
+ cp aaa.xml bbb.xml
+ cp aaa.xml ccc.xml
+ pcs -f bbb.xml resource create bbb ocf:pacemaker:Dummy
+ pcs -f ccc.xml resource create ccc ocf:pacemaker:Dummy
+ crm_diff --original aaa.xml --new bbb.xml --no-version
+ crm_diff --original aaa.xml --new ccc.xml --no-version
+ cibadmin --xml-pipe --patch
+ cibadmin --xml-pipe --patch
Call cib_apply_diff failed (-205): Update was older than existing configurationjk

patch version pacemaker-1.1.16-9.el7.x86_64
===========================================
[root@tardis-01 ~]# /root/bz1412309.sh 
+ pcs resource create aaa ocf:pacemaker:Dummy
+ pcs cluster cib aaa.xml
+ cp aaa.xml bbb.xml
+ cp aaa.xml ccc.xml
+ pcs -f bbb.xml resource create bbb ocf:pacemaker:Dummy
+ pcs -f ccc.xml resource create ccc ocf:pacemaker:Dummy
+ crm_diff --original aaa.xml --new bbb.xml --no-version
+ crm_diff --original aaa.xml --new ccc.xml --no-version
+ cibadmin --xml-pipe --patch
+ cibadmin --xml-pipe --patch

[root@tardis-01 ~]# pcs status
Cluster name: STSRHTS27159
Stack: corosync
Current DC: tardis-01.ipv4 (version 1.1.16-9.el7-94ff4df) - partition with quorum
Last updated: Tue May 16 14:53:07 2017
Last change: Tue May 16 14:53:04 2017 by root via cibadmin on tardis-01.ipv4

2 nodes configured
9 resources configured

Online: [ tardis-01.ipv4 tardis-03.ipv4 ]

Full list of resources:

 fence-tardis-03        (stonith:fence_ipmilan):        Started tardis-01.ipv4
 fence-tardis-01        (stonith:fence_ipmilan):        Started tardis-03.ipv4
 Clone Set: dlm-clone [dlm]
     Started: [ tardis-01.ipv4 tardis-03.ipv4 ]
 Clone Set: clvmd-clone [clvmd]
     Started: [ tardis-01.ipv4 tardis-03.ipv4 ]
 aaa    (ocf::pacemaker:Dummy): Started tardis-01.ipv4
 ccc    (ocf::pacemaker:Dummy): Started tardis-03.ipv4
 bbb    (ocf::pacemaker:Dummy): Started tardis-01.ipv4

Daemon Status:
  corosync: active/disabled
  pacemaker: active/disabled
  pcsd: active/enabled


-----

> (1) pcs status
[root@tardis-01 ~]# pcs status
Cluster name: STSRHTS27159
Stack: corosync
Current DC: tardis-01.ipv4 (version 1.1.16-9.el7-94ff4df) - partition with quorum
Last updated: Tue May 16 14:39:46 2017
Last change: Tue May 16 14:37:09 2017 by root via cibadmin on tardis-03.ipv4

2 nodes configured
6 resources configured

Online: [ tardis-01.ipv4 tardis-03.ipv4 ]

Full list of resources:

 fence-tardis-03        (stonith:fence_ipmilan):        Started tardis-01.ipv4
 fence-tardis-01        (stonith:fence_ipmilan):        Started tardis-03.ipv4
 Clone Set: dlm-clone [dlm]
     Started: [ tardis-01.ipv4 tardis-03.ipv4 ]
 Clone Set: clvmd-clone [clvmd]
     Started: [ tardis-01.ipv4 tardis-03.ipv4 ]

Daemon Status:
  corosync: active/disabled
  pacemaker: active/disabled
  pcsd: active/enabled

Comment 7 errata-xmlrpc 2017-08-01 17:54:39 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:1862