Bug 1611631

Summary: 'cibadmin --upgrade' times out on non-DC nodes if schema is already the latest available
Product: [Fedora] Fedora Reporter: Tomas Jelinek <tojeline>
Component: pacemakerAssignee: Jan Pokorný [poki] <jpokorny>
Status: CLOSED RAWHIDE QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: medium Docs Contact:
Priority: low    
Version: rawhideCC: andrew, anprice, jpokorny, lhh, tojeline
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: pacemaker-2.0.0-2.fc29 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-08-09 14:54:03 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Tomas Jelinek 2018-08-02 13:42:43 UTC
Description of problem:
When 'cibadmin --upgrade --force' is run on a node which is not currently the DC and CIB schema already is the latest available, the command exits with a timeout error. When the same is run on the DC or the schema is not the latest, everything works as expected.


Version-Release number of selected component (if applicable):
pacemaker-2.0.0-1.fc29.1.x86_64


How reproducible:
always, easily


Steps to Reproduce:
[root@fed28-node1:~]# crm_mon -1 | grep DC
Current DC: fed28-node2 (version 2.0.0-1.fc29.1-8cf3fe749e) - partition with quorum
[root@fed28-node1:~]# cibadmin --query | head -n 1
<cib crm_feature_set="3.1.0" validate-with="pacemaker-3.1" epoch="12" num_updates="0" admin_epoch="2" cib-last-written="Thu Aug  2 12:12:50 2018" update-origin="fed28-node1" update-client="cibadmin" update-user="root" have-quorum="1" dc-uuid="2">
[root@fed28-node1:~]# cibadmin --upgrade --force
Call cib_upgrade failed (-62): Timer expired
[root@fed28-node1:~]# echo $?
124
[root@fed28-node1:~]# cibadmin --query | head -n 1
<cib crm_feature_set="3.1.0" validate-with="pacemaker-3.1" epoch="12" num_updates="0" admin_epoch="2" cib-last-written="Thu Aug  2 12:12:50 2018" update-origin="fed28-node1" update-client="cibadmin" update-user="root" have-quorum="1" dc-uuid="2">


Actual results:
[root@fed28-node1:~]# cibadmin --upgrade --force
Call cib_upgrade failed (-62): Timer expired
[root@fed28-node1:~]# echo $?
124


Expected results:
It should not matter on which node the command was run, the result should be the same.
Results from DC with pacemaker-2.0.0-1.fc29.1.x86_64:
[root@fed28-node2:~]# cibadmin --upgrade --force
Call cib_upgrade failed (-211): Schema is already the latest available
[root@fed28-node2:~]# echo $?
1


Additional info:
pacemaker.log on node1:
fed28-node1 pacemaker-based     [626] (cib_process_request)     info: Forwarding cib_upgrade operation for section 'all' to all (origin=local/cibadmin/2)

pacemaker.log on node2:
fed28-node2 pacemaker-based     [598] (cib_process_request)     warning: Completed cib_upgrade operation for section 'all': Schema is already the latest available (rc=-211, origin=fed28-node1/cibadmin/2, version=2.12.0)

Comment 1 Ken Gaillot 2018-08-02 17:25:26 UTC
The exit status of 1 on the DC is also a bug. It should be 0, and the message should be "Upgrade unnecessary: Schema is already the latest available". We can include that issue in this bz, too.

Comment 2 Ken Gaillot 2018-08-08 15:52:28 UTC
Upon investigation, I found the timeout issue has existed since at least upstream 1.1.11 (I reproduced as far back as 1.1.16) and possibly always.

When a non-DC node gets an upgrade request, it forwards it to the DC. When the DC gets it, if an upgrade is required, it resends the request to all nodes asking for an upgrade to a specific version, and all the nodes perform that upgrade locally, notifying any clients (such as cibadmin) of the result.

The problem is that if an upgrade is not required, the DC does not do anything further, so the non-DC nodes never do anything either, and the client doesn't get any notification.

We will need to change it such that the DC always sends a result to at least the requesting node, even if an upgrade is not required. This will only work once all cluster nodes in a cluster are upgraded to a pacemaker version with the fix (that is, in a rolling upgrade, the fix will not take effect until all nodes are upgraded).

Comment 3 Ken Gaillot 2018-08-09 14:01:45 UTC
The timeout issue is fixed upstream by commit 1f05f5e2 and the exit status issue by commit f5e936fb.

Re-assigning to Jan Pokorný for release