Bug 1388602

Summary: Document rolling upgrades to 7.3 for remote nodes
Product: Red Hat Enterprise Linux 7 Reporter: Steven J. Levine <slevine>
Component: doc-Cluster_GeneralAssignee: Steven J. Levine <slevine>
Status: CLOSED NOTABUG QA Contact: ecs-bugs
Severity: unspecified Docs Contact: Steven J. Levine <slevine>
Priority: unspecified    
Version: 7.3CC: cfeist, fdinitto, jruemker, kgaillot, slevine
Target Milestone: rcKeywords: Documentation
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Known Issue
Doc Text:
Upgrading pacemaker clusters with remote nodes requires special procedure The version of Pacemaker in Red Hat Enterprise Linux 7.3 incorporates an increase in the version number of the remote node protocol. Because of this, cluster nodes running Pacemaker in Red Hat Enterprise Linux 7.3 and remote nodes running earlier versions of Red Hat Enterprise Linux are not able to communicate with each other. As a result, a rolling upgrade of a Pacemaker cluster that includes remote nodes requires a special procedure, as documented in https://access.redhat.com/node/2726571.
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-10-26 17:46:50 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Steven J. Levine 2016-10-25 18:15:00 UTC
This needs to be documented.  It's too late to get this into the 7.3 Pacemaker reference, so we can put out a portal article and then move the info to the product doc if we decide it needs to be there in the long turn.

From Ken Gaillot:

Hi Steven,

We recently realized a potential issue with upgrades to 7.3.

The new version of pacemaker in 7.3 involves an increase in the version
number of the remote node protocol. This is the first time that's
happened, so we didn't realize the potential consequences on customer
upgrades.

Cluster nodes running 7.3 pacemaker will not be able to talk to any
remote nodes running an older version, and vice versa. This means that
rolling upgrades of a pacemaker cluster with remote nodes is tricky.

We haven't verified the issue or the fix yet, but this is what I expect
the upgrade process will need to look like in such a case:

1. Upgrade half the cluster's corosync nodes like this:
* Add -INFINITY location constraints for each remote node or guest node
vs the corosync node being upgraded. This ensures that no remote nodes
connect to the node after it is upgraded.
* Stop and upgrade the corosync node, and return it to the cluster.

2. Upgrade all the remote/guest nodes like this:
* Stop the remote node or guest node. (Guest nodes are stopped by
disabling their resource.)
* Remove all the constraints added in step 1 for the remote node or
guest node, and add -INFINITY constraints for the remote node or guest
node vs all the *other* corosync nodes. This ensures that the upgraded
remote node will not connect to an older corosync node.
* Upgrade the node, and return it to the cluster. For guest nodes, this
involves bringing the virtual machine up outside cluster control,
upgrading it, stopping the virtual machine, then re-enabling the guest
node resource.

3. Upgrade the remaining corosync nodes like this:
* Upgrade the host as normal.
* Remove all constraints added for the node in step 2.

If the above process is not followed, the cluster will see failures of
the remote node connection resources and guest node VM resources,
potentially involving very lengthy recovery times, and depending on the
order of upgrades, eventually the remote nodes and guest nodes may be
unable to run at all until more nodes have been upgraded.

This only affects clusters with remote nodes or guest nodes. Other
clusters may perform rolling upgrades in the usual manner.
-- 
Ken Gaillot <kgaillot>

Comment 1 Steven J. Levine 2016-10-25 18:57:01 UTC
Portal article will be here (currently in very basic draft state):

https://access.redhat.com/articles/2726571

Comment 2 Steven J. Levine 2016-10-25 19:15:44 UTC
Draft is now formatted and ready for review and testing:

https://access.redhat.com/articles/2726571