Bug 1212435
| Summary: | [RFE] Add semi-automatic rolling-update helper to pcs | |||
|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Jan Pokorný [poki] <jpokorny> | |
| Component: | pcs | Assignee: | Tomas Jelinek <tojeline> | |
| Status: | CLOSED WONTFIX | QA Contact: | cluster-qe <cluster-qe> | |
| Severity: | unspecified | Docs Contact: | ||
| Priority: | unspecified | |||
| Version: | 7.2 | CC: | cfeist, cluster-maint, kgaillot, plambri, tojeline | |
| Target Milestone: | rc | Keywords: | FutureFeature | |
| Target Release: | --- | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | Doc Type: | Enhancement | ||
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 1388827 (view as bug list) | Environment: | ||
| Last Closed: | 2020-12-15 07:34:05 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 1388827 | |||
Considering not so staightforward procedures in some cases (e.g. pacemaker_remoted 1.1.15 won't talk to pacemaker 1.1.14 [the source of the issue is apparent in the logs on pacemaker_remoted side]), it would be really useful to have this feature implemented, with rather flexible ways to express the logic for particular scenarios. It would be good to have constructs like these to express the steps: - get set of cluster nodes/remote nodes - proceed something with all/all but one from the node set - make resources avoid particular nodes + cancel that (maintenance mode or ban/unban?) - open/close upgrade window in which user is supposed to perform the upgrade One twist is that rolling upgrades require attention to certain internal protocol versioning, which currently includes the crm feature set and the lrmd protocol version. The crm feature set is easily obtainable from <cib crm_feature_set="..."> in the CIB, but there is no way currently for an external program to check the lrmd protocol version. Probably some pacemaker CLI tool should be able to provide all versions (pacemaker, crm feature set, lrmd protocol) on request. Note that each of these is per-node, not per-cluster. The pacemaker version itself is not of any concern for rolling upgrades. The crm feature set applies to the full cluster nodes in a cluster. The crm feature set is currently a triplet (e.g. "3.0.11"), although in the distant past (through pacemaker 1.0.1 in 2008) it was just a major-minor (e.g. "2.1"). If the major version (the first number) changes, a rolling upgrade is not possible. If the minor version (the second number) changes, a rolling upgrade is possible, but any node that leaves the cluster may be unable to return unless it is upgraded, so it is important that the upgrade be completed in a reasonable window. If the minor-minor changes, currently it is treated the same as the minor, but there are plans to change that so it is irrelevant to rolling upgrades (used only to provide information to resource agents). If the crm feature set does not change, rolling upgrades are possible with no limitation. The lrmd protocol version applies to the connection between remote/guest nodes and cluster nodes. It is currently a major-minor (e.g. "1.1"). There are no explicit semantics for major-version changes, but presumably they should be interpreted as making rolling upgrades impossible. If the minor version is different between a remote node and the cluster node hosting its connection, pacemaker through 1.1.14 (lrmd protocol version 1.0) will not allow the connection to proceed; pacemaker 1.1.15 and later (lrmd protocol version >= 1.1) will allow the connection to proceed only if the cluster node's version is newer. If the lrmd protocol version does not change, rolling upgrades are possible with no limitation. It's complicated, but predictable. Let me know if anything is unclear. Upgrading process description in Pacemaker Explained documentation: http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html/Pacemaker_Explained/_upgrading.html After evaluating this issue, there are no plans to address it further or fix it in an upcoming release. Therefore, it is being closed. If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened. |
It would be nice if pcs could assist with a process of node-by-node rolling update of (cluster or whole environment) software. One possible analogy to the proposed feature is akin to "git rebase --interactive" where you use "edit" keyword for each commit. Git then: 1. some preparation work 2. proceed particular commit in row: 2a. pick the current queued commit and commits it 2b. USER HAS FREE HANDS TO MODIFY THE COMMIT NOW (here in terms of git commit --ammend, for instance) 2c. wait for "git rebase --continue" so as to move on the next commit in the queue and apply it in 2a. (finish if there is no more commit left in the queue) 2d. or wait for "git rebase --abort" that rolls the whole rebase operation back 2e. or wait for "git rebase --skip" that, IIUIC, undoes 2a. and continues with the next commit at 2a. 3. possibly some "transaction finished" finalization In a similar vein, pcs could assist with an iteration over all the nodes in the cluster, mainly for the purpose of per-node rolling update: 1. some preparation work - figure out the nodes and if at all the cluster is elligible for the operation (all nodes online and healthy) - internally note the operation-in-progress 2. proceed particular node in row: 2a. pick the node, put it in the standby mode 2b. USER HAS FREE HANDS TO DO WHATEVER IS NEEDED ON THAT NODE (software updates, other maintenance) 2c. wait for something like "pcs maint --continue" so as to move on the next node in queue, prior to trying contacting the current node and bringing it back from standby mode, the new node is used again in 2a. (finish if there is no more commit left in the queue) 2d. or wait for something like "pcs maint --abort" that just tries contacting the current node and bringing it back from standby-mode 2e. or wait for something like "pcs maint --skip" that performs 2d., but continues with the next node in row 3. "transaction finished" finalization - remove internal tracking of the operation-in-progress - possibly check if cluster is fully healthy (plus the software versions match, etc.) This could make such a complicated administrative step as rolling update across cluster a breeze. Thanks for considering. This feature should perhaps not be exposed in the GUI, as it expects some level of expertise, nothing for a novice user. References: http://clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/_rolling_node_by_node.html