Description of problem: While performing the minor update, the controller nodes are taken out of the cluster one at a time stopping the cluster on each one of them using 'pcs cluster stop'. If for some reason, the minor update fails on a controller node while its not the part of the cluster, the cluster is not started automatically on the controller node. Its shown as 'OFFLINE' is 'pcs status'. We will have to manually start the cluster on that controller node using 'pcs cluster start'. This is okay if when the next time the update chooses the same controller node for the update. If the update procedure chooses any other controller which is 'ONLINE', the update fails with this error: "Error: stopping the node will cause a loss of quorum, use --force to override". I faced this issue once. It would be good that if the update fails on a controller node while its taken out of cluster, a rollback procedure should run which runs 'pcs cluster start' on the controller to bring it back in the cluster. Version-Release number of selected component (if applicable): RHOSP 10 How reproducible: Always Additional info: To fix this issue, you need to bring the controller node back in the cluster by running 'pcs cluster start' and re-run the deployment.
(In reply to Chaitanya from comment #0) > Description of problem: > While performing the minor update, the controller nodes are taken out of the > cluster one at a time stopping the cluster on each one of them using 'pcs > cluster stop'. > > If for some reason, the minor update fails on a controller node while its > not the part of the cluster, the cluster is not started automatically on the > controller node. Its shown as 'OFFLINE' is 'pcs status'. We will have to > manually start the cluster on that controller node using 'pcs cluster start'. > > This is okay if when the next time the update chooses the same controller > node for the update. > > If the update procedure chooses any other controller which is 'ONLINE', the > update fails with this error: "Error: stopping the node will cause a loss of > quorum, use --force to override". I faced this issue once. > > It would be good that if the update fails on a controller node while its > taken out of cluster, a rollback procedure should run which runs 'pcs > cluster start' on the controller to bring it back in the cluster. > Hello, A failed update is something that should be carefully investigated and while there are some scenarios where a 'pcs cluster start' would be enough to bring things back online and achieve a subsequent successful update (like a temporary failure of some sort) there are cases where you want to make sure the node where the cluster has been stopped is in a proper state before you restart pacemaker. The fact that the minor update fails when trying to stop a second cluster node is also intentional and should be treated as safety net to prevent unnecessary api downtime. All in all we feel it is much better (= safer) to have operators investigate the reasons why the update failed and perform a manual recovery before attempting to resume the update itself. To be completely safe the rollback procedure mentioned in this RFE would have to be a complete restore from a backup and this isn't something we can provide atm. Regards. Luca