1475416 – [RFE]Ensure that the controller node is rolled back into the cluster if minor update fails

Bug 1475416 - [RFE]Ensure that the controller node is rolled back into the cluster if minor update fails

Summary: [RFE]Ensure that the controller node is rolled back into the cluster if minor...

Keywords:
Status:	CLOSED CANTFIX
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	RFEs
Sub Component:
Version:	unspecified
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	---
Assignee:	OSP Team
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-07-26 15:41 UTC by Chaitanya Shastri
Modified:	2022-08-02 17:24 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-08-02 15:14:06 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	OSP-2652	0	None	None	None	2022-08-02 17:24:22 UTC

Description Chaitanya Shastri 2017-07-26 15:41:19 UTC

Description of problem:
While performing the minor update, the controller nodes are taken out of the cluster one at a time stopping the cluster on each one of them using 'pcs cluster stop'. 

If for some reason, the minor update fails on a controller node while its not the part of the cluster, the cluster is not started automatically on the controller node. Its shown as 'OFFLINE' is 'pcs status'. We will have to manually start the cluster on that controller node using 'pcs cluster start'.

This is okay if when the next time the update chooses the same controller node for the update. 

If the update procedure chooses any other controller which is 'ONLINE', the update fails with this error: "Error: stopping the node will cause a loss of quorum, use --force to override". I faced this issue once.

It would be good that if the update fails on a controller node while its taken out of cluster, a rollback procedure should run which runs 'pcs cluster start' on the controller to bring it back in the cluster.

Version-Release number of selected component (if applicable):
RHOSP 10

How reproducible:
Always

Additional info:

To fix this issue, you need to bring the controller node back in the cluster by running 'pcs cluster start' and re-run the deployment.

Comment 5 Luca Miccini 2021-08-02 15:14:06 UTC

(In reply to Chaitanya from comment #0)
> Description of problem:
> While performing the minor update, the controller nodes are taken out of the
> cluster one at a time stopping the cluster on each one of them using 'pcs
> cluster stop'. 
> 
> If for some reason, the minor update fails on a controller node while its
> not the part of the cluster, the cluster is not started automatically on the
> controller node. Its shown as 'OFFLINE' is 'pcs status'. We will have to
> manually start the cluster on that controller node using 'pcs cluster start'.
> 
> This is okay if when the next time the update chooses the same controller
> node for the update. 
> 
> If the update procedure chooses any other controller which is 'ONLINE', the
> update fails with this error: "Error: stopping the node will cause a loss of
> quorum, use --force to override". I faced this issue once.
> 
> It would be good that if the update fails on a controller node while its
> taken out of cluster, a rollback procedure should run which runs 'pcs
> cluster start' on the controller to bring it back in the cluster.
> 

Hello,

A failed update is something that should be carefully investigated and while there are some scenarios where a 'pcs cluster start' would be enough to bring things back online and achieve a subsequent successful update (like a temporary failure of some sort) there are cases where you want to make sure the node where the cluster has been stopped is in a proper state before you restart pacemaker.

The fact that the minor update fails when trying to stop a second cluster node is also intentional and should be treated as safety net to prevent unnecessary api downtime.

All in all we feel it is much better (= safer) to have operators investigate the reasons why the update failed and perform a manual recovery before attempting to resume the update itself.

To be completely safe the rollback procedure mentioned in this RFE would have to be a complete restore from a backup and this isn't something we can provide atm.

Regards.
Luca

Note You need to log in before you can comment on or make changes to this bug.