Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 2221205

Summary:	Controller upgrade shows completed even with pacmaker resources in failed state.
Product:	Red Hat OpenStack	Reporter:	Paul Jany <pgodwin>
Component:	openstack-tripleo	Assignee:	Sofer Athlan-Guyot <sathlang>
Status:	CLOSED WONTFIX	QA Contact:	Joe H. Rahme <jhakimra>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	16.2 (Train)	CC:	bshephar, jbadiapa, jpretori, mburns, ravsingh
Target Milestone:	---	Keywords:	Triaged
Target Release:	---
Hardware:	All
OS:	All
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2023-07-18 10:43:35 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Paul Jany 2023-07-07 13:05:06 UTC

Description of problem:
As a part of minor upgrade, controller-1 was updated and showed as "Update complete". But pacemaker showed haproxy and redis resources in the stopped state. 


Version-Release number of selected component (if applicable):
16.2.5

How reproducible:
Minor upgrade from 16.2.2 to 16.2.5
$ openstack overcloud update run --limit controller-1

Activities:
- Controller-1 was the last controller in the list to upgrade.  
- Node upgrdae was success, but we could see errors the haproxy, redis showing as stopped state. Refresh and restart did not help.
- Rerun of the update completed, but still the resources was in stopped state.
- Identified that both these containers were missing and has to manually pull and has to cleanup the resources.


Expected results:
- Update should fail if the containers are not pulled and started in the controller-1

Comment 3 Sofer Athlan-Guyot 2023-07-17 10:19:32 UTC

Hi,

thanks for the feedback, unfortunately this is known shortcoming of the update framework especially around pacemaker resources. This is mainly due to how configurable the overcloud can be.

In the simple case where we have only one role named "Controller" with all the pacemaker services in it, it's could be added as a final checkpoint (around the time of the container cleanup), but the problem come from composable roles, where we could have mysql under "database" role, rabbitmq under "Messaging" role and so on and so forth.

In that case when the "controller" role end, we could have the "database" role being updated. As those processes are different, one cannot know what the second is doing [1]. During the database update we could have some resources in stopped state without being an error. Then, drawing conclusion becomes hard, because there is no way to know if all roles and all nodes have been updated.

Furthermore in some specific cases, we expect to have some resources in error during the update and we clear them off during the process[3]

Validation is the place to have those checks as we known in which state the update is.

The relevant validation in OSP16.2 is in the group "post-deployment"[2] with the "pacemaker-status" (https://github.com/openstack/tripleo-validations/blob/stable/train/playbooks/pacemaker-status.yaml).

To conclude, as the update of pacemaker can be run on different unrelated processes, there is no point in the code where we can check for its completion easily and validation is the path the director framework took to solve this problem and where those checks happen.

Hope this clarify the situation,

Regards,

[1] for instance by running those command in parallel : openstack overcloud update --limit Controller > controller.log &
openstack overcloud update --limit Database > database.log
[2] Starting with OSP17.1 we have this validation in the "post-update" group.
[3] https://opendev.org/openstack/tripleo-heat-templates/src/branch/stable/train/deployment/ovn/ovn-dbs-pacemaker-puppet.yaml#L312-L316

Comment 4 Sofer Athlan-Guyot 2023-07-18 10:43:35 UTC

Hi,

for completeness I'll add a pointer to the 16.2 documentation around the validation: https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/16.2/html-single/director_installation_and_usage/index#assembly_using-the-validation-framework for 16.2.

Feel free to re-open this if some further inquiry is needed.

Regards,