Bug 2221205 - Controller upgrade shows completed even with pacmaker resources in failed state.
Summary: Controller upgrade shows completed even with pacmaker resources in failed state.
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo
Version: 16.2 (Train)
Hardware: All
OS: All
medium
medium
Target Milestone: ---
: ---
Assignee: Sofer Athlan-Guyot
QA Contact: Joe H. Rahme
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-07-07 13:05 UTC by Paul Jany
Modified: 2023-07-18 10:43 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-07-18 10:43:35 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker OSP-26434 0 None None None 2023-07-07 13:05:47 UTC

Description Paul Jany 2023-07-07 13:05:06 UTC
Description of problem:
As a part of minor upgrade, controller-1 was updated and showed as "Update complete". But pacemaker showed haproxy and redis resources in the stopped state. 


Version-Release number of selected component (if applicable):
16.2.5

How reproducible:
Minor upgrade from 16.2.2 to 16.2.5
$ openstack overcloud update run --limit controller-1

Activities:
- Controller-1 was the last controller in the list to upgrade.  
- Node upgrdae was success, but we could see errors the haproxy, redis showing as stopped state. Refresh and restart did not help.
- Rerun of the update completed, but still the resources was in stopped state.
- Identified that both these containers were missing and has to manually pull and has to cleanup the resources.


Expected results:
- Update should fail if the containers are not pulled and started in the controller-1

Comment 3 Sofer Athlan-Guyot 2023-07-17 10:19:32 UTC
Hi,

thanks for the feedback, unfortunately this is known shortcoming of the update framework especially around pacemaker resources. This is mainly due to how configurable the overcloud can be.

In the simple case where we have only one role named "Controller" with all the pacemaker services in it, it's could be added as a final checkpoint (around the time of the container cleanup), but the problem come from composable roles, where we could have mysql under "database" role, rabbitmq under "Messaging" role and so on and so forth.

In that case when the "controller" role end, we could have the "database" role being updated. As those processes are different, one cannot know what the second is doing [1].  During the database update we could have some resources in stopped state without being an error.  Then, drawing conclusion becomes  hard, because there is no way to know if all roles and all nodes have been updated. 

Furthermore in some specific cases, we expect to have some resources in error during the update and we clear them off during the process[3]

Validation is the place to have those checks as we known in which state the update is.

The relevant validation in OSP16.2 is in the group "post-deployment"[2] with the "pacemaker-status" (https://github.com/openstack/tripleo-validations/blob/stable/train/playbooks/pacemaker-status.yaml).

To conclude, as the update of pacemaker can be run on different unrelated processes, there is no point in the code where we can check for its completion easily and validation is the path the director framework took to solve this problem and where those checks happen.

Hope this clarify the situation,

Regards,

[1] for instance by running those command in parallel : openstack overcloud update --limit Controller > controller.log &
                                                        openstack overcloud update --limit Database > database.log
[2] Starting with OSP17.1 we have this validation in the "post-update" group.
[3] https://opendev.org/openstack/tripleo-heat-templates/src/branch/stable/train/deployment/ovn/ovn-dbs-pacemaker-puppet.yaml#L312-L316

Comment 4 Sofer Athlan-Guyot 2023-07-18 10:43:35 UTC
Hi,

for completeness I'll add a pointer to the 16.2 documentation around the validation: https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/16.2/html-single/director_installation_and_usage/index#assembly_using-the-validation-framework for 16.2.

Feel free to re-open this if some further inquiry is needed.

Regards,


Note You need to log in before you can comment on or make changes to this bug.