Bug 2221205
| Summary: | Controller upgrade shows completed even with pacmaker resources in failed state. | ||
|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Paul Jany <pgodwin> |
| Component: | openstack-tripleo | Assignee: | Sofer Athlan-Guyot <sathlang> |
| Status: | CLOSED WONTFIX | QA Contact: | Joe H. Rahme <jhakimra> |
| Severity: | medium | Docs Contact: | |
| Priority: | medium | ||
| Version: | 16.2 (Train) | CC: | bshephar, jbadiapa, jpretori, mburns, ravsingh |
| Target Milestone: | --- | Keywords: | Triaged |
| Target Release: | --- | ||
| Hardware: | All | ||
| OS: | All | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2023-07-18 10:43:35 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Paul Jany
2023-07-07 13:05:06 UTC
Hi, thanks for the feedback, unfortunately this is known shortcoming of the update framework especially around pacemaker resources. This is mainly due to how configurable the overcloud can be. In the simple case where we have only one role named "Controller" with all the pacemaker services in it, it's could be added as a final checkpoint (around the time of the container cleanup), but the problem come from composable roles, where we could have mysql under "database" role, rabbitmq under "Messaging" role and so on and so forth. In that case when the "controller" role end, we could have the "database" role being updated. As those processes are different, one cannot know what the second is doing [1]. During the database update we could have some resources in stopped state without being an error. Then, drawing conclusion becomes hard, because there is no way to know if all roles and all nodes have been updated. Furthermore in some specific cases, we expect to have some resources in error during the update and we clear them off during the process[3] Validation is the place to have those checks as we known in which state the update is. The relevant validation in OSP16.2 is in the group "post-deployment"[2] with the "pacemaker-status" (https://github.com/openstack/tripleo-validations/blob/stable/train/playbooks/pacemaker-status.yaml). To conclude, as the update of pacemaker can be run on different unrelated processes, there is no point in the code where we can check for its completion easily and validation is the path the director framework took to solve this problem and where those checks happen. Hope this clarify the situation, Regards, [1] for instance by running those command in parallel : openstack overcloud update --limit Controller > controller.log & openstack overcloud update --limit Database > database.log [2] Starting with OSP17.1 we have this validation in the "post-update" group. [3] https://opendev.org/openstack/tripleo-heat-templates/src/branch/stable/train/deployment/ovn/ovn-dbs-pacemaker-puppet.yaml#L312-L316 Hi, for completeness I'll add a pointer to the 16.2 documentation around the validation: https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/16.2/html-single/director_installation_and_usage/index#assembly_using-the-validation-framework for 16.2. Feel free to re-open this if some further inquiry is needed. Regards, |