Description of problem: Customer's OVN cluster was affected by a bug and engineering shared a hot fix. We used standard approach described in [1] to update ovn-northd image. As a result, container image was successfully updated and preparations were executed on controller nodes, BUT ovn-dbs-bundle wasn't restarted and old container images were still used after successful deployment. ovn-dbs-bundle switched to using new images after it was manually restarted using "pcs resource restart ovn-dbs-bundle" command [1] https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/16.2/html-single/director_installation_and_usage/index#proc_installing-additional-rpm-files-to-container-images_preparing-for-director-installation - push_destination: true includes: - ovn-northd modify_role: tripleo-modify-image modify_append_tag: "-hotfix_2211240" modify_vars: tasks_from: rpm_install.yml rpms_path: /home/stack/ovn-2021-21.12.0-134/ovn-al Version-Release number of selected component (if applicable): Red Hat OpenStack Platform release 16.2.4 (Train) How reproducible: - Tune ContainerImagePrepare to update RPMs inside ovn-northd image - Run overcloud deployment command - Check if OVN DB containers are using hotfixed images Actual results: New container images are not used until ovn-dbs-bundle is restarted manually Expected results: New container images are used without extra steps Additional info: It looks like tripleo-ansible play named "Run pacemaker restart if the config file for the service changed" doesn't restart ovn-dbs-bundle if configuration wasn't changed. This may also affect other pacemaker resources. - name: Run pacemaker restart if the config file for the service changed tripleo_diff_exec: command: >- {{ tripleo_ha_wrapper_pcmk_restart_script }} {{ tripleo_ha_wrapper_service_name }} {{ tripleo_ha_wrapper_resource_name }} {{ tripleo_ha_wrapper_bundle_name }} {{ tripleo_ha_wrapper_resource_state }} state_file: "{{ tripleo_ha_wrapper_config_basedir }}/{{ tripleo_ha_wrapper_puppet_config_volume }}.md5sum" state_file_suffix: "{{ tripleo_ha_wrapper_config_suffix }}" environment: TRIPLEO_MINOR_UPDATE: "{{ tripleo_ha_wrapper_minor_update|default('') | string }}"
Thanks Alex for raising this. Unfortunately we can't fix this properly during a regular deployment since we rely on the update and upgrade tasks to perform rolling container updates and global restarts of clustered resources. The reasoning behind is to avoid service disruptions during a regular deployment, so we delegate restarts to special tasks happening at the very end of the update/upgrade (see https://github.com/openstack/tripleo-heat-templates/blob/stable/train/deployment/ovn/ovn-dbs-pacemaker-puppet.yaml#L559 for example). If the customers overcloud is in such a bad state that they do not care about the global ovn restart I guess the best option would be to perform a manual restart as you already figured out. Please reopen if you feel that we should look into this further. Thanks Luca
I think that current message in our documentation [1] is ambiguous: we claim that it is possible to use deployment command for yum_update.yml, but it is not the case for containers managed by pacemaker. While I understand motivation behind making an exception, I think that this bug should be translated to documentation bug. Please let me know what do you think. Kind Regards, Alex. [1] https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/16.2/html-single/director_installation_and_usage/index#proc_updating-existing-packages-on-container-images_preparing-for-director-installation https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/16.2/html-single/transitioning_to_containerized_services/index#proc_installing-additional-rpm-files-to-container-images_obtaining-and-modifying-container-images
(In reply to Alex Stupnikov from comment #7) > I think that current message in our documentation [1] is ambiguous: we claim > that it is possible to use deployment command for yum_update.yml, but it is > not the case for containers managed by pacemaker. While I understand > motivation behind making an exception, I think that this bug should be > translated to documentation bug. Please let me know what do you think. > > Kind Regards, Alex. > > [1] > https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/16. > 2/html-single/director_installation_and_usage/index#proc_updating-existing- > packages-on-container-images_preparing-for-director-installation > https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/16. > 2/html-single/transitioning_to_containerized_services/index#proc_installing- > additional-rpm-files-to-container-images_obtaining-and-modifying-container- > images That example is part of the "preparing for director installation" section, so I would read it as in it applies to tasks that must be performed *before* the actual overcloud is deployed. I can't see anything suggesting that this could be applied to any subsequent overcloud deploy with the expectation of being able to roll out updates in a rolling fashion, but I agree that having a warning note would not hurt :) cheers Luca
Thanks Luca. From my perspective our message in official documentation about ContainerImagePrepare parameter is that it is enforced during each deployment when it is defined. Basically we have same text in different documents explaining how to tune container images and this document doesn't necessary related to director in the first place. I may be wrong here, but it would be great if DF Doc team would discuss this and provide their feedback.
DF does not own how individual services are deployed. I personally can understand the concern shared but as DF we don't have much to add what Luca mentioned earlier. Probably we can create a KCS (because this would affect in case a user is trying to use a different tag, instead of using any container image modifications, adding the note to the modification section is not sufficient) and that can work as a solution for now.
Hi @tkajinam. What I was looking from DFG:DF is an information about THT philosophy and idempotence. I am wondering if DF expects container images to be set according to ContainerImagePrepare definitions after deployment command is executed for RHOSP deployment or it is up to services to decide. IMO in both cases we need to tune documentation to send a clear message to users. Please let me know what do you think.
In general we expect the deployment should honor what is given by users without manual operations, which means container images should be updated according to the latest templates. This would be the "philosophy" you are expecting, though this is not documented (AFAIK). However if there is any requirements specific to services then that should supersede the philosophy. DF does not maintain how individual services should be managed, and defer to individual teams about the way how each components should be managed. In this specific case resources managed by pacemaker are maintained by PIDONE and there is a specific reason not to restart services automatically from the component's point of view (which Luca explained earlier so the exception applies.) I can understand adding this to a documentation would be ideal, but I'm not quite clear about the best way to add this to our documentation mainly because we don't have a specific section explaining the scenario to update container images in a existing deployment by the deploy command. Because the behavior only appears in a specific use case, I still believe having a KCS would work.
(In reply to Takashi Kajinami from comment #12) > I can understand adding this to a documentation would be ideal, but I'm not quite clear about > the best way to add this to our documentation mainly because we don't have a specific section > explaining the scenario to update container images in a existing deployment by the deploy > command. Because the behavior only appears in a specific use case, I still believe having a KCS > would work. Thank you for clarifications, it makes sense. IMO the part of the problem here is lack of proper message in documentation about ContainerImagePrepare and its use cases. I don't see the problem in creating KCS, but this will make this a niche knowledge, while IMO it should be clearly described in documentation and available for anyone willing to use ContainerImagePrepare.
General expectation for a deployment tooling is that it should apply the given parameters to the deployment. This is not specific to CIP but all tripleo parameters, and I'm not sure why describing this expectation specifically to CIP parameter is so required here. Also, the procedures documented in 3.13-15 are not very specific to hotfix. These are meant to describes the generic ways to modify container images while these shows the examples of hot fix package. So adding a note to that section which describes a problem for hotfix-ing does not look best for me. If we still believe adding something to our documentation is required then probably we can add a note to 3.12 ? (Note this is not complete because that does not cover the case where a user reads only 3.7 and makes mortification in CIP for using a different tag for example). In case we add a note, we probably also need some details to describe steps to check the container images used by pacemaker resources and the steps to restart the pacemaker resources, but having these steps documented in 3.12 would make the document a bit scattered (because the aim of these sections are describing CIP usage and the problem is really specific to some use cases). I think it's better to create a separate doc(KCS) to describe details and put a short note and link for the KCS for further references. Again, DF does not own deployment of individual services. How updating images work is ultimately determined by each DFG owning that service component. Please keep in mind that DF provides only the general guidance and does not maintain how these are really implemented in individual services. That's another reason I hesitate to add some limitations specific to services to the documentation about generic deployment framework.