Bug 2215283 - Hotfix is not applied properly for ovn-northd because ovn-dbs-bundle is not restarted if image is changed
Summary: Hotfix is not applied properly for ovn-northd because ovn-dbs-bundle is not r...
Keywords:
Status: NEW
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: tripleo-ansible
Version: 16.2 (Train)
Hardware: All
OS: All
low
medium
Target Milestone: ---
: ---
Assignee: OSP Team
QA Contact: Joe H. Rahme
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-06-15 10:19 UTC by Alex Stupnikov
Modified: 2023-06-28 08:38 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-06-16 07:33:27 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker OSP-25807 0 None None None 2023-06-15 10:21:01 UTC
Red Hat Knowledge Base (Solution) 7019224 0 None None None 2023-06-15 10:29:09 UTC

Description Alex Stupnikov 2023-06-15 10:19:40 UTC
Description of problem:
Customer's OVN cluster was affected by a bug and engineering shared a hot fix. We used standard approach described in [1] to update ovn-northd image. As a result, container image was successfully updated and preparations were executed on controller nodes, BUT ovn-dbs-bundle wasn't restarted and old container images were still used after successful deployment.

ovn-dbs-bundle switched to using new images after it was manually restarted using "pcs resource restart ovn-dbs-bundle" command

[1]
https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/16.2/html-single/director_installation_and_usage/index#proc_installing-additional-rpm-files-to-container-images_preparing-for-director-installation

  - push_destination: true
    includes:
    - ovn-northd
    modify_role: tripleo-modify-image
    modify_append_tag: "-hotfix_2211240"
    modify_vars:
      tasks_from: rpm_install.yml
      rpms_path: /home/stack/ovn-2021-21.12.0-134/ovn-al


Version-Release number of selected component (if applicable):
Red Hat OpenStack Platform release 16.2.4 (Train)


How reproducible:
- Tune ContainerImagePrepare to update RPMs inside ovn-northd image
- Run overcloud deployment command
- Check if OVN DB containers are using hotfixed images


Actual results:
New container images are not used until ovn-dbs-bundle is restarted manually

Expected results:
New container images are used without extra steps


Additional info:
It looks like tripleo-ansible play named "Run pacemaker restart if the config file for the service changed" doesn't restart ovn-dbs-bundle if configuration wasn't changed. This may also affect other pacemaker resources.

- name: Run pacemaker restart if the config file for the service changed
  tripleo_diff_exec:
    command: >-
      {{ tripleo_ha_wrapper_pcmk_restart_script }} {{ tripleo_ha_wrapper_service_name }}
      {{ tripleo_ha_wrapper_resource_name }} {{ tripleo_ha_wrapper_bundle_name }}
      {{ tripleo_ha_wrapper_resource_state }}
    state_file: "{{ tripleo_ha_wrapper_config_basedir }}/{{ tripleo_ha_wrapper_puppet_config_volume }}.md5sum"
    state_file_suffix: "{{ tripleo_ha_wrapper_config_suffix }}"
    environment:
      TRIPLEO_MINOR_UPDATE: "{{ tripleo_ha_wrapper_minor_update|default('') | string }}"

Comment 6 Luca Miccini 2023-06-16 07:37:48 UTC
Thanks Alex for raising this.
Unfortunately we can't fix this properly during a regular deployment since we rely on the update and upgrade tasks to perform rolling container updates and global restarts of clustered resources.
The reasoning behind is to avoid service disruptions during a regular deployment, so we delegate restarts to special tasks happening at the very end of the update/upgrade (see https://github.com/openstack/tripleo-heat-templates/blob/stable/train/deployment/ovn/ovn-dbs-pacemaker-puppet.yaml#L559 for example).
If the customers overcloud is in such a bad state that they do not care about the global ovn restart I guess the best option would be to perform a manual restart as you already figured out.

Please reopen if you feel that we should look into this further.

Thanks
Luca

Comment 7 Alex Stupnikov 2023-06-16 08:20:52 UTC
I think that current message in our documentation [1] is ambiguous: we claim that it is possible to use deployment command for yum_update.yml, but it is not the case for containers managed by pacemaker. While I understand motivation behind making an exception, I think that this bug should be translated to documentation bug. Please let me know what do you think.

Kind Regards, Alex.

[1]
https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/16.2/html-single/director_installation_and_usage/index#proc_updating-existing-packages-on-container-images_preparing-for-director-installation
https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/16.2/html-single/transitioning_to_containerized_services/index#proc_installing-additional-rpm-files-to-container-images_obtaining-and-modifying-container-images

Comment 8 Luca Miccini 2023-06-16 09:55:56 UTC
(In reply to Alex Stupnikov from comment #7)
> I think that current message in our documentation [1] is ambiguous: we claim
> that it is possible to use deployment command for yum_update.yml, but it is
> not the case for containers managed by pacemaker. While I understand
> motivation behind making an exception, I think that this bug should be
> translated to documentation bug. Please let me know what do you think.
> 
> Kind Regards, Alex.
> 
> [1]
> https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/16.
> 2/html-single/director_installation_and_usage/index#proc_updating-existing-
> packages-on-container-images_preparing-for-director-installation
> https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/16.
> 2/html-single/transitioning_to_containerized_services/index#proc_installing-
> additional-rpm-files-to-container-images_obtaining-and-modifying-container-
> images

That example is part of the "preparing for director installation" section, so I would read it as in it applies to tasks that must be performed *before* the actual overcloud is deployed. 
I can't see anything suggesting that this could be applied to any subsequent overcloud deploy with the expectation of being able to roll out updates in a rolling fashion, but I agree that having a warning note would not hurt :)

cheers
Luca

Comment 9 Alex Stupnikov 2023-06-16 11:02:42 UTC
Thanks Luca. From my perspective our message in official documentation about ContainerImagePrepare parameter is that it is enforced during each deployment when it is defined. Basically we have same text in different documents explaining how to tune container images and this document doesn't necessary related to director in the first place. I may be wrong here, but it would be great if DF Doc team would discuss this and provide their feedback.

Comment 10 Takashi Kajinami 2023-06-19 04:26:20 UTC
DF does not own how individual services are deployed. I personally can understand the concern shared but
as DF we don't have much to add what Luca mentioned earlier.

Probably we can create a KCS (because this would affect in case a user is trying to use a different tag,
instead of using any container image modifications, adding the note to the modification section is not
sufficient) and that can work as a solution for now.

Comment 11 Alex Stupnikov 2023-06-19 08:26:25 UTC
Hi @tkajinam. What I was looking from DFG:DF is an information about THT philosophy and idempotence. I am wondering if DF expects container images to be set according to ContainerImagePrepare definitions after deployment command is executed for RHOSP deployment or it is up to services to decide. IMO in both cases we need to tune documentation to send a clear message to users. Please let me know what do you think.

Comment 12 Takashi Kajinami 2023-06-19 09:31:05 UTC
In general we expect the deployment should honor what is given by users without manual operations,
which means container images should be updated according to the latest templates. This would be
the "philosophy" you are expecting, though this is not documented (AFAIK).

However if there is any requirements specific to services then that should supersede the philosophy.
DF does not maintain how individual services should be managed, and defer to individual teams about
the way how each components should be managed. In this specific case resources managed by pacemaker
are maintained by PIDONE and there is a specific reason not to restart services automatically from
the component's point of view (which Luca explained earlier so the exception applies.)

I can understand adding this to a documentation would be ideal, but I'm not quite clear about
the best way to add this to our documentation mainly because we don't have a specific section
explaining the scenario to update container images in a existing deployment by the deploy
command. Because the behavior only appears in a specific use case, I still believe having a KCS
would work.

Comment 13 Alex Stupnikov 2023-06-19 11:49:52 UTC
(In reply to Takashi Kajinami from comment #12)
> I can understand adding this to a documentation would be ideal, but I'm not quite clear about
> the best way to add this to our documentation mainly because we don't have a specific section
> explaining the scenario to update container images in a existing deployment by the deploy
> command. Because the behavior only appears in a specific use case, I still believe having a KCS
> would work.

Thank you for clarifications, it makes sense. IMO the part of the problem here is lack of proper message in documentation about ContainerImagePrepare and its use cases. I don't see the problem in creating KCS, but this will make this a niche knowledge, while IMO it should be clearly described in documentation and available for anyone willing to use ContainerImagePrepare.

Comment 14 Takashi Kajinami 2023-06-19 12:22:12 UTC
General expectation for a deployment tooling is that it should apply the given parameters to the deployment.
This is not specific to CIP but all tripleo parameters, and I'm not sure why describing this expectation
specifically to CIP parameter is so required here.

Also, the procedures documented in 3.13-15 are not very specific to hotfix. These are meant to describes
the generic ways to modify container images while these shows the examples of hot fix package. So adding
a note to that section which describes a problem for hotfix-ing does not look best for me.

If we still believe adding something to our documentation is required then probably we can add a note to 3.12 ?
(Note this is not complete because that does not cover the case where a user reads only 3.7 and makes mortification
in CIP for using a different tag for example).

In case we add a note, we probably also need some details to describe steps to check the container images
used by pacemaker resources and the steps to restart the pacemaker resources, but having these steps documented
in 3.12 would make the document a bit scattered (because the aim of these sections are describing CIP usage and
the problem is really specific to some use cases). I think it's better to create a separate doc(KCS) to describe
details and put a short note and link for the KCS for further references.

Again, DF does not own deployment of individual services. How updating images work is ultimately determined
by each DFG owning that service component. Please keep in mind that DF provides only the general guidance
and does not maintain how these are really implemented in individual services. That's another reason I hesitate
to add some limitations specific to services to the documentation about generic deployment framework.


Note You need to log in before you can comment on or make changes to this bug.