Bug 2250940 - [FFU][DCN] OSP upgrade of a DCN stack with etcd (Cinder A/A) fail on inability to fetch etcd image
Summary: [FFU][DCN] OSP upgrade of a DCN stack with etcd (Cinder A/A) fail on inabilit...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-heat-templates
Version: 17.1 (Wallaby)
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: z3
: 17.1
Assignee: Lukas Bezdicka
QA Contact: Marian Krcmarik
URL:
Whiteboard:
Depends On:
Blocks: 1997638
TreeView+ depends on / blocked
 
Reported: 2023-11-21 21:31 UTC by Marian Krcmarik
Modified: 2024-05-22 20:42 UTC (History)
15 users (show)

Fixed In Version: openstack-tripleo-heat-templates-14.3.1-17.1.20231103010828.el9ost
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2024-05-22 20:42:25 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker OSP-30521 0 None None None 2023-11-21 21:33:51 UTC
Red Hat Product Errata RHSA-2024:2736 0 None None None 2024-05-22 20:42:31 UTC

Description Marian Krcmarik 2023-11-21 21:31:46 UTC
Description of problem:
The OSP FFU upgrade (16.2->) of a DCN stack (multistack, central site with controllers node upgraded successfully) with etcd deployed (to manage Cinder A/A service on the DCN site) fails on following error:
FATAL | Pre-fetch all the containers | dcn1-computehci1-1 | item=site-undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp17-openstack-etcd:17.1_20231116.2 | error={"ansible_loop_var": "prefetch_image", "attempts": 5, "changed": false, "msg": "Failed to pull image site-undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp17-openstack-etcd:17.1_20231116.2", "prefetch_image": "site-undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp17-openstack-etcd:17.1_20231116.2" 

The deployment uses the registry on undercloud and the proper image (for the 17.1) of etcd is not available on there (There is only the 16.2 version of etcd image which was used for initial 16. deployment).
Note: Central site does not require etcd and is upgraded successfully).

The undercloud config has following parameters set:
container_images_file = /home/stack/containers-prepare-parameter.yaml
container_insecure_registries= brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888,rhos-qe-mirror.lab.eng.tlv2.redhat.com:5002,registry-proxy.engineering.redhat.com

The the containers-prepare-parameter.yaml looks like:
parameter_defaults:
  ContainerImagePrepare:
  - tag_from_label: '{version}-{release}'
    set:
      namespace: rhos-qe-mirror.lab.eng.tlv2.redhat.com:5002/rh-osbs
      name_prefix: rhosp17-openstack-
      name_suffix: ''
      tag: 17.1_20231116.2
      rhel_containers: false
      neutron_driver: ovn
      ceph_namespace: registry-proxy.engineering.redhat.com/rh-osbs
      ceph_image: rhceph
      ceph_tag: '5'
      ceph_prometheus_namespace: rhos-qe-mirror.lab.eng.tlv2.redhat.com:5002/rh-osbs
      ceph_prometheus_image: openshift-ose-prometheus
      ceph_prometheus_tag: v4.10
      ceph_alertmanager_namespace: rhos-qe-mirror.lab.eng.tlv2.redhat.com:5002/rh-osbs
      ceph_alertmanager_image: openshift-ose-prometheus-alertmanager
      ceph_alertmanager_tag: v4.10
      ceph_node_exporter_namespace: rhos-qe-mirror.lab.eng.tlv2.redhat.com:5002/rh-osbs
      ceph_node_exporter_image: openshift-ose-prometheus-node-exporter
      ceph_node_exporter_tag: v4.10
      ceph_grafana_namespace: rhos-qe-mirror.lab.eng.tlv2.redhat.com:5002/rh-osbs
      ceph_grafana_image: grafana
      ceph_grafana_tag: '5'
    push_destination: true
  MultiRhelRoleContainerImagePrepare: &id001
  - tag_from_label: '{version}-{release}'
    set:
      namespace: rhos-qe-mirror.lab.eng.tlv2.redhat.com:5002/rh-osbs
      name_prefix: rhosp17-openstack-
      name_suffix: ''
      tag: 17.1_20231116.2
      rhel_containers: false
      neutron_driver: ovn
      ceph_namespace: registry-proxy.engineering.redhat.com/rh-osbs
      ceph_image: rhceph
      ceph_tag: '5'
      ceph_prometheus_namespace: rhos-qe-mirror.lab.eng.tlv2.redhat.com:5002/rh-osbs
      ceph_prometheus_image: openshift-ose-prometheus
      ceph_prometheus_tag: v4.10
      ceph_alertmanager_namespace: rhos-qe-mirror.lab.eng.tlv2.redhat.com:5002/rh-osbs
      ceph_alertmanager_image: openshift-ose-prometheus-alertmanager
      ceph_alertmanager_tag: v4.10
      ceph_node_exporter_namespace: rhos-qe-mirror.lab.eng.tlv2.redhat.com:5002/rh-osbs
      ceph_node_exporter_image: openshift-ose-prometheus-node-exporter
      ceph_node_exporter_tag: v4.10
      ceph_grafana_namespace: rhos-qe-mirror.lab.eng.tlv2.redhat.com:5002/rh-osbs
      ceph_grafana_image: grafana
      ceph_grafana_tag: '5'
    push_destination: true
    excludes:
    - collectd
    - nova-libvirt
  - tag_from_label: '{version}-{release}'
    set:
      namespace: rhos-qe-mirror.lab.eng.tlv2.redhat.com:5002/rh-osbs
      name_prefix: rhosp17-openstack-
      name_suffix: ''
      tag: 17.1_20231116.1
      rhel_containers: false
      neutron_driver: ovn
      ceph_namespace: registry-proxy.engineering.redhat.com/rh-osbs
      ceph_image: rhceph
      ceph_tag: '5'
      ceph_prometheus_namespace: rhos-qe-mirror.lab.eng.tlv2.redhat.com:5002/rh-osbs
      ceph_prometheus_image: openshift-ose-prometheus
      ceph_prometheus_tag: v4.10
      ceph_alertmanager_namespace: rhos-qe-mirror.lab.eng.tlv2.redhat.com:5002/rh-osbs
      ceph_alertmanager_image: openshift-ose-prometheus-alertmanager
      ceph_alertmanager_tag: v4.10
      ceph_node_exporter_namespace: rhos-qe-mirror.lab.eng.tlv2.redhat.com:5002/rh-osbs
      ceph_node_exporter_image: openshift-ose-prometheus-node-exporter
      ceph_node_exporter_tag: v4.10
      ceph_grafana_namespace: rhos-qe-mirror.lab.eng.tlv2.redhat.com:5002/rh-osbs
      ceph_grafana_image: grafana
      ceph_grafana_tag: '5'
    push_destination: true
    includes:
    - collectd
    - nova-libvirt
  ComputeHCI1ContainerImagePrepare: *id001
  ComputeHCIScaleOut1ContainerImagePrepare: *id001

The procedure which is executed includes running "openstack overcloud upgrade prepare" command for each stack followed by "openstack overcloud external-upgrade run ${EXTERNAL_ANSWER} --stack ${STACK} --tags container_image_prepare"

But It seems like It does not do anything in the case of DCN stack, It does prepare and upload images in the case of central stack. 
The service which is reponsible for the etcd is: OS::TripleO::Services::Etcd and is included in the roles file which is attached to do uprgade prepare command line.
It seems like the generated playbook external_deploy_steps_tasks_step1.yaml (in overcloud-deploy/dcn1/config-download/dcn1/ does not have the task "Run tripleo-container-image-prepare role" while the same playbook on central stack has it, which means that no container image prepare is executed for dcn1 when overcloud prepare is executed while It happens for central stack.

(undercloud) [stack@site-undercloud-0 overcloud-deploy]$ fgrep "Run tripleo-container-image-prepare" central/config-download/central/external_deploy_steps_tasks_step1.yaml 
  name: Run tripleo-container-image-prepare role
(undercloud) [stack@site-undercloud-0 overcloud-deploy]$ fgrep "Run tripleo-container-image-prepare" dcn1/config-download/dcn1/external_deploy_steps_tasks_step1.yaml 
(undercloud) [stack@site-undercloud-0 overcloud-deploy]$

Once I upload the image into registry manually the ffu works as expected, It just imo should be uploaded into undercloud registry during the "openstack overcloud external-upgrade run ${EXTERNAL_ANSWER} --stack ${STACK} --tags container_image_prepare" step.

Version-Release number of selected component (if applicable):
 rpm -qa| grep tripleo
ansible-role-tripleo-modify-image-1.5.1-17.1.20230622042720.b6eedb6.el8ost.noarch
openstack-tripleo-common-containers-15.4.1-17.1.20230927003755.el8ost.noarch
openstack-tripleo-heat-templates-14.3.1-17.1.20231103003744.el8ost.noarch
openstack-tripleo-validations-14.3.2-17.1.20231026023743.2b526f8.el8ost.noarch
openstack-tripleo-common-15.4.1-17.1.20230927003755.el8ost.noarch
ansible-tripleo-ipsec-11.0.1-17.1.20230621182214.b5559c8.el8ost.noarch
python3-tripleo-common-15.4.1-17.1.20230927003755.el8ost.noarch
python3-tripleoclient-16.5.1-17.1.20230927003754.f3599d0.el8ost.noarch
openstack-tripleo-puppet-elements-14.1.3-17.1.20230811123850.b4e0cbd.el8ost.noarch
python3-tripleoclient-heat-installer-12.6.1-2.20220725105244.8cc1d6d.el8ost.noarch
openstack-tripleo-image-elements-13.1.3-17.1.20230622064519.a641940.el8ost.noarch
ansible-tripleo-ipa-0.3.1-17.1.20230627183823.8d29d9e.el8ost.noarch
puppet-tripleo-14.2.3-17.1.20231102193745.40278e1.el8ost.noarch
tripleo-ansible-3.3.1-17.1.20231101233745.4d015bf.el8ost.noarch

The overcloud upgrade prepare cmdline looks like:
openstack overcloud upgrade prepare ${PREPARE_ANSWER} \
    --stack dcn1 \
    --templates /usr/share/openstack-tripleo-heat-templates \
    -n /home/stack/dcn1/network/network_data.yaml \
    -e /usr/share/openstack-tripleo-heat-templates/environments/services/neutron-ovs.yaml \
    -e /usr/share/openstack-tripleo-heat-templates/environments/barbican-backend-simple-crypto.yaml \
    -e /usr/share/openstack-tripleo-heat-templates/environments/services/barbican-edge.yaml \
    -e /usr/share/openstack-tripleo-heat-templates/environments/dcn-hci.yaml \
    -e /usr/share/openstack-tripleo-heat-templates/environments/podman.yaml \
    -e /usr/share/openstack-tripleo-heat-templates/environments/disable-telemetry.yaml \
    -e /usr/share/openstack-tripleo-heat-templates/environments/net-multiple-nics.yaml \
    -e /home/stack/dcn1/internal.yaml \
    -e /usr/share/openstack-tripleo-heat-templates/environments/services/neutron-ovs.yaml \
    -e /home/stack/dcn1/network/network-environment.yaml \
    -e /home/stack/dcn1/inject-trust-anchor.yaml \
    -e /home/stack/dcn1/hostnames.yml \
    -e /usr/share/openstack-tripleo-heat-templates/environments/ceph-ansible/ceph-ansible.yaml \
    -e /home/stack/dcn1/glance.yaml \
    -e /home/stack/dcn1/nodes_data.yaml \
    -e /home/stack/dcn1/debug.yaml \
    -e /home/stack/dcn1/use-dns-for-vips.yaml \
    -e /usr/share/openstack-tripleo-heat-templates/environments/ceph-ansible/ceph-dashboard.yaml \
    -e /home/stack/central_ceph_external.yaml \
    -e /home/stack/central-export.yaml \
    -e /home/stack/dcn1/config_heat.yaml \
    -e /home/stack/dcn1/firstboot.yaml \
    -e ~/containers-prepare-parameter.yaml \
    -e /home/stack/dcn1/barbican.yaml \
    -e /usr/share/openstack-tripleo-heat-templates/environments/ssl/tls-everywhere-endpoints-dns.yaml \
    -e /usr/share/openstack-tripleo-heat-templates/environments/ssl/enable-internal-tls.yaml \
    -e /home/stack/dcn1/cloud-names.yaml \
    -e /home/stack/dcn1/ipaservices-baremetal-ansible.yaml \
    -e /home/stack/cli_opts_params.yaml \
    -e /home/stack/overcloud-params.yaml -e /home/stack/overcloud-deploy/dcn1/dcn1-network-environment.yaml -e /home/stack/tmp/dcn1-baremetal_deployment.yaml -e /home/stack/tmp/central-generated-networks-deployed.yaml -e /home/stack/tmp/central-generated-vip-deployed.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/cephadm/cephadm-rbd-only.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/cephadm/ceph-dashboard.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/nova-hw-machine-type-upgrade.yaml \
    -e /home/stack/containers-prepare-parameter.yaml \
    --roles-file /home/stack/dcn1/roles/roles_data.yaml 2>&1

How reproducible:
Always

Comment 3 James Slagle 2024-01-11 13:28:25 UTC
I'd need to see the upgrade prepare and upgrade run (which you've now provided), and all the custom templates and files passed to those commands.

Comment 4 James Slagle 2024-01-11 13:30:22 UTC
I'm thinking the issue is due to the OS::TripleO::Services::ContainerImagePrepare service being only on the Controller role, so when the commands are run for the dcn stacks, it does not re-run ContainerImagePrepare. We may need to add that service to dcn roles, or document which container image prepare command to run manually.

Comment 5 Marian Krcmarik 2024-01-11 13:40:13 UTC
(In reply to James Slagle from comment #4)
> I'm thinking the issue is due to the
> OS::TripleO::Services::ContainerImagePrepare service being only on the
> Controller role, so when the commands are run for the dcn stacks, it does
> not re-run ContainerImagePrepare. We may need to add that service to dcn
I can test that - manually adding the service to my dcn roles and rerun the job If That's helpful?
> roles, or document which container image prepare command to run manually.

Comment 6 James Slagle 2024-01-11 15:49:40 UTC
(In reply to Marian Krcmarik from comment #5)
> (In reply to James Slagle from comment #4)
> > I'm thinking the issue is due to the
> > OS::TripleO::Services::ContainerImagePrepare service being only on the
> > Controller role, so when the commands are run for the dcn stacks, it does
> > not re-run ContainerImagePrepare. We may need to add that service to dcn
> I can test that - manually adding the service to my dcn roles and rerun the
> job If That's helpful?
> > roles, or document which container image prepare command to run manually.

Yes, that's worth trying. I'm not sure if it will remove unmanaged images though. Can you check if the other service images (such as nova-api) are still in the undercloud image-serve, that would be helpful.

Comment 7 Marian Krcmarik 2024-01-12 10:51:24 UTC
(In reply to James Slagle from comment #6)
 
> Yes, that's worth trying. I'm not sure if it will remove unmanaged images
> though. Can you check if the other service images (such as nova-api) are
> still in the undercloud image-serve, that would be helpful.

Adding the OS::TripleO::Services::ContainerImagePrepare service manually into roles did solve the problem and the etcd image got successfully fetched during the upgrade and It seems other images which are not needed for any service on the DCN site are still present in the undercloud image server, I can see i.e. openstack-nova-api image there and I can pull it.

So I let you decide which way we want to fix this.

Comment 29 errata-xmlrpc 2024-05-22 20:42:25 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: openstack-tripleo-heat-templates and tripleo-ansible update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:2736


Note You need to log in before you can comment on or make changes to this bug.