Bug 1858575 - [13->16.1] Compute DVR upgrade failed as its pulling wrong openstack-neutron-l3-agent container image
Summary: [13->16.1] Compute DVR upgrade failed as its pulling wrong openstack-neutron-...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-common
Version: 16.1 (Train)
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: z4
: 16.1 (Train on RHEL 8.2)
Assignee: Brent Eagles
QA Contact: Alex Katz
URL:
Whiteboard:
: 1893638 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-07-19 09:35 UTC by Khomesh Thakre
Modified: 2022-08-30 11:45 UTC (History)
13 users (show)

Fixed In Version: openstack-tripleo-common-11.4.1-1.20210104173607.el8ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-04-20 11:03:33 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Roles data (12.21 KB, text/x-matlab)
2020-07-22 06:04 UTC, Jose Luis Franco
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1898256 0 None None None 2020-10-02 16:38:00 UTC
OpenStack gerrit 773591 0 None MERGED Add ComputeNeutronl3Agent link to neutron-l3-agent container image 2021-02-18 15:26:05 UTC
OpenStack gerrit 773663 0 None MERGED Add missing ComputeNeutronMetadataAgent service into tripleo_containers.yaml.j2. 2021-03-12 13:33:52 UTC
Red Hat Issue Tracker OSP-1929 0 None None None 2022-08-30 11:45:35 UTC
Red Hat Knowledge Base (Solution) 5479741 0 None None None 2020-10-09 21:27:49 UTC

Description Khomesh Thakre 2020-07-19 09:35:36 UTC
Description of problem:
During FFU of Compute DVR usecase with OVS, the upgrade failed with as ansible was pulling wrong openstack-neutron-l3-agent container image 

~~~
TASK [Pre-fetch all the containers] ********************************************
Saturday 18 July 2020  13:09:43 -0400 (0:00:00.441)       0:13:08.800 ********* 
failed: [overcloud-novacompute-dvr-0] (item=registry.redhat.io/rhosp-rhel8/openstack-neutron-l3-agent:16.1) => {"ansible_loop_var": "prefetch_image", "changed": false, "msg": "Failed to pull image registry.redhat.io/rhosp-rhel8/openstack-neutron-l3-agent:16.1", "prefetch_image": "registry.redhat.io/rhosp-rhel8/openstack-neutron-l3-agent:16.1"}
~~~

Actual container is 
~~~
$ openstack tripleo container image list | grep l3
| docker://undercloud.ctlplane.localdomain:8787/rhosp-beta/openstack-neutron-l3-agent:16.1-39                |
~~~

Below is the cmd use to run upgrade

~~~
openstack overcloud upgrade run --stack overcloud --limit overcloud-novacompute-dvr-0
~~~

Below is the prepare cmd we used 

~~~
$ cat osp16/00_overcloud-prepare.sh 
#!bin/bash
mv nohup.out nohup-data/nohup.out-$(date +"%m-%d-%y"_"%T")
nohup openstack overcloud upgrade prepare --templates \
-r /home/stack/templates/roles_data.yaml \
-e /home/stack/templates/node-info.yaml \
-e /home/stack/templates/rhsm.yaml \
-e /home/stack/templates/upgrades-environment.yaml \
-e /home/stack/containers-prepare-parameter.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/neutron-ovs-dvr.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml \
-e /home/stack/templates/neutrondvr.yaml \
-e /home/stack/templates/network-environment.yaml  \
--ntp-server 192.168.24.1 --log-file overcloud_upgrade_prepare_1.log &
~~~

Version-Release number of selected component (if applicable):
Red Hat Openstack Platform 16.1-Beta

How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:
Ansible is pulling container image which dosent exist

Expected results:
It should use the correct and latest container image available

Additional info:

Comment 2 Carlos Camacho 2020-07-20 13:30:45 UTC
Please add the content of the yaml files and please and how are you creating the scripts.
The container images seem to be ok.

Comment 3 Jose Luis Franco 2020-07-20 13:37:01 UTC
Hello,

Could you please confirm that you had the OS::TripleO::Services::NeutronL3Agent service enabled in your roles_data? In theory, the workaround you added should be automatically included in the containers-prepare-parameter step if you have the right service enabled:

https://github.com/openstack/tripleo-common/blob/master/container-images/overcloud_containers.yaml.j2#L462-L467

Also, do you have a CI job I could have a look at? Or did you run the procedure manually? the logs look confusing to me.

Comment 4 Jose Luis Franco 2020-07-20 13:37:48 UTC
I missed the need-info from the comment 3 above

Comment 5 MD Sufiyan 2020-07-20 13:56:43 UTC
@jose,

This is computeDVR node so l3 service role is available under runs on compute node.

~~~
openstack overcloud roles generate -o /home/stack/templates/roles_data.yaml Controller ComputeDVR
~~~

>> /home/stack/templates/roles_data.yaml

~~~
###############################################################################
# Role: ComputeDVR                                                            #
###############################################################################
- name: ComputeDVR
  description: |
    DVR enabled Compute Node role
  CountDefault: 1
  tags:
    - external_bridge
  networks:
    InternalApi:
      subnet: internal_api_subnet
    Tenant:
      subnet: tenant_subnet
    Storage:
      subnet: storage_subnet
  HostnameFormatDefault: '%stackname%-novacompute-dvr-%index%'
  RoleParametersDefault:
    TunedProfileName: "virtual-host"
  update_serial: 25
  ServicesDefault:
    - OS::TripleO::Services::Aide
    - OS::TripleO::Services::AuditD
    - OS::TripleO::Services::BootParams
    - OS::TripleO::Services::CACerts
    - OS::TripleO::Services::CephClient
    - OS::TripleO::Services::CephExternal
    - OS::TripleO::Services::CertmongerUser
    - OS::TripleO::Services::Collectd
    - OS::TripleO::Services::ComputeCeilometerAgent
    - OS::TripleO::Services::ComputeNeutronCorePlugin
    - OS::TripleO::Services::ComputeNeutronL3Agent  << l3 agent service
~~~

~~~
(overcloud) [stack@undercloud ~]$ cat ~/containers-prepare-parameter.yaml
# Generated with the following on 2020-07-09T11:46:41.783514
#
#   openstack tripleo container image prepare default --local-push-destination --output-env-file containers-prepare-parameter.yaml
#

parameter_defaults:
  DockerInsecureRegistryAddress:
  - 192.168.24.1:8787
  - undercloud.ctlplane.localdomain:8787
  ContainerImagePrepare:
  - push_destination: true
    set:
      ceph_alertmanager_image: ose-prometheus-alertmanager
      ceph_alertmanager_namespace: registry.redhat.io/openshift4
      ceph_alertmanager_tag: 4.1
      ceph_grafana_image: rhceph-4-dashboard-rhel8
      ceph_grafana_namespace: registry.redhat.io/rhceph
      ceph_grafana_tag: 4
      ceph_image: rhceph-4-rhel8
      ceph_namespace: registry.redhat.io/rhceph
      ceph_node_exporter_image: ose-prometheus-node-exporter
      ceph_node_exporter_namespace: registry.redhat.io/openshift4
      ceph_node_exporter_tag: v4.1
      ceph_prometheus_image: ose-prometheus
      ceph_prometheus_namespace: registry.redhat.io/openshift4
      ceph_prometheus_tag: 4.1
      ceph_tag: latest
      name_prefix: openstack-
      name_suffix: ''
      namespace: registry.redhat.io/rhosp-beta
      neutron_driver: openvswitch
      rhel_containers: false
      tag: '16.1'
      name_prefix_stein: openstack-
      name_suffix_stein: ''                            
      namespace_stein: registry.redhat.io/rhosp15-rhel8
      tag_stein: 15.0                                  
    tag_from_label: '{version}-{release}'
  ContainerImageRegistryCredentials:
    registry.redhat.io:
      user:pass
~~~

Comment 7 Jose Luis Franco 2020-07-21 20:03:24 UTC
So, I did have a look at the stack's environment parameters and it seems like that's a common pattern in other images:

  ContainerManilaShareImage: registry.redhat.io/rhosp-rhel8/openstack-manila-share:16.1
  ContainerMemcachedConfigImage: undercloud.ctlplane.localdomain:8787/rhosp-beta/openstack-memcached:16.1-44
  ContainerMemcachedImage: undercloud.ctlplane.localdomain:8787/rhosp-beta/openstack-memcached:16.1-44
  ContainerMetricsQdrConfigImage: undercloud.ctlplane.localdomain:8787/rhosp-beta/openstack-qdrouterd:16.1-43
  ContainerMetricsQdrImage: undercloud.ctlplane.localdomain:8787/rhosp-beta/openstack-qdrouterd:16.1-43
  ContainerMistralApiImage: registry.redhat.io/rhosp-rhel8/openstack-mistral-api:16.1
  ContainerMistralApiImageStein: ''
  ContainerMistralConfigImage: registry.redhat.io/rhosp-rhel8/openstack-mistral-api:16.1
  ContainerMistralEngineImage: registry.redhat.io/rhosp-rhel8/openstack-mistral-engine:16.1
  ContainerMistralEventEngineImage: registry.redhat.io/rhosp-rhel8/openstack-mistral-event-engine:16.1
  ContainerMistralExecutorImage: registry.redhat.io/rhosp-rhel8/openstack-mistral-executor:16.1
  ContainerMultipathdConfigImage: registry.redhat.io/rhosp-rhel8/openstack-multipathd:16.1
  ContainerMultipathdImage: registry.redhat.io/rhosp-rhel8/openstack-multipathd:16.1
  ContainerMysqlClientConfigImage: undercloud.ctlplane.localdomain:8787/rhosp-beta/openstack-mariadb:16.1-43
  ContainerMysqlConfigImage: undercloud.ctlplane.localdomain:8787/rhosp-beta/openstack-mariadb:16.1-43
  ContainerMysqlImage: undercloud.ctlplane.localdomain:8787/rhosp-beta/openstack-mariadb:16.1-43
  ContainerNeutronApiImage: undercloud.ctlplane.localdomain:8787/rhosp-beta/openstack-neutron-server:16.1-40
  ContainerNeutronApiImageStein: ''
  ContainerNeutronConfigImage: undercloud.ctlplane.localdomain:8787/rhosp-beta/openstack-neutron-server:16.1-40
  ContainerNeutronDHCPImage: undercloud.ctlplane.localdomain:8787/rhosp-beta/openstack-neutron-dhcp-agent:16.1-39
  ContainerNeutronL3AgentImage: registry.redhat.io/rhosp-rhel8/openstack-neutron-l3-agent:16.1
  ContainerNeutronMetadataImage: undercloud.ctlplane.localdomain:8787/rhosp-beta/openstack-neutron-metadata-agent:16.1-43
  ContainerNeutronMlnxImage: registry.redhat.io/rhosp-rhel8/openstack-neutron-mlnx-agent:16.1
  ContainerNeutronSriovImage: registry.redhat.io/rhosp-rhel8/openstack-neutron-sriov-agent:16.1
  ContainerNovaApiImage: undercloud.ctlplane.localdomain:8787/rhosp-beta/openstack-nova-api:16.1-41
  ContainerNovaApiImageStein: ''
  ContainerNovaComputeImage: undercloud.ctlplane.localdomain:8787/rhosp-beta/openstack-nova-compute:16.1-37
  ContainerNovaComputeIronicImage: registry.redhat.io/rhosp-rhel8/openstack-nova-compute-ironic:16.1
  ContainerNovaConductorImage: undercloud.ctlplane.localdomain:8787/rhosp-beta/openstack-nova-conductor:16.1-39
  ContainerNovaConductorImageStein: ''
  ContainerNovaConfigImage: undercloud.ctlplane.localdomain:8787/rhosp-beta/openstack-nova-api:16.1-41
  ContainerNovaLibvirtConfigImage: undercloud.ctlplane.localdomain:8787/rhosp-beta/openstack-nova-compute:16.1-37
  ContainerNovaLibvirtImage: undercloud.ctlplane.localdomain:8787/rhosp-beta/openstack-nova-libvirt:16.1-40
  ContainerNovaMetadataConfigImage: undercloud.ctlplane.localdomain:8787/rhosp-beta/openstack-nova-api:16.1-41
  ContainerNovaMetadataImage: undercloud.ctlplane.localdomain:8787/rhosp-beta/openstack-nova-api:16.1-41
  ContainerNovaSchedulerImage: undercloud.ctlplane.localdomain:8787/rhosp-beta/openstack-nova-scheduler:16.1-38
  ContainerNovaSerialproxyConfigImage: registry.redhat.io/rhosp-rhel8/openstack-nova-serialproxy:16.1
  ContainerNovaSerialproxyImage: registry.redhat.io/rhosp-rhel8/openstack-nova-serialproxy:16.1
  ContainerNovaVncProxyImage: undercloud.ctlplane.localdomain:8787/rhosp-beta/openstack-nova-novncproxy:16.1-40
  ContainerNovajoinConfigImage: registry.redhat.io/rhosp-rhel8/openstack-novajoin-server:16.1
  ContainerNovajoinNotifierImage: registry.redhat.io/rhosp-rhel8/openstack-novajoin-notifier:16.1
  ContainerNovajoinServerImage: registry.redhat.io/rhosp-rhel8/openstack-novajoin-server:16.1

The ContainerNeutronL3AgentImage image isn't the only one which points at 16.1, you can see others, like ContainerMistralConfigImage or ContainerMistralExecutorImage.

Having a deeper look, all the container images using the 16.1 tag are, for some unknown reason, using the rhosp-rhel8 namespace which isn't right. As it doesn't have 16.1 tag registered yet: https://catalog.redhat.com/software/containers/rhosp-rhel8/openstack-neutron-l3-agent/5de6baf6d70cc51644a57066

It is the rhosp-beta namespace the one containing the right tag: https://catalog.redhat.com/software/containers/rhosp-beta/openstack-neutron-l3-agent/5cdc8278d70cc57c44b28750

I believe, the reason is the tag_from_label parameter, which was included in the containers-prepare-parameter.yaml. I've removed it from the file, relaunched the upgrade prepare step and I will check again the content of the images. Probably, this time all will use the rhosp-beta namespace.

Comment 8 Jose Luis Franco 2020-07-22 06:04:46 UTC
Created attachment 1702031 [details]
Roles data

Comment 9 Jose Luis Franco 2020-07-22 06:15:04 UTC
So, I was looking at the wrong place. The problem is that, as ComputeNeutronL3Agent service is enabled, but no NeutronL3Agent, then we don't set the image Heat parameter.

This can be seen here:


- imagename: "{{namespace}}/{{name_prefix}}neutron-l3-agent{{name_suffix}}:{{tag}}"
  image_source: kolla
  params:
  - ContainerNeutronL3AgentImage
  services:
  - OS::TripleO::Services::NeutronL3Agent

https://github.com/openstack/tripleo-common/blob/stable/train/container-images/overcloud_containers.yaml.j2#L482-L487

The only service which triggers the image name creation is NeutronL3Agent, there is no presence of the ComputeNeutronL3Agent service in the file. The solution would be to do something similar as it is being done with the ComputeNeutronOvsAgent and NeutronOvsAgent (https://github.com/openstack/tripleo-common/blob/stable/train/container-images/overcloud_containers.yaml.j2#L496-L502) :

- imagename: "{{namespace}}/{{name_prefix}}neutron-l3-agent{{name_suffix}}:{{tag}}"
  image_source: kolla
  params:
  - ContainerNeutronL3AgentImage
  services:
  - OS::TripleO::Services::NeutronL3Agent
  - OS::TripleO::Services::ComputeNeutronL3Agent

This would ensure we get the rigth container image for openstack-neutron-l3-agent container.

Moving the BZ to DFG:Networking, as they are more familiar with this type of set up. However, the configuration seems to be fine as the environment deployed properly with DVR in OSP13.

I attached the full roles_data.yaml from the environment in previous comment.

Comment 10 Bernard Cafarelli 2020-07-22 12:30:41 UTC
Analysis from comment 9 looks good, we should have the ComputeNeutron* variants listed in overcloud_containers mapping. We can run most of them directly on compute nodes, for example ComputeDVR role:
https://github.com/openstack/tripleo-heat-templates/blob/master/roles/ComputeDVR.yaml

DCN deployments can also have dhcp/metadata on compute nodes

Comment 21 Jose Luis Franco 2020-11-04 14:40:47 UTC
It has been observed during the FFU 13 to 16.1 lab testing in PSI that the very same issue occurs with openstack-neutron-metadata-agent and ComputeNeutronMetadataAgent. If this service is enabled in some of the roles the image won't be downloaded into the Undercloud's registry as it isn't present in the container-images/tripleo-containers.yaml.j2 https://github.com/openstack/tripleo-common/blob/master/container-images/tripleo_containers.yaml.j2#L470 
and the upgrade fails when trying to upgrade the compute node including such a service.

@Brent, if it's ok by you I will add such a patch based on https://review.opendev.org/#/c/761418/. 

Also, it would be great to update the Knowledge Base entry to cover the ComputeNeutronMetadataAgent service too until these patches merge.

Comment 22 Jose Luis Franco 2020-11-04 15:42:04 UTC
For the knowledge base entry, this would solve it:
sudo hiera container_image_prepare_node_names
["undercloud.ctlplane.prod.ipa.lab"]

parameter_defaults:
  ContainerNeutronL3AgentImage: undercloud.ctlplane.prod.ipa.lab:8787/rhosp-rhel8/openstack-neutron-l3-agent:16.1-51
  ContainerNeutronMetadataImage: undercloud.ctlplane.prod.ipa.lab:8787/rhosp-rhel8/openstack-neutron-metadata-agent:16.1

Comment 30 Lukas Bezdicka 2021-05-05 10:51:26 UTC
*** Bug 1893638 has been marked as a duplicate of this bug. ***


Note You need to log in before you can comment on or make changes to this bug.