Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1756443

Summary: Race condition on releasing new images between "upgrade prepare" and "external-upgrade run --tags container_image_prepare"
Product: Red Hat OpenStack Reporter: Paras Babbar <pbabbar>
Component: openshift-heat-templatesAssignee: Jiri Stransky <jstransk>
Status: CLOSED EOL QA Contact: RHOS Maint <rhos-maint>
Severity: low Docs Contact:
Priority: medium    
Version: 15.0 (Stein)CC: abishop, amoralej, athomas, dbecker, jpretori, jstransk, lbezdick, mburns, morazi, pbabbar, scollier, sgolovat, stchen
Target Milestone: ---Keywords: Triaged, ZStream
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-09-30 19:17:02 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1727807    
Attachments:
Description Flags
Logs none

Description Paras Babbar 2019-09-27 16:12:29 UTC
Created attachment 1620211 [details]
Logs

Description of problem:

There was a tag mismatch happen between what upgrade run yaml scripts trying to pull and what actually available in UC's registry.

Fatal: [controller-0]: FAILED! => {"changed": true, "cmd": ["podman", "pull", "192.168.24.1:8787/rhosp15/openstack-cinder-volume:15.0-70"]

VS

curl -X GET http://192.168.24.1:8787/v2/rhosp15/openstack-cinder-volume/{"name": "rhosp15/openstack-cinder-volume", "tags": ["15.0-71"]


Version-Release number of selected component(if applicable):
(14->15 upgrade)

How reproducible:
It occurs at irregular interval, faced by two-three engineers as of my knowledge. But worked fine after re running the overcloud upgrade prepare step again.

Steps to Reproduce:

Use the below commands:

cat <<EOF > ~/overcloud-params.yaml
parameter_defaults:
  UpgradeLeappDevelSkip: LEAPP_DEVEL_SKIP_RHSM=1 LEAPP_SKIP_CHECK_OS_RELEASE=1
  UpgradeLeappDevelSkipRhsm: true
  SELinuxMode: permissive
  ContainerHealthcheckDisabled: true
EOF

cp overcloud_deploy.sh overcloud_upgrade_prepare.sh
sed -i overcloud_upgrade_prepare.sh -e 's/openstack overcloud deploy/OVERCLOUD_SKIP_UPGRADE_SUPPORT_CHECK=1 openstack overcloud upgrade prepare/g'
sed -i overcloud_upgrade_prepare.sh -e '/\-e ~\/containers-prepare-parameter.yaml/i -e \/usr\/share\/openstack-tripleo-heat-templates\/environments\/services\/neutron-ovs.yaml \\'
sed -i overcloud_upgrade_prepare.sh -e '/\-e ~\/containers-prepare-parameter.yaml/a -e ~\/overcloud-params\.yaml \\'

bash -x overcloud_upgrade_prepare.sh 2>&1 | tee oc-upgrade-prepare.log

# Upload images to local registry on undercloud
openstack overcloud external-upgrade run --tags container_image_prepare 2>&1 | tee oc-container-image-prepare.log


# check the stack environment Tag Both will shown  different tag(15.0-70 vs 15.0-71]:

Fatal: [controller-0]: FAILED! => {"changed": true, "cmd": ["podman", "pull", "192.168.24.1:8787/rhosp15/openstack-cinder-volume:15.0-70"]

VS

curl -X GET http://192.168.24.1:8787/v2/rhosp15/openstack-cinder-volume/{"name": "rhosp15/openstack-cinder-volume", "tags": ["15.0-71"]

 
Actual results:
Pull latest cinder_volume images - failed

Expected results:
Pull latest cinder_volume images - success

Additional info:
On 2019-09-26 22:36, Jose Luis Franco Arza wrote:
> Ok, I re-ran the "overcloud upgrade prepare" command and it now takes the
> right tag in the heat stack:
> 
> (undercloud) [stack@undercloud-0 ~]$ openstack stack environment show
> overcloud | grep "rhosp15/openstack-cinder-volume"
>    ContainerCinderVolumeImage:
> 192.168.24.1:8787/rhosp15/openstack-cinder-volume:15.0-71
>    DockerCinderVolumeImage:
> 192.168.24.1:8787/rhosp15/openstack-cinder-volume:15.0-71
> 
> (undercloud) [stack@undercloud-0 ~]$ curl -X GET
> http://192.168.24.1:8787/v2/rhosp15/openstack-cinder-volume/tags/list
> {"name": "rhosp15/openstack-cinder-volume", "tags": ["15.0-71"]}
> 
> Try to run the overcloud upgrade run again, please.

I've hit a similar thing earlier in my testing, there may be a real bug 
here.

Not 100% sure but i think this may be happening: the `openstack 
overcloud upgrade prepare` command looks at latest images and writes 
*ContainerImage Heat parameters, then `openstack external-upgrade run 
--tags container_image_prepare` *also* looks at latest images and 
fetches them. So if the definition of what's latest image changes 
between running those two commands (e.g. new containers happen to be 
released), then there can be a mismatch between what's in params and 
what's in UC registry.

Maybe the same thing might happen on `overcloud deploy` too, the chance 
is much lower because both pieces are executed within a same CLI 
command, but i suspect the race condition might still be there.
The internals of the image parameters/uploads are more of a DFG:DF 
expertise so we will probably ask them for some insight. Ideally the 
image uploader would look at the generated parameters and fetch those 
images instead of fetching latest, but i'm not sure how easy that fix 
would be.

The workaround in our case is to always run both of the aforementioned 
commands in immediate succession.

Comment 1 Alan Bishop 2019-09-27 17:03:59 UTC
As noted in the BZ description, this is something for DFG:DF (or maybe DFG:Upgrades ?) to comment on. While cinder's container image may have been involved, the problem does not seem to be with cinder itself.

Comment 4 Jiri Stransky 2019-10-08 11:53:43 UTC
I went through the code paths today, and the initial suspicion seems correct. This is a race condition on TripleO actions vs. releasing new images, it can occur both on deployment and upgrade, but it is less likely during deployment because the code runs within a single command.

First the mistral workflow (deployment [1] or update/upgrade [2]) calls an action to prepare container image parameters [3] and that action tries to dereference the latest images into specific (non-latest) tags and writes them into "environments/containers-default-parameters.yaml" file [4].

Then the image-prepare (the part which uploads images to undercloud) task runs, and it gets fed the same ContainerImagePrepare parameter [5] as the mistral action earlier, but it does not get fed the actual dereferenced values from the yaml file created earlier [4]. So it does its own dereferencing of the latest images before uploading them to the undercloud registry.

The overcloud wants to use images which were dereferenced by the mistral workflow, but the undercloud contains images dereferenced by the image uploader. If some of the images were updated betweeen the parameter generation and the image upload, the overcloud will want to fetch different images than what undercloud offers, and deployment/upgrade can break.

[1] https://github.com/openstack/tripleo-common/blob/58abba685e65441e52a1e577baa653dac6852fcc/workbooks/deployment.yaml#L168
[2] https://github.com/openstack/tripleo-common/blob/58abba685e65441e52a1e577baa653dac6852fcc/workbooks/package_update.yaml#L27
[3] https://github.com/openstack/tripleo-common/blob/58abba685e65441e52a1e577baa653dac6852fcc/tripleo_common/actions/container_images.py#L117
[4] https://github.com/openstack/tripleo-common/blob/58abba685e65441e52a1e577baa653dac6852fcc/tripleo_common/constants.py#L180-L181
[5] https://github.com/openstack/tripleo-heat-templates/blob/668588d73e6e40adb85e53c0000c078d84624837/deployment/container-image-prepare/container-image-prepare-baremetal-ansible.j2.yaml#L117-L128

-----

I'm adding DFG:DF into the whiteboard as the race condition is also present on deployment and the root cause lies within the image prepare mechanisms. Workaround is both in deployment/upgrade cases to re-run the failed command(s) again. In upgrade case the race condition potential is larger, the scope could be reduced (make it the same as during deployment) by essentially running `external-upgrade run --tags container_image_prepare` from within `upgrade prepare`, e.g. by adding `--container-image-prepare` parameter.

Comment 7 stchen 2020-09-30 19:17:02 UTC
Closing EOL, OSP 15 has been retired as of Sept 19