Bug 2058421 - 4.9.23-s390x-machine-os-content manifest invalid when mirroring content for disconnected install
Summary: 4.9.23-s390x-machine-os-content manifest invalid when mirroring content for d...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: RHCOS
Version: 4.9
Hardware: s390x
OS: Linux
high
urgent
Target Milestone: ---
: 4.11.0
Assignee: Colin Walters
QA Contact: Michael Nguyen
URL:
Whiteboard:
Depends On:
Blocks: 2059761 2060067
TreeView+ depends on / blocked
 
Reported: 2022-02-24 21:06 UTC by Philip Chan
Modified: 2022-08-30 18:44 UTC (History)
21 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 2059761 2060067 (view as bug list)
Environment:
Last Closed: 2022-08-10 10:51:23 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Command and output from invalid manifest error (44.74 KB, text/plain)
2022-02-24 21:06 UTC, Philip Chan
no flags Details
4.10.0-rc.5 mirror image failure for machine-os-content (42.52 KB, text/plain)
2022-02-25 14:07 UTC, Philip Chan
no flags Details
4.9.23 disconnected failure on Power (80.64 KB, text/plain)
2022-03-03 11:47 UTC, Manisha
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker MULTIARCH-2278 0 None None None 2022-02-24 21:15:00 UTC
Red Hat Product Errata RHSA-2022:5069 0 None None None 2022-08-10 10:51:56 UTC

Description Philip Chan 2022-02-24 21:06:33 UTC
Created attachment 1863246 [details]
Command and output from invalid manifest error

Description of problem:
For a disconnected installation, we attempt to mirror the OCP 4.9.23-s390x content, but consistently fails on image 4.9.23-s390x-machine-os-content with manifest invalid.

This is the command we issue to directly push the release images to our local registry:

# oc adm -a ${LOCAL_SECRET_JSON} release mirror      --from=quay.io/${PRODUCT_REPO}/${RELEASE_NAME}:${OCP_RELEASE}      --to=${LOCAL_REGISTRY}/${LOCAL_REPOSITORY}      --to-release-image=${LOCAL_REGISTRY}/${LOCAL_REPOSITORY}:${OCP_RELEASE} --apply-release-image-signature

The majority of the images are pushed to our local registry, until the following error occurs:

...
sha256:14e66cc7f40e3efba3fd20105ad1f40913d4fd2085ccf04e33cbfeee7127c1b5 bastion:5000/ocp4/openshift4:4.9.23-s390x-kuryr-controller
sha256:14e66cc7f40e3efba3fd20105ad1f40913d4fd2085ccf04e33cbfeee7127c1b5 bastion:5000/ocp4/openshift4:4.9.23-s390x-pod
sha256:14e66cc7f40e3efba3fd20105ad1f40913d4fd2085ccf04e33cbfeee7127c1b5 bastion:5000/ocp4/openshift4:4.9.23-s390x-vsphere-csi-driver
sha256:14e66cc7f40e3efba3fd20105ad1f40913d4fd2085ccf04e33cbfeee7127c1b5 bastion:5000/ocp4/openshift4:4.9.23-s390x-vsphere-csi-driver-operator
sha256:14e66cc7f40e3efba3fd20105ad1f40913d4fd2085ccf04e33cbfeee7127c1b5 bastion:5000/ocp4/openshift4:4.9.23-s390x-vsphere-csi-driver-syncer
sha256:14e66cc7f40e3efba3fd20105ad1f40913d4fd2085ccf04e33cbfeee7127c1b5 bastion:5000/ocp4/openshift4:4.9.23-s390x-vsphere-problem-detector
error: unable to push manifest to bastion:5000/ocp4/openshift4:4.9.23-s390x-machine-os-content: manifest invalid: manifest invalid
info: Mirroring completed in 930ms (0B/s)
error: one or more errors occurred while uploading images

Version-Release number of selected component (if applicable):
OCP 4.9.23

How reproducible:
Consistently reproducible.

Steps to Reproduce:
1. Have a mirror-registry container started and running:
# podman ps
CONTAINER ID  IMAGE                                    COMMAND               CREATED      STATUS          PORTS                   NAMES
f875644decab  docker.io/ibmcom/registry-s390x:2.6.2.5  registry serve /e...  4 hours ago  Up 2 hours ago  0.0.0.0:5000->5000/tcp  mirror-registry
2. Issue the command to push content to this local registry:
# oc adm -a ${LOCAL_SECRET_JSON} release mirror      --from=quay.io/${PRODUCT_REPO}/${RELEASE_NAME}:${OCP_RELEASE}      --to=${LOCAL_REGISTRY}/${LOCAL_REPOSITORY}      --to-release-image=${LOCAL_REGISTRY}/${LOCAL_REPOSITORY}:${OCP_RELEASE} --apply-release-image-signature
3. Fails with
error: unable to push manifest to bastion:5000/ocp4/openshift4:4.9.23-s390x-machine-os-content: manifest invalid: manifest invalid

Actual results:
A manifest invalid error occurs for image file 4.9.23-s390x-machine-os-content.

Expected results:
All 4.9.23 image files should be pushed to our local registry successfully.

Additional info:

Comment 2 Philip Chan 2022-02-25 14:06:45 UTC
We are now seeing the same failure with OCP 4.10.0-rc.5 when performing the mirror.  It is the same image but for rc.5 - s390x-machine-os-content.  I will attached the output.

Comment 3 Philip Chan 2022-02-25 14:07:48 UTC
Created attachment 1863319 [details]
4.10.0-rc.5 mirror image failure for machine-os-content

Comment 4 Philip Chan 2022-02-25 14:34:16 UTC
We've also re-mirrored the content from previously OCP releases such as OCP 4.8.33, 4.9.22, and 4.10.0-rc.4 successfully.  They all worked fine on both KVM and zVM platforms.

Comment 5 Dan Li 2022-02-25 15:53:17 UTC
After chatting with the team, we are re-assigning this bug to the ART/Release team to investigate further into the mirroring problem to see if this is cross-arch. Please feel free to re-assign back to us if this problem is multi-arch only.

Also note that this issue is observed on both 4.9.23 and 4.10-RC.5

Comment 6 Jeremy Poulin 2022-02-25 16:23:59 UTC
While I didn't take this as far on x86, I was able to confirm that all of the new machine-os-content images are manifested differently than their previous release counterparts:


This is the output for x86 machine OS content for 4.9.23

> [jpoulin@rock-kvmlp-3 ~]$ podman manifest inspect quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ae92a919cb6da4d1a5d832f8bc486ae92e55bf3814ebab94bf4baa4c4bcde85d
> Error: error parsing manifest blob "{\"schemaVersion\":2,\"config\":{\"mediaType\":\"application/vnd.oci.image.config.v1+json\",\"digest\":\"sha256:88c5613cfae9f21dc2db7fc1d00dcf50f522bd3f85a787a1e71d67d53c445d34\",\"size\":3240},\"layers\":[{\"mediaType\":\"application/vnd.oci.image.layer.v1.tar+gzip\",\"digest\":\"sha256:0672ccd2448e808003cfd58868d18c6fbcba3cd02b6868808fefa6e76b61498f\",\"size\":85670741},{\"mediaType\":\"application/vnd.oci.image.layer.v1.tar+gzip\",\"digest\":\"sha256:0c9ea41036996ec83a5f118963d45a6f1f53e56dc8f0c9b8da744138f590b0d2\",\"size\":1879},{\"mediaType\":\"application/vnd.oci.image.layer.v1.tar+gzip\",\"digest\":\"sha256:7cc8c27e4d3ba855252f073c0712443491e7bb460a2e7a0d9536e899fd200b9b\",\"size\":1104943259}],\"annotations\":{\"org.opencontainers.image.base.digest\":\"sha256:cbc1e8cea8c78cfa1490c4f01b2be59d43ddbbad6987d938def1960f64bcd02c\",\"org.opencontainers.image.base.name\":\"registry.access.redhat.com/ubi8/ubi:latest\"}}" as a "application/vnd.oci.image.manifest.v1+json": Treating single images as manifest lists is not implemented

This is the output for x86 machine OS content for 4.9.22

>[jpoulin@rock-kvmlp-3 ~]$ podman manifest inspect quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:07666daf9bb9249e666e66b117f2b7ad7ed0cd68c1f9265124c244e80e685482
> WARN[0000] Warning! The manifest type application/vnd.docker.distribution.manifest.v2+json is not a manifest list but a single image. 
> {
>    "schemaVersion": 2,
>    "mediaType": "application/vnd.docker.distribution.manifest.v2+json",
>    "config": {
>        "mediaType": "application/vnd.docker.container.image.v1+json",
>        "size": 3236,
>        "digest": "sha256:db548dfe0de420165b67a2b2174c2d94a5542e096beeb5219505759aa847d406"
>    },
>    "layers": [
>        {
>            "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip",
>            "size": 85670741,
>            "digest": "sha256:0672ccd2448e808003cfd58868d18c6fbcba3cd02b6868808fefa6e76b61498f"
>        },
>        {
>            "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip",
>            "size": 1879,
>            "digest": "sha256:0c9ea41036996ec83a5f118963d45a6f1f53e56dc8f0c9b8da744138f590b0d2"
>        },
>        {
>            "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip",
>            "size": 1104753786,
>            "digest": "sha256:6b1b21084b0cf1c22b954983610a7cd598b46798d0d3d14243b291da59e3f10a"
>        }
>    ]
>}

All arches I've tested are like this for the latest 4.9 (23) and 4.10 (rc5). I believe this will break mirroring as documented.

Comment 7 Colin Walters 2022-02-25 21:27:58 UTC
OK it's possible that something changed in coreos-assembler here - we are using the current Fedora 35 buildah when generating this image.  But there's also a big quay.io update that happened around this same time.

I just used coreos-assembler to upload a build to quay.io/cgwalters/ostest:35.20220225.dev.0 and I see
```
$ oc image info quay.io/cgwalters/ostest:35.20220225.dev.0
Name:       quay.io/cgwalters/ostest:35.20220225.dev.0
Digest:     sha256:2c984693ca74168bd9f6e1405b4302322b6808c21a39d1041e16262dfd687de1
Media Type: application/vnd.oci.image.manifest.v1+json
Created:    1m ago
```

So...actually, I think what happened here is quay.io now supports OCI natively, and we now push images that way.

I think we may need to change coreos-assembler to explicitly use --format=v2s2 for now, until we're ready to switch to OCI by default.

Comment 8 Colin Walters 2022-02-25 21:37:46 UTC
https://github.com/coreos/coreos-assembler/pull/2726

Comment 9 Philip Chan 2022-02-28 16:33:51 UTC
We tried a disconnected installation using the latest 4.10.0-rc.6 build(under both KVM and zVM) that was just released and the same error occurred during mirroring:

sha256:2a1588ed0f99e238e284d4dff10f60d180c892941c8e4d95dce4684e8452ed43 bastion:5000/ocp4/openshift4:4.10.0-rc.6-s390x-vsphere-csi-driver-syncer
sha256:2a1588ed0f99e238e284d4dff10f60d180c892941c8e4d95dce4684e8452ed43 bastion:5000/ocp4/openshift4:4.10.0-rc.6-s390x-vsphere-problem-detector
error: unable to push manifest to bastion:5000/ocp4/openshift4:4.10.0-rc.6-s390x-machine-os-content: manifest invalid: manifest invalid
info: Mirroring completed in 5m45.72s (23MB/s)
error: one or more errors occurred while uploading images

Comment 10 Philip Chan 2022-03-01 15:20:16 UTC
Raising bug to urgent as this is currently blocking a number of our tests across both 4.9.x and 4.10.x releases.

Comment 11 Muhammad Adeel (IBM) 2022-03-01 16:20:06 UTC
Hi Colin, is there a backport PR? since we need it for 4.9 and 4.10.

Comment 12 Jeremy Poulin 2022-03-01 22:14:29 UTC
Overriding the blocker status on this bug as per:
https://coreos.slack.com/archives/CB95J6R4N/p1646172701323879?thread_ts=1646153089.903709&cid=CB95J6R4N

Comment 13 Colin Walters 2022-03-01 22:14:37 UTC
See https://github.com/coreos/coreos-assembler/issues?q=label%3Abranch%2Frhcos+is%3Aclosed for the backport PRs, but we're still in progress of ensuring that build is deployed.

Comment 14 Colin Walters 2022-03-01 22:17:53 UTC
OK so, while it's a bit ugly, we can manually convert these images to v2s2 via e.g.:

`skopeo copy --format=v2s2 docker://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:e1dcc7ebecab4598c4c2a6a5a3d8768ed4546cd4b42b6f1822e2babd8cc864f7 docker://quay.io/someothernamespace/someimage:4.11.X`

Then pass `quay.io/someothernamespace/someimage:4.11.X` when generating `machine-os-content` in `oc adm release new`

Or perhaps even simpler, disable the machine-os-content promotion job today, and manually re-push the existing image over itself and let release-controller use that.

It may also work to change the promotion job to do this conversion.

Comment 15 Colin Walters 2022-03-01 22:47:06 UTC
SUMMARY:
I'm going to rephrase and build on https://bugzilla.redhat.com/show_bug.cgi?id=2058421#c7

Docker was invented, and it initially used some older container image schemas that aren't relevant anymore.

Until recently, mostly we use what's called "v2s2" for short, or "docker schema 2": https://docs.docker.com/registry/spec/manifest-v2-2/

But for years now there was the standardized OCI format, and it is actually used *by default* by tools such as podman/buildah at least.
It's just that when those tools go to push to a registry, a negotiation happens and if the registry doesn't support oci, the image is
converted to v2s2.

Everything in the OCP payload (until this bug) was v2s2 - I don't think this was an explicit decision, but really a consequence of the lack of OCI
registry support.

What happened recently is: quay.io deployed OCI support.  

And this is great, it's what we want to happen!  OCI support unlocks a bunch of things, including OCI Artifacts which I am sure we will make use of.
Our tools *should* support OCI.

It seems that something in the `oc image mirror` path doesn't though - which should be fixed.

But still, until we have a handle on the blast radius of OCI, we should go back to v2s2.

Comment 17 krmoser 2022-03-02 13:27:56 UTC
Folks,

1. We tested with the RC.7 build and unfortunately we encounter the same manifest invalid issue with the disconnected install mirror registry:

sha256:2a1588ed0f99e238e284d4dff10f60d180c892941c8e4d95dce4684e8452ed43 bastion:5000/ocp4/openshift4:4.10.0-rc.7-s390x-vsphere-csi-driver-operator
sha256:2a1588ed0f99e238e284d4dff10f60d180c892941c8e4d95dce4684e8452ed43 bastion:5000/ocp4/openshift4:4.10.0-rc.7-s390x-vsphere-csi-driver-syncer
sha256:2a1588ed0f99e238e284d4dff10f60d180c892941c8e4d95dce4684e8452ed43 bastion:5000/ocp4/openshift4:4.10.0-rc.7-s390x-vsphere-problem-detector
error: unable to push manifest to bastion:5000/ocp4/openshift4:4.10.0-rc.7-s390x-machine-os-content: manifest invalid: manifest invalid
info: Mirroring completed in 1m26.84s (91.55MB/s)
error: one or more errors occurred while uploading images


2. FYI. The RHCOS build is the same for RC.6 and RC.7:410.84.202202251632-0

Thank you,
Kyle

Comment 18 Philip Chan 2022-03-02 18:21:38 UTC
Thank you for the updates -

We have successfully mirrored the RC.8 images for disconnected installation to our local registries.  These are the images residing on the CI servers.  We no longer encounter the manifest invalid error.  We will continue running through our installation and upgrade tests using this build and will also verify 4.10.0-rc.8 when it becomes available on quay.io.

Thank you,
Phil

Comment 19 Manisha 2022-03-03 11:45:28 UTC
While trying to execute the test for upgrading 4.9.23 to 4.10.0-rc.8 in disconnected/restricted environment on POWER(ppc64le arch), the below issue is encountered.

This happened while installing 4.9.23:

sha256:ff709d98d118eb014a0b6f057bc735ff4d041b1ec104c4b68ac267373dfa5299 -> 4.9.23-ppc64le-cluster-authentication-operator", "  stats: shared=0 unique=4 size=1006MiB ratio=1.00", "", "phase 0:", "  registry.rdr-mani-dis.ibm.com:5000 ocp4/openshift4 blobs=4 mounts=0 manifests=141 shared=0", "", "info: Planning completed in 25.2s", "error: unable to push manifest to registry.rdr-mani-dis.ibm.com:5000/ocp4/openshift4:4.9.23-ppc64le-machine-os-content: manifest invalid: manifest invalid", "info: Mirroring completed in 1.68s (0B/s)", "error: one or more errors occurred while uploading images"]

The detailed error log is attached.

Comment 20 Manisha 2022-03-03 11:47:30 UTC
Created attachment 1863967 [details]
4.9.23 disconnected failure on Power

Comment 21 Philip Chan 2022-03-03 20:00:26 UTC
We have now successfully completed all our disconnected install and upgrade tests using OCP 4.10.0-rc.8 and RHCOS 410.84.202202251632-0.  We used RC.8 from both CI and quay.io.  This was covered on both KVM and zVM platforms.  Note that the tests for the disconnected upgrade from OCP 4.9, we used OCP 4.9.22 since the latest 4.9.23 currently has the manifest invalid issue.

Comment 22 Micah Abbott 2022-03-04 14:36:55 UTC
The fix to the coreos-assembler tooling was landed in https://github.com/coreos/coreos-assembler/pull/2726

We are waiting for new, successful builds of 4.11 across all arches before we can move this to MODIFIED.

Comment 23 Micah Abbott 2022-03-07 13:50:31 UTC
Rebuilds of RHCOS 4.11 across all arches have completed and are being pushed to Quay using v2s2 format

Comment 29 errata-xmlrpc 2022-08-10 10:51:23 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069


Note You need to log in before you can comment on or make changes to this bug.