Bug 1698253 - CRI-O 1.13.4-3.rhaos4.1.git30006b3.el8 raising "Manifest does not match provided manifest digest"
Summary: CRI-O 1.13.4-3.rhaos4.1.git30006b3.el8 raising "Manifest does not match provi...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Containers
Version: 4.1.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.1.0
Assignee: Nalin Dahyabhai
QA Contact: weiwei jiang
URL:
Whiteboard:
: 1699125 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-04-09 22:48 UTC by W. Trevor King
Modified: 2019-05-06 13:00 UTC (History)
11 users (show)

Fixed In Version: cri-o-1.13.6-1.dev.rhaos4.1.gitee2e748.el8
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-04-29 20:35:44 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Occurrences of this error in CI from 2019-04-08T22:49 to 2019-04-09T22:47 UTC (275.18 KB, image/svg+xml)
2019-04-09 22:50 UTC, W. Trevor King
no flags Details
Better occurrences plot (279.89 KB, image/svg+xml)
2019-04-09 23:35 UTC, W. Trevor King
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github cri-o cri-o pull 2238 0 None closed [1.13] Revert containers/storage to older version 2020-05-27 12:39:04 UTC

Description W. Trevor King 2019-04-09 22:48:27 UTC
Description of problem:

Starting today, we have been seeing OpenShift CI failures which include the error mentioned in the subject:

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.0/6581/artifacts/e2e-aws/pods.json | jq -r '.items[] |  select(.status.phase == "Pending") | .status.phase + " " + .metadata.name + " " + ([.status.containerStatuses[].state.waiting.message][0])'
Pending apiserver-9gh5p
Pending apiserver-hndrh
Pending apiserver-rn8n5
Pending controller-manager-2fb7b
Pending controller-manager-f5rbk
Pending controller-manager-g7trd
Pending apiservice-cabundle-injector-7d8658698b-4rxvn Failed to inspect image "registry.svc.ci.openshift.org/ocp/4.0-2019-04-09-183816@sha256:630293bd328a58ec7583fb3404a69cbc9f1efff873c95cf8876c380da92d33f6": rpc error: code = Unknown desc = Manifest does not match provided manifest digest sha256:630293bd328a58ec7583fb3404a69cbc9f1efff873c95cf8876c380da92d33f6
Pending service-serving-cert-signer-86c55f7c65-6qscm Failed to inspect image "registry.svc.ci.openshift.org/ocp/4.0-2019-04-09-183816@sha256:630293bd328a58ec7583fb3404a69cbc9f1efff873c95cf8876c380da92d33f6": rpc error: code = Unknown desc = Manifest does not match provided manifest digest sha256:630293bd328a58ec7583fb3404a69cbc9f1efff873c95cf8876c380da92d33f6
$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.0/6581/artifacts/e2e-aws/nodes.json | jq -r '.items[].status.nodeInfo.containerRuntimeVersion' | uniq
cri-o://1.13.4-3.rhaos4.1.git30006b3.el8

Previous references like this in bug 1546324 and bug 1669096 (this may be a dup of 1669096).  Also in this space is [1,2], with [2] going out with CRI-O 1.13.4 today.

[1]: https://github.com/cri-o/cri-o/pull/2066
[2]: https://github.com/cri-o/cri-o/pull/2071

Comment 1 W. Trevor King 2019-04-09 22:50:11 UTC
Created attachment 1553956 [details]
Occurrences of this error in CI from 2019-04-08T22:49 to 2019-04-09T22:47 UTC

This occurred in 7 of our 395 failures (1%) in *-e2e-aws* jobs across the whole CI system over the past 23 hours.  Generated with [1]:

  $ deck-build-log-plot 'Manifest does not match provided manifest digest'
  7	Manifest does not match provided manifest digest
  	1	https://github.com/openshift/origin/pull/22518	ci-op-iprtmiil
  	1	https://github.com/openshift/origin/pull/22514	ci-op-7gr7m40k
  	1	https://github.com/openshift/origin/pull/22505	ci-op-7zj10g1r
  	1	https://github.com/openshift/machine-config-operator/pull/616	ci-op-9fi6il1y
  	1	https://github.com/openshift/installer/pull/1572	ci-op-mr0mf0s5
  	1	https://github.com/openshift/cluster-kube-apiserver-operator/pull/368	ci-op-l2v4b87j
  	1	https://github.com/openshift/builder/pull/61	ci-op-lib66inc

[1]: https://github.com/wking/openshift-release/tree/debug-scripts/deck-build-log

Comment 2 Mrunal Patel 2019-04-09 23:27:41 UTC
With https://github.com/cri-o/cri-o/pull/2238 which is coming in 1.13.5 we will have same version of containers/image and containers/storage as in 1.12.x which wasn't hitting these issues.

Comment 3 W. Trevor King 2019-04-09 23:35:23 UTC
Created attachment 1553979 [details]
Better occurrences plot

With some local, ugly hacks to pull pods.json and include that in my regexp matching, this issue rises to 54 instances (13% of *-e2e-aws* failures).  The oldest case I've found so far is from a cluster launched 2019-04-09T14:07Z [1]:

$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_builder/61/pull-ci-openshift-builder-master-e2e-aws-builds/166/artifacts/e2e-aws-builds/pods.json | grep 'Manifest does not match provided manifest digest'
                                "message": "Failed to inspect image \"registry.svc.ci.openshift.org/ci-op-lib66inc/stable@sha256:8eb140d803ec324f5b3b472be7ffc2a6c582923833280c9d78227ccfb65d2154\": rpc error: code = Unknown desc = Manifest does not match provided manifest digest sha256:8eb140d803ec324f5b3b472be7ffc2a6c582923833280c9d78227ccfb65d2154",
                                "message": "Failed to inspect image \"registry.svc.ci.openshift.org/ci-op-lib66inc/stable@sha256:8eb140d803ec324f5b3b472be7ffc2a6c582923833280c9d78227ccfb65d2154\": rpc error: code = Unknown desc = Manifest does not match provided manifest digest sha256:8eb140d803ec324f5b3b472be7ffc2a6c582923833280c9d78227ccfb65d2154",


[1]: https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_builder/61/pull-ci-openshift-builder-master-e2e-aws-builds/166

Comment 4 W. Trevor King 2019-04-09 23:38:08 UTC
Not sure if this issue is both for the short-term rollback and also the long-term fix, but I'm moving to MODIFIED with the rollback landed, and we can always go back to ASSIGNED if we want to scope to "long-term fix".

Comment 5 W. Trevor King 2019-04-10 05:09:41 UTC
So waiting for this to get picked up in CI.  The next ART build should come in via an 8-hour periodic on registry.svc.ci.openshift.org/rhcos/machine-os-content:latest, and promotes the machine-os-content into CI if the job passes.  Currently that's still the old image:

$ oc image info registry.svc.ci.openshift.org/rhcos/machine-os-content:latest
Name:       registry.svc.ci.openshift.org/rhcos/machine-os-content:latest
Digest:     sha256:ca664d88674d930afd6d727d0d6242668fe3e274f077abbe9d7375854b1bf788
Media Type: application/vnd.docker.distribution.manifest.v1+prettyjws
Created:    1d ago
Image Size: 0B
OS:         linux
Arch:       amd64
Entrypoint: /noentry
Labels:     com.coreos.ostree-commit=9df99f7dd9e11ba06ef83b006cfe256b4b9ab1e2acc30f47f5a6a4979f6d5f20
            version=410.8.20190408.1

It's not clear to me whether we need to bump the bootimages too.  Do we use CRI-O pre-pivot?  Or is it just Podman?  Is Podman also affected by the buggy containers/storage?

Comment 6 W. Trevor King 2019-04-10 05:11:15 UTC
Looped in some RHCOS folks, in case this needs shepherding through the RHCOS-release pipeline.

Comment 7 Steve Milner 2019-04-10 13:07:59 UTC
> It's not clear to me whether we need to bump the bootimages too.  Do we use CRI-O pre-pivot?

We do not (unless it's an actively used cluster being upgraded as Openshift would be using cri-o)

> Or is it just Podman?

Just podman

> Is Podman also affected by the buggy containers/storage?

This I will have to defer to runtimes folks on this one.

> Looped in some RHCOS folks, in case this needs shepherding through the RHCOS-release pipeline.

Works for me. ART and RHCOS can help move the package through as quickly as the process allows once the patch is applied and a new cri-o build is available.

Comment 8 W. Trevor King 2019-04-10 13:12:23 UTC
Patched package is out (cri-o-1.13.5-1.rhaos4.1.gita9d8dde.el8, attached by Lokesh just after comment 4).  I haven't checked to see where it is in the pipeline today.

Comment 12 W. Trevor King 2019-04-10 20:10:06 UTC
$ oc image info registry.svc.ci.openshift.org/rhcos/machine-os-content:latestName:       registry.svc.ci.openshift.org/rhcos/machine-os-content:latest
Digest:     sha256:4f9b91f9ef4889c30c79373dad241706bfe13858e45d25c2bfc06434aeae8772
Media Type: application/vnd.docker.distribution.manifest.v2+json
Created:    17m ago
Image Size: 674.6MB
OS:         linux
Arch:       amd64
Entrypoint: /noentry
Labels:     com.coreos.ostree-commit=9a02acb0a24296387e64ec3cb61a33f3649679f6a2b9bfb52d125b1b3c11df95
            version=410.8.20190410.0

Bumped, thanks to Luke Meyer.  Now we wait for the periodic promotion job [1].

[1]: https://prow.svc.ci.openshift.org/?type=periodic&job=release-promote-openshift-machine-os-content-e2e-aws-*

Comment 14 W. Trevor King 2019-04-10 22:16:49 UTC
$ oc image info registry.svc.ci.openshift.org/rhcos/machine-os-content:latest
Name:       registry.svc.ci.openshift.org/rhcos/machine-os-content:latest
Digest:     sha256:b5eec9ad5c7ff8a0e346c8d879a3872b78fde649ccdb33a83ca269278e7ee112
Media Type: application/vnd.docker.distribution.manifest.v2+json
Created:    1h ago
Image Size: 674.5MB
OS:         linux
Arch:       amd64
Entrypoint: /noentry
Labels:     com.coreos.ostree-commit=a91489858bc239986831f3501854363924b0047f191bb43f71dc69adec0bd171
            version=410.8.20190410.0

^ new image :).  So I'm going to mark this closed, and we'll re-open if for some reason clusters launched with the new image still hit this issue.  There's also a Podman fix in the pipe [1], but I haven't noticed that biting us in CI (we run many fewer Podman containers), so I'm not going to hold this open on that score.

[1]: https://github.com/containers/libpod/pull/2890

Comment 15 Mrunal Patel 2019-04-11 20:25:05 UTC
We are still seeing the issue in CI with cri-o 1.13.5 so re-opening the bug.

Comment 16 Mrunal Patel 2019-04-11 20:36:48 UTC
*** Bug 1699125 has been marked as a duplicate of this bug. ***

Comment 17 Mrunal Patel 2019-04-11 20:49:59 UTC
https://github.com/cri-o/cri-o/pull/2249 opened.

Comment 19 W. Trevor King 2019-04-12 16:15:28 UTC
We've had a machine-os-content bump:

$ oc image info registry.svc.ci.openshift.org/rhcos/machine-os-content:latest
Name:       registry.svc.ci.openshift.org/rhcos/machine-os-content:latest
Digest:     sha256:1b8e6ecc5ab0c7ba3021b24a1669495e21d693d44392b3b2393c97cc11ef17f4
Media Type: application/vnd.docker.distribution.manifest.v2+json
Created:    2h ago
Image Size: 674.5MB
OS:         linux
Arch:       amd64
Entrypoint: /noentry
Labels:     com.coreos.ostree-commit=2521ff905506b377534b79e98d1c92d8d2302c99b0d20abc3ea1f9ea9d7bfd00
            version=410.8.20190412.1

But a more-recent run still has the old version:

$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-authentication-operator/92/pull-ci-openshift-cluster-authentication-operator-master-e2e-aws-operator/432/artifacts/e2e-aws-operator/nodes.json | grep cri-o | uniq
                    "containerRuntimeVersion": "cri-o://1.13.5-1.rhaos4.1.gita9d8dde.el8",

So maybe just waiting for promotion now [1]?  Or maybe we can get something promoted manually...

[1]: https://prow.svc.ci.openshift.org/?type=periodic&job=release-promote-openshift-machine-os-content-e2e-aws-*

Comment 20 W. Trevor King 2019-04-12 19:43:34 UTC
There was a bug with ci-operator picking up the new images.  With [1] landed, the new promotion jobs should pull in the new RHCOS.

[1]: https://github.com/openshift/ci-operator/pull/330

Comment 22 W. Trevor King 2019-04-12 21:19:36 UTC
$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/22562/pull-ci-openshift-origin-master-e2e-aws-serial/4595/artifacts/release-latest/release-payload-latest/image-references | jq -r '.spec.tags[] | select(.name == "machine-os-content").from.name'
registry.svc.ci.openshift.org/ci-op-cg3d9i05/stable@sha256:d4523b097e0e75e0154e91e647b4553aeb4b6fff9d16ebc87ff212def59e5ca7
$ oc image info registry.svc.ci.openshift.org/ci-op-cg3d9i05/stable@sha256:d4523b097e0e75e0154e91e647b4553aeb4b6fff9d16ebc87ff212def59e5ca7
Name:        registry.svc.ci.openshift.org/ci-op-cg3d9i05/stable@sha256:d4523b097e0e75e0154e91e647b4553aeb4b6fff9d16ebc87ff212def59e5ca7
Media Type:  application/vnd.docker.distribution.manifest.v2+json
Created:     23m ago
Image Size:  829.7MB in 5 layers
Layers:      75.82MB sha256:c2340472a0fa0c6c0b7910b6c292f627448da15d5e3c375c61c3141f494a3268
             1.008kB sha256:6e55351c18ffebc0918b4c21c1257d28d7e311bd37009fb722cb59761b7449ed
             471B    sha256:b2d2704dda6c98b6a775e8e5566af7a92f700b542905bd94310dc18b9bc2d0b0
             7.755MB sha256:a2c10be042b935d0f2657bca65ddc5b52f542501ad20d6b29c2558352f723837
             746.1MB sha256:65e12faa4d96a577f46e778c2683d57a124e0317996cc4e2d7f06b6db6b1f4d6
OS:          linux
Arch:        amd64
Command:     /bin/bash
Environment: OPENSHIFT_BUILD_NAME=machine-os-content
             OPENSHIFT_BUILD_NAMESPACE=ci-op-cg3d9i05
             PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
             container=oci
Labels:      architecture=x86_64
             authoritative-source-url=registry.access.redhat.com
             build-date=2019-03-06T02:42:38.249442
             com.redhat.build-host=cpt-0004.osbs.prod.upshift.rdu2.redhat.com
             com.redhat.component=ubi7-container
             com.redhat.license_terms=https://www.redhat.com/licenses/eulas
             description=The Universal Base Image is designed and engineered to be the base layer for all of your containerized applications, middleware and utilities. This base image is freely redistributable, but Red Hat only supports Red Hat technologies through subscriptions for Red Hat products. This image is maintained by Red Hat and updated regularly.
             distribution-scope=public
             io.k8s.description=This is the base image from which all OpenShift images inherit.
             io.k8s.display-name=OpenShift Base
             io.openshift.build.commit.author=
             io.openshift.build.commit.date=
             io.openshift.build.commit.id=
             io.openshift.build.commit.message=
             io.openshift.build.commit.ref=
             io.openshift.build.name=
             io.openshift.build.namespace=
             io.openshift.build.source-context-dir=
             io.openshift.build.source-location=
             io.openshift.tags=base rhel7
             name=ubi7
             release=73
             summary=Provides the latest release of the Red Hat Universal Base Image 7.
             url=https://access.redhat.com/containers/#/registry.access.redhat.com/ubi7/images/7.6-73
             vcs-ref=
             vcs-type=
             vcs-url=
             vendor=Red Hat, Inc.
             version=7.6

whoa.  Well, that's new :).  com.coreos.ostree-commit and RHCOS version seem to be gone from the labels though.  I guess we'll see the RHCOS version once we get far enough in for a successful run.

Comment 23 W. Trevor King 2019-04-12 22:28:03 UTC
$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_service-ca-operator/40/pull-ci-openshift-service-ca-operator-master-e2e-aws-operator/3/artifacts/e2e-aws-operator/nodes.json | grep cri-o | uniq
                    "containerRuntimeVersion": "cri-o://1.13.6-1.dev.rhaos4.1.gitee2e748.el8-dev",

:D


Note You need to log in before you can comment on or make changes to this bug.