Description of problem: Starting today, we have been seeing OpenShift CI failures which include the error mentioned in the subject: $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.0/6581/artifacts/e2e-aws/pods.json | jq -r '.items[] | select(.status.phase == "Pending") | .status.phase + " " + .metadata.name + " " + ([.status.containerStatuses[].state.waiting.message][0])' Pending apiserver-9gh5p Pending apiserver-hndrh Pending apiserver-rn8n5 Pending controller-manager-2fb7b Pending controller-manager-f5rbk Pending controller-manager-g7trd Pending apiservice-cabundle-injector-7d8658698b-4rxvn Failed to inspect image "registry.svc.ci.openshift.org/ocp/4.0-2019-04-09-183816@sha256:630293bd328a58ec7583fb3404a69cbc9f1efff873c95cf8876c380da92d33f6": rpc error: code = Unknown desc = Manifest does not match provided manifest digest sha256:630293bd328a58ec7583fb3404a69cbc9f1efff873c95cf8876c380da92d33f6 Pending service-serving-cert-signer-86c55f7c65-6qscm Failed to inspect image "registry.svc.ci.openshift.org/ocp/4.0-2019-04-09-183816@sha256:630293bd328a58ec7583fb3404a69cbc9f1efff873c95cf8876c380da92d33f6": rpc error: code = Unknown desc = Manifest does not match provided manifest digest sha256:630293bd328a58ec7583fb3404a69cbc9f1efff873c95cf8876c380da92d33f6 $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.0/6581/artifacts/e2e-aws/nodes.json | jq -r '.items[].status.nodeInfo.containerRuntimeVersion' | uniq cri-o://1.13.4-3.rhaos4.1.git30006b3.el8 Previous references like this in bug 1546324 and bug 1669096 (this may be a dup of 1669096). Also in this space is [1,2], with [2] going out with CRI-O 1.13.4 today. [1]: https://github.com/cri-o/cri-o/pull/2066 [2]: https://github.com/cri-o/cri-o/pull/2071
Created attachment 1553956 [details] Occurrences of this error in CI from 2019-04-08T22:49 to 2019-04-09T22:47 UTC This occurred in 7 of our 395 failures (1%) in *-e2e-aws* jobs across the whole CI system over the past 23 hours. Generated with [1]: $ deck-build-log-plot 'Manifest does not match provided manifest digest' 7 Manifest does not match provided manifest digest 1 https://github.com/openshift/origin/pull/22518 ci-op-iprtmiil 1 https://github.com/openshift/origin/pull/22514 ci-op-7gr7m40k 1 https://github.com/openshift/origin/pull/22505 ci-op-7zj10g1r 1 https://github.com/openshift/machine-config-operator/pull/616 ci-op-9fi6il1y 1 https://github.com/openshift/installer/pull/1572 ci-op-mr0mf0s5 1 https://github.com/openshift/cluster-kube-apiserver-operator/pull/368 ci-op-l2v4b87j 1 https://github.com/openshift/builder/pull/61 ci-op-lib66inc [1]: https://github.com/wking/openshift-release/tree/debug-scripts/deck-build-log
With https://github.com/cri-o/cri-o/pull/2238 which is coming in 1.13.5 we will have same version of containers/image and containers/storage as in 1.12.x which wasn't hitting these issues.
Created attachment 1553979 [details] Better occurrences plot With some local, ugly hacks to pull pods.json and include that in my regexp matching, this issue rises to 54 instances (13% of *-e2e-aws* failures). The oldest case I've found so far is from a cluster launched 2019-04-09T14:07Z [1]: $ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_builder/61/pull-ci-openshift-builder-master-e2e-aws-builds/166/artifacts/e2e-aws-builds/pods.json | grep 'Manifest does not match provided manifest digest' "message": "Failed to inspect image \"registry.svc.ci.openshift.org/ci-op-lib66inc/stable@sha256:8eb140d803ec324f5b3b472be7ffc2a6c582923833280c9d78227ccfb65d2154\": rpc error: code = Unknown desc = Manifest does not match provided manifest digest sha256:8eb140d803ec324f5b3b472be7ffc2a6c582923833280c9d78227ccfb65d2154", "message": "Failed to inspect image \"registry.svc.ci.openshift.org/ci-op-lib66inc/stable@sha256:8eb140d803ec324f5b3b472be7ffc2a6c582923833280c9d78227ccfb65d2154\": rpc error: code = Unknown desc = Manifest does not match provided manifest digest sha256:8eb140d803ec324f5b3b472be7ffc2a6c582923833280c9d78227ccfb65d2154", [1]: https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_builder/61/pull-ci-openshift-builder-master-e2e-aws-builds/166
Not sure if this issue is both for the short-term rollback and also the long-term fix, but I'm moving to MODIFIED with the rollback landed, and we can always go back to ASSIGNED if we want to scope to "long-term fix".
So waiting for this to get picked up in CI. The next ART build should come in via an 8-hour periodic on registry.svc.ci.openshift.org/rhcos/machine-os-content:latest, and promotes the machine-os-content into CI if the job passes. Currently that's still the old image: $ oc image info registry.svc.ci.openshift.org/rhcos/machine-os-content:latest Name: registry.svc.ci.openshift.org/rhcos/machine-os-content:latest Digest: sha256:ca664d88674d930afd6d727d0d6242668fe3e274f077abbe9d7375854b1bf788 Media Type: application/vnd.docker.distribution.manifest.v1+prettyjws Created: 1d ago Image Size: 0B OS: linux Arch: amd64 Entrypoint: /noentry Labels: com.coreos.ostree-commit=9df99f7dd9e11ba06ef83b006cfe256b4b9ab1e2acc30f47f5a6a4979f6d5f20 version=410.8.20190408.1 It's not clear to me whether we need to bump the bootimages too. Do we use CRI-O pre-pivot? Or is it just Podman? Is Podman also affected by the buggy containers/storage?
Looped in some RHCOS folks, in case this needs shepherding through the RHCOS-release pipeline.
> It's not clear to me whether we need to bump the bootimages too. Do we use CRI-O pre-pivot? We do not (unless it's an actively used cluster being upgraded as Openshift would be using cri-o) > Or is it just Podman? Just podman > Is Podman also affected by the buggy containers/storage? This I will have to defer to runtimes folks on this one. > Looped in some RHCOS folks, in case this needs shepherding through the RHCOS-release pipeline. Works for me. ART and RHCOS can help move the package through as quickly as the process allows once the patch is applied and a new cri-o build is available.
Patched package is out (cri-o-1.13.5-1.rhaos4.1.gita9d8dde.el8, attached by Lokesh just after comment 4). I haven't checked to see where it is in the pipeline today.
$ oc image info registry.svc.ci.openshift.org/rhcos/machine-os-content:latestName: registry.svc.ci.openshift.org/rhcos/machine-os-content:latest Digest: sha256:4f9b91f9ef4889c30c79373dad241706bfe13858e45d25c2bfc06434aeae8772 Media Type: application/vnd.docker.distribution.manifest.v2+json Created: 17m ago Image Size: 674.6MB OS: linux Arch: amd64 Entrypoint: /noentry Labels: com.coreos.ostree-commit=9a02acb0a24296387e64ec3cb61a33f3649679f6a2b9bfb52d125b1b3c11df95 version=410.8.20190410.0 Bumped, thanks to Luke Meyer. Now we wait for the periodic promotion job [1]. [1]: https://prow.svc.ci.openshift.org/?type=periodic&job=release-promote-openshift-machine-os-content-e2e-aws-*
$ oc image info registry.svc.ci.openshift.org/rhcos/machine-os-content:latest Name: registry.svc.ci.openshift.org/rhcos/machine-os-content:latest Digest: sha256:b5eec9ad5c7ff8a0e346c8d879a3872b78fde649ccdb33a83ca269278e7ee112 Media Type: application/vnd.docker.distribution.manifest.v2+json Created: 1h ago Image Size: 674.5MB OS: linux Arch: amd64 Entrypoint: /noentry Labels: com.coreos.ostree-commit=a91489858bc239986831f3501854363924b0047f191bb43f71dc69adec0bd171 version=410.8.20190410.0 ^ new image :). So I'm going to mark this closed, and we'll re-open if for some reason clusters launched with the new image still hit this issue. There's also a Podman fix in the pipe [1], but I haven't noticed that biting us in CI (we run many fewer Podman containers), so I'm not going to hold this open on that score. [1]: https://github.com/containers/libpod/pull/2890
We are still seeing the issue in CI with cri-o 1.13.5 so re-opening the bug.
*** Bug 1699125 has been marked as a duplicate of this bug. ***
https://github.com/cri-o/cri-o/pull/2249 opened.
We've had a machine-os-content bump: $ oc image info registry.svc.ci.openshift.org/rhcos/machine-os-content:latest Name: registry.svc.ci.openshift.org/rhcos/machine-os-content:latest Digest: sha256:1b8e6ecc5ab0c7ba3021b24a1669495e21d693d44392b3b2393c97cc11ef17f4 Media Type: application/vnd.docker.distribution.manifest.v2+json Created: 2h ago Image Size: 674.5MB OS: linux Arch: amd64 Entrypoint: /noentry Labels: com.coreos.ostree-commit=2521ff905506b377534b79e98d1c92d8d2302c99b0d20abc3ea1f9ea9d7bfd00 version=410.8.20190412.1 But a more-recent run still has the old version: $ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-authentication-operator/92/pull-ci-openshift-cluster-authentication-operator-master-e2e-aws-operator/432/artifacts/e2e-aws-operator/nodes.json | grep cri-o | uniq "containerRuntimeVersion": "cri-o://1.13.5-1.rhaos4.1.gita9d8dde.el8", So maybe just waiting for promotion now [1]? Or maybe we can get something promoted manually... [1]: https://prow.svc.ci.openshift.org/?type=periodic&job=release-promote-openshift-machine-os-content-e2e-aws-*
There was a bug with ci-operator picking up the new images. With [1] landed, the new promotion jobs should pull in the new RHCOS. [1]: https://github.com/openshift/ci-operator/pull/330
Green promotions :) https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-promote-openshift-machine-os-content-e2e-aws-4.0/116 https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-promote-openshift-machine-os-content-e2e-aws-4.1/8 Waiting to make sure that new jobs are using the new CRI-O...
$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/22562/pull-ci-openshift-origin-master-e2e-aws-serial/4595/artifacts/release-latest/release-payload-latest/image-references | jq -r '.spec.tags[] | select(.name == "machine-os-content").from.name' registry.svc.ci.openshift.org/ci-op-cg3d9i05/stable@sha256:d4523b097e0e75e0154e91e647b4553aeb4b6fff9d16ebc87ff212def59e5ca7 $ oc image info registry.svc.ci.openshift.org/ci-op-cg3d9i05/stable@sha256:d4523b097e0e75e0154e91e647b4553aeb4b6fff9d16ebc87ff212def59e5ca7 Name: registry.svc.ci.openshift.org/ci-op-cg3d9i05/stable@sha256:d4523b097e0e75e0154e91e647b4553aeb4b6fff9d16ebc87ff212def59e5ca7 Media Type: application/vnd.docker.distribution.manifest.v2+json Created: 23m ago Image Size: 829.7MB in 5 layers Layers: 75.82MB sha256:c2340472a0fa0c6c0b7910b6c292f627448da15d5e3c375c61c3141f494a3268 1.008kB sha256:6e55351c18ffebc0918b4c21c1257d28d7e311bd37009fb722cb59761b7449ed 471B sha256:b2d2704dda6c98b6a775e8e5566af7a92f700b542905bd94310dc18b9bc2d0b0 7.755MB sha256:a2c10be042b935d0f2657bca65ddc5b52f542501ad20d6b29c2558352f723837 746.1MB sha256:65e12faa4d96a577f46e778c2683d57a124e0317996cc4e2d7f06b6db6b1f4d6 OS: linux Arch: amd64 Command: /bin/bash Environment: OPENSHIFT_BUILD_NAME=machine-os-content OPENSHIFT_BUILD_NAMESPACE=ci-op-cg3d9i05 PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin container=oci Labels: architecture=x86_64 authoritative-source-url=registry.access.redhat.com build-date=2019-03-06T02:42:38.249442 com.redhat.build-host=cpt-0004.osbs.prod.upshift.rdu2.redhat.com com.redhat.component=ubi7-container com.redhat.license_terms=https://www.redhat.com/licenses/eulas description=The Universal Base Image is designed and engineered to be the base layer for all of your containerized applications, middleware and utilities. This base image is freely redistributable, but Red Hat only supports Red Hat technologies through subscriptions for Red Hat products. This image is maintained by Red Hat and updated regularly. distribution-scope=public io.k8s.description=This is the base image from which all OpenShift images inherit. io.k8s.display-name=OpenShift Base io.openshift.build.commit.author= io.openshift.build.commit.date= io.openshift.build.commit.id= io.openshift.build.commit.message= io.openshift.build.commit.ref= io.openshift.build.name= io.openshift.build.namespace= io.openshift.build.source-context-dir= io.openshift.build.source-location= io.openshift.tags=base rhel7 name=ubi7 release=73 summary=Provides the latest release of the Red Hat Universal Base Image 7. url=https://access.redhat.com/containers/#/registry.access.redhat.com/ubi7/images/7.6-73 vcs-ref= vcs-type= vcs-url= vendor=Red Hat, Inc. version=7.6 whoa. Well, that's new :). com.coreos.ostree-commit and RHCOS version seem to be gone from the labels though. I guess we'll see the RHCOS version once we get far enough in for a successful run.
$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_service-ca-operator/40/pull-ci-openshift-service-ca-operator-master-e2e-aws-operator/3/artifacts/e2e-aws-operator/nodes.json | grep cri-o | uniq "containerRuntimeVersion": "cri-o://1.13.6-1.dev.rhaos4.1.gitee2e748.el8-dev", :D