Created attachment 1774122 [details] sosreport-node1 Description of problem: Version-Release number of selected component (if applicable): OpenShift Version: 4.8.0-0.nightly-2021-04-20-101404 CRI-O: crictl version 1.21.0 # crio -v crio version 1.21.0-74.rhaos4.8.gitbc1ef35.el8 Version: 1.21.0-74.rhaos4.8.gitbc1ef35.el8 GoVersion: go1.16.1 Compiler: gc Platform: linux/amd64 Linkmode: dynamic How reproducible: Almost consistently during my testing Steps to Reproduce: 1.Start a pod with some image (in my case I used quay.io/cloud-bulldozer/backpack:latest) 2.Let it run for a bit and then delete it. 3.Try to start new pod with the same image on the same node as the previous pod was running. Actual results: If you describe the pod you see: Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 16m default-scheduler Successfully assigned my-ripsaw/uperf-server-ip-10-0-128-151-0-e75a9e69-5mn7x to ip-10-0-128-151.us-east-2.compute.internal Normal AddedInterface 16m multus Add eth0 [10.129.82.8/23] Normal Pulling 14m (x4 over 16m) kubelet Pulling image "quay.io/cloud-bulldozer/backpack:latest" Warning Failed 14m (x4 over 16m) kubelet Failed to pull image "quay.io/cloud-bulldozer/backpack:latest": rpc error: code = Unknown desc = Error committing the finished image: image with ID "570f95b8bf085aa970ed6f7a68e896b6ec8365af9f75e6f8b3570fbc30d6f38e" already exists, but uses a different top layer: that ID is already in use Warning Failed 14m (x4 over 16m) kubelet Error: ErrImagePull Warning Failed 14m (x6 over 16m) kubelet Error: ImagePullBackOff Normal BackOff 93s (x63 over 16m) kubelet Back-off pulling image "quay.io/cloud-bulldozer/backpack:latest" Expected results: Pod should not error out. Additional info: I was able to reproduce this issue on 2 different clusters, running on AWS(installed with IPI installer). I didn't manually try to reproduce the bug, but it was evident during my automation test execution. Precisely I was running the test https://github.com/cloud-bulldozer/e2e-benchmarking/blob/master/workloads/network-perf/run_pod_network_test_fromgit.sh Workaround: Use `oc debug node/<node_name>` and remove the image in question using `podman rmi` and in a few seconds, pod that was stuck in ImagePullBackOff changes the state to Running. Attachments: SOSReports of 2 nodes of my cluster where this error was evident.
Created attachment 1774123 [details] sosreport-node2
The image being pulled (quay.io/cloud-bulldozer/backpack:latest) has a schema1 manifest with a couple of "throwaway" entries in its history. If I'm reading https://github.com/containers/image/pull/1205 correctly, the version of github.com/containers/image that CRI-O commit bc1ef35 uses changed how it applied layer blobs when saving images, in a way which affected the ID that it assigned to those layers. I think bumping to CRI-O to use v5.11.1 will fix it.
4.8 PR attached
fix merged!
Verified on 4.8.0-0.nightly-2021-04-28-090319 $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.8.0-0.nightly-2021-04-28-090319 True False 4h9m Cluster version is 4.8.0-0.nightly-2021-04-28-090319 $ oc get nodes NAME STATUS ROLES AGE VERSION ip-10-0-142-4.us-east-2.compute.internal Ready worker 4h54m v1.21.0-rc.0+cd6d8b0 ip-10-0-150-69.us-east-2.compute.internal Ready master 5h3m v1.21.0-rc.0+cd6d8b0 ip-10-0-167-40.us-east-2.compute.internal Ready master 5h2m v1.21.0-rc.0+cd6d8b0 ip-10-0-186-203.us-east-2.compute.internal Ready worker 4h54m v1.21.0-rc.0+cd6d8b0 ip-10-0-216-205.us-east-2.compute.internal Ready master 5h5m v1.21.0-rc.0+cd6d8b0 ip-10-0-223-10.us-east-2.compute.internal Ready worker 4h54m v1.21.0-rc.0+cd6d8b0 $ oc get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES postgresql-1-deploy 0/1 Completed 0 44m 10.129.2.112 ip-10-0-223-10.us-east-2.compute.internal <none> <none> postgresql-1-nwjnm 1/1 Running 0 44m 10.129.2.113 ip-10-0-223-10.us-east-2.compute.internal <none> <none> rails-postgresql-example-1-82xxs 1/1 Running 0 42m 10.129.2.117 ip-10-0-223-10.us-east-2.compute.internal <none> <none> $ oc get pods NAME READY STATUS RESTARTS AGE postgresql-1-nwjnm 0/1 Terminating 0 47m rails-postgresql-example-1-82xxs 1/1 Terminating 0 45m rails-postgresql-example-1-build 0/1 Completed 0 47m $ oc get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES rails-postgresql-example-1-build 0/1 Completed 0 57m 10.129.2.111 ip-10-0-223-10.us-east-2.compute.internal <none> <none> rails-postgresql-example-55cb7fc669-z9526 1/1 Running 0 5m34s 10.129.2.139 ip-10-0-223-10.us-east-2.compute.internal <none> <none> $ oc new-app --docker-image=quay.io/cloud-bulldozer/backpack:latest ... $ oc get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES backpack-56c78c6-fwb4r 0/1 Completed 2 52s 10.129.2.143 ip-10-0-223-10.us-east-2.compute.internal <none> <none> $ oc describe pod backpack-56c78c6-fwb4r Name: backpack-56c78c6-fwb4r Namespace: app1 Priority: 0 Node: ip-10-0-223-10.us-east-2.compute.internal/10.0.223.10 Start Time: Wed, 28 Apr 2021 20:34:36 +0530 Labels: deployment=backpack pod-template-hash=56c78c6 Annotations: k8s.v1.cni.cncf.io/network-status: [{ "name": "", "interface": "eth0", "ips": [ "10.129.2.143" ], "default": true, "dns": {} }] k8s.v1.cni.cncf.io/networks-status: [{ "name": "", "interface": "eth0", "ips": [ "10.129.2.143" ], "default": true, "dns": {} }] openshift.io/generated-by: OpenShiftNewApp openshift.io/scc: restricted Status: Running IP: 10.129.2.143 IPs: IP: 10.129.2.143 Controlled By: ReplicaSet/backpack-56c78c6 Containers: backpack: Container ID: cri-o://b7153f37469f3b1d3b1266a0afbbd8f82de6acb83bcb6a8f341b65e361939f6b Image: quay.io/cloud-bulldozer/backpack@sha256:fe14b444967419b1cef078f5f2f4ff3b4e6238f1e2ab775d1e0d4e9c7a9685f7 Image ID: quay.io/cloud-bulldozer/backpack@sha256:fe14b444967419b1cef078f5f2f4ff3b4e6238f1e2ab775d1e0d4e9c7a9685f7 Port: <none> Host Port: <none> State: Waiting Reason: CrashLoopBackOff Last State: Terminated Reason: Completed Exit Code: 0 Started: Wed, 28 Apr 2021 20:35:19 +0530 Finished: Wed, 28 Apr 2021 20:35:19 +0530 Ready: False Restart Count: 2 Environment: <none> Mounts: /var/run/secrets/kubernetes.io/serviceaccount from default-token-bgnrs (ro) Conditions: Type Status Initialized True Ready False ContainersReady False PodScheduled True Volumes: default-token-bgnrs: Type: Secret (a volume populated by a Secret) SecretName: default-token-bgnrs Optional: false QoS Class: BestEffort Node-Selectors: <none> Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s node.kubernetes.io/unreachable:NoExecute op=Exists for 300s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 65s default-scheduler Successfully assigned app1/backpack-56c78c6-fwb4r to ip-10-0-223-10.us-east-2.compute.internal Normal AddedInterface 64s multus Add eth0 [10.129.2.143/23] Normal Pulling 64s kubelet Pulling image "quay.io/cloud-bulldozer/backpack@sha256:fe14b444967419b1cef078f5f2f4ff3b4e6238f1e2ab775d1e0d4e9c7a9685f7" Normal Pulled 38s kubelet Successfully pulled image "quay.io/cloud-bulldozer/backpack@sha256:fe14b444967419b1cef078f5f2f4ff3b4e6238f1e2ab775d1e0d4e9c7a9685f7" in 26.356752673s Normal Created 23s (x3 over 37s) kubelet Created container backpack Normal Started 23s (x3 over 37s) kubelet Started container backpack Normal Pulled 23s (x2 over 37s) kubelet Container image "quay.io/cloud-bulldozer/backpack@sha256:fe14b444967419b1cef078f5f2f4ff3b4e6238f1e2ab775d1e0d4e9c7a9685f7" already present on machine Warning BackOff 11s (x4 over 36s) kubelet Back-off restarting failed container $ oc get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES backpack-56c78c6-nn6c6 0/1 Completed 3 47s 10.129.2.145 ip-10-0-223-10.us-east-2.compute.internal <none> <none> $ oc describe pod backpack-56c78c6-nn6c6 Name: backpack-56c78c6-nn6c6 Namespace: app1 Priority: 0 Node: ip-10-0-223-10.us-east-2.compute.internal/10.0.223.10 Start Time: Wed, 28 Apr 2021 20:37:46 +0530 Labels: deployment=backpack pod-template-hash=56c78c6 Annotations: k8s.v1.cni.cncf.io/network-status: [{ "name": "", "interface": "eth0", "ips": [ "10.129.2.145" ], "default": true, "dns": {} }] k8s.v1.cni.cncf.io/networks-status: [{ "name": "", "interface": "eth0", "ips": [ "10.129.2.145" ], "default": true, "dns": {} }] openshift.io/generated-by: OpenShiftNewApp openshift.io/scc: restricted Status: Running IP: 10.129.2.145 IPs: IP: 10.129.2.145 Controlled By: ReplicaSet/backpack-56c78c6 Containers: backpack: Container ID: cri-o://a9fbdedb6887aef73213cbbbdfcc8de17e815fb0de62e9bd1d2e7aa0519197ce Image: quay.io/cloud-bulldozer/backpack@sha256:fe14b444967419b1cef078f5f2f4ff3b4e6238f1e2ab775d1e0d4e9c7a9685f7 Image ID: quay.io/cloud-bulldozer/backpack@sha256:fe14b444967419b1cef078f5f2f4ff3b4e6238f1e2ab775d1e0d4e9c7a9685f7 Port: <none> Host Port: <none> State: Terminated Reason: Completed Exit Code: 0 Started: Wed, 28 Apr 2021 20:38:32 +0530 Finished: Wed, 28 Apr 2021 20:38:32 +0530 Last State: Terminated Reason: Completed Exit Code: 0 Started: Wed, 28 Apr 2021 20:38:03 +0530 Finished: Wed, 28 Apr 2021 20:38:03 +0530 Ready: False Restart Count: 3 Environment: <none> Mounts: /var/run/secrets/kubernetes.io/serviceaccount from default-token-bgnrs (ro) Conditions: Type Status Initialized True Ready False ContainersReady False PodScheduled True Volumes: default-token-bgnrs: Type: Secret (a volume populated by a Secret) SecretName: default-token-bgnrs Optional: false QoS Class: BestEffort Node-Selectors: <none> Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s node.kubernetes.io/unreachable:NoExecute op=Exists for 300s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 50s default-scheduler Successfully assigned app1/backpack-56c78c6-nn6c6 to ip-10-0-223-10.us-east-2.compute.internal Normal AddedInterface 49s multus Add eth0 [10.129.2.145/23] Normal Pulled 5s (x4 over 49s) kubelet Container image "quay.io/cloud-bulldozer/backpack@sha256:fe14b444967419b1cef078f5f2f4ff3b4e6238f1e2ab775d1e0d4e9c7a9685f7" already present on machine Normal Created 5s (x4 over 49s) kubelet Created container backpack Normal Started 5s (x4 over 49s) kubelet Started container backpack Warning BackOff 5s (x5 over 47s) kubelet Back-off restarting failed container
I'm seeing this issue as well in 4.6/4.7. Can this fix be backported to those versions as well?
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438