Bug 1952187

Summary:

Pods stuck in ImagePullBackOff with errors like rpc error: code = Unknown desc = Error committing the finished image: image with ID "SomeLongID" already exists, but uses a different top layer: that ID

Product:

OpenShift Container Platform

Reporter:

Kedar Kulkarni <kkulkarn>

Component:

Node

Assignee:

Peter Hunt <pehunt>

Node sub component:

CRI-O

QA Contact:

Sunil Choudhary <schoudha>

Status:

CLOSED ERRATA

Docs Contact:

Severity:

medium

Priority:

medium

CC:

aos-bugs, nagrawal, nalin

Version:

4.8

Target Milestone:

---

Target Release:

4.8.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2021-07-27 23:02:36 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
sosreport-node1	none
sosreport-node2	none

Description Kedar Kulkarni 2021-04-21 17:36:41 UTC

Created attachment 1774122 [details]
sosreport-node1

Description of problem:


Version-Release number of selected component (if applicable):

OpenShift Version: 4.8.0-0.nightly-2021-04-20-101404
CRI-O: crictl version 1.21.0

# crio -v
crio version 1.21.0-74.rhaos4.8.gitbc1ef35.el8
Version:    1.21.0-74.rhaos4.8.gitbc1ef35.el8
GoVersion:  go1.16.1
Compiler:   gc
Platform:   linux/amd64
Linkmode:   dynamic

How reproducible:
Almost consistently during my testing 

Steps to Reproduce:
1.Start a pod with some image (in my case I used quay.io/cloud-bulldozer/backpack:latest)
2.Let it run for a bit and then delete it.
3.Try to start new pod with the same image on the same node as the previous pod was running.

Actual results:

If you describe the pod you see:

Events:
  Type     Reason          Age                 From               Message
  ----     ------          ----                ----               -------
  Normal   Scheduled       16m                 default-scheduler  Successfully assigned my-ripsaw/uperf-server-ip-10-0-128-151-0-e75a9e69-5mn7x to ip-10-0-128-151.us-east-2.compute.internal
  Normal   AddedInterface  16m                 multus             Add eth0 [10.129.82.8/23]
  Normal   Pulling         14m (x4 over 16m)   kubelet            Pulling image "quay.io/cloud-bulldozer/backpack:latest"
  Warning  Failed          14m (x4 over 16m)   kubelet            Failed to pull image "quay.io/cloud-bulldozer/backpack:latest": rpc error: code = Unknown desc = Error committing the finished image: image with ID "570f95b8bf085aa970ed6f7a68e896b6ec8365af9f75e6f8b3570fbc30d6f38e" already exists, but uses a different top layer: that ID is already in use
  Warning  Failed          14m (x4 over 16m)   kubelet            Error: ErrImagePull
  Warning  Failed          14m (x6 over 16m)   kubelet            Error: ImagePullBackOff
  Normal   BackOff         93s (x63 over 16m)  kubelet            Back-off pulling image "quay.io/cloud-bulldozer/backpack:latest"



Expected results:
Pod should not error out.

Additional info:

I was able to reproduce this issue on 2 different clusters, running on AWS(installed with IPI installer). I didn't manually try to reproduce the bug, but it was evident during my automation test execution. Precisely I was running the test https://github.com/cloud-bulldozer/e2e-benchmarking/blob/master/workloads/network-perf/run_pod_network_test_fromgit.sh

Workaround: 

Use `oc debug node/<node_name>`  and remove the image in question using `podman rmi` and in a few seconds, pod that was stuck in ImagePullBackOff changes the state to Running.

Attachments: 
SOSReports of 2 nodes of my cluster where this error was evident.

Comment 1 Kedar Kulkarni 2021-04-21 17:37:18 UTC

Created attachment 1774123 [details]
sosreport-node2

Comment 2 Nalin Dahyabhai 2021-04-22 19:08:40 UTC

The image being pulled (quay.io/cloud-bulldozer/backpack:latest) has a schema1 manifest with a couple of "throwaway" entries in its history.  If I'm reading https://github.com/containers/image/pull/1205 correctly, the version of github.com/containers/image that CRI-O commit bc1ef35 uses changed how it applied layer blobs when saving images, in a way which affected the ID that it assigned to those layers.  I think bumping to CRI-O to use v5.11.1 will fix it.

Comment 3 Peter Hunt 2021-04-23 13:19:27 UTC

4.8 PR attached

Comment 4 Peter Hunt 2021-04-27 15:26:01 UTC

fix merged!

Comment 6 Sunil Choudhary 2021-04-28 15:11:45 UTC

Verified on 4.8.0-0.nightly-2021-04-28-090319

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.0-0.nightly-2021-04-28-090319   True        False         4h9m    Cluster version is 4.8.0-0.nightly-2021-04-28-090319

$ oc get nodes
NAME                                         STATUS   ROLES    AGE     VERSION
ip-10-0-142-4.us-east-2.compute.internal     Ready    worker   4h54m   v1.21.0-rc.0+cd6d8b0
ip-10-0-150-69.us-east-2.compute.internal    Ready    master   5h3m    v1.21.0-rc.0+cd6d8b0
ip-10-0-167-40.us-east-2.compute.internal    Ready    master   5h2m    v1.21.0-rc.0+cd6d8b0
ip-10-0-186-203.us-east-2.compute.internal   Ready    worker   4h54m   v1.21.0-rc.0+cd6d8b0
ip-10-0-216-205.us-east-2.compute.internal   Ready    master   5h5m    v1.21.0-rc.0+cd6d8b0
ip-10-0-223-10.us-east-2.compute.internal    Ready    worker   4h54m   v1.21.0-rc.0+cd6d8b0

$ oc get pods -o wide
NAME                                  READY   STATUS      RESTARTS   AGE   IP             NODE                                        NOMINATED NODE   READINESS GATES
postgresql-1-deploy                   0/1     Completed   0          44m   10.129.2.112   ip-10-0-223-10.us-east-2.compute.internal   <none>           <none>
postgresql-1-nwjnm                    1/1     Running     0          44m   10.129.2.113   ip-10-0-223-10.us-east-2.compute.internal   <none>           <none>
rails-postgresql-example-1-82xxs      1/1     Running     0          42m   10.129.2.117   ip-10-0-223-10.us-east-2.compute.internal   <none>           <none>

$ oc get pods
NAME                               READY   STATUS        RESTARTS   AGE
postgresql-1-nwjnm                 0/1     Terminating   0          47m
rails-postgresql-example-1-82xxs   1/1     Terminating   0          45m
rails-postgresql-example-1-build   0/1     Completed     0          47m

$ oc get pods -o wide
NAME                                        READY   STATUS      RESTARTS   AGE     IP             NODE                                        NOMINATED NODE   READINESS GATES
rails-postgresql-example-1-build            0/1     Completed   0          57m     10.129.2.111   ip-10-0-223-10.us-east-2.compute.internal   <none>           <none>
rails-postgresql-example-55cb7fc669-z9526   1/1     Running     0          5m34s   10.129.2.139   ip-10-0-223-10.us-east-2.compute.internal   <none>           <none>



$ oc new-app --docker-image=quay.io/cloud-bulldozer/backpack:latest
...

$ oc get pods -o wide
NAME                     READY   STATUS      RESTARTS   AGE   IP             NODE                                        NOMINATED NODE   READINESS GATES
backpack-56c78c6-fwb4r   0/1     Completed   2          52s   10.129.2.143   ip-10-0-223-10.us-east-2.compute.internal   <none>           <none>

$ oc describe pod backpack-56c78c6-fwb4r
Name:         backpack-56c78c6-fwb4r
Namespace:    app1
Priority:     0
Node:         ip-10-0-223-10.us-east-2.compute.internal/10.0.223.10
Start Time:   Wed, 28 Apr 2021 20:34:36 +0530
Labels:       deployment=backpack
              pod-template-hash=56c78c6
Annotations:  k8s.v1.cni.cncf.io/network-status:
                [{
                    "name": "",
                    "interface": "eth0",
                    "ips": [
                        "10.129.2.143"
                    ],
                    "default": true,
                    "dns": {}
                }]
              k8s.v1.cni.cncf.io/networks-status:
                [{
                    "name": "",
                    "interface": "eth0",
                    "ips": [
                        "10.129.2.143"
                    ],
                    "default": true,
                    "dns": {}
                }]
              openshift.io/generated-by: OpenShiftNewApp
              openshift.io/scc: restricted
Status:       Running
IP:           10.129.2.143
IPs:
  IP:           10.129.2.143
Controlled By:  ReplicaSet/backpack-56c78c6
Containers:
  backpack:
    Container ID:   cri-o://b7153f37469f3b1d3b1266a0afbbd8f82de6acb83bcb6a8f341b65e361939f6b
    Image:          quay.io/cloud-bulldozer/backpack@sha256:fe14b444967419b1cef078f5f2f4ff3b4e6238f1e2ab775d1e0d4e9c7a9685f7
    Image ID:       quay.io/cloud-bulldozer/backpack@sha256:fe14b444967419b1cef078f5f2f4ff3b4e6238f1e2ab775d1e0d4e9c7a9685f7
    Port:           <none>
    Host Port:      <none>
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Wed, 28 Apr 2021 20:35:19 +0530
      Finished:     Wed, 28 Apr 2021 20:35:19 +0530
    Ready:          False
    Restart Count:  2
    Environment:    <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-bgnrs (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  default-token-bgnrs:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-bgnrs
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason          Age                From               Message
  ----     ------          ----               ----               -------
  Normal   Scheduled       65s                default-scheduler  Successfully assigned app1/backpack-56c78c6-fwb4r to ip-10-0-223-10.us-east-2.compute.internal
  Normal   AddedInterface  64s                multus             Add eth0 [10.129.2.143/23]
  Normal   Pulling         64s                kubelet            Pulling image "quay.io/cloud-bulldozer/backpack@sha256:fe14b444967419b1cef078f5f2f4ff3b4e6238f1e2ab775d1e0d4e9c7a9685f7"
  Normal   Pulled          38s                kubelet            Successfully pulled image "quay.io/cloud-bulldozer/backpack@sha256:fe14b444967419b1cef078f5f2f4ff3b4e6238f1e2ab775d1e0d4e9c7a9685f7" in 26.356752673s
  Normal   Created         23s (x3 over 37s)  kubelet            Created container backpack
  Normal   Started         23s (x3 over 37s)  kubelet            Started container backpack
  Normal   Pulled          23s (x2 over 37s)  kubelet            Container image "quay.io/cloud-bulldozer/backpack@sha256:fe14b444967419b1cef078f5f2f4ff3b4e6238f1e2ab775d1e0d4e9c7a9685f7" already present on machine
  Warning  BackOff         11s (x4 over 36s)  kubelet            Back-off restarting failed container



$ oc get pods -o wide
NAME                     READY   STATUS      RESTARTS   AGE   IP             NODE                                        NOMINATED NODE   READINESS GATES
backpack-56c78c6-nn6c6   0/1     Completed   3          47s   10.129.2.145   ip-10-0-223-10.us-east-2.compute.internal   <none>           <none>

$ oc describe pod backpack-56c78c6-nn6c6
Name:         backpack-56c78c6-nn6c6
Namespace:    app1
Priority:     0
Node:         ip-10-0-223-10.us-east-2.compute.internal/10.0.223.10
Start Time:   Wed, 28 Apr 2021 20:37:46 +0530
Labels:       deployment=backpack
              pod-template-hash=56c78c6
Annotations:  k8s.v1.cni.cncf.io/network-status:
                [{
                    "name": "",
                    "interface": "eth0",
                    "ips": [
                        "10.129.2.145"
                    ],
                    "default": true,
                    "dns": {}
                }]
              k8s.v1.cni.cncf.io/networks-status:
                [{
                    "name": "",
                    "interface": "eth0",
                    "ips": [
                        "10.129.2.145"
                    ],
                    "default": true,
                    "dns": {}
                }]
              openshift.io/generated-by: OpenShiftNewApp
              openshift.io/scc: restricted
Status:       Running
IP:           10.129.2.145
IPs:
  IP:           10.129.2.145
Controlled By:  ReplicaSet/backpack-56c78c6
Containers:
  backpack:
    Container ID:   cri-o://a9fbdedb6887aef73213cbbbdfcc8de17e815fb0de62e9bd1d2e7aa0519197ce
    Image:          quay.io/cloud-bulldozer/backpack@sha256:fe14b444967419b1cef078f5f2f4ff3b4e6238f1e2ab775d1e0d4e9c7a9685f7
    Image ID:       quay.io/cloud-bulldozer/backpack@sha256:fe14b444967419b1cef078f5f2f4ff3b4e6238f1e2ab775d1e0d4e9c7a9685f7
    Port:           <none>
    Host Port:      <none>
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Wed, 28 Apr 2021 20:38:32 +0530
      Finished:     Wed, 28 Apr 2021 20:38:32 +0530
    Last State:     Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Wed, 28 Apr 2021 20:38:03 +0530
      Finished:     Wed, 28 Apr 2021 20:38:03 +0530
    Ready:          False
    Restart Count:  3
    Environment:    <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-bgnrs (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  default-token-bgnrs:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-bgnrs
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason          Age               From               Message
  ----     ------          ----              ----               -------
  Normal   Scheduled       50s               default-scheduler  Successfully assigned app1/backpack-56c78c6-nn6c6 to ip-10-0-223-10.us-east-2.compute.internal
  Normal   AddedInterface  49s               multus             Add eth0 [10.129.2.145/23]
  Normal   Pulled          5s (x4 over 49s)  kubelet            Container image "quay.io/cloud-bulldozer/backpack@sha256:fe14b444967419b1cef078f5f2f4ff3b4e6238f1e2ab775d1e0d4e9c7a9685f7" already present on machine
  Normal   Created         5s (x4 over 49s)  kubelet            Created container backpack
  Normal   Started         5s (x4 over 49s)  kubelet            Started container backpack
  Warning  BackOff         5s (x5 over 47s)  kubelet            Back-off restarting failed container

Comment 7 Keith 2021-04-28 15:12:31 UTC

I'm seeing this issue as well in 4.6/4.7. Can this fix be backported to those versions as well?

Comment 10 errata-xmlrpc 2021-07-27 23:02:36 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438