Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1860984

Summary:	Cluster remains degraded after shutdown and restart of master nodes with "layer not known" errors
Product:	OpenShift Container Platform	Reporter:	pdsilva
Component:	Node	Assignee:	Ryan Phillips <rphillips>
Status:	CLOSED DUPLICATE	QA Contact:	Sunil Choudhary <schoudha>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	4.5	CC:	aos-bugs, dwalsh, jokerman, sjenning, tsweeney
Target Milestone:	---
Target Release:	---
Hardware:	ppc64le
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-07-28 17:21:33 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description pdsilva 2020-07-27 15:28:59 UTC

Description of problem:
Upon shutting down all the master nodes and restarting them together, the nodes restart and node status is Ready. However the pods remain in ContainerCreating state. In the master nodes crictl and podman image list show "layer not known" error.

Version-Release number of selected component (if applicable):
openshift-install version:
# openshift-install version
openshift-install 4.5.0-0.nightly-ppc64le-2020-07-17-173216
built from commit 01f5643a02f154246fab0923f8828aa9ae3b76fb
release image quay.io/openshift-release-dev/ocp-release-nightly@sha256:643716db82b13e0629a3596ddf577a0633ba1973629d51496f78a8c58fd5ba71

RHCOS version:
$ cat /etc/os-release
NAME="Red Hat Enterprise Linux CoreOS"
VERSION="45.82.202007151158-0"
VERSION_ID="4.5"
OPENSHIFT_VERSION="4.5"
RHEL_VERSION="8.2"
PRETTY_NAME="Red Hat Enterprise Linux CoreOS 45.82.202007151158-0 (Ootpa)"
ID="rhcos"
ID_LIKE="rhel fedora"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:redhat:enterprise_linux:8::coreos"
HOME_URL="https://www.redhat.com/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"
REDHAT_BUGZILLA_PRODUCT="OpenShift Container Platform"
REDHAT_BUGZILLA_PRODUCT_VERSION="4.5"
REDHAT_SUPPORT_PRODUCT="OpenShift Container Platform"
REDHAT_SUPPORT_PRODUCT_VERSION="4.5"
OSTREE_VERSION='45.82.202007151158-0'


How reproducible:
Always

Steps to Reproduce:
1. Shut down all master nodes of a healthy cluster.
2. Start all master nodes

Actual results:
Cluster is degraded:

# oc get clusterversion
NAME      VERSION                                     AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.5.0-0.nightly-ppc64le-2020-07-17-173216   True        False         9h      Error while reconciling 4.5.0-0.nightly-ppc64le-2020-07-17-173216: the cluster operator network is degraded

Pods are in ContainerCreating state with error: layer not known

# oc get pods --all-namespaces | grep -v "Running\|Completed"
NAMESPACE                                          NAME                                                         READY   STATUS              RESTARTS   AGE
nfs-provisioner                                    nfs-client-provisioner-77594c66d9-wsktl                      0/1     CrashLoopBackOff    47         10h
openshift-apiserver                                apiserver-bc47f849f-92frg                                    0/1     PodInitializing     0          10h
openshift-authentication                           oauth-openshift-64d6d54fb5-cplgr                             0/1     ContainerCreating   0          10h
openshift-cloud-credential-operator                cloud-credential-operator-7b87f78c6b-h6mj2                   0/1     ContainerCreating   16         10h
openshift-cluster-node-tuning-operator             tuned-7f5gr                                                  0/1     ContainerCreating   0          10h
openshift-config-operator                          openshift-config-operator-85f4475f8b-g8ngp                   0/1     CrashLoopBackOff    50         10h
openshift-console                                  console-657ddfc5f6-j2q95                                     0/1     ContainerCreating   1          10h
openshift-console                                  downloads-5c8c858489-bl89w                                   0/1     ContainerCreating   0          10h
openshift-controller-manager                       controller-manager-pjgtg                                     0/1     ContainerCreating   1          10h
openshift-dns                                      dns-default-655tl                                            0/3     ContainerCreating   0          10h
openshift-etcd                                     etcd-master-1.pravin-5f23.redhat.com                         0/4     Init:0/2            0          10h
openshift-image-registry                           cluster-image-registry-operator-7475cb85d4-p5zxc             0/2     ContainerCreating   0          10h
openshift-image-registry                           node-ca-bdvqd                                                0/1     ContainerCreating   0          10h
openshift-kube-apiserver                           kube-apiserver-master-1.pravin-5f23.redhat.com               0/4     Init:0/1            0          10h
openshift-kube-controller-manager                  kube-controller-manager-master-1.pravin-5f23.redhat.com      0/4     ContainerCreating   0          10h
openshift-kube-scheduler                           openshift-kube-scheduler-master-1.pravin-5f23.redhat.com     0/2     Init:0/1            0          10h
openshift-machine-api                              cluster-autoscaler-operator-8d4d96755-9z28f                  0/2     ContainerCreating   2          10h
openshift-machine-config-operator                  etcd-quorum-guard-67966885fd-hntwn                           0/1     ContainerCreating   0          10h
openshift-machine-config-operator                  machine-config-daemon-7dq5z                                  0/2     ContainerCreating   0          10h
openshift-machine-config-operator                  machine-config-server-62npn                                  0/1     ContainerCreating   0          10h
openshift-monitoring                               node-exporter-fngzr                                          0/2     PodInitializing     0          10h
openshift-monitoring                               telemeter-client-5fc8db58bd-ckvjl                            3/3     Terminating         0          4m17s
openshift-multus                                   multus-admission-controller-gdz4s                            0/2     ContainerCreating   0          10h
openshift-multus                                   multus-xvqdz                                                 0/1     PodInitializing     0          10h
openshift-operator-lifecycle-manager               packageserver-7b9c5b5dbc-2cfsx                               0/1     ContainerCreating   0          103s
openshift-operator-lifecycle-manager               packageserver-8569cbc957-tjn2j                               0/1     CrashLoopBackOff    6          48m
openshift-sdn                                      ovs-6dcxk                                                    0/1     ContainerCreating   0          10h
openshift-sdn                                      sdn-controller-8pgrj                                         0/1     ContainerCreating   0          10h
openshift-sdn                                      sdn-qphvn                                                    0/1     ContainerCreating   0          10h


Events:
  Type     Reason                  Age                       From                                      Message
  ----     ------                  ----                      ----                                      -------
  Warning  FailedScheduling        <unknown>                 default-scheduler                         0/6 nodes are available: 3 node(s) didn't match node selector, 3 node(s) didn't match pod affinity/anti-affinity, 3 node(s) didn't satisfy existing pods anti-affinity rules.
  Warning  FailedScheduling        <unknown>                 default-scheduler                         0/6 nodes are available: 3 node(s) didn't match node selector, 3 node(s) didn't match pod affinity/anti-affinity, 3 node(s) didn't satisfy existing pods anti-affinity rules.
  Normal   Scheduled               <unknown>                 default-scheduler                         Successfully assigned openshift-apiserver/apiserver-bc47f849f-92frg to master-1.pravin-5f23.redhat.com
  Normal   AddedInterface          10h                       multus                                    Add eth0 [10.128.0.41/23]
  Normal   Pulled                  10h                       kubelet, master-1.pravin-5f23.redhat.com  Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:fec07023c1337eab29892d53f860402dea5a28698ff5e386d6f9a7d1d7cc7507" already present on machine
  Normal   Created                 10h                       kubelet, master-1.pravin-5f23.redhat.com  Created container fix-audit-permissions
  Normal   Started                 10h                       kubelet, master-1.pravin-5f23.redhat.com  Started container fix-audit-permissions
  Normal   Pulled                  10h                       kubelet, master-1.pravin-5f23.redhat.com  Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:fec07023c1337eab29892d53f860402dea5a28698ff5e386d6f9a7d1d7cc7507" already present on machine
  Normal   Created                 10h                       kubelet, master-1.pravin-5f23.redhat.com  Created container openshift-apiserver
  Normal   Started                 10h                       kubelet, master-1.pravin-5f23.redhat.com  Started container openshift-apiserver
  Warning  Unhealthy               10h                       kubelet, master-1.pravin-5f23.redhat.com  Readiness probe failed: HTTP probe failed with statuscode: 500
  Warning  FailedCreatePodSandBox  5m15s (x1728 over 6h30m)  kubelet, master-1.pravin-5f23.redhat.com  Failed to create pod sandbox: rpc error: code = Unknown desc = error creating pod sandbox with name "k8s_apiserver-bc47f849f-92frg_openshift-apiserver_923f3d0f-0448-4ba9-b22d-84d451608202_0": layer not known


This is the output seen on the master nodes (Not showing full output of images for brevity):

$ sudo crictl images
FATA[0000] listing images failed: rpc error: code = Unknown desc = layer not known

$ sudo podman images
ERRO[0000] error checking if image is a parent "4206ecc1c5f1a6aab2fd5eb4c907d51119d02f7c44ad3c04bea8f2a80c5183bf": layer not known
ERRO[0000] error checking if image is a parent "1a9d0d4819f76437d0612b175b99061b893c7070fdf85c25e418ca8838a39a4d": layer not known
ERRO[0000] error checking if image is a parent "62ce19450eb37c4756ff4b3906f2b1497a7c42113276596dd82724970b100ab0": layer not known
ERRO[0000] error checking if image is a parent "585e536cae5734b80918be433905ae5b4374cd009f6735860bae066b6f7b24b9": layer not known
ERRO[0000] error checking if image is a parent "3380faf56f5815f4cfe12c65a20a78a318140da907c6df591c561586dd1660ec": layer not known
ERRO[0000] error checking if image is a parent "bac4edb5f1695a5c9a9560fc4550a689b69e22cb724ce1d5a164589fa19c6ff9": layer not known
ERRO[0000] error checking if image is a parent "432d785607226705b9b2342e579891e4fe98b9f2845a327244ac50f39891307f": layer not known
ERRO[0000] error checking if image is a parent "b9d443597755d986bdb86db34a10a9a10e96916e827d31a9a1728155cbab8223": layer not known
ERRO[0000] error checking if image is a parent "54cde41669271d8a72ee3d6a533876c7bd6b616a9992df01bad9bcc76780c95e": layer not known
ERRO[0000] error checking if image is a parent "25a488bd48ebf8f7c805392b3e0037ccf51e08402f6c4e5982ac021f39c9c25d": layer not known
ERRO[0000] error checking if image is a parent "9ea5aa3fa90408e7dfb58d92d6249af71f5c8841351a1720f3cbe37b140e8484": layer not known
REPOSITORY                                       TAG      IMAGE ID       CREATED      SIZE
quay.io/openshift-release-dev/ocp-v4.0-art-dev   <none>   252093b95171   5 days ago   unable to determine size
quay.io/openshift-release-dev/ocp-v4.0-art-dev   <none>   b9d443597755   6 days ago   unable to determine size
quay.io/openshift-release-dev/ocp-v4.0-art-dev   <none>   000215b0c758   6 days ago   unable to determine size
quay.io/openshift-release-dev/ocp-v4.0-art-dev   <none>   432d78560722   6 days ago   unable to determine size


Expected results:
All pods should be in running state and co's should be Available.

Additional info:

oc get nodes
# oc get nodes
NAME                              STATUS   ROLES    AGE   VERSION
master-0.pravin-5f23.redhat.com   Ready    master   10h   v1.18.3+b74c5ed
master-1.pravin-5f23.redhat.com   Ready    master   10h   v1.18.3+b74c5ed
master-2.pravin-5f23.redhat.com   Ready    master   10h   v1.18.3+b74c5ed
worker-0.pravin-5f23.redhat.com   Ready    worker   10h   v1.18.3+b74c5ed
worker-1.pravin-5f23.redhat.com   Ready    worker   10h   v1.18.3+b74c5ed
worker-2.pravin-5f23.redhat.com   Ready    worker   10h   v1.18.3+b74c5ed

Comment 2 Seth Jennings 2020-07-28 17:21:33 UTC


*** This bug has been marked as a duplicate of bug 1858411 ***