Bug 1907770

Summary:	Recent RHCOS 47.83 builds (from rhcos-47.83.202012072210-0 on) don't allow master and worker nodes to boot
Product:	OpenShift Container Platform	Reporter:	krmoser
Component:	Node	Assignee:	Qi Wang <qiwan>
Node sub component:	CRI-O	QA Contact:	Sunil Choudhary <schoudha>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	urgent
Priority:	unspecified	CC:	aos-bugs, bbreard, bgilbert, brueckner, cbaus, chanphil, christian.lapolt, dgilmore, Holger.Wolf, imcleod, jligon, miabbott, ndubrovs, nstielau, pehunt, psundara, sjenning, tsweeney, walters, wvoesch
Version:	4.7	Keywords:	TestBlocker, UpcomingSprint
Target Milestone:	---
Target Release:	4.7.0
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	When trying to install an OCP 4.7 build from 2020-12-08 or after, using an RHCOS 47.83 build from rhcos-47.83.202012072210-0 through rhcos-47.83.202012141910-0, an authentication issue would occur with cri-o version 1.20.0-0.rhaos4.7.gitb03c34a.el8.30. ``` Error initializing source docker://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:65f23040cd19b6eaf5d15ad9fc0526acff11acd65c9e21d97be1e010397e79d4: Error reading manifest sha256:65f23040cd19b6eaf5d15ad9fc0526acff11acd65c9e21d97be1e010397e79d4 in quay.io/openshift-release-dev/ocp-v4.0-art-dev: unauthorized: access to the requested resource is not authorized" ``` The $HOME directory where the authfile stays was not handled correctly. This has been fixed with this update.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-02-24 15:43:55 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1903544, 1915617

Description krmoser 2020-12-15 07:56:16 UTC

Description of problem:
With the recent RHCOS 47.83 builds, the bootstrap node will install, but the master and worker nodes will never bootstrap install.  The ignition (fetch) job never completes successfully.

Version-Release number of selected component (if applicable):
The following RHCOS 47.83 builds exhibit the issue.
1. rhcos-47.83.202012072210-0
2. rhcos-47.83.202012101610-0
3. rhcos-47.83.202012140410-0
4. rhcos-47.83.202012141910-0

RHCOS 47.83 rhcos-47.83.202012071550-0 is the last RHCOS 47.83 build that does not have this issue, and master and worker nodes successfully bootstrap install.


How reproducible:
Easily reproducible; recreates every time.

Steps to Reproduce:
1. Attempt to install an OCP 4.7 build from 2020-12-08 or after, using an RHCOS 47.83 build from rhcos-47.83.202012072210-0 through rhcos-47.83.202012141910-0. 


Actual results:
The master and worker nodes never bootstrap install.

Expected results:
The master and worker nodes should bootstrap install.


Additional info:

From core@bootstrap node:


[core@bootstrap-0 ~]$ journalctl -b -f -u release-image.service -u bootkube.service
-- Logs begin at Tue 2020-12-15 06:08:48 UTC. --
Dec 15 06:29:09 bootstrap-0.pok-98.ocptest.pok.stglabs.ibm.com bootkube.sh[2384]: Error: unhealthy cluster
Dec 15 06:29:09 bootstrap-0.pok-98.ocptest.pok.stglabs.ibm.com bootkube.sh[2384]: etcdctl failed. Retrying in 5 seconds...
Dec 15 06:29:21 bootstrap-0.pok-98.ocptest.pok.stglabs.ibm.com bootkube.sh[2384]: {"level":"warn","ts":"2020-12-15T06:29:21.352Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-c3d70c74-0d80-4b84-a12d-7e486b88de52/localhost:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp [::1]:2379: connect: connection refused\""}
Dec 15 06:29:21 bootstrap-0.pok-98.ocptest.pok.stglabs.ibm.com bootkube.sh[2384]: https://localhost:2379 is unhealthy: failed to commit proposal: context deadline exceeded
Dec 15 06:29:21 bootstrap-0.pok-98.ocptest.pok.stglabs.ibm.com bootkube.sh[2384]: Error: unhealthy cluster
Dec 15 06:29:22 bootstrap-0.pok-98.ocptest.pok.stglabs.ibm.com bootkube.sh[2384]: etcdctl failed. Retrying in 5 seconds...
Dec 15 06:29:33 bootstrap-0.pok-98.ocptest.pok.stglabs.ibm.com bootkube.sh[2384]: {"level":"warn","ts":"2020-12-15T06:29:33.278Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-b15d996e-56b5-44fe-be04-f0b3b6865c6f/localhost:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp [::1]:2379: connect: connection refused\""}
Dec 15 06:29:33 bootstrap-0.pok-98.ocptest.pok.stglabs.ibm.com bootkube.sh[2384]: https://localhost:2379 is unhealthy: failed to commit proposal: context deadline exceeded
Dec 15 06:29:33 bootstrap-0.pok-98.ocptest.pok.stglabs.ibm.com bootkube.sh[2384]: Error: unhealthy cluster
Dec 15 06:29:33 bootstrap-0.pok-98.ocptest.pok.stglabs.ibm.com bootkube.sh[2384]: etcdctl failed. Retrying in 5 seconds...
Dec 15 06:29:45 bootstrap-0.pok-98.ocptest.pok.stglabs.ibm.com bootkube.sh[2384]: {"level":"warn","ts":"2020-12-15T06:29:45.150Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-2396fc88-120e-49c0-b3bb-51dfc7509765/localhost:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp [::1]:2379: connect: connection refused\""}
Dec 15 06:29:45 bootstrap-0.pok-98.ocptest.pok.stglabs.ibm.com bootkube.sh[2384]: https://localhost:2379 is unhealthy: failed to commit proposal: context deadline exceeded
Dec 15 06:29:45 bootstrap-0.pok-98.ocptest.pok.stglabs.ibm.com bootkube.sh[2384]: Error: unhealthy cluster
Dec 15 06:29:45 bootstrap-0.pok-98.ocptest.pok.stglabs.ibm.com bootkube.sh[2384]: etcdctl failed. Retrying in 5 seconds...
Dec 15 06:29:56 bootstrap-0.pok-98.ocptest.pok.stglabs.ibm.com bootkube.sh[2384]: {"level":"warn","ts":"2020-12-15T06:29:56.750Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-8e0db3e9-35cd-4b60-9fe7-7e8871ef27f8/localhost:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp [::1]:2379: connect: connection refused\""}
Dec 15 06:29:56 bootstrap-0.pok-98.ocptest.pok.stglabs.ibm.com bootkube.sh[2384]: https://localhost:2379 is unhealthy: failed to commit proposal: context deadline exceeded
Dec 15 06:29:56 bootstrap-0.pok-98.ocptest.pok.stglabs.ibm.com bootkube.sh[2384]: Error: unhealthy cluster
Dec 15 06:29:57 bootstrap-0.pok-98.ocptest.pok.stglabs.ibm.com bootkube.sh[2384]: etcdctl failed. Retrying in 5 seconds...
^C
[core@bootstrap-0 ~]$



[core@bootstrap-0 ~]$ journalctl | tail -30
Dec 15 06:30:24 bootstrap-0.pok-98.ocptest.pok.stglabs.ibm.com hyperkube[2364]: I1215 06:30:24.474123    2364 kubelet_node_status.go:526] Recording NodeHasSufficientMemory event message for node bootstrap-0.pok-98.ocptest.pok.stglabs.ibm.com
Dec 15 06:30:24 bootstrap-0.pok-98.ocptest.pok.stglabs.ibm.com hyperkube[2364]: I1215 06:30:24.474160    2364 kubelet_node_status.go:526] Recording NodeHasNoDiskPressure event message for node bootstrap-0.pok-98.ocptest.pok.stglabs.ibm.com
Dec 15 06:30:24 bootstrap-0.pok-98.ocptest.pok.stglabs.ibm.com hyperkube[2364]: I1215 06:30:24.474169    2364 kubelet_node_status.go:526] Recording NodeHasSufficientPID event message for node bootstrap-0.pok-98.ocptest.pok.stglabs.ibm.com
Dec 15 06:30:24 bootstrap-0.pok-98.ocptest.pok.stglabs.ibm.com hyperkube[2364]: I1215 06:30:24.474283    2364 kuberuntime_manager.go:425] No sandbox for pod "bootstrap-machine-config-operator-bootstrap-0.pok-98.ocptest.pok.stglabs.ibm.com_default(98c4cfd884e851cfb3a66f337acbd418)" can be found. Need to start a new one
Dec 15 06:30:24 bootstrap-0.pok-98.ocptest.pok.stglabs.ibm.com crio[2081]: time="2020-12-15 06:30:24.477729593Z" level=info msg="Running pod sandbox: default/bootstrap-machine-config-operator-bootstrap-0.pok-98.ocptest.pok.stglabs.ibm.com/POD" id=83695628-30bf-46de-a693-9533a8143a66 name=/runtime.v1alpha2.RuntimeService/RunPodSandbox
Dec 15 06:30:25 bootstrap-0.pok-98.ocptest.pok.stglabs.ibm.com crio[2081]: time="2020-12-15 06:30:25.528861289Z" level=info msg="RunSandbox: releasing container name: k8s_POD_bootstrap-machine-config-operator-bootstrap-0.pok-98.ocptest.pok.stglabs.ibm.com_default_98c4cfd884e851cfb3a66f337acbd418_0" id=83695628-30bf-46de-a693-9533a8143a66 name=/runtime.v1alpha2.RuntimeService/RunPodSandbox
Dec 15 06:30:25 bootstrap-0.pok-98.ocptest.pok.stglabs.ibm.com crio[2081]: time="2020-12-15 06:30:25.529386162Z" level=info msg="RunSandbox: releasing pod sandbox name: k8s_bootstrap-machine-config-operator-bootstrap-0.pok-98.ocptest.pok.stglabs.ibm.com_default_98c4cfd884e851cfb3a66f337acbd418_0" id=83695628-30bf-46de-a693-9533a8143a66 name=/runtime.v1alpha2.RuntimeService/RunPodSandbox
Dec 15 06:30:25 bootstrap-0.pok-98.ocptest.pok.stglabs.ibm.com hyperkube[2364]: E1215 06:30:25.530409    2364 remote_runtime.go:113] RunPodSandbox from runtime service failed: rpc error: code = Unknown desc = error creating pod sandbox with name "k8s_bootstrap-machine-config-operator-bootstrap-0.pok-98.ocptest.pok.stglabs.ibm.com_default_98c4cfd884e851cfb3a66f337acbd418_0": Error initializing source docker://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:65f23040cd19b6eaf5d15ad9fc0526acff11acd65c9e21d97be1e010397e79d4: Error reading manifest sha256:65f23040cd19b6eaf5d15ad9fc0526acff11acd65c9e21d97be1e010397e79d4 in quay.io/openshift-release-dev/ocp-v4.0-art-dev: unauthorized: access to the requested resource is not authorized
Dec 15 06:30:25 bootstrap-0.pok-98.ocptest.pok.stglabs.ibm.com hyperkube[2364]: E1215 06:30:25.530488    2364 kuberuntime_sandbox.go:70] CreatePodSandbox for pod "bootstrap-machine-config-operator-bootstrap-0.pok-98.ocptest.pok.stglabs.ibm.com_default(98c4cfd884e851cfb3a66f337acbd418)" failed: rpc error: code = Unknown desc = error creating pod sandbox with name "k8s_bootstrap-machine-config-operator-bootstrap-0.pok-98.ocptest.pok.stglabs.ibm.com_default_98c4cfd884e851cfb3a66f337acbd418_0": Error initializing source docker://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:65f23040cd19b6eaf5d15ad9fc0526acff11acd65c9e21d97be1e010397e79d4: Error reading manifest sha256:65f23040cd19b6eaf5d15ad9fc0526acff11acd65c9e21d97be1e010397e79d4 in quay.io/openshift-release-dev/ocp-v4.0-art-dev: unauthorized: access to the requested resource is not authorized
Dec 15 06:30:25 bootstrap-0.pok-98.ocptest.pok.stglabs.ibm.com hyperkube[2364]: E1215 06:30:25.530505    2364 kuberuntime_manager.go:730] createPodSandbox for pod "bootstrap-machine-config-operator-bootstrap-0.pok-98.ocptest.pok.stglabs.ibm.com_default(98c4cfd884e851cfb3a66f337acbd418)" failed: rpc error: code = Unknown desc = error creating pod sandbox with name "k8s_bootstrap-machine-config-operator-bootstrap-0.pok-98.ocptest.pok.stglabs.ibm.com_default_98c4cfd884e851cfb3a66f337acbd418_0": Error initializing source docker://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:65f23040cd19b6eaf5d15ad9fc0526acff11acd65c9e21d97be1e010397e79d4: Error reading manifest sha256:65f23040cd19b6eaf5d15ad9fc0526acff11acd65c9e21d97be1e010397e79d4 in quay.io/openshift-release-dev/ocp-v4.0-art-dev: unauthorized: access to the requested resource is not authorized
Dec 15 06:30:25 bootstrap-0.pok-98.ocptest.pok.stglabs.ibm.com hyperkube[2364]: E1215 06:30:25.530577    2364 pod_workers.go:191] Error syncing pod 98c4cfd884e851cfb3a66f337acbd418 ("bootstrap-machine-config-operator-bootstrap-0.pok-98.ocptest.pok.stglabs.ibm.com_default(98c4cfd884e851cfb3a66f337acbd418)"), skipping: failed to "CreatePodSandbox" for "bootstrap-machine-config-operator-bootstrap-0.pok-98.ocptest.pok.stglabs.ibm.com_default(98c4cfd884e851cfb3a66f337acbd418)" with CreatePodSandboxError: "CreatePodSandbox for pod \"bootstrap-machine-config-operator-bootstrap-0.pok-98.ocptest.pok.stglabs.ibm.com_default(98c4cfd884e851cfb3a66f337acbd418)\" failed: rpc error: code = Unknown desc = error creating pod sandbox with name \"k8s_bootstrap-machine-config-operator-bootstrap-0.pok-98.ocptest.pok.stglabs.ibm.com_default_98c4cfd884e851cfb3a66f337acbd418_0\": Error initializing source docker://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:65f23040cd19b6eaf5d15ad9fc0526acff11acd65c9e21d97be1e010397e79d4: Error reading manifest sha256:65f23040cd19b6eaf5d15ad9fc0526acff11acd65c9e21d97be1e010397e79d4 in quay.io/openshift-release-dev/ocp-v4.0-art-dev: unauthorized: access to the requested resource is not authorized"
Dec 15 06:30:25 bootstrap-0.pok-98.ocptest.pok.stglabs.ibm.com systemd[1]: Couldn't stat device /dev/char/10:200: No such file or directory
Dec 15 06:30:25 bootstrap-0.pok-98.ocptest.pok.stglabs.ibm.com systemd[1]: Started libcontainer container 00119b3612bdbcadae212c18853b356a0c04aef3ff7dd96c5c19ccff2919c15f.
Dec 15 06:30:25 bootstrap-0.pok-98.ocptest.pok.stglabs.ibm.com systemd[1]: Couldn't stat device /dev/char/10:200: No such file or directory
Dec 15 06:30:26 bootstrap-0.pok-98.ocptest.pok.stglabs.ibm.com hyperkube[2364]: I1215 06:30:26.451219    2364 kubelet_node_status.go:334] Setting node annotation to enable volume controller attach/detach
Dec 15 06:30:26 bootstrap-0.pok-98.ocptest.pok.stglabs.ibm.com hyperkube[2364]: I1215 06:30:26.469952    2364 kubelet_node_status.go:526] Recording NodeHasSufficientMemory event message for node bootstrap-0.pok-98.ocptest.pok.stglabs.ibm.com
Dec 15 06:30:26 bootstrap-0.pok-98.ocptest.pok.stglabs.ibm.com hyperkube[2364]: I1215 06:30:26.469996    2364 kubelet_node_status.go:526] Recording NodeHasNoDiskPressure event message for node bootstrap-0.pok-98.ocptest.pok.stglabs.ibm.com
Dec 15 06:30:26 bootstrap-0.pok-98.ocptest.pok.stglabs.ibm.com hyperkube[2364]: I1215 06:30:26.470005    2364 kubelet_node_status.go:526] Recording NodeHasSufficientPID event message for node bootstrap-0.pok-98.ocptest.pok.stglabs.ibm.com
Dec 15 06:30:26 bootstrap-0.pok-98.ocptest.pok.stglabs.ibm.com hyperkube[2364]: I1215 06:30:26.470120    2364 kuberuntime_manager.go:425] No sandbox for pod "etcd-bootstrap-member-bootstrap-0.pok-98.ocptest.pok.stglabs.ibm.com_openshift-etcd(676a77c1814571d9f14f63208341374c)" can be found. Need to start a new one
Dec 15 06:30:26 bootstrap-0.pok-98.ocptest.pok.stglabs.ibm.com crio[2081]: time="2020-12-15 06:30:26.471027504Z" level=info msg="Running pod sandbox: openshift-etcd/etcd-bootstrap-member-bootstrap-0.pok-98.ocptest.pok.stglabs.ibm.com/POD" id=cc0e5b6c-9cbf-4b15-9967-34fe1b1cbf70 name=/runtime.v1alpha2.RuntimeService/RunPodSandbox
Dec 15 06:30:27 bootstrap-0.pok-98.ocptest.pok.stglabs.ibm.com crio[2081]: time="2020-12-15 06:30:27.367015547Z" level=info msg="RunSandbox: releasing container name: k8s_POD_etcd-bootstrap-member-bootstrap-0.pok-98.ocptest.pok.stglabs.ibm.com_openshift-etcd_676a77c1814571d9f14f63208341374c_0" id=cc0e5b6c-9cbf-4b15-9967-34fe1b1cbf70 name=/runtime.v1alpha2.RuntimeService/RunPodSandbox
Dec 15 06:30:27 bootstrap-0.pok-98.ocptest.pok.stglabs.ibm.com crio[2081]: time="2020-12-15 06:30:27.367802991Z" level=info msg="RunSandbox: releasing pod sandbox name: k8s_etcd-bootstrap-member-bootstrap-0.pok-98.ocptest.pok.stglabs.ibm.com_openshift-etcd_676a77c1814571d9f14f63208341374c_0" id=cc0e5b6c-9cbf-4b15-9967-34fe1b1cbf70 name=/runtime.v1alpha2.RuntimeService/RunPodSandbox
Dec 15 06:30:27 bootstrap-0.pok-98.ocptest.pok.stglabs.ibm.com hyperkube[2364]: E1215 06:30:27.369075    2364 remote_runtime.go:113] RunPodSandbox from runtime service failed: rpc error: code = Unknown desc = error creating pod sandbox with name "k8s_etcd-bootstrap-member-bootstrap-0.pok-98.ocptest.pok.stglabs.ibm.com_openshift-etcd_676a77c1814571d9f14f63208341374c_0": Error initializing source docker://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:65f23040cd19b6eaf5d15ad9fc0526acff11acd65c9e21d97be1e010397e79d4: Error reading manifest sha256:65f23040cd19b6eaf5d15ad9fc0526acff11acd65c9e21d97be1e010397e79d4 in quay.io/openshift-release-dev/ocp-v4.0-art-dev: unauthorized: access to the requested resource is not authorized
Dec 15 06:30:27 bootstrap-0.pok-98.ocptest.pok.stglabs.ibm.com hyperkube[2364]: E1215 06:30:27.369157    2364 kuberuntime_sandbox.go:70] CreatePodSandbox for pod "etcd-bootstrap-member-bootstrap-0.pok-98.ocptest.pok.stglabs.ibm.com_openshift-etcd(676a77c1814571d9f14f63208341374c)" failed: rpc error: code = Unknown desc = error creating pod sandbox with name "k8s_etcd-bootstrap-member-bootstrap-0.pok-98.ocptest.pok.stglabs.ibm.com_openshift-etcd_676a77c1814571d9f14f63208341374c_0": Error initializing source docker://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:65f23040cd19b6eaf5d15ad9fc0526acff11acd65c9e21d97be1e010397e79d4: Error reading manifest sha256:65f23040cd19b6eaf5d15ad9fc0526acff11acd65c9e21d97be1e010397e79d4 in quay.io/openshift-release-dev/ocp-v4.0-art-dev: unauthorized: access to the requested resource is not authorized
Dec 15 06:30:27 bootstrap-0.pok-98.ocptest.pok.stglabs.ibm.com hyperkube[2364]: E1215 06:30:27.369171    2364 kuberuntime_manager.go:730] createPodSandbox for pod "etcd-bootstrap-member-bootstrap-0.pok-98.ocptest.pok.stglabs.ibm.com_openshift-etcd(676a77c1814571d9f14f63208341374c)" failed: rpc error: code = Unknown desc = error creating pod sandbox with name "k8s_etcd-bootstrap-member-bootstrap-0.pok-98.ocptest.pok.stglabs.ibm.com_openshift-etcd_676a77c1814571d9f14f63208341374c_0": Error initializing source docker://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:65f23040cd19b6eaf5d15ad9fc0526acff11acd65c9e21d97be1e010397e79d4: Error reading manifest sha256:65f23040cd19b6eaf5d15ad9fc0526acff11acd65c9e21d97be1e010397e79d4 in quay.io/openshift-release-dev/ocp-v4.0-art-dev: unauthorized: access to the requested resource is not authorized
Dec 15 06:30:27 bootstrap-0.pok-98.ocptest.pok.stglabs.ibm.com hyperkube[2364]: E1215 06:30:27.369238    2364 pod_workers.go:191] Error syncing pod 676a77c1814571d9f14f63208341374c ("etcd-bootstrap-member-bootstrap-0.pok-98.ocptest.pok.stglabs.ibm.com_openshift-etcd(676a77c1814571d9f14f63208341374c)"), skipping: failed to "CreatePodSandbox" for "etcd-bootstrap-member-bootstrap-0.pok-98.ocptest.pok.stglabs.ibm.com_openshift-etcd(676a77c1814571d9f14f63208341374c)" with CreatePodSandboxError: "CreatePodSandbox for pod \"etcd-bootstrap-member-bootstrap-0.pok-98.ocptest.pok.stglabs.ibm.com_openshift-etcd(676a77c1814571d9f14f63208341374c)\" failed: rpc error: code = Unknown desc = error creating pod sandbox with name \"k8s_etcd-bootstrap-member-bootstrap-0.pok-98.ocptest.pok.stglabs.ibm.com_openshift-etcd_676a77c1814571d9f14f63208341374c_0\": Error initializing source docker://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:65f23040cd19b6eaf5d15ad9fc0526acff11acd65c9e21d97be1e010397e79d4: Error reading manifest sha256:65f23040cd19b6eaf5d15ad9fc0526acff11acd65c9e21d97be1e010397e79d4 in quay.io/openshift-release-dev/ocp-v4.0-art-dev: unauthorized: access to the requested resource is not authorized"
Dec 15 06:30:29 bootstrap-0.pok-98.ocptest.pok.stglabs.ibm.com hyperkube[2364]: I1215 06:30:29.602405    2364 kubelet_node_status.go:334] Setting node annotation to enable volume controller attach/detach
Dec 15 06:30:29 bootstrap-0.pok-98.ocptest.pok.stglabs.ibm.com hyperkube[2364]: I1215 06:30:29.627418    2364 kubelet_node_status.go:526] Recording NodeHasSufficientMemory event message for node bootstrap-0.pok-98.ocptest.pok.stglabs.ibm.com
Dec 15 06:30:29 bootstrap-0.pok-98.ocptest.pok.stglabs.ibm.com hyperkube[2364]: I1215 06:30:29.627452    2364 kubelet_node_status.go:526] Recording NodeHasNoDiskPressure event message for node bootstrap-0.pok-98.ocptest.pok.stglabs.ibm.com
Dec 15 06:30:29 bootstrap-0.pok-98.ocptest.pok.stglabs.ibm.com hyperkube[2364]: I1215 06:30:29.627461    2364 kubelet_node_status.go:526] Recording NodeHasSufficientPID event message for node bootstrap-0.pok-98.ocptest.pok.stglabs.ibm.com
[core@bootstrap-0 ~]$




Thank you.

Comment 1 Micah Abbott 2020-12-15 14:48:49 UTC

What kind of environment is this?  Bare metal or cloud?  What architecture?

Could you please provide the complete journal from a master or worker showing the full boot and Ignition stage failing?

Comment 2 Carvel Baus 2020-12-15 14:53:10 UTC

I currently have the 4.7 nightly running on s390x and I did not see this issue. the coreos version is one of the ones listed above.


[cbaus@rock-kvmlp-3 ~]$ oc get clusterversion
NAME      VERSION                                   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.0-0.nightly-s390x-2020-12-10-120744   True        False         3d18h   Cluster version is 4.7.0-0.nightly-s390x-2020-12-10-120744


[cbaus@rock-kvmlp-3 ~]$ oc describe node cbaus-ocp-zvzn6-master-0
.
.
.
System Info:
  Machine ID:                                       22fa410fa0474cd69926781f6d0c935e
  System UUID:                                      22fa410fa0474cd69926781f6d0c935e
  Boot ID:                                          dd8b10b0-3ff7-4cda-ad30-26a430d23298
  Kernel Version:                                   4.18.0-240.7.1.el8_3.s390x
  OS Image:                                         Red Hat Enterprise Linux CoreOS 47.83.202012072210-0 (Ootpa)
  Operating System:                                 linux
  Architecture:                                     s390x
  Container Runtime Version:                        cri-o://1.20.0-0.rhaos4.7.gitb03c34a.el8.30-dev
  Kubelet Version:                                  v1.19.2+ad738ba
  Kube-Proxy Version:                               v1.19.2+ad738ba
.
.
.

I noticed this error in the log above which points to a credentials issue with quay:

Error initializing source docker://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:65f23040cd19b6eaf5d15ad9fc0526acff11acd65c9e21d97be1e010397e79d4: Error reading manifest sha256:65f23040cd19b6eaf5d15ad9fc0526acff11acd65c9e21d97be1e010397e79d4 in quay.io/openshift-release-dev/ocp-v4.0-art-dev: unauthorized: access to the requested resource is not authorized"

Comment 3 krmoser 2020-12-15 15:10:59 UTC

Folks,

1. This issue is seen on multiple System z zVM LPARs.

2. The following RHCOS 47.83 build installs the bootstrap and then the masters and workers boot successfully:
  1. rhcos-47.83.202012071550-0

3. The following RHCOS 47.83 builds successfully install the bootstrap, but the masters and workers do not boot successfully:
  1. rhcos-47.83.202012072210-0
  2. rhcos-47.83.202012101610-0
  3. rhcos-47.83.202012140410-0
  4. rhcos-47.83.202012141910-0

4. Practically all OCP 4.7 builds since December 12th (and many previous builds) install successfully when using the RHCOS 47.83 rhcos-47.83.202012071550-0 build installed on the bootstrap.

5. Using the RHCOS 47.83 rhcos-47.83.202012071550-0 build installed on the bootstrap, the masters and workers then successfully install these RHCOS build levels as required by specific OCP 4.7 builds:
  1. rhcos-47.83.202012072210-0
  2. rhcos-47.83.202012101610-0
  3. rhcos-47.83.202012140410-0
  4. rhcos-47.83.202012141910-0


6. The issue for this defect is when using any of the above 4 RHCOS builds to install on the bootstrap node. 


Thank you,
Kyle

Comment 4 Prashanth Sundararaman 2020-12-15 15:23:00 UTC

I can confirm too that I am seeing this issue in a couple of places:

- libvirt IPI install with 47.83.202012101610-0 as the bootimage. I updated the bootimage just for a test and found out it has this same issue. The default older image (47.83.202012030410-0) works fine.
- zVM baremetal platform with  47.83.202012101610-0 as the bootimage.

Something seems to have happened recently (inside rhcos ?) that is giving this unauthorized request error. This doesn't seem arch specific either and just to confirm I will also try with an updated bootimage on x86.

@Micah - off the top of your head can you think of any changes which would have caused this ?

Comment 5 Colin Walters 2020-12-15 15:41:08 UTC

Is there anything like a HTTP proxy involved here? 

In order to debug this I'd ssh to the bootstrap
and try e.g. reverting podman/crio (if indeed they changed) or request debug level output from them.

Or instead of ssh-to-bootstrap start up an RHCOS node just providing your pull secret, should be basically the same.

Another avenue is to look at the package-level diffs.

Comment 6 Prashanth Sundararaman 2020-12-15 15:42:03 UTC

Ok. I see the same for x86 too after updating bootimage to the latest (47.83.202012142132-0).

@Kyle you mentioned that the build that started failing is 47.83.202012072210-0 ? in that build I see that cri-o has been updated

`cri-o	1.20.0-0.rhaos4.7.gitb5f76f7.el8.29 → 1.20.0-0.rhaos4.7.gitb03c34a.el8.30`

So this could be related to cri-o? let me confirm with that build and the previous to make sure

Comment 7 krmoser 2020-12-15 16:10:54 UTC

Prashanth,

Thank you.  Yes, we started seeing this issue with the RHCOS 47.83 rhcos-47.83.202012072210-0 build.

Thank you,
Kyle

Comment 8 Prashanth Sundararaman 2020-12-15 16:15:17 UTC

(In reply to Colin Walters from comment #5)
> Is there anything like a HTTP proxy involved here? 
> 
> In order to debug this I'd ssh to the bootstrap
> and try e.g. reverting podman/crio (if indeed they changed) or request debug
> level output from them.
> 
> Or instead of ssh-to-bootstrap start up an RHCOS node just providing your
> pull secret, should be basically the same.
> 
> Another avenue is to look at the package-level diffs.

Thanks for this idea @walters! I tried this and indeed downgrading cri-o worked

The cri-o version 1.20.0-0.rhaos4.7.gitb03c34a.el8.30 seems to have this issue. I downgraded to cri-o-1.20.0-0.rhaos4.7.gitb5f76f7.el8.29.s390x and things started working again. I will move this for investigation to the cri-o team.

Comment 9 Peter Hunt 2020-12-15 16:29:30 UTC

there were only three commits between those two versions:
b03c34aa7dd60777cac081d9aa2bb67c7d35bab5 Merge pull request #4397 from saschagrunert/k-release-0.6.0
5e80372b7f65218a9814ba077d33224755d79281 Increase release-notes run timeout to 30m
7150db5ba2877e926675fc987fe1542d653a6a0c Bump k/release to v0.6.0

The only suspicious piece is that we bumped containers/image from v5.5.2 to v5.7.0, which is a change that could cause this (not sure why yet).

Comment 10 Colin Walters 2020-12-15 23:59:33 UTC

The thing I don't understand at all is why this bug isn't happening to all of CI; is there something e.g. special about your pull secrets?
Could you perhaps have multiple secrets for quay.io and we're picking the wrong one now?

Comment 11 Prashanth Sundararaman 2020-12-16 14:12:41 UTC

(In reply to Colin Walters from comment #10)
> The thing I don't understand at all is why this bug isn't happening to all
> of CI; is there something e.g. special about your pull secrets?
> Could you perhaps have multiple secrets for quay.io and we're picking the
> wrong one now?

The pull secret is the regular pull secret from try.openshift.com. No changes to that. The thing is - this issue is only seen on the bootstrap node. Once the masters are up, they seem to work fine and pull the images fine from quay and they are using the same crio version as well. Maybe something to do with where the config.json is stored in the bootstrap vs masters? not sure.

Comment 12 Seth Jennings 2021-01-07 17:32:52 UTC

cri-o PR
https://github.com/cri-o/cri-o/pull/4462

Comment 13 Jeremy Poulin 2021-01-07 19:37:42 UTC

Marking this as a test blocker for UPI installs as documented.
This is because customers are pointed to the latest rhcos images when they install following the docs, but our IPI installs (and many of our test scenarios) used a fixed rhcos version (which is bumped when the release branch is cut). This has resulted in the up-to-date rhcos image being broken for deployment purposes since December 7th, 2020.

See https://coreos.slack.com/archives/CH76YSYSC/p1610037169365200 for the full context.

Comment 14 krmoser 2021-01-07 19:41:50 UTC

Folks,

1. This issue is also seen on the OCP 4.7 public mirror build, 4.7.0-fc.0, at:
  https://mirror.openshift.com/pub/openshift-v4/s390x/clients/ocp-dev-preview/4.7.0-fc.0/

2. This OCP 4.7 build uses the RHCOS 47.83.202012182210-0 build.

Thank you,
Kyle

Comment 15 krmoser 2021-01-08 09:23:16 UTC

Folks,

1. This issue is also seen with the RHCOS 47.83.202101041010-0 build.

Thank you,
Kyle

Comment 17 krmoser 2021-01-10 13:47:54 UTC

Folks,

1. This issue is also seen with the RHCOS 47.83.202101081310-0 build.

Thank you,
Kyle

Comment 18 Prashanth Sundararaman 2021-01-11 14:26:13 UTC

@Kyle - https://releases-rhcos-art.cloud.privileged.psi.redhat.com/?stream=releases/rhcos-4.7-s390x&release=47.83.202101100110-0#47.83.202101100110-0 has the updated cri-o with the fix. could you please test it ?

Comment 19 krmoser 2021-01-12 04:37:51 UTC

Prashanth,

Thank you for the information.  Using the RHCOS 47.83.202101100110-0 build, we have successfully tested the following OCP 4.7 builds on both zVM using ECKD and FCP storage, and on KVM using ECKD and FCP storage:

1. 4.7.0-0.nightly-s390x-2021-01-09-013834
2. 4.7.0-0.nightly-s390x-2021-01-09-083526
3. 4.7.0-0.nightly-s390x-2021-01-09-144807
4. 4.7.0-0.nightly-s390x-2021-01-09-202949

5. 4.7.0-fc.0
6. 4.7.0-fc.1
7. 4.7.0-fc.2

Thank you,
Kyle

Comment 21 Sunil Choudhary 2021-01-19 09:57:03 UTC

Verified on 4.7.0-0.nightly-2021-01-19-033533 with RHCOS 47.83.202101171239-0 using UPI install.

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.0-0.nightly-2021-01-19-033533   True        False         89m     Cluster version is 4.7.0-0.nightly-2021-01-19-033533

$ oc get nodes -o wide
NAME                                        STATUS   ROLES    AGE    VERSION           INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                                                       KERNEL-VERSION                 CONTAINER-RUNTIME
ip-10-0-56-217.us-east-2.compute.internal   Ready    master   114m   v1.20.0+d9c52cc   10.0.56.217   <none>        Red Hat Enterprise Linux CoreOS 47.83.202101171239-0 (Ootpa)   4.18.0-240.10.1.el8_3.x86_64   cri-o://1.20.0-0.rhaos4.7.gitd9f17c8.el8.42
ip-10-0-59-181.us-east-2.compute.internal   Ready    master   114m   v1.20.0+d9c52cc   10.0.59.181   <none>        Red Hat Enterprise Linux CoreOS 47.83.202101171239-0 (Ootpa)   4.18.0-240.10.1.el8_3.x86_64   cri-o://1.20.0-0.rhaos4.7.gitd9f17c8.el8.42
ip-10-0-63-227.us-east-2.compute.internal   Ready    worker   104m   v1.20.0+d9c52cc   10.0.63.227   <none>        Red Hat Enterprise Linux CoreOS 47.83.202101171239-0 (Ootpa)   4.18.0-240.10.1.el8_3.x86_64   cri-o://1.20.0-0.rhaos4.7.gitd9f17c8.el8.42
ip-10-0-69-79.us-east-2.compute.internal    Ready    master   115m   v1.20.0+d9c52cc   10.0.69.79    <none>        Red Hat Enterprise Linux CoreOS 47.83.202101171239-0 (Ootpa)   4.18.0-240.10.1.el8_3.x86_64   cri-o://1.20.0-0.rhaos4.7.gitd9f17c8.el8.42
ip-10-0-70-235.us-east-2.compute.internal   Ready    worker   104m   v1.20.0+d9c52cc   10.0.70.235   <none>        Red Hat Enterprise Linux CoreOS 47.83.202101171239-0 (Ootpa)   4.18.0-240.10.1.el8_3.x86_64   cri-o://1.20.0-0.rhaos4.7.gitd9f17c8.el8.42

Comment 24 errata-xmlrpc 2021-02-24 15:43:55 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633