Bug 1907770
Summary: | Recent RHCOS 47.83 builds (from rhcos-47.83.202012072210-0 on) don't allow master and worker nodes to boot | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | krmoser |
Component: | Node | Assignee: | Qi Wang <qiwan> |
Node sub component: | CRI-O | QA Contact: | Sunil Choudhary <schoudha> |
Status: | CLOSED ERRATA | Docs Contact: | |
Severity: | urgent | ||
Priority: | unspecified | CC: | aos-bugs, bbreard, bgilbert, brueckner, cbaus, chanphil, christian.lapolt, dgilmore, Holger.Wolf, imcleod, jligon, miabbott, ndubrovs, nstielau, pehunt, psundara, sjenning, tsweeney, walters, wvoesch |
Version: | 4.7 | Keywords: | TestBlocker, UpcomingSprint |
Target Milestone: | --- | ||
Target Release: | 4.7.0 | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: |
When trying to install an OCP 4.7 build from 2020-12-08 or after, using an RHCOS 47.83 build from rhcos-47.83.202012072210-0 through rhcos-47.83.202012141910-0, an authentication issue would occur with cri-o version 1.20.0-0.rhaos4.7.gitb03c34a.el8.30.
```
Error initializing source docker://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:65f23040cd19b6eaf5d15ad9fc0526acff11acd65c9e21d97be1e010397e79d4: Error reading manifest sha256:65f23040cd19b6eaf5d15ad9fc0526acff11acd65c9e21d97be1e010397e79d4 in quay.io/openshift-release-dev/ocp-v4.0-art-dev: unauthorized: access to the requested resource is not authorized"
```
The $HOME directory where the authfile stays was not handled correctly. This has been fixed with this update.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2021-02-24 15:43:55 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1903544, 1915617 |
Description
krmoser
2020-12-15 07:56:16 UTC
What kind of environment is this? Bare metal or cloud? What architecture? Could you please provide the complete journal from a master or worker showing the full boot and Ignition stage failing? I currently have the 4.7 nightly running on s390x and I did not see this issue. the coreos version is one of the ones listed above. [cbaus@rock-kvmlp-3 ~]$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.7.0-0.nightly-s390x-2020-12-10-120744 True False 3d18h Cluster version is 4.7.0-0.nightly-s390x-2020-12-10-120744 [cbaus@rock-kvmlp-3 ~]$ oc describe node cbaus-ocp-zvzn6-master-0 . . . System Info: Machine ID: 22fa410fa0474cd69926781f6d0c935e System UUID: 22fa410fa0474cd69926781f6d0c935e Boot ID: dd8b10b0-3ff7-4cda-ad30-26a430d23298 Kernel Version: 4.18.0-240.7.1.el8_3.s390x OS Image: Red Hat Enterprise Linux CoreOS 47.83.202012072210-0 (Ootpa) Operating System: linux Architecture: s390x Container Runtime Version: cri-o://1.20.0-0.rhaos4.7.gitb03c34a.el8.30-dev Kubelet Version: v1.19.2+ad738ba Kube-Proxy Version: v1.19.2+ad738ba . . . I noticed this error in the log above which points to a credentials issue with quay: Error initializing source docker://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:65f23040cd19b6eaf5d15ad9fc0526acff11acd65c9e21d97be1e010397e79d4: Error reading manifest sha256:65f23040cd19b6eaf5d15ad9fc0526acff11acd65c9e21d97be1e010397e79d4 in quay.io/openshift-release-dev/ocp-v4.0-art-dev: unauthorized: access to the requested resource is not authorized" Folks, 1. This issue is seen on multiple System z zVM LPARs. 2. The following RHCOS 47.83 build installs the bootstrap and then the masters and workers boot successfully: 1. rhcos-47.83.202012071550-0 3. The following RHCOS 47.83 builds successfully install the bootstrap, but the masters and workers do not boot successfully: 1. rhcos-47.83.202012072210-0 2. rhcos-47.83.202012101610-0 3. rhcos-47.83.202012140410-0 4. rhcos-47.83.202012141910-0 4. Practically all OCP 4.7 builds since December 12th (and many previous builds) install successfully when using the RHCOS 47.83 rhcos-47.83.202012071550-0 build installed on the bootstrap. 5. Using the RHCOS 47.83 rhcos-47.83.202012071550-0 build installed on the bootstrap, the masters and workers then successfully install these RHCOS build levels as required by specific OCP 4.7 builds: 1. rhcos-47.83.202012072210-0 2. rhcos-47.83.202012101610-0 3. rhcos-47.83.202012140410-0 4. rhcos-47.83.202012141910-0 6. The issue for this defect is when using any of the above 4 RHCOS builds to install on the bootstrap node. Thank you, Kyle I can confirm too that I am seeing this issue in a couple of places: - libvirt IPI install with 47.83.202012101610-0 as the bootimage. I updated the bootimage just for a test and found out it has this same issue. The default older image (47.83.202012030410-0) works fine. - zVM baremetal platform with 47.83.202012101610-0 as the bootimage. Something seems to have happened recently (inside rhcos ?) that is giving this unauthorized request error. This doesn't seem arch specific either and just to confirm I will also try with an updated bootimage on x86. @Micah - off the top of your head can you think of any changes which would have caused this ? Is there anything like a HTTP proxy involved here? In order to debug this I'd ssh to the bootstrap and try e.g. reverting podman/crio (if indeed they changed) or request debug level output from them. Or instead of ssh-to-bootstrap start up an RHCOS node just providing your pull secret, should be basically the same. Another avenue is to look at the package-level diffs. Ok. I see the same for x86 too after updating bootimage to the latest (47.83.202012142132-0). @Kyle you mentioned that the build that started failing is 47.83.202012072210-0 ? in that build I see that cri-o has been updated `cri-o 1.20.0-0.rhaos4.7.gitb5f76f7.el8.29 → 1.20.0-0.rhaos4.7.gitb03c34a.el8.30` So this could be related to cri-o? let me confirm with that build and the previous to make sure Prashanth, Thank you. Yes, we started seeing this issue with the RHCOS 47.83 rhcos-47.83.202012072210-0 build. Thank you, Kyle (In reply to Colin Walters from comment #5) > Is there anything like a HTTP proxy involved here? > > In order to debug this I'd ssh to the bootstrap > and try e.g. reverting podman/crio (if indeed they changed) or request debug > level output from them. > > Or instead of ssh-to-bootstrap start up an RHCOS node just providing your > pull secret, should be basically the same. > > Another avenue is to look at the package-level diffs. Thanks for this idea @walters! I tried this and indeed downgrading cri-o worked The cri-o version 1.20.0-0.rhaos4.7.gitb03c34a.el8.30 seems to have this issue. I downgraded to cri-o-1.20.0-0.rhaos4.7.gitb5f76f7.el8.29.s390x and things started working again. I will move this for investigation to the cri-o team. there were only three commits between those two versions: b03c34aa7dd60777cac081d9aa2bb67c7d35bab5 Merge pull request #4397 from saschagrunert/k-release-0.6.0 5e80372b7f65218a9814ba077d33224755d79281 Increase release-notes run timeout to 30m 7150db5ba2877e926675fc987fe1542d653a6a0c Bump k/release to v0.6.0 The only suspicious piece is that we bumped containers/image from v5.5.2 to v5.7.0, which is a change that could cause this (not sure why yet). The thing I don't understand at all is why this bug isn't happening to all of CI; is there something e.g. special about your pull secrets? Could you perhaps have multiple secrets for quay.io and we're picking the wrong one now? (In reply to Colin Walters from comment #10) > The thing I don't understand at all is why this bug isn't happening to all > of CI; is there something e.g. special about your pull secrets? > Could you perhaps have multiple secrets for quay.io and we're picking the > wrong one now? The pull secret is the regular pull secret from try.openshift.com. No changes to that. The thing is - this issue is only seen on the bootstrap node. Once the masters are up, they seem to work fine and pull the images fine from quay and they are using the same crio version as well. Maybe something to do with where the config.json is stored in the bootstrap vs masters? not sure. Marking this as a test blocker for UPI installs as documented. This is because customers are pointed to the latest rhcos images when they install following the docs, but our IPI installs (and many of our test scenarios) used a fixed rhcos version (which is bumped when the release branch is cut). This has resulted in the up-to-date rhcos image being broken for deployment purposes since December 7th, 2020. See https://coreos.slack.com/archives/CH76YSYSC/p1610037169365200 for the full context. Folks, 1. This issue is also seen on the OCP 4.7 public mirror build, 4.7.0-fc.0, at: https://mirror.openshift.com/pub/openshift-v4/s390x/clients/ocp-dev-preview/4.7.0-fc.0/ 2. This OCP 4.7 build uses the RHCOS 47.83.202012182210-0 build. Thank you, Kyle Folks, 1. This issue is also seen with the RHCOS 47.83.202101041010-0 build. Thank you, Kyle Folks, 1. This issue is also seen with the RHCOS 47.83.202101081310-0 build. Thank you, Kyle @Kyle - https://releases-rhcos-art.cloud.privileged.psi.redhat.com/?stream=releases/rhcos-4.7-s390x&release=47.83.202101100110-0#47.83.202101100110-0 has the updated cri-o with the fix. could you please test it ? Prashanth, Thank you for the information. Using the RHCOS 47.83.202101100110-0 build, we have successfully tested the following OCP 4.7 builds on both zVM using ECKD and FCP storage, and on KVM using ECKD and FCP storage: 1. 4.7.0-0.nightly-s390x-2021-01-09-013834 2. 4.7.0-0.nightly-s390x-2021-01-09-083526 3. 4.7.0-0.nightly-s390x-2021-01-09-144807 4. 4.7.0-0.nightly-s390x-2021-01-09-202949 5. 4.7.0-fc.0 6. 4.7.0-fc.1 7. 4.7.0-fc.2 Thank you, Kyle Verified on 4.7.0-0.nightly-2021-01-19-033533 with RHCOS 47.83.202101171239-0 using UPI install. $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.7.0-0.nightly-2021-01-19-033533 True False 89m Cluster version is 4.7.0-0.nightly-2021-01-19-033533 $ oc get nodes -o wide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME ip-10-0-56-217.us-east-2.compute.internal Ready master 114m v1.20.0+d9c52cc 10.0.56.217 <none> Red Hat Enterprise Linux CoreOS 47.83.202101171239-0 (Ootpa) 4.18.0-240.10.1.el8_3.x86_64 cri-o://1.20.0-0.rhaos4.7.gitd9f17c8.el8.42 ip-10-0-59-181.us-east-2.compute.internal Ready master 114m v1.20.0+d9c52cc 10.0.59.181 <none> Red Hat Enterprise Linux CoreOS 47.83.202101171239-0 (Ootpa) 4.18.0-240.10.1.el8_3.x86_64 cri-o://1.20.0-0.rhaos4.7.gitd9f17c8.el8.42 ip-10-0-63-227.us-east-2.compute.internal Ready worker 104m v1.20.0+d9c52cc 10.0.63.227 <none> Red Hat Enterprise Linux CoreOS 47.83.202101171239-0 (Ootpa) 4.18.0-240.10.1.el8_3.x86_64 cri-o://1.20.0-0.rhaos4.7.gitd9f17c8.el8.42 ip-10-0-69-79.us-east-2.compute.internal Ready master 115m v1.20.0+d9c52cc 10.0.69.79 <none> Red Hat Enterprise Linux CoreOS 47.83.202101171239-0 (Ootpa) 4.18.0-240.10.1.el8_3.x86_64 cri-o://1.20.0-0.rhaos4.7.gitd9f17c8.el8.42 ip-10-0-70-235.us-east-2.compute.internal Ready worker 104m v1.20.0+d9c52cc 10.0.70.235 <none> Red Hat Enterprise Linux CoreOS 47.83.202101171239-0 (Ootpa) 4.18.0-240.10.1.el8_3.x86_64 cri-o://1.20.0-0.rhaos4.7.gitd9f17c8.el8.42 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633 |