Bug 1857224
Summary: | Following node restart, node becones NotReady with "error creating container storage: layer not known" | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Robert Krawitz <rkrawitz> | |
Component: | Node | Assignee: | Ryan Phillips <rphillips> | |
Node sub component: | Kubelet | QA Contact: | Sunil Choudhary <schoudha> | |
Status: | CLOSED ERRATA | Docs Contact: | ||
Severity: | urgent | |||
Priority: | urgent | CC: | akamra, aos-bugs, dhellmann, dornelas, dyocum, fiezzi, gharden, harpatil, jhou, jokerman, jwang, mlammon, mpatel, scuppett, syangsao, vlaad, walters, weinliu, wking, yprokule | |
Version: | 4.3.z | Keywords: | Reopened, Upgrades | |
Target Milestone: | --- | |||
Target Release: | 4.5.z | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | If docs needed, set a value | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1858411 (view as bug list) | Environment: | ||
Last Closed: | 2020-07-22 12:20:42 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | 1858411 | |||
Bug Blocks: | 1186913 |
Description
Robert Krawitz
2020-07-15 13:36:10 UTC
*** Bug 1855049 has been marked as a duplicate of this bug. *** *** Bug 1846486 has been marked as a duplicate of this bug. *** Observed (in my case) on a baremetal cluster. We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the UpgradeBlocker flag has been added to this bug. It will be removed if the assessment indicates that this should not block upgrade edges. The expectation is that the assignee answers these questions. Who is impacted? If we have to block upgrade edges based on this issue, which edges would need blocking? example: Customers upgrading from 4.y.Z to 4.y+1.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet example: All customers upgrading from 4.y.z to 4.y+1.z fail approximately 10% of the time What is the impact? Is it serious enough to warrant blocking edges? example: Up to 2 minute disruption in edge routing example: Up to 90seconds of API downtime example: etcd loses quorum and you have to restore from backup How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)? example: Issue resolves itself after five minutes example: Admin uses oc to fix things example: Admin must SSH to hosts, restore from backups, or other non standard admin activities Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)? example: No, it’s always been like this we just never noticed example: Yes, from 4.y.z to 4.y+1.z Or 4.y.z to 4.y.z+1 *** Bug 1855003 has been marked as a duplicate of this bug. *** I'm adding a structured link to the PR Mrunal linked. We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the UpgradeBlocker flag has been added to this bug. It will be removed if the assessment indicates that this should not block upgrade edges. The expectation is that the assignee answers these questions. Who is impacted? If we have to block upgrade edges based on this issue, which edges would need blocking? Customers using 4.5.z would be impacted. Any updates to 4.5.z should be blocked. We don't know the exact percentage but we expect most clusters to hit this with increasing likelihood as the nodes get rebooted as part of upgrades or config changes. What is the impact? Is it serious enough to warrant blocking edges? container storage files get corrupted. How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)? Remediation would be to logging into the node and resetting container storage and restarting cri-o. Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)? This is a regression. A change in storage around syncing files has caused this issue. Fixes landed in cri-o-1.18.3-4.rhaos4.5.gitb5e3b15.el7 and cri-o-1.18.3-4.rhaos4.5.gitb5e3b15.el8. Still working on a new RHCOS/machine-os-content. Retargetted this bug at 4.5.z, since it's tracking the 1.18 CRI-O PR. Cloned it forward to 4.6 as bug 1858411. 45.82.202007171855-0 has the fix: $ curl -s https://releases-rhcos-art.cloud.privileged.psi.redhat.com/storage/releases/rhcos-4.5/45.82.202007171855-0/x86_64/commitmeta.json | jq -c '.["rpmostree.rpmdb.pkglist"][] | select(.[0] == "cri-o")' ["cri-o","0","1.18.3","4.rhaos4.5.gitb5e3b15.el8","x86_64"] and is working through its promotion gate now [1,2]. [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/promote-release-openshift-machine-os-content-e2e-aws-4.5/1284213807211614208 [2]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/promote-release-openshift-machine-os-content-e2e-aws-4.5/1284231688360038400 Reset the Target Release now that we depend on a 4.6 bug. We have a nightly with the new machine-os-content [1]. Not sure if we need a fresh pass of Elliot to attach us to an errata and sweep us into ON_QE or not. [1]: https://openshift-release.apps.ci.l2s4.p1.openshiftapps.com/releasestream/4.5.0-0.nightly/release/4.5.0-0.nightly-2020-07-17-221127 has the fix Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2956 Removing UpgradeBlocker from this older bug, to remove it from the suspect queue described in [1]. If you feel like this bug still needs to be a suspect, please add keyword again. [1]: https://github.com/openshift/enhancements/pull/475 |