1857224 – Following node restart, node becones NotReady with "error creating container storage: layer not known"

Bug 1857224 - Following node restart, node becones NotReady with "error creating container storage: layer not known"

Summary: Following node restart, node becones NotReady with "error creating container ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	4.3.z
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	4.5.z
Assignee:	Ryan Phillips
QA Contact:	Sunil Choudhary
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1846486 (view as bug list)
Depends On:	1858411
Blocks:	1186913
TreeView+	depends on / blocked

Reported:	2020-07-15 13:36 UTC by Robert Krawitz
Modified:	2021-04-05 17:47 UTC (History)
CC List:	20 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1858411 (view as bug list)
Environment:
Last Closed:	2020-07-22 12:20:42 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	cri-o cri-o pull 3972	0	None	closed	[1.18] Revert "container_server: disable fdatasync() for atomic writes"	2021-02-18 09:45:41 UTC
Red Hat Product Errata	RHBA-2020:2956	0	None	None	None	2020-07-22 12:21:05 UTC

Description Robert Krawitz 2020-07-15 13:36:10 UTC

Description of problem:

After rebooting a node, it sometimes never transitions to the Ready state.  This may happen more frequently under load.  Typical messages are:

Jun 25 14:08:07 worker-2.ostest.test.metalkube.org podman[1424]: Error: error creating container storage: layer not known
Jun 25 14:08:07 worker-2.ostest.test.metalkube.org systemd[1]: nodeip-configuration.service: Main process exited, code=exited, status=125/n/a
Jun 25 14:08:07 worker-2.ostest.test.metalkube.org systemd[1]: nodeip-configuration.service: Failed with result 'exit-code'.
Jun 25 14:08:07 worker-2.ostest.test.metalkube.org systemd[1]: Failed to start Writes IP address configuration so that kubelet and crio services select a valid node IP.

The workaround is to ssh to the node, stop the crio and kubelet services, rm -rf /var/lib/containers, and restart crio and kubelet.


Version-Release number of selected component (if applicable): 4.5


How reproducible: Infrequent to frequent


Steps to Reproduce:
1. Have active, running node
2. Reboot it until this happens
3.

Actual results:  Node stays not ready, with above messages


Expected results: Node reboots and becomes ready


Additional info:

Comment 1 Nir 2020-07-15 13:45:57 UTC

*** Bug 1855049 has been marked as a duplicate of this bug. ***

Comment 2 Nir 2020-07-15 13:48:50 UTC

*** Bug 1846486 has been marked as a duplicate of this bug. ***

Comment 3 Robert Krawitz 2020-07-15 13:54:03 UTC

Observed (in my case) on a baremetal cluster.

Comment 7 W. Trevor King 2020-07-17 16:17:38 UTC

We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the UpgradeBlocker flag has been added to this bug. It will be removed if the assessment indicates that this should not block upgrade edges. The expectation is that the assignee answers these questions.

Who is impacted?  If we have to block upgrade edges based on this issue, which edges would need blocking?
  example: Customers upgrading from 4.y.Z to 4.y+1.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet
  example: All customers upgrading from 4.y.z to 4.y+1.z fail approximately 10% of the time
What is the impact?  Is it serious enough to warrant blocking edges?
  example: Up to 2 minute disruption in edge routing
  example: Up to 90seconds of API downtime
  example: etcd loses quorum and you have to restore from backup
How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)?
  example: Issue resolves itself after five minutes
  example: Admin uses oc to fix things
  example: Admin must SSH to hosts, restore from backups, or other non standard admin activities
Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)?
  example: No, it’s always been like this we just never noticed
  example: Yes, from 4.y.z to 4.y+1.z Or 4.y.z to 4.y.z+1

Comment 8 Harshal Patil 2020-07-17 16:17:39 UTC

*** Bug 1855003 has been marked as a duplicate of this bug. ***

Comment 9 Mrunal Patel 2020-07-17 16:19:03 UTC

https://github.com/cri-o/cri-o/pull/3972

Comment 10 W. Trevor King 2020-07-17 16:30:38 UTC

I'm adding a structured link to the PR Mrunal linked.

Comment 11 Mrunal Patel 2020-07-17 16:30:46 UTC

We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the UpgradeBlocker flag has been added to this bug. It will be removed if the assessment indicates that this should not block upgrade edges. The expectation is that the assignee answers these questions.

Who is impacted?  If we have to block upgrade edges based on this issue, which edges would need blocking?
Customers using 4.5.z would be impacted. Any updates to 4.5.z should be blocked.
We don't know the exact percentage but we expect most clusters to hit this with increasing likelihood 
as the nodes get rebooted as part of upgrades or config changes.

What is the impact?  Is it serious enough to warrant blocking edges?
container storage files get corrupted.

How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)?
Remediation would be to logging into the node and resetting container storage and restarting cri-o. 

Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)?
This is a regression. A change in storage around syncing files has caused this issue.

Comment 15 W. Trevor King 2020-07-17 18:54:18 UTC

Fixes landed in cri-o-1.18.3-4.rhaos4.5.gitb5e3b15.el7 and cri-o-1.18.3-4.rhaos4.5.gitb5e3b15.el8.  Still working on a new RHCOS/machine-os-content.

Comment 20 W. Trevor King 2020-07-17 21:02:16 UTC

Retargetted this bug at 4.5.z, since it's tracking the 1.18 CRI-O PR.  Cloned it forward to 4.6 as bug 1858411.

Comment 21 W. Trevor King 2020-07-17 21:11:01 UTC

45.82.202007171855-0 has the fix:

  $ curl -s https://releases-rhcos-art.cloud.privileged.psi.redhat.com/storage/releases/rhcos-4.5/45.82.202007171855-0/x86_64/commitmeta.json | jq -c '.["rpmostree.rpmdb.pkglist"][] | select(.[0] == "cri-o")'
  ["cri-o","0","1.18.3","4.rhaos4.5.gitb5e3b15.el8","x86_64"]

and is working through its promotion gate now [1,2].

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/promote-release-openshift-machine-os-content-e2e-aws-4.5/1284213807211614208
[2]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/promote-release-openshift-machine-os-content-e2e-aws-4.5/1284231688360038400

Comment 22 W. Trevor King 2020-07-17 22:21:30 UTC

Reset the Target Release now that we depend on a 4.6 bug.  We have a nightly with the new machine-os-content [1].  Not sure if we need a fresh pass of Elliot to attach us to an errata and sweep us into ON_QE or not.

[1]: https://openshift-release.apps.ci.l2s4.p1.openshiftapps.com/releasestream/4.5.0-0.nightly/release/4.5.0-0.nightly-2020-07-17-221127 has the fix

Comment 29 Colin Walters 2020-07-18 12:52:43 UTC

My thoughts: https://github.com/containers/storage/pull/620#issuecomment-660478404

Comment 39 errata-xmlrpc 2020-07-22 12:20:42 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2956

Comment 40 W. Trevor King 2021-04-05 17:47:22 UTC

Removing UpgradeBlocker from this older bug, to remove it from the suspect queue described in [1].  If you feel like this bug still needs to be a suspect, please add keyword again.

[1]: https://github.com/openshift/enhancements/pull/475

Note You need to log in before you can comment on or make changes to this bug.

akamra
aos-bugs
dhellmann
dornelas
dyocum
fiezzi
gharden
harpatil
jhou
jokerman
jwang
mlammon
mpatel
scuppett
syangsao
vlaad
walters
weinliu
wking
yprokule