1858411 – Following node restart, node becones NotReady with "error creating container storage: layer not known"

Bug 1858411 - Following node restart, node becones NotReady with "error creating container storage: layer not known"

Summary: Following node restart, node becones NotReady with "error creating container ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	4.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	4.6.0
Assignee:	Ryan Phillips
QA Contact:	Sunil Choudhary
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1860984 (view as bug list)
Depends On:
Blocks:	1857224
TreeView+	depends on / blocked

Reported:	2020-07-17 21:01 UTC by W. Trevor King
Modified:	2021-04-05 17:47 UTC (History)
CC List:	21 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1857224
Environment:
Last Closed:	2020-10-27 16:15:58 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	cri-o cri-o pull 3975	0	None	closed	[master] Revert "container_server: disable fdatasync() for atomic writes"	2020-12-15 17:12:24 UTC
Red Hat Product Errata	RHBA-2020:4196	0	None	None	None	2020-10-27 16:16:18 UTC

Description W. Trevor King 2020-07-17 21:01:05 UTC

+++ This bug was initially created as a clone of Bug #1857224 +++

Description of problem:

After rebooting a node, it sometimes never transitions to the Ready state.  This may happen more frequently under load.  Typical messages are:

Jun 25 14:08:07 worker-2.ostest.test.metalkube.org podman[1424]: Error: error creating container storage: layer not known
Jun 25 14:08:07 worker-2.ostest.test.metalkube.org systemd[1]: nodeip-configuration.service: Main process exited, code=exited, status=125/n/a
Jun 25 14:08:07 worker-2.ostest.test.metalkube.org systemd[1]: nodeip-configuration.service: Failed with result 'exit-code'.
Jun 25 14:08:07 worker-2.ostest.test.metalkube.org systemd[1]: Failed to start Writes IP address configuration so that kubelet and crio services select a valid node IP.

The workaround is to ssh to the node, stop the crio and kubelet services, rm -rf /var/lib/containers, and restart crio and kubelet.


Version-Release number of selected component (if applicable): 4.5


How reproducible: Infrequent to frequent


Steps to Reproduce:
1. Have active, running node
2. Reboot it until this happens
3.

Actual results:  Node stays not ready, with above messages


Expected results: Node reboots and becomes ready


Additional info:

Comment 1 Seth Jennings 2020-07-28 17:21:33 UTC

*** Bug 1860984 has been marked as a duplicate of this bug. ***

Comment 5 Dustin Black 2020-08-12 17:20:48 UTC

Given that clone BZ 1857224 is CLOSED ERRATA for 4.5.3, we would normally assume that the patch has already been brought into the 4.6 nightly builds. Can someone confirm that that is the case, and maybe this BZ can be moved into ON_QA or VERIFIED?

Comment 6 Giuseppe Scrivano 2020-08-12 19:59:25 UTC

the PR is still not merged but it is in the merge queue

Comment 7 Giuseppe Scrivano 2020-08-17 14:52:25 UTC

the PR was merged

Comment 11 Weinan Liu 2020-08-21 04:10:40 UTC

Issue not reproduced following https://bugzilla.redhat.com/show_bug.cgi?id=1857224#c37
$ oc version
Client Version: 4.5.2
Server Version: 4.6.0-0.nightly-2020-08-18-165040
Kubernetes Version: v1.19.0-rc.2+99cb93a-dirty

Comment 13 errata-xmlrpc 2020-10-27 16:15:58 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196

Comment 14 W. Trevor King 2021-04-05 17:47:38 UTC

Removing UpgradeBlocker from this older bug, to remove it from the suspect queue described in [1].  If you feel like this bug still needs to be a suspect, please add keyword again.

[1]: https://github.com/openshift/enhancements/pull/475

Note You need to log in before you can comment on or make changes to this bug.

aos-bugs
dblack
dhellmann
dyocum
gharden
gscrivan
harpatil
jokerman
lmcfadde
mifiedle
mkarg
mlammon
mpatel
pdsilva
rkrawitz
schoudha
scuppett
sjenning
vlaad
wking
yprokule