2052622 – [OCP 4.9] SNO cluster unable to boot after unclean shutdown

Bug 2052622 - [OCP 4.9] SNO cluster unable to boot after unclean shutdown

Summary: [OCP 4.9] SNO cluster unable to boot after unclean shutdown

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	4.9
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	urgent
Target Milestone:	---
Target Release:	4.11.0
Assignee:	Peter Hunt
QA Contact:	Sunil Choudhary
Docs Contact:
URL:
Whiteboard:
Depends On:	2021202 2055019 2055049 2055244 2055272 2055318
Blocks:
TreeView+	depends on / blocked

Reported:	2022-02-09 16:53 UTC by Mario Abajo
Modified:	2022-10-21 09:00 UTC (History)
CC List:	24 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-04-01 18:43:52 UTC
Target Upstream Version:
Embargoed:
Flags:	pehunt: needinfo-

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	6807281	0	None	None	None	2022-03-11 11:47:28 UTC

Description Mario Abajo 2022-02-09 16:53:59 UTC

Description of problem:
The setup consists of a SNO cluster in a disconnected network environment in which a power outage was simulated. After the hard reboot the node was unable to boot correctly. Important highlight is that the external repository that contains the openshift cluster images is also unavailable as part of the test.
What has been observed is that when an unclean shutdown crio removes the images storage, this combined with the unavailability of the external repository makes impossible to get images of the system. Also, the system was unable to start even crio daemon as it is trying to start the service "node-ip-configuration" which try to pull an image to start. 

Version-Release number of selected component (if applicable):
Openshift 4.9.13-assembly.art3657 

How reproducible:
Always

Steps to Reproduce:
1. Install a SNO cluster with an external repository
2. Stop/break the availability of the external repository with all the cluster images
3. Do a hard/unclean reboot of the node

Actual results:
System unable to boot, system waiting on job "node-ip-configuration"

Expected results:
System is able to recover from the unclean reboot and cluster start

Additional info:

Comment 2 Peter Hunt 2022-02-10 15:10:36 UTC

this happens because CRI-O doesn't know if images were half-pulled during a shutdown, causing a storage corruption. In case it was, it wipes the images to make sure the node can bootstrap.

We could be finer grained with this check, and actually check if the storage is corrupted, but if the node is being uncleanly shutdown, there exists the possibility there still will be a corruption and the node won't come up. 

We may need to investigate having a storage configuration that protects against sudden shutdowns. It's possible it'll cause performance issues with extra syncs, so it'll be opt-in

Comment 33 Keith Calligan 2022-02-23 16:50:09 UTC

I just wanted to chime-in based on a POC I am working on. This may be relevant to problem that is being faced in this bug. I'm not sure if this is the most efficient way but it works for non-production purposes.

The customer I'm working with will be disconnected and doesn't want to have any reliance on an external DNS server.

From some testing that I've performed, after rebooting a cluster (either cleanly or uncleanly in an environment with either no DNS or disconnected), Kubelet may not come up all the way on the nodes. This is due to ocp-release image not being pulled (since we can't resolve or reach quay.io)

The work-around that I am testing (and seems to work is as follows). This would require that all images be put in place in an alternate (persistent) location on each node in a directory that has a read-only attribute (don't want crio to be able to delete). The customer doesn't care about doing upgrades in this case so the images that are on cluster now are what they will keep for now.

Steps:

1. Push new machine config to workers/masters to /etc/containers/storage.conf and add the following:

additionalimagestores = ["/home/core/images",]

2. When cluster is functioning normally, copy everything from /var/lib/containers/storage to /home/core/images directory

cd /var/lib/containers/storage;
cp -ar * /home/core/images
# The following makes it so the images can't be deleted or modified
chattr -R +i +a /home/core/images

After this, a reboot of the cluster (either uncleanly or cleanly) should work.

This is a hack but seems to do what I need for now. Hope this helps.

Comment 34 Peter Hunt 2022-03-15 19:31:27 UTC

Given we've both identified a way of disabling crio-wipe, and disabling pullPolicy Always, is there anything else needed here? these can serve as workarounds while the CRI-O team investigates more targeted crio-wipe behavior (only wiping when we need to, rather than always on a forced shutdown)

Comment 36 Mario Abajo 2022-03-17 07:43:50 UTC

This will be enough until CRI-O team manages to deal with this issues, thanks so much Peter.

Comment 37 Mario Abajo 2022-03-17 07:58:35 UTC

Do we have a bugzilla to track the evolution of the long term fix for the cri-o issue?

Comment 38 Peter Hunt 2022-03-18 20:49:55 UTC

Not yet, I think we should track it via jira as it's sort of a feature request. @nalin does there exist a card already?

Comment 39 Peter Hunt 2022-04-01 18:43:52 UTC

found it! The issue is being tracked in https://issues.redhat.com/browse/RUN-1094

Note You need to log in before you can comment on or make changes to this bug.

achernet
aos-bugs
cgaynor
decarr
dhellmann
djuran
dwalsh
fsimonce
gscrivan
iheim
jpena
julim
kcalliga
keyoung
mavazque
nalin
oarribas
otuchfel
pehunt
pibanezr
rfreiman
tsweeney
wking
ykashtan