Description of problem: The setup consists of a SNO cluster in a disconnected network environment in which a power outage was simulated. After the hard reboot the node was unable to boot correctly. Important highlight is that the external repository that contains the openshift cluster images is also unavailable as part of the test. What has been observed is that when an unclean shutdown crio removes the images storage, this combined with the unavailability of the external repository makes impossible to get images of the system. Also, the system was unable to start even crio daemon as it is trying to start the service "node-ip-configuration" which try to pull an image to start. Version-Release number of selected component (if applicable): Openshift 4.9.13-assembly.art3657 How reproducible: Always Steps to Reproduce: 1. Install a SNO cluster with an external repository 2. Stop/break the availability of the external repository with all the cluster images 3. Do a hard/unclean reboot of the node Actual results: System unable to boot, system waiting on job "node-ip-configuration" Expected results: System is able to recover from the unclean reboot and cluster start Additional info:
this happens because CRI-O doesn't know if images were half-pulled during a shutdown, causing a storage corruption. In case it was, it wipes the images to make sure the node can bootstrap. We could be finer grained with this check, and actually check if the storage is corrupted, but if the node is being uncleanly shutdown, there exists the possibility there still will be a corruption and the node won't come up. We may need to investigate having a storage configuration that protects against sudden shutdowns. It's possible it'll cause performance issues with extra syncs, so it'll be opt-in
I just wanted to chime-in based on a POC I am working on. This may be relevant to problem that is being faced in this bug. I'm not sure if this is the most efficient way but it works for non-production purposes. The customer I'm working with will be disconnected and doesn't want to have any reliance on an external DNS server. From some testing that I've performed, after rebooting a cluster (either cleanly or uncleanly in an environment with either no DNS or disconnected), Kubelet may not come up all the way on the nodes. This is due to ocp-release image not being pulled (since we can't resolve or reach quay.io) The work-around that I am testing (and seems to work is as follows). This would require that all images be put in place in an alternate (persistent) location on each node in a directory that has a read-only attribute (don't want crio to be able to delete). The customer doesn't care about doing upgrades in this case so the images that are on cluster now are what they will keep for now. Steps: 1. Push new machine config to workers/masters to /etc/containers/storage.conf and add the following: additionalimagestores = ["/home/core/images",] 2. When cluster is functioning normally, copy everything from /var/lib/containers/storage to /home/core/images directory cd /var/lib/containers/storage; cp -ar * /home/core/images # The following makes it so the images can't be deleted or modified chattr -R +i +a /home/core/images After this, a reboot of the cluster (either uncleanly or cleanly) should work. This is a hack but seems to do what I need for now. Hope this helps.
Given we've both identified a way of disabling crio-wipe, and disabling pullPolicy Always, is there anything else needed here? these can serve as workarounds while the CRI-O team investigates more targeted crio-wipe behavior (only wiping when we need to, rather than always on a forced shutdown)
This will be enough until CRI-O team manages to deal with this issues, thanks so much Peter.
Do we have a bugzilla to track the evolution of the long term fix for the cri-o issue?
Not yet, I think we should track it via jira as it's sort of a feature request. @nalin does there exist a card already?
found it! The issue is being tracked in https://issues.redhat.com/browse/RUN-1094