Bug 1942536

Summary: Corrupted image preventing containers from starting
Product: OpenShift Container Platform Reporter: Matthew Robson <mrobson>
Component: NodeAssignee: Peter Hunt <pehunt>
Node sub component: CRI-O QA Contact: Sunil Choudhary <schoudha>
Status: CLOSED ERRATA Docs Contact:
Severity: medium    
Priority: unspecified CC: aos-bugs, apizarro, bbaude, dornelas, dwalsh, jkaur, jligon, jnovy, lsm5, mheon, pthomas, smccarty, steven.barre, tsweeney, umohnani, vrothber
Version: 4.5   
Target Milestone: ---   
Target Release: 4.8.0   
Hardware: Unspecified   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: A sudden reboot while a container or image is being committed to disk Consequence: Corruption of container storage, causing failures to pull images or create containers Fix: Detect when a node has rebootted without a corresponding sync and clear container storage if so Result: The node is protected from sudden reboots
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-07-27 22:55:12 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1186913, 1995199    

Description Matthew Robson 2021-03-24 14:13:36 UTC
Description of problem:

pods were crash looping with:

container_linux.go:348: starting container process caused "exec: \"/bin/bash\": stat /bin/bash: no such file or directory"

container_linux.go:348: starting container process caused "exec: \"/usr/bin/cluster-network-operator\": stat /usr/bin/cluster-network-operator: no such file or directory"

Looking at the image, it appears corrupt:
# podman image inspect quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c877a9f6507be9f8979f3cc96e25c3b1b9c59e861c757fa20c57bcdc7bd99af4
Error: error parsing image data "5f996790a8d8380d4b3c47f8b19febd4f8c8c0317f47beab9364743889e5e307": readlink /var/lib/containers/storage/overlay: invalid argument


Deleted and it re-pulled fine and the containers came up:
[root@master-03 ~]# podman image rm quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c877a9f6507b
e9f8979f3cc96e25c3b1b9c59e861c757fa20c57bcdc7bd99af4     
Untagged: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c877a9f6507be9f8979f3cc96e25c3b1b9c59e861c757fa20c57bcdc7bd99af4
Deleted: 5f996790a8d8380d4b3c47f8b19febd4f8c8c0317f47beab9364743889e5e307


Version-Release number of selected component (if applicable):
RHEL 8.2 / OCP 4.5.31


How reproducible:
Once

Steps to Reproduce:
1. Unknown
2.
3.

Actual results:
Crashloop backoff from the pod with little that points to this being a corrupted image issue.

Expected results:


Additional info:

Comment 10 Peter Hunt 2021-04-16 20:08:40 UTC
*** Bug 1950536 has been marked as a duplicate of this bug. ***

Comment 11 Peter Hunt 2021-04-16 20:10:49 UTC
We have a fix incoming for this in 4.8 (attached) but it will require some soak time and testing to make sure it doesn't break things (it already has broken some things in 4.8) before we backport

Comment 12 Peter Hunt 2021-04-16 20:13:59 UTC
*** Bug 1918126 has been marked as a duplicate of this bug. ***

Comment 14 Sunil Choudhary 2021-04-22 09:25:32 UTC
Followed reproducer steps from https://bugzilla.redhat.com/show_bug.cgi?id=1921128#c25 by hard rebooting all nodes couple of times.

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.0-0.nightly-2021-04-21-231018   True        False         3h21m   Cluster version is 4.8.0-0.nightly-2021-04-21-231018

$ oc get nodes
NAME                                         STATUS   ROLES    AGE     VERSION
ip-10-0-130-215.us-east-2.compute.internal   Ready    worker   3h41m   v1.21.0-rc.0+3ced7a9
ip-10-0-152-86.us-east-2.compute.internal    Ready    master   3h49m   v1.21.0-rc.0+3ced7a9
ip-10-0-178-57.us-east-2.compute.internal    Ready    worker   3h42m   v1.21.0-rc.0+3ced7a9
ip-10-0-184-90.us-east-2.compute.internal    Ready    master   3h49m   v1.21.0-rc.0+3ced7a9
ip-10-0-214-243.us-east-2.compute.internal   Ready    master   3h49m   v1.21.0-rc.0+3ced7a9
ip-10-0-221-20.us-east-2.compute.internal    Ready    worker   3h41m   v1.21.0-rc.0+3ced7a9

$ oc debug node/ip-10-0-152-86.us-east-2.compute.internal
Starting pod/ip-10-0-152-86us-east-2computeinternal-debug ...
...
sh-4.4# journalctl | grep -i "Error: readlink"
sh-4.4#

Comment 17 errata-xmlrpc 2021-07-27 22:55:12 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438