Bug 1942536 - Corrupted image preventing containers from starting
Summary: Corrupted image preventing containers from starting
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 4.5
Hardware: Unspecified
OS: Linux
unspecified
medium
Target Milestone: ---
: 4.8.0
Assignee: Peter Hunt
QA Contact: Sunil Choudhary
URL:
Whiteboard:
: 1918126 1950536 (view as bug list)
Depends On:
Blocks: 1186913 1995199
TreeView+ depends on / blocked
 
Reported: 2021-03-24 14:13 UTC by Matthew Robson
Modified: 2024-03-25 18:14 UTC (History)
16 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: A sudden reboot while a container or image is being committed to disk Consequence: Corruption of container storage, causing failures to pull images or create containers Fix: Detect when a node has rebootted without a corresponding sync and clear container storage if so Result: The node is protected from sudden reboots
Clone Of:
Environment:
Last Closed: 2021-07-27 22:55:12 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github cri-o cri-o pull 3999 0 None closed crio wipe: ensure a clean shutdown 2021-04-16 20:10:48 UTC
Red Hat Knowledge Base (Solution) 5972661 0 None None None 2022-03-15 07:28:15 UTC
Red Hat Product Errata RHSA-2021:2438 0 None None None 2021-07-27 22:55:34 UTC

Description Matthew Robson 2021-03-24 14:13:36 UTC
Description of problem:

pods were crash looping with:

container_linux.go:348: starting container process caused "exec: \"/bin/bash\": stat /bin/bash: no such file or directory"

container_linux.go:348: starting container process caused "exec: \"/usr/bin/cluster-network-operator\": stat /usr/bin/cluster-network-operator: no such file or directory"

Looking at the image, it appears corrupt:
# podman image inspect quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c877a9f6507be9f8979f3cc96e25c3b1b9c59e861c757fa20c57bcdc7bd99af4
Error: error parsing image data "5f996790a8d8380d4b3c47f8b19febd4f8c8c0317f47beab9364743889e5e307": readlink /var/lib/containers/storage/overlay: invalid argument


Deleted and it re-pulled fine and the containers came up:
[root@master-03 ~]# podman image rm quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c877a9f6507b
e9f8979f3cc96e25c3b1b9c59e861c757fa20c57bcdc7bd99af4     
Untagged: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c877a9f6507be9f8979f3cc96e25c3b1b9c59e861c757fa20c57bcdc7bd99af4
Deleted: 5f996790a8d8380d4b3c47f8b19febd4f8c8c0317f47beab9364743889e5e307


Version-Release number of selected component (if applicable):
RHEL 8.2 / OCP 4.5.31


How reproducible:
Once

Steps to Reproduce:
1. Unknown
2.
3.

Actual results:
Crashloop backoff from the pod with little that points to this being a corrupted image issue.

Expected results:


Additional info:

Comment 10 Peter Hunt 2021-04-16 20:08:40 UTC
*** Bug 1950536 has been marked as a duplicate of this bug. ***

Comment 11 Peter Hunt 2021-04-16 20:10:49 UTC
We have a fix incoming for this in 4.8 (attached) but it will require some soak time and testing to make sure it doesn't break things (it already has broken some things in 4.8) before we backport

Comment 12 Peter Hunt 2021-04-16 20:13:59 UTC
*** Bug 1918126 has been marked as a duplicate of this bug. ***

Comment 14 Sunil Choudhary 2021-04-22 09:25:32 UTC
Followed reproducer steps from https://bugzilla.redhat.com/show_bug.cgi?id=1921128#c25 by hard rebooting all nodes couple of times.

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.0-0.nightly-2021-04-21-231018   True        False         3h21m   Cluster version is 4.8.0-0.nightly-2021-04-21-231018

$ oc get nodes
NAME                                         STATUS   ROLES    AGE     VERSION
ip-10-0-130-215.us-east-2.compute.internal   Ready    worker   3h41m   v1.21.0-rc.0+3ced7a9
ip-10-0-152-86.us-east-2.compute.internal    Ready    master   3h49m   v1.21.0-rc.0+3ced7a9
ip-10-0-178-57.us-east-2.compute.internal    Ready    worker   3h42m   v1.21.0-rc.0+3ced7a9
ip-10-0-184-90.us-east-2.compute.internal    Ready    master   3h49m   v1.21.0-rc.0+3ced7a9
ip-10-0-214-243.us-east-2.compute.internal   Ready    master   3h49m   v1.21.0-rc.0+3ced7a9
ip-10-0-221-20.us-east-2.compute.internal    Ready    worker   3h41m   v1.21.0-rc.0+3ced7a9

$ oc debug node/ip-10-0-152-86.us-east-2.compute.internal
Starting pod/ip-10-0-152-86us-east-2computeinternal-debug ...
...
sh-4.4# journalctl | grep -i "Error: readlink"
sh-4.4#

Comment 17 errata-xmlrpc 2021-07-27 22:55:12 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438


Note You need to log in before you can comment on or make changes to this bug.