Bug 1740079

Summary: race/corruption: podman failed to launch containers
Product: Red Hat Enterprise Linux 8 Reporter: Michele Baldessari <michele>
Component: podmanAssignee: Brent Baude <bbaude>
Status: CLOSED ERRATA QA Contact: atomic-bugs <atomic-bugs>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 8.0CC: ajia, dciabrin, dornelas, dwalsh, jligon, jnovy, lsm5, mheon, pthomas, ypu
Target Milestone: rcKeywords: ZStream
Target Release: 8.1Flags: pm-rhel: mirror+
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1741110 (view as bug list) Environment:
Last Closed: 2019-11-05 21:02:57 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1727325, 1734574, 1741110    
Attachments:
Description Flags
strace of podman run
none
ls -lr /var/lib/containers
none
bolt db on broken node none

Description Michele Baldessari 2019-08-12 09:15:57 UTC
Created attachment 1602796 [details]
strace of podman run

Description of problem:

After some destructive testing involving many reboots of a controller node, some of which via hard reset, podman got into a completely borked state. Namely every podman run command claims that there is no image:
# podman run -d --net=host --name=test 192.168.24.1:8787/rhosp15/openstack-haproxy:pcmklatest
must provide image ID and image name to use an image: invalid argument


This does not happen normally and it took a few reboots to get into this state, but in this state not a single run command works:
[root@controller-0 ~]# rpm -q podman kernel runc
podman-1.0.3-1.git9d78c0c.module+el8.0.0.z+3717+fdd07b7c.x86_64
kernel-4.18.0-80.7.1.el8_0.x86_64
runc-1.0.0-55.rc5.dev.git2abd837.module+el8.0.0+3049+59fd2bba.x86_64

Also note that we reproduced this also with a testing version of runc:
runc-1.0.0-60.rc8.rhaos4.2.git3cbe540.el8.x86_64

[root@controller-0 ~]# podman image inspect 41bfdd5a7361
error parsing image data "41bfdd5a7361b1ecd6233d67bd163008cb407f9098c99fb5e625f9918b1558ef": readlink /var/lib/containers/storage/overlay/l/7G7QCIMC7D5MK7NQXQC4WXJTV7: no such file or directory

Notice the uppercase there which seems a bit suspicious?

This seems very similar to https://github.com/code-ready/crc/issues/325 ?

Am attaching strace from the run command, the bolt db and ls -lR from /var/lib/containers
What sprints to the eye is that on a working node we have:
[root@controller-1 ~]# ls -l /var/lib/containers/storage/overlay/l/ |wc -l
170

Whereas on a broken node we have:
[root@controller-0 ~]# ls /var/lib/containers/storage/overlay/l/
[root@controller-0 ~]#

Comment 1 Michele Baldessari 2019-08-12 09:16:43 UTC
Created attachment 1602797 [details]
ls -lr /var/lib/containers

Comment 2 Michele Baldessari 2019-08-12 09:17:26 UTC
Created attachment 1602798 [details]
bolt db on broken node

Comment 20 Joy Pu 2019-09-29 08:36:35 UTC
Thanks Michele for your feedback. Checked the code in vendor/github.com/containers/storage/drivers/overlay/overlay.go from podman-1.4.2-5.module+el8.1.0+4240+893c1ab8.src.rpm. The patches are already included. So set this to verified.

Comment 22 errata-xmlrpc 2019-11-05 21:02:57 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2019:3403