Bug 1315541 - fsck.ext4 discard sometimes fails when run in Koji (results in live image compose failure)
fsck.ext4 discard sometimes fails when run in Koji (results in live image com...
Status: CLOSED ERRATA
Product: Fedora
Classification: Fedora
Component: koji (Show other bugs)
24
x86_64 Unspecified
unspecified Severity unspecified
: ---
: ---
Assigned To: Mike McLean
Fedora Extras Quality Assurance
abrt_hash:bf130648ecbf297f2fe9c77554e...
:
: 1318039 1330636 (view as bug list)
Depends On:
Blocks: F25AlphaBlocker
  Show dependency treegraph
 
Reported: 2016-03-07 20:27 EST by Adam Williamson
Modified: 2016-10-19 20:10 EDT (History)
26 users (show)

See Also:
Fixed In Version: koji-1.10.1-12.fc25 koji-1.10.1-12.fc24
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2016-08-25 18:24:32 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
File: anaconda-tb (1.20 MB, text/plain)
2016-03-07 20:27 EST, Adam Williamson
no flags Details
File: anaconda.log (92.90 KB, text/plain)
2016-03-07 20:27 EST, Adam Williamson
no flags Details
File: environ (556 bytes, text/plain)
2016-03-07 20:27 EST, Adam Williamson
no flags Details
File: journalctl (588.31 KB, text/plain)
2016-03-07 20:27 EST, Adam Williamson
no flags Details
File: lsblk_output (1.59 KB, text/plain)
2016-03-07 20:27 EST, Adam Williamson
no flags Details
File: lvm.log (311.56 KB, text/plain)
2016-03-07 20:27 EST, Adam Williamson
no flags Details
File: nmcli_dev_list (1.26 KB, text/plain)
2016-03-07 20:27 EST, Adam Williamson
no flags Details
File: os_info (449 bytes, text/plain)
2016-03-07 20:27 EST, Adam Williamson
no flags Details
File: program.log (47.37 KB, text/plain)
2016-03-07 20:27 EST, Adam Williamson
no flags Details
File: storage.log (132.01 KB, text/plain)
2016-03-07 20:27 EST, Adam Williamson
no flags Details
File: ifcfg.log (2.41 KB, text/plain)
2016-03-07 20:27 EST, Adam Williamson
no flags Details

  None (edit)
Description Adam Williamson 2016-03-07 20:27:03 EST
Description of problem:
Boot the F24 2016-03-07 KDE nightly, run the installer, proceed to the main hub. It crashes immediately.

Version-Release number of selected component:
anaconda-core-24.13-1.fc24.x86_64

The following was filed automatically by anaconda:
anaconda 24.13-1 exception report
Traceback (most recent call first):
  File "/usr/lib64/python3.5/site-packages/pyanaconda/packaging/livepayload.py", line 86, in setup
    raise PayloadInstallError("Failed to mount the install tree")
  File "/usr/lib64/python3.5/site-packages/pyanaconda/packaging/__init__.py", line 1285, in _runThread
    payload.setup(storage, instClass)
  File "/usr/lib64/python3.5/threading.py", line 862, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib64/python3.5/site-packages/pyanaconda/threads.py", line 253, in run
    threading.Thread.run(self, *args, **kwargs)
pyanaconda.packaging.PayloadInstallError: Failed to mount the install tree

Additional info:
cmdline:        /usr/bin/python3  /sbin/anaconda --liveinst --method=livecd:///dev/mapper/live-base
cmdline_file:   BOOT_IMAGE=vmlinuz initrd=initrd.img root=live:CDLABEL=Fedora-KDE-Live-24-20160307-0 rd.live.image quiet
executable:     /sbin/anaconda
hashmarkername: anaconda
kernel:         4.5.0-0.rc6.git3.1.fc24.x86_64
other involved packages: system-python-libs-3.5.1-7.fc24.x86_64
product:        Fedora
release:        Fedora release 24 (Twenty Four)
type:           anaconda
version:        24

Potential duplicate: bug 1253745
Comment 1 Adam Williamson 2016-03-07 20:27:09 EST
Created attachment 1134001 [details]
File: anaconda-tb
Comment 2 Adam Williamson 2016-03-07 20:27:11 EST
Created attachment 1134002 [details]
File: anaconda.log
Comment 3 Adam Williamson 2016-03-07 20:27:12 EST
Created attachment 1134003 [details]
File: environ
Comment 4 Adam Williamson 2016-03-07 20:27:14 EST
Created attachment 1134004 [details]
File: journalctl
Comment 5 Adam Williamson 2016-03-07 20:27:16 EST
Created attachment 1134005 [details]
File: lsblk_output
Comment 6 Adam Williamson 2016-03-07 20:27:18 EST
Created attachment 1134006 [details]
File: lvm.log
Comment 7 Adam Williamson 2016-03-07 20:27:19 EST
Created attachment 1134007 [details]
File: nmcli_dev_list
Comment 8 Adam Williamson 2016-03-07 20:27:20 EST
Created attachment 1134008 [details]
File: os_info
Comment 9 Adam Williamson 2016-03-07 20:27:21 EST
Created attachment 1134009 [details]
File: program.log
Comment 10 Adam Williamson 2016-03-07 20:27:22 EST
Created attachment 1134010 [details]
File: storage.log
Comment 11 Adam Williamson 2016-03-07 20:27:24 EST
Created attachment 1134011 [details]
File: ifcfg.log
Comment 12 Adam Williamson 2016-03-07 20:29:46 EST
This may be related to Pungi 4 (livemedia-creator) image compose, I'm not sure we've actually got any other Pungi 4-created live image to this point before.

It is a clear Alpha blocker, per criterion "The installer must be able to complete an installation to a single disk using automatic partitioning." for the KDE live image (which is a release blocking image). This may well affect the Workstation live as well, except you can't run the installer on that ATM because of https://bugzilla.redhat.com/show_bug.cgi?id=1313098 .
Comment 13 Adam Williamson 2016-03-07 20:30:48 EST
cmdline is: BOOT_IMAGE=vmlinuz initrd=initrd.img root=live:CDLABEL=Fedora-KDE-Live-24-20160307-0 rd.live.image quiet
Comment 14 Adam Williamson 2016-03-08 13:58:15 EST
So today the Workstation live switched back to X, and funnily, it doesn't hit this bug. Only KDE live does. I am not sure what the difference is, off hand.
Comment 15 Chris Murphy 2016-03-08 16:39:17 EST
Right before this traceback:
Mar 08 01:25:12 localhost kernel: EXT4-fs (dm-1): INFO: recovery required on readonly filesystem
Mar 08 01:25:12 localhost kernel: EXT4-fs (dm-1): write access unavailable, cannot proceed


Before that:
Mar 07 20:23:36 localhost systemd[1]: Mounting /sysroot...
Mar 07 20:23:36 localhost kernel: EXT4-fs (dm-0): recovery complete
Mar 07 20:23:36 localhost systemd[1]: Mounted /sysroot.
Mar 07 20:23:36 localhost kernel: EXT4-fs (dm-0): mounted filesystem with ordered data mode. Opts: (null)


And then the mapping:

'/dev/disk/by-uuid/40d0aa23-a973-4213-9615-1ed877cbac9d '
'/dev/disk/by-label/Anaconda',
'DEVNAME': '/dev/dm-1',


'/dev/disk/by-uuid/40d0aa23-a973-4213-9615-1ed877cbac9d '
'/dev/disk/by-label/Anaconda',
'DEVNAME': '/dev/dm-0',

Why is the same volume UUID associated with two dm devices, and then both are mounted? It's ultimately the same UUID being mounted twice.
Comment 16 David Shea 2016-03-08 17:30:20 EST
The problem is:

Mar 08 01:25:12 localhost kernel: EXT4-fs (dm-1): INFO: recovery required on readonly filesystem
Mar 08 01:25:12 localhost kernel: EXT4-fs (dm-1): write access unavailable, cannot proceed


that is, /dev/mapper/live-base has "needs_recovery" set in the flags, and it can't do that because down at the bottom of all the dm layers there's a read-only ISO.

I don't suppose you have the logs from the livemedia-creator run that created this disc? Somehow the device wasn't cleanly unmounted, or, I don't know, maybe it has something to do with the cache=unsafe option that was added a little while back.
Comment 17 Brian Lane 2016-03-08 17:50:44 EST
This has been seen a few times and as far as I know isn't reproducible. I've been meaning to add some extra checks to lmc after the filesystem is done to make sure it can be mounted, I'll do that this week.

Extra logs are appreciated, there were a couple koji runs that I looked at a few weeks back where there were errors in the lmc run complaining about unmounting /dev/ being busy or /dev/shm missing when it tried to unmount it. I ran a bunch of experiments with parallel runs thinking the mocks might be getting crossed up but that didn't pan out.
Comment 18 Adam Williamson 2016-03-08 18:03:24 EST
FWIW it looks like this happened for at least the last two days of F24 KDE live x86_64 images. the 32-bit lives don't boot due to the 32-bit kernel bug, and the Rawhide KDE lives are crashing on the 'no locales' bug (which happens earlier than this bug, so they *may* also be suffering from this, we just don't know for sure).
Comment 19 Chris Murphy 2016-03-10 02:21:42 EST
Loopback mounting rootfs.img and running e2fsck on it; 
Fedora-Workstation-Live-x86_64-24-20160305.0.iso is good.
Fedora-Workstation-Live-x86_64-24-20160308.0.iso is good.
        Fedora-KDE-Live-x86_64-24-20160308.0.iso fails.

There's something consistently happening with at least KDE composes that's resulting in a corrupt ext4 root fs.
Comment 20 Brian Lane 2016-03-10 12:25:56 EST
The thing to look for in the koji logs is:

WARNING livemedia-creator: Problem zeroing free blocks of /tmp/lmc/lmc-disk-hlmwp6ys.img

In lorax-25.1-1 I've turned this into an error instead of a warning. After Alpha I'll do the same for F24.

I also still haven't seen the output from the fsck run, koji needs to be modified to capture the lmc program.log as well (dgilmore is aware of this).
Comment 21 Chris Murphy 2016-03-11 02:56:16 EST
I'm not familiar with the compose logs, I've looked through them but can't tell what e2fsprogs or kernel version the compose system itself is running. Maybe that would be in lmc program.log?

This is where I'm looking for the KDE image that has a failing rootfs.img:
http://koji.fedoraproject.org/koji/taskinfo?taskID=13271578

I'm suspicious of the e2fsck -E discard option, but usually failures are more consistent than this, which is what, 1 in 10?
Comment 22 Brian Lane 2016-03-11 12:20:50 EST
The versions used are logged here, when the mock is setup:

https://kojipkgs.fedoraproject.org//work/tasks/1578/13271578/root.log

The -E discard makes sure that deleted block in the image are zeroed so that squashfs can compress it. It *should* also fix any errors, but doesn't. There also shouldn't actually be any errors, but at the end of many of these installation runs you can see it complaining about unmounting things under /mnt/sysimage, which should never happen.
Comment 23 Chris Murphy 2016-03-11 13:24:35 EST
(In reply to Brian Lane from comment #22)
> The versions used are logged here, when the mock is setup:
> 
> https://kojipkgs.fedoraproject.org//work/tasks/1578/13271578/root.log

Ok then I'm confused, because that looks like versions for packages to be installed, rather than running. I take it the mock environment uses the same version of everything that it'll be installing? 
 
> The -E discard makes sure that deleted block in the image are zeroed so that
> squashfs can compress it. It *should* also fix any errors, but doesn't.

Yeah I'm not sure how trim gets translated through each layer: ext4 > loop > rootfs.img > ext4 > thinp?ssd?hdd? I know it's supported by loop device for a couple years, but I'm pretty sure it has to hit an SSD or thinp for the file to return zeros. Otherwise it's like trim on an HDD, the ioctl returns an error, and data is not zero'd.

> There also shouldn't actually be any errors, but at the end of many of these
> installation runs you can see it complaining about unmounting things under
> /mnt/sysimage, which should never happen.

Right. Confusing matters is there are two errors and I'm not certain if they relate.

>Mar 08 01:25:12 localhost kernel: EXT4-fs (dm-1): INFO: recovery required on readonly filesystem

This suggests unclean unmount, and would be enough to cause this bug.

>WARNING livemedia-creator: Problem zeroing free blocks of /tmp/lmc/lmc-disk-hlmwp6ys.img

This suggests a trim/discard fail, but we don't know what/how, it may only result in a rootfs.img that's bigger than it'd be if it had worked, but hasn't caused any other problem. The thing is, so far with the limited sample size thus far, the "problem zeroing" has always coincided with rootfs.img's that need journal recovery.

Kernel messages or systemd journal for the system/VM this is running on could shed light on both.
Comment 24 Adam Williamson 2016-03-14 12:27:38 EDT
FWIW the last two F24 composes (20160313.n.1 and 20160314.n.0) got working KDE lives...
Comment 25 Kamil Páral 2016-03-14 12:38:10 EDT
Discussed at today's blocker review meeting [1]. Voted as RejectedBlocker (Alpha) AcceptedFreezeException (Alpha) - this appears to be an intermittent issue, so we think in the worst case we can just keep composing till it works. We're accepting it as a freeze exception to allow an F24 lorax build with the change to make compose fail when the bug happens, so it's easy to know when we need to rebuild.

[1] https://meetbot-raw.fedoraproject.org/fedora-blocker-review/2016-03-14/
Comment 26 Fedora Update System 2016-03-14 18:11:03 EDT
lorax-24.15-2.fc24 has been submitted as an update to Fedora 24. https://bodhi.fedoraproject.org/updates/FEDORA-2016-97f7839c12
Comment 27 Adam Williamson 2016-03-15 16:08:45 EDT
*** Bug 1318039 has been marked as a duplicate of this bug. ***
Comment 28 Fedora Update System 2016-03-15 17:30:57 EDT
lorax-24.15-2.fc24 has been pushed to the Fedora 24 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2016-97f7839c12
Comment 29 Fedora Update System 2016-03-15 17:43:11 EDT
lorax-24.15-3.fc24 has been submitted as an update to Fedora 24. https://bodhi.fedoraproject.org/updates/FEDORA-2016-97f7839c12
Comment 30 Fedora Update System 2016-03-18 11:00:38 EDT
lorax-24.15-3.fc24 has been pushed to the Fedora 24 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2016-97f7839c12
Comment 31 Steven Usdansky 2016-03-19 09:58:31 EDT
Similar problem has been detected:

dd'd latest F24 alpha Mate to flash drive
Booted from flash drive
connected to wireless network
set timezone
updated anaconda (2nd time I tried it, first time I didn't but result was the same)
clicked desktop icon to  start installation

cmdline:        /usr/bin/python3  /sbin/anaconda --liveinst --method=livecd:///dev/mapper/live-base
cmdline_file:   BOOT_IMAGE=vmlinuz initrd=initrd.img root=live:CDLABEL=Fedora-MATE-Live-24_A-1 rd.live.image rd.live.check quiet
hashmarkername: anaconda
kernel:         4.5.0-0.rc7.git0.2.fc24.x86_64
other involved packages: system-python-libs-3.5.1-7.fc24.x86_64
package:        anaconda-core-24.13.1-1.fc24.x86_64
packaging.log:  
product:        Fedora
reason:         pyanaconda.packaging.PayloadInstallError: Failed to mount the install tree
release:        Fedora release 24 (Twenty Four)
version:        24
Comment 32 Fedora Update System 2016-03-23 15:59:28 EDT
lorax-24.15-3.fc24 has been pushed to the Fedora 24 stable repository. If problems still persist, please make note of it in this bug report.
Comment 33 Adam Williamson 2016-03-28 11:15:16 EDT
I don't think there's any more info I can provide at this point. The updated lorax is now stable, meaning that when this bug occurs during a live compose, the compose fails (rather than completing and producing a non-installable image), but we are still looking for an actual fix.
Comment 34 David Shea 2016-03-31 15:07:00 EDT
Probably don't need that needinfo flag anymore.
Comment 35 Adam Williamson 2016-04-25 19:57:59 EDT
Updating summary. In case it tells anyone anything, this *seems* to happen a lot to the SoaS image, though I haven't done comprehensive statistical analysis or anything. It succeeded on 20160422 but failed each of the last three days, for instance. Could how full the root partition winds up after installation be related, for e.g.?
Comment 36 Chris Murphy 2016-04-26 01:43:49 EDT
I see i686 and x86_64 get hit by this as well as non-soas, for example these two x86_64 recently:
Fedora-Workstation-Live-24-20160425.n.0 succeeded
Fedora-Workstation-Live-24-20160424.n.0 failed

Consistently it's failure to umount /mnt/sysimage/dev. The program.log [1] lists 'lsof /mnt/sysimage/dev' as if to help us find the problem, but I can't make sense of the resulting list but have nothing to compare it to since lsof isn't run on the successful compose.

At least as weird to me is, the succeeding compose always fails to umount /mnt/sysimage/dev/shm, but that's because it's already no longer mounted. Why? The logs show it being mounted, and then later umount failing because it's not mounted; perhaps as a consequence there's no umount  /mnt/sysimage/dev listed in the log. 

There may be more than one bug here... and one is helping to mask the other.

I think the e2fsck -E discard failure is not the real problem, it's just a much later symptom of failed umount which happens almost a minute before e2fsck.



[1]
https://kojipkgs.fedoraproject.org//work/tasks/5149/13785149/program.log
Comment 37 Adam Williamson 2016-04-26 02:04:39 EDT
"Consistently it's failure to umount /mnt/sysimage/dev"

No, that is in fact unrelated. We know precisely what the problem is, what we don't know is why it happens or how to fix it.
Comment 38 Adam Williamson 2016-04-26 02:06:03 EDT
This is pretty trivial to demonstrate, btw: just look at a successful compose, e.g. https://kojipkgs.fedoraproject.org//work/tasks/5110/13785110/root.log . The same failure to umount /mnt/sysimage/dev occurs.
Comment 39 Chris Murphy 2016-04-26 03:42:50 EDT
(In reply to Adam Williamson from comment #37)
> "Consistently it's failure to umount /mnt/sysimage/dev"
> 
> No, that is in fact unrelated. We know precisely what the problem is, what
> we don't know is why it happens or how to fix it.

e2fsck always spits out a bunch of stuff and it'd be nice if the logs could be made to show that user space output.

Per c16, is it possible to change the qemu disk cache option from unsafe to none? That option can result in the guest being informed writes are complete, yet they aren't guaranteed to be committed on the host. Possibly by the time e2fsck comes along, the file system is still inconsistent.
Comment 40 Brian Lane 2016-04-27 16:40:38 EDT
*** Bug 1330636 has been marked as a duplicate of this bug. ***
Comment 41 Cole Robinson 2016-04-27 19:03:45 EDT
(In reply to Chris Murphy from comment #39)
> (In reply to Adam Williamson from comment #37)
> > "Consistently it's failure to umount /mnt/sysimage/dev"
> > 
> > No, that is in fact unrelated. We know precisely what the problem is, what
> > we don't know is why it happens or how to fix it.
> 
> e2fsck always spits out a bunch of stuff and it'd be nice if the logs could
> be made to show that user space output.
> 
> Per c16, is it possible to change the qemu disk cache option from unsafe to
> none? That option can result in the guest being informed writes are
> complete, yet they aren't guaranteed to be committed on the host. Possibly
> by the time e2fsck comes along, the file system is still inconsistent.

if you want to eliminate the cache=unsafe option as the culprit, don't switch to cache=none in the interim, just drop the cache=unsafe option entirely to get the safe qemu default (cache=writethrough). cache=none uses O_DIRECT on the host which has its own set of problems, like not working on tmpfs (or it did at one point)
Comment 42 Chris Murphy 2016-04-27 19:09:16 EDT
(In reply to Cole Robinson from comment #41)
> if you want to eliminate the cache=unsafe option as the culprit, don't
> switch to cache=none in the interim, just drop the cache=unsafe option
> entirely to get the safe qemu default (cache=writethrough). cache=none uses
> O_DIRECT on the host which has its own set of problems, like not working on
> tmpfs (or it did at one point)

Sure but the logic of none per is that per c16 it was being used before without problems, and also I'm pretty sure writethrough is going to slow down writes and the whole compose process, although I don't know by how much where none shouldn't have as much impact.
Comment 43 Cole Robinson 2016-04-27 19:19:06 EDT
I checked lorax.git and the commit in question seems to be:

commit f2239d7860dd125b0503cdcbe2ca2fef0adea508
Author: Fabian Deutsch <fabiand@fedoraproject.org>
Date:   Tue Dec 1 09:25:50 2015 +0100

    livemedia-creator: Use cache=unsafe for the installation disk
    
    This should improve the installation speed, as no sync is required on the
    underlying disk.
    
    Signed-off-by: Fabian Deutsch <fabiand@fedoraproject.org>

diff --git a/src/sbin/livemedia-creator b/src/sbin/livemedia-creator
index 501d221..7b698a4 100755
--- a/src/sbin/livemedia-creator
+++ b/src/sbin/livemedia-creator
@@ -141,6 +141,7 @@ class VirtualInstall(object):
             args.append(ks)
 
         disk_opts = "path={0}".format(disk_img)
+        disk_opts += ",cache=unsafe"
         if qcow2:
             disk_opts += ",format=qcow2"
         else:


Which looks like they were using the cache default before (it could be cache=none which is what virt-install will set for block devices, but for files we leave it up to qemu)
Comment 44 Brian Lane 2016-04-27 20:07:52 EDT
Note that virt is not involved with koji runs of lmc, it is being run inside a mock environment.
Comment 45 Chris Murphy 2016-04-27 20:31:37 EDT
(In reply to Brian Lane from comment #44)
> Note that virt is not involved with koji runs of lmc, it is being run inside
> a mock environment.

Then it's not the cache option.

That e2fsck finds a problem and also doesn't fix it suggests something worse than just an unclean umount. But without e2fsck output we don't know if it's fs fixup failure or discard failure.

Is there a way to get a 'journalctl --since=-5m' snippet during an affected compose? That's five minutes worth, probably enough. It's tedious to look through but I volunteer for it, maybe there's a clue.

Another option is to drop -E discard if we're not beyond poking this problem with a stick.
Comment 46 Kevin Fenzi 2016-05-02 15:08:54 EDT
So, I dug around on logs some more on this. 

Look at: 

https://kojipkgs.fedoraproject.org//work/tasks/6064/13886064/program.log
which is a working MATE compose for f24 nightly ( http://koji.fedoraproject.org/koji/taskinfo?taskID=13886064 ) 

against: 

https://kojipkgs.fedoraproject.org//work/tasks/6150/13886150/program.log
Which is a failing Workstation compose for f24 nightly ( http://koji.fedoraproject.org/koji/taskinfo?taskID=13886150 )

In the working case: 

10:15:58,693 INFO program: Running... udevadm settle --timeout=300
10:15:58,705 DEBUG program: Return code: 0
10:15:58,706 INFO program: Running... umount /mnt/sysimage/dev/shm
10:15:58,719 INFO program: stderr:
10:15:58,719 INFO program: b'umount: /mnt/sysimage/dev/shm: not mounted'
10:15:58,720 DEBUG program: Return code: 32
10:15:58,720 INFO program: Running... lsof /mnt/sysimage/dev/shm
10:15:58,819 DEBUG program: Return code: 1

In the failing case: 

10:23:25,700 INFO program: Running... udevadm settle --timeout=300
10:23:25,712 DEBUG program: Return code: 0
10:23:25,717 INFO program: Running... umount /mnt/sysimage/dev/shm
10:23:25,727 INFO program: stderr:
10:23:25,728 INFO program: b'umount: /mnt/sysimage/dev/shm: target is busy'
10:23:25,729 INFO program: b'        (In some cases useful info about processes that'
10:23:25,729 INFO program: b'         use the device is found by lsof(8) or fuser(1).)'
10:23:25,729 DEBUG program: Return code: 32
10:23:25,729 INFO program: Running... lsof /mnt/sysimage/dev/shm
10:23:25,873 INFO program: stdout:
10:23:25,873 INFO program: COMMAND    PID USER  FD   TYPE DEVICE SIZE/OFF     NODE NAME
10:23:25,873 INFO program: anaconda 29126 root DEL    REG   0,49          23540950 /var/lib/mock/f24-build-5488769-600037/root/dev/shm/RFagMb
10:23:25,873 INFO program: anaconda 29126 root DEL    REG   0,49          23540949 /var/lib/mock/f24-build-5488769-600037/root/dev/shm/B0XkTm
10:23:25,874 INFO program: anaconda 29126 root DEL    REG   0,49          23540948 /var/lib/mock/f24-build-5488769-600037/root/dev/shm/92er0x
10:23:25,874 INFO program: anaconda 29405 root DEL    REG   0,49          23540950 /var/lib/mock/f24-build-5488769-600037/root/dev/shm/RFagMb
10:23:25,874 INFO program: anaconda 29405 root DEL    REG   0,49          23540949 /var/lib/mock/f24-build-5488769-600037/root/dev/shm/B0XkTm
10:23:25,874 INFO program: anaconda 29405 root DEL    REG   0,49          23540948 /var/lib/mock/f24-build-5488769-600037/root/dev/shm/92er0x
10:23:25,874 DEBUG program: Return code: 0

So, if I am reading this right, anaconda in some cases is keeping /dev/shm busy so it cannot be unmounted. Perhaps some race condition in koji?
Comment 47 Brian Lane 2016-05-03 12:28:36 EDT
Both cases are actually wrong, but the one that works isn't fatal. There's no way I can think of for shm to have already been unmounted.


That's the first useful output from lsof that I've seen though, thanks for that.
Comment 48 Brian Lane 2016-05-03 18:10:42 EDT
Here's something else that may help, even though it doesn't get to the root cause:

https://github.com/rhinstaller/blivet/pull/406
Comment 49 Giulio 'juliuxpigface' 2016-05-04 07:51:42 EDT
(In reply to Kevin Fenzi from comment #46)
> In the failing case: 
> 
> 10:23:25,700 INFO program: Running... udevadm settle --timeout=300
> 10:23:25,712 DEBUG program: Return code: 0
> 10:23:25,717 INFO program: Running... umount /mnt/sysimage/dev/shm
> 10:23:25,727 INFO program: stderr:
> 10:23:25,728 INFO program: b'umount: /mnt/sysimage/dev/shm: target is busy'
> 10:23:25,729 INFO program: b'        (In some cases useful info about


I'm sorry if I am providing pieces information which maybe aren't useful (or already known), but I read a recent log...

In case of the failure of Beta 1.6 Xfce (https://kojipkgs.fedoraproject.org//work/tasks/1501/13911501/program.log), we've got a similar failure, but not related to /mnt/sysimage/dev/shm

03:56:53,544 INFO program: b'umount: /mnt/sysimage/dev: target is busy'
03:56:53,545 INFO program: b'        (In some cases useful info about processes that'
03:56:53,545 INFO program: b'         use the device is found by lsof(8) or fuser(1).)'
03:56:53,545 DEBUG program: Return code: 32
03:56:53,545 INFO program: Running... lsof /mnt/sysimage/dev

And then, we've got a long list of processes which seems using that path (if I understood it correctly).
Comment 50 Brian Lane 2016-05-04 13:14:15 EDT
Don't worry about providing more info, we still don't know what's going on so more is better. I've seen this as well, it is another aspect of whatever's going on.


That output is pretty strange though. It is showing things outside the mock using it (which may just be an artifact of the bind mount), but it also shows this which doesn't look right to me:

program: anaconda  14099     root    0r   CHR    1,3      0t0 1028 /dev/null
program: anaconda  14099     root    4w   CHR    1,3      0t0 1028 /var/lib/mock/f24-build-5502723-600551/root/dev/null

Which, again, may be normal, I'll have to check, but it shows anaconda using /dev/null outside AND inside the mock.
Comment 51 Kamil Páral 2016-05-09 13:56:04 EDT
After discussion with Dennis Gilmore, proposing as a Final blocker. This breaks the compose frequently, including Workstation or KDE Live images, which are release blocking. We could end up with a large delay if we get an unlucky streak and we're close to the release deadline.
Comment 52 Adam Williamson 2016-05-13 00:51:45 EDT
I don't really see the benefits of making this a blocker exactly. If we don't block the release on it, it *could* cause a delay if it's not fixed. If we do block the release on it, it'll *definitely* cause a delay if it's not fixed. Think about the last blocker test: if this was the last blocker, would people really rather just unconditionally slip a week rather than say 'let's just try a compose and see if we get all the required images'?
Comment 53 Kamil Páral 2016-05-13 03:43:07 EDT
It seems I didn't really propose it, I forgot to carefully examine Whiteboard and remove RejectedBlocker :) Anyway, I can see this both ways, no strong feeling for either one. Removing the Blocks field.
Comment 54 Bruno Wolff III 2016-05-13 10:09:20 EDT
I think what is going on here is this is a very important bug, but affects the compose and not the published images (at least when we actually get them). So it doesn't seem to make sense to call it a blocker, but we would sure like to get it fixed before we need to do the final compose.
Comment 55 Chris Murphy 2016-05-13 11:08:03 EDT
(In reply to Adam Williamson from comment #52)
Yeah it sorta turns the bug from being PITA roulette into a blocker. And there's no assurance the proposed patch for lorax will actually fix the bug, it may just expose more hints as to the cause. What would be nice is if there's a way to get releng a build with this patch in it sooner than later, since releng can't apply the patch themselves.

I'm +1FE but not less than 72 hours before go/nogo as it seems the only way to test anything is to actually put it into production.
Comment 56 satellitgo 2016-06-01 11:49:12 EDT
2016-06-01 10:12:27,387 WARNING pylorax.imgutils: failed to unmount /mnt/sysimage/dev/shm. retrying (1/3).

soas build
Comment 57 Stuart D Gathman 2016-06-10 22:36:41 EDT
This sounds like a trim bug.  Koji probably has these patches, but:

Data corruption with raid0 and trim:
http://www.spinics.net/lists/raid/msg49440.html

Blog of original enterprise production errors uncovering the bug:
https://blog.algolia.com/when-solid-state-drives-are-not-that-solid/
Comment 58 Chris Murphy 2016-06-11 11:36:44 EDT
(In reply to Stuart D Gathman from comment #57)
> This sounds like a trim bug.  Koji probably has these patches, but:

Yeah see comment 23. The compose usage of trim has more layers involved. The bug you're referring to was fixed in block/bio.c and backported far and wide. It's in 4.1.26, maybe even farther back. Even Fedora's build system uses older kernels, surely it has a version with that backported fix. But none of the logs show kernel versions, so...?

I asked Eric Sandeen if he thought this bug could be trim related, and at this point he kinda expects it to either work or fail, not corrupt. Possibly getting him a bad rootfs.img would reveal the nature of the corruption. So if the bug isn't found/fixed by F24 GA, I suggest reverting the change to fail composes on this zeroing error, so we get an image for closer inspection.
Comment 59 Fedora Blocker Bugs Application 2016-06-20 11:15:38 EDT
Proposed as a Blocker for 25-alpha by Fedora user pbrobinson using the blocker tracking app because:

 Proposing now to ensure it gets enough attention before Alpha freeze on  2016-08-09 

Per comment 12:

"It is a clear Alpha blocker, per criterion "The installer must be able to complete an installation to a single disk using automatic partitioning." for the KDE live image (which is a release blocking image). This may well affect the Workstation live as well, except you can't run the installer on that ATM because of "
Comment 60 Bruno Wolff III 2016-06-20 11:24:49 EDT
This is an abuse of the blocker system. If it's not a blocker, it's not a blocker. If we need a better way to get attention for some bugs, that aren't blockers, but cause a lot of grief, then we should talk about how to do that. Maybe this would be a good topic for next weeks QA meeting.
Comment 61 Peter Robinson 2016-06-20 11:31:39 EDT
(In reply to Bruno Wolff III from comment #60)
> This is an abuse of the blocker system. If it's not a blocker, it's not a
> blocker. If we need a better way to get attention for some bugs, that aren't
> blockers, but cause a lot of grief, then we should talk about how to do
> that. Maybe this would be a good topic for next weeks QA meeting.

Why isn't it a blocker? The intention isn't to abuse the system, I firmly believe issues with the anaconda/lorax bugs causing issues with consistent output is a blocker.
Comment 62 Adam Williamson 2016-06-20 11:42:56 EDT
Nope, still not a blocker. We already considered it and rejected it on the grounds that if a compose is missing a blocking image we can just re-compose it.

The blocker process is not for "getting attention", and in any case this bug is not lacking "attention". bcl and dgilmore spent a lot of their time in the whole f24 cycle trying to fix it. The problem is that it's really sodding difficult to figure out what the problem is and no-one managed to do that yet. The blocker process is not going to do anything to help that.
Comment 63 Dennis Gilmore 2016-08-18 11:45:07 EDT
We have no ability to recompose just an artifact. at least at the moment KDE that is a release blocking artifact is failing consistently. which means we get to start compose over and over and over agin. which is not a good use of anyones time.  This really needs to be fixed, and at least in the eyes of release engineering is a blocker because it effects our ability to deliver release blocking artifacts.
Comment 64 Dennis Gilmore 2016-08-18 11:49:42 EDT
hopefully reprosing as a blocker
Comment 65 Adam Williamson 2016-08-18 13:11:39 EDT
If we're talking about workarounds, I think it wouldn't be too difficult to adjust pungi to allow for retries; so you could give each image (or image category or whatever) a compose retry count, and if the task failed pungi wouldn't just give up on the image (and either bail or carry on depending on whether it's considered 'blocking'), but run it again, until the retry count was exhausted. Just a thought.

If we wanted to get really silly about it we could even teach it to recognize this failure case and just keep retrying until it didn't happen, probably.
Comment 66 Adam Williamson 2016-08-18 14:53:30 EDT
Discussed at 2016-08-18 go/no-go meeting, functioning as a blocker review meeting: https://meetbot-raw.fedoraproject.org/fedora-meeting/2016-08-18/f25-alpha-go_no_go-meeting.2016-08-18-17.00.html . We agreed to delay the decision on blocker status for a while as there are various perspectives and some ideas for how to debug this (and also how to work around it if necessary), so we will see where things stand at the next blocker review point.
Comment 67 Adam Williamson 2016-08-19 21:26:43 EDT
Has anyone who's tried to reproduce this locally tried running *two* live composes (same image - one of the commonly-affected ones like KDE - for two different arches) in two mock chroots at the same time?
Comment 68 Chris Murphy 2016-08-20 14:19:01 EDT
Failed compose for createLiveMedia (f25, Fedora-Security-Live-25-20160819.n.1, fedora-live-security.ks, x86_64)
http://koji.fedoraproject.org/koji/taskinfo?taskID=15312133

This one is /mnt/sysimage/dev fails to umount because it's busy. But the logs are saying contradictory things:

https://kojipkgs.fedoraproject.org//work/tasks/2133/15312133/mock_output.log

2016-08-19 22:55:47,350: failed to unmount /mnt/sysimage/dev/shm. retrying (1/3)...
2016-08-19 22:55:48,500: failed to unmount /mnt/sysimage/dev/shm. retrying (2/3)...
2016-08-19 22:55:49,615: Cleanup of /mnt/sysimage/dev/shm failed.

That appears to be bogus because if you go to the program log...
https://kojipkgs.fedoraproject.org//work/tasks/2133/15312133/anaconda-program.log

22:55:09,806 INFO program: Running... umount /mnt/sysimage/dev/shm
22:55:09,827 DEBUG program: Return code: 0
[...]
22:55:10,035 DEBUG program: Return code: 0

dev/shm umounted OK, it's only /mnt/sysimage/dev that fails to umount. Plus, program.log ends at 22:55:10 where mock_output.log continues to complain about things and says to go see program.log for more info, but of course there's no info to be found in program.log because it's ended.

So did dev/shm fail to umount? anaconda-program.log says it umounted, mock_output.log says it failed.
Comment 69 Adam Williamson 2016-08-20 15:52:43 EDT
Chris: what you're missing is this:

https://github.com/rhinstaller/lorax/blob/master/src/sbin/livemedia-creator#L688

it's not anaconda that reports that 'failed to unmount (foo). retrying...' message. It's livemedia-creator. livemedia-creator's `anaconda_cleanup()` function is run after anaconda exits, and it basically looks in the build environment's /proc/mounts for any listed mounts that are under /mnt/sysimage - the path where the install target image is mounted - and tries to unmount them. That is, just as the name suggests, it tries to clean up after anaconda.

Still, this is a bit confusing, because it suggests that in this case, anaconda (well, blivet) by all indications successfully umounted its /dev/shm mount, yet when livemedia-creator's `anaconda_cleanup()` function ran, /proc/mounts showed /mnt/sysimage/dev/shm as mounted, and livemedia-creator tried to umount it and that failed.

Note that program.log and anaconda-program.log are *different*. anaconda-program.log is the log from the anaconda run. program.log is output by lorax, and if you look at that one, you will indeed see the failed attempt to umount /mnt/sysimage/dev/shm , and some debugging info (the outputs of lsof, fuser, and ls):

Running... umount /mnt/sysimage/dev/shm
umount: /mnt/sysimage/dev/shm: target is busy
(In some cases useful info about processes that
use the device is found by lsof(8) or fuser(1).)
Return code: 32
Running... fuser -vm /mnt/sysimage/dev/shm
USER        PID ACCESS COMMAND
/mnt/sysimage/dev/shm:
root     kernel mount /mnt/sysimage/dev/shm
Return code: 0
Running... ls -lah /mnt/sysimage/dev/shm
total 0
drwxrwxrwt.  2 root root   40 Aug 19 22:38 .
drwxr-xr-x. 20 root root 4.0K Feb  3  2016 ..
Return code: 0
Running... lsof /mnt/sysimage/dev/shm
Return code: 1

So just to recap - sorry to everyone for whom this is old news, but I think it'll help some people who are trying to follow this bug - here is how everything happens from start to finish, at least as I understand it:

1. Koji creates a clean mock chroot for the image build to run in; each Koji task gets its own mock chroot (I checked, and they don't seem to be colliding at all)

2. Koji / pungi (I don't know which one is responsible for which tasks exactly, didn't look in detail) installs all the necessary bits for building live images into the chroot, checks out the spin-kickstarts repo, flattens the appropriate kickstart, fiddles with the repos a bit - just generally preparing for the image build

3. Koji / pungi runs livemedia-creator in the mock chroot, telling it to do a non-virt install with the flattened kickstart

4. livemedia-creator creates a disk image file under /tmp/lmc (with a random filename) and mounts it to /mnt/sysimage

5. livemedia-creator calls anaconda and tells it to do a directory install with the flattened kickstart that koji fed it (it doesn't specify /mnt/sysimage as the target directory, so I guess that's the default)

6. As part of the directory install process, anaconda bind mounts /dev (remember, this is the *mock chroot's* /dev, not the builder's /dev) as /mnt/sysimage/dev , then mounts a tmpfs on /mnt/sysimage/dev/shm

7. At the end of its run, anaconda tries to umount /mnt/sysimage/dev/shm and /mnt/sysimage/dev (and a few other similar mounts it does); if any of these umounts fail, anaconda crashes immediately. This is what's happening at 22:55:10 in the log you refer to: anaconda tries to umount /mnt/sysimage/dev and this fails, so it crashes.

8. When anaconda completes (whether it quit normally or crashed), livemedia-creator calls that `anaconda_cleanup()` function, which checks for left-over mounts under /mnt/sysimage (again, remember, we're still inside the mock chroot at this point) and tries to clean them up; for each mount it finds it tries three times to umount it, on each failure logging the lsof, fuser and ls output for the mount point. After three retries of each mount point, it gives up. If any of the umounts failed, livemedia-creator will (intentionally) crash; this is what we changed some months back, previously it would log a warning and carry on.

I looked into what a mock chroot has for /dev ; it looks like mock basically creates it out of whole cloth when initializing the chroot, i.e. it's not a bind mount of the host system's /dev or anything like that.
Comment 70 Adam Williamson 2016-08-20 17:01:17 EDT
heyyy...wait a bit. I do see this in the logs of koji createLiveImage tasks:

DEBUG util.py:499:  Executing command: ['/bin/mount', '-n', '--bind', '/dev', '/var/lib/mock/f25-build-6240789-628717/root/dev'] with env {'TERM': 'vt100', 'PATH': '/usr/bin:/bin:/usr/sbin:/sbin', 'LANG': 'en_US.UTF-8', 'HOME': '/builddir', 'SHELL': '/bin/sh', 'HOSTNAME': 'mock'} and shell False

it looks like they *do* bind mount /dev from the host into the mock chroot, even though mock usually doesn't do that. It looks like koji is configured to make mock do this:

https://git.fedorahosted.org/cgit/koji/tree/builder/kojid#n2500

the LiveMediaTask class inherits from that ImageTask class.

So...this seems like it could at least *potentially* be a source of trouble? If I've still got things straight, that means we have koji bind mounting the builder's /dev into the mock chroot, and anaconda bind mounting that mock chroot /dev as its /mnt/sysimage/dev ...
Comment 71 Adam Williamson 2016-08-20 17:26:10 EDT
Hmm, this may be interesting: if you look closely at the log, after both anaconda and livemedia-creator have had their umount problems, mock/koji (I'm not sure which of the two is actually *doing* this bind mounting) also fails to umount the bind mount of the builder's /dev into the mock chroot. Look near the end of the mock logs for a failed run, and you see something like this:

DEBUG util.py:421:  2016-08-19 22:55:54,195: ERROR: Image creation failed: novirt_install cleanup of anaconda mounts failed.
DEBUG util.py:546:  Child return code was: 1
DEBUG util.py:180:  kill orphans
DEBUG util.py:567:  child environment: None
DEBUG util.py:499:  Executing command: ['/bin/umount', '-n', '/var/lib/mock/f25-build-6252959-629158/root/proc/filesystems'] with env {'SHELL': '/bin/sh', 'TERM': 'vt100', 'HOSTNAME': 'mock', 'HOME': '/builddir', 'LANG': 'en_US.UTF-8', 'PATH': '/usr/bin:/bin:/usr/sbin:/sbin'} and shell False
DEBUG util.py:546:  Child return code was: 0
DEBUG util.py:567:  child environment: None
DEBUG util.py:499:  Executing command: ['/bin/umount', '-n', '/var/lib/mock/f25-build-6252959-629158/root/dev'] with env {'SHELL': '/bin/sh', 'TERM': 'vt100', 'HOSTNAME': 'mock', 'HOME': '/builddir', 'LANG': 'en_US.UTF-8', 'PATH': '/usr/bin:/bin:/usr/sbin:/sbin'} and shell False
DEBUG util.py:546:  Child return code was: 32

It seems like these bind mounts are set up at the start of each mock operation and torn down at the end; there are two mock operations in the createLiveMedia task, it seems, so you see the mounts set up twice and torn down twice. The first time they all umount cleanly, the second time /dev fails to umount.
Comment 72 Adam Williamson 2016-08-20 18:21:38 EDT
So I looked a bit into the history of *why* we bind mount the whole of /dev into the mock chroot for 'image build' tasks. When live CD compose was added to koji in 2009, the code bind mounted various bits of /dev , but not all of it:

https://git.fedorahosted.org/cgit/koji/commit/?id=d93d05ab5f995947ca39a45dacb7b94da81751be

look at the "# Here we configure mock to bind mount a set of /dev directories" bit.

The commit that changed to bind mounting all of /dev came in 2011, and is sadly lacking in information as to *why* it was being done:

https://git.fedorahosted.org/cgit/koji/commit/?id=296d2d190083b053f39212e4eab380eef135837a

but we can get at least a hint from this mail thread:

https://lists.fedoraproject.org/archives/list/infrastructure@lists.fedoraproject.org/thread/DH3JKWTWOAG6CO42Y5O3Z3W7WZWZFAA3/#DH3JKWTWOAG6CO42Y5O3Z3W7WZWZFAA3

note "in order to compose appliances in koji". So it sounds like it was to do with doing the creation of cloud images in Koji; I think that's approximately the right timeframe. I see four createAppliance tasks before that commit which somehow succeeded, but they were clearly early experiments; regular succeeded createAppliance tasks start showing up right around 2011-02-19, the date of the patch.

anyhow: it's clearly something that's just been inherited for the LiveMediaTask without any specific thought, we just had LiveMediaTask inherit from the existing ImageTask class because it was kinda 'logical', probably without deeply considering whether everything ImageTask does is actually appropriate for it.

So, *one* thing we could try doing to address this is we could try modifying koji so that LiveMediaTask does *not* bind mount /dev into the mock chroot, and see how that goes. It's possible that anaconda really will need the 'real' /dev , either to work at all or we'll find that something subtle breaks without it, but we can at least *try* without it and see how it goes, right? I've at least just run a KDE live image creation in a regular mock chroot here (without a bind mount of /dev) and it worked apparently fine.

I have *another* suggestion, coming from a different angle, that I'll put in the next comment.
Comment 73 Adam Williamson 2016-08-20 18:38:18 EDT
So here's my other angle:

Why does anaconda explicitly mount a tmpfs on /mnt/sysimage/dev/shm ?

anaconda bind mounts /dev from its own environment as /mnt/sysimage/dev . Alright. Then it explicitly mounts a tmpfs as /mnt/sysimage/dev/shm . But the funny thing is, in all the cases I can think of, /dev/shm is *already* a tmpfs; systemd creates it that way. So, I'm kinda suspicious of this behaviour. It's interesting if you boot an F25 installer image and look at 'mount | grep shm', before and after starting the install. Before:

tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev,seclabel)

but after:

tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev,seclabel)
tmpfs on /mnt/sysimage/dev/shm type tmpfs (rw,relatime,seclabel)
tmpfs on /dev/shm type tmpfs (rw,relatime,seclabel)

that...looks a bit squiggly, doesn't it? It doesn't seem to cause the installer any issues when operating in its 'regular' mode, but it still doesn't really *look* right.

So I tried to look into why anaconda does this at all. I couldn't find a clear explanation in the logs, but what I can find is that it's really *old* behaviour. I still have a Fedora 13 image lying around here, and I found that Fedora 13 behaves the same way - it mounts a tmpfs on /mnt/sysimage/dev/shm . But the interesting thing is that Fedora 13's anaconda environment does *not* have a /dev/shm at all.

So if you look at it in *that* context, you can make a guess at why this was put in anaconda in the first place: in Ye Olde Days the anaconda environment had no /dev/shm , and then we realized that the install root needed one for whatever reason - package install process needs to use it, or something - so we made anaconda set one up. But we never *stopped* anaconda doing that after anaconda's environment grew a /dev/shm (which was probably around the time of the 'noloader' change, if I had to hazard a guess).

So here's my other suggestion: we just stop anaconda from doing this /mnt/sysimage/dev/shm tmpfs mount at all, and see if that helps.
Comment 74 Chris Murphy 2016-08-20 19:14:01 EDT
(In reply to Adam Williamson from comment #71)

> DEBUG util.py:499:  Executing command: ['/bin/umount', '-n',
> '/var/lib/mock/f25-build-6252959-629158/root/dev'] with env {'SHELL':
> '/bin/sh', 'TERM': 'vt100', 'HOSTNAME': 'mock', 'HOME': '/builddir', 'LANG':
> 'en_US.UTF-8', 'PATH': '/usr/bin:/bin:/usr/sbin:/sbin'} and shell False
> DEBUG util.py:546:  Child return code was: 32

My understanding is before this (before the crash and image creation fail) there was a failure in the mock chroot to umount /mnt/sysimage/dev/shm, which presumably will make umount of anything below it fail, and there are messages to that effect.

But adding to the confusion is the lack of consistent time stamps in these logs. The anaconda-program.log says /mnt/sysimage/dev/shm umount worked:
22:55:09,806 INFO program: Running... umount /mnt/sysimage/dev/shm
22:55:09,827 DEBUG program: Return code: 0

But then program.log says it doesn't:
Running... umount /mnt/sysimage/dev/shm
umount: /mnt/sysimage/dev/shm: target is busy

So that's still WTF to me because a.) two things seems to want to umount the same thing at slightly different times b.) one of them says success the other says busy. So maybe the program.log busy one is true (koji or mock does this one? but not anaconda), and then the anaconda-program.log is also true and it's not busy anymore?
Comment 75 Adam Williamson 2016-08-20 19:52:22 EDT
I already explained a) - anaconda-program.log is *anaconda* trying to tear down its mounts, program.log is *livemedia-creator* trying to clean up any mounts that anaconda left over. The latter always happens *after* the former, it cannot be otherwise - it cannot run until the anaconda process finishes, and anaconda's own teardown is the thing it runs right *before* the process finishes.

It is indeed strange that anaconda claims to have successfully umounted /mnt/sysimage/dev/shm , but livemedia-creator subsequently finds it to still be mounted (livemedia-creator does not blindly go around trying to umount random paths - it reads /proc/mounts to find out what the kernel thinks is mounted), tries to umount it, and fails.

But my current belief is that this is simply us winding up in an area of pretty undefined behaviour, because we're doing very weird stuff here. It's just not *right* - I don't think - to bind mount a path which contains submounts , and then instead of also bind mounting the submounts , effectively replicating the submounts in the bind mount. I just don't think that's how you're supposed to do stuff at all, it certainly makes the 'mount' output look odd, and I can totally believe it could cause umount to do weird things. Which is...exactly what we're seeing, isn't it?

So, that's basically why I'm saying, maybe we should be looking at changing what anaconda does with /mnt/sysimage/dev/shm . And I'm currently testing a potential change there.
Comment 76 Adam Williamson 2016-08-21 23:02:02 EDT
for the record, dgilmore points out that infra actually runs a fork of koji:

https://pagure.io/fork/ausil/koji/branch/fedora-infra

but the relevant code looks the same there, ImageTask.makeImgBuildRoot() still configures the /dev bind mount, and LiveMediaTask still inherits from ImageTask. Which isn't surprising, since the log shows the /dev bind mount happening.

Dennis was surprised at this as he thought it was already the case that the bind mount wasn't being used for live media creation tasks, so that kinda agrees with my suggestion to try turning that off. I definitely think we could try disabling that bind mount tomorrow and running a compose, see how that turns out.
Comment 77 Brian Lane 2016-08-22 12:55:14 EDT
(In reply to Adam Williamson from comment #67)
> Has anyone who's tried to reproduce this locally tried running *two* live
> composes (same image - one of the commonly-affected ones like KDE - for two
> different arches) in two mock chroots at the same time?

Yes I've done that. But I was using mock with just the loop device nodes added, not a bind mount of my host (which gives me the willies). I'd sure like to find the root cause of this. Looks like we should try turning off the host /dev mount -- there should be no need for that with lmc.
Comment 78 Adam Williamson 2016-08-22 12:58:30 EDT
bcl: great, thanks - that's what dgilmore and I thought too, so that makes it unanimous :) dgilmore is currently running a scratch build of koji with the bind mount for live image compose tasks disabled, and we'll run a compose with that and see how it goes.

I'll probably file a separate bug and PR for blivet to do a recursive bind mount of the host's /dev instead of doing its own tmpfs mount of /dev/shm (and devpts mount of /dev/pts) at some point, as I think that behaviour is still worth discussing, but if we're right about the problem for the compose process, just disabling the mock bind mount should be sufficient to fix it, we don't also need to change anaconda immediately I don't think.
Comment 79 Geoffrey Marr 2016-08-22 17:38:07 EDT
Discussed during the 2016-08-22 blocker review meeting: [1]

The decision to classify this bug as an Accepted Blocker was made as this has been happening long enough and affects KDE often enough that we are now willing to take it as a blocker as a conditional violation of "All release-blocking images must boot in their supported configurations", the criterion which implicitly covers 'the release-blocking images must actually exist'.

[1] https://meetbot.fedoraproject.org/fedora-blocker-review/2016-08-22/f25-blocker-review.2016-08-22-16.00.txt
Comment 80 Adam Williamson 2016-08-22 19:25:00 EDT
so https://kojipkgs.fedoraproject.org/compose//branched/Fedora-25-20160822.n.1/logs/ is running with the Koji change to not do the bind mount into the mock chroots. We'll see how that goes. Marking POST to denote this.
Comment 81 Adam Williamson 2016-08-22 22:59:15 EDT
The fix for this looks good, we had no failures of this kind in the Fedora-25-20160822.n.1 compose. Additionally, anaconda actually exited cleanly on all live composes (rather than crashing after 'umount /mnt/sysimage/dev/shm', which it was previously doing even on successful composes). appliance-creator composes still work, too. We just need a proper koji build submitted as an update now.
Comment 82 Adam Williamson 2016-08-22 23:01:39 EDT
actually, seems like some appliance-creator composes had issues, so we may have to restore the bind mount for the ApplianceTask class. But dennis' PR is here: https://pagure.io/koji/pull-request/137
Comment 83 Adam Williamson 2016-08-23 14:34:23 EDT
https://pagure.io/koji/pull-request/138 is an alternative I wrote which should only disable the /dev bind mount for createLiveMedia tasks, leaving it enabled for createAppliance.
Comment 84 Fedora Update System 2016-08-23 14:57:09 EDT
koji-1.10.1-12.fc25 has been submitted as an update to Fedora 25. https://bodhi.fedoraproject.org/updates/FEDORA-2016-789e641d4c
Comment 85 Fedora Update System 2016-08-23 14:59:48 EDT
koji-1.10.1-12.fc24 has been submitted as an update to Fedora 24. https://bodhi.fedoraproject.org/updates/FEDORA-2016-adf8ab8a29
Comment 86 Fedora Update System 2016-08-23 19:20:01 EDT
koji-1.10.1-12.fc25 has been pushed to the Fedora 25 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2016-789e641d4c
Comment 87 Fedora Update System 2016-08-24 13:24:33 EDT
koji-1.10.1-12.fc24 has been pushed to the Fedora 24 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2016-adf8ab8a29
Comment 88 Adam Williamson 2016-08-25 13:13:34 EDT
This definitely looks good, no compose since we changed koji has hit this bug to my knowledge.
Comment 89 Fedora Update System 2016-08-25 18:24:21 EDT
koji-1.10.1-12.fc25 has been pushed to the Fedora 25 stable repository. If problems still persist, please make note of it in this bug report.
Comment 90 Fedora Update System 2016-08-26 06:20:53 EDT
koji-1.10.1-12.fc24 has been pushed to the Fedora 24 stable repository. If problems still persist, please make note of it in this bug report.
Comment 91 Adam Williamson 2016-10-19 20:02:21 EDT
This goddamn bug seems to be back for current Rawhide composes - it ate today's Rawhide Workstation live compose: https://kojipkgs.fedoraproject.org//work/tasks/3655/16123655

check this from the log, it seems like the bind mount is back for some reason?

DEBUG util.py:502:  Executing command: ['/bin/mount', '-n', '--bind', '/dev', '/var/lib/mock/f26-build-6674072-655679/root/dev'] with env {'LANG': 'en_US.UTF-8', 'PATH': '/usr/bin:/bin:/usr/sbin:/sbin', 'HOME': '/builddir', 'TERM': 'vt100', 'SHELL': '/bin/sh', 'HOSTNAME': 'mock'} and shell False
Comment 92 Adam Williamson 2016-10-19 20:10:40 EDT
Actually, I think it's an issue on one specific koji host. All the createLiveMedia tasks that run on buildhw-02.phx2.fedoraproject.org seem to get the /dev bind mount (which is wrong). createLiveMedia tasks run on any other host don't get it (which is right). There's something up with that host. Old koji version?

Note You need to log in before you can comment on or make changes to this bug.