The Fedora-Rawhide-20191212.n.1 compose seems completely broken. No images boot. Not installer images, not live images. They all seem to be failing on disk discovery during early boot and fail to the dracut shell. I'm currently trying to get more info. Live images fail with: Warning: /dev/disk/by-label/Fedora-KDE-Live-rawh-20191212-n- does not exist Warning: /dev/root does not exist (Obviously the first line differs a bit depending on the image). Installer images show only the "/dev/root does not exist" line. I'm assigning to util-linux for now as a guess because it's the most obviously-related thing that changed in the 1212.n.1 compose. The kernel didn't, and dracut didn't. Will update as I get more info. This is an automatic Fedora 32 Beta blocker, per "Complete failure of any release-blocking TC/RC image to boot at all under any circumstance - "DOA" image (conditional failure is not an automatic blocker)" - https://fedoraproject.org/wiki/QA:SOP_blocker_bug_process#Automatic_blockers .
Interesting: of all the images, the Cloud qcow2 disk image did boot, but a later test of that image failed with a segfault in agetty, which is part of util-linux indeed: https://openqa.fedoraproject.org/tests/497814#step/autocloud/80 the coredump is available at https://openqa.fedoraproject.org/tests/497814/file/autocloud-coredump.tar.gz .
so, pxeboot also worked. Installer images and live images failing to boot while a Cloud image boots and PXE (direct kernel boot) install works tends to imply that the issue is with squashfs, I think.
Created attachment 1644596 [details] backtrace of agetty crash from Cloud image test Here's a backtrace of the agetty crash from the Cloud image test. Not sure yet if this crash is related to the failure of other images to boot.
Oh, URLs of images for reproducing this: * https://kojipkgs.fedoraproject.org/compose/rawhide/Fedora-Rawhide-20191212.n.1/compose/Workstation/x86_64/iso/Fedora-Workstation-Live-x86_64-Rawhide-20191212.n.1.iso (Workstation live) * https://kojipkgs.fedoraproject.org/compose/rawhide/Fedora-Rawhide-20191212.n.1/compose/Everything/x86_64/iso/Fedora-Everything-netinst-x86_64-Rawhide-20191212.n.1.iso (Everything netinst)
So, if I boot an older Workstation live with `rd.break` to get to dracut shell, and look at /dev/disk/by-label , the expected entry is there, and it's a symlink to /dev/sr0: /dev/disk/by-label/Fedora-WS-Live-rawh-20191202-n-0 -> ../../sr0 in the live case I think the thing that ultimately fails is the dmsquash-live-root command specified in the udev rules in /etc/udev/rules.d/99-live-squash.rules , which is supposed to do the work of setting up the live root from the squashfs: KERNEL=="disk/by-label/Fedora-WS-Live-rawh-20191202-n-0", RUN+="/sbin/initqueue --settled --onetime --unique /sbin/dmsquash-live-root /dev/disk/by-label/Fedora-WS-Live-rawh-20191202-n-0" SYMLINK=="disk/by-label/Fedora-WS-Live-rawh-20191202-n-0", RUN+="/sbin/initqueue --settled --onetime --unique /sbin/dmsquash-live-root /dev/disk/by-label/Fedora-WS-Live-rawh-20191202-n-0" that's obviously going to fail when the expected /dev/disk/by-label node just isn't there at all. So, I guess we need to know why that node isn't showing up. A possibly-interesting thing is that if I run `dmsquash-live-root /dev/sr0` manually from the dracut rescue shell it actually fails, but I'm not 100% sure whether we'd expect that to work.
Okay, so. /dev/disk/by-label/ symlinks are created by /etc/udev/rules.d/61-persistent-storage.rules in this environment, and we get them based on the ID_FS_LABEL_ENC environment var. And this is definitely the problem! If I run `udevadm info /dev/sr0 | grep ID_FS` on the older, working live image, I get a bunch of output (ID_FS_UUID, ID_FS_UUID_ENC, ID_FS_BOOT_SYSTEM_ID, ID_FS_VERSION, ID_FS_LABEL, ID_FS_LABEL_ENC, ID_FS_TYPE and ID_FS_USAGE). If I run the same command on the newer, broken live image I get nothing at all. All those are generated by systemd-udev, around here: https://github.com/systemd/systemd/blob/master/src/udev/udev-builtin-blkid.c#L215 using (lib)blkid somehow (I haven't dug into the details of how yet). So we're narrowing down here for sure. Chris Murphy suggests this util-linux commit might be the culprit: https://github.com/karelzak/util-linux/commit/7ef86a08914427d6486734614d7d3bbed1f108fe and yes, it sounds very much like it to me, now I read the commit message. Especially this bit: "Clean up this situation by detecting such partitioned ISO disks in the superblock probing setup. If files of this kind are detected, we now only expose the ISO metadata attributes on the specific partition that points to the ISO data (and not the parent disk)." I'm pretty sure the path we're on here is *expecting* those attributes to be exposed on "the parent disk".
Given what the believed-to-be-offending commit does, I think this will only happen when the medium is being booted as a real or virtual optical disc (when booted as a real or virtual USB stick it'll likely work). I can try and confirm this in a bit.
I've got to run out soon, but I've got some tests running to try and prove or disprove the theory ATM: * https://openqa.stg.fedoraproject.org/tests/overview?distri=fedora&version=31&build=Kojitask-39505805-NOREPORT&groupid=2 (unmodified util-linux 2.35, expected to FAIL) * https://openqa.stg.fedoraproject.org/tests/overview?distri=fedora&version=31&build=Kojitask-39505893-NOREPORT&groupid=2 (2.35 with the offending commit reverted, expected to PASS) I ran two scratch builds of util-linux for F31, one unmodified from the Rawhide source, one with patches added to revert the offending commit (and a subsequent commit that also has to be reverted to revert the commit we care about cleanly). Then I fired the openQA test which builds a boot.iso and tries to boot and install it on both scratch builds. I did it this way because it's nice and easy to have openQA test it for us, but it's not set up to run these tests on Rawhide builds, only stable and branched releases. We'll hope there are no other complicating issues with using util-linux 2.35 in a Fedora 31 environment, if there are I'll have to test manually later. Will check back in on this later tonight.
OK, the tests turned out as expected, indicating we're right about the cause and the fix here. So I've sent an official Rawhide build with the revert: https://koji.fedoraproject.org/koji/taskinfo?taskID=39506577 that should fix things in the next compose.
In case it's useful for Karel Zak's understanding of the bug, this is the command creating the ISO for Workstation Live ISO (other ISOs have something functionally the same): 2019-12-12 17:32:50,780 DEBUG pylorax.ltmpl: template line 59: runcmd xorrisofs -o /var/tmp/lmc-work-m8uvrz8m/images/boot.iso -isohybrid-mbr /usr/share/syslinux/isohdpfx.bin -b isolinux/isolinux.bin -c isolinux/boot.cat -boot-load-size 4 -boot-info-table -no-emul-boot -eltorito-alt-boot -e images/efiboot.img -no-emul-boot -isohybrid-gpt-basdat -eltorito-alt-boot -e images/macboot.img -no-emul-boot -isohybrid-gpt-hfsplus -R -J -V Fedora-WS-Live-rawh-20191212-n-1 -graft-points isolinux=/var/tmp/lmc-work-m8uvrz8m/isolinux images/pxeboot=/var/tmp/lmc-work-m8uvrz8m/images/pxeboot LiveOS=/var/tmp/lmc-work-m8uvrz8m/LiveOS EFI/BOOT=/var/tmp/lmc-work-m8uvrz8m/EFI/BOOT images/efiboot.img=/var/tmp/lmc-work-m8uvrz8m/images/efiboot.img images/macboot.img=/var/tmp/lmc-work-m8uvrz8m/images/macboot.img Slightly off topic: considering this bug, and also the proposed F32 system wide change to no longer block on optical media related bugs [1], are we approaching a time when it might be more sane to intentionally split optical and USB boot images? i.e. choose the timing of the breakage with a plan? I'm reluctant to imagine what this looks like change and testing wise, because the hybrid ISO is really magical: optical, USB, BIOS, UEFI, Macs... it practically makes unicorn poop cookies. [1] https://fedoraproject.org/wiki/Changes/Drop_Optical_Media_Criterion https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org/message/FSRXYXL4ECPLUA7LGTINQDI5MN67LHYC/
but if we did that, where would I get my unicorn poop cookies?!
Hi, my two cents as developer of xorriso: The commit 7ef86a08 announces to fix an udev problem by manipulating the blkid recognition of ISO 9660 filessystems on the devices. This seems unwise in itself, from the view of proper software architecture. My theory about the actual problem is that libblkid recognizes the partition table on /dev/sr0, whereas the Linux kernel does not. Then it places its unwise bet on the fact that there will be a partition device which leads to the same superblock. My bet: grub-mkrescue ISOs will not be recognized as carrier of ISO 9660 even on USB stick, because none of its GPT partitions expose the ISO and the APM partition exposes the HFS+ filesystem tree. If my theory is correct, then there is need for deep thinking how bad it is if the same ISO 9660 is shown on base device and partition device. A variation of this situation is the partition offset feature of xorriso by which two superblocks and two directory trees are produced for the same ISO 9660 data file content. One for the base device, and one for the partition which begins at CD block 16 / disk block 64. (See e.g. Knoppix 8.2 ISO) If this kind of situation is really undesirable, then a smarter detector is needed in libblkid, which verifies that the base device really offers the same ISO 9660 filesystem which one of the partitions offers. Simply assuming that this is true is extra unwise. But reading this part of the commit description, it looks like the whole approach is mislead and maybe fixes the udev problem only by incident: "Since sda2 is unformatted, it won't have any ID_FS_ attributes of it's own. And due to the following standard udev rule: # for partitions import parent information ENV{DEVTYPE}=="partition", IMPORT{parent}="ID_*" sda2 will actually import all of the ID_FS_ stuff from the parent device sda." Fedora-Workstation-Live-x86_64-Rawhide-20191213.n.0.iso has no partition without filesystem. With an old libblkid, lsblk reports about it when on USB stick: NAME SIZE FSTYPE TRAN LABEL sdd 1.9G iso9660 usb Fedora-WS-Live-rawh-20191213-n-0 |-sdd1 1.8G iso9660 Fedora-WS-Live-rawh-20191213-n-0 |-sdd2 11M vfat ANACONDA `-sdd3 22.9M hfsplus ANACONDA So already with the Fedora ISO, the commit reasoning is wrong. (That udev rule is a known brain bug. Its legitimate use case is unclear to me. It bites people from time to time.) The producers of isohybrid ISOs should try to convince util-linux to improve their handling of this situation and to analyze whether the real problem isn't actually in udev. ----------------------------------------------------------------------- Side note: Is there a reason why Fedora ISO production gave up on exposing the HFS+ filesystem by an Apple Partition Map ? Debian uses xorrisofs option -isohybrid-apm-hfsplus in vain on its EFI image, because there is no HFS+. But Fedora has /images/macboot.img which could be advertised by option -isohybrid-apm-hfsplus in addition to -isohybrid-gpt-basdat for the (invalid) GPT. To answer in advance the question who came to such ideas, it was a Fedora developer: http://mjg59.dreamwidth.org/11285.html The first implementation was in SYSLINUX program isohybrid, which mjg augmented by options --uefi and --mac. But the ISO 9660 partition stems from BIOS-only isohybrid, hpa, Sep 2008. https://github.com/geneC/syslinux/commits/master/utils/isohybrid.in Have a nice day :) Thomas
So, just for the record, we were right about the problem - the next Rawhide compose with the reverts booted fine again. I'm gonna leave the bug open for a bit for discussion about the correct long-term fix and the issues Thomas raised, but let's drop the blocker metadata as it's not really blocking any more.
Hi, i have to correct a mistake in my previous post: The shown xorriso run uses for macboot.img the option -isohybrid-gpt-hfsplus I wrote -isohybrid-gpt-basdat, which is used for the EFI image. Nevertheless, adding to -isohybrid-gpt-hfsplus the option -isohybrid-apm-hfsplus would cause creation of an Apple Partition Map, which mjg intended when he added option --mac to SYSLINUX isohybrid. (Neither he nor Vladimir Serbinenko could tell me which Macs exactly would boot by APM and HFS+. I doubt that any HFS-needy Mac boots by MBR partition table. By tradition the GPT in SYSLINUX isohybrids is invalid, because the MBR partition table is not "protective". So if HFS+ then via APM, i'd say.) ------------------------------------------------------------------------ As for the blkid change. I now believe to remember that the udev rule # for partitions import parent information ENV{DEVTYPE}=="partition", IMPORT{parent}="ID_*" shall set defaults for those properties of a partition which are defined by the overall device and might not get read from the partition device. The content properties of the base device obviously should not belong to these defaults. In november 2013 i discussed the problem on grub-devel https://lists.gnu.org/archive/html/grub-devel/2013-11/msg00008.html ff. Vladimir Serbinenko refreshed his decision not to have a mountable ISO partition in grub-mkrescue ISOs: https://lists.gnu.org/archive/html/grub-devel/2013-11/msg00011.html For a while there was a fix possible by setting the inappropriate properties to empty text, after they were imported: ENV{DEVTYPE}=="partition", IMPORT{parent}="ID_*", \ ENV{ID_FS_LABEL}="" , ENV{ID_FS_LABEL_ENC}="" , \ ENV{ID_FS_TYPE}="" , ENV{ID_FS_USAGE}="" But in april 2014 there arised reports that this did not work any more. I find traces in my draft mail directory that i began to prepare a mail to Karel Zak. But i seem to have been distracted from sending it. If above larger rule could be revived and if udev would accept it as replacement of the current one, then the offending change in libblkid would not be necessary. Have a nice day :) Thomas
Here https://github.com/karelzak/util-linux/pull/913 is an improved ISO prober from Daniel; it seems more robust than the original solution. It would be nice to test it with our Fedora stuff, volunteers? ;-)
The agetty crashes are still happening, so I filed a separate bug for that: https://bugzilla.redhat.com/show_bug.cgi?id=1784536 Karel, I can test that prober in openQA easily enough, but all openQA will do is test a Fedora image as a virtual optical device, it doesn't do anything else...
OK, well, Daniel's PR passes in that case. I'll manually test booting with it as a USB stick, I guess.
Hi, one could test it as virtual CD-ROM qemu-system-x86_64 -enable-kvm -m 1024 -cdrom $iso With iso=Fedora-Workstation-Live-x86_64-Rawhide-20191213.n.0.iso on elderly Debian i get to a graphical screen which offers me "Try Fedora". With iso=Fedora-Workstation-Live-x86_64-Rawhide-20191212.n.1.iso i finally get to some text terminal prompt: Warning: /dev/disk/by-label/... does not exist Warning: /dev/root does not exist ... dracut:/# So the qemu run with PC-BIOS firmware seems to be able to demonstrate the problem. Is the new ISO published somewhere ? I could test on real iron, too. Have a nice day :) Thomas
Thomas: that is how I tested it already (that's what openQA does). I'm also testing it manually as a USB stick at the moment. It's not...exactly...published anywhere, no.
Booting the image written to a USB stick works OK in both BIOS and UEFI modes.
Post an URL when published. I'll give it a run on DVD. The test on USB stick would not show the original problem, but only confirms that there are no regressions. I understand that the fix for DVD is in this piece of code in https://github.com/karelzak/util-linux/pull/913/commits/fc84bc0a463480ffb17a39b5375463b7f07d14ce /* If no ISO9660 partition devno was found, consider the current device * as an appropriate owner of the filesystem. This can happen for CD/DVDs, * where partitions may exist in the table, but are not usually probed by * the kernel. */ if (!isopart_devno) return true; At least this comment matches my suspicion about the reason for this bug report.
I think it's fine with the testing I've done, we've already established it works on optical and USB cases.
Fedora util-linux have been upgraded to the current upstream git tree (with merged Thomas' https://github.com/karelzak/util-linux/pull/913). Thanks Thomas and Adam for help with this issue!
This bug appears to have been reported against 'rawhide' during the Fedora 32 development cycle. Changing version to 32.
This message is a reminder that Fedora 32 is nearing its end of life. Fedora will stop maintaining and issuing updates for Fedora 32 on 2021-05-25. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as EOL if it remains open with a Fedora 'version' of '32'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version. Thank you for reporting this issue and we are sorry that we were not able to fix it before Fedora 32 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora, you are encouraged change the 'version' to a later Fedora version prior this bug is closed as described in the policy above. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete.
Fedora 32 changed to end-of-life (EOL) status on 2021-05-25. Fedora 32 is no longer maintained, which means that it will not receive any further security or bug fix updates. As a result we are closing this bug. If you can reproduce this bug against a currently maintained version of Fedora please feel free to reopen this bug against that version. If you are unable to reopen this bug, please file a new report against the current release. If you experience problems, please add a comment to this bug. Thank you for reporting this bug and we are sorry it could not be fixed.