1. Please describe the problem: A t2.small EC2 instance will not boot with the 5.14.9-200.fc34.x86_64 kernel. 5.13.19-200.fc34.x86_64 is OK. A t3.small EC2 instance starts OK with 5.14.9-200. 2. What is the Version-Release number of the kernel: 5.14.9-200 3. Did it work previously in Fedora? If so, what kernel version did the issue *first* appear? Old kernels are available for download at https://koji.fedoraproject.org/koji/packageinfo?packageID=8 : I keep the instance up-to-date, so it's probably used every kernel that's been released as a F34 update. All kernels up to and including 5.13.19-200 were OK. 5.14.9-200 is the first that won't boot. 4. Can you reproduce this issue? If so, please provide the steps to reproduce the issue below: * Launch EC2 t2.small instance using Fedora 34 x86_64 Cloud Base AMI (I used London - ami-034794b0310a1d8b7) * sudo dnf update --exclude kernel-core * sudo reboot # instance starts OK * sudo dnf update # updates kernel to 5.14.9-200.fc34 * sudo reboot # instance does not start 5. Does this problem occur with the latest Rawhide kernel? To install the Rawhide kernel, run ``sudo dnf install fedora-repos-rawhide`` followed by ``sudo dnf update --enablerepo=rawhide kernel``: Will check this later 6. Are you running any modules that not shipped with directly Fedora's kernel?: No 7. Please attach the kernel logs. You can get the complete kernel log for a boot with ``journalctl --no-hostname -k > dmesg.txt``. If the issue occurred on a previous boot, use the journalctl ``-b`` flag. I don't think anything is being saved to the journal.
> 5. Does this problem occur with the latest Rawhide kernel? To install the > Rawhide kernel, run ``sudo dnf install fedora-repos-rawhide`` followed by > ``sudo dnf update --enablerepo=rawhide kernel``: Replacing "kernel" with "kernel-core", this installed 5.15.0-0.rc3.20211001git4de593fb965f.30.fc36, and the instance would not boot.
Created attachment 1828618 [details] EC2 console log with 5.14.9-200 kernel EC2 console log attached - looks the disk never appears
I think this might be because the 5.14.9-200 initramfs image doesn't include the xen-blkfront module: $ sudo lsinitrd /boot/initramfs-5.13.19-200.fc34.x86_64.img | grep blkfront -rw-r--r-- 1 root root 21276 Jul 9 15:18 usr/lib/modules/5.13.19-200.fc34.x86_64/kernel/drivers/block/xen-blkfront.ko.xz $ sudo lsinitrd /boot/initramfs-5.14.9-200.fc34.x86_64.img | grep blkfront
If I run: $ dracut -v --debug -f 5.13.19.img 5.13.19-200.fc34.x86_64 2> 5.13.19.err the output contains: dracut-install: Handling /lib/modules/5.13.19-200.fc34.x86_64//kernel/drivers/block/xen-blkfront.ko.xz dracut-install: Module xen_blkfront: symbol blk_cleanup_queue matched inclusion filter dracut-install: dracut_install '/lib/modules/5.13.19-200.fc34.x86_64//kernel/drivers/block/xen-blkfront.ko.xz' '/lib/modules/5.13.19-200.fc34.x86_64//kernel/drivers/block/xen-blkfront.ko.xz' dracut-install: dracut_install('/lib/modules/5.13.19-200.fc34.x86_64//kernel/drivers/block/xen-blkfront.ko.xz', '/lib/modules/5.13.19-200.fc34.x86_64//kernel/drivers/block/xen-blkfront.ko.xz', 0, 0, 1) dracut-install: dracut_install ret = 0 dracut-install: cp '/lib/modules/5.13.19-200.fc34.x86_64//kernel/drivers/block/xen-blkfront.ko.xz' '/var/tmp/dracut.L6QFqe/initramfs/lib/modules/5.13.19-200.fc34.x86_64//kernel/drivers/block/xen-blkfront.ko.xz' dracut-install: cp ret = 0 dracut-install: dracut_install ret = 0 dracut-install: dracut_install 'xen_blkfront' OK whereas if I run: $ dracut -v --debug -f 5.14.9.img 5.14.9-200.fc34.x86_64 2> 5.14.9.err the output contains: dracut-install: Handling /lib/modules/5.14.9-200.fc34.x86_64//kernel/drivers/block/xen-blkfront.ko.xz dracut-install: No symbol or path match for '/lib/modules/5.14.9-200.fc34.x86_64//kernel/drivers/block/xen-blkfront.ko.xz' Going to move this bug to dracut, as it seems like the issue is dracut failing to realise that it needs to include xen-blkfront in the initramfs.
I guess this is because xen-blkfront stopped using blk_cleanup_queue between 5.13 and 5.14: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/drivers/block/xen-blkfront.c?id=3b62c140e93d32c825ed028faca45dee58dbe37f
I hit this bug as well. This appears to be similar to https://bugzilla.redhat.com/show_bug.cgi?id=2004822 which links to https://bugzilla.opensuse.org/show_bug.cgi?id=1190326 and points to https://github.com/dracutdevs/dracut/commit/b292ce72 as an upstream fix.
Manually applying the upstream fix to /usr/lib/dracut/modules.d/90kernel-modules/module-setup.sh fixed the problem I was having.
> Manually applying the upstream fix to /usr/lib/dracut/modules.d/90kernel-modules/module-setup.sh fixed the problem I was having. Same here. Thanks for the links!
I can't get the latest Fedora-Cloud-Base-34 images to boot on m5.2xlarge or m6i.2xlarge either. Would those be the same issue, or something else?
To note f34 cloud image based VMs are not affected even with 5.14 kernel. At least I haven't been able to reproduce this issue, I suspect that the btrfs might be a factor.
(In reply to Jakub Čajka from comment #10) > To note f34 cloud image based VMs are not affected even with 5.14 kernel. At > least I haven't been able to reproduce this issue, I suspect that the btrfs > might be a factor. Sory wrong BZ... I need more coffee.
(In reply to Jakub Čajka from comment #10) > To note f34 cloud image based VMs are not affected even with 5.14 kernel. At > least I haven't been able to reproduce this issue, I suspect that the btrfs > might be a factor. The F34 instances don't have btrfs. They are ext4. Whatever the problem is, I can't get any up-to-date F34 image, whether it's the latest AMI, or a fully dnf-upgraded one, to boot on any EC2 instance type. Big big problem. Seems 100% reproducible. Wouldn't be surprised if others start reporting it after me. I don't really know how to troubleshoot it further.
I should perhaps clarify my comment 7 that what worked for me was manually applying the upstream fix to /usr/lib/dracut/modules.d/90kernel-modules/module-setup.sh , and then reinstalling the latest kernel. The fix needs to be applied before the kernel package is installed.
I manually applied the patch as well, and it seems to have fixed my instances running on m5 instance types. I was unable to run on m5a or m6i instance types, but I suspect that's for a different reason (possibly the new UEFI features Amazon now supports; these instances default to UEFI mode if the AMI doesn't specify, and Fedora Cloud Base images don't specify a boot mode).
(In reply to David Baron from comment #13) > I should perhaps clarify my comment 7 that what worked for me was manually > applying the upstream fix to > /usr/lib/dracut/modules.d/90kernel-modules/module-setup.sh , and then > reinstalling the latest kernel. The fix needs to be applied before the > kernel package is installed. Alternatively, you could rerun dracut, e.g. `dracut -f /boot/initramfs-5.10.14... 5.10.14...`, after applying the fix. The fix works!
Proposed as a Blocker for 35-final by Fedora user richardfearn using the blocker tracking app because: The F35 Final Release Criteria includes: "Release-blocking cloud disk images must be published to Amazon EC2 as AMIs, and these must boot successfully and meet other relevant release criteria on at least one KVM-based x86 instance type, at least one KVM-based aarch64 instance type, and at least one Xen-based x86 instance type." Because the initramfs doesn't include the xen-blkfront driver, the latest F35 AMI doesn't boot on at least t2.small / c4.large / m4.large instances. I tested in eu-west-2 using Fedora-Cloud-Base-35-20211014.n.0.x86_64-hvm-eu-west-2-gp2-0 (ami-04eebe63502fadb08).
*** Bug 2014892 has been marked as a duplicate of this bug. ***
Discussed during the 2021-10-18 blocker review meeting: [0] The decision to classify this bug as a "RejectedBlocker (Final)" and an "AcceptedFreezeException (Final)" was made as this does not violate the criterion, as we do have at least one working instance type per arch (which is all the criterion requires). But it's clearly a big problem worth fixing for release (and we intend to do so). [0] https://meetbot.fedoraproject.org/fedora-blocker-review/2021-10-18/f35-blocker-review.2021-10-18-16.00.txt
FEDORA-2021-5918c936f8 has been submitted as an update to Fedora 35. https://bodhi.fedoraproject.org/updates/FEDORA-2021-5918c936f8
On which instances is it confirmed to work? I couldn't find any that worked consistently. I had to detach volumes, attach to a working outdated instance and chroot to apply the workaround. If there was a known instance that worked, I would have just changed the instance type and rebooted to apply the fix.
The OP reported that t3 works, but we don't seem to have full info across all instance types indeed. I was kinda hoping to get it before we voted, but the message got mixed along the way (it was a busy morning). It should be a bit academic in any case, as the bug was granted an FE and I already submitted the update. It'd be good if people could verify that the update works, then we can push it stable.
thinking about it, I'd guess the nature of the bug implies that all Xen-type instances are likely affected, so it probably *should* be a blocker. If it becomes important I'll get it revoted, it'd be good if someone could confirm that's the case.
When I tried a t3 instance type, it did not work. I also had a strange experience where one of my m5.2xlarge instances seemed to be fine after upgrading and rebooting. However, it then stopped working after another reboot a few days later. Are we sure the hypervisor type is part of the instance type contract? In other words, are all m4's Xen-based? Or can it vary? If it varies, then it seems the error could happen on any instance type. Between this bug, and the fact that Fedora's Cloud Base AMIs aren't explicitly marked for "legacy-bios" boot mode, which prevents them from booting on newer instance types that default to "uefi" boot mode, there seem to be strikingly few instance types where one *can* run an up-to-date Fedora instance in EC2 out-of-the-box. I'm not familiar with the blocker process, but I do hope this gets fixed before I upgrade to F35 images.
I'm the person who reported that, of the three types of servers he manages (standalone hardware / VMWare / Xen), 5.14.11 would not boot on Xen but it booted fine on the others. I then followed the suggestion to regenerate dracut manually on the Xen servers, and they booted. This morning, I successfully updated four Xen servers to 5.14.12, and they booted without an issue. They were all servers where I'd previously done the manual dracut, so I don't know whether that's a factor. HOWEVER -- I also tested 5.14.12 on some of my lower-impact VMWare servers, and they aren't coming up at all. I can't even reboot them. It may take me a bit to figure out what's happening here, and this may be a false alarm of some kind (our VMWare has been known to lock up before). But I wanted to get this out there ASAP just in case there is a legit issue with 5.14.12 and VMware.
It would also be good if the fix gets pushed to F34, given that this is also a problem on F34 with all updates applied.
... a further update on VMWare. I rebooted the VMWare server itself (the one that's managing the virtual servers), and since doing that, I have been able to update to 5.14.12 on other virtual servers without incident. They reboot just fine. And the virtual servers that wouldn't come up before, are coming up fine now too. So I conclude that the issues were 100% our VMWare being in a flaky condition, and not 5.14.12.
David: oh, yeah, good point. I'll backport it to F34 too. Thanks.
FEDORA-2021-e4843341ca has been submitted as an update to Fedora 34. https://bodhi.fedoraproject.org/updates/FEDORA-2021-e4843341ca
FEDORA-2021-5918c936f8 has been pushed to the Fedora 35 testing repository. Soon you'll be able to install the update with the following command: `sudo dnf upgrade --enablerepo=updates-testing --advisory=FEDORA-2021-5918c936f8` You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2021-5918c936f8 See also https://fedoraproject.org/wiki/QA:Updates_Testing for more information on how to test updates.
FEDORA-2021-e4843341ca has been pushed to the Fedora 34 testing repository. Soon you'll be able to install the update with the following command: `sudo dnf upgrade --enablerepo=updates-testing --advisory=FEDORA-2021-e4843341ca` You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2021-e4843341ca See also https://fedoraproject.org/wiki/QA:Updates_Testing for more information on how to test updates.
(In reply to Adam Williamson from comment #22) > thinking about it, I'd guess the nature of the bug implies that all Xen-type > instances are likely affected, so it probably *should* be a blocker. If it > becomes important I'll get it revoted, it'd be good if someone could confirm > that's the case. That's what I thought. I proposed it as a blocker because the release criteria say: > must boot successfully and meet other relevant release criteria on at least one KVM-based x86 instance type, at least one KVM-based aarch64 instance type, and at least one Xen-based x86 instance type and given that the problem was due to a Xen driver not being included in the initramfs, I thought it would *not* be possible to boot it on "at least one Xen-based x86 instance type". (I only tried t2.small / c4.large / m4.large, though.) Adam - thank you very much for backporting the fix! I've tested it on a t2.small instance, and it works fine. The driver gets included in the initramfs, and the instance starts up.
For the record, the fix is also in F35 Final RC1 (and will be in all future RCs unless something turns out to be wildly wrong with it).
FEDORA-2021-5918c936f8 has been pushed to the Fedora 35 stable repository. If problem still persists, please make note of it in this bug report.
*** Bug 2013183 has been marked as a duplicate of this bug. ***
FEDORA-2021-e4843341ca has been pushed to the Fedora 34 stable repository. If problem still persists, please make note of it in this bug report.
*** Bug 2004822 has been marked as a duplicate of this bug. ***
*** Bug 2040183 has been marked as a duplicate of this bug. ***
So if bug 2040183 is a duplicate, has this actually regressed back? I thought dracut-055-5.fc35 fixed this. I'm not sure what I am missing, but perhaps bug 2047266 is a duplicate of this, too.