Bug 2010058 - F34/F35 5.14 kernels will not boot on AWS EC2 t2.small / c4.large / m4.large instances
Summary: F34/F35 5.14 kernels will not boot on AWS EC2 t2.small / c4.large / m4.large ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: dracut
Version: 34
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: dracut-maint-list
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard: RejectedBlocker AcceptedFreezeException
: 2004822 2013183 2014892 2040183 (view as bug list)
Depends On:
Blocks: F35FinalFreezeException
TreeView+ depends on / blocked
 
Reported: 2021-10-03 10:48 UTC by Richard Fearn
Modified: 2022-02-03 06:34 UTC (History)
37 users (show)

Fixed In Version: dracut-055-5.fc35
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-10-21 23:17:40 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
EC2 console log with 5.14.9-200 kernel (62.50 KB, text/plain)
2021-10-03 11:48 UTC, Richard Fearn
no flags Details

Description Richard Fearn 2021-10-03 10:48:45 UTC
1. Please describe the problem:

A t2.small EC2 instance will not boot with the 5.14.9-200.fc34.x86_64 kernel. 5.13.19-200.fc34.x86_64 is OK.

A t3.small EC2 instance starts OK with 5.14.9-200.


2. What is the Version-Release number of the kernel:

5.14.9-200


3. Did it work previously in Fedora? If so, what kernel version did the issue
   *first* appear?  Old kernels are available for download at
   https://koji.fedoraproject.org/koji/packageinfo?packageID=8 :

I keep the instance up-to-date, so it's probably used every kernel that's been released as a F34 update. All kernels up to and including 5.13.19-200 were OK. 5.14.9-200 is the first that won't boot.


4. Can you reproduce this issue? If so, please provide the steps to reproduce
   the issue below:

* Launch EC2 t2.small instance using Fedora 34 x86_64 Cloud Base AMI (I used London - ami-034794b0310a1d8b7)
* sudo dnf update --exclude kernel-core
* sudo reboot      # instance starts OK
* sudo dnf update  # updates kernel to 5.14.9-200.fc34
* sudo reboot      # instance does not start


5. Does this problem occur with the latest Rawhide kernel? To install the
   Rawhide kernel, run ``sudo dnf install fedora-repos-rawhide`` followed by
   ``sudo dnf update --enablerepo=rawhide kernel``:

Will check this later


6. Are you running any modules that not shipped with directly Fedora's kernel?:

No


7. Please attach the kernel logs. You can get the complete kernel log
   for a boot with ``journalctl --no-hostname -k > dmesg.txt``. If the
   issue occurred on a previous boot, use the journalctl ``-b`` flag.


I don't think anything is being saved to the journal.

Comment 1 Richard Fearn 2021-10-03 11:01:05 UTC
> 5. Does this problem occur with the latest Rawhide kernel? To install the
>    Rawhide kernel, run ``sudo dnf install fedora-repos-rawhide`` followed by
>    ``sudo dnf update --enablerepo=rawhide kernel``:

Replacing "kernel" with "kernel-core", this installed 5.15.0-0.rc3.20211001git4de593fb965f.30.fc36, and the instance would not boot.

Comment 2 Richard Fearn 2021-10-03 11:48:31 UTC
Created attachment 1828618 [details]
EC2 console log with 5.14.9-200 kernel

EC2 console log attached - looks the disk never appears

Comment 3 Richard Fearn 2021-10-03 12:25:33 UTC
I think this might be because the 5.14.9-200 initramfs image doesn't include the xen-blkfront module:

  $ sudo lsinitrd /boot/initramfs-5.13.19-200.fc34.x86_64.img | grep blkfront
  -rw-r--r--   1 root     root        21276 Jul  9 15:18 usr/lib/modules/5.13.19-200.fc34.x86_64/kernel/drivers/block/xen-blkfront.ko.xz

  $ sudo lsinitrd /boot/initramfs-5.14.9-200.fc34.x86_64.img | grep blkfront

Comment 4 Richard Fearn 2021-10-03 13:54:01 UTC
If I run:

  $ dracut -v --debug -f 5.13.19.img 5.13.19-200.fc34.x86_64 2> 5.13.19.err

the output contains:

  dracut-install: Handling /lib/modules/5.13.19-200.fc34.x86_64//kernel/drivers/block/xen-blkfront.ko.xz
  dracut-install: Module xen_blkfront: symbol blk_cleanup_queue matched inclusion filter
  dracut-install: dracut_install '/lib/modules/5.13.19-200.fc34.x86_64//kernel/drivers/block/xen-blkfront.ko.xz' '/lib/modules/5.13.19-200.fc34.x86_64//kernel/drivers/block/xen-blkfront.ko.xz'
  dracut-install: dracut_install('/lib/modules/5.13.19-200.fc34.x86_64//kernel/drivers/block/xen-blkfront.ko.xz', '/lib/modules/5.13.19-200.fc34.x86_64//kernel/drivers/block/xen-blkfront.ko.xz', 0, 0, 1)
  dracut-install: dracut_install ret = 0
  dracut-install: cp '/lib/modules/5.13.19-200.fc34.x86_64//kernel/drivers/block/xen-blkfront.ko.xz' '/var/tmp/dracut.L6QFqe/initramfs/lib/modules/5.13.19-200.fc34.x86_64//kernel/drivers/block/xen-blkfront.ko.xz'
  dracut-install: cp ret = 0
  dracut-install: dracut_install ret = 0
  dracut-install: dracut_install 'xen_blkfront' OK

whereas if I run:

  $ dracut -v --debug -f 5.14.9.img 5.14.9-200.fc34.x86_64 2> 5.14.9.err

the output contains:

  dracut-install: Handling /lib/modules/5.14.9-200.fc34.x86_64//kernel/drivers/block/xen-blkfront.ko.xz
  dracut-install: No symbol or path match for '/lib/modules/5.14.9-200.fc34.x86_64//kernel/drivers/block/xen-blkfront.ko.xz'

Going to move this bug to dracut, as it seems like the issue is dracut failing to realise that it needs to include xen-blkfront in the initramfs.

Comment 5 Richard Fearn 2021-10-03 14:19:59 UTC
I guess this is because xen-blkfront stopped using blk_cleanup_queue between 5.13 and 5.14:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/drivers/block/xen-blkfront.c?id=3b62c140e93d32c825ed028faca45dee58dbe37f

Comment 6 David Baron 2021-10-04 19:24:20 UTC
I hit this bug as well.

This appears to be similar to https://bugzilla.redhat.com/show_bug.cgi?id=2004822 which links to https://bugzilla.opensuse.org/show_bug.cgi?id=1190326 and points to https://github.com/dracutdevs/dracut/commit/b292ce72 as an upstream fix.

Comment 7 David Baron 2021-10-04 19:39:00 UTC
Manually applying the upstream fix to /usr/lib/dracut/modules.d/90kernel-modules/module-setup.sh fixed the problem I was having.

Comment 8 Richard Fearn 2021-10-06 18:02:37 UTC
> Manually applying the upstream fix to /usr/lib/dracut/modules.d/90kernel-modules/module-setup.sh fixed the problem I was having.

Same here. Thanks for the links!

Comment 9 Christopher Tubbs 2021-10-07 14:47:51 UTC
I can't get the latest Fedora-Cloud-Base-34 images to boot on m5.2xlarge or m6i.2xlarge either. Would those be the same issue, or something else?

Comment 10 Jakub Čajka 2021-10-07 19:12:17 UTC
To note f34 cloud image based VMs are not affected even with 5.14 kernel. At least I haven't been able to reproduce this issue, I suspect that the btrfs might be a factor.

Comment 11 Jakub Čajka 2021-10-07 19:14:06 UTC
(In reply to Jakub Čajka from comment #10)
> To note f34 cloud image based VMs are not affected even with 5.14 kernel. At
> least I haven't been able to reproduce this issue, I suspect that the btrfs
> might be a factor.

Sory wrong BZ... I need more coffee.

Comment 12 Christopher Tubbs 2021-10-07 19:38:08 UTC
(In reply to Jakub Čajka from comment #10)
> To note f34 cloud image based VMs are not affected even with 5.14 kernel. At
> least I haven't been able to reproduce this issue, I suspect that the btrfs
> might be a factor.

The F34 instances don't have btrfs. They are ext4. Whatever the problem is, I can't get any up-to-date F34 image, whether it's the latest AMI, or a fully dnf-upgraded one, to boot on any EC2 instance type. Big big problem. Seems 100% reproducible. Wouldn't be surprised if others start reporting it after me. I don't really know how to troubleshoot it further.

Comment 13 David Baron 2021-10-08 00:43:48 UTC
I should perhaps clarify my comment 7 that what worked for me was manually applying the upstream fix to /usr/lib/dracut/modules.d/90kernel-modules/module-setup.sh , and then reinstalling the latest kernel.  The fix needs to be applied before the kernel package is installed.

Comment 14 Christopher Tubbs 2021-10-08 06:23:47 UTC
I manually applied the patch as well, and it seems to have fixed my instances running on m5 instance types. I was unable to run on m5a or m6i instance types, but I suspect that's for a different reason (possibly the new UEFI features Amazon now supports; these instances default to UEFI mode if the AMI doesn't specify, and Fedora Cloud Base images don't specify a boot mode).

Comment 15 Saswat Padhi 2021-10-12 23:13:58 UTC
(In reply to David Baron from comment #13)
> I should perhaps clarify my comment 7 that what worked for me was manually
> applying the upstream fix to
> /usr/lib/dracut/modules.d/90kernel-modules/module-setup.sh , and then
> reinstalling the latest kernel.  The fix needs to be applied before the
> kernel package is installed.

Alternatively, you could rerun dracut, e.g. `dracut -f /boot/initramfs-5.10.14... 5.10.14...`, after applying the fix.

The fix works!

Comment 16 Fedora Blocker Bugs Application 2021-10-14 18:45:41 UTC
Proposed as a Blocker for 35-final by Fedora user richardfearn using the blocker tracking app because:

 The F35 Final Release Criteria includes:

"Release-blocking cloud disk images must be published to Amazon EC2 as AMIs, and these must boot successfully and meet other relevant release criteria on at least one KVM-based x86 instance type, at least one KVM-based aarch64 instance type, and at least one Xen-based x86 instance type."

Because the initramfs doesn't include the xen-blkfront driver, the latest F35 AMI doesn't boot on at least t2.small / c4.large / m4.large instances.

I tested in eu-west-2 using Fedora-Cloud-Base-35-20211014.n.0.x86_64-hvm-eu-west-2-gp2-0 (ami-04eebe63502fadb08).

Comment 17 Vitaly Kuznetsov 2021-10-18 12:01:17 UTC
*** Bug 2014892 has been marked as a duplicate of this bug. ***

Comment 18 Geoffrey Marr 2021-10-18 20:47:50 UTC
Discussed during the 2021-10-18 blocker review meeting: [0]

The decision to classify this bug as a "RejectedBlocker (Final)" and an "AcceptedFreezeException (Final)" was made as this does not violate the criterion, as we do have at least one working instance type per arch (which is all the criterion requires). But it's clearly a big problem worth fixing for release (and we intend to do so).

[0] https://meetbot.fedoraproject.org/fedora-blocker-review/2021-10-18/f35-blocker-review.2021-10-18-16.00.txt

Comment 19 Fedora Update System 2021-10-18 23:19:54 UTC
FEDORA-2021-5918c936f8 has been submitted as an update to Fedora 35. https://bodhi.fedoraproject.org/updates/FEDORA-2021-5918c936f8

Comment 20 Christopher Tubbs 2021-10-19 01:41:32 UTC
On which instances is it confirmed to work? I couldn't find any that worked consistently. I had to detach volumes, attach to a working outdated instance and chroot to apply the workaround. If there was a known instance that worked, I would have just changed the instance type and rebooted to apply the fix.

Comment 21 Adam Williamson 2021-10-19 06:38:16 UTC
The OP reported that t3 works, but we don't seem to have full info across all instance types indeed. I was kinda hoping to get it before we voted, but the message got mixed along the way (it was a busy morning).

It should be a bit academic in any case, as the bug was granted an FE and I already submitted the update. It'd be good if people could verify that the update works, then we can push it stable.

Comment 22 Adam Williamson 2021-10-19 06:39:56 UTC
thinking about it, I'd guess the nature of the bug implies that all Xen-type instances are likely affected, so it probably *should* be a blocker. If it becomes important I'll get it revoted, it'd be good if someone could confirm that's the case.

Comment 23 Christopher Tubbs 2021-10-19 06:52:13 UTC
When I tried a t3 instance type, it did not work. I also had a strange experience where one of my m5.2xlarge instances seemed to be fine after upgrading and rebooting. However, it then stopped working after another reboot a few days later. Are we sure the hypervisor type is part of the instance type contract? In other words, are all m4's Xen-based? Or can it vary? If it varies, then it seems the error could happen on any instance type.

Between this bug, and the fact that Fedora's Cloud Base AMIs aren't explicitly marked for "legacy-bios" boot mode, which prevents them from booting on newer instance types that default to "uefi" boot mode, there seem to be strikingly few instance types where one *can* run an up-to-date Fedora instance in EC2 out-of-the-box.

I'm not familiar with the blocker process, but I do hope this gets fixed before I upgrade to F35 images.

Comment 24 Lou Duchez 2021-10-19 11:09:50 UTC
I'm the person who reported that, of the three types of servers he manages (standalone hardware / VMWare / Xen), 5.14.11 would not boot on Xen but it booted fine on the others. I then followed the suggestion to regenerate dracut manually on the Xen servers, and they booted.

This morning, I successfully updated four Xen servers to 5.14.12, and they booted without an issue. They were all servers where I'd previously done the manual dracut, so I don't know whether that's a factor.

HOWEVER -- I also tested 5.14.12 on some of my lower-impact VMWare servers, and they aren't coming up at all. I can't even reboot them. It may take me a bit to figure out what's happening here, and this may be a false alarm of some kind (our VMWare has been known to lock up before). But I wanted to get this out there ASAP just in case there is a legit issue with 5.14.12 and VMware.

Comment 25 David Baron 2021-10-19 12:59:17 UTC
It would also be good if the fix gets pushed to F34, given that this is also a problem on F34 with all updates applied.

Comment 26 Lou Duchez 2021-10-19 13:21:53 UTC
... a further update on VMWare. I rebooted the VMWare server itself (the one that's managing the virtual servers), and since doing that, I have been able to update to 5.14.12 on other virtual servers without incident. They reboot just fine.  And the virtual servers that wouldn't come up before, are coming up fine now too. So I conclude that the issues were 100% our VMWare being in a flaky condition, and not 5.14.12.

Comment 27 Adam Williamson 2021-10-19 21:30:22 UTC
David: oh, yeah, good point. I'll backport it to F34 too. Thanks.

Comment 28 Fedora Update System 2021-10-19 22:19:47 UTC
FEDORA-2021-e4843341ca has been submitted as an update to Fedora 34. https://bodhi.fedoraproject.org/updates/FEDORA-2021-e4843341ca

Comment 29 Fedora Update System 2021-10-20 13:47:43 UTC
FEDORA-2021-5918c936f8 has been pushed to the Fedora 35 testing repository.
Soon you'll be able to install the update with the following command:
`sudo dnf upgrade --enablerepo=updates-testing --advisory=FEDORA-2021-5918c936f8`
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2021-5918c936f8

See also https://fedoraproject.org/wiki/QA:Updates_Testing for more information on how to test updates.

Comment 30 Fedora Update System 2021-10-20 20:03:20 UTC
FEDORA-2021-e4843341ca has been pushed to the Fedora 34 testing repository.
Soon you'll be able to install the update with the following command:
`sudo dnf upgrade --enablerepo=updates-testing --advisory=FEDORA-2021-e4843341ca`
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2021-e4843341ca

See also https://fedoraproject.org/wiki/QA:Updates_Testing for more information on how to test updates.

Comment 31 Richard Fearn 2021-10-20 22:36:41 UTC
(In reply to Adam Williamson from comment #22)
> thinking about it, I'd guess the nature of the bug implies that all Xen-type
> instances are likely affected, so it probably *should* be a blocker. If it
> becomes important I'll get it revoted, it'd be good if someone could confirm
> that's the case.

That's what I thought. I proposed it as a blocker because the release criteria say:

> must boot successfully and meet other relevant release criteria on at least one KVM-based x86 instance type, at least one KVM-based aarch64 instance type, and at least one Xen-based x86 instance type

and given that the problem was due to a Xen driver not being included in the initramfs, I thought it would *not* be possible to boot it on "at least one Xen-based x86 instance type". (I only tried t2.small / c4.large / m4.large, though.)

Adam - thank you very much for backporting the fix! I've tested it on a t2.small instance, and it works fine. The driver gets included in the initramfs, and the instance starts up.

Comment 32 Adam Williamson 2021-10-20 23:14:14 UTC
For the record, the fix is also in F35 Final RC1 (and will be in all future RCs unless something turns out to be wildly wrong with it).

Comment 33 Fedora Update System 2021-10-21 23:17:40 UTC
FEDORA-2021-5918c936f8 has been pushed to the Fedora 35 stable repository.
If problem still persists, please make note of it in this bug report.

Comment 34 Pavel Raiskup 2021-10-23 18:24:20 UTC
*** Bug 2013183 has been marked as a duplicate of this bug. ***

Comment 35 Fedora Update System 2021-10-28 19:30:57 UTC
FEDORA-2021-e4843341ca has been pushed to the Fedora 34 stable repository.
If problem still persists, please make note of it in this bug report.

Comment 36 Dennis Glindhart 2021-10-29 21:33:21 UTC
*** Bug 2004822 has been marked as a duplicate of this bug. ***

Comment 37 Orange Kao 2022-02-03 02:40:51 UTC
*** Bug 2040183 has been marked as a duplicate of this bug. ***

Comment 38 Pavel Raiskup 2022-02-03 06:34:32 UTC
So if bug 2040183 is a duplicate, has this actually regressed back?  I thought
dracut-055-5.fc35 fixed this.  I'm not sure what I am missing, but perhaps
bug 2047266 is a duplicate of this, too.


Note You need to log in before you can comment on or make changes to this bug.