Bug 2047266 - Instance i3.large fails to spawn in AWS with kernel 5.15.16
Summary: Instance i3.large fails to spawn in AWS with kernel 5.15.16
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 35
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Kernel Maintainer List
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-01-27 13:38 UTC by Pavel Raiskup
Modified: 2022-02-03 13:45 UTC (History)
21 users (show)

Fixed In Version: kernel-core-5.15.18-200.fc35.x86_64
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-02-03 08:19:24 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)

Description Pavel Raiskup 2022-01-27 13:38:39 UTC
We create our own ami-* images for Fedora Copr in AWS by:

1) starting an instance of type i3.large (475G nvme storage) from
   the officially provided ami- mentioned on getfedora.org

2) updating all packages, this updates kernel-core package from
   5.14.10-300.fc35.x86_64 to 5.15.16-200.fc35.x86_64

3) we install some other packages, and modify some (not kernel related)
   files

4) we create a snapshot ami from that instance

The instances started from the updated image with 5.15.16-200 kernel fail
to boot in 80% of the cases.  It either goes UP very quickly (~20%) or it
doesn't boot, hangs indefinitely (~80% cases).

The console in i3.* machines are not available, but — when the machine
fails to boot — on the screenshot from the machine I can see errors like:

[  123.840114] BTFS info (device xvda5): enabling ssd optimizations
....
[  128.991608] nvme nvme0: I/O 25 QID 0 timeout, completion polled

See it's ~130 seconds after the machine start.

Comment 1 Pavel Raiskup 2022-01-27 13:42:05 UTC
If we prepare the image the very same way, but keep the original Kernel,
everything works fine (boots in 100% of the cases).

Comment 2 Orange Kao 2022-02-03 07:29:44 UTC
Hello.

I tried ami-02ace19e7faa2ba49 on i3.large
(Fedora-Cloud-Base-35-1.2.x86_64-hvm-ap-southeast-2-gp2-0)
Run "sudo dnf update" (this upgrade kernel to 5.15.18-200)
and "sudo shutdown -r now", the VM can boot.


Would you like to try "sudo dracut --regenerate-all --force"
before creating snapshot? I learned that (from bug #2040183) if the kernel
was updated before dracut update, the old initramfs will not be regenerated
and result in boot failure.


You may also try the following steps to see if "xen-blkfront" kernel module is
in the initramfs. If it's not in initramfs, i3.large won't boot.

sudo dnf -y install binwalk
# find out the start of gzip compressed data, like
# 31744         0x7C00          gzip compressed data, maximum compression...
sudo binwalk /boot/initramfs-5.15.18-200.fc35.x86_64.img
mkdir temp
cd temp
# specify bs according to binwalk output
sudo dd if=/boot/initramfs-5.15.18-200.fc35.x86_64.img bs=31744 skip=1 | gzip -d | cpio -idmv
find -name xen-blkfront.ko.xz


Thank you.

Comment 3 Pavel Raiskup 2022-02-03 08:19:24 UTC
(In reply to Orange Kao from comment #2)
> I tried ami-02ace19e7faa2ba49 on i3.large
> (Fedora-Cloud-Base-35-1.2.x86_64-hvm-ap-southeast-2-gp2-0)
> Run "sudo dnf update" (this upgrade kernel to 5.15.18-200)
> and "sudo shutdown -r now", the VM can boot.

I now tested with 'kernel-core-5.15.18-200.fc35.x86_64' and it seems to
work indeed, 4/4 machines started on the first attempt.

> Would you like to try "sudo dracut --regenerate-all --force"
> before creating snapshot? I learned that (from bug #2040183) if the kernel
> was updated before dracut update, the old initramfs will not be regenerated
> and result in boot failure.

It seems to work now, though.

> You may also try the following steps to see if "xen-blkfront" kernel
> module is in the initramfs. If it's not in initramfs, i3.large won't
> boot.

I doubt _this_ is the reason.  If it was the boot failure would be 100%
reproducible, right?  About 20% of machines booted correctly.

Comment 4 Dusty Mabe 2022-02-03 13:45:46 UTC
Take a look at https://bugzilla.redhat.com/show_bug.cgi?id=2040360 and https://github.com/coreos/fedora-coreos-tracker/issues/1066 - I suspect this is your issue.


Note You need to log in before you can comment on or make changes to this bug.