Bug 2221531 - kernel 6.3.x cannot boot in Hyper-V
Summary: kernel 6.3.x cannot boot in Hyper-V
Keywords:
Status: CLOSED RAWHIDE
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: rawhide
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
Assignee: Kernel Maintainer List
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-07-09 23:16 UTC by Kamil J. Dudek
Modified: 2023-08-16 18:49 UTC (History)
29 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-08-08 13:12:58 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)

Description Kamil J. Dudek 2023-07-09 23:16:23 UTC
The kernel 6.3.8 and newer cannot boot in Hyper-V and freezes immediately, before displaying any message, even in debug mode (so no kernel panic etc). The only result of booting a 6.3.x kernel is a non-blinking _ sign in the top-left corner.

The problem occurs on 6.3.8-100.fc37, the latest 6.3.x-200.fc38, and the rawhide 6.5.0-0.rc0.20230706gitc17414a273b8.12.fc39. Booting 6.2 still work. 

The underlying Hyper-V has not been reconfigured, the Windows system has not been updated in the meantime. While it is true that Hyper-V does not explicitly support Fedora (as in Microsoft Learn "Supported Linux and FreeBSD virtual machines for Hyper-V on Windows Server and Windows"), Fedora does have hyper-daemons. It is therefore some kind of regression in Hyper-V support and not some new bug introduced in Windows. I am therefore not sure if this bug report as acceptable and if what I am attempting is a supported use case, but I decided to file this report anyway.

I don't know if 6.3.8 is the exact version that exhibits the problem or if it occurred in a previous 6.3, but every 6.2 works. I have not tried other 6.3 (I can if instructed to do so) on Fedora. However, I tried 6.3.9-1-default on OpenSUSE Tumbleweed where it also freezes immediately while 6.3.2-1-default-tumbleweed works. Ubuntu Server 23.10 with 6.3.0-7-generic works as well.

There is no report in Microsoft Answers about it, only a single question from Jul 02 2023 09:21 PM about the same thing, with no answers.

Reproducible: Always

Steps to Reproduce:
1. Install Fedora 37, 38 or Rawhide in Hyper-V 11.0 Generation 2 VM on Windows Build 22621 (current)
2. Run dnf -y update
3. Reboot
4. Attempt to boot from the newest kernel
Actual Results:  
Booting from the 6.3 kernel in Hyper-V results in freeze

Expected Results:  
Booting from the 6.3 kernel in Hyper-V should succeed

Comment 1 Kamil J. Dudek 2023-07-12 23:45:53 UTC
The kernel 6.5.0-0.rc1.11.fc39.x86_64 also has this problem.

Comment 2 Michael Kelley 2023-07-14 18:36:37 UTC
I'm working on debugging this issue.

Comment 3 Michael Kelley 2023-07-15 05:30:08 UTC
The failure to boot is due to the Linux kernel taking a panic early during the boot process.  The panic is due to an exception generated by the Intel Indirect Branch Tracking (IBT) hardware feature.  See https://lwn.net/Articles/889475/.  In order for this feature to be enabled in the Linux kernel, the kernel must be built with CONFIG_X86_KERNEL_IBT.

I was able to reproduce the problem with the 6.3.8 kernel from Fedora.  Then I grabbed the source code for the 6.3.4 kernel from kernel.org, and built my own kernel *without* CONFIG_X86_KERNEL_IBT.  This kernel boots and runs.  Then I built the same kernel source code with CONFIG_X86_KERNEL_IBT=y, and this kernel fails like the Fedora 6.3.8 kernel.  The Fedora 6.2.9-300.fc38.x86_64 kernel does *not* have CONFIG_X86_KERNEL_IBT set, while the Fedora 6.3.12-200.fc38.x86_64 kernel has CONFIG_X86_KERNEL_IBT=y.  The key determinant of the failure appears to be this kernel build parameter.  You mentioned some other openSUSE and Rawhide kernels.  Please check the corresponding config files and see if my theory holds.

For kernels that are built with CONFIG_X86_KERNEL_IBT=y, the problem can be avoided by adding ibt=off to the kernel boot line.  That's the immediate workaround.

The underlying cause of the exception is an inconsistency in how Hyper-V is presenting the IBT feature in guest VMs. I need to have an internal discussion with the Hyper-V team to resolve the inconsistency and determine the correct approach for Linux guests on Hyper-V to work correctly when IBT is enabled.  It will probably be next week before I can have that discussion.

*Thank you* for raising this issue.  CONFIG_X86_KERNEL_IBT=y became the default in the Linux 6.2 kernel with commit 4fd5f70ce14.  I also need to track down why we didn't spot this problem sooner within Microsoft. :-(

Comment 4 Kamil J. Dudek 2023-07-15 15:25:13 UTC
(In reply to Michael Kelley from comment #3)
> Please check the corresponding config files and see if my theory holds.

Yeah, that seems about right. I'm not sure I'm doing it right, but:
# grep "^NAME=" /etc/os-release ; for c in `find /boot -name "config*"` ; do echo -n "${c}: " ; grep CONFIG_X86_KERNEL_IBT $c ; done

NAME="openSUSE Tumbleweed"
/boot/grub2/x86_64-efi/configfile.mod: /boot/config-6.3.2-1-default: # CONFIG_X86_KERNEL_IBT is not set
/boot/config-6.3.9-1-default: CONFIG_X86_KERNEL_IBT=y

NAME="Fedora Linux"
/boot/config-6.2.15-300.fc38.x86_64: # CONFIG_X86_KERNEL_IBT is not set
/boot/config-6.5.0-0.rc0.20230706gitc17414a273b8.12.fc39.x86_64: CONFIG_X86_KERNEL_IBT=y
/boot/config-6.5.0-0.rc1.11.fc39.x86_64: CONFIG_X86_KERNEL_IBT=y
/boot/config-6.5.0-0.rc1.20230711git3f01e9fed845.12.fc39.x86_64: CONFIG_X86_KERNEL_IBT=y

Maps 1:1 to the booting and failing kernels. All of them work with ibt=off

> I also need to track down why we didn't spot this problem sooner within Microsoft. :-(

It has been spotted soon after I filed this bug, by "ht1023" on Microsoft Techcommunity. The only post by the account, no explanation provided, fully correct! :D

Regards,
K.

Comment 5 Michael Kelley 2023-07-20 20:37:21 UTC
I have posted a patch to LKML to fix the problem.  See https://lore.kernel.org/lkml/1689885237-32662-1-git-send-email-mikelley@microsoft.com/T/#u.  The technical details of the issue with Hyper-V are in the patch commit message.

Comment 6 Kamil J. Dudek 2023-07-31 23:25:47 UTC
(In reply to Michael Kelley from comment #5)
> I have posted a patch to LKML to fix the problem.  See
> https://lore.kernel.org/lkml/1689885237-32662-1-git-send-email-
> mikelley/T/#u.  The technical details of the issue with
> Hyper-V are in the patch commit message.

This is really interesting, thank you. I am not familiar with the details of the hypercall interface. If I understand correctly, this bug can be closed after the following set of events occurs:

 - the hv_init.c patch gets merged upstream
 - Fedora performs a source intake of the kernel that includes the patch
 - Fedora compiles its fc37,38,39,rahwide kernels, maintaining the CONFIG_X86_KERNEL_IBT=y config
 - the kernel is bootable on Hyper-V @@ Gen12 Intel without `ibt=off`

Am I right? Or is it justified to append the kernel-*.src.rpm with the additional patch before that happens? I don't suppose adding a *.patch into the kernel SRC.RPM is something that happens frequently, I have no idea about the process and if this meets the threshold, especially if it's being processed upstream.

Regards,
K.

Comment 7 Dexuan Cui 2023-08-01 00:45:42 UTC
Kamil, I think you should be correct.

Michael's fix is already in the upstream Hyper-V tree:

https://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux.git/log/?h=hyperv-fixes

https://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux.git/commit/?h=hyperv-fixes&id=d5ace2a776442d80674eff9ed42e737f7dd95056

The fix will be merged into the mainline tree soon, e.g. within a few weeks, I suppose.

Comment 8 Michael Kelley 2023-08-01 23:12:00 UTC
For Fedora specifically, I think what will happen is that once my patch is merged into the mainline tree as Dexuan describes, then the kernel stable tree maintainers will backport the fix to the "stable" kernels as listed at kernel.org.  Currently the most recent are 6.3 and 6.4, though 6.3 will be going away. At some point, the Fedora team will create an updated stable 6.3 or 6.4 kernel with the patch, and the updated kernel will be available as a Fedora 38 update via "dnf".

Comment 9 Kamil J. Dudek 2023-08-06 22:34:47 UTC
(In reply to Dexuan Cui from comment #7)

> 
> The fix will be merged into the mainline tree soon, e.g. within a few weeks,
> I suppose.

seems to have happened already:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/hyperv/hv_init.c#n476

That's good news. It means that the kernel will most probably be operational on 12th gen before my class re-starts on September :)

Regards,
K.

Comment 10 Justin M. Forbes 2023-08-08 13:12:58 UTC
As this bug was filed against rawhide, I will close it as fixed. Rawhide has the required patch. For stable Fedora releases, the patch should appear in the next build (6.4.9).

Comment 11 Kamil J. Dudek 2023-08-08 23:59:16 UTC
I couldn't find it in current kernel-6.4.9-100.fc37 and kernel-6.4.9-200.fc38 on Buildsystem but I'm probably just peeking too soon. However, it indeed is present in kernel-6.5.0-0.rc5.20230808git14f9643dc90a.37.fc39 and upstream so this specific bug can be closed as the issue has been addressed :)
Thanks MSFT for providing interesting insight. I hope that Hyper-V itself will get fixed as well.

Regards,
K.

Comment 12 Michael Kelley 2023-08-16 18:49:49 UTC
The fix went into the kernel.org 6.4.10 kernel.  Presumably it will show up in a Fedora 6.4.10 or later kernel very soon.


Note You need to log in before you can comment on or make changes to this bug.