After updating to edk2-ovmf-20250523-6.fc43.noarch I am no longer able to successfully run any TDX enabled guests. The previous build edk2-ovmf-20250221-8.fc42.noarch.rpm worked fine, as do the builds from CentOS Stream 9 (edk2-ovmf-20241117-4.el9.noarch.rpm) and Stream 10 (edk2-ovmf-20250523-2.el10.noarch.rpm). The latter ought to be the same base version of EDK2 as in Fedora rawhide, so it is particularly strange the c10s build works but rawhide build fails. I enabled isa-debugcon and compare the EDK2 logs from 20250523 vs 20250221 and I can see the EDK2 log stops immediately where you'd expect to see some TDX initialization @@ -3,2238 +3,20 @@ ResourceAttribute: 0x7 PhysicalStart: 0x0 ResourceLength: 0x800000 Owner: 00000000-0000-0000-0000-000000000000 ResourceAttribute: 0x7 PhysicalStart: 0x806000 ResourceLength: 0x3000 Owner: 00000000-0000-0000-0000-000000000000 ResourceAttribute: 0x7 PhysicalStart: 0x80D000 ResourceLength: 0x3000 Owner: 00000000-0000-0000-0000-000000000000 ResourceAttribute: 0x7 PhysicalStart: 0x820000 ResourceLength: 0x9D3E0000 Owner: 00000000-0000-0000-0000-000000000000 -SecCoreStartupWithStack(0xFFFCC000, 0x820000) -SecMtrrSetup: Skip TD-Guest -Tdx started with(Hob: 0x809000, Gpaw: 0x34, Cpus: 1) -LowMemory Start and End: 820000, 9DC00000 -HobList: 820000 -InitializePlatform in Pei-less boot -CMOS: -00: 56 00 22 00 15 00 02 07 07 25 26 02 10 80 00 00 -10: 00 00 00 00 06 80 02 FF FF 00 00 00 00 00 00 00 -20: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 -30: FF FF 20 00 C0 9C 00 20 30 00 00 00 00 12 00 00 -40: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 -50: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 -60: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 -70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 -HostBridgeDeviceId = 0x29C0 -Select Item: 0x19 -Select Item: 0x0 -FW CFG Signature: 0x554D4551 -Select Item: 0x1 -FW CFG Revision: 0x3 -QemuFwCfg interface is supported. -Select Item: 0x19 Reproducible: Always Steps to Reproduce: 1. Power on any TDX guest 2. Select a kernel from grub to boot Actual Results: VM immediately powers off before the kernel emits any console messages.
Created attachment 2096380 [details] Broken EDK2 log from edk2-ovmf-20250523-6.fc43.noarch
Created attachment 2096381 [details] Working EDK2 log from edk2-ovmf-20250221-8.fc42.noarch
The following change makes it build with identical options to RHEL and works as expected diff --git a/edk2-build.fedora b/edk2-build.fedora index 957a28b..7c1b609 100644 --- a/edk2-build.fedora +++ b/edk2-build.fedora @@ -181,13 +181,13 @@ dest = Fedora/ovmf cpy1 = FV/OVMF.fd OVMF.amdsev.fd [build.ovmf.inteltdx] -desc = ovmf build for IntelTdx (2MB) +desc = ovmf build for IntelTdx (4MB) conf = OvmfPkg/IntelTdx/IntelTdxX64.dsc arch = X64 opts = ovmf.common - ovmf.2m + ovmf.4m ovmf.sb.stateless -pcds = nx.strict +pcds = nx.compat.x64 la57 plat = IntelTdx dest = Fedora/ovmf This more minimal change which exclusively re-enables TPM/CC options does NOT work diff --git a/edk2-build.fedora b/edk2-build.fedora index 957a28b..f3366f1 100644 --- a/edk2-build.fedora +++ b/edk2-build.fedora @@ -17,8 +17,8 @@ FD_SIZE_4MB = TRUE FD_SIZE_2MB = TRUE NETWORK_ISCSI_ENABLE = FALSE NETWORK_TLS_ENABLE = FALSE -CC_MEASUREMENT_ENABLE = FALSE -TPM2_ENABLE = FALSE +#CC_MEASUREMENT_ENABLE = FALSE +#TPM2_ENABLE = FALSE #BUILD_SHELL = FALSE [opts.ovmf.sb.smm] Similarly this change which exclusively switches to 4m builds does NOT work diff --git a/edk2-build.fedora b/edk2-build.fedora index 957a28b..1a6b49b 100644 --- a/edk2-build.fedora +++ b/edk2-build.fedora @@ -181,11 +181,11 @@ dest = Fedora/ovmf cpy1 = FV/OVMF.fd OVMF.amdsev.fd [build.ovmf.inteltdx] -desc = ovmf build for IntelTdx (2MB) +desc = ovmf build for IntelTdx (4MB) conf = OvmfPkg/IntelTdx/IntelTdxX64.dsc arch = X64 opts = ovmf.common - ovmf.2m + ovmf.4m ovmf.sb.stateless pcds = nx.strict la57 The most minimal change that makes it work is this: diff --git a/edk2-build.fedora b/edk2-build.fedora index 957a28b..17b735f 100644 --- a/edk2-build.fedora +++ b/edk2-build.fedora @@ -187,7 +187,7 @@ arch = X64 opts = ovmf.common ovmf.2m ovmf.sb.stateless -pcds = nx.strict +pcds = nx.compat.x64 la57 plat = IntelTdx dest = Fedora/ovmf This is rather confusing as AFAICT the use of 'nx.strict' was already present in the edk2-ovmf-20250221-8.fc42.noarch.rpm build which worked correctly. So it looks like something in the rebase to 'edk2-ovmf-20250523' has caused 'nx.strict' to take effect in a way that it did not previously do.
Can you try https://copr.fedorainfracloud.org/coprs/kraxel/edk2.testbuilds/ ? Also add 'grep PageFault $firmwarelog` output to this bug please. Thanks.
(In reply to Gerd Hoffmann from comment #4) > Can you try https://copr.fedorainfracloud.org/coprs/kraxel/edk2.testbuilds/ ? With edk2-ovmf-20250523-9.copr9233614.noarch I get new failure behaviour - a pagefault dump on the guest serial console !!!! X64 Exception Type - 0E(#PF - Page-Fault) CPU Apic ID - 00000000 !!!! ExceptionData - 0000000000000003 I:0 R:0 U:0 W:1 P:1 PK:0 SS:0 SGX:0 RIP - 00000000020062E0, CS - 0000000000000038, RFLAGS - 0000000000210046 RAX - 0000000002022000, RCX - 0000000002022000, RDX - 0000000000000000 RBX - 000000009A1C4B18, RSP - 000000009C89DF28, RBP - 000000009C46D018 RSI - 0000000000000000, RDI - 0000000002065078 R8 - 0000000000000000, R9 - 000000007FDA4195, R10 - 00000000990C6828 R11 - 0000000000000000, R12 - 000000007FFFF000, R13 - 0000000000000000 R14 - 000000009992E6E8, R15 - 000000009992E6F0 DS - 0000000000000030, ES - 0000000000000030, FS - 0000000000000030 GS - 0000000000000030, SS - 0000000000000030 CR0 - 0000000080010031, CR2 - 0000000002022000, CR3 - 000000009C601000 CR4 - 0000000000000268, CR8 - 0000000000000000 DR0 - 0000000000000000, DR1 - 0000000000000000, DR2 - 0000000000000000 DR3 - 0000000000000000, DR6 - 00000000FFFF0FF0, DR7 - 0000000000000400 GDTR - 000000009C45D000 0000000000000047, LDTR - 0000000000000000 IDTR - 000000009B114018 0000000000000FFF, TR - 0000000000000000 FXSAVE_STATE - 000000009C89DB80 > Also add 'grep PageFault $firmwarelog` output to this bug please. Thanks. No results from that in either of the logs I've attached to this bug, nor in the logs from the copr build above.
> The most minimal change that makes it work is this: > -pcds = nx.strict > +pcds = nx.compat.x64 > This is rather confusing as AFAICT the use of 'nx.strict' was already > present in the edk2-ovmf-20250221-8.fc42.noarch.rpm build which worked > correctly. Indeed, especially as the firmware doesn't do any NX stuff that early at boot. (didn't notice it is failing /that/ early when checking the bug the first time). So it might be something totally unrelated, which is triggered by good/bad luck, maybe due to changed image layout. The firmware simply hangs? Could be here with the messages you get: [ in OvmfPkg/IntelTdx/Sec/SecMain.c ] if (TdxHelperProcessTdHob () != EFI_SUCCESS) { CpuDeadLoop (); } > With edk2-ovmf-20250523-9.copr9233614.noarch I get new failure behaviour - a > pagefault dump on the guest serial console Thanks. First, strange that it makes it that far, there are no changes in the early TX code path. > !!!! X64 Exception Type - 0E(#PF - Page-Fault) CPU Apic ID - 00000000 !!!! > ExceptionData - 0000000000000003 I:0 R:0 U:0 W:1 P:1 PK:0 SS:0 SGX:0 Known grub bug if the EFI_MEMORY_ATTRIBUTE_PROTOCOL is present. The build has a custom page fault handler to fixup NX faults (and warn about them, like selinux in permissive mode), which apparently is not active in TDX mode. Need to check why. There is a runtime switch to turn off EFI_MEMORY_ATTRIBUTE_PROTOCOL (downstream builds only): -fw_cfg name=opt/org.tianocore/UninstallMemAttrProtocol,string=yes
new copr test builds underway [compiling still]
(In reply to Gerd Hoffmann from comment #6) > > The most minimal change that makes it work is this: > > > -pcds = nx.strict > > +pcds = nx.compat.x64 > > > This is rather confusing as AFAICT the use of 'nx.strict' was already > > present in the edk2-ovmf-20250221-8.fc42.noarch.rpm build which worked > > correctly. > > Indeed, especially as the firmware doesn't do any NX stuff that early at > boot. > (didn't notice it is failing /that/ early when checking the bug the first > time). > > So it might be something totally unrelated, which is triggered by good/bad > luck, maybe due to changed image layout. > > The firmware simply hangs? Actually it isn't a hang - the whole VM resets - libvirt receives this from QEMU: {"timestamp": {"seconds": 1751971878, "microseconds": 54790}, "event": "SHUTDOWN", "data": {"guest": true, "reason": "guest-reset"}} NB: TDX can't do normal resets, so QEMU always shuts down for resets, and libvirt has to re-create the whole VM > > With edk2-ovmf-20250523-9.copr9233614.noarch I get new failure behaviour - a > > pagefault dump on the guest serial console > > Thanks. First, strange that it makes it that far, there are no changes > in the early TX code path. > > > !!!! X64 Exception Type - 0E(#PF - Page-Fault) CPU Apic ID - 00000000 !!!! > > ExceptionData - 0000000000000003 I:0 R:0 U:0 W:1 P:1 PK:0 SS:0 SGX:0 > > Known grub bug if the EFI_MEMORY_ATTRIBUTE_PROTOCOL is present. > The build has a custom page fault handler to fixup NX faults > (and warn about them, like selinux in permissive mode), which > apparently is not active in TDX mode. Need to check why. > > There is a runtime switch to turn off EFI_MEMORY_ATTRIBUTE_PROTOCOL > (downstream builds only): > -fw_cfg name=opt/org.tianocore/UninstallMemAttrProtocol,string=yes Setting that fw_cfg flag, the copr build exhibits the same failure mode as current rawhide VMs - the VM immediately shuts down due to a guest reset, either in EFI stub or Linux early boot.
> > The firmware simply hangs? > > Actually it isn't a hang - the whole VM resets - libvirt receives this > from QEMU: > > {"timestamp": {"seconds": 1751971878, "microseconds": 54790}, "event": > "SHUTDOWN", "data": {"guest": true, "reason": "guest-reset"}} OK, so the firmware does NOT sit in a CpuDeadLoop() due to unrecoverable errors. Might get a fault it can't handle -> triple-fault -> reset. Can we get details from kvm on what happend? Or is that confidential in TDX mode? > NB: TDX can't do normal resets, so QEMU always shuts down for resets, and > libvirt has to re-create the whole VM Yes, much like 'qemu -no-reboot' on non-cc guests. > > There is a runtime switch to turn off EFI_MEMORY_ATTRIBUTE_PROTOCOL > > (downstream builds only): > > -fw_cfg name=opt/org.tianocore/UninstallMemAttrProtocol,string=yes > > Setting that fw_cfg flag, the copr build exhibits the same failure mode > as current rawhide VMs - the VM immediately shuts down due to a guest reset, > either in EFI stub or Linux early boot. So, with '-fw-cfg name=opt/org.tianocore/UninstallMemAttrProtocol,string=yes' you get an almost instant reset (comment #1 logfile)? And with '-fw-cfg name=opt/org.tianocore/UninstallMemAttrProtocol,string=no' you get the page fault on the serial line like in comment #4? [ side note: With the latest copr build this should change into 'PageFault' messages in the firmware log ]
(In reply to Gerd Hoffmann from comment #9) > > > The firmware simply hangs? > > > > Actually it isn't a hang - the whole VM resets - libvirt receives this > > from QEMU: > > > > {"timestamp": {"seconds": 1751971878, "microseconds": 54790}, "event": > > "SHUTDOWN", "data": {"guest": true, "reason": "guest-reset"}} > > OK, so the firmware does NOT sit in a CpuDeadLoop() due to unrecoverable > errors. > Might get a fault it can't handle -> triple-fault -> reset. > Can we get details from kvm on what happend? Or is that confidential in TDX > mode? I've now traced it in QEMU and got back to kvm_cpu_exec case KVM_EXIT_SHUTDOWN: qemu_system_reset_request(SHUTDOWN_CAUSE_GUEST_RESET); which IIUC will happen when the guest exits with a triple-fault > > > There is a runtime switch to turn off EFI_MEMORY_ATTRIBUTE_PROTOCOL > > > (downstream builds only): > > > -fw_cfg name=opt/org.tianocore/UninstallMemAttrProtocol,string=yes > > > > Setting that fw_cfg flag, the copr build exhibits the same failure mode > > as current rawhide VMs - the VM immediately shuts down due to a guest reset, > > either in EFI stub or Linux early boot. > > So, with '-fw-cfg > name=opt/org.tianocore/UninstallMemAttrProtocol,string=yes' you get an > almost instant reset (comment #1 logfile)? Correct. > And with '-fw-cfg name=opt/org.tianocore/UninstallMemAttrProtocol,string=no' > you get the page fault on the serial line like in comment #4? [ side note: > With the latest copr build this should change into 'PageFault' messages in > the firmware log ] Correct, or to be more precise I simply don't set that -fw_cfg feature at all.
> > So, with '-fw-cfg > > name=opt/org.tianocore/UninstallMemAttrProtocol,string=yes' you get an > > almost instant reset (comment #1 logfile)? > > Correct. Log is truncated. append mode seems to be 'off' by default and libvirt restarting the guest will zap the old log content. That explains quite a bit of the confusion ;) So, key log message is this (on the serial console, with EFI_MEMORY_ATTRIBUTE_PROTOCOL disabled): EFI stub: WARNING: Unable to unprotect memory range [9a0a0000,9a0a1000]: 8000000000000003 EFI stub: WARNING: Unable to unprotect memory range [59c00000,5b200000]: 8000000000000003 First range is the trampoline to turn on 5-level paging, second range is the kernel image. EFI stub can not clear NX + set RW -> page fault -> boom. Dunno why this happens with TDX enabled only, this should not be TDX-specific at all. With EFI_MEMORY_ATTRIBUTE_PROTOCOL enabled this works fine, but requires the page fault handler which compensates for the NX bug in grub, leaving this trail in the firmware log: PageFaultInit: StrictNX disabled - installing page fault handler PageFaultInit: mCpu->RegisterInterruptHandler: Success PageFaultInit: gBS->CreateEvent: Success PageFaultHandler: CR2: 000000000208A000 - RIP: 000000000206F920 - ID:0 WR:1 P:1 [0x3] PageFaultHandler: setting RW for page 0x208A000 [large pte] PageFaultExitBoot: fixups: 0 NX, 1 RW
(In reply to Gerd Hoffmann from comment #11) > > > So, with '-fw-cfg > > > name=opt/org.tianocore/UninstallMemAttrProtocol,string=yes' you get an > > > almost instant reset (comment #1 logfile)? > > > > Correct. > > Log is truncated. append mode seems to be 'off' by default and libvirt > restarting > the guest will zap the old log content. That explains quite a bit of the > confusion ;) > > So, key log message is this (on the serial console, with > EFI_MEMORY_ATTRIBUTE_PROTOCOL disabled): > > EFI stub: WARNING: Unable to unprotect memory range [9a0a0000,9a0a1000]: > 8000000000000003 > EFI stub: WARNING: Unable to unprotect memory range [59c00000,5b200000]: > 8000000000000003 > > First range is the trampoline to turn on 5-level paging, second range is the > kernel image. > EFI stub can not clear NX + set RW -> page fault -> boom. Dunno why this > happens with TDX > enabled only, this should not be TDX-specific at all. > Urgh, I'm sorry, I don't know why I forgot to copy those lines of console output from EFI stub into the initial description :-(
FEDORA-2025-7e2a69db6b (edk2-20250523-11.fc42) has been submitted as an update to Fedora 42. https://bodhi.fedoraproject.org/updates/FEDORA-2025-7e2a69db6b
https://fedoraproject.org/wiki/Changes/Edk2Security#july_2025_update
FEDORA-2025-7e2a69db6b has been pushed to the Fedora 42 testing repository. Soon you'll be able to install the update with the following command: `sudo dnf upgrade --enablerepo=updates-testing --refresh --advisory=FEDORA-2025-7e2a69db6b` You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2025-7e2a69db6b See also https://fedoraproject.org/wiki/QA:Updates_Testing for more information on how to test updates.
FEDORA-2025-7e2a69db6b (edk2-20250523-11.fc42) has been pushed to the Fedora 42 stable repository. If problem still persists, please make note of it in this bug report.