Bug 2189136

Summary:

windows 11 installation broken with edk2-20230301gitf80f052277c8-1.el9

Product:

Red Hat Enterprise Linux 9

Reporter:

Guo, Zhiyi <zhguo>

Component:

edk2

Assignee:

Gerd Hoffmann <kraxel>

Status:

CLOSED ERRATA

QA Contact:

Xueqiang Wei <xuwei>

Severity:

high

Docs Contact:

Priority:

high

Version:

9.3

CC:

berrange, coli, demeng, echo, gveitmic, jinzhao, juzhang, kraxel, lersek, lkotek, mbrown, mdean, pbonzini, phou, qizhu, vgoyal, virt-maint, xuwei, yama, zhguo

Target Milestone:

Keywords:

CustomerScenariosInitiative, Regression, Triaged

Target Release:

---

Hardware:

x86_64

OS:

Windows

Whiteboard:

Fixed In Version:

edk2-20230301gitf80f052277c8-4.el9

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2023-11-07 08:24:29 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
ovmf log	none
edk2 debug patch on top of commit 2c2cb2352896	none
catch unannounced high TPL: debug patch series on top of git 94c802e108a0	none

Description Guo, Zhiyi 2023-04-24 10:48:18 UTC

Description of problem:
windows 11 installation broken with edk2-20230301gitf80f052277c8-1.el9

Version-Release number of selected component (if applicable):
edk2-20230301gitf80f052277c8-1.el9
qemu-kvm-8.0.0-1.el9

How reproducible:
100%

Steps to Reproduce:
1.Try to install windows 11 VM with below virt-install command:
virt-install --name win11_T4 --cpu host-model --ram=8192 --vcpus=4,sockets=1,cores=4,threads=1 --disk path=/home/win11_T4.qcow2,format='qcow2',cache=none,size=80,bus='scsi' --graphics vnc,listen=0.0.0.0 --video bochs --boot uefi --hostdev c8:00.0 --network bridge=br9 --noreboot --os-variant=win10 --cdrom windows.iso --disk path=unattend.iso,device='cdrom',readonly='on',bus='sata' --disk path=driver.iso,device='cdrom',readonly='on',bus='sata' --console file,path=/tmp/win11_T4.log

2.
3.

Actual results:
windows 11 installation always stuck at tianocore icon
 
Expected results:
VM installation can success
Additional info:
No such issue happen to edk2-ovmf-20221207gitfff6d81270b5-9.el9_2.noarch

Comment 1 Guo, Zhiyi 2023-04-24 11:06:58 UTC

Created attachment 1959509 [details]
ovmf log

ovmf log show below information:
...
AtaError: 0
Tcg2SubmitCommand ...
PtpCrbTpmCommand Send - 80 01 00 00 00 14 00 00 01 7E 00 00 00 01 00 0B 03 00 08 00 
PtpCrbTpmCommand ReceiveHeader - 80 01 00 00 00 3E 00 00 00 00 
PtpCrbTpmCommand Receive - 80 01 00 00 00 3E 00 00 00 00 00 00 00 34 00 00 00 01 00 0B 03 00 08 00 00 00 00 01 00 20 69 D4 B0 81 B3 89 C1 A1 08 DA 1C 63 42 EB 24 8A FC 8F A2 99 31 26 E4 91 98 E4 D2 AB 36 57 59 8B 
Tcg2SubmitCommand - Success
Tcg2SubmitCommand ...
PtpCrbTpmCommand Send - 80 02 00 00 00 41 00 00 01 82 00 00 00 0C 00 00 00 09 40 00 00 09 00 00 00 00 00 00 00 00 01 00 0B 9A 03 E3 85 CE D8 DF 9C CF 7A 45 E0 14 D4 F7 D1 ED 9B CB 95 C1 D0 56 E0 0C 2C 55 57 E9 A1 82 00 
PtpCrbTpmCommand ReceiveHeader - 80 02 00 00 00 13 00 00 00 00 
PtpCrbTpmCommand Receive - 80 02 00 00 00 13 00 00 00 00 00 00 00 00 00 00 01 00 00 
Tcg2SubmitCommand - Success
Tcg2SubmitCommand ...
PtpCrbTpmCommand Send - 80 02 00 00 00 41 00 00 01 82 00 00 00 0D 00 00 00 09 40 00 00 09 00 00 00 00 00 00 00 00 01 00 0B 01 E7 FB AF F4 19 89 9D 13 88 12 E8 0D 24 32 B9 CF 95 E6 35 25 66 AE C7 28 45 EE 53 23 05 00 DB 
PtpCrbTpmCommand ReceiveHeader - 80 02 00 00 00 13 00 00 00 00 
PtpCrbTpmCommand Receive - 80 02 00 00 00 13 00 00 00 00 00 00 00 00 00 00 01 00 00 
Tcg2SubmitCommand - Success
Tcg2SubmitCommand ...
PtpCrbTpmCommand Send - 80 02 00 00 00 41 00 00 01 82 00 00 00 0E 00 00 00 09 40 00 00 09 00 00 00 00 00 00 00 00 01 00 0B 56 B5 E2 61 1E 00 F4 75 C7 F1 C5 F1 EB 2D F5 17 3E 01 C6 10 99 42 37 75 B9 65 0E 9A 66 6E A4 A8 
PtpCrbTpmCommand ReceiveHeader - 80 02 00 00 00 13 00 00 00 00 
PtpCrbTpmCommand Receive - 80 02 00 00 00 13 00 00 00 00 00 00 00 00 00 00 01 00 00 
Tcg2SubmitCommand - Success
ASSERT /builddir/build/BUILD/edk2-f80f052277c8/OvmfPkg/Library/NestedInterruptTplLib/Tpl.c(42): InterruptedTPL < 31

Comment 9 Gerd Hoffmann 2023-04-25 15:20:59 UTC

Tried to reproduce, no success (slightly older win11 iso).

(virt-install --vnc --name win11-bz2189136 --boot uefi --vcpus=4 --os-variant=win11 --cdrom ~/iso/Win11_German_x64.iso)

Boots fine to the installer screen.

Does this happen only with latest win11?
Or does this happen with older versions (say win10) too?

Can you try remove the options from the virt-install command line one by one
to the bare minimum and see if onee of them changes the situation for you?

Comment 10 Guo, Zhiyi 2023-04-26 09:44:49 UTC

(In reply to Gerd Hoffmann from comment #9)
> Tried to reproduce, no success (slightly older win11 iso).
> 
> (virt-install --vnc --name win11-bz2189136 --boot uefi --vcpus=4
> --os-variant=win11 --cdrom ~/iso/Win11_German_x64.iso)
> 
> Boots fine to the installer screen.
> 
> Does this happen only with latest win11?
> Or does this happen with older versions (say win10) too?

It looks like the issue specific to the version of windows 11 used by QE. I have tried the same virt-install command in comment 0 to windows 10, windows server 2019 and windows server 2022, none of them can reproduce the issue.
> 
> Can you try remove the options from the virt-install command line one by one
> to the bare minimum and see if onee of them changes the situation for you?

Even the simple virt-install command can hit this issue:
virt-install --name win11 --cpu host-model --ram=8192 --vcpus=4,sockets=1,cores=4,threads=1 --video virtio --boot uefi --disk path=/home/win11.qcow2,format='qcow2',cache=none,size=80,bus='sata' --graphics vnc,listen=0.0.0.0 --noreboot --os-variant=win10 --cdrom windows.iso

Comment 19 Gerd Hoffmann 2023-04-28 13:43:12 UTC

upstream discussion
https://edk2.groups.io/g/devel/message/103770

Comment 21 Gerd Hoffmann 2023-05-02 14:51:40 UTC

unbalanced RaiseTPL in DiskIo2ReadWriteDisk

Comment 23 Laszlo Ersek 2023-05-02 15:20:02 UTC

(In reply to Gerd Hoffmann from comment #21)
> unbalanced RaiseTPL in DiskIo2ReadWriteDisk

That's a good and interesting find:

  OldTpl = gBS->RaiseTPL (TPL_CALLBACK);
...
  gBS->RaiseTPL (TPL_NOTIFY);
...
  gBS->RestoreTPL (OldTpl);

Arguments can be made for both the validity and the invalidity of this construct.

One could be tempted to call this invalid because the second Raise (to NOTIFY) is not matched with a Restore.

However, there are several arguments why this can be considered valid:
- all "rising edges" are good (from application to callback to notify)
- the one restore we have does correspond to a raise that we performed earlier
- we do exit the function with the original TPL
- when we lower (restore the TPL), the spec provides well-defined semantics for *all* functions that have been queued in the interim (see EFI_BOOT_SERVICES.CreateEvent()); so when we jump from NOTIFY to APPLICATION, the highest prio (= NOTIFY) functions will execute in FIFO order, then the next prio (= CALLBACK) ones, also in FIFO order.

If the code did:

  OldTpl = gBS->RaiseTPL (TPL_CALLBACK);
  Old2Tpl = gBS->RaiseTPL (TPL_NOTIFY);
  gBS->RestoreTPL (Old2Tpl);
  gBS->RestoreTPL (OldTpl);

that would be required to have the exact same effect (the first restore would unblock NOTIFY funcs, the second restore would unblock CALLBACK ones; assuming anyway that we entered the function at OldTpl = APPLICATION).

Comment 24 Laszlo Ersek 2023-05-03 10:00:40 UTC

This problem reproduces with commit c37cbc030d96 ("OvmfPkg: Switch timer in build time for OvmfPkg", 2022-04-02) reverted, on top of 2c2cb2352896. In other words, using the LAPIC timer, vs. the 8254 timer + the 8259 interrupt controller, makes no difference.

Comment 25 Laszlo Ersek 2023-05-03 11:23:03 UTC

Found some interesting bits.


First, I reverted commits a086f4a63bc0 ("OvmfPkg: Use
NestedInterruptTplLib in nested interrupt handlers", 2022-12-23) and
a24fbd606125 ("OvmfPkg: Add library to handle TPL from within nested
interrupt handlers", 2022-12-23) on top of upstream 2c2cb2352896, *and*
modified LocalApicTimerDxe like this:

> diff --git a/OvmfPkg/LocalApicTimerDxe/LocalApicTimerDxe.c b/OvmfPkg/LocalApicTimerDxe/LocalApicTimerDxe.c
> index 3e04b49d6bcf..4b5b4dad8196 100644
> --- a/OvmfPkg/LocalApicTimerDxe/LocalApicTimerDxe.c
> +++ b/OvmfPkg/LocalApicTimerDxe/LocalApicTimerDxe.c
> @@ -61,6 +61,9 @@ TimerInterruptHandler (
>    EFI_TPL  OriginalTPL;
>
>    OriginalTPL = gBS->RaiseTPL (TPL_HIGH_LEVEL);
> +  if (OriginalTPL == TPL_HIGH_LEVEL) {
> +    DEBUG ((DEBUG_WARN, "%a:%d\n", __func__, __LINE__));
> +  }
>
>    SendApicEoi ();
>

Sure enough, this fires; that is, even before/without the nested
interrupt handler stuff from Michael Brown, the same issue is triggered
or caused by "cdboot.efi":

> Loading driver at 0x00010000000 EntryPoint=0x00010001090 cdboot.efi
> InstallProtocolInterface: [EfiLoadedImageDevicePathProtocol] 3DEA4698
> ProtectUefiImageCommon - 0x3CEB9040
>   - 0x0000000010000000 - 0x000000000011C000
> TimerInterruptHandler:65


Second, right after this snippet, "cdboot.efi" calls
SerialSetAttributes() [MdeModulePkg/Bus/Pci/PciSioSerialDxe/SerialIo.c].
From there we get a call to VerifyUartParameters() [same file], which
logs:

> ClockRate = 1843200
> Divisor   = 1
> BaudRate/Actual (115200/115200) = 100%

Not much later "cbdoot.efi" calls gBS->ExitBootServices(), which we
learn from log messages such as:

> VirtioScsiExitBoot: Context=0x3CFF6498
> SmmInstallProtocolInterface: [EdkiiSmmExitBootServicesProtocol] 0


Third, I copied "cboot.efi" out of the installer ISO, disassembled it
with "objdump -d" (thankfully, on RHEL9, objdump directly supports EFI
binaries!), and then looked for cli/sti instructions. Found two
interesting spots.

(a) This pair of spots is really small in itself:

    100b8ef0:   fb                      sti
    100b8ef1:   c3                      retq

and

    100b8f00:   fa                      cli
    100b8f01:   c3                      retq

What are interesting are the call sites! I'm not going to quote them in
full, just the tight contexts:

    1009beea:   44 8a f8                mov    %al,%r15b
    1009beed:   84 c0                   test   %al,%al
    1009beef:   74 05                   je     0x1009bef6
    1009bef1:   e8 0a d0 01 00          callq  0x100b8f00
    1009bef6:   41 0f 20 c4             mov    %cr0,%r12

This is effectively a conditional CLI. If the condition holds, we jump
*just over* the subroutine call that would mean a CLI. In other words,
"cdboot.efi" checks something, and then based on the result, it either
executes CLI (in a tight subroutine) or not.

    1009c133:   45 84 ff                test   %r15b,%r15b
    1009c136:   74 05                   je     0x1009c13d
    1009c138:   e8 b3 cd 01 00          callq  0x100b8ef0
    1009c13d:   48 8b 6c 24 58          mov    0x58(%rsp),%rbp

Exact same story, just with STI.

The next one is so short I can quote in full:

    1009cbfe:   f6 42 10 02             testb  $0x2,0x10(%rdx)
    1009cc02:   74 05                   je     0x1009cc09  -> jumps just
                                                              over the CLI
    1009cc04:   e8 f7 c2 01 00          callq  0x100b8f00  -> CLI
    1009cc09:   83 3b 01                cmpl   $0x1,(%rbx)
    1009cc0c:   75 16                   jne    0x1009cc24
    1009cc0e:   48 8b d7                mov    %rdi,%rdx
    1009cc11:   48 8b cb                mov    %rbx,%rcx
    1009cc14:   e8 3b 00 00 00          callq  0x1009cc54
    1009cc19:   48 8d 4b 14             lea    0x14(%rbx),%rcx
    1009cc1d:   e8 2e c2 01 00          callq  0x100b8e50
    1009cc22:   eb 14                   jmp    0x1009cc38
    1009cc24:   48 8d 4b 14             lea    0x14(%rbx),%rcx
    1009cc28:   e8 23 c2 01 00          callq  0x100b8e50
    1009cc2d:   48 8b d7                mov    %rdi,%rdx
    1009cc30:   48 8b cb                mov    %rbx,%rcx
    1009cc33:   e8 1c 00 00 00          callq  0x1009cc54
    1009cc38:   f6 43 10 02             testb  $0x2,0x10(%rbx)
    1009cc3c:   74 05                   je     0x1009cc43  -> jumps just
                                                              over the STI
    1009cc3e:   e8 ad c2 01 00          callq  0x100b8ef0  -> STI
    1009cc43:   48 8b 5c 24 30          mov    0x30(%rsp),%rbx

The same conditional patterns.

(b) Regarding the other CLI+STI pair, the CLI is in a subroutine that
has only one caller, and the STI is in a code snippet that is *not*
called by the CALL instruction, but reached by other means (haven't
figured that out for now):

Entry to the subroutine with CLI (having one call site):

    100b9020:	48 55                	rex.W push %rbp
    100b9022:	56                   	push   %rsi
    100b9023:	57                   	push   %rdi
    100b9024:	53                   	push   %rbx
    100b9025:	9c                   	pushfq
    100b9026:	83 3d 1b 30 04 00 00 	cmpl   $0x0,0x4301b(%rip)        # 0x100fc048
    100b902d:	74 01                	je     0x100b9030
    100b902f:	fa                   	cli
    100b9030:	0f 20 c0             	mov    %cr0,%rax

Note above the same pattern: a flag check (the flag is at 0x100fc048),
and then the jump that *only* skips the CLI in case the flag equals
zero.

(BTW, the flag at 0x100fc048 is written-to in just one place in the
disassembly; it's most probably a global variable. At least I think it
is a "write" access; the GNU assembly style puts the destination operand
second, right?)

    100b9033:	50                   	push   %rax
    100b9034:	0f 20 e0             	mov    %cr4,%rax
    100b9037:	50                   	push   %rax
    100b9038:	48 25 00 10 00 00    	and    $0x1000,%rax
    100b903e:	48 c1 e8 0c          	shr    $0xc,%rax
    100b9042:	50                   	push   %rax
    100b9043:	0f 20 d8             	mov    %cr3,%rax
    100b9046:	48 89 05 d3 2f 04 00 	mov    %rax,0x42fd3(%rip)        # 0x100fc020
    100b904d:	48 89 25 04 30 04 00 	mov    %rsp,0x43004(%rip)        # 0x100fc058
    100b9054:	48 8b 15 e5 2f 04 00 	mov    0x42fe5(%rip),%rdx        # 0x100fc040
    100b905b:	0f 01 12             	lgdt   (%rdx)
    100b905e:	0f 01 5a 0a          	lidt   0xa(%rdx)
    100b9062:	0f 00 52 14          	lldt   0x14(%rdx)

So this is some really quirky code, reading CR4, CR3, and loading the
global, local, and interrupt descriptor tables!

    100b9066:	48 b8 79 90 0b 10 00 	movabs $0x100b9079,%rax
    100b906d:	00 00 00

Note this value -- 100b9079.

    100b9070:	48 0f b7 4a 16       	movzwq 0x16(%rdx),%rcx
    100b9075:	51                   	push   %rcx
    100b9076:	50                   	push   %rax
    100b9077:	48 cb                	lretq

So the subroutine with the (conditional) CLI in it ends here.

It is followed immediately by some helper subroutine; listed below. Note
its start address: 100b9079.

    100b9079:	8e 5a 18             	mov    0x18(%rdx),%ds
    100b907c:	8e 42 1a             	mov    0x1a(%rdx),%es
    100b907f:	8e 6a 1e             	mov    0x1e(%rdx),%gs
    100b9082:	8e 62 1c             	mov    0x1c(%rdx),%fs
    100b9085:	8e 52 20             	mov    0x20(%rdx),%ss
    100b9088:	48 8b 0d d9 2f 04 00 	mov    0x42fd9(%rip),%rcx        # 0x100fc068
    100b908f:	48 8b 15 9a 2f 04 00 	mov    0x42f9a(%rip),%rdx        # 0x100fc030
    100b9096:	48 8b 05 d3 2f 04 00 	mov    0x42fd3(%rip),%rax        # 0x100fc070
    100b909d:	48 85 c0             	test   %rax,%rax
    100b90a0:	74 04                	je     0x100b90a6
    100b90a2:	ff d0                	callq  *%rax
    100b90a4:	eb 03                	jmp    0x100b90a9
    100b90a6:	0f 22 d9             	mov    %rcx,%cr3
    100b90a9:	48 2b ed             	sub    %rbp,%rbp
    100b90ac:	48 8b 25 75 2f 04 00 	mov    0x42f75(%rip),%rsp        # 0x100fc028
    100b90b3:	48 2b f6             	sub    %rsi,%rsi
    100b90b6:	48 8b 0d 7b 2f 04 00 	mov    0x42f7b(%rip),%rcx        # 0x100fc038
    100b90bd:	48 8b 05 8c 2f 04 00 	mov    0x42f8c(%rip),%rax        # 0x100fc050
    100b90c4:	ff d0                	callq  *%rax
    100b90c6:	48 8b 25 8b 2f 04 00 	mov    0x42f8b(%rip),%rsp        # 0x100fc058
    100b90cd:	48 8b 0d 4c 2f 04 00 	mov    0x42f4c(%rip),%rcx        # 0x100fc020
    100b90d4:	5a                   	pop    %rdx
    100b90d5:	48 8b 05 94 2f 04 00 	mov    0x42f94(%rip),%rax        # 0x100fc070
    100b90dc:	48 85 c0             	test   %rax,%rax
    100b90df:	74 04                	je     0x100b90e5
    100b90e1:	ff d0                	callq  *%rax
    100b90e3:	eb 03                	jmp    0x100b90e8
    100b90e5:	0f 22 d9             	mov    %rcx,%cr3
    100b90e8:	48 8b 15 71 2f 04 00 	mov    0x42f71(%rip),%rdx        # 0x100fc060
    100b90ef:	0f 01 12             	lgdt   (%rdx)
    100b90f2:	0f 01 5a 0a          	lidt   0xa(%rdx)
    100b90f6:	0f 00 52 14          	lldt   0x14(%rdx)

Again the descriptor table loading stuff...

    100b90fa:	48 b8 0d 91 0b 10 00 	movabs $0x100b910d,%rax
    100b9101:	00 00 00

Note this address: 100b910d.

    100b9104:	48 0f b7 4a 16       	movzwq 0x16(%rdx),%rcx
    100b9109:	51                   	push   %rcx
    100b910a:	50                   	push   %rax
    100b910b:	48 cb                	lretq

The helper subroutine ends here.

And then right after follows the subroutine with STI in it -- note that
it starts at 100b910d -- see the movabs above.

    100b910d:	8e 5a 18             	mov    0x18(%rdx),%ds
    100b9110:	8e 42 1a             	mov    0x1a(%rdx),%es
    100b9113:	8e 6a 1e             	mov    0x1e(%rdx),%gs
    100b9116:	8e 62 1c             	mov    0x1c(%rdx),%fs
    100b9119:	8e 52 20             	mov    0x20(%rdx),%ss
    100b911c:	58                   	pop    %rax
    100b911d:	0f 22 e0             	mov    %rax,%cr4
    100b9120:	58                   	pop    %rax
    100b9121:	0f 22 c0             	mov    %rax,%cr0
    100b9124:	83 3d 1d 2f 04 00 00 	cmpl   $0x0,0x42f1d(%rip)        # 0x100fc048
    100b912b:	74 01                	je     0x100b912e
    100b912d:	fb                   	sti
    100b912e:	9d                   	popfq

Same pattern as in the subroutine with the CLI: if the flag at
0x100fc048 is zero, we jump *just over* the STI. Otherwise, if the flag
is nonzero, we perform the STI.

    100b912f:	5b                   	pop    %rbx
    100b9130:	5f                   	pop    %rdi
    100b9131:	5e                   	pop    %rsi
    100b9132:	5d                   	pop    %rbp
    100b9133:	c3                   	retq

End of the STI-containing subroutine.

Here's my take: functionality (b) seems way too complex for running
before ExitBootServces()! It manipulates control registers and
descriptor tables. No sane UEFI application would do that before calling
ExitBootServices(), I presume.

Functionality (a) seems more likely, except I don't know in the
slightest what it does.

The interesting question is now to disassemble "cdboot.efi" from a
previous (non-problematic) Windows installer ISO, and see whether that
version contains similar CLI/STI pairs.

Comment 26 Laszlo Ersek 2023-05-03 12:26:10 UTC

Testing cdboot.efi from
"en_windows_10_enterprise_ltsc_2019_x64_dvd_be3c8ffb.iso" (more or less
randomly picked):

> Loading driver at 0x00010000000 EntryPoint=0x00010001090 cdboot.efi
> InstallProtocolInterface: [EfiLoadedImageDevicePathProtocol] 3DEA4698
> ProtectUefiImageCommon - 0x3CEB8040
>   - 0x0000000010000000 - 0x0000000000152000
> ClockRate = 1843200
> Divisor   = 1
> BaudRate/Actual (115200/115200) = 100%
> VirtioScsiExitBoot: Context=0x3CFF5018
> SmmInstallProtocolInterface: [EdkiiSmmExitBootServicesProtocol] 0

Note the lack of "TimerInterruptHandler:65" just before "ClockRate".

However, in the disassembly, I find the exact same CLI/STI patterns.

Comment 27 Laszlo Ersek 2023-05-03 12:41:54 UTC

Next idea: what if edk2 services are used for reenabling interrupts, while at TPL_HIGH_LEVEL?

- modify the DXE core to reflect all changes to gEfiCurrentTpl to a predefined fixed physical address

- modify EnableInterrupts() [MdePkg/Library/BaseLib/X64/GccInlinePriv.c] to check the current TPL; complain if it is already HIGH. (EnableInterrupts() is the BaseLib interface that is called either directly by low-level drivers or via CPU protocol interfaces via higher-level drivers.)

Comment 28 Gerd Hoffmann 2023-05-03 12:50:50 UTC

> Entry to the subroutine with CLI (having one call site):
> 
>     100b9020:	48 55                	rex.W push %rbp
>     100b9022:	56                   	push   %rsi
>     100b9023:	57                   	push   %rdi
>     100b9024:	53                   	push   %rbx
>     100b9025:	9c                   	pushfq
>     100b9026:	83 3d 1b 30 04 00 00 	cmpl   $0x0,0x4301b(%rip)        #
> 0x100fc048
>     100b902d:	74 01                	je     0x100b9030
>     100b902f:	fa                   	cli
>     100b9030:	0f 20 c0             	mov    %cr0,%rax

When setting breakpoints at all four cli/sti instructions this is the
only one which fires.

>     100b9033:	50                   	push   %rax
>     100b9034:	0f 20 e0             	mov    %cr4,%rax
>     100b9037:	50                   	push   %rax
>     100b9038:	48 25 00 10 00 00    	and    $0x1000,%rax
>     100b903e:	48 c1 e8 0c          	shr    $0xc,%rax
>     100b9042:	50                   	push   %rax
>     100b9043:	0f 20 d8             	mov    %cr3,%rax
>     100b9046:	48 89 05 d3 2f 04 00 	mov    %rax,0x42fd3(%rip)        #
> 0x100fc020
>     100b904d:	48 89 25 04 30 04 00 	mov    %rsp,0x43004(%rip)        #
> 0x100fc058
>     100b9054:	48 8b 15 e5 2f 04 00 	mov    0x42fe5(%rip),%rdx        #
> 0x100fc040
>     100b905b:	0f 01 12             	lgdt   (%rdx)
>     100b905e:	0f 01 5a 0a          	lidt   0xa(%rdx)
>     100b9062:	0f 00 52 14          	lldt   0x14(%rdx)
> 
> So this is some really quirky code, reading CR4, CR3, and loading the
> global, local, and interrupt descriptor tables!

Really strange cdboot seems to do all this way before calling
exitbootservices.

>     100b90fa:	48 b8 0d 91 0b 10 00 	movabs $0x100b910d,%rax
>     100b9101:	00 00 00
> 
> Note this address: 100b910d.
> 
>     100b9104:	48 0f b7 4a 16       	movzwq 0x16(%rdx),%rcx
>     100b9109:	51                   	push   %rcx
>     100b910a:	50                   	push   %rax
>     100b910b:	48 cb                	lretq
> 
> The helper subroutine ends here.

No, this is just a jmp $0x100b910d (push target to stack and ret).

Comment 29 Gerd Hoffmann 2023-05-03 13:20:17 UTC

> >     100b90fa:	48 b8 0d 91 0b 10 00 	movabs $0x100b910d,%rax
> >     100b9101:	00 00 00
> > 
> > Note this address: 100b910d.
> > 
> >     100b9104:	48 0f b7 4a 16       	movzwq 0x16(%rdx),%rcx
> >     100b9109:	51                   	push   %rcx
> >     100b910a:	50                   	push   %rax
> >     100b910b:	48 cb                	lretq
> > 
> > The helper subroutine ends here.
> 
> No, this is just a jmp $0x100b910d (push target to stack and ret).

Well, almost, not clear whenever %rxc is zero at that point.
Maybe that is part of self-relocation.

Comment 30 Gerd Hoffmann 2023-05-03 13:28:01 UTC

(In reply to Laszlo Ersek from comment #27)
> Next idea: what if edk2 services are used for reenabling interrupts, while
> at TPL_HIGH_LEVEL?

Also note that using sti/cli is not the only way to enable/disable interrupts,
it's just a bit in the flags register, so a pushf + raisetpl(high) + popf could
also lead to IRQs being enabled at TPL=HIGH.

Comment 31 Laszlo Ersek 2023-05-03 13:30:58 UTC

Thanks for the corrections! :) Didn't realize the jump-by-ret trick was in used here; I thought that was only necessary for CPU mode switches.

Regarding popf... That sounds horrible. I'm unsure if we can ever catch that. There are probably too many popfs for us to set breakpoints on all of them.

It's interesting though that from the explicit instructions, only one CLI is reached (repeatedly perhaps, but still).

Comment 32 Gerd Hoffmann 2023-05-03 13:52:08 UTC

> > No, this is just a jmp $0x100b910d (push target to stack and ret).
> 
> Well, almost, not clear whenever %rxc is zero at that point.
> Maybe that is part of self-relocation.

Self-relocation would also explain why that sti breakpoint never triggers.

Comment 33 Gerd Hoffmann 2023-05-03 14:08:02 UTC

(In reply to Gerd Hoffmann from comment #32)
> > > No, this is just a jmp $0x100b910d (push target to stack and ret).
> > 
> > Well, almost, not clear whenever %rxc is zero at that point.
> > Maybe that is part of self-relocation.
> 
> Self-relocation would also explain why that sti breakpoint never triggers.

Hmm, trying to set a breakpoint at the lretq instruction doesn't work either.
But doesn't matter much, that is most likely re-enabling interrupts after
calling exitbootservices and setting up everything (specifically IRQ handlers)
on the boot cpu.

Comment 34 Gerd Hoffmann 2023-05-03 14:40:19 UTC

Printed some random junk to the (slow) debug log while IRQs are disabled
to let some time pass.  Had the effect that the timer IRQ was already pending
when interrupts are enabled, so it instantly fired and ran into the ASSERT().

Now we can inspect where we are:

(1) Walk up the stack to TimerInterruptHandler.
(2) print /x *SystemContext.SystemContextX64

$2 = {ExceptionData = 0x0, FxSaveState = {Fcw = 0x37f, Fsw = 0x0, Ftw = 0x0, Opcode = 0x0, 
[ ... ]
  Rflags = 0x202, Ldtr = 0x0, Tr = 0x0, Gdtr = {0x7f7dc000, 0x47}, Idtr = {0x7ee93018, 0xfff}, 
  Rip = 0x235579b, Gs = 0x30, Fs = 0x30, Es = 0x30, Ds = 0x30, Cs = 0x38, Ss = 0x30, 
  Rdi = 0x2608000, Rsi = 0x243f90, Rbp = 0xa, Rsp = 0x243d20, Rbx = 0x2607fc0, Rdx = 0x2569e3e, 
  Rcx = 0x2607fd4, Rax = 0x38, R8 = 0x7fe3f2ce, R9 = 0x0, R10 = 0x200, R11 = 0xffffffffff000, 
  R12 = 0x24403c, R13 = 0x0, R14 = 0x1, R15 = 0x4}

(3) x/10i 0x2355790

   0x2355790:	testb  $0x2,0x10(%rbx)
   0x2355794:	je     0x235579b
   0x2355796:	call   0x2569ec0
   0x235579b:	mov    0x30(%rsp),%rbx     <== RIP is here.
   0x23557a0:	mov    0x38(%rsp),%rsi

(4) x/10i 0x2569ec0

   0x2569ec0:	sti
   0x2569ec1:	ret

So apparently a tiny sti() helper function, simliar to the one found in cdboot.efi
But it's not the location where cdboot.efi was loaded to.  So either cdboot.efi
relocated itself, or it loaded stuff from cdrom and jumped to it.  The latter is
IMHO more likely.

Comment 35 Laszlo Ersek 2023-05-03 14:43:41 UTC

Created attachment 1961973 [details]
edk2 debug patch on top of commit 2c2cb2352896

(In reply to Laszlo Ersek from comment #27)
> Next idea: what if edk2 services are used for reenabling interrupts, while
> at TPL_HIGH_LEVEL?
> 
> - modify the DXE core to reflect all changes to gEfiCurrentTpl to a
> predefined fixed physical address
> 
> - modify EnableInterrupts() [MdePkg/Library/BaseLib/X64/GccInlinePriv.c] to
> check the current TPL; complain if it is already HIGH. (EnableInterrupts()
> is the BaseLib interface that is called either directly by low-level drivers
> or via CPU protocol interfaces via higher-level drivers.)

The attached debug patch does this (more or less... more precisely, the debug patch exposes the DXE Core's own gEfiCurrentTpl variable to other modules).

Unfortunately, this patch does *not* fire, even though booting Windows reproduces the symptom. (I did test the patch forcibly, by raising the TPL to HIGH manually, and then manually enabling interrupts. The patch surely fired then.) So in whatever way Windows reenables interrupts while the TPL is HIGH, Windows does not go through any edk2 service that terminates in BaseLib's EnableInterrupts() function.

Comment 36 Gerd Hoffmann 2023-05-03 14:48:53 UTC

scratch build with workaround (still building atm):
https://kojihub.stream.centos.org/koji/taskinfo?taskID=2176907

Comment 37 Laszlo Ersek 2023-05-03 15:46:04 UTC

(In reply to Gerd Hoffmann from comment #34)
> Printed some random junk to the (slow) debug log while IRQs are
> disabled to let some time pass.  Had the effect that the timer IRQ was
> already pending when interrupts are enabled, so it instantly fired and
> ran into the ASSERT().

Ah, beautiful -- it turns the STI into a synchronous interrupt,
effectively.

>
> Now we can inspect where we are:
>
> (1) Walk up the stack to TimerInterruptHandler.
> (2) print /x *SystemContext.SystemContextX64
>
> $2 = {ExceptionData = 0x0, FxSaveState = {Fcw = 0x37f, Fsw = 0x0, Ftw = 0x0,
> Opcode = 0x0,
> [ ... ]
>   Rflags = 0x202, Ldtr = 0x0, Tr = 0x0, Gdtr = {0x7f7dc000, 0x47}, Idtr =
> {0x7ee93018, 0xfff},
>   Rip = 0x235579b, Gs = 0x30, Fs = 0x30, Es = 0x30, Ds = 0x30, Cs = 0x38, Ss
> = 0x30,
>   Rdi = 0x2608000, Rsi = 0x243f90, Rbp = 0xa, Rsp = 0x243d20, Rbx =
> 0x2607fc0, Rdx = 0x2569e3e,
>   Rcx = 0x2607fd4, Rax = 0x38, R8 = 0x7fe3f2ce, R9 = 0x0, R10 = 0x200, R11 =
> 0xffffffffff000,
>   R12 = 0x24403c, R13 = 0x0, R14 = 0x1, R15 = 0x4}
>
> (3) x/10i 0x2355790
>
>    0x2355790:	testb  $0x2,0x10(%rbx)
>    0x2355794:	je     0x235579b
>    0x2355796:	call   0x2569ec0
>    0x235579b:	mov    0x30(%rsp),%rbx     <== RIP is here.
>    0x23557a0:	mov    0x38(%rsp),%rsi
>
> (4) x/10i 0x2569ec0
>
>    0x2569ec0:	sti
>    0x2569ec1:	ret
>
> So apparently a tiny sti() helper function, simliar to the one found
> in cdboot.efi But it's not the location where cdboot.efi was loaded
> to.  So either cdboot.efi relocated itself, or it loaded stuff from
> cdrom and jumped to it.  The latter is IMHO more likely.

Both ideas are consistent with what I'm seeing. I've created another
small debug patch:

> diff --git a/MdeModulePkg/Core/Dxe/Misc/Stall.c b/MdeModulePkg/Core/Dxe/Misc/Stall.c
> index 35e045d41a6c..fc371bac3769 100644
> --- a/MdeModulePkg/Core/Dxe/Misc/Stall.c
> +++ b/MdeModulePkg/Core/Dxe/Misc/Stall.c
> @@ -54,6 +54,16 @@ CoreStall (
>    UINT32  Remainder;
>    UINTN   Index;
>
> +  if (gEfiCurrentTpl == TPL_HIGH_LEVEL) {
> +    DEBUG ((
> +      DEBUG_WARN,
> +      "%a:%d:%p\n",
> +      __func__,
> +      __LINE__,
> +      (VOID *)&Microseconds
> +      ));
> +  }
> +
>    if (gMetronome == NULL) {
>      return EFI_NOT_AVAILABLE_YET;
>    }

It basically logs all gBS->Stall() calls that are made at
TPL_HIGH_LEVEL. (This debug patch is easy because the DXE Core both
implements CoreStall() and defines gEfiCurrentTpl, so the former can
access the latter.)

The log message also includes the address of the gBS->Stall() parameter
"Microseconds". The idea with that is to glean some minimal insight into
the stack upon which gBS->Stall() is called.

When I boot "en_windows_10_enterprise_ltsc_2019_x64_dvd_be3c8ffb.iso"
(successfully to the first dialog of the Windows installer), there are
*zero* such log messages.

When I boot the problematic ISO, I get the following:

> Loading driver at 0x00010000000 EntryPoint=0x00010001090 cdboot.efi
> InstallProtocolInterface: BC62157E-3E33-4FEC-9920-2D3B36D750DF 3DEA1698
> ProtectUefiImageCommon - 0x3CEBC040
>   - 0x0000000010000000 - 0x000000000011C000
> CoreStall:62:3EF13810
> CoreStall:62:3EF13810
> CoreStall:62:3EF13810

These are the first such messages in the log; that is, nothing other
than cdboot.efi calls Stall() at TPL_HIGH_LEVEL (which is BTW permitted
by the UEFI spec).

These work just fine.

Also note that the stack seems to be the "normal" edk2 stack.

Yet further, let's record the fact that the new Windows ISO differs from
the old ISO in that the new one calls Stall() at TPL_HIGH_LEVEL, while
the old one doesn't. Again, calling Stall() like this is permitted by
the spec, what I don't understand is how Windows (cdboot.efi) gets to
run *any* code at TPL_HIGH_LEVEL, including a function call to
gBS->Stall().

This suggests that "cdboot.efi" registers a notification function at
TPL_HIGH_LEVEL, which on the other hand would be a violation of the
spec. In fact it doesn't look possible; CoreCreateEventEx()
[MdeModulePkg/Core/Dxe/Event/Event.c] seems to enforce valid TPLs.

Anyway, these messages above are logged while "cdboot.efi" is displaying
the growing string of dots, after the prompt ("press any key to boot
from DVD" or some such). If I don't press a key and "cdboot.efi" times
out, then we don't get other types of CoreStall messages.

However, when I press Enter for example, the log continues as follows,
with the stack address changed:

> CoreStall:62:136D00
> CoreStall:62:136D00
> CoreStall:62:136D00
> CoreStall:62:241D80
> CoreStall:62:241D80
> TimerInterruptHandler:65
> CoreStall:62:241D80
> ClockRate = 1843200
> Divisor   = 1
> BaudRate/Actual (115200/115200) = 100%
> VirtioScsiExitBoot: Context=0x3CED0018

This indicates that the same loop somewhere in Windows continues, or
that the same (impossible?) TPL_HIGH_LEVEL notification function
continues to be queued and executed, but this time it keeps interrupting
a very different call stack. And then even *that* stack changes quite
significantly.

And then we see the interrupt being delivered while the TPL is HIGH,
there's *another* Stall() call, then we see SerialSetAttributes(), and
finally ExitBootServices().

It's inexplicable *whence* we get those gBS->Stall() calls, at
TPL_HIGH_LEVEL.

Hm, I might be able to check into that with gdb, in fact; just replace
the log with a CpuDeadLoop(), and get a backtrace?

Comment 38 Gerd Hoffmann 2023-05-04 05:14:11 UTC

> Hm, I might be able to check into that with gdb, in fact; just replace
> the log with a CpuDeadLoop(), and get a backtrace?

My process: boot, let cdboot.efi timeout, wait for efi shell prompt,
attach gdb.  All firmware modules are loaded at that point, and RIP
is somewhere in edk2, the edk2 gdb scripts manage to automatically
load the complete debug info in this state.

Now set breakpoints + continue.  In efi shell start cdboot.efi again
and watch.

For monitoring specific conditions its useful to call a empty void
function, just for being able to easily set breakpoints, like this:

--- a/MdeModulePkg/Core/Dxe/Misc/Stall.c
+++ b/MdeModulePkg/Core/Dxe/Misc/Stall.c
@@ -44,6 +44,9 @@ CoreInternalWaitForTick (
   @retval EFI_NOT_AVAILABLE_YET  gMetronome is not available yet
 
 **/
+
+void CoreStallBT(void) {}
+
 EFI_STATUS
 EFIAPI
 CoreStall (
@@ -58,6 +61,10 @@ CoreStall (
     return EFI_NOT_AVAILABLE_YET;
   }
 
+  if (gEfiCurrentTpl == TPL_HIGH_LEVEL) {
+    CoreStallBT();
+  }
+
   //
   // Counter = Microseconds * 10 / gMetronome->TickPeriod
   // 0x1999999999999999 = (2^64 - 1) / 10

to be combined with a gdb script:

break CoreStallBT
commands
bt
cont
end

First there are a few stacktraces like this:

Breakpoint 1, CoreStallBT () at /home/kraxel/projects/edk2/MdeModulePkg/Core/Dxe/Misc/Stall.c:48
48	void CoreStallBT(void) {}
#0  CoreStallBT () at /home/kraxel/projects/edk2/MdeModulePkg/Core/Dxe/Misc/Stall.c:48
#1  0x000000007fe063f9 in CoreStall (Microseconds=9000)
    at /home/kraxel/projects/edk2/MdeModulePkg/Core/Dxe/Misc/Stall.c:65
#2  0x00000000016459dc in ?? ()
#3  0x0000000000002328 in ?? ()
#4  0x000000017e170020 in ?? ()
#5  0x0000000000137f50 in ?? ()
#6  0x0000000000138000 in ?? ()
#7  0x00000021cfd76d3c in ?? ()
#8  0x000000000177ab75 in ?? ()
#9  0x00000021d40c8998 in ?? ()
#10 0x00000000017b7fe2 in ?? ()
#11 0x000000007e4a5100 in mConIn ()
#12 0x0000000000137dc0 in ?? ()
#13 0x00000000001383c6 in ?? ()
#14 0x0000000000000001 in ?? ()
#15 0x0000000000138000 in ?? ()
#16 0x0000000001779ae4 in ?? ()
#17 0x0000000000000000 in ?? ()

This is probably cdboot.efi waiting for a keypress.

Finally once this (before hitting NestedInterruptRaiseTPL ASSERT())

Breakpoint 1, CoreStallBT () at /home/kraxel/projects/edk2/MdeModulePkg/Core/Dxe/Misc/Stall.c:48
48	void CoreStallBT(void) {}
#0  CoreStallBT () at /home/kraxel/projects/edk2/MdeModulePkg/Core/Dxe/Misc/Stall.c:48
#1  0x000000007fe063f9 in CoreStall (Microseconds=1)
    at /home/kraxel/projects/edk2/MdeModulePkg/Core/Dxe/Misc/Stall.c:65
#2  0x000000000235824c in ?? ()
#3  0x0000000000000001 in ?? ()
#4  0x0000000000243f90 in ?? ()
#5  0x0000000000243f90 in ?? ()
#6  0x0000000000244000 in ?? ()
#7  0x0000000000244000 in ?? ()
#8  0x000000000249c779 in ?? ()
#9  0x0000000000000000 in ?? ()

Which again seems to be the code loaded by cdboot.efi running.

Comment 44 echo 2023-05-05 02:51:00 UTC

*** Bug 2183336 has been marked as a duplicate of this bug. ***

Comment 45 Laszlo Ersek 2023-05-05 09:52:58 UTC

Created attachment 1962534 [details]
catch unannounced high TPL: debug patch series on top of git 94c802e108a0

Bingo. Windows bug.

I said earlier that no Windows code should ever be running at
TPL_HIGH_LEVEL; per UEFI spec, TPL_HIGH_LEVEL is restricted to platform
firmware use.

https://uefi.org/specs/UEFI/2.10/07_Services_Boot_Services.html#event-timer-and-task-priority-services

> Tasks that execute at a higher priority level may interrupt tasks that
> execute at a lower priority level. For example, tasks that run at the
> TPL_NOTIFY level may interrupt tasks that run at the TPL_APPLICATION
> or TPL_CALLBACK level. While TPL_NOTIFY is the highest level exposed
> to the boot services applications, the firmware may have higher task
> priority items it deals with. For example, the firmware may have to
> deal with tasks of higher priority like timer ticks and internal
> devices. Consequently, there is a fourth TPL, TPL_HIGH_LEVEL {link
> needed}, designed for use exclusively by the firmware.

This morning I got suspicious that "cdboot.efi" was *itself* raising the
TPL to HIGH level. That would (i) violate the UEFI spec, and (ii) be
consistent with the misbehavior we've been seeing. Gerd has shown before
that "cdboot.efi" itself contained STI instructions, and that it also
executed those. If you combine that with "cdboot.efi" itself raising the
TPL to TPL_HIGH_LEVEL, we've got an explanation (and evidence that
"cdboot.efi" violates the UEFI spec, therefore the undefined behavior is
"cdboot.efi"'s bug).

So, how do we prove that "cdboot.efi" directly raises the TPL to HIGH
level?

We can't make "cdboot.efi" do things for us, but we can make the rest of
the firmware do things that "cdboot.efi" does not do. For example,
before raising the TPL to HIGH level, *announce* the intent. And then in
the RaiseTPL() boot service implementation (CoreRaiseTpl()
[MdeModulePkg/Core/Dxe/Event/Tpl.c]) *enforce* the announcement. If the
intent was announced, everything's fine; otherwise, halt. Furthermore,
in every case, clear the announcement in RaiseTPL().

Refer to the attached debug patch series (applies on top of upstream git
commit 94c802e108a0, "MdePkg/BasePeCoffLib: Deal with broken debug
directories", 2023-05-05):

 1  04adbfa5d04a Revert "OvmfPkg: Use NestedInterruptTplLib in nested
                 interrupt handlers"
 2  1aa2b0eca9a0 Revert "OvmfPkg: Add library to handle TPL from within
                 nested interrupt handlers"
 3  b7469ddb3329 LocalApicTimerDxe: warn about timer interrupt firing at
                 TPL_HIGH_LEVEL

Here's what these three patches do: the first two revert Michael's
re-fixing of the Xen issue (= interrupt storm on Xen causing stack
overflow due to nested interrupt handler invocations). The third patch
adds a warning to the interrupt handler, namely when we determine that
the interrupt was *delivered* with the TPL already being HIGH (this is
what should never happen).

This sub-series of patches only changes the *symptoms* of the
misbehavior. It does not interfere with the misbehavior in any other
way; it just makes the bug (wherever it is) more tractable for analysis.

The sub-series does not change the invariants that (a) no 3rd party code
ever should run at HIGH level, (b) no timer interrupt must be delivered
at TPL_HIGH_LEVEL.

The sub-series only returns to the naive/direct interrupt handler
nesting that we had in place *before* commit 239b50a86370 ("OvmfPkg: End
timer interrupt later to avoid stack overflow under load", 2020-06-18).
That status was actually correct, it was only problematic when exposed
to broken virtual hardware on overloaded Xen hosts.

With this sub-series applied, a timer interrupt that is delivered at
TPL_HIGH_LEVEL is simply logged. Other than that, we call the RaiseTPL
and RestoreTPL boot services directly in TimerInterruptHandler(); we
simply don't try to be smart about nesting.

 4  1b54bbd0ab94 force agents raising the TPL to HIGH to announce
                 themselves

This is the core patch that adds the "infrastructure" for announcing the
intent to raise the TPL to HIGH, and it also enforces (and clears) the
announcement in RaiseTPL. For normal edk2 standards, this patch should
be split into multiple parts (5 patches), but for debugging, it is good
enough.

Booting OVMF with the series applied up to this point will of course
trigger the new ASSERT() in a whole lot of locations -- edk2 justifiedly
raises the TPL to HIGH in a good number of places, but never "announces"
the intent. Thus, the rest of the series:

 5  aa55557a86ef DXE Core: CoreAcquireEventLock(): announce high TPL
 6  10f666d7cb57 DXE Core: CoreAcquireDispatcherLock(): announce high
                 TPL
 7  5ffc2eec2362 DXE Core: CoreCurrentSystemTime(): announce high TPL
 8  42d4708a027e OvmfPkg/LocalApicTimerDxe: TimerInterruptHandler():
                 announce high TPL
 9  4f05f8422a1a MdeModulePkg/MonotonicCounterRuntimeDxe: announce high
                 TPL in GetNextHigh
10  c2dc683e159e MdeModulePkg/MonotonicCounterRuntimeDxe: announce high
                 TPL in GetNext
11  8d17c4ad439a MdeModulePkg/PciBusDxe: announce high TPL in BarExisted

12  56b300b4ae2a MdeModulePkg/PciBusDxe: announce high TPL in
                 PciTestSupportedAttribute
13  15be7cf9fcff NetworkPkg/DpcDxe: announce high TPL in DpcDispatchDpc

14  18c3f576171a NetworkPkg/DpcDxe: announce high TPL in DpcQueueDpc
15  8b09460f0397 MdePkg/UefiLib: announce high TPL in EfiGetCurrentTpl
16  abd2e07fdc32 MdeModulePkg/UsbBusDxe: announce high TPL in
                 UsbGetCurrentTpl

instruments all those locations, with announcing the intent.

I developed each patch in this sub-series as follows:

(1) boot OVMF, passing "-s" to QEMU (= enable the gdb back-end),

(2) hit the ASSERT from patch#4 (raising the TPL to HIGH without
    announcement),

(3) run the following command from the edk2 project root:

    ( cd BaseTools/Scripts/ &&
      gdb -ex "target remote localhost:1234" -ex "source efi_gdb.py" )

(4) get a backtrace,

(5) locate the edk2 code raising the TPL, and modify it to announce the
    intent. I didn't always choose to extend the *immediate* caller of
    RaiseTPL; sometimes I went for the next outer frame. (See for
    example in patches #5 through #7.)

Again for upstream edk2 standards, this high-level ordering of the
series is inapproprite; after this last part of the series was
developed, it should have been reordered with patch #4. That is,
normally we should only "arm" the enforcement when the codebase already
complies with the new requirement. However, this more naive / original
ordering shows the development method better, in this case.

With the entire series in place, the following happens:

- If I boot "en_windows_10_enterprise_ltsc_2019_x64_dvd_be3c8ffb.iso",
  there is nothing out of the ordinary. The warning message from patch#3
  is not logged (that is, no violation of the invariant (b)). The
  ASSERT() from patch#4 is not hit (that is, no unannounced TPL bump to
  HIGH level -- no violation of the invariant (a)). And the Windows
  installer starts OK.

- If I boot
  "en-us_windows_11_business_editions_version_22h2_x64_dvd_17a08ce3.iso"
  on the other hand, then patch#4 *does catch* "cdboot.efi" at violating
  invariant (a), that is, raising the TPL to HIGH level. And this
  happens *before* patch#3 could complain about invariant (b) being
  violated (i.e., a timer interrupt being delivered at TPL_HIGH_LEVEL);
  cf. "TimerInterruptHandler:65" in comment#25.

Here's the tail of the firmware log (wrapped manually for readability):

> Loading driver at 0x00010000000 EntryPoint=0x00010001090 cdboot.efi
> InstallProtocolInterface: [EfiLoadedImageDevicePathProtocol] 3DEB5B18
> ProtectUefiImageCommon - 0x3CEC3040
>   - 0x0000000010000000 - 0x000000000011C000
> ASSERT .../edk2/MdeModulePkg/Core/Dxe/Event/Tpl.c(78):
> HighTplAnnounce == ((void *) 0) || *HighTplAnnounce

Now, in this stage, the ASSERT() is of course spinning in CpuDeadLoop(),
so we can pause the VM, attach gdb, and get a backtrace. (See steps (3)
and (4) above.)

> (gdb) where
> #0  CpuPause () at .../edk2/MdePkg/Library/BaseLib/X64/GccInline.c:45
> #1  0x000000003ef37953 in CpuDeadLoop () at
>     .../edk2/MdePkg/Library/BaseLib/CpuDeadLoop.c:30
> #2  0x000000003ef3b600 in DebugAssert (FileName=0x3ef4cba0
>     ".../edk2/MdeModulePkg/Core/Dxe/Event/Tpl.c", LineNumber=78,
>     Description=0x3ef4cbf8
>     "HighTplAnnounce == ((void *) 0) || *HighTplAnnounce") at
>     .../edk2/OvmfPkg/Library/PlatformDebugLibIoPort/DebugLib.c:230
> #3  0x000000003ef31d27 in CoreRaiseTpl (NewTpl=TPL_HIGH_LEVEL) at
>     .../edk2/MdeModulePkg/Core/Dxe/Event/Tpl.c:78
> #4  0x0000000010041681 in ?? ()
> #5  0x000000000000001f in ?? ()
> #6  0x000000001006f6c2 in ?? ()
> #7  0x000000003d3d4000 in mConIn ()
> #8  0x000000003ef138e0 in ?? ()
> #9  0x0000000010111ffc in ?? ()
> #10 0x0000000000000001 in ?? ()
> #11 0x0000000010111fc0 in ?? ()
> #12 0x000000001003dd20 in ?? ()
> #13 0x0000000000000000 in ?? ()

Note the following bits:

- Frame#3: RaiseTPL is being called with NewTpl=TPL_HIGH_LEVEL.

- Frame#4: the return address on the stack is 0x0000000010041681. This
  address (range) definitely belongs to "cdboot.efi", per earlier
  investigation (see the prior BZ comments).

OK, let's get a disassembly leading up to return address
0x0000000010041681:

> (gdb) x/20i 0x10041660
>    0x10041660:  mov    (%rax),%ebx
>    0x10041662:  cmp    $0x1,%ebx
>    0x10041665:  je     0x1004166f
>    0x10041667:  lea    -0x9(%rbp),%ecx
>    0x1004166a:  call   0x1009b478
>    0x1004166f:  mov    0xd091a(%rip),%rax        # 0x10111f90
>    0x10041676:  mov    $0x1f,%ecx
>    0x1004167b:  mov    0x18(%rax),%rax
>    0x1004167f:  call   *%rax
>    0x10041681:  nopl   (%rax)       <------ we're going to return here
>    0x10041684:  mov    %rax,%r15
>    0x10041687:  cmp    $0x1,%ebx
>    0x1004168a:  je     0x10041693
>    0x1004168c:  mov    %ebx,%ecx
>    0x1004168e:  call   0x1009b478
>    0x10041693:  mov    $0x1,%ecx
>    0x10041698:  call   0x100046ac
>    0x1004169d:  rdtsc
>    0x1004169f:  shl    $0x20,%rdx
>    0x100416a3:  mov    $0x3e8,%ecx

Consider the three instructions just before the return address:

>    0x10041676:  mov    $0x1f,%ecx
>    0x1004167b:  mov    0x18(%rax),%rax
>    0x1004167f:  call   *%rax
>    0x10041681:  nopl   (%rax)       <------ we're going to return here

I find the GNU assembly syntax hard to read, so the same byte sequence
disassembled with NASM:

> B91F000000        mov ecx,0x1f
> 488B4018          mov rax,[rax+0x18]
> FFD0              call rax

The first instruction moves 0x1F to ECX. Per EFIAPI (aka Microsoft)
calling convention
<https://en.wikipedia.org/wiki/X86_calling_conventions#Microsoft_x64_calling_convention>,
this means that we're setting the first argument for the upcoming
function call to 0x1F. And 0x1F happens to be the *published* value of
TPL_HIGH_LEVEL
<https://uefi.org/specs/UEFI/2.10/07_Services_Boot_Services.html#efi-boot-services-raisetpl>.

The second instruction uses RAX as a pointer, offsets it by 0x18 bytes,
and reads the qword from the pointed-to location back into RAX.

The third instruction calls the function that is now pointed-to by RAX.

Now consider the EFI_BOOT_SERVICES structure (quoting just the front):

https://uefi.org/specs/UEFI/2.10/04_EFI_System_Table.html#efi-boot-services

> typedef struct {
>   EFI_TABLE_HEADER     Hdr;
>
>   //
>   // Task Priority Services
>   //
>   EFI_RAISE_TPL        RaiseTPL;       // EFI 1.0+

also consider the EFI_TABLE_HEADER structure:

https://uefi.org/specs/UEFI/2.10/04_EFI_System_Table.html#id4

> typedef struct {
>   UINT64      Signature;
>   UINT32      Revision;
>   UINT32      HeaderSize;
>   UINT32      CRC32;
>   UINT32      Reserved;
> } EFI_TABLE_HEADER;

The size of EFI_TABLE_HEADER is 24 bytes (decimal) -- in hex: 0x18.
Therefore, the offset of the "RaiseTPL" field in EFI_BOOT_SERVICES is
0x18.

In other words, in the quoted "cdboot.efi" disassembly, RAX originally
points to EFI_BOOT_SERVICES; "cdboot.efi" then loads the "RaiseTPL"
field (at offset 0x18) into RAX, then calls RaiseTPL() via RAX, with
argument 0x1F (= TPL_HIGH_LEVEL) open-coded in ECX.

  ((EFI_BOOT_SERVICES *)RAX)->RaiseTPL (TPL_HIGH_LEVEL);

This is the Windows bug. Per spec, a 3rd party UEFI application must not
do this.

Comment 46 Laszlo Ersek 2023-05-05 10:50:56 UTC

Not that we need more evidence, but two further comments:

(1)

> - Frame#4: the return address on the stack is 0x0000000010041681. This
>   address (range) definitely belongs to "cdboot.efi", per earlier
>   investigation (see the prior BZ comments).

This is actually evident from the firmware log already, note the load
address of "cdboot.efi", plus the subsequent ProtectUefiImageCommon
message:

> Loading driver at 0x00010000000 EntryPoint=0x00010001090 cdboot.efi
> InstallProtocolInterface: [EfiLoadedImageDevicePathProtocol] 3DEB5B18
> ProtectUefiImageCommon - 0x3CEC3040
>   - 0x0000000010000000 - 0x000000000011C000
> ASSERT .../edk2/MdeModulePkg/Core/Dxe/Event/Tpl.c(78):
> HighTplAnnounce == ((void *) 0) || *HighTplAnnounce

ImageBase=0x10000000
ImageSize=0x11C000

Return address 0x10041681 is at offset 0x41681 from ImageBase, and that
offset is less than ImageSize.

(2)

I've just realized that "EFI_BOOT_SERVICES.Hdr.Signature" is right at
the start of the EFI_BOOT_SERVICES table; therefore, let's reconsider
the context leading up to the invalid RaiseTPL call with *just one more*
instruction:

>    0x1004166f:  mov    0xd091a(%rip),%rax        # 0x10111f90
>    0x10041676:  mov    $0x1f,%ecx
>    0x1004167b:  mov    0x18(%rax),%rax
>    0x1004167f:  call   *%rax
>    0x10041681:  nopl   (%rax)       <------ we're going to return here

Here GDB even helpfully calculates (from the RIP-relative addressing)
the address of the *global variable* that's being loaded into RAX, going
in: 0x10111f90.

We expect that this global variable is "gBS"; more precisely, some
global variable that "cdboot.efi" uses for caching the address of the
boot services table.

Let's see what this supposed "gBS" global variable contains:

> (gdb) x/g 0x10111f90
> 0x10111f90:     0x000000003ef53200

Then let's see the 8-byte signature at offset#0 from (*gBS) --
equivalently, the contents of gBS->Hdr.Signature. Both as a string, and
as a UINT64:

> (gdb) x/8c 0x000000003ef53200
> 0x3ef53200 <mBootServices>:     66 'B'  79 'O'  79 'O'  84 'T'  83 'S'  69 'E'  82 'R'  86 'V'
> (gdb) x/g 0x000000003ef53200
> 0x3ef53200 <mBootServices>:     0x56524553544f4f42

GDB even resolves the 0x000000003ef53200 address for us. It is the
address of the "mBootServices" variable -- the actual boot services
table -- from "MdeModulePkg/Core/Dxe/DxeMain/DxeMain.c".

And yes, the signature matches; it is "BOOTSERV" / 0x56524553544f4f42:

https://uefi.org/specs/UEFI/2.10/04_EFI_System_Table.html#efi-boot-services

> #define EFI_BOOT_SERVICES_SIGNATURE 0x56524553544f4f42

So yes, in the quoted assembly, "cdboot.efi" totally calls

  (&mBootServices)->RaiseTPL (TPL_HIGH_LEVEL);

violating the UEFI spec.

Comment 51 Gerd Hoffmann 2023-05-08 09:53:19 UTC

Workaround: turn on all hyperv enlightments (-cpu host,hv-passthrough=on)

Comment 52 Gerd Hoffmann 2023-05-08 10:33:26 UTC

(In reply to Gerd Hoffmann from comment #51)
> Workaround: turn on all hyperv enlightments (-cpu host,hv-passthrough=on)

-cpu host,invtsc=on works too.

Comment 53 Michael Brown 2023-05-08 21:34:57 UTC

Candidate patch at https://edk2.groups.io/g/devel/message/104290 (and identical pull request at https://github.com/tianocore/edk2/pull/4362) to work around the known Windows bug.

Comment 55 Gerd Hoffmann 2023-05-09 06:01:12 UTC

(In reply to Gerd Hoffmann from comment #52)
> (In reply to Gerd Hoffmann from comment #51)
> > Workaround: turn on all hyperv enlightments (-cpu host,hv-passthrough=on)

More specific: hv-frequencies=on enlighment is needed for this.

> -cpu host,invtsc=on works too.

Nope, doesn't, testing error.

Comment 56 Laszlo Ersek 2023-05-09 08:46:40 UTC

(In reply to Michael Brown from comment #53)
> Candidate patch at https://edk2.groups.io/g/devel/message/104290 (and
> identical pull request at https://github.com/tianocore/edk2/pull/4362) to
> work around the known Windows bug.

Thank you, Michael!

Comment 58 Michael Brown 2023-05-10 10:17:18 UTC

The patch to relax the assertion to an error message has been merged via https://github.com/tianocore/edk2/pull/4371, so Windows 11 installation should now work on current OVMF despite the dubious behaviour of calling (&mBootServices)->RaiseTPL(TPL_HIGH_LEVEL) and then enabling interrupts.

I think this issue can probably now be closed, though it would be good to sustain the separate conversation thread with Microsoft to get the root cause fixed.

Comment 59 Laszlo Ersek 2023-05-10 11:32:43 UTC

Thank you again for the patches, Michael!

Since this is technically a downstream ticket, the expected BZ Status field changes are a bit less simple.

Once we've backported your upstream patches (or inherited them via rebase), and a downstream development RPM has been built, the Status will move to MODIFIED.

Then the QE team will verify the update, using that (or a later) development RPM (-> ON_QA, -> VERIFIED, ideally).

The ticket is expected to be closed once an RPM version containing the verified updates is released (most likely with the next minor release of RHEL9).

The process could get much more complicated in some cases (dependent on what previous minor releases of RHEL9 were affected), but this is the gist of it, AFAICT.

Comment 60 Xueqiang Wei 2023-05-11 08:39:19 UTC

Add a note from QE side.

1. Not hit this bug when testing with avocado(tp-qemu).
   The cpu flags "hv_stimer,hv_synic,hv_vpindex,hv_relaxed,hv_spinlocks=0x1fff,hv_vapic,hv_time,hv_frequencies,hv_runtime,hv_tlbflush,hv_reenlightenment,hv_stimer_direct,hv_ipi,kvm_pv_unhalt=on" will be added to qemu command lines in automation scripts, so I did not hit this bug.

cpu command lines:
-cpu 'Icelake-Server',ds=on,ss=on,dtes64=on,vmx=on,pdcm=on,hypervisor=on,tsc-adjust=on,avx512ifma=on,sha-ni=on,rdpid=on,fsrm=on,md-clear=on,stibp=on,arch-capabilities=on,xsaves=on,ibpb=on,ibrs=on,amd-stibp=on,amd-ssbd=on,rdctl-no=on,ibrs-all=on,skip-l1dfl-vmentry=on,mds-no=on,pschange-mc-no=on,tsx-ctrl=on,hle=off,rtm=off,mpx=off,intel-pt=off,hv_stimer,hv_synic,hv_vpindex,hv_relaxed,hv_spinlocks=0x1fff,hv_vapic,hv_time,hv_frequencies,hv_runtime,hv_tlbflush,hv_reenlightenment,hv_stimer_direct,hv_ipi,kvm_pv_unhalt=on \


2. Hit this bug when testing with virt-install.
 The cpu flags were added to qemu command lines were different from avocado automation scripts when testing with virt-install.
 
cpu command lines:
-cpu Icelake-Server,ds=on,ss=on,dtes64=on,vmx=on,pdcm=on,hypervisor=on,tsc-adjust=on,avx512ifma=on,sha-ni=on,rdpid=on,fsrm=on,md-clear=on,stibp=on,arch-capabilities=on,xsaves=on,ibpb=on,ibrs=on,amd-stibp=on,amd-ssbd=on,rdctl-no=on,ibrs-all=on,skip-l1dfl-vmentry=on,mds-no=on,pschange-mc-no=on,tsx-ctrl=on,hle=off,rtm=off,mpx=off,intel-pt=off,hv-time=on,hv-relaxed=on,hv-vapic=on,hv-spinlocks=0x1fff \

Found the following message in edk2 debug log:
ASSERT /builddir/build/BUILD/edk2-f80f052277c8/OvmfPkg/Library/NestedInterruptTplLib/Tpl.c(42): InterruptedTPL < 31

3. Because edk2 and qemu gating test loops were tested by avocado(tp-qemu) scripts, not cover it in these two test loops.
   vfio-gpu automation test scripts were implemented by virt-install, so cover this bug by vfio-gpu test loop.

4. Tested the above scenarios with the following versions.
versions:
kernel-5.14.0-306.el9.x86_64
qemu-kvm-8.0.0-2.el9
edk2-ovmf-20230301gitf80f052277c8-3.el9.noarch

Comment 62 Laszlo Ersek 2023-05-11 08:59:26 UTC

(In reply to Xueqiang Wei from comment #60)
> Add a note from QE side.

Right, your first use case contains "hv_frequencies=on", the second one doesn't. This is consistent with Gerd's findings (comment#55).

Comment 64 Xueqiang Wei 2023-05-16 15:06:47 UTC

QE bot(pre verify): Set 'Verified:Tested,SanityOnly' as gating/tier1 test pass.

Comment 67 Xueqiang Wei 2023-05-22 17:38:53 UTC

According to Comment 60, run the following tests, all results were passed. So set status to VERIFIED.

Versions:
kernel-5.14.0-306.el9.x86_64
qemu-kvm-8.0.0-2.el9
edk2-ovmf-20230301gitf80f052277c8-4.el9.noarch


1. tested win11 installation with avocado (tp-qemu), the results were passed.
(1/2) Host_RHEL.m9.u3.ovmf.qcow2.virtio_blk.up.virtio_net.Guest.Win11.x86_64.io-github-autotest-qemu.uefi_secureboot.q35: PASS (1161.46 s)
(2/2) Host_RHEL.m9.u3.ovmf.qcow2.virtio_scsi.up.virtio_net.Guest.Win11.x86_64.io-github-autotest-qemu.uefi_secureboot.q35: PASS (1897.76 s)

2. win11 installation with virt-install, the results were passed. 
# cat win11_installation_with_virt-install.sh
virt-install \
        --name win11_T4 \
        --cpu host-model \
        --ram=8192 \
        --vcpus=4,sockets=1,cores=4,threads=1 \
        --disk path=/home/win11_T4.qcow2,format='qcow2',cache=none,size=80,bus='scsi' \
        --graphics vnc,listen=0.0.0.0 \
        --video bochs \
        --boot uefi \
        --network bridge=virbr0 \
        --noreboot \
        --os-variant=win10 \
        --cdrom /home/kvm_autotest_root/iso/ISO/Win11/windows_11_x64_official_dvd.iso \
        --disk path=/home/kvm_autotest_root/iso/windows/virtio-win-latest-signed-el9.iso,device='cdrom',readonly='on',bus='sata' \
        --console file,path=/tmp/win11_T4.log \
        --disk path=/home/kvm_autotest_root/images/win11-64/autounattend.iso,device='cdrom',readonly='on',bus='sata' \

# sh win11_installation_with_virt-install.sh
Starting install...
Creating domain...                                         |         00:00:00     
Domain is still running. Installation may be in progress.
Waiting for the installation to complete.
Domain has shutdown. Continuing.
Domain creation completed.
# virsh list  --all
 Id   Name       State
---------------------------
 -    win11_T4   shut off

# virsh start win11_T4
Domain 'win11_T4' started

# virsh list --all
 Id   Name       State
--------------------------
 2    win11_T4   running

# virsh destroy win11_T4 
Domain 'win11_T4' destroyed

3. tested edk2 test loop, the results were passed.
 Job link: http://fileshare.hosts.qa.psi.pek2.redhat.com/pub/logs/edk2_test_with_edk2-20230301gitf80f052277c8-4.el9/results.html

4. tested qemu_gating_test, the results were passed.
 Job link: http://fileshare.hosts.qa.psi.pek2.redhat.com/pub/logs/qemu_gating_test_rhel9_with_edk2-20230301gitf80f052277c8-4.el9/results.html

Comment 69 errata-xmlrpc 2023-11-07 08:24:29 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: edk2 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:6330