1158204 – Booting Xen HVM 32 guest under AMD crashes with IP: [<c042e905>] load_microcode_amd+0x25/0x4a0

Bug 1158204 - Booting Xen HVM 32 guest under AMD crashes with IP: [<c042e905>] load_microcode_amd+0x25/0x4a0

Summary: Booting Xen HVM 32 guest under AMD crashes with IP: [<c042e905>] load_microc...

Keywords:
Status:	CLOSED RAWHIDE
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	21
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	Kernel Maintainer List
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2014-10-28 20:03 UTC by Konrad Rzeszutek Wilk
Modified:	2014-11-03 15:21 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2014-11-03 15:21:07 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
console output (23.24 KB, text/plain) 2014-10-28 20:03 UTC, Konrad Rzeszutek Wilk	no flags	Details
serial console with the crash (18.74 KB, text/plain) 2014-10-30 17:35 UTC, Konrad Rzeszutek Wilk	no flags	Details
The patch I used. (1.55 KB, patch) 2014-10-30 17:35 UTC, Konrad Rzeszutek Wilk	no flags	Details \| Diff
serial console with the crash (18.73 KB, text/plain) 2014-10-31 14:44 UTC, Konrad Rzeszutek Wilk	no flags	Details
Debug patch (1.90 KB, patch) 2014-10-31 14:44 UTC, Konrad Rzeszutek Wilk	no flags	Details \| Diff
test patch (1.42 KB, patch) 2014-10-31 16:05 UTC, Borislav Petkov	no flags	Details \| Diff
console with your patch (41.85 KB, text/plain) 2014-10-31 20:14 UTC, Konrad Rzeszutek Wilk	no flags	Details
Show Obsolete (4) View All

Description Konrad Rzeszutek Wilk 2014-10-28 20:03:52 UTC

Created attachment 951519 [details]
console output

Description of problem:

Booting under an AMD machine I see the freshly non-PAE kernel installed crash:
4
[    0.587285] Unpacking initramfs...
[    0.861285] BUG: unable to handle kernel paging request at 35d4e304
[    0.862015] IP: [<c042e905>] load_microcode_amd+0x25/0x4a0
[    0.862015] *pde = 00000000 
[    0.862015] Oops: 0000 [#1] SMP 
[    0.862015] Modules linked in:
[    0.862015] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.17.1-302.fc21.i686 #1
[    0.862015] Hardware name: Xen HVM domU, BIOS 4.4.1 10/01/2014
[    0.862015] task: f5098000 ti: f50d0000 task.ti: f50d0000
[    0.862015] EIP: 0060:[<c042e905>] EFLAGS: 00010246 CPU: 0
[    0.862015] EIP is at load_microcode_amd+0x25/0x4a0
[    0.862015] EAX: 00000000 EBX: f6e9ec4c ECX: 00001ec4 EDX: 00000000
[    0.862015] ESI: f5d4e000 EDI: 35d4e2fc EBP: f50d1ed0 ESP: f50d1e94
[    0.862015]  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
[    0.862015] CR0: 8005003b CR2: 35d4e304 CR3: 00e33000 CR4: 000406d0
[    0.862015] Stack:
[    0.862015]  00000000 00000000 f50d1ebc f50d1ec4 f5d4e000 c0d7735a f50d1ed0 15a3d17f
[    0.862015]  f50d1ec4 00600f20 00001ec4 bfb83203 f6e9ec4c f5d4e000 c0d7735a f50d1ed8
[    0.862015]  c0d80861 f50d1ee0 c0d80429 f50d1ef0 c0d889a9 f5d4e000 c0000000 f50d1f04
[    0.862015] Call Trace:
[    0.862015]  [<c0d7735a>] ? unpack_to_rootfs+0x27a/0x27a
[    0.862015]  [<c0d7735a>] ? unpack_to_rootfs+0x27a/0x27a
[    0.862015]  [<c0d80861>] save_microcode_in_initrd_amd+0x95/0xbf
[    0.862015]  [<c0d80429>] save_microcode_in_initrd+0x30/0x34
[    0.862015]  [<c0d889a9>] free_initrd_mem+0xe/0x2a
[    0.862015]  [<c0d77425>] populate_rootfs+0xcb/0xee
[    0.862015]  [<c0d7735a>] ? unpack_to_rootfs+0x27a/0x27a
[    0.862015]  [<c0400496>] do_one_initcall+0xc6/0x200
[    0.862015]  [<c0d7735a>] ? unpack_to_rootfs+0x27a/0x27a
[    0.862015]  [<c0d75503>] ? repair_env_string+0x12/0x54
[    0.862015]  [<c05e6400>] ? proc_mkdir+0x20/0x20
[    0.862015]  [<c0d75c8e>] kernel_init_freeable+0x15b/0x1e9
[    0.862015]  [<c0a3afb0>] kernel_init+0x10/0xe0
[    0.862015]  [<c0a44341>] ret_from_kernel_thread+0x21/0x30
[    0.862015]  [<c0a3afa0>] ? rest_init+0x70/0x70

Version-Release number of selected component (if applicable):


How reproducible:

100%

Steps to Reproduce:
1. Install Fedora 21 on a machine, reboot
2. Install Xen (yum install xen) on said machine, reboot in Xen
3. Install the F21 LiveIOS 32-bit HVM guest using virt-install or virt-manager. 
4. See it boot and crash

Actual results:

See attachment for full serial output

Expected results:

Boot in a nice graphical screen.
Additional info:

Comment 1 Konrad Rzeszutek Wilk 2014-10-28 20:24:24 UTC

Specs of the AMD machine:

processor       : 7
vendor_id       : AuthenticAMD
cpu family      : 21
model           : 2
model name      : AMD FX(tm)-8320 Eight-Core Processor           
stepping        : 0
microcode       : 0x6000822
cpu MHz         : 3511.804
cache size      : 2048 KB

Handle 0x0002, DMI type 2, 15 bytes
Base Board Information
        Manufacturer: ASUSTeK COMPUTER INC.
        Product Name: M5A97 LE R2.0
        Version: Rev 1.xx

Comment 2 Josh Boyer 2014-10-28 20:33:31 UTC

Pass 'dis_ucode_ldr' on the command line and see if that makes the issue go away.  This isn't a properly solution but it might suffice as a workaround.  I'd suggest taking the problem report upstream as well.

Comment 3 Konrad Rzeszutek Wilk 2014-10-28 21:06:13 UTC

(In reply to Josh Boyer from comment #2)
> Pass 'dis_ucode_ldr' on the command line and see if that makes the issue go
> away.  This isn't a properly solution but it might suffice as a workaround. 
> I'd suggest taking the problem report upstream as well.

That did it.
Trying different builds to see what CONFIG option exposes this as I don't seem to be triggering it with my normal builds.

Comment 4 Josh Boyer 2014-10-28 21:29:48 UTC

It could just be a case of the initramfs being too large for the memory allocated to your guest.  We enabled early microcode loading a while ago, which means the ucode gets prepended to the initramfs image.  Does increasing the memory to the guest also make it boot?

Comment 5 Borislav Petkov 2014-10-29 11:31:54 UTC

Here's what the asm looks like after Josh pointed me at the kernel in
question (btw, there's another bug - 1157157 - which has the same RIP).
Annotations mine reconstructed from System.map:

c042e8e0 <load_microcode_amd>:
c042e8e0:       55                      push   %ebp
c042e8e1:       89 e5                   mov    %esp,%ebp
c042e8e3:       57                      push   %edi
c042e8e4:       56                      push   %esi
c042e8e5:       53                      push   %ebx			# ... callee-saved
c042e8e6:       83 e4 f8                and    $0xfffffff8,%esp		# align stack ptr
c042e8e9:       83 ec 2c                sub    $0x2c,%esp		# grow stack
c042e8ec:       e8 4b 68 61 00          call   0xc0a4513c		# mcount
c042e8f1:       88 44 24 1f             mov    %al,0x1f(%esp)
c042e8f5:       a1 cc d8 e3 c0          mov    0xc0e3d8cc,%eax		# equiv_cpu_table
c042e8fa:       89 d7                   mov    %edx,%edi		# data
c042e8fc:       89 4c 24 28             mov    %ecx,0x28(%esp)
c042e900:       e8 4b 7d 13 00          call   0xc0566650		# vfree()
c042e905:       8b 77 08                mov    0x8(%edi),%esi		<--- faulting insn
c042e908:       c7 05 cc d8 e3 c0 00    movl   $0x0,0xc0e3d8cc
c042e90f:       00 00 00

%edi (copied from %edx) contains the second arg to apply_microcode_amd()
which is that that const u8 *data pointer, pointing to the microcode
container coming from the initrd. And that %edi looks funny: 0x35d4e2fc
which causes the NULL ptr deref.

And since save_microcode_in_initrd_amd() checks the container for being
0, it is probably that relocated_ramdisk fun we do which gets the
container pointer wrong.

Konrad, can you dump those values participating in the computation? It
might tell us what is going wrong:

        if (relocated_ramdisk)
                container = (u8 *)(__va(relocated_ramdisk) +
                             (cont - boot_params.hdr.ramdisk_image));

Thanks.

Comment 6 Konrad Rzeszutek Wilk 2014-10-30 17:35:04 UTC

Created attachment 952289 [details]
serial console with the crash

Comment 7 Konrad Rzeszutek Wilk 2014-10-30 17:35:30 UTC

Created attachment 952290 [details]
The patch I used.

Comment 8 Konrad Rzeszutek Wilk 2014-10-30 17:37:04 UTC

Ugh. Time to add more printks..

Comment 9 Konrad Rzeszutek Wilk 2014-10-31 14:44:10 UTC

Created attachment 952526 [details]
serial console with the crash

The issue appears to be in:

 	ret = load_microcode_amd(eax, container, container_size);

Comment 10 Konrad Rzeszutek Wilk 2014-10-31 14:44:43 UTC

Created attachment 952527 [details]
Debug patch

Comment 11 Borislav Petkov 2014-10-31 16:04:17 UTC

(In reply to Konrad Rzeszutek Wilk from comment #9)
> The issue appears to be in:
> 
>  	ret = load_microcode_amd(eax, container, container_size);

Yeah, I thought this was clear from comment #5...

In any case, staring at this more, it looks like this happens because
we're using the *physical* address of the container *after* we have
enabled paging and thus the #PF. Because the ramdisk is exactly there:

[    0.000000] RAMDISK: [mem 0x35e04000-0x36ef9fff]

and we fault at 0x35e04304.

And since this guest doesn't relocate the ramdisk, we don't do the
computation which will give us the correct virtual address and we end up
with the PA.

So, we should actually be using virtual addresses on 32-bit by the time
we're freeing the initrd. How about the attached debug patch?

Thanks.

Comment 12 Borislav Petkov 2014-10-31 16:05:13 UTC

Created attachment 952535 [details]
test patch

Comment 13 Konrad Rzeszutek Wilk 2014-10-31 20:14:40 UTC

Created attachment 952587 [details]
console with your patch

Comment 14 Konrad Rzeszutek Wilk 2014-10-31 20:16:20 UTC

(In reply to Konrad Rzeszutek Wilk from comment #13)
> Created attachment 952587 [details]
> console with your patch

Which of course boots!

Comment 15 Borislav Petkov 2014-10-31 20:50:54 UTC

Thanks for testing Konrad, much appreciated. I'll clean it up and send it to tip guys soon.

@Josh: you can close this one now.

Thanks.

Comment 16 Josh Boyer 2014-11-03 15:21:07 UTC

Will be in todays 3.18.0-0.rc3.git0.1 build.  Thanks Borislav!

Note You need to log in before you can comment on or make changes to this bug.