Bug 697062

Summary: [PV Xen guest] hit kernel BUG on restore with multiple VCPUs in guest.
Product: [Fedora] Fedora Reporter: Igor Mammedov <imammedo>
Component: kernelAssignee: Justin M. Forbes <jforbes>
Status: CLOSED WONTFIX QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: high Docs Contact:
Priority: unspecified    
Version: 15CC: gansalmon, itamar, jonathan, kernel-maint, madhu.chinakonda, xen-maint
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-07-11 17:53:01 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Igor Mammedov 2011-04-15 17:43:59 UTC
Description of problem:
Kernel BUGON panic:

[   84.945012] microcode: CPU0: update failed (for patch_level=0x1000083)
[   84.945012] ------------[ cut here ]------------
[   84.945012] WARNING: at arch/x86/kernel/microcode_core.c:454 mc_sysdev_resume+0x35/0x61 [microcode]()
[   84.945012] Modules linked in: sunrpc ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables microcode joydev x
enfs uinput ipv6 xen_netfront xen_blkfront [last unloaded: scsi_wait_scan]
[   84.945012] Pid: 6, comm: migration/0 Not tainted 2.6.38.2-9.fc15.x86_64 #1
[   84.945012] Call Trace:
[   84.945012]  [<ffffffff81058ce4>] ? warn_slowpath_common+0x85/0x9d
[   84.945012]  [<ffffffff8109cb01>] ? cpu_stopper_thread+0x129/0x1aa
[   84.945012]  [<ffffffff81058d16>] ? warn_slowpath_null+0x1a/0x1c
[   84.945012]  [<ffffffffa005c08b>] ? mc_sysdev_resume+0x35/0x61 [microcode]
[   84.945012]  [<ffffffff812f9942>] ? __sysdev_resume+0x79/0xc9
[   84.945012]  [<ffffffff812f9a4a>] ? sysdev_resume+0xb8/0xfd
[   84.945012]  [<ffffffff812c00df>] ? xen_suspend+0xc9/0xd0
[   84.945012]  [<ffffffff8109cc04>] ? stop_machine_cpu_stop+0x82/0xbb
[   84.945012]  [<ffffffff8109cb82>] ? stop_machine_cpu_stop+0x0/0xbb
[   84.945012]  [<ffffffff8109cadc>] ? cpu_stopper_thread+0x104/0x1aa
[   84.945012]  [<ffffffff8148a419>] ? schedule+0x67e/0x6ca
[   84.945012]  [<ffffffff81006faf>] ? xen_restore_fl_direct_end+0x0/0x1
[   84.945012]  [<ffffffff8109c9d8>] ? cpu_stopper_thread+0x0/0x1aa
[   84.945012]  [<ffffffff81073201>] ? kthread+0x82/0x8a
[   84.945012]  [<ffffffff8100ba64>] ? kernel_thread_helper+0x4/0x10
[   84.945012]  [<ffffffff8100ae63>] ? int_ret_from_sys_call+0x7/0x1b
[   84.945012]  [<ffffffff8148c1e1>] ? retint_restore_args+0x5/0x6
[   84.945012]  [<ffffffff8100ba60>] ? kernel_thread_helper+0x0/0x10
[   84.945012] ---[ end trace 264a91fa5fb2b103 ]---
[   84.945012] ------------[ cut here ]------------
[   84.945012] kernel BUG at arch/x86/kernel/microcode_amd.c:138!
[   84.945012] invalid opcode: 0000 [#1] SMP 
[   84.945012] last sysfs file: /sys/kernel/mm/ksm/run
[   84.945012] CPU 0 
[   84.945012] Modules linked in: sunrpc ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables microcode joydev xenfs uinput ipv6 xen_netfront xen_blkfront [last unloaded: scsi_wait_scan]
[   84.945012] 
[   84.945012] Pid: 6, comm: migration/0 Tainted: G        W   2.6.38.2-9.fc15.x86_64 #1  
[   84.945012] RIP: e030:[<ffffffffa005cc13>]  [<ffffffffa005cc13>] apply_microcode_amd+0x4e/0xc5 [microcode]
[   84.945012] RSP: e02b:ffff88003a9e9ce0  EFLAGS: 00010097
[   84.945012] RAX: 0000000000000079 RBX: 0000000000000000 RCX: ffff88003a9e9bb0
[   84.945012] RDX: 0000000000000000 RSI: 00000000000000fb RDI: 0000000000000004
[   84.945012] RBP: ffff88003a9e9d10 R08: 000000000000000a R09: 000000000000000a
[   84.945012] R10: 0000000000000005 R11: 0000000000000000 R12: 0000000000000001
[   84.945012] R13: ffffc90000336000 R14: 0000000000000000 R15: ffff88003aa0dd90
[   84.945012] FS:  00007f5cece9b700(0000) GS:ffff88003ff7d000(0000) knlGS:0000000000000000
[   84.945012] CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
[   84.945012] CR2: 0000000000000000 CR3: 0000000002d7b000 CR4: 0000000000000660
[   84.945012] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   84.945012] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000000
[   84.945012] Process migration/0 (pid: 6, threadinfo ffff88003a9e8000, task ffff88003a9e1720)
[   84.945012] Stack:
[   84.945012]  ffffffff81a66f38 ffff88003ffa5b28 0000000000000001 ffffffff81a66f38
[   84.945012]  ffff88003ffa5b28 ffffffff8109cb01 ffff88003a9e9d30 ffffffffa005c0b1
[   84.945012]  ffff88003d010000 ffffffffa005dfa0 ffff88003a9e9d60 ffffffff812f9942
[   84.945012] Call Trace:
[   84.945012]  [<ffffffff8109cb01>] ? cpu_stopper_thread+0x129/0x1aa
[   84.945012]  [<ffffffffa005c0b1>] mc_sysdev_resume+0x5b/0x61 [microcode]
[   84.945012]  [<ffffffff812f9942>] __sysdev_resume+0x79/0xc9
[   84.945012]  [<ffffffff812f9a4a>] sysdev_resume+0xb8/0xfd
[   84.945012]  [<ffffffff812c00df>] xen_suspend+0xc9/0xd0
[   84.945012]  [<ffffffff8109cc04>] stop_machine_cpu_stop+0x82/0xbb
[   84.945012]  [<ffffffff8109cb82>] ? stop_machine_cpu_stop+0x0/0xbb
[   84.945012]  [<ffffffff8109cadc>] cpu_stopper_thread+0x104/0x1aa
[   84.945012]  [<ffffffff8148a419>] ? schedule+0x67e/0x6ca
[   84.945012]  [<ffffffff81006faf>] ? xen_restore_fl_direct_end+0x0/0x1
[   84.945012]  [<ffffffff8109c9d8>] ? cpu_stopper_thread+0x0/0x1aa
[   84.945012]  [<ffffffff81073201>] kthread+0x82/0x8a
[   84.945012]  [<ffffffff8100ba64>] kernel_thread_helper+0x4/0x10
[   84.945012]  [<ffffffff8100ae63>] ? int_ret_from_sys_call+0x7/0x1b
[   84.945012]  [<ffffffff8148c1e1>] ? retint_restore_args+0x5/0x6
[   84.945012]  [<ffffffff8100ba60>] ? kernel_thread_helper+0x0/0x10
[   84.945012] Code: fa 4d 6b f6 18 31 c0 89 de 48 c7 c7 ad da 05 a0 4d 8b ae 20 e5 05 a0 4d 8d 86 10 e5 05 a0 4c 89 e9 e8 54 cf 42 e1 44 39 e3 74 04 <0f> 0b eb fe 31 c0 4d 85 ed 74 61 4c 89 ea 44 89 ee bf 20 00 01 
[   84.945012] RIP  [<ffffffffa005cc13>] apply_microcode_amd+0x4e/0xc5 [microcode]
[   84.945012]  RSP <ffff88003a9e9ce0>
[   84.945012] ---[ end trace 264a91fa5fb2b104 ]---
[   84.945012] ------------[ cut here ]------------



Version-Release number of selected component (if applicable):
kernel-2.6.38.2-9.fc15.x86_64

How reproducible:
100% on 2sockets AMD with 4cores per socket.
on intel based server we only get:
   WARNING: at arch/x86/kernel/microcode_core.c:454
without hitting this BUGON in microcode_amd.c:138

Steps to Reproduce:
1. create and start guest with 2 cpus
2. virsh save guestname dumpfile
3. virsh restore dumpfile
  
Actual results:
Panic shouldn't happen.

Expected results:
Guest should continue to work as if it never has been stopped.

Additional info:

on AMD server:
before save
xm vcpu-list
Name           ID VCPUs   CPU State   Time(s) CPU Affinity
fc15x86_64     20     0     2   -b-       9.7 0-3
fc15x86_64     20     1     3   -b-       7.6 0-3

after restore
xm vcpu-list
Name           ID VCPUs   CPU State   Time(s) CPU Affinity
fc15x86_64     21     0     0   -b-       0.0 0-3
fc15x86_64     21     1     0   r--      23.8 0-3

in the last case call to raw_smp_processor_id():microcode_amd.c:137
returns 0 when called on both vcpus and that triggers BUGON on vcpu 1.

Comment 1 Igor Mammedov 2011-04-18 10:33:14 UTC
>on intel based server we only get:
>   WARNING: at arch/x86/kernel/microcode_core.c:454
>without hitting this BUGON in microcode_amd.c:138
mean microcode_intel.c:309 (i.e. similar BUGON)

Comment 2 Igor Mammedov 2011-04-20 07:23:22 UTC
After some testing:
Fc13 has the same BUG.

RHEL6.0 ins't affected by the BUG, this guest only get WARNING and
VCPUs are assigned to different CPUs.

Comment 3 Igor Mammedov 2011-04-22 09:29:41 UTC
This BUG_ON is hit only when we have microcode module loaded with valid microcode blob on the system. It affect both AMD and Intel since their
apply_microcode has similar BUG_ON (cpu != raw_smp_processor_id()).

This can be fixed by a long proposed patch, that somehow hasn't made its
way into upstream http://marc.info/?l=linux-kernel&m=126105863415715&w=2

Patch also could be found in BZ671161.

Comment 4 Igor Mammedov 2011-04-22 11:42:48 UTC
assigning to justin for integration

Comment 5 Josh Boyer 2012-06-04 18:22:43 UTC
This should be fixed as far as I know.  Justin?

Comment 6 Josh Boyer 2012-07-11 17:53:01 UTC
Fedora 15 has reached it's end of life as of June 26, 2012.  As a result, we will not be fixing any remaining bugs found in Fedora 15.

In the event that you have upgraded to a newer release and the bug you reported is still present, please reopen the bug and set the version field to the newest release you have encountered the issue with.  Before doing so, please ensure you are testing the latest kernel update in that release and attach any new and relevant information you may have gathered.

Thank you for taking the time to file a report.  We hope newer versions of Fedora suit your needs.