Bug 1448312
Summary: | kernel panics in mce_register_decode_chain when booted on qemu | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | YongkuiGuo <yoguo> | ||||||||
Component: | kernel | Assignee: | Prarit Bhargava <prarit> | ||||||||
kernel sub component: | Platform Enablement | QA Contact: | Virtualization Bugs <virt-bugs> | ||||||||
Status: | CLOSED ERRATA | Docs Contact: | |||||||||
Severity: | high | ||||||||||
Priority: | unspecified | CC: | bhu, chayang, jshortt, juzhang, kbenoit, knoel, lcheng, linl, michen, pbonzini, pmarciniak, prarit, ptoscano, qzhang, rjones, rmcswain, virt-bugs, xchen, xfu, yoguo, yuri | ||||||||
Version: | 7.4 | Keywords: | Regression | ||||||||
Target Milestone: | rc | ||||||||||
Target Release: | --- | ||||||||||
Hardware: | x86_64 | ||||||||||
OS: | Linux | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | kernel-3.10.0-680.el7 | Doc Type: | If docs needed, set a value | ||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2017-08-02 07:28:42 UTC | Type: | Bug | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Bug Depends On: | |||||||||||
Bug Blocks: | 910269, 1353018, 1449577, 1456511 | ||||||||||
Attachments: |
|
Created attachment 1276500 [details]
the output with "lscpu"
Yongkui - which exact version of qemu is this? (In reply to Richard W.M. Jones from comment #4) > Yongkui - which exact version of qemu is this? qemu-kvm-1.5.3-136.el7.x86_64 qemu-img-1.5.3-136.el7.x86_64 Dump of the function: 00000001 53 push rbx 00000002 4889FB mov rbx,rdi 00000005 E8D9085F00 call qword 0x5f08e3 0000000A 488B5308 mov rdx,[rbx+0x8] 0000000E 488D4B08 lea rcx,[rbx+0x8] 00000012 4885D2 test rdx,rdx 00000015 7427 jz 0x3e 00000017 458B442410 mov r8d,[r12+0x10] 0000001C 443B4210 cmp r8d,[rdx+0x10] 00000020 7E0F jng 0x31 00000022 EB1A jmp short 0x3e 00000024 0F1F8000000000 nop dword [rax+0x0] 0000002B 44394210 cmp [rdx+0x10],r8d <<<< 0000002F 7C0D jl 0x3e 00000031 488D4A08 lea rcx,[rdx+0x8] 00000035 488B5208 mov rdx,[rdx+0x8] 00000039 4885D2 test rdx,rdx 0000003C 75ED jnz 0x2b Source code: while ((*nl) != NULL) { if (n->priority > (*nl)->priority) break; nl = &((*nl)->next); } rdx is 0x10, and it should be "nl" if my reading is correct. It seems like the notifier chain got corrupted?!? (In reply to Paolo Bonzini from comment #7) > Dump of the function: > > 00000001 53 push rbx > 00000002 4889FB mov rbx,rdi > 00000005 E8D9085F00 call qword 0x5f08e3 > 0000000A 488B5308 mov rdx,[rbx+0x8] > 0000000E 488D4B08 lea rcx,[rbx+0x8] > 00000012 4885D2 test rdx,rdx > 00000015 7427 jz 0x3e > 00000017 458B442410 mov r8d,[r12+0x10] > 0000001C 443B4210 cmp r8d,[rdx+0x10] > 00000020 7E0F jng 0x31 > 00000022 EB1A jmp short 0x3e > 00000024 0F1F8000000000 nop dword [rax+0x0] > 0000002B 44394210 cmp [rdx+0x10],r8d <<<< > 0000002F 7C0D jl 0x3e > 00000031 488D4A08 lea rcx,[rdx+0x8] > 00000035 488B5208 mov rdx,[rdx+0x8] > 00000039 4885D2 test rdx,rdx > 0000003C 75ED jnz 0x2b > > Source code: > > while ((*nl) != NULL) { > if (n->priority > (*nl)->priority) > break; > nl = &((*nl)->next); > } > > rdx is 0x10, and it should be "nl" if my reading is correct. It seems like > the notifier chain got corrupted?!? See comment #6. P. This also panics with qemu-kvm-rhev: kernel-3.10.0-668.el7.x86_64 qemu-kvm-rhev-2.9.0-2.el7.x86_64 [ 0.902782] MCE: In-kernel MCE decoding enabled. [ 0.903489] general protection fault: 0000 [#1] SMP [ 0.904257] Modules linked in: edac_mce_amd(+) edac_core ghash_clmulni_intel snd_pcm snd_timer aesni_intel sg snd lrw gf128mul glue_helper ablk_helper cryptd joydev soundcore serio_raw pcspkr ata_generic pata_acpi libcrc32c crc8 crc_itu_t crc_ccitt ext4 mbcache jbd2 virtio_pci virtio_input virtio_balloon virtio_scsi sd_mod crc_t10dif nd_pmem nd_btt virtio_net virtio_console virtio_rng virtio_blk virtio_ring virtio ata_piix libata libnvdimm crct10dif_generic crc32_generic crct10dif_pclmul crct10dif_common crc32c_intel crc32_pclmul [ 0.911808] CPU: 0 PID: 106 Comm: systemd-udevd Not tainted 3.10.0-668.el7.x86_64 #1 [ 0.912915] Hardware name: Red Hat KVM, BIOS 1.10.2-2.el7 04/01/2014 [ 0.913806] task: ffff8a171d961f60 ti: ffff8a171db2c000 task.ti: ffff8a171db2c000 [ 0.914881] RIP: 0010:[<ffffffffbeab68c8>] [<ffffffffbeab68c8>] atomic_notifier_chain_register+0x38/0x70 [ 0.916255] RSP: 0018:ffff8a171db2fd30 EFLAGS: 00010002 [ 0.916978] RAX: 0000000000000293 RBX: ffffffffbf70d810 RCX: ffffffffc00d75f8 [ 0.917985] RDX: 656e5f676e697276 RSI: ffffffffc02ee000 RDI: ffffffffbf70d810 [ 0.919017] RBP: ffff8a171db2fd40 R08: 0000000000000000 R09: 0000000000000000 [ 0.920009] R10: 000000000000014d R11: ffff8a171db2fa5e R12: ffffffffc02ee000 [ 0.921036] R13: ffffffffc0084000 R14: 0000000000000000 R15: ffffffffc02ee020 [ 0.922041] FS: 00007fde73ffc8c0(0000) GS:ffff8a171ee00000(0000) knlGS:0000000000000000 [ 0.923195] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [ 0.923998] CR2: 00007fde73ffb000 CR3: 000000001db27000 CR4: 00000000000407f0 [ 0.925005] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 0.926013] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [ 0.927035] Stack: [ 0.927330] ffffffffbf3f1020 ffff8a1700017d80 ffff8a171db2fd50 ffffffffbea426b1 [ 0.928455] ffff8a171db2fd60 ffffffffc0084166 ffff8a171db2fd90 ffffffffbea020e8 [ 0.929556] ffffffffc02ee038 ffff8a171db2fef0 ffffffffc02ee070 0000000000000001 [ 0.930664] Call Trace: [ 0.931013] [<ffffffffbea426b1>] mce_register_decode_chain+0x31/0x40 [ 0.931915] [<ffffffffc0084166>] mce_amd_init+0x166/0x1000 [edac_mce_amd] [ 0.932889] [<ffffffffbea020e8>] do_one_initcall+0xb8/0x230 [ 0.933674] [<ffffffffbeb004c4>] load_module+0x1f64/0x29e0 [ 0.934455] [<ffffffffbed4a370>] ? ddebug_proc_write+0xf0/0xf0 [ 0.935305] [<ffffffffbeb01005>] SyS_init_module+0xc5/0x110 [ 0.936095] [<ffffffffbf0b1209>] system_call_fastpath+0x16/0x1b [ 0.936938] Code: f4 53 48 89 fb e8 39 15 5f 00 48 8b 53 08 48 8d 4b 08 48 85 d2 74 27 45 8b 44 24 10 44 3b 42 10 7e 0f eb 1a 0f 1f 80 00 00 00 00 <44> 39 42 10 7c 0d 48 8d 4a 08 48 8b 52 08 48 85 d2 75 ed 49 89 [ 0.940862] RIP [<ffffffffbeab68c8>] atomic_notifier_chain_register+0x38/0x70 [ 0.941893] RSP <ffff8a171db2fd30> [ 0.942427] ---[ end trace 55076db2a896c8df ]--- [ 0.943092] Kernel panic - not syncing: Fatal exception [ 0.944215] Kernel Offset: 0x3da00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) I should add to comment 11 that the hardware in this case is AMD FX(tm)-8320 Eight-Core Processor. It seems like this may be AMD specific? I'm going to build a kernel with 9026cc82b632 reverted to see if that fixes the problem. (In reply to Richard W.M. Jones from comment #12) > I should add to comment 11 that the hardware in this case is > AMD FX(tm)-8320 Eight-Core Processor. It seems like this may > be AMD specific? AFAICT it is AMD specific. P. > > I'm going to build a kernel with 9026cc82b632 reverted to see > if that fixes the problem. FWIW adding (not reverting) commit 9026cc82b632 on top of the -668 kernel does NOT fix the problem. (In reply to Richard W.M. Jones from comment #15) > FWIW adding (not reverting) commit 9026cc82b632 on top of the -668 > kernel does NOT fix the problem. I'm digging through the nvdimm code ... looks like it might actually be some sort of memory corruption. -640 is broken. -639 works. FYI. P. (In reply to Prarit Bhargava from comment #16) > (In reply to Richard W.M. Jones from comment #15) > > FWIW adding (not reverting) commit 9026cc82b632 on top of the -668 > > kernel does NOT fix the problem. > > I'm digging through the nvdimm code ... looks like it might actually be some > sort of memory corruption. -640 is broken. -639 works. > > FYI. > > P. Yup ... it's the acpi nfit code. Continuing debug ... P. The call to nfit_mce_register() looks strange. The driver isn't succeeding (ie it is returning -ENODEV) but we leave the mce callback registered. I think that's the problem. P. This is the problem: The nfit module loads and executes nfit_init(). This calls nfit_mce_register() which registers a mce decoder. The module fails to load because there is no nfit devices but the decoder remains registered. The module unloads and free's it's memory. The edac_mce_amd driver loads, and attempts to attach another mce decoder to the list. The list is corrupt since the nfit_mce decoder's memory has been free'd. I'll post a patch upstream. P. (In reply to Prarit Bhargava from comment #20) > v1: http://marc.info/?l=linux-acpi&m=149506537617370&w=2 > > P. Thanks - I'll also try it out locally and comment upstream if I get it to work. (In reply to Richard W.M. Jones from comment #21) > (In reply to Prarit Bhargava from comment #20) > > v1: http://marc.info/?l=linux-acpi&m=149506537617370&w=2 > > > > P. > > Thanks - I'll also try it out locally and comment upstream > if I get it to work. I'm not subscribed to the linux-acpi mailing list. However I tested the patch on top of the -668 kernel, and it fixes the problem for me. You can test with: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=13221710 P. (In reply to Prarit Bhargava from comment #25) > You can test with: > > https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=13221710 Yes this package also works. The same problem also exists on the machine with cpu model of Intel(R) Xeon(R) CPU E5-2609 v3. So it's not just for AMD. (In reply to YongkuiGuo from comment #27) > The same problem also exists on the machine with cpu model of Intel(R) > Xeon(R) CPU E5-2609 v3. So it's not just for AMD. On virt guest or on bare-metal? P. (In reply to Prarit Bhargava from comment #28) > (In reply to YongkuiGuo from comment #27) > > The same problem also exists on the machine with cpu model of Intel(R) > > Xeon(R) CPU E5-2609 v3. So it's not just for AMD. > > On virt guest or on bare-metal? > > P. It's a bare-metal, not a virtual guest. So the patch has a few acks upstream. I'm going to post this internally for review. P. Created attachment 1283822 [details]
RHEL PATCH 1/1
*** Bug 1458109 has been marked as a duplicate of this bug. *** I have tested this bug on AMD machine. The result of the libguestfs-test-tool command is correct. So verified. *** Bug 1460005 has been marked as a duplicate of this bug. *** 3.10.0-679.el7.x86_64 is still broken, and this stops me from testing libguestfs (for various reasons I have a lot of AMD hardware around). Can we get the patch added to the kernel as soon as possible? Patch(es) committed on kernel repository and an interim kernel build is undergoing testing Patch(es) available on kernel-3.10.0-680.el7 Tested kernel-3.10.0-681.el7 on AMD hardware with libguestfs-1.36.3-5.el7.x86_64 & qemu-kvm-1.5.3-137.el7.x86_64 and it works for me. Verified with packages: kernel-3.10.0-680.el7.x86_64 libguestfs-1.36.3-5.el7.x86_64 1. Update all kernel packages on Intel machine(Comment 29). 2. Run the libguestfs-test-tool command --------------------------------------------------------- ... supermin: kernel: picked kernel vmlinuz-3.10.0-680.el7.x86_64.debug supermin: kernel: picked modules path /lib/modules/3.10.0-680.el7.x86_64.debug supermin: kernel: kernel_version 3.10.0-680.el7.x86_64.debug supermin: kernel: modules /lib/modules/3.10.0-680.el7.x86_64.debug ... libguestfs: command: run: rm libguestfs: command: run: \ -rf /tmp/libguestfskwKBKn libguestfs: command: run: rm libguestfs: command: run: \ -rf /tmp/libguestfstV4nyK ===== TEST FINISHED OK ===== The command can be executed successfully. So verified this bug. *** Bug 1470216 has been marked as a duplicate of this bug. *** Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2017:1842 |
Created attachment 1276499 [details] the output with "libguestfs-test-tool" Description of problem: There is kernel panic error when executing the libguest-test-tool command on both machines with cpu model of AMD Athlon(tm) 64 X2 Dual Core Processor'. Version-Release number of selected component (if applicable): libguestfs-1.36.3-3.el7.x86_64 vmlinuz-3.10.0-657.el7.x86_64 RHEL-7.4-20170426.4 How reproducible: 100% Steps: 1. On Rhel7.4 machine(ip:10.66.144.53) #libguestfs-test-tool Actual results: [ 1.188163] input: PC Speaker as /devices/platform/pcspkr/input/input3 [ 1.253683] sd 2:0:0:0: Attached scsi generic sg0 type 0 [ 1.255970] EDAC MC: Ver: 3.0.0 [ 1.259995] sd 2:0:1:0: Attached scsi generic sg1 type 0 [ 1.283603] MCE: In-kernel MCE decoding enabled. [ 1.289398] BUG: unable to handle kernel NULL pointer dereference at 0000000000000020 [ 1.290008] IP: [<ffffffff856b68c8>] atomic_notifier_chain_register+0x38/0x70 [ 1.290008] PGD 0 [ 1.290008] Oops: 0000 [#1] SMP [ 1.290008] Modules linked in: edac_mce_amd(+) edac_core sg pcspkr ata_generic serio_raw pata_acpi libcrc32c crc8 crc_itu_t crc_ccitt ext4 mbcache jbd2 virtio_pci virtio_input virtio_balloon virtio_scsi sd_mod crc_t10dif nd_pmem nd_btt virtio_net virtio_console virtio_rng virtio_blk virtio_ring virtio ata_piix libata libnvdimm crct10dif_generic crc32_generic crct10dif_common [ 1.290008] CPU: 0 PID: 102 Comm: systemd-udevd Not tainted 3.10.0-657.el7.x86_64 #1 [ 1.290008] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [ 1.290008] task: ffff8ed71db2bec0 ti: ffff8ed71cd78000 task.ti: ffff8ed71cd78000 [ 1.290008] RIP: 0010:[<ffffffff856b68c8>] [<ffffffff856b68c8>] atomic_notifier_chain_register+0x38/0x70 [ 1.290008] RSP: 0018:ffff8ed71cd7bd30 EFLAGS: 00010002 [ 1.290008] RAX: 0000000000000293 RBX: ffffffff8630d810 RCX: ffffffffc02ff5f8 [ 1.290008] RDX: 0000000000000010 RSI: ffffffffc0487000 RDI: ffffffff8630d810 [ 1.290008] RBP: ffff8ed71cd7bd40 R08: 0000000000000000 R09: 0000000000000000 [ 1.290008] R10: 0000000000000145 R11: ffff8ed71cd7ba5e R12: ffffffffc0487000 [ 1.290008] R13: ffffffffc034f000 R14: 0000000000000000 R15: ffffffffc0487020 [ 1.290008] FS: 00007f617a8948c0(0000) GS:ffff8ed71ee00000(0000) knlGS:0000000000000000 [ 1.290008] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1.290008] CR2: 00007f617a8a3000 CR3: 000000001cd76000 CR4: 00000000000007f0 [ 1.290008] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 1.290008] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [ 1.290008] Stack: [ 1.290008] ffffffff85ff1020 ffff8ed71d8775c0 ffff8ed71cd7bd50 ffffffff856425c1 [ 1.290008] ffff8ed71cd7bd60 ffffffffc034f166 ffff8ed71cd7bd90 ffffffff856020e8 [ 1.290008] ffffffffc0487038 ffff8ed71cd7bef0 ffffffffc0487070 0000000000000001 [ 1.290008] Call Trace: [ 1.290008] [<ffffffff856425c1>] mce_register_decode_chain+0x31/0x40 [ 1.290008] [<ffffffffc034f166>] mce_amd_init+0x166/0x1000 [edac_mce_amd] [ 1.290008] [<ffffffff856020e8>] do_one_initcall+0xb8/0x230 [ 1.290008] [<ffffffff85700624>] load_module+0x1f64/0x29e0 [ 1.290008] [<ffffffff8594a6e0>] ? ddebug_proc_write+0xf0/0xf0 [ 1.290008] [<ffffffff85701165>] SyS_init_module+0xc5/0x110 [ 1.290008] [<ffffffff85cb05c9>] system_call_fastpath+0x16/0x1b [ 1.290008] Code: f4 53 48 89 fb e8 d9 08 5f 00 48 8b 53 08 48 8d 4b 08 48 85 d2 74 27 45 8b 44 24 10 44 3b 42 10 7e 0f eb 1a 0f 1f 80 00 00 00 00 <44> 39 42 10 7c 0d 48 8d 4a 08 48 8b 52 08 48 85 d2 75 ed 49 89 [ 1.290008] RIP [<ffffffff856b68c8>] atomic_notifier_chain_register+0x38/0x70 [ 1.290008] RSP <ffff8ed71cd7bd30> [ 1.290008] CR2: 0000000000000020 [ 1.343238] ---[ end trace a4d0e7bf515f02f6 ]--- [ 1.344173] Kernel panic - not syncing: Fatal exception [ 1.345141] Kernel Offset: 0x4600000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) [ 1.345141] Rebooting in 1 seconds..libguestfs: error: appliance closed the connection unexpectedly, see earlier error messages Expected results: ... ... ===== TEST FINISHED OK ===== Additional info: I used a lower kernel package of vmlinuz-3.10.0-514.el7.x86_64, then execute next steps: #export SUPERMIN_KERNEL=/boot/vmlinuz-3.10.0-514.el7.x86_64 #libguestfs-test-tool But the error is the same as before. Note: the password for the test machine(10.66.144.53) is redhat.