Bug 1626059 - RHEL6 guest panics on boot if hotpluggable memory (pc-dimm) is present at boot time
Summary: RHEL6 guest panics on boot if hotpluggable memory (pc-dimm) is present at b...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: qemu-kvm-rhev
Version: 7.6
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: rc
: ---
Assignee: Igor Mammedov
QA Contact: Yumei Huang
URL:
Whiteboard:
Depends On:
Blocks: 1635625
TreeView+ depends on / blocked
 
Reported: 2018-09-06 13:33 UTC by Igor Mammedov
Modified: 2018-11-08 09:28 UTC (History)
10 users (show)

Fixed In Version: qemu-kvm-rhev-2.12.0-16.el7
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1635625 (view as bug list)
Environment:
Last Closed: 2018-11-01 11:13:32 UTC
Target Upstream Version:


Attachments (Terms of Use)
full guest boot log (23.08 KB, text/plain)
2018-09-06 13:33 UTC, Igor Mammedov
no flags Details


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2018:3443 None None None 2018-11-01 11:14:44 UTC

Description Igor Mammedov 2018-09-06 13:33:01 UTC
Created attachment 1481323 [details]
full guest boot log

Description of problem:

RHEL6 guest crashes on boot with following call trace:

 [<ffffffff81537a64>] ? panic+0xa7/0x16f
 [<ffffffff8153c844>] ? oops_end+0xe4/0x100
 [<ffffffff8104e8cb>] ? no_context+0xfb/0x260
 [<ffffffff8104eb55>] ? __bad_area_nosemaphore+0x125/0x1e0
 [<ffffffff8104ec23>] ? bad_area_nosemaphore+0x13/0x20
 [<ffffffff8104f31c>] ? __do_page_fault+0x30c/0x500
 [<ffffffff81290b90>] ? idr_get_empty_slot+0x110/0x2c0
 [<ffffffff81290df0>] ? ida_get_new_above+0xb0/0x210
 [<ffffffff812986dc>] ? put_dec+0x10c/0x110
 [<ffffffff8153e76e>] ? do_page_fault+0x3e/0xa0
 [<ffffffff8153bb15>] ? page_fault+0x25/0x30
 [<ffffffff8129f6e3>] ? __bitmap_weight+0x83/0xb0
 [<ffffffff81178daf>] ? init_cache_nodelists_node+0x7f/0x190
 [<ffffffff815376f9>] ? slab_memory_callback+0x43/0xf5
 [<ffffffff8153e825>] ? notifier_call_chain+0x55/0x80
 [<ffffffff810a7b2a>] ? __blocking_notifier_call_chain+0x5a/0x80
 [<ffffffff810a7b66>] ? blocking_notifier_call_chain+0x16/0x20
 [<ffffffff8138305b>] ? memory_notify+0x1b/0x20
 [<ffffffff8117b38b>] ? online_pages+0x8b/0x200
 [<ffffffff812920da>] ? kobject_get+0x1a/0x30
 [<ffffffff813827af>] ? memory_section_action+0xdf/0x100
 [<ffffffff81382875>] ? memory_block_change_state+0xa5/0x140
 [<ffffffff8138315b>] ? set_memory_state+0x7b/0xc0
 [<ffffffff8131e332>] ? acpi_memory_enable_device+0xd7/0x147
 [<ffffffff8131e4b9>] ? acpi_memory_device_add+0x117/0x120
 [<ffffffff812f4610>] ? acpi_device_probe+0x50/0x122
 [<ffffffff81373d3a>] ? driver_probe_device+0xaa/0x3a0
 [<ffffffff813740db>] ? __driver_attach+0xab/0xb0
 [<ffffffff81374030>] ? __driver_attach+0x0/0xb0
 [<ffffffff81372f24>] ? bus_for_each_dev+0x64/0x90
 [<ffffffff813739ce>] ? driver_attach+0x1e/0x20
 [<ffffffff81372738>] ? bus_add_driver+0x1e8/0x2b0
 [<ffffffff81c6f141>] ? acpi_memory_device_init+0x0/0x78
 [<ffffffff81374336>] ? driver_register+0x76/0x140
 [<ffffffff81c6f141>] ? acpi_memory_device_init+0x0/0x78
 [<ffffffff812f5c6d>] ? acpi_bus_register_driver+0x43/0x45
 [<ffffffff81c6f15d>] ? acpi_memory_device_init+0x1c/0x78
 [<ffffffff810020d0>] ? do_one_initcall+0xc0/0x280
 [<ffffffff81c37a77>] ? kernel_init+0x29b/0x2f7
 [<ffffffff8100969d>] ? __switch_to+0x7d/0x340
 [<ffffffff8100c28a>] ? child_rip+0xa/0x20
 [<ffffffff81c377dc>] ? kernel_init+0x0/0x2f7
 [<ffffffff8100c280>] ? child_rip+0x0/0x20

when guest started like this:
 /usr/libexec/qemu-kvm -m 4G,slots=4,maxmem=8G -numa node -numa node -object memory-backend-ram,size=1G,id=m0 -device  pc-dimm,memdev=m0,node=1 rhel6x64.img

It's regression introduced by fix for Bug 1609234
(RHEL6 memory hot-add used to work when started with >=4Gb of initial memory)


Version-Release number of selected component (if applicable):
qemu-kvm-rhev-2.12.0-10.el7

How reproducible:
100%

Steps to Reproduce:
1. start guest with above CLI

Actual results:
guest kernel panics

Expected results:
guest kernel booted normally

Additional info:
  - it happens only when dimm is plugged into node 1
  - upstream it was agreed[1] upon reverting fix for bug 1609234 and commit
     848a1cc1e (hw/acpi-build: build SRAT memory affinity structures for DIMM devices)
    that caused bug 1609234,
    so we end up with old SRAT layout that worked fine in 2.10

1) http://patchwork.ozlabs.org/patch/960874/

Comment 6 Miroslav Rezanina 2018-09-17 08:57:55 UTC
Fix included in qemu-kvm-rhev-2.12.0-16.el7

Comment 8 Yumei Huang 2018-09-19 06:43:40 UTC
The issue is gone if boot rhel6.10 guest with the cli in comment 0 on qemu-kvm-rhev-2.12.0-16.el7. 

But QE still can hit the panic issue with following steps:

1. Boot guest in pause status with 3 nodes and 2 pc-dimms(assigned to node 1 and node 0), please see the cli[1].

2. Hotplug pc-dimm to node 2
(qemu) object_add memory-backend-ram,id=mem2,host-nodes=0,policy=bind,size=1G
(qemu) device_add pc-dimm,id=dimm2,memdev=mem2,node=2

3. Resume guest, hit call trace[2], guest panic


[1] QEMU cli:

/usr/libexec/qemu-kvm \
    -S  \
    -name 'avocado-vt-vm1'  \
    -sandbox off  \
    -machine pc  \
    -nodefaults \
    -device VGA,bus=pci.0,addr=0x2  \
    -drive id=drive_image1,if=none,snapshot=off,aio=threads,cache=none,format=qcow2,file=/home/kvm_autotest_root/images/rhel610-64-virtio.qcow2 \
    -device virtio-blk-pci,id=image1,drive=drive_image1,bootindex=0,bus=pci.0,addr=0x4 \
    -device virtio-net-pci,mac=9a:11:12:13:14:15,id=idHcQEHf,vectors=4,netdev=idayU70J,bus=pci.0,addr=0x5  \
    -netdev tap,id=idayU70J,vhost=on \
    -m 4096,slots=16,maxmem=32G \
    -object memory-backend-ram,policy=bind,host-nodes=0,size=1G,id=mem-mem1 \
    -device pc-dimm,node=1,id=dimm-mem1,memdev=mem-mem1 \
    -object memory-backend-ram,policy=bind,host-nodes=0,size=1G,id=mem-mem2 \
    -device pc-dimm,node=0,id=dimm-mem2,memdev=mem-mem2  \
    -smp 8,maxcpus=8,cores=4,threads=1,sockets=2  \
    -numa node,nodeid=0  \
    -numa node,nodeid=1  \
    -numa node,nodeid=2  \
    -cpu 'Opteron_G3',+kvm_pv_unhalt \
    -vnc :0  \
    -rtc base=utc,clock=host,driftfix=slew  \
    -boot menu=off,strict=off,order=cdn,once=c \
    -enable-kvm -monitor stdio -serial tcp:0:4444,server,nowait

[2] Call trace info:

Kernel panic - not syncing: Fatal exception
Pid: 1182, comm: udisks-part-id Tainted: G      D W  -- ------------    2.6.32-754.6.2.el6.x86_64 #1
Call Trace:
 [<ffffffff8155856a>] ? panic+0xa7/0x18b
 [<ffffffff8155e304>] ? oops_end+0xe4/0x100
 [<ffffffff8100f95b>] ? die+0x5b/0x90
 [<ffffffff8155ddc2>] ? do_general_protection+0x152/0x160
 [<ffffffff8155d235>] ? general_protection+0x25/0x30
 [<ffffffff812b84db>] ? list_del+0x1b/0xa0
 [<ffffffff8113f7d3>] ? __rmqueue+0xc3/0x4a0
 [<ffffffff8114a891>] ? lru_cache_add_lru+0x21/0x40
 [<ffffffff811418e0>] ? get_page_from_freelist+0x590/0x870
 [<ffffffff811431d9>] ? __alloc_pages_nodemask+0x129/0x960
 [<ffffffff811b1522>] ? do_lookup+0xa2/0x230
 [<ffffffff811b01d0>] ? path_to_nameidata+0x20/0x60
 [<ffffffff8117ed8a>] ? alloc_pages_vma+0x9a/0x150
 [<ffffffff8115ee4a>] ? do_wp_page+0x11a/0xa40
 [<ffffffff8115c209>] ? __do_fault+0x459/0x540
 [<ffffffff8115fa4d>] ? handle_pte_fault+0x2dd/0xc80
 [<ffffffff812a9276>] ? prio_tree_insert+0x256/0x2b0
 [<ffffffff81154e90>] ? vma_prio_tree_insert+0x30/0x60
 [<ffffffff81163c3c>] ? __vma_link_file+0x4c/0x80
 [<ffffffff811606f6>] ? handle_mm_fault+0x306/0x450
 [<ffffffff81054db1>] ? __do_page_fault+0x141/0x500
 [<ffffffff81166de5>] ? do_mmap_pgoff+0x335/0x380
 [<ffffffff81155339>] ? sys_mmap_pgoff+0x199/0x340
 [<ffffffff8156029e>] ? do_page_fault+0x3e/0xa0
 [<ffffffff8155d265>] ? page_fault+0x25/0x30

Comment 9 Igor Mammedov 2018-09-19 11:46:56 UTC
(In reply to Yumei Huang from comment #8)
> The issue is gone if boot rhel6.10 guest with the cli in comment 0 on
> qemu-kvm-rhev-2.12.0-16.el7. 
> 
> But QE still can hit the panic issue with following steps:
backtrace indicates it's not related issue,
can you reproduce it with 2.10 version?

I suggest to open a bug for it.

[...]
> [2] Call trace info:
> 
> Kernel panic - not syncing: Fatal exception
> Pid: 1182, comm: udisks-part-id Tainted: G      D W  -- ------------   
> 2.6.32-754.6.2.el6.x86_64 #1
> Call Trace:
>  [<ffffffff8155856a>] ? panic+0xa7/0x18b
>  [<ffffffff8155e304>] ? oops_end+0xe4/0x100
>  [<ffffffff8100f95b>] ? die+0x5b/0x90
>  [<ffffffff8155ddc2>] ? do_general_protection+0x152/0x160
>  [<ffffffff8155d235>] ? general_protection+0x25/0x30
>  [<ffffffff812b84db>] ? list_del+0x1b/0xa0
>  [<ffffffff8113f7d3>] ? __rmqueue+0xc3/0x4a0
>  [<ffffffff8114a891>] ? lru_cache_add_lru+0x21/0x40
>  [<ffffffff811418e0>] ? get_page_from_freelist+0x590/0x870
>  [<ffffffff811431d9>] ? __alloc_pages_nodemask+0x129/0x960
>  [<ffffffff811b1522>] ? do_lookup+0xa2/0x230
>  [<ffffffff811b01d0>] ? path_to_nameidata+0x20/0x60
>  [<ffffffff8117ed8a>] ? alloc_pages_vma+0x9a/0x150
>  [<ffffffff8115ee4a>] ? do_wp_page+0x11a/0xa40
>  [<ffffffff8115c209>] ? __do_fault+0x459/0x540
>  [<ffffffff8115fa4d>] ? handle_pte_fault+0x2dd/0xc80
>  [<ffffffff812a9276>] ? prio_tree_insert+0x256/0x2b0
>  [<ffffffff81154e90>] ? vma_prio_tree_insert+0x30/0x60
>  [<ffffffff81163c3c>] ? __vma_link_file+0x4c/0x80
>  [<ffffffff811606f6>] ? handle_mm_fault+0x306/0x450
>  [<ffffffff81054db1>] ? __do_page_fault+0x141/0x500
>  [<ffffffff81166de5>] ? do_mmap_pgoff+0x335/0x380
>  [<ffffffff81155339>] ? sys_mmap_pgoff+0x199/0x340
>  [<ffffffff8156029e>] ? do_page_fault+0x3e/0xa0
>  [<ffffffff8155d265>] ? page_fault+0x25/0x30

Comment 10 Yumei Huang 2018-09-19 12:36:56 UTC
(In reply to Igor Mammedov from comment #9)
> (In reply to Yumei Huang from comment #8)
> > The issue is gone if boot rhel6.10 guest with the cli in comment 0 on
> > qemu-kvm-rhev-2.12.0-16.el7. 
> > 
> > But QE still can hit the panic issue with following steps:
> backtrace indicates it's not related issue,
> can you reproduce it with 2.10 version?
> 
> I suggest to open a bug for it.
> 

Yes, it's reproducible with qemu-kvm-rhev-2.10.0-21.el7. 

A new bug[1] has been filed.

[1]https://bugzilla.redhat.com/show_bug.cgi?id=1630850

Comment 11 Yumei Huang 2018-09-20 01:23:11 UTC
Verify:
qemu-kvm-rhev-2.12.0-16.el7
Guest: RHEL6.10, RHEL7.6, Win2008sp2, Win2008r2, Win2012, Win2012r2, Win2016

QE did the same test as bug 1609234(in comment 7&10), only RHEL6.10 guest failed 3 cases due to bug 1630850, other guests pass all the tests.

Comment 13 errata-xmlrpc 2018-11-01 11:13:32 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:3443


Note You need to log in before you can comment on or make changes to this bug.