1393273 – Rebooting guest with invalid hugepage parameters hangs up

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1393273 - Rebooting guest with invalid hugepage parameters hangs up

Summary: Rebooting guest with invalid hugepage parameters hangs up

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	SLOF
Sub Component:
Version:	7.3
Hardware:	ppc64le
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	rc
Target Release:	7.4
Assignee:	Laurent Vivier
QA Contact:	Min Deng
Docs Contact:
URL:
Whiteboard:
Depends On:	1392055
Blocks:	1339117
TreeView+	depends on / blocked

Reported:	2016-11-09 09:07 UTC by Min Deng
Modified:	2017-08-01 22:33 UTC (History)
CC List:	12 users (show)
Fixed In Version:	SLOF-20170303
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-08-01 22:33:27 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Screenshot (47.19 KB, image/png) 2016-11-09 09:09 UTC, Min Deng	no flags	Details
Error log from spapr-vty console (217.69 KB, text/plain) 2016-11-15 02:49 UTC, Min Deng	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2017:2093	0	normal	SHIPPED_LIVE	SLOF bug fix and enhancement update	2017-08-01 19:35:59 UTC

Description Min Deng 2016-11-09 09:07:17 UTC

Description of problem:
Rebooting guest will induce guest hang up after hotplug memory

Version-Release number of selected component (if applicable):
kernel-3.10.0-514.el7.ppc64le
qemu-kvm-rhev-2.6.0-27.el7.ppc64le

How reproducible:
2/2

Steps to Reproduce:
1.boot up guest with cli
  /usr/libexec/qemu-kvm -S -name avocado-vt-vm1 -sandbox off -machine pseries -nodefaults -vga std -chardev socket,id=qmp_id_qmpmonitor1,path=/tmp/1,server,nowait -mon chardev=qmp_id_qmpmonitor1,mode=control -chardev socket,id=qmp_id_catch_monitor,path=/tmp/2,server,nowait -mon chardev=qmp_id_catch_monitor,mode=control -chardev socket,id=serial_id_serial0,path=/tmp/3,server,nowait -device spapr-vty,reg=0x30000000,chardev=serial_id_serial0 -device pci-ohci,id=usb1,bus=pci.0,addr=03 -device virtio-scsi-pci,id=virtio_scsi_pci0,bus=pci.0,addr=04 -drive id=drive_image1,if=none,snapshot=off,aio=native,cache=none,format=qcow2,file=/root/test_home/mdeng/staf-kvm-devel/workspace/usr/share/avocado/data/avocado-vt/images/RHEL-Server-7.3-ppc64le-virtio-scsi.qcow2 -device scsi-hd,id=image1,drive=drive_image1 -m 2048,slots=4,maxmem=32G -object memory-backend-file,policy=bind,mem-path=/mnt/kvm_hugepage,host-nodes=0,size=1G,id=mem-mem1 -device pc-dimm,node=1,id=dimm-mem1,memdev=mem-mem1 -smp 2,maxcpus=2,cores=1,threads=1,sockets=2 -numa node,nodeid=0 -numa node,nodeid=1 -device usb-tablet,id=usb-tablet1,bus=usb1.0,port=1 -vnc :0 -rtc base=utc,clock=host -boot order=cdn,once=c,menu=off,strict=off -enable-kvm -device usb-kbd,id=input0 -device usb-mouse,id=input1 -device usb-tablet,id=input2 -monitor stdio
2.do the following memory hot plug actions
a.{"execute":"qmp_capabilities"}
b.{'execute': 'cont'}
c.{'execute': 'object-add', 'arguments': {'id': 'mem-plug', 'qom-type': 'memory-backend-file', 'props': {'policy': 'bind', 'mem-path': '/mnt/kvm_hugepage', 'host-nodes': [0], 'size': 1073741824}}}
d.{"execute": "device_add", "arguments": {"driver": "pc-dimm", "id": "dimm-plug", "memdev": "mem-plug"}}
3.reboot guest 

Actual results:
The reboot cannot successfully and guest hung up ,please see screenshot
Expected results:
It can reboot successfully.

Additional info:
My host's results of "numactl --show"

policy: default
preferred node: current
physcpubind: 0 8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 128 136 144 152 
cpubind: 0 1 16 17 
nodebind: 0 1 16 17 
membind: 0 1 16 17

Comment 1 Min Deng 2016-11-09 09:09:57 UTC

Created attachment 1218854 [details]
Screenshot

Comment 2 Min Deng 2016-11-09 09:13:07 UTC

Only try it on ppc and QE will take a look on x86 as well,as soon as completing investigation QE will update the bug.Thanks !

Comment 4 Min Deng 2016-11-10 07:41:16 UTC

(In reply to dengmin from comment #2)
> Only try it on ppc and QE will take a look on x86 as well,as soon as
> completing investigation QE will update the bug.Thanks !
  QE could not reproduce it on x86 platform.It should be a ppc64le specific bug.Any issues please let me know,thanks!
Build info,
kernel-3.10.0-514.el7.x86_64
qemu-kvm-rhev-2.6.0-27.el7.x86_64

Comment 5 Min Deng 2016-11-15 02:48:56 UTC

As it will induce a series of automation tests error upload related log info here as well,thanks.

Comment 6 Min Deng 2016-11-15 02:49:41 UTC

Created attachment 1220661 [details]
Error log from spapr-vty console

Comment 7 David Gibson 2016-11-25 05:22:28 UTC

dengmin,

Please see if you can still reproduce this with the VGA and USB virtual devices removed (using -nographic and just spapr-vty console).

In fact, in general it is best if you can check what the minimal virtual hardware is to reproduce a bug.  Many of the standard testcases include VGA and USB devices just because a graphical console is normal on x86.  On power using just the hypervisor console (spapr-vty) is the more normal mode of operation.

Comment 8 Min Deng 2016-11-28 07:43:43 UTC

QE re-test it with following cli
/usr/libexec/qemu-kvm -name bug -sandbox off -machine pseries -nodefaults -nographic -chardev socket,id=qmp_id_qmpmonitor1,path=/tmp/1,server,nowait -mon chardev=qmp_id_qmpmonitor1,mode=control -chardev socket,id=qmp_id_catch_monitor,path=/tmp/2,server,nowait -mon chardev=qmp_id_catch_monitor,mode=control -chardev socket,id=serial_id_serial0,path=/tmp/3,server,nowait -device spapr-vty,reg=0x30000000,chardev=serial_id_serial0 -device pci-ohci,id=usb1,bus=pci.0,addr=03 -device virtio-scsi-pci,id=virtio_scsi_pci0,bus=pci.0,addr=04 -drive id=drive_image1,if=none,snapshot=off,aio=native,cache=none,format=qcow2,file=RHEL-Server-7.3-ppc64le-virtio-scsi.qcow2 -device scsi-hd,id=image1,drive=drive_image1 -m 2048,slots=4,maxmem=32G -object memory-backend-file,policy=bind,mem-path=/mnt/kvm_hugepage,host-nodes=0,size=1G,id=mem-mem1 -device pc-dimm,node=1,id=dimm-mem1,memdev=mem-mem1 -smp 2,maxcpus=2,cores=1,threads=1,sockets=2 -numa node,nodeid=0 -numa node,nodeid=1 -device usb-tablet,id=usb-tablet1,bus=usb1.0,port=1 -vnc :10 -rtc base=utc,clock=host -boot order=cdn,once=c,menu=off,strict=off -enable-kvm -device usb-kbd,id=input0 -device usb-mouse,id=input1 -device usb-tablet,id=input2 -monitor stdio

1.boot up guest and monitor guest via spapr-vty console
2.
a.{"execute":"qmp_capabilities"}
b.{'execute': 'cont'}
c.{'execute': 'object-add', 'arguments': {'id': 'mem-plug', 'qom-type': 'memory-backend-file', 'props': {'policy': 'bind', 'mem-path': '/mnt/kvm_hugepage', 'host-nodes': [0], 'size': 1073741824}}}
d.{"execute": "device_add", "arguments": {"driver": "pc-dimm", "id": "dimm-plug", "memdev": "mem-plug"}}

3.reboot guest

Actual results,
Kernel 3.10.0-514.el7.ppc64le on an ppc64le

localhost login: 

SLOF **********************************************************************
QEMU Starting
 Build Date = Aug  5 2016 01:08:50
 FW Version = mockbuild@ release 20160223
 Press "s" to enter Open Firmware.

Populating /vdevice methods
Populating /vdevice/vty@30000000
Populating /vdevice/nvram@71000000
Populating /pci@800000020000000
                     00 2000 (D) : 1af4 1004    virtio [ scsi ]
Populating /pci@800000020000000/scsi@4
 

( 300 ) Data Storage Exception [ 1012000002c ]


    R0 .. R7           R8 .. R15         R16 .. R23         R24 .. R31
000000003dbea094   000000003e556e40   0000000002b6c850   000000003dbfb028   
000000003e45ff50   0000000000000100   0000000002000000   0000000000000006   
000000003dc016d0   0000000000000080   0000000000000060   000000003dbf8800   
00000000b2220800   000000003fbadf50   000000003dbe0e58   000000003dbfae58   
0000000000000000   0000000000000000   0000000000000047   0000000000000000   
000001012000002c   c000000007b80000   000000003e537cb4   0000000000000000   
0000000000000000   000000003dc55d58   000000003e527d9d   0000000000000001   
000000003e556e38   0000000002b6c858   000000003dbe19ac   000000003e454780   

    CR / XER           LR / CTR          SRR0 / SRR1        DAR / DSISR
        84002024   000000003dbea094   000000003dbea0b4   00000000b2220800   
0000000020000000   000000003dbe2b9c   8000000000001000           42000000   


7 > 


Expected results,
The guest boot up successfully !

Comment 9 Min Deng 2016-11-28 08:07:07 UTC

Also re-tested with simple cli and tried about 6 times but didn't hit the issue.Remove usb device and vga device at this time.
 /usr/libexec/qemu-kvm -name bug -sandbox off -machine pseries -nodefaults -nographic -chardev socket,id=qmp_id_catch_monitor,path=/tmp/2,server,nowait -mon chardev=qmp_id_catch_monitor,mode=control -chardev socket,id=serial_id_serial0,path=/tmp/3,server,nowait -device spapr-vty,reg=0x30000000,chardev=serial_id_serial0 -device virtio-scsi-pci,id=virtio_scsi_pci0,bus=pci.0,addr=04 -drive id=drive_image1,if=none,snapshot=off,aio=native,cache=none,format=qcow2,file=RHEL-Server-7.3-ppc64le-virtio-scsi.qcow2 -device scsi-hd,id=image1,drive=drive_image1 -m 2048,slots=4,maxmem=32G -monitor stdio
 Any issues please let me know,thanks a lot.

Comment 10 Min Deng 2016-11-28 10:40:21 UTC

QE could not reproduce the bug 100% so far but still did following tests
1.without -vga device but with vnc :10 in cli,cli please refer to comment8  - 25% reproducible
  /usr/libexec/qemu-kvm -S -name bug -sandbox off -machine pseries -nodefaults -nographic -chardev socket,id=qmp_id_qmpmonitor1,path=/tmp/1,server,nowait -mon chardev=qmp_id_qmpmonitor1,mode=control -chardev socket,id=qmp_id_catch_monitor,path=/tmp/2,server,nowait -mon chardev=qmp_id_catch_monitor,mode=control -chardev socket,id=serial_id_serial0,path=/tmp/3,server,nowait -device spapr-vty,reg=0x30000000,chardev=serial_id_serial0 -device pci-ohci,id=usb1,bus=pci.0,addr=03 -device virtio-scsi-pci,id=virtio_scsi_pci0,bus=pci.0,addr=04 -drive id=drive_image1,if=none,snapshot=off,aio=native,cache=none,format=qcow2,file=RHEL-Server-7.3-ppc64le-virtio-scsi.qcow2 -device scsi-hd,id=image1,drive=drive_image1 -m 2048,slots=4,maxmem=32G -object memory-backend-file,policy=bind,mem-path=/mnt/kvm_hugepage,host-nodes=0,size=1G,id=mem-mem1 -device pc-dimm,node=1,id=dimm-mem1,memdev=mem-mem1 -smp 2,maxcpus=2,cores=1,threads=1,sockets=2 -numa node,nodeid=0 -numa node,nodeid=1 -device usb-tablet,id=usb-tablet1,bus=usb1.0,port=1 -vnc :10 -rtc base=utc,clock=host -boot order=cdn,once=c,menu=off,strict=off -enable-kvm -device usb-kbd,id=input0 -device usb-mouse,id=input1 -device usb-tablet,id=input2 -monitor stdio

2.without -usb device but with vga and vnc :10 - 33% reproducible
  /usr/libexec/qemu-kvm -S -name bug -sandbox off -machine pseries -nodefaults -vga std -chardev socket,id=qmp_id_catch_monitor,path=/tmp/2,server,nowait -mon chardev=qmp_id_catch_monitor,mode=control -chardev socket,id=serial_id_serial0,path=/tmp/3,server,nowait -device spapr-vty,reg=0x30000000,chardev=serial_id_serial0 -device virtio-scsi-pci,id=virtio_scsi_pci0,bus=pci.0,addr=04 -drive id=drive_image1,if=none,snapshot=off,aio=native,cache=none,format=qcow2,file=RHEL-Server-7.3-ppc64le-virtio-scsi.qcow2 -device scsi-hd,id=image1,drive=drive_image1 -m 2048,slots=4,maxmem=32G -object memory-backend-file,policy=bind,mem-path=/mnt/kvm_hugepage,host-nodes=0,size=1G,id=mem-mem1 -device pc-dimm,node=1,id=dimm-mem1,memdev=mem-mem1 -smp 2,maxcpus=2,cores=1,threads=1,sockets=2 -numa node,nodeid=0 -numa node,nodeid=1 -vnc :10 -rtc base=utc,clock=host -enable-kvm -monitor stdio

3.without vga and usb device - cannot reproduce the issue.

Any issues please let me know,thanks a lot.

Min

Comment 11 xianwang 2017-01-10 07:51:48 UTC

Hit this issue in latest test 100%

version:
Host:
3.10.0-514.el7.ppc64le
qemu-kvm-rhev-2.6.0-28.el7_3.2.ppc64le
SLOF-20160223-6.gitdbbfda4.el7.noarch
Guest:
3.10.0-514.6.1.el7.ppc64le

steps is as follows:
(1)boot a guest with numa node, the full qemu cli is as follows:
/usr/libexec/qemu-kvm \
    -S  \
    -name 'avocado-vt-vm1'  \
    -sandbox off  \
    -machine pseries  \
    -nodefaults  \
    -vga std  \
    -chardev socket,id=qmp_id_qmpmonitor1,path=/var/tmp/avocado_BhCv7g/monitor-qmpmonitor1-20170109-213621-EvB2O3uM,server,nowait \
    -mon chardev=qmp_id_qmpmonitor1,mode=control  \
    -chardev socket,id=qmp_id_catch_monitor,path=/var/tmp/avocado_BhCv7g/monitor-catch_monitor-20170109-213621-EvB2O3uM,server,nowait \
    -mon chardev=qmp_id_catch_monitor,mode=control  \
    -chardev socket,id=serial_id_serial0,path=/var/tmp/avocado_BhCv7g/serial-serial0-20170109-213621-EvB2O3uM,server,nowait \
    -device spapr-vty,reg=0x30000000,chardev=serial_id_serial0 \
    -device pci-ohci,id=usb1,bus=pci.0,addr=03 \
    -device virtio-scsi-pci,id=virtio_scsi_pci0,bus=pci.0,addr=04 \
    -drive id=drive_image1,if=none,snapshot=off,aio=native,cache=none,format=qcow2,file=/root/staf-kvm-devel/workspace/usr/share/avocado/data/avocado-vt/images/rhel73-ppc64le-virtio-scsi.qcow2 \
    -device scsi-hd,id=image1,drive=drive_image1 \
    -device virtio-net-pci,mac=9a:ed:ee:ef:f0:f1,id=idCfRyR1,vectors=4,netdev=idGmxYYm,bus=pci.0,addr=05  \
    -netdev tap,id=idGmxYYm,vhost=on,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown \
    -m 1024,slots=4,maxmem=32G \
    -smp 16,maxcpus=16,cores=8,threads=1,sockets=2  \
    -device usb-tablet,id=usb-tablet1,bus=usb1.0,port=1  \
    -object memory-backend-file,policy=default,mem-path=/mnt/kvm_hugepage,size=1G,id=mem-mem1 \
    -device pc-dimm,node=1,id=dimm-mem1,memdev=mem-mem1  \
    -numa node,nodeid=0  \
    -numa node,nodeid=1 \
    -vnc :3  \
    -qmp tcp:0:8881,server,nowait \
    -monitor stdio \
    -rtc base=utc,clock=host  \
    -boot order=cdn,once=c,menu=off,strict=off \
    -enable-kvm
(2)start and reset vm in HMP
(qemu) cont
(qemu) system_reset

(3)actual result
after executing "(qemu)cont",vm can work normally.
after executing "(qemu)system_reset" vm hang.
the log printout in serial port is as below:
[root@dhcp113-12 ~]# 

SLOF **********************************************************************
QEMU Starting
 Build Date = Aug  5 2016 01:08:50
 FW Version = mockbuild@ release 20160223
 Press "s" to enter Open Firmware.

Populating /vdevice methods
Populating /vdevice/vty@30000000
Populating /vdevice/nvram@71000000
Populating /pci@800000020000000
                     00 2800 (D) : 1af4 1000    virtio [ net ]
                     00 2000 (D) : 1af4 1004    virtio [ scsi ]
Populating /pci@800000020000000/scsi@4
 

( 300 ) Data Storage Exception [ 1012000402c ]


    R0 .. R7           R8 .. R15         R16 .. R23         R24 .. R31
000000001dbea094   000000001e560e68   0000000002b6c950   000000001dbfb028   
000000001e45ff50   0000000000000100   0000000002000000   0000000000000006   
000000001dc016d0   0000000000000080   0000000000000060   000000001dbf8800   
000000006d720800   000000001fbadf50   000000001dbe0e58   000000001dbfae58   
0000000000000000   0000000000000000   0000000000000047   0000000000000000   
000001012000402c   c000000007b80000   000000001e5409fc   0000000000000001   
0000000000000100   000000001dc55d58   000000001e530ae5   0000000000000002   
000000001e560e60   0000000002b6c958   000000001dbe19ac   000000001e450700   

    CR / XER           LR / CTR          SRR0 / SRR1        DAR / DSISR
        84002044   000000001dbea094   000000001dbea0b4   000000006d720800   
0000000020000000   000000001dbe2b9c   8000000000001000           42000000   


7 > 
(4)Additional
If boot vm without the following cli:
    -object memory-backend-file,policy=default,mem-path=/mnt/kvm_hugepage,size=1G,id=mem-mem1 \
    -device pc-dimm,node=1,id=dimm-mem1,memdev=mem-mem1  \
    -numa node,nodeid=0  \
    -numa node,nodeid=1 \
the vm can be reset normally,and work well.

Comment 13 Laurent Vivier 2017-01-11 13:28:23 UTC

The problem seems not to be with hugepage but with adding normal backed memory.

Just a reminder, we must prepare the host to provide some hugepage memory to the guest by doing:

# mkdir /mnt/kvm_hugepage
# mount -t hugetlbfs none /mnt/kvm_hugepage/ -o pagesize=16M
# echo 1024 > /proc/sys/vm/nr_hugepages

But then the hugepage are not used, I have an warning message at
the start of the guest:

    qemu-kvm: Huge page support disabled (n/a for main memory).

To be able to use hugepage with NUMA nodes, you must have hugepage memory on the nodes themself:

    -m 2G,slots=4,maxmem=32G \
    -numa node,nodeid=0,memdev=mem-mem0  \
    -numa node,nodeid=1,memdev=mem-mem1 \
    -object memory-backend-file,policy=default,mem-path=/mnt/kvm_hugepage,size=1G,id=mem-mem0 \
    -object memory-backend-file,policy=default,mem-path=/mnt/kvm_hugepage,size=1G,id=mem-mem1

Then you can hotplug hugepage memory:

    -object memory-backend-file,policy=default,mem-path=/mnt/kvm_hugepage,size=1G,id=mem-mem2 \
    -device pc-dimm,node=1,id=dimm-mem2,memdev=mem-mem2  \

And in this case we can reset the system without any problem.

So the problem is not with hugepage but with backed memory without hugepage memory.

I'm investigating to understand what happens...

Comment 14 Laurent Vivier 2017-01-11 13:54:10 UTC

The problem can be reproduced without hugepage memory, without NUMA parameters, but with at least 2 CPUs:

    -m 1G,slots=1,maxmem=32G \
    -smp 2 \
    -object memory-backend-file,policy=default,mem-path=/var/tmp/backed-mem,size=1G,id=mem-mem2 \
    -device pc-dimm,id=dimm-mem2,memdev=mem-mem2

Comment 15 Laurent Vivier 2017-01-11 14:06:23 UTC

Nice "yum update" side effect: I've update the kernel in the guest to "3.10.0-537.el7.ppc64le" and the problem disappears...

Comment 16 Laurent Vivier 2017-01-11 16:22:17 UTC

The problem seems related to the kernel:
- "system_reset" while we are waiting in SLOF doesn't trigger the problem,
- a "reboot" from linux triggers the problem (with identified broken kernel) like "system_reset" does.

Comment 17 Laurent Vivier 2017-01-11 16:50:45 UTC

The problem appears only if the memory is hotplugged before the kernel is started:
- either with "-object" and "-device",
- or HMP commands "object_add" and "device_add" BEFORE the actual start of the kernel.

Comment 18 Laurent Vivier 2017-01-11 17:44:15 UTC

dengmin,

could you check:
- if you use the good parameters for hugepage memory (see comment #13),
  you don't have any problem,
- if you use parameters from comment #14, you have the problem.

Thanks

Comment 19 Min Deng 2017-01-12 10:19:09 UTC

(In reply to Laurent Vivier from comment #18)
> dengmin,
> 
> could you check:
> - if you use the good parameters for hugepage memory (see comment #13),
>   you don't have any problem,
    Yes,I don't hit the similar issue 
    Cli,...-smp 16,maxcpus=16,cores=8,threads=1,sockets=2 -m 2G,slots=4,maxmem=32G -numa node,nodeid=0,memdev=mem-mem0 -numa node,nodeid=1,memdev=mem-mem1 -object memory-backend-file,policy=default,mem-path=/mnt/kvm_hugepage,size=1G,id=mem-mem0 -object memory-backend-file,policy=default,mem-path=/mnt/kvm_hugepage,size=1G,id=mem-mem1 

> - if you use parameters from comment #14, you have the problem.
   Still have not the problem.
   cli,
    ...
    -m 1G,slots=1,maxmem=32G \
    -smp 2 \
    -object memory-backend-file,policy=default,mem-path=/var/tmp/backed-mem,size=1G,id=mem-mem2 \
    -device pc-dimm,id=dimm-mem2,memdev=mem-mem2 
    ...

   Finally,Only my cli can I reproduce the issue.
   /usr/libexec/qemu-kvm -S -name avocado-vt-vm1 -sandbox off -machine pseries -nodefaults -vga std -chardev socket,id=qmp_id_qmpmonitor1,path=/tmp/qmp1,server,nowait -mon chardev=qmp_id_qmpmonitor1,mode=control -chardev socket,id=qmp_id_catch_monitor,path=/tmp/qmp2,server,nowait -mon chardev=qmp_id_catch_monitor,mode=control -chardev socket,id=serial_id_serial0,path=/tmp/t,server,nowait -device spapr-vty,reg=0x30000000,chardev=serial_id_serial0 -device pci-ohci,id=usb1,bus=pci.0,addr=03 -device virtio-scsi-pci,id=virtio_scsi_pci0,bus=pci.0,addr=04 -drive id=drive_image1,if=none,snapshot=off,aio=native,cache=none,format=qcow2,file=rhel73-ppc64le-virtio-scsi.qcow2 -device scsi-hd,id=image1,drive=drive_image1 -device virtio-net-pci,mac=9a:bf:c0:c1:c2:c3,id=hostnet0,vectors=4,netdev=hostnet1,bus=pci.0,addr=05 -netdev tap,id=hostnet1,vhost=on,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown -m 1024,slots=4,maxmem=32G -object memory-backend-file,policy=default,mem-path=/mnt/kvm_hugepage,size=1G,id=mem-mem1 -device pc-dimm,node=1,id=dimm-mem1,memdev=mem-mem1 -smp 16,maxcpus=16,cores=8,threads=1,sockets=2 -numa node,nodeid=0 -numa node,nodeid=1 -device usb-tablet,id=usb-tablet1,bus=usb1.0,port=1 -vnc :12 -rtc base=utc,clock=host -boot order=cdn,once=c,menu=off,strict=off -enable-kvm -device usb-kbd,id=input0 -device usb-mouse,id=input1 -device usb-tablet,id=input2 -monitor stdio

Comment 20 Min Deng 2017-01-12 10:22:45 UTC

Per developer's request,QE have a try with latest kernel-3.10.0-541.el7.ppc64le.rpm installed on guest.But still can reproduce it.

Comment 21 Laurent Vivier 2017-01-12 10:37:45 UTC

 I confirm the problem occurs only with invalid hugepage parameters (with warning "qemu-kvm: Huge page support disabled (n/a for main memory)."), with latest kernel too.

Comment 22 Laurent Vivier 2017-01-12 14:17:35 UTC

The problem occurs only if we use a "virtio-scsi-pci" device, if we use a "spapr-vscsi" to access the disk, the guest re-boots fine.

As we can see in the comment #8, the error happens during the SCSI bus scan:

Populating /pci@800000020000000
                     00 2800 (D) : 1af4 1000    virtio [ net ]
                     00 2000 (D) : 1af4 1004    virtio [ scsi ]
Populating /pci@800000020000000/scsi@4
 

( 300 ) Data Storage Exception [ 1012000402c ]

After the boot failure, if I unplug, replug the card it works again:

device_del scsi0

system_reset

device_add virtio-scsi-pci,id=scsi0
__com.redhat_drive_add file=/var/lib/libvirt/images/laurent-rhel73-le.qcow2,format=qcow2,id=drive-disk0,werror=stop,rerror=stop
device_add scsi-hd,id=disk,bus=scsi0.0,bootindex=1,drive=drive-disk0

system_reset

Comment 23 Laurent Vivier 2017-01-12 15:21:03 UTC

Bisected to SLOF commit:

commit d8296da8960a3d469c41c95b93d2ac0e629755df
Author: Nikunj A Dadhania <nikunj.ibm.com>
Date:   Wed Feb 10 14:35:08 2016 +0530

    virtio-scsi: enable virtio 1.0
    
    Also add a device file for non-transitional pci device id: 0x1048
    
    Signed-off-by: Nikunj A Dadhania <nikunj.ibm.com>
    Reviewed-by: Thomas Huth <thuth>
    Signed-off-by: Alexey Kardashevskiy <aik>

So the problem seems to be with virtio 1.0 implementation in SLOF.

To workaround this problem, we can add "disable-modern=true" to virtio-scsi-pci parameter.

Comment 24 Laurent Vivier 2017-01-13 11:25:18 UTC

The involved part of SLOF is:

int virtioscsi_init(struct virtio_device *dev)
{
...
        while(1) {
                qsize = virtio_get_qsize(dev, idx);
                if (!qsize)
                        break;
                virtio_vring_size(qsize);

                vq_avail = virtio_get_vring_avail(dev, idx);
                vq_avail->flags = virtio_cpu_to_modern16(dev,
                                                    VRING_AVAIL_F_NO_INTERRUPT);
                vq_avail->idx = 0;
                idx++;
        }
...

It crashes while accessing "vq_avail->flags".

On the first boot it works and doesn't crash because "vq_avail" is always NULL.
It's not the reason of our crash, but this should be corrected as:

                if (vq_avail) {
                        vq_avail->flags = virtio_cpu_to_modern16(dev, 
                                                    VRING_AVAIL_F_NO_INTERRUPT);
                        vq_avail->idx = 0;
                }

After a "system_reset", vq_avail is not NULL, and to have hotplugged memory changes this address, this is why it crashes in this case.

Without hotplugged memory, vq_avail is:              0x196d0800
With hugepage correctly configured, vq_avail is:     0x236b0800
With hugepage not correctly configured, and disable-modern=true,
vq_avail is:                                         0x1e456680
With hugepage not correctly configured, vq_avail is: 0x7b620800 -> CRASH
[^^^ this case is like normal hotplugged memory, but this address can change
 according to parameters like "-m", "-kernel", ... and triggers the problem or not]

Thomas, is there a maximum value for the address range SLOF can use?

Comment 25 Thomas Huth 2017-01-13 11:49:04 UTC

SLOF runs in real mode, so maybe that's the issue here? Can you add a printf() to hw/ppc/spapr.c to check the value of spapr->rma_size ?

Also I think vq_avail should never be NULL here during first boot - if that's the case, something very strange is going on here.

Comment 26 Laurent Vivier 2017-01-13 14:50:25 UTC

(In reply to Thomas Huth from comment #25)
> Also I think vq_avail should never be NULL here during first boot - if
> that's the case, something very strange is going on here.

Yes, you're right, the reasons of the NULL pointer and of storage exception are the same one: the queue is not enabled.

In QEMU, the "vring.avail" pointer is set when the queue is enabled, and in our case it is not done.

SLOF normally enables the queue by calling "virtio_set_qaddr()" and is is done in "virtio_queue_init_vq()". For virtio-scsi, this call is never done.

If I modify the virtioscsi_init() function like that to enable the queues, all works fine:

@@ -130,7 +130,10 @@ int virtioscsi_init(struct virtio_device *dev)
                qsize = virtio_get_qsize(dev, idx);
                if (!qsize)
                        break;
-               virtio_vring_size(qsize);
+               virtio_set_qaddr(dev, idx,
+                                 (unsigned long)SLOF_alloc_mem_aligned(
+                                                       virtio_vring_size(qsize),
+                                                       4096));
 
                vq_avail = virtio_get_vring_avail(dev, idx);
                vq_avail->flags = virtio_cpu_to_modern16(dev, VRING_AVAIL_F_NO_INTERRUPT);

Comment 27 Laurent Vivier 2017-02-06 16:14:45 UTC

Now upstream:

https://github.com/aik/SLOF/commit/007a175410f919a4368499bd8ef11c32bbf3e01e

We'll get it with the rebase to the SLOF version that will be shipped with QEMU v2.9, so moving this BZ to the POST state now.

Comment 30 Yongxue Hong 2017-03-23 08:41:47 UTC

The following is the step of verification:

1.Version:
Host:3.10.0-623.el7.ppc64le
Qemu:qemu-kvm-rhev-2.9.0-0.el7.mrezanin201703210848
SLOF:SLOF.noarch  20170303-1.git66d250e.el7

2.Steps to Verify:
Same to the top Description

3.Actual results:
It can reboot successfully.

This bug is fixed, and change the status to verified.

Comment 31 errata-xmlrpc 2017-08-01 22:33:27 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2093

Note You need to log in before you can comment on or make changes to this bug.