Bug 1265576 - hugepage could not be used inside guest if start the guest with NUMA supported huge pages
hugepage could not be used inside guest if start the guest with NUMA supporte...
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: qemu-kvm-rhev (Show other bugs)
7.2
ppc64le Unspecified
high Severity medium
: rc
: ---
Assigned To: David Gibson
Virtualization Bugs
: ZStream
Depends On:
Blocks: RHV4.1PPC 1279387 1288337 RHEV4.0PPC
  Show dependency treegraph
 
Reported: 2015-09-23 05:18 EDT by Gu Nini
Modified: 2016-11-07 15:42 EST (History)
13 users (show)

See Also:
Fixed In Version: QEMU 2.4
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1279387 (view as bug list)
Environment:
Last Closed: 2016-11-07 15:42:36 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Screenshot_when_guest_stall.png (260.52 KB, image/png)
2015-09-25 06:52 EDT, Gu Nini
no flags Details

  None (edit)
Description Gu Nini 2015-09-23 05:18:00 EDT
Description of problem:
hugepage could not be used inside guest if start the guest with NUMA supported huge pages

Version-Release number of selected component (if applicable):
Host kernel: 3.10.0-316.el7.ppc64le
Guest kernel: 3.10.0-316.el7.ppc64
Qemu-kvm-rhev: qemu-kvm-rhev-2.3.0-23.el7.ppc64le
SLOF: SLOF-20150313-5.gitc89b0df.el7.noarch

How reproducible:
100%

Steps to Reproduce:
1. On the host, check if hugepage is mounted, if not, mount it; and allocate pages for it:
# mount
# mount -t hugetlbfs hugetlbfs /dev/hugepages -o pagesize=16M
# echo 384 > /proc/sys/vm/nr_hugepages

2. Start a guest with numa supported hugepage:

/usr/libexec/qemu-kvm -name spaprqcow2-0921 -machine pseries,accel=kvm,usb=off -m 2560M -realtime mlock=off -smp 5,sockets=1,cores=5,threads=1 ... ****-object memory-backend-file,host-nodes=0,policy=interleave,id=mem-0,size=1536M,prealloc=yes,mem-path=/dev/hugepages -numa node,memdev=mem-0,nodeid=1 -object memory-backend-file,host-nodes=1,policy=interleave,id=mem-1,size=1024M,prealloc=yes,mem-path=/dev/hugepages -numa node,memdev=mem-1,nodeid=0**** -vnc 0:06 -msg timestamp=on

3. After the guest boot up, check if the hugepage could be used inside the guest:
# mount
# mount -t hugetlbfs hugetlbfs /mnt
mount: unknown filesystem type 'hugetlbfs.

4. In the host, quit the guest, restart a guest only with hugepage while without numa:
/usr/libexec/qemu-kvm -name spaprqcow2-0921 -machine pseries,accel=kvm,usb=off -m 2560M -realtime mlock=off -smp 5,sockets=1,cores=5,threads=1 ... ****-mem-prealloc -mem-path /dev/hugepages**** -nvc 0:06 -msg timestamp=on

5. Repeat step3 to check if the hugepage could be used inside guest


Actual results:
In step3, the hugepage could not be used when used together with numa
In step5, the hugepage could be used

Expected results:
The hugepage could be used when used together with numa

Additional info:
Comment 2 David Gibson 2015-09-25 01:05:29 EDT
I think I know what is causing this.

getrampagesize() in target-ppc/kvm.c takes into account the page size for the "default" ram pool (i.e. the one controlled by -mem-path) even if all the guest's actual memory is actually taken from memory-backend-file objects that are configured for hugepages.

I believe you should be able to work around this by adding -mem-path /dev/hugepages/... to the first command line.

This is a real bug, but given there's a workaround, I think we can punt this to 7.3.
Comment 3 Gu Nini 2015-09-25 06:50:01 EDT
(In reply to David Gibson from comment #2)

After add '-mem-path /dev/hugepages/' to restart the guest, hugepage could be used inside the guest, i.e. mount and libhugetlbfs work ok; the detailed qemu cmd line is as follows:

/usr/libexec/qemu-kvm -name spaprqcow2-0921 -machine pseries,accel=kvm,usb=off -m 2560M -realtime mlock=off -smp 5,sockets=1,cores=5,threads=1 ... ****-mem-path=/dev/hugepages -object memory-backend-file,host-nodes=0,policy=interleave,id=mem-0,size=1536M,prealloc=yes,mem-path=/dev/hugepages -numa node,memdev=mem-0,nodeid=0 -object memory-backend-file,host-nodes=1,policy=interleave,id=mem-1,size=1024M,prealloc=yes,mem-path=/dev/hugepages -numa node,memdev=mem-1,nodeid=1**** -vnc 0:06 -msg timestamp=on


However, if boot guest with both hugepage numa node and the none hugepage one as following qemu cmd showed:

/usr/libexec/qemu-kvm -name spaprqcow2-0921 -machine pseries,accel=kvm,usb=off -m 2560M -realtime mlock=off -smp 5,sockets=1,cores=5,threads=1 ... ****-mem-path=/dev/hugepages -object memory-backend-ram,host-nodes=0,policy=interleave,id=mem-0,size=1536M,prealloc=yes -numa node,memdev=mem-0,nodeid=0 -object memory-backend-file,host-nodes=1,policy=interleave,id=mem-1,size=1024M,prealloc=yes,mem-path=/dev/hugepages -numa node,memdev=mem-1,nodeid=1**** -vnc 0:06 -msg timestamp=on

The guest would stalled at:
......
Calling quiesce...
returning from prom_iit

Detailed info please refer to the attachment 'Screenshot_when_guest_stall.png'


David,

Do you think with above problem, bz 1211112 could be verified? In my point, I think it's ok if it's the same cause as the original bug.
Comment 4 Gu Nini 2015-09-25 06:52 EDT
Created attachment 1076989 [details]
Screenshot_when_guest_stall.png
Comment 5 David Gibson 2015-09-28 23:00:56 EDT
Nini,

No, I think we'll need to fix this bug before we can verify bug 1211112.

The stall is, I think, the opposite problem to the one reported earlier in this bug: with -mem-path set, but one of the memory backend objects without hugepage support, the guest is attempting to use hugepages which they can't actually work.

I think there is a fix for this upstream which we'll need to backport.
Comment 6 David Gibson 2015-09-29 00:22:29 EDT
I had a look at upstream, and found 2d103aa which improves the handling of hugepage sizes with memory backends, although I think it doesn't get all the edge cases right.

I've posted a backport, brew build to test at http://brewweb.devel.redhat.com/brew/taskinfo?taskID=9896624
Comment 11 David Gibson 2015-10-21 19:26:25 EDT
Andrea,

We need to work out if this bug is likely to be hit in practice on RHEV.  The problem is when using the -object memory-backend-file options for qemu instead of the normal, simple way of configuring backing memory.

When you have a chance can you see if libvirt will ever use that option, and if so what sorts of XML will trigger it.
Comment 13 Andrea Bolognani 2015-10-30 10:25:46 EDT
(In reply to David Gibson from comment #11)
> Andrea,
> 
> We need to work out if this bug is likely to be hit in practice on RHEV. 
> The problem is when using the -object memory-backend-file options for qemu
> instead of the normal, simple way of configuring backing memory.
> 
> When you have a chance can you see if libvirt will ever use that option, and
> if so what sorts of XML will trigger it.

memory-backend-file is used whenever hugepages are
configured as backing memory for a guest NUMA node, eg.

  <domain>
    ...
    <memoryBacking>
      <hugepages>
        <page size='16384' unit='KiB' nodeset='0'/>
      </hugepages>
    </memoryBacking>
    <cpu>
      <topology sockets='1' cores='1' threads='8'/>
      <numa>
        <cell id='0' cpus='0-7' memory='2097152' unit='KiB'/>
      </numa>
    </cpu>
    ...
  <domain>

memory-backend-ram can be used in some other situations,
is that problematic as well?
Comment 14 David Gibson 2015-11-04 20:55:33 EST
Andrea,

So problems will only arise if some of the memory-backend-file instances have different "hugepageness" than the main backing for memory specified with -mem-path.  Will that ever be the case?
Comment 15 Andrea Bolognani 2015-11-06 08:39:50 EST
Yes, that can happen.

If your guest is configured like

  <domain type='kvm'>
    <name>abologna-rhel72-1102-le</name>
    <uuid>56050ed5-0e5a-4683-ae2d-1c53fc332954</uuid>
    <maxMemory slots='2' unit='KiB'>10485760</maxMemory>
    <memory unit='KiB'>3145728</memory>
    <currentMemory unit='KiB'>1048576</currentMemory>
    <memoryBacking>
      <hugepages>
        <page size='16384' unit='KiB' nodeset='0'/>
      </hugepages>
    </memoryBacking>
    <vcpu placement='static'>1</vcpu>
    <os>
      <type arch='ppc64le' machine='pseries-rhel7.2.0'>hvm</type>
      <boot dev='hd'/>
    </os>
    <cpu mode='custom' match='exact'>
      <model fallback='allow'>POWER8</model>
      <numa>
        <cell id='0' cpus='0' memory='2097152' unit='KiB'/>
      </numa>
    </cpu>
    <clock offset='utc'/>
    <on_poweroff>destroy</on_poweroff>
    <on_reboot>restart</on_reboot>
    <on_crash>restart</on_crash>
    <devices>
      <emulator>/usr/libexec/qemu-kvm</emulator>
      <disk type='file' device='disk'>
        <driver name='qemu' type='qcow2'/>
        <source file='/var/lib/libvirt/images/abologna-rhel72-1102-le.qcow2'/>
        <target dev='sda' bus='scsi'/>
        <address type='drive' controller='0' bus='0' target='0' unit='0'/>
      </disk>
      <controller type='usb' index='0'>
        <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/>
      </controller>
      <controller type='pci' index='0' model='pci-root'/>
      <controller type='scsi' index='0'>
        <address type='spapr-vio' reg='0x2000'/>
      </controller>
      <interface type='network'>
        <mac address='52:54:00:43:ce:e6'/>
        <source network='default'/>
        <model type='virtio'/>
        <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x0'/>
      </interface>
      <serial type='pty'>
        <target port='0'/>
        <address type='spapr-vio' reg='0x30000000'/>
      </serial>
      <console type='pty'>
        <target type='serial' port='0'/>
        <address type='spapr-vio' reg='0x30000000'/>
      </console>
      <memballoon model='virtio'>
        <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0'/>
      </memballoon>
      <panic/>
      <memory model='dimm'>
        <source>
          <pagesize unit='KiB'>64</pagesize>
        </source>
        <target>
          <size unit='KiB'>1048576</size>
          <node>0</node>
        </target>
      </memory>
    </devices>
  </domain>

then the resulting qemu command like will look like

  /usr/libexec/qemu-kvm
  -name abologna-rhel72-1102-le
  -S
  -machine pseries-rhel7.2.0,accel=kvm,usb=off
  -cpu POWER8
  -m size=2097152k,slots=2,maxmem=10485760k
  -realtime mlock=off
  -smp 1,sockets=1,cores=1,threads=1
  -object memory-backend-file,id=ram-node0,prealloc=yes,mem-path=/dev/hugepages/libvirt/qemu,size=2147483648
  -numa node,nodeid=0,cpus=0,memdev=ram-node0
  -object memory-backend-ram,id=memdimm0,size=1073741824
  -device pc-dimm,node=0,memdev=memdimm0,id=dimm0
  -uuid 56050ed5-0e5a-4683-ae2d-1c53fc332954
  -nographic
  -no-user-config
  -nodefaults
  -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-abologna-rhel72-1102-le/monitor.sock,server,nowait
  -mon chardev=charmonitor,id=monitor,mode=control
  -rtc base=utc
  -no-shutdown
  -boot strict=on
  -device spapr-vscsi,id=scsi0,reg=0x2000
  -usb
  -drive file=/var/lib/libvirt/images/abologna-rhel72-1102-le.qcow2,if=none,id=drive-scsi0-0-0-0,format=qcow2
  -device scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0,bootindex=1
  -netdev tap,fd=22,id=hostnet0,vhost=on,vhostfd=23
  -device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:43:ce:e6,bus=pci.0,addr=0x1
  -chardev pty,id=charserial0
  -device spapr-vty,chardev=charserial0,reg=0x30000000
  -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5
  -msg timestamp=on

This happens because NUMA node 0 in the guest is configured
to use 16 MiB pages, whereas the additional memory DIMM is
configured to use the default 64 KiB pages.
Comment 16 Andrea Bolognani 2015-11-06 12:17:50 EST
Reading your question a second time, I realize I did
probably misunderstand it the first time around: you were
asking whether it's possible to end up in a situation where
there are two memory-backend-file instances, each backed
by huge pages of different sizes, right?

The answer is once again yes: the following XML

  <domain type='kvm'>
    <name>centos7</name>
    <uuid>f1b5f1d6-b116-41a9-9a54-52aebda71bf4</uuid>
    <maxMemory slots='2' unit='KiB'>4194304</maxMemory>
    <memory unit='KiB'>2097152</memory>
    <currentMemory unit='KiB'>1048576</currentMemory>
    <memoryBacking>
      <hugepages>
        <page size='1048576' unit='KiB' nodeset='0'/>
        <page size='2048' unit='KiB' nodeset='1'/>
      </hugepages>
    </memoryBacking>
    <vcpu placement='static'>2</vcpu>
    <os>
      <type arch='x86_64' machine='pc-i440fx-2.3'>hvm</type>
      <boot dev='hd'/>
    </os>
    <features>
      <acpi/>
      <apic/>
      <pae/>
    </features>
    <cpu mode='host-model'>
      <model fallback='allow'/>
      <numa>
        <cell id='0' cpus='0' memory='1048576' unit='KiB'/>
        <cell id='1' cpus='1' memory='1048576' unit='KiB'/>
      </numa>
    </cpu>
    <clock offset='utc'>
      <timer name='rtc' tickpolicy='catchup'/>
      <timer name='pit' tickpolicy='delay'/>
      <timer name='hpet' present='no'/>
    </clock>
    <on_poweroff>destroy</on_poweroff>
    <on_reboot>restart</on_reboot>
    <on_crash>destroy</on_crash>
    <pm>
      <suspend-to-mem enabled='no'/>
      <suspend-to-disk enabled='no'/>
    </pm>
    <devices>
      <emulator>/usr/bin/qemu-kvm</emulator>
      <disk type='file' device='disk'>
        <driver name='qemu' type='qcow2'/>
        <source file='/var/lib/libvirt/images/centos7.qcow2'/>
        <target dev='vda' bus='virtio'/>
        <address type='pci' domain='0x0000' bus='0x00' slot='0x09' function='0x0'/>
      </disk>
      <disk type='block' device='cdrom'>
        <driver name='qemu' type='raw'/>
        <target dev='hda' bus='ide'/>
        <readonly/>
        <address type='drive' controller='0' bus='0' target='0' unit='0'/>
      </disk>
      <controller type='usb' index='0' model='ich9-ehci1'>
        <address type='pci' domain='0x0000' bus='0x00' slot='0x06' function='0x7'/>
      </controller>
      <controller type='usb' index='0' model='ich9-uhci1'>
        <master startport='0'/>
        <address type='pci' domain='0x0000' bus='0x00' slot='0x06' function='0x0' multifunction='on'/>
      </controller>
      <controller type='usb' index='0' model='ich9-uhci2'>
        <master startport='2'/>
        <address type='pci' domain='0x0000' bus='0x00' slot='0x06' function='0x1'/>
      </controller>
      <controller type='usb' index='0' model='ich9-uhci3'>
        <master startport='4'/>
        <address type='pci' domain='0x0000' bus='0x00' slot='0x06' function='0x2'/>
      </controller>
      <controller type='pci' index='0' model='pci-root'/>
      <controller type='ide' index='0'>
        <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x1'/>
      </controller>
      <controller type='virtio-serial' index='0'>
        <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0'/>
      </controller>
      <interface type='network'>
        <mac address='52:54:00:3e:b1:3a'/>
        <source network='default'/>
        <model type='virtio'/>
        <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
      </interface>
      <serial type='pty'>
        <target port='0'/>
      </serial>
      <console type='pty'>
        <target type='serial' port='0'/>
      </console>
      <channel type='unix'>
        <source mode='bind' path='/var/lib/libvirt/qemu/channel/target/centos7.org.qemu.guest_agent.0'/>
        <target type='virtio' name='org.qemu.guest_agent.0'/>
        <address type='virtio-serial' controller='0' bus='0' port='1'/>
      </channel>
      <channel type='spicevmc'>
        <target type='virtio' name='com.redhat.spice.0'/>
        <address type='virtio-serial' controller='0' bus='0' port='2'/>
      </channel>
      <input type='tablet' bus='usb'/>
      <input type='mouse' bus='ps2'/>
      <input type='keyboard' bus='ps2'/>
      <graphics type='spice' autoport='yes'/>
      <sound model='ich6'>
        <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/>
      </sound>
      <video>
        <model type='qxl' ram='65536' vram='65536' vgamem='16384' heads='1'/>
        <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/>
      </video>
      <redirdev bus='usb' type='spicevmc'>
      </redirdev>
      <redirdev bus='usb' type='spicevmc'>
      </redirdev>
      <memballoon model='virtio'>
        <address type='pci' domain='0x0000' bus='0x00' slot='0x08' function='0x0'/>
      </memballoon>
      <panic/>
    </devices>
  </domain>

will be converted by libvirt to the following qemu
command

  /usr/bin/qemu-system-x86_64
  -machine accel=kvm
  -name centos7
  -S
  -machine pc-i440fx-2.3,accel=kvm,usb=off
  -cpu Haswell-noTSX,+abm,+pdpe1gb,+rdrand,+f16c,+osxsave,+pdcm,+xtpr,+tm2,+est,+smx,+vmx,+ds_cpl,+monitor,+dtes64,+pbe,+tm,+ht,+ss,+acpi,+ds,+vme
  -m size=2097152k,slots=2,maxmem=4194304k
  -realtime mlock=off
  -smp 2,sockets=2,cores=1,threads=1
  -object memory-backend-file,id=ram-node0,prealloc=yes,mem-path=/dev/hugepages/1G/libvirt/qemu,size=1073741824
  -numa node,nodeid=0,cpus=0,memdev=ram-node0
  -object memory-backend-file,id=ram-node1,prealloc=yes,mem-path=/dev/hugepages/2M/libvirt/qemu,size=1073741824
  -numa node,nodeid=1,cpus=1,memdev=ram-node1
  -uuid f1b5f1d6-b116-41a9-9a54-52aebda71bf4
  -no-user-config
  -nodefaults
  -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/centos7.monitor,server,nowait
  -mon chardev=charmonitor,id=monitor,mode=control
  -rtc base=utc,driftfix=slew
  -global kvm-pit.lost_tick_policy=discard
  -no-hpet
  -no-shutdown
  -global PIIX4_PM.disable_s3=1
  -global PIIX4_PM.disable_s4=1
  -boot strict=on
  -device ich9-usb-ehci1,id=usb,bus=pci.0,addr=0x6.0x7
  -device ich9-usb-uhci1,masterbus=usb.0,firstport=0,bus=pci.0,multifunction=on,addr=0x6
  -device ich9-usb-uhci2,masterbus=usb.0,firstport=2,bus=pci.0,addr=0x6.0x1
  -device ich9-usb-uhci3,masterbus=usb.0,firstport=4,bus=pci.0,addr=0x6.0x2
  -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x5
  -drive file=/var/lib/libvirt/images/centos7.qcow2,if=none,id=drive-virtio-disk0,format=qcow2
  -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x9,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1
  -drive if=none,id=drive-ide0-0-0,readonly=on,format=raw
  -device ide-cd,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0
  -netdev tap,fd=23,id=hostnet0,vhost=on,vhostfd=24
  -device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:3e:b1:3a,bus=pci.0,addr=0x3
  -chardev pty,id=charserial0
  -device isa-serial,chardev=charserial0,id=serial0
  -chardev socket,id=charchannel0,path=/var/lib/libvirt/qemu/channel/target/centos7.org.qemu.guest_agent.0,server,nowait
  -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=org.qemu.guest_agent.0
  -chardev spicevmc,id=charchannel1,name=vdagent
  -device virtserialport,bus=virtio-serial0.0,nr=2,chardev=charchannel1,id=channel1,name=com.redhat.spice.0
  -device usb-tablet,id=input0
  -spice port=5900,addr=127.0.0.1,disable-ticketing,seamless-migration=on
  -device qxl-vga,id=video0,ram_size=67108864,vram_size=67108864,vgamem_mb=16,bus=pci.0,addr=0x2
  -device intel-hda,id=sound0,bus=pci.0,addr=0x4
  -device hda-duplex,id=sound0-codec0,bus=sound0.0,cad=0
  -chardev spicevmc,id=charredir0,name=usbredir
  -device usb-redir,chardev=charredir0,id=redir0
  -chardev spicevmc,id=charredir1,name=usbredir
  -device usb-redir,chardev=charredir1,id=redir1
  -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x8
  -device pvpanic
  -msg timestamp=on

As you can see one of the two memory-backend-file is backed
by 2 MiB hugepages and the other one by 1 GiB hugepages.

AIUI this will never happen on ppc64 though, because the
only supported hugepage sizes are 4 KiB (default) and
16 MiB; libvirt always uses memory-backend-ram when
hugepages are requested but their size is the same as the
system default page size.
Comment 17 David Gibson 2015-11-08 20:04:04 EST
Andrea,

Thanks for the information.  In fact the first answer was more what I'm after.

So for the ppc guest kernel, the available hugepage sizes is a (more or less) global parameter.  It gets the available sizes from the device tree, and the way it's encoded has no way to specify per-region sizes.

So, when there are multiple backing regions, qemu needs to advertise only pagesizes which are available on all of them (which in practice is those sizes <= the minimum of the largest pagesize for each region).

Prior to the patch for this BZ, qemu considered only the global backing memory not any explicit memory-backend-file instances (the pagesize advertising code predates memory-backend-file).  So, if any memory backends had a pagesize smaller than the global -mem-path, qemu would advertise the wrong thing.

With the patch for this bug, that case is fixed - qemu advertises only pagesizes <= the minimum of all memory backends + the global setting.

The remaining problem is that it still always considers the global setting, even if all memory is actually covered by specific backends.  So, if all memory is assigned to a memory-backend-file with 16M pages, but the global page setting is still 4k (e.g. no -mem-path option at all), then only 4k pages will be allowed in the guest.

So, I guess the question is, what in libvirt controls the global -mem-path option?  Can we get a situation where all guest memory is attached to hugepage backends, but the global -mem-path option is _not_ hugepage.
Comment 18 Andrea Bolognani 2015-11-09 06:02:02 EST
There is no single bit in libvirt that controls the global
-mem-path options specifically - whether that option or the
memory-backend-* is used depends on whether a NUMA topology
is configured for the guest.

The following, pretty basic guest definition

  <domain type='kvm'>
    <name>centos7</name>
    <uuid>f1b5f1d6-b116-41a9-9a54-52aebda71bf4</uuid>
    <memory unit='KiB'>1048576</memory>
    <currentMemory unit='KiB'>1048576</currentMemory>
    <memoryBacking>
      <hugepages>
        <page size='2048' unit='KiB' nodeset='0'/>
      </hugepages>
    </memoryBacking>
    <vcpu placement='static'>2</vcpu>
    <os>
      <type arch='x86_64' machine='pc-i440fx-2.3'>hvm</type>
      <boot dev='hd'/>
    </os>
    <features>
      <acpi/>
      <apic/>
      <pae/>
    </features>
    <cpu mode='host-model'>
      <model fallback='allow'/>
    </cpu>
    <clock offset='utc'>
      <timer name='rtc' tickpolicy='catchup'/>
      <timer name='pit' tickpolicy='delay'/>
      <timer name='hpet' present='no'/>
    </clock>
    <on_poweroff>destroy</on_poweroff>
    <on_reboot>restart</on_reboot>
    <on_crash>destroy</on_crash>
    <pm>
      <suspend-to-mem enabled='no'/>
      <suspend-to-disk enabled='no'/>
    </pm>
    <devices>
      <emulator>/usr/bin/qemu-kvm</emulator>
      <disk type='file' device='disk'>
        <driver name='qemu' type='qcow2'/>
        <source file='/var/lib/libvirt/images/centos7.qcow2'/>
        <target dev='vda' bus='virtio'/>
        <address type='pci' domain='0x0000' bus='0x00' slot='0x09' function='0x0'/>
      </disk>
      <disk type='block' device='cdrom'>
        <driver name='qemu' type='raw'/>
        <target dev='hda' bus='ide'/>
        <readonly/>
        <address type='drive' controller='0' bus='0' target='0' unit='0'/>
      </disk>
      <controller type='usb' index='0' model='ich9-ehci1'>
        <address type='pci' domain='0x0000' bus='0x00' slot='0x06' function='0x7'/>
      </controller>
      <controller type='usb' index='0' model='ich9-uhci1'>
        <master startport='0'/>
        <address type='pci' domain='0x0000' bus='0x00' slot='0x06' function='0x0' multifunction='on'/>
      </controller>
      <controller type='usb' index='0' model='ich9-uhci2'>
        <master startport='2'/>
        <address type='pci' domain='0x0000' bus='0x00' slot='0x06' function='0x1'/>
      </controller>
      <controller type='usb' index='0' model='ich9-uhci3'>
        <master startport='4'/>
        <address type='pci' domain='0x0000' bus='0x00' slot='0x06' function='0x2'/>
      </controller>
      <controller type='pci' index='0' model='pci-root'/>
      <controller type='ide' index='0'>
        <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x1'/>
      </controller>
      <controller type='virtio-serial' index='0'>
        <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0'/>
      </controller>
      <interface type='network'>
        <mac address='52:54:00:3e:b1:3a'/>
        <source network='default'/>
        <model type='virtio'/>
        <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
      </interface>
      <serial type='pty'>
        <target port='0'/>
      </serial>
      <console type='pty'>
        <target type='serial' port='0'/>
      </console>
      <channel type='unix'>
        <source mode='bind' path='/var/lib/libvirt/qemu/channel/target/centos7.org.qemu.guest_agent.0'/>
        <target type='virtio' name='org.qemu.guest_agent.0'/>
        <address type='virtio-serial' controller='0' bus='0' port='1'/>
      </channel>
      <channel type='spicevmc'>
        <target type='virtio' name='com.redhat.spice.0'/>
        <address type='virtio-serial' controller='0' bus='0' port='2'/>
      </channel>
      <input type='tablet' bus='usb'/>
      <input type='mouse' bus='ps2'/>
      <input type='keyboard' bus='ps2'/>
      <graphics type='spice' autoport='yes'/>
      <sound model='ich6'>
        <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/>
      </sound>
      <video>
        <model type='qxl' ram='65536' vram='65536' vgamem='16384' heads='1'/>
        <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/>
      </video>
      <redirdev bus='usb' type='spicevmc'>
      </redirdev>
      <redirdev bus='usb' type='spicevmc'>
      </redirdev>
      <memballoon model='virtio'>
        <address type='pci' domain='0x0000' bus='0x00' slot='0x08' function='0x0'/>
      </memballoon>
      <panic/>
    </devices>
  </domain>

results in a qemu command line that uses -mem-path because
the guest NUMA topology is not specified:

  /usr/bin/qemu-system-x86_64
  -machine accel=kvm
  -name centos7
  -S
  -machine pc-i440fx-2.3,accel=kvm,usb=off
  -cpu Haswell-noTSX,+abm,+pdpe1gb,+rdrand,+f16c,+osxsave,+pdcm,+xtpr,+tm2,+est,+smx,+vmx,+ds_cpl,+monitor,+dtes64,+pbe,+tm,+ht,+ss,+acpi,+ds,+vme
  -m 1024
  -mem-prealloc
  -mem-path /dev/hugepages/libvirt/qemu
  -realtime mlock=off
  -smp 2,sockets=2,cores=1,threads=1
  -uuid f1b5f1d6-b116-41a9-9a54-52aebda71bf4
  -no-user-config
  -nodefaults
  -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/centos7.monitor,server,nowait
  -mon chardev=charmonitor,id=monitor,mode=control
  -rtc base=utc,driftfix=slew
  -global kvm-pit.lost_tick_policy=discard
  -no-hpet
  -no-shutdown
  -global PIIX4_PM.disable_s3=1
  -global PIIX4_PM.disable_s4=1
  -boot strict=on
  -device ich9-usb-ehci1,id=usb,bus=pci.0,addr=0x6.0x7
  -device ich9-usb-uhci1,masterbus=usb.0,firstport=0,bus=pci.0,multifunction=on,addr=0x6
  -device ich9-usb-uhci2,masterbus=usb.0,firstport=2,bus=pci.0,addr=0x6.0x1
  -device ich9-usb-uhci3,masterbus=usb.0,firstport=4,bus=pci.0,addr=0x6.0x2
  -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x5
  -drive file=/var/lib/libvirt/images/centos7.qcow2,if=none,id=drive-virtio-disk0,format=qcow2
  -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x9,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1
  -drive if=none,id=drive-ide0-0-0,readonly=on,format=raw
  -device ide-cd,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0
  -netdev tap,fd=23,id=hostnet0,vhost=on,vhostfd=24
  -device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:3e:b1:3a,bus=pci.0,addr=0x3
  -chardev pty,id=charserial0
  -device isa-serial,chardev=charserial0,id=serial0
  -chardev socket,id=charchannel0,path=/var/lib/libvirt/qemu/channel/target/centos7.org.qemu.guest_agent.0,server,nowait
  -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=org.qemu.guest_agent.0
  -chardev spicevmc,id=charchannel1,name=vdagent
  -device virtserialport,bus=virtio-serial0.0,nr=2,chardev=charchannel1,id=channel1,name=com.redhat.spice.0
  -device usb-tablet,id=input0
  -spice port=5900,addr=127.0.0.1,disable-ticketing,seamless-migration=on
  -device qxl-vga,id=video0,ram_size=67108864,vram_size=67108864,vgamem_mb=16,bus=pci.0,addr=0x2
  -device intel-hda,id=sound0,bus=pci.0,addr=0x4
  -device hda-duplex,id=sound0-codec0,bus=sound0.0,cad=0
  -chardev spicevmc,id=charredir0,name=usbredir
  -device usb-redir,chardev=charredir0,id=redir0
  -chardev spicevmc,id=charredir1,name=usbredir
  -device usb-redir,chardev=charredir1,id=redir1
  -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x8
  -device pvpanic
  -msg timestamp=on

On the other hand, as soon as the /domain/cpu element is
changed to include NUMA topology information, like

  <cpu mode='host-model'>
    <model fallback='allow'/>
    <numa>
      <cell id='0' cpus='0-1' memory='1048576' unit='KiB'/>
    </numa>
  </cpu>

the command line becomes

  /usr/bin/qemu-system-x86_64
  -machine accel=kvm
  -name centos7
  -S
  -machine pc-i440fx-2.3,accel=kvm,usb=off
  -cpu Haswell-noTSX,+abm,+pdpe1gb,+rdrand,+f16c,+osxsave,+pdcm,+xtpr,+tm2,+est,+smx,+vmx,+ds_cpl,+monitor,+dtes64,+pbe,+tm,+ht,+ss,+acpi,+ds,+vme
  -m 1024
  -realtime mlock=off
  -smp 2,sockets=2,cores=1,threads=1
  -object memory-backend-file,id=ram-node0,prealloc=yes,mem-path=/dev/hugepages/libvirt/qemu,size=1073741824
  -numa node,nodeid=0,cpus=0-1,memdev=ram-node0
  -uuid f1b5f1d6-b116-41a9-9a54-52aebda71bf4
  -no-user-config
  -nodefaults
  -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/centos7.monitor,server,nowait
  -mon chardev=charmonitor,id=monitor,mode=control
  -rtc base=utc,driftfix=slew
  -global kvm-pit.lost_tick_policy=discard
  -no-hpet
  -no-shutdown
  -global PIIX4_PM.disable_s3=1
  -global PIIX4_PM.disable_s4=1
  -boot strict=on
  -device ich9-usb-ehci1,id=usb,bus=pci.0,addr=0x6.0x7
  -device ich9-usb-uhci1,masterbus=usb.0,firstport=0,bus=pci.0,multifunction=on,addr=0x6
  -device ich9-usb-uhci2,masterbus=usb.0,firstport=2,bus=pci.0,addr=0x6.0x1
  -device ich9-usb-uhci3,masterbus=usb.0,firstport=4,bus=pci.0,addr=0x6.0x2
  -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x5
  -drive file=/var/lib/libvirt/images/centos7.qcow2,if=none,id=drive-virtio-disk0,format=qcow2
  -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x9,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1
  -drive if=none,id=drive-ide0-0-0,readonly=on,format=raw
  -device ide-cd,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0
  -netdev tap,fd=23,id=hostnet0,vhost=on,vhostfd=24
  -device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:3e:b1:3a,bus=pci.0,addr=0x3
  -chardev pty,id=charserial0
  -device isa-serial,chardev=charserial0,id=serial0
  -chardev socket,id=charchannel0,path=/var/lib/libvirt/qemu/channel/target/centos7.org.qemu.guest_agent.0,server,nowait
  -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=org.qemu.guest_agent.0
  -chardev spicevmc,id=charchannel1,name=vdagent
  -device virtserialport,bus=virtio-serial0.0,nr=2,chardev=charchannel1,id=channel1,name=com.redhat.spice.0
  -device usb-tablet,id=input0
  -spice port=5900,addr=127.0.0.1,disable-ticketing,seamless-migration=on
  -device qxl-vga,id=video0,ram_size=67108864,vram_size=67108864,vgamem_mb=16,bus=pci.0,addr=0x2
  -device intel-hda,id=sound0,bus=pci.0,addr=0x4
  -device hda-duplex,id=sound0-codec0,bus=sound0.0,cad=0
  -chardev spicevmc,id=charredir0,name=usbredir
  -device usb-redir,chardev=charredir0,id=redir0
  -chardev spicevmc,id=charredir1,name=usbredir
  -device usb-redir,chardev=charredir1,id=redir1
  -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x8
  -device pvpanic
  -msg timestamp=on

with all the configuration moved to memory-backend-file and
no -mem-path option at all.

So using the former configuration (pretending it was ppc64,
I was testing on x86 to get three page sizes instead of two)
the guest would be able to use 2 MiB hugepages while in the
former it could only use 4 KiB pages, is that correct?
Comment 20 David Gibson 2015-11-09 22:23:10 EST
Andrea,

Thanks again for the clarifications.  I re-read the patch posted for this bug, and realised I was mistaken about what what missing in it.  I thought it was incorrectly considering the global mem-path even when all memory was using explicit backends.  Instead, it is ignoring the global mem-path whenever any explicit backends are in yse.

It looks like it should correctly handle both the cases you describe above.  The case it might not handle properly is if there is some RAM backed by an explicit backend, *and* some RAM backed by the global mem-path (i.e. not attached to an explicit backend).  In that case, the guest will probably hang (because it will try to use a bad page size) if the global mem-path has a smaller pagesize than the minimum pagesize of all the explicit backends.

Can that situation arise?
Comment 21 Miroslav Rezanina 2015-11-18 05:06:52 EST
Fix included in qemu-kvm-rhev-2.3.0-31.el7_2.2
Comment 23 Andrea Bolognani 2015-11-19 05:30:03 EST
(In reply to David Gibson from comment #20)
> Andrea,
> 
> Thanks again for the clarifications.  I re-read the patch posted for this
> bug, and realised I was mistaken about what what missing in it.  I thought
> it was incorrectly considering the global mem-path even when all memory was
> using explicit backends.  Instead, it is ignoring the global mem-path
> whenever any explicit backends are in yse.
> 
> It looks like it should correctly handle both the cases you describe above. 
> The case it might not handle properly is if there is some RAM backed by an
> explicit backend, *and* some RAM backed by the global mem-path (i.e. not
> attached to an explicit backend).  In that case, the guest will probably
> hang (because it will try to use a bad page size) if the global mem-path has
> a smaller pagesize than the minimum pagesize of all the explicit backends.
> 
> Can that situation arise?

No. libvirt will use either -object memory-backend-* or -mem-path
based on whether NUMA nodes need to be configured and the
availability of memory-backend-*, but it will not mix the two
approaches so we should be safe.
Comment 24 Mike McCune 2016-03-28 18:27:32 EDT
This bug was accidentally moved from POST to MODIFIED via an error in automation, please see mmccune@redhat.com with any questions
Comment 26 Xu Han 2016-06-03 02:02:34 EDT
Reproduced this issue with qemu-kvm-rhev-2.3.0-31.el7.

Steps:
1. prepare mem file for guest on host:
(host)# mount -t hugetlbfs hugetlbfs /dev/hugepages -o pagesize=16M
(host)# echo 256 > /proc/sys/vm/nr_hugepages

2. boot guest with following command line:
(host)# /usr/libexec/qemu-kvm ... \
            -m 4096 \
            -object memory-backend-file,host-nodes=0,policy=interleave,id=memdev0,size=4096M,prealloc=yes,mem-path=/dev/hugepages \
            -numa node,memdev=memdev0,nodeid=0

3. check whether hugepage can be used on guest:
(guest)# cat /proc/meminfo |grep -i HugePages
(guest)# mount -t hugetlbfs hugetlbfs /mnt -o pagesize=16M


Result: hugepage can not be used on guest.

(guest)# cat /proc/meminfo |grep -i HugePages
AnonHugePages:         0 kB

(guest)# mount -t hugetlbfs hugetlbfs /mnt -o pagesize=16M
mount: unknown filesystem type 'hugetlbfs'

---------------------8<----------------------

Verified this with qemu-kvm-rhev-2.6.0-4.el7.

Result: hugepage can be used on guest.

(guest)# cat /proc/meminfo |grep -i HugePages
AnonHugePages:         0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:      16384 kB

(guest)# mount -t hugetlbfs hugetlbfs /mnt -o pagesize=16M
(guest)# echo 128 > /proc/sys/vm/nr_hugepages
(guest)# cat /proc/meminfo |grep -i HugePages
AnonHugePages:         0 kB
HugePages_Total:     121
HugePages_Free:      121
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:      16384 kB

(guest)# numactl -H
available: 1 nodes (0)
node 0 cpus: 0 1 2 3
node 0 size: 4096 MB
node 0 free: 273 MB
node distances:
node   0 
  0:  10 


So base on the above test results, this bug has been fixed.
Comment 28 errata-xmlrpc 2016-11-07 15:42:36 EST
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-2673.html

Note You need to log in before you can comment on or make changes to this bug.