Bug 1786303 - RFE: Support for configuring NUMA HMAT table information for guest
Summary: RFE: Support for configuring NUMA HMAT table information for guest
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux Advanced Virtualization
Classification: Red Hat
Component: libvirt
Version: 8.0
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: rc
: 8.3
Assignee: Michal Privoznik
QA Contact: Jing Qi
URL:
Whiteboard:
Depends On: 1847230
Blocks: 1745059
TreeView+ depends on / blocked
 
Reported: 2019-12-24 10:03 UTC by Daniel Berrangé
Modified: 2020-11-30 07:33 UTC (History)
11 users (show)

Fixed In Version: libvirt-6.6.0-4.el8
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-11-17 17:46:36 UTC
Type: Feature Request
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Daniel Berrangé 2019-12-24 10:03:35 UTC
Description of problem:
QEMU 5.0 is gaining support for configuring NUMA HMAT tables

   https://lists.gnu.org/archive/html/qemu-devel/2019-12/msg02583.html

This allows for configuring distance information between resources on the host. Differing from previous SLIT tables, HMAT allows for concepts such as memory-only NUMA nodes.

Further background can be found in Linux patches which initially merged in Linux 5.1

  https://lore.kernel.org/patchwork/cover/862903/


Some info from QEMU patches:

Following example creates a machine with 2 NUMA nodes, node 0 has CPU.
node 1 has only memory, and its initiator is node 0. Note that because
node 0 has CPU, by default the initiator of node 0 is itself and must be
itself.

-machine hmat=on \
-m 2G,slots=2,maxmem=4G \
-object memory-backend-ram,size=1G,id=m0 \
-object memory-backend-ram,size=1G,id=m1 \
-numa node,nodeid=0,memdev=m0 \
-numa node,nodeid=1,memdev=m1,initiator=0 \
-smp 2,sockets=2,maxcpus=2  \
-numa cpu,node-id=0,socket-id=0 \
-numa cpu,node-id=0,socket-id=1


For example, the following options describe 2 NUMA nodes. Node 0 has 2 cpus and
a ram, node 1 has only a ram. The processors in node 0 access memory in node
0 with access-latency 5 nanoseconds, access-bandwidth is 200 MB/s;
The processors in NUMA node 0 access memory in NUMA node 1 with access-latency 
10 nanoseconds, access-bandwidth is 100 MB/s.

-machine hmat=on \
-m 2G \
-object memory-backend-ram,size=1G,id=m0 \
-object memory-backend-ram,size=1G,id=m1 \
-smp 2 \
-numa node,nodeid=0,memdev=m0 \
-numa node,nodeid=1,memdev=m1,initiator=0 \
-numa cpu,node-id=0,socket-id=0 \
-numa cpu,node-id=0,socket-id=1 \
-numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-latency,latency=5 \
-numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-bandwidth,bandwidth=200M \
-numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-latency,latency=10 \
-numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-bandwidth,bandwidth=100M


There are further options for cache size / locality information


Configuring this level of detail on guests will likely become important if trying to do strict pinning for guests to hosts, to maximise performance of the NUMA topology

Version-Release number of selected component (if applicable):
libvirt-5.10.0-1

Comment 2 Michal Privoznik 2020-01-09 16:18:39 UTC
I've started XML design discussion here:

https://www.redhat.com/archives/libvir-list/2020-January/msg00422.html

Comment 3 Eduardo Habkost 2020-03-30 19:30:43 UTC
Setting ITR and Target Release to 8.3.0, because this is a dependency of bug 1745059 (which is a 8.3 candidate).

Comment 5 Michal Privoznik 2020-07-08 10:08:38 UTC
Merged upstream:

30dd4aed3c news: Document HMAT addition
aeecbc87b7 qemu: Build HMAT command line
c2f15f1b18 qemu: Introduce QEMU_CAPS_NUMA_HMAT capability
11d8ca9794 numa: expose HMAT APIs
f0611fe883 conf: Validate NUMA HMAT configuration
a89bbbac86 conf: Parse and format HMAT
a26f61ee0c Allow NUMA nodes without vCPUs
1050c6beb1 numa_conf: Make virDomainNumaSetNodeCpumask() return void
fe43b3a5a5 qemuBuildMachineCommandLine: Drop needless check
8ba1792785 qemu_command: Rename qemuBuildNumaArgStr()
68c5b0183c numa_conf: Drop CPU from name of two functions
04bd77a19f conf: Move and rename virDomainParseScaledValue()
afb1ea6776 qemuxml2xmltest: Add "numatune-distance" test case
e95da4e5bf qemuBuildMemoryBackendProps: Use boolean type for 'pmem' property


v6.5.0-58-g30dd4aed3c

Comment 9 Jing Qi 2020-08-11 10:19:12 UTC
I tried to verify the new feature and found some issues that need to be confirmed.

1. Add two caches under one cell, and in connectors,  use the second cache in the latency, it reports "non-existent NUMA node cache". Two caches can be added from the libvirt.org -
Since 6.6.0 the cell element can have a cache child element which describes memory side cache for memory proximity domains. The cache element has a level attribute describing the cache level and thus the element can be repeated multiple times to describe different levels of the cache.

<cpu mode='host-model' check='partial'>
    <numa>
      <cell id='0' cpus='0-5' memory='512000' unit='KiB' discard='yes'>
        <distances>
          <sibling id='0' value='10'/>
          <sibling id='1' value='21'/>
        </distances>
        <cache level='3' associativity='direct' policy='writeback'>
          <size value='10' unit='KiB'/>
          <line value='8' unit='B'/>
        </cache>
        <cache level='1' associativity='direct' policy='writeback'>
          <size value='8' unit='KiB'/>
          <line value='5' unit='B'/>
        </cache>
      </cell>
      <cell id='1' memory='512000' unit='KiB'>
        <distances>
          <sibling id='0' value='21'/>
          <sibling id='1' value='10'/>
        </distances>
      </cell>
      <interconnects>
        <latency initiator='0' target='0' cache='1' type='access' value='5'/>
        <bandwidth initiator='0' target='0' type='access' value='204800' unit='KiB'/>
      </interconnects>
    </numa>
  </cpu>

virsh edit avocado-vt-vm
error: XML error: 'cache' refers to a non-existent NUMA node cache
Failed. Try again? [y,n,i,f,?]:

2. Remove the "cache='1'" from latency, the domain can be started and the qemu-cmdline is as below-
 
-numa hmat-cache,node-id=0,size=8K,level=1,associativity=direct,policy=write-back,line=5 

     <interconnects>
        <latency initiator='0' target='0'  type='access' value='5'/>
        <bandwidth initiator='0' target='0' type='access' value='204800' unit='KiB'/>
      </interconnects>

-numa hmat-cache,node-id=0,size=8K,level=1,associativity=direct,policy=write-back,line=5 

Can you please help to confirm if the two situations are as expected ?

Comment 10 Michal Privoznik 2020-08-18 10:56:50 UTC
(In reply to Jing Qi from comment #9)
> I tried to verify the new feature and found some issues that need to be
> confirmed.
> 
> 1. Add two caches under one cell, and in connectors,  use the second cache
> in the latency, it reports "non-existent NUMA node cache". Two caches can be
> added from the libvirt.org -
> Since 6.6.0 the cell element can have a cache child element which describes
> memory side cache for memory proximity domains. The cache element has a
> level attribute describing the cache level and thus the element can be
> repeated multiple times to describe different levels of the cache.
> 
> <cpu mode='host-model' check='partial'>
>     <numa>
>       <cell id='0' cpus='0-5' memory='512000' unit='KiB' discard='yes'>
>         <distances>
>           <sibling id='0' value='10'/>
>           <sibling id='1' value='21'/>
>         </distances>
>         <cache level='3' associativity='direct' policy='writeback'>
>           <size value='10' unit='KiB'/>
>           <line value='8' unit='B'/>
>         </cache>
>         <cache level='1' associativity='direct' policy='writeback'>
>           <size value='8' unit='KiB'/>
>           <line value='5' unit='B'/>
>         </cache>
>       </cell>
>       <cell id='1' memory='512000' unit='KiB'>
>         <distances>
>           <sibling id='0' value='21'/>
>           <sibling id='1' value='10'/>
>         </distances>
>       </cell>
>       <interconnects>
>         <latency initiator='0' target='0' cache='1' type='access' value='5'/>
>         <bandwidth initiator='0' target='0' type='access' value='204800'
> unit='KiB'/>
>       </interconnects>
>     </numa>
>   </cpu>
> 
> virsh edit avocado-vt-vm
> error: XML error: 'cache' refers to a non-existent NUMA node cache
> Failed. Try again? [y,n,i,f,?]:

Ah, this is a bug. Patch posted here:

https://www.redhat.com/archives/libvir-list/2020-August/msg00536.html

> 
> 2. Remove the "cache='1'" from latency, the domain can be started and the
> qemu-cmdline is as below-
>  
> -numa
> hmat-cache,node-id=0,size=8K,level=1,associativity=direct,policy=write-back,
> line=5 
> 
>      <interconnects>
>         <latency initiator='0' target='0'  type='access' value='5'/>
>         <bandwidth initiator='0' target='0' type='access' value='204800'
> unit='KiB'/>
>       </interconnects>
> 
> -numa
> hmat-cache,node-id=0,size=8K,level=1,associativity=direct,policy=write-back,
> line=5 
> 
> Can you please help to confirm if the two situations are as expected ?

This looks as expected, doesn't it? Do you think there is something wrong with the generated command line?

PS sorry for the delay, was on longer PTO.

Comment 11 Jing Qi 2020-08-19 01:28:03 UTC
For the second scenario, there is no issue with the qemu cmd line. But if there is no "cache=X" is specified, does it need to select one as the default? the minimal one, the first one, or the last one?

There is a issue with one more scenario,please help to confirm if it's fixed - 

1. Edit the xml -
 <cpu mode='custom' match='exact' check='none'>
    <model fallback='forbid'>qemu64</model>
    <numa>
      <cell id='0' cpus='0-5' memory='512000' unit='KiB' discard='yes'>
        <distances>
          <sibling id='0' value='10'/>
          <sibling id='1' value='21'/>
        </distances>
        <cache level='1' associativity='direct' policy='writeback'>
          <size value='8' unit='KiB'/>
          <line value='5' unit='B'/>
        </cache>
        <cache level='2' associativity='direct' policy='writeback'>
          <size value='9' unit='KiB'/>
          <line value='6' unit='B'/>
        </cache>
        <cache level='3' associativity='direct' policy='writeback'>
          <size value='10' unit='KiB'/>
          <line value='8' unit='B'/>
        </cache>
      </cell>
      <cell id='1' memory='512000' unit='KiB'>
        <distances>
          <sibling id='0' value='21'/>
          <sibling id='1' value='10'/>
        </distances>
      </cell>
    <interconnects>
        <latency initiator='0' target='0' type='access' value='5'/>
        <bandwidth initiator='0' target='0' type='access' value='204800' unit='KiB'/>
      </interconnects>
    </numa>
  </cpu>

2.#virsh start vm2
error: Failed to start domain vm2
error: internal error: qemu unexpectedly closed the monitor: 2020-08-19T01:23:10.928169Z qemu-kvm: -numa hmat-cache,node-id=0,size=9K,level=2,associativity=direct,policy=write-back,line=6: Invalid size=9216, the size of level=2 should be less than the size(8192) of level=1

Comment 12 Jing Qi 2020-08-19 02:23:55 UTC
I am not sure the value of size in which level should be smallest? Seems the deeper of the level should be smaller of the value? But below two conditions can work both - 
 
1)<cache level='1' associativity='direct' policy='writeback'>
          <size value='8' unit='KiB'/>
          <line value='5' unit='B'/>
        </cache>
        <cache level='2' associativity='direct' policy='writeback'>
          <size value='7' unit='KiB'/>
          <line value='6' unit='B'/>
        </cache>
        <cache level='3' associativity='direct' policy='writeback'>
          <size value='6' unit='KiB'/>
          <line value='8' unit='B'/>
        </cache>
      
2) <cache level='1' associativity='direct' policy='writeback'>
          <size value='8' unit='KiB'/>
          <line value='5' unit='B'/>
        </cache>
        <cache level='3' associativity='direct' policy='writeback'>
          <size value='10' unit='KiB'/>
          <line value='8' unit='B'/>
        </cache>

Comment 13 Michal Privoznik 2020-08-19 09:20:19 UTC
Problem 1) from comment 9 is fixed with patch from comment 10 which is now merged:

e41ac71fca numa_conf: Properly check for caches in virDomainNumaDefValidate()

v6.6.0-538-ge41ac71fca

Comment 14 Michal Privoznik 2020-08-19 09:50:55 UTC
(In reply to Jing Qi from comment #11)

> 2.#virsh start vm2
> error: Failed to start domain vm2
> error: internal error: qemu unexpectedly closed the monitor:
> 2020-08-19T01:23:10.928169Z qemu-kvm: -numa
> hmat-cache,node-id=0,size=9K,level=2,associativity=direct,policy=write-back,
> line=6: Invalid size=9216, the size of level=2 should be less than the
> size(8192) of level=1

I'm not sure it makes sense to replicate these checks in libvirt. Also, this looks very suspicious - usually out in the wild L3 > L2 > L1. Why would QEMU want it the other way?

Comment 15 Jing Qi 2020-08-20 02:24:09 UTC
@yuhuang
For the issue in comment 11 & 12, do you think if qemu works as expected? Or it's a bug?

Comment 16 Michal Privoznik 2020-08-20 11:35:05 UTC
(In reply to Jing Qi from comment #15)
> @yuhuang
> For the issue in comment 11 & 12, do you think if qemu works as expected? Or
> it's a bug?

I don't know. Igor?

Comment 17 Igor Mammedov 2020-08-21 09:27:35 UTC
(In reply to Jing Qi from comment #15)
> @yuhuang
> For the issue in comment 11 & 12, do you think if qemu works as expected? Or
> it's a bug?

Looks like bug to me,
I'll post a patch to rise the question,
I hope original author or someone else from Intel will answer.

Comment 18 Igor Mammedov 2020-08-21 09:43:34 UTC
(In reply to Jing Qi from comment #11)
> For the second scenario, there is no issue with the qemu cmd line. But if
> there is no "cache=X" is specified, does it need to select one as the
> default? the minimal one, the first one, or the last one?
> 
> There is a issue with one more scenario,please help to confirm if it's fixed
> - 
> 
> 1. Edit the xml -
>  <cpu mode='custom' match='exact' check='none'>
>     <model fallback='forbid'>qemu64</model>
>     <numa>
>       <cell id='0' cpus='0-5' memory='512000' unit='KiB' discard='yes'>
>         <distances>
>           <sibling id='0' value='10'/>
>           <sibling id='1' value='21'/>
>         </distances>
>         <cache level='1' associativity='direct' policy='writeback'>
>           <size value='8' unit='KiB'/>
>           <line value='5' unit='B'/>
>         </cache>
>         <cache level='2' associativity='direct' policy='writeback'>
>           <size value='9' unit='KiB'/>
>           <line value='6' unit='B'/>
>         </cache>
>         <cache level='3' associativity='direct' policy='writeback'>
>           <size value='10' unit='KiB'/>
>           <line value='8' unit='B'/>
>         </cache>
>       </cell>
>       <cell id='1' memory='512000' unit='KiB'>
>         <distances>
>           <sibling id='0' value='21'/>
>           <sibling id='1' value='10'/>
>         </distances>
>       </cell>
>     <interconnects>
>         <latency initiator='0' target='0' type='access' value='5'/>
>         <bandwidth initiator='0' target='0' type='access' value='204800'
> unit='KiB'/>
>       </interconnects>
>     </numa>
>   </cpu>
> 
> 2.#virsh start vm2
> error: Failed to start domain vm2
> error: internal error: qemu unexpectedly closed the monitor:
> 2020-08-19T01:23:10.928169Z qemu-kvm: -numa
> hmat-cache,node-id=0,size=9K,level=2,associativity=direct,policy=write-back,
> line=6: Invalid size=9216, the size of level=2 should be less than the
> size(8192) of level=1

can you provide full QEMU CLI that generates this error?

Comment 19 Jing Qi 2020-08-21 10:26:35 UTC
Yes, the qemu command line is as below -

/usr/libexec/qemu-kvm \
-name guest=avocado-vt-vm1,debug-threads=on \
-S \
-object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-1-avocado-vt-vm1/master-key.aes \
-machine pc-q35-rhel8.3.0,accel=kvm,usb=off,dump-guest-core=off,hmat=on \
-cpu Broadwell,vme=on,ss=on,vmx=off,f16c=on,rdrand=on,hypervisor=on,arat=on,tsc-adjust=on,umip=on,arch-capabilities=on,xsaveopt=on,pdpe1gb=on,abm=on,skip-l1dfl-vmentry=on,pschange-mc-no=on,rtm=on,hle=on \
-m 1000 \
-overcommit mem-lock=off \
-smp 6,sockets=6,cores=1,threads=1 \
-object memory-backend-ram,id=ram-node0,size=524288000 \
-numa node,nodeid=0,cpus=0-5,initiator=0,memdev=ram-node0 \
-object memory-backend-ram,id=ram-node1,size=524288000 \
-numa node,nodeid=1,initiator=0,memdev=ram-node1 \
-numa dist,src=0,dst=0,val=10 \
-numa dist,src=0,dst=1,val=21 \
-numa dist,src=1,dst=0,val=21 \
-numa dist,src=1,dst=1,val=10 \
-numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-latency,latency=5 \
-numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-bandwidth,bandwidth=204800K \
-numa hmat-cache,node-id=0,size=8K,level=1,associativity=direct,policy=write-back,line=5 \
-numa hmat-cache,node-id=0,size=9K,level=2,associativity=direct,policy=write-back,line=6 \
-numa hmat-cache,node-id=0,size=10K,level=3,associativity=direct,policy=write-back,line=8 \
-uuid 03040ca7-2813-4f71-bc80-5943228d6371 \
-no-user-config \
-nodefaults \
-chardev socket,id=charmonitor,fd=35,server,nowait \
-mon chardev=charmonitor,id=monitor,mode=control \
-rtc base=utc,driftfix=slew \
-global kvm-pit.lost_tick_policy=delay \
-no-hpet \
-no-shutdown \
-global ICH9-LPC.disable_s3=1 \
-global ICH9-LPC.disable_s4=1 \
-boot strict=on \
-device pcie-root-port,port=0x10,chassis=1,id=pci.1,bus=pcie.0,multifunction=on,addr=0x2 \
-device pcie-root-port,port=0x11,chassis=2,id=pci.2,bus=pcie.0,addr=0x2.0x1 \
-device pcie-root-port,port=0x12,chassis=3,id=pci.3,bus=pcie.0,addr=0x2.0x2 \
-device pcie-root-port,port=0x13,chassis=4,id=pci.4,bus=pcie.0,addr=0x2.0x3 \
-device pcie-root-port,port=0x14,chassis=5,id=pci.5,bus=pcie.0,addr=0x2.0x4 \
-device pcie-root-port,port=0x15,chassis=6,id=pci.6,bus=pcie.0,addr=0x2.0x5 \
-device pcie-root-port,port=0x16,chassis=7,id=pci.7,bus=pcie.0,addr=0x2.0x6 \
-device qemu-xhci,p2=15,p3=15,id=usb,bus=pci.2,addr=0x0 \
-device virtio-serial-pci,id=virtio-serial0,bus=pci.3,addr=0x0 \
-blockdev '{"driver":"file","filename":"/var/lib/avocado/data/avocado-vt/images/jeos-27-x86_64.qcow2","node-name":"libvirt-1-storage","auto-read-only":true,"discard":"unmap"}' \
-blockdev '{"node-name":"libvirt-1-format","read-only":false,"driver":"qcow2","file":"libvirt-1-storage","backing":null}' \
-device virtio-blk-pci,bus=pci.4,addr=0x0,drive=libvirt-1-format,id=virtio-disk0,bootindex=1 \
-netdev tap,fd=37,id=hostnet0,vhost=on,vhostfd=38 \
-device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:2e:74:ba,bus=pci.1,addr=0x0 \
-chardev pty,id=charserial0 \
-device isa-serial,chardev=charserial0,id=serial0 \
-chardev socket,id=charchannel0,fd=39,server,nowait \
-device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=org.qemu.guest_agent.0 \
-device usb-tablet,id=input0,bus=usb.0,port=1 \
-vnc 127.0.0.1:0 \
-device qxl-vga,id=video0,ram_size=67108864,vram_size=67108864,vram64_size_mb=0,vgamem_mb=16,max_outputs=1,bus=pcie.0,addr=0x1 \
-device virtio-balloon-pci,id=balloon0,bus=pci.5,addr=0x0 \
-object rng-random,id=objrng0,filename=/dev/urandom \
-device virtio-rng-pci,rng=objrng0,id=rng0,bus=pci.6,addr=0x0 \
-sandbox on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny \
-msg timestamp=on

Comment 21 Jing Qi 2020-08-31 08:18:05 UTC
(In reply to Michal Privoznik from comment #20)
> To POST:
> 
> http://post-office.corp.redhat.com/archives/rhvirt-patches/2020-August/
> msg00261.html
> 
> Scratch build can be found here:
> 
> https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=31002229
> 
> http://brew-task-repos.usersys.redhat.com/repos/scratch/mprivozn/libvirt/6.6.
> 0/3.el8_rc.9b168aa093/libvirt-6.6.0-3.el8_rc.9b168aa093-scratch.repo

Test with the scratch build, below is the test result:

1. level='*'(such as 1) can't be added in "latency" attributes.
 Otherwise, it can not pass the xml validation against the schema.
<numa>
      <cell id='0' cpus='0-5' memory='524288' unit='KiB' discard='yes'>
        <distances>
          <sibling id='0' value='10'/>
          <sibling id='1' value='21'/>
        </distances>
        <cache level='1' associativity='direct' policy='writeback'>
          <size value='10' unit='KiB'/>
          <line value='5' unit='B'/>
        </cache>
      </cell>
      <cell id='1' memory='524288' unit='KiB'>
        <distances>
          <sibling id='0' value='21'/>
          <sibling id='1' value='10'/>
        </distances>
        <cache level='3' associativity='direct' policy='writeback'>
          <size value='8' unit='KiB'/>
          <line value='8' unit='B'/>
        </cache>
      </cell>
      <interconnects>
        <latency initiator='0' target='0' type='access' value='5'/>     => level='1'
        <bandwidth initiator='0' target='0' type='access' value='204800' unit='KiB'/>
      </interconnects>
    </numa>

2. Another xml validation failure is about the order of the <distance> & <cache>. 

<cell id='0' cpus='0-5' memory='524288' unit='KiB' discard='yes'>
        <cache level='1' associativity='direct' policy='writeback'>
          <size value='10' unit='KiB'/>
          <line value='5' unit='B'/>
        </cache>
        <distances>
          <sibling id='0' value='10'/>
          <sibling id='1' value='21'/>
        </distances>
      </cell>

Then, try to save -

Failed. Try again? [y,n,i,f,?]: 
error: XML document failed to validate against schema: Unable to validate doc against /usr/share/libvirt/schemas/domain.rng
Extra element cpu in interleave
Element domain failed to validate content

Failed. Try again? [y,n,i,f,?]: 

If the <cache> is above the <distance>, it failed to validate. Is it as expected?

3. If the <cache> is set, and no "latency" & "bandwidth" for the node.

<numa>
      <cell id='0' cpus='0-5' memory='524288' unit='KiB' discard='yes'>
        <distances>
          <sibling id='0' value='10'/>
          <sibling id='1' value='21'/>
        </distances>
        <cache level='1' associativity='direct' policy='writeback'>
          <size value='10' unit='KiB'/>
          <line value='5' unit='B'/>
        </cache>
      </cell>
      <cell id='1' memory='524288' unit='KiB'>
        <distances>
          <sibling id='0' value='21'/>
          <sibling id='1' value='10'/>
        </distances>
        <cache level='3' associativity='direct' policy='writeback'>
          <size value='8' unit='KiB'/>
          <line value='8' unit='B'/>
        </cache>
      </cell>
      <interconnects>
        <latency initiator='0' target='0' type='access' value='5'/>
        <bandwidth initiator='0' target='0' type='access' value='204800' unit='KiB'/>
      </interconnects>
    </numa>
# virsh start avocado-vt-vm1
error: Failed to start domain avocado-vt-vm1
error: internal error: process exited while connecting to monitor: 2020-08-31T08:15:59.794755Z qemu-kvm: -numa hmat-cache,node-id=1,size=8K,level=3,associativity=direct,policy=write-back,line=8: The latency and bandwidth information of node-id=1 should be provided before memory side cache attributes

My question is if libvirt is better to add the check in xml schema validation? Thanks.

Comment 22 Michal Privoznik 2020-08-31 09:49:19 UTC
(In reply to Jing Qi from comment #21)
> Test with the scratch build, below is the test result:
> 
> 1. level='*'(such as 1) can't be added in "latency" attributes.
>  Otherwise, it can not pass the xml validation against the schema.
> <numa>
>       <cell id='0' cpus='0-5' memory='524288' unit='KiB' discard='yes'>
>         <distances>
>           <sibling id='0' value='10'/>
>           <sibling id='1' value='21'/>
>         </distances>
>         <cache level='1' associativity='direct' policy='writeback'>
>           <size value='10' unit='KiB'/>
>           <line value='5' unit='B'/>
>         </cache>
>       </cell>
>       <cell id='1' memory='524288' unit='KiB'>
>         <distances>
>           <sibling id='0' value='21'/>
>           <sibling id='1' value='10'/>
>         </distances>
>         <cache level='3' associativity='direct' policy='writeback'>
>           <size value='8' unit='KiB'/>
>           <line value='8' unit='B'/>
>         </cache>
>       </cell>
>       <interconnects>
>         <latency initiator='0' target='0' type='access' value='5'/>     =>
> level='1'

The attribute is named cache: cache='1' should work.

>         <bandwidth initiator='0' target='0' type='access' value='204800'
> unit='KiB'/>
>       </interconnects>
>     </numa>
> 
> 2. Another xml validation failure is about the order of the <distance> &
> <cache>. 
> 
> <cell id='0' cpus='0-5' memory='524288' unit='KiB' discard='yes'>
>         <cache level='1' associativity='direct' policy='writeback'>
>           <size value='10' unit='KiB'/>
>           <line value='5' unit='B'/>
>         </cache>
>         <distances>
>           <sibling id='0' value='10'/>
>           <sibling id='1' value='21'/>
>         </distances>
>       </cell>
> 
> Then, try to save -
> 
> Failed. Try again? [y,n,i,f,?]: 
> error: XML document failed to validate against schema: Unable to validate
> doc against /usr/share/libvirt/schemas/domain.rng
> Extra element cpu in interleave
> Element domain failed to validate content
> 
> Failed. Try again? [y,n,i,f,?]: 
> 
> If the <cache> is above the <distance>, it failed to validate. Is it as
> expected?

Ouch. No, I will post a patch.

> 
> 3. If the <cache> is set, and no "latency" & "bandwidth" for the node.
> 
> <numa>
>       <cell id='0' cpus='0-5' memory='524288' unit='KiB' discard='yes'>
>         <distances>
>           <sibling id='0' value='10'/>
>           <sibling id='1' value='21'/>
>         </distances>
>         <cache level='1' associativity='direct' policy='writeback'>
>           <size value='10' unit='KiB'/>
>           <line value='5' unit='B'/>
>         </cache>
>       </cell>
>       <cell id='1' memory='524288' unit='KiB'>
>         <distances>
>           <sibling id='0' value='21'/>
>           <sibling id='1' value='10'/>
>         </distances>
>         <cache level='3' associativity='direct' policy='writeback'>
>           <size value='8' unit='KiB'/>
>           <line value='8' unit='B'/>
>         </cache>
>       </cell>
>       <interconnects>
>         <latency initiator='0' target='0' type='access' value='5'/>
>         <bandwidth initiator='0' target='0' type='access' value='204800'
> unit='KiB'/>
>       </interconnects>
>     </numa>
> # virsh start avocado-vt-vm1
> error: Failed to start domain avocado-vt-vm1
> error: internal error: process exited while connecting to monitor:
> 2020-08-31T08:15:59.794755Z qemu-kvm: -numa
> hmat-cache,node-id=1,size=8K,level=3,associativity=direct,policy=write-back,
> line=8: The latency and bandwidth information of node-id=1 should be
> provided before memory side cache attributes
> 
> My question is if libvirt is better to add the check in xml schema
> validation? Thanks.

Honestly, I don't know. I think that users of this feature will always fully specify the HMAT. For future reference, here is the generated CMD line:

-object memory-backend-memfd,id=ram-node0,hugetlb=yes,hugetlbsize=2097152,size=536870912,host-nodes=0,policy=bind \
-numa node,nodeid=0,cpus=0-3,initiator=0,memdev=ram-node0 \
-object memory-backend-memfd,id=ram-node1,hugetlb=yes,hugetlbsize=2097152,size=536870912,host-nodes=0,policy=bind \
-numa node,nodeid=1,initiator=0,memdev=ram-node1 \
-numa dist,src=0,dst=0,val=10 \
-numa dist,src=0,dst=1,val=21 \
-numa dist,src=1,dst=0,val=21 \
-numa dist,src=1,dst=1,val=10 \
-numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-latency,latency=5 \
-numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-bandwidth,bandwidth=204800K \
-numa hmat-cache,node-id=0,size=10K,level=1,associativity=direct,policy=write-back,line=5 \
-numa hmat-cache,node-id=1,size=8K,level=3,associativity=direct,policy=write-back,line=8 \

Igor, do you have opinion?

Comment 23 Michal Privoznik 2020-08-31 09:51:47 UTC
Patch for interleaving elements under <cell/>:

https://www.redhat.com/archives/libvir-list/2020-August/msg01113.html

Comment 24 Jing Qi 2020-08-31 10:02:53 UTC
(In reply to Michal Privoznik from comment #22)
> (In reply to Jing Qi from comment #21)
> > Test with the scratch build, below is the test result:
> > 
> > 1. level='*'(such as 1) can't be added in "latency" attributes.
> >  Otherwise, it can not pass the xml validation against the schema.
> > <numa>
> >       <cell id='0' cpus='0-5' memory='524288' unit='KiB' discard='yes'>
> >         <distances>
> >           <sibling id='0' value='10'/>
> >           <sibling id='1' value='21'/>
> >         </distances>
> >         <cache level='1' associativity='direct' policy='writeback'>
> >           <size value='10' unit='KiB'/>
> >           <line value='5' unit='B'/>
> >         </cache>
> >       </cell>
> >       <cell id='1' memory='524288' unit='KiB'>
> >         <distances>
> >           <sibling id='0' value='21'/>
> >           <sibling id='1' value='10'/>
> >         </distances>
> >         <cache level='3' associativity='direct' policy='writeback'>
> >           <size value='8' unit='KiB'/>
> >           <line value='8' unit='B'/>
> >         </cache>
> >       </cell>
> >       <interconnects>
> >         <latency initiator='0' target='0' type='access' value='5'/>     =>
> > level='1'
> 
> The attribute is named cache: cache='1' should work.
> 
 Yes. It works. Thanks.

Comment 25 Igor Mammedov 2020-08-31 13:59:39 UTC
(In reply to Michal Privoznik from comment #22)
> (In reply to Jing Qi from comment #21)
[...]
> > 2020-08-31T08:15:59.794755Z qemu-kvm: -numa
> > hmat-cache,node-id=1,size=8K,level=3,associativity=direct,policy=write-back,
> > line=8: The latency and bandwidth information of node-id=1 should be
> > provided before memory side cache attributes
> > 
> > My question is if libvirt is better to add the check in xml schema
> > validation? Thanks.
> 
> Honestly, I don't know. I think that users of this feature will always fully
> specify the HMAT. For future reference, here is the generated CMD line:
> 
> -object
> memory-backend-memfd,id=ram-node0,hugetlb=yes,hugetlbsize=2097152,
> size=536870912,host-nodes=0,policy=bind \
> -numa node,nodeid=0,cpus=0-3,initiator=0,memdev=ram-node0 \
> -object
> memory-backend-memfd,id=ram-node1,hugetlb=yes,hugetlbsize=2097152,
> size=536870912,host-nodes=0,policy=bind \
> -numa node,nodeid=1,initiator=0,memdev=ram-node1 \
> -numa dist,src=0,dst=0,val=10 \
> -numa dist,src=0,dst=1,val=21 \
> -numa dist,src=1,dst=0,val=21 \
> -numa dist,src=1,dst=1,val=10 \
> -numa
> hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-latency,
> latency=5 \
> -numa
> hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-bandwidth,
> bandwidth=204800K \
> -numa
> hmat-cache,node-id=0,size=10K,level=1,associativity=direct,policy=write-back,
> line=5 \
> -numa
> hmat-cache,node-id=1,size=8K,level=3,associativity=direct,policy=write-back,
> line=8 \
> 
> Igor, do you have opinion?


well, CLI above doesn't provide hmat-lb for node 1, but tries to use hmat-cache for node 1,
that's what QEMU doesn't like. (it works as expected)

Also looking that node 1 doesn't have any cpus, it's probably pointless to specify any hmat-cache for it
(and may be hmat-lb, I'm not sure).

PS:
Is cpu-less configuration on x86 machine possible at all?

Comment 26 Michal Privoznik 2020-08-31 14:05:38 UTC
I've merged the fix mentioned in comment 23:

fd2ad818b2 RNG: Allow interleaving of /domain/cpu/numa/cell children

And backported it here:

http://post-office.corp.redhat.com/archives/rhvirt-patches/2020-August/msg00303.html

I'm not building a new scratch build, because the only change would be RNG fix, no code change since the last one.

Comment 27 Michal Privoznik 2020-08-31 15:25:34 UTC
(In reply to Igor Mammedov from comment #25)
> (In reply to Michal Privoznik from comment #22)
> > (In reply to Jing Qi from comment #21)
> [...]
> > > 2020-08-31T08:15:59.794755Z qemu-kvm: -numa
> > > hmat-cache,node-id=1,size=8K,level=3,associativity=direct,policy=write-back,
> > > line=8: The latency and bandwidth information of node-id=1 should be
> > > provided before memory side cache attributes
> > > 
> > > My question is if libvirt is better to add the check in xml schema
> > > validation? Thanks.
> > 
> > Honestly, I don't know. I think that users of this feature will always fully
> > specify the HMAT. For future reference, here is the generated CMD line:
> > 
> > -object
> > memory-backend-memfd,id=ram-node0,hugetlb=yes,hugetlbsize=2097152,
> > size=536870912,host-nodes=0,policy=bind \
> > -numa node,nodeid=0,cpus=0-3,initiator=0,memdev=ram-node0 \
> > -object
> > memory-backend-memfd,id=ram-node1,hugetlb=yes,hugetlbsize=2097152,
> > size=536870912,host-nodes=0,policy=bind \
> > -numa node,nodeid=1,initiator=0,memdev=ram-node1 \
> > -numa dist,src=0,dst=0,val=10 \
> > -numa dist,src=0,dst=1,val=21 \
> > -numa dist,src=1,dst=0,val=21 \
> > -numa dist,src=1,dst=1,val=10 \
> > -numa
> > hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-latency,
> > latency=5 \
> > -numa
> > hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-bandwidth,
> > bandwidth=204800K \
> > -numa
> > hmat-cache,node-id=0,size=10K,level=1,associativity=direct,policy=write-back,
> > line=5 \
> > -numa
> > hmat-cache,node-id=1,size=8K,level=3,associativity=direct,policy=write-back,
> > line=8 \
> > 
> > Igor, do you have opinion?
> 
> 
> well, CLI above doesn't provide hmat-lb for node 1, but tries to use
> hmat-cache for node 1,
> that's what QEMU doesn't like. (it works as expected)

Okay, I'd say let's track that in a different bug then. The feature works as expected.

> 
> Also looking that node 1 doesn't have any cpus, it's probably pointless to
> specify any hmat-cache for it

Makes sense. Again, if we want to track it, let's track it in a different bug.

> (and may be hmat-lb, I'm not sure).
> 
> PS:
> Is cpu-less configuration on x86 machine possible at all?

Sure it is:
$ ssh root@fedora numactl -H
X11 forwarding request failed on channel 0
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3
node 0 size: 502 MB
node 0 free: 52 MB
node 1 cpus:
node 1 size: 471 MB
node 1 free: 210 MB
node distances:
node   0   1 
  0:  10  21 
  1:  21  10

Comment 28 Jing Qi 2020-09-04 10:06:20 UTC
Verified with libvirt-daemon-6.6.0-4.module+el8.3.0+7883+3d717aa8.x86_64
& qemu-kvm-5.1.0-4.module+el8.3.0+7846+ae9b566f.x86_64

1. Start domain with below xml part -

 <cpu mode='host-model' check='partial'>
    <model fallback='forbid'>qemu64</model>
    <feature policy='disable' name='svm'/>
    <numa>
      <cell id='0' cpus='0-5' memory='512000' unit='KiB' discard='yes'>
        <distances>
          <sibling id='0' value='10'/>
          <sibling id='1' value='21'/>
        </distances>
        <cache level='1' associativity='direct' policy='writeback'>
          <size value='8' unit='KiB'/>
          <line value='5' unit='B'/>
        </cache>
        <cache level='2' associativity='direct' policy='writeback'>
          <size value='7' unit='KiB'/>
          <line value='5' unit='B'/>
        </cache>
        <cache level='3' associativity='direct' policy='writeback'>
          <size value='6' unit='KiB'/>
          <line value='8' unit='B'/>
        </cache>
      </cell>
      <cell id='1' memory='512000' unit='KiB'>
        <distances>
          <sibling id='0' value='21'/>
          <sibling id='1' value='10'/>
        </distances>
      </cell>
      <interconnects>
        <latency initiator='0' target='0' type='access' value='5'/>
        <bandwidth initiator='0' target='0' type='access' value='204800' unit='KiB'/>
      </interconnects>
    </numa>
  </cpu>

# virsh start avocado-vt-vm
Domain avocado-vt-vm started

2. Succeeded to migrate the vm and did some check in the guest.

#virsh migrate avocado-vt-vm qemu+ssh://***.redhat.com/system


Check "dmesg" in the guest 

# dmesg |grep -i hmat
[    0.000000] ACPI: HMAT 0x000000003E7E1EC5 000138 (v02 BOCHS  BXPCHMAT 00000001 BXPC 00000001)
[    1.153980] acpi/hmat: HMAT: Memory Flags:0001 Processor Domain:0 Memory Domain:0
[    1.155943] acpi/hmat: HMAT: Memory Flags:0001 Processor Domain:0 Memory Domain:1
[    1.157787] acpi/hmat: HMAT: Locality: Flags:00 Type:Access Latency Initiator Domains:1 Target Domains:2 Base:1000
[    1.160003] acpi/hmat:   Initiator-Target[0-0]:5 nsec
[    1.161134] acpi/hmat:   Initiator-Target[0-1]:0 nsec
[    1.162261] acpi/hmat: HMAT: Locality: Flags:00 Type:Access Bandwidth Initiator Domains:1 Target Domains:2 Base:8
[    1.164550] acpi/hmat:   Initiator-Target[0-0]:200 MB/s
[    1.165698] acpi/hmat:   Initiator-Target[0-1]:0 MB/s
[    1.166825] acpi/hmat: HMAT: Cache: Domain:0 Size:8192 Attrs:00051113 SMBIOS Handles:0
[    1.168720] acpi/hmat: HMAT: Cache: Domain:0 Size:7168 Attrs:00051123 SMBIOS Handles:0
[    1.170616] acpi/hmat: HMAT: Cache: Domain:0 Size:6144 Attrs:00081133 SMBIOS Handles:0

# numactl --ha
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5
node 0 size: 451 MB
node 0 free: 39 MB
node 1 cpus:
node 1 size: 331 MB
node 1 free: 222 MB
node distances:
node   0   1 
  0:  10  21 
  1:  21  10

Comment 31 errata-xmlrpc 2020-11-17 17:46:36 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (virt:8.3 bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:5137


Note You need to log in before you can comment on or make changes to this bug.