Description of problem: QEMU 5.0 is gaining support for configuring NUMA HMAT tables https://lists.gnu.org/archive/html/qemu-devel/2019-12/msg02583.html This allows for configuring distance information between resources on the host. Differing from previous SLIT tables, HMAT allows for concepts such as memory-only NUMA nodes. Further background can be found in Linux patches which initially merged in Linux 5.1 https://lore.kernel.org/patchwork/cover/862903/ Some info from QEMU patches: Following example creates a machine with 2 NUMA nodes, node 0 has CPU. node 1 has only memory, and its initiator is node 0. Note that because node 0 has CPU, by default the initiator of node 0 is itself and must be itself. -machine hmat=on \ -m 2G,slots=2,maxmem=4G \ -object memory-backend-ram,size=1G,id=m0 \ -object memory-backend-ram,size=1G,id=m1 \ -numa node,nodeid=0,memdev=m0 \ -numa node,nodeid=1,memdev=m1,initiator=0 \ -smp 2,sockets=2,maxcpus=2 \ -numa cpu,node-id=0,socket-id=0 \ -numa cpu,node-id=0,socket-id=1 For example, the following options describe 2 NUMA nodes. Node 0 has 2 cpus and a ram, node 1 has only a ram. The processors in node 0 access memory in node 0 with access-latency 5 nanoseconds, access-bandwidth is 200 MB/s; The processors in NUMA node 0 access memory in NUMA node 1 with access-latency 10 nanoseconds, access-bandwidth is 100 MB/s. -machine hmat=on \ -m 2G \ -object memory-backend-ram,size=1G,id=m0 \ -object memory-backend-ram,size=1G,id=m1 \ -smp 2 \ -numa node,nodeid=0,memdev=m0 \ -numa node,nodeid=1,memdev=m1,initiator=0 \ -numa cpu,node-id=0,socket-id=0 \ -numa cpu,node-id=0,socket-id=1 \ -numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-latency,latency=5 \ -numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-bandwidth,bandwidth=200M \ -numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-latency,latency=10 \ -numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-bandwidth,bandwidth=100M There are further options for cache size / locality information Configuring this level of detail on guests will likely become important if trying to do strict pinning for guests to hosts, to maximise performance of the NUMA topology Version-Release number of selected component (if applicable): libvirt-5.10.0-1
I've started XML design discussion here: https://www.redhat.com/archives/libvir-list/2020-January/msg00422.html
Setting ITR and Target Release to 8.3.0, because this is a dependency of bug 1745059 (which is a 8.3 candidate).
v1: https://www.redhat.com/archives/libvir-list/2020-June/msg01078.html
Merged upstream: 30dd4aed3c news: Document HMAT addition aeecbc87b7 qemu: Build HMAT command line c2f15f1b18 qemu: Introduce QEMU_CAPS_NUMA_HMAT capability 11d8ca9794 numa: expose HMAT APIs f0611fe883 conf: Validate NUMA HMAT configuration a89bbbac86 conf: Parse and format HMAT a26f61ee0c Allow NUMA nodes without vCPUs 1050c6beb1 numa_conf: Make virDomainNumaSetNodeCpumask() return void fe43b3a5a5 qemuBuildMachineCommandLine: Drop needless check 8ba1792785 qemu_command: Rename qemuBuildNumaArgStr() 68c5b0183c numa_conf: Drop CPU from name of two functions 04bd77a19f conf: Move and rename virDomainParseScaledValue() afb1ea6776 qemuxml2xmltest: Add "numatune-distance" test case e95da4e5bf qemuBuildMemoryBackendProps: Use boolean type for 'pmem' property v6.5.0-58-g30dd4aed3c
I tried to verify the new feature and found some issues that need to be confirmed. 1. Add two caches under one cell, and in connectors, use the second cache in the latency, it reports "non-existent NUMA node cache". Two caches can be added from the libvirt.org - Since 6.6.0 the cell element can have a cache child element which describes memory side cache for memory proximity domains. The cache element has a level attribute describing the cache level and thus the element can be repeated multiple times to describe different levels of the cache. <cpu mode='host-model' check='partial'> <numa> <cell id='0' cpus='0-5' memory='512000' unit='KiB' discard='yes'> <distances> <sibling id='0' value='10'/> <sibling id='1' value='21'/> </distances> <cache level='3' associativity='direct' policy='writeback'> <size value='10' unit='KiB'/> <line value='8' unit='B'/> </cache> <cache level='1' associativity='direct' policy='writeback'> <size value='8' unit='KiB'/> <line value='5' unit='B'/> </cache> </cell> <cell id='1' memory='512000' unit='KiB'> <distances> <sibling id='0' value='21'/> <sibling id='1' value='10'/> </distances> </cell> <interconnects> <latency initiator='0' target='0' cache='1' type='access' value='5'/> <bandwidth initiator='0' target='0' type='access' value='204800' unit='KiB'/> </interconnects> </numa> </cpu> virsh edit avocado-vt-vm error: XML error: 'cache' refers to a non-existent NUMA node cache Failed. Try again? [y,n,i,f,?]: 2. Remove the "cache='1'" from latency, the domain can be started and the qemu-cmdline is as below- -numa hmat-cache,node-id=0,size=8K,level=1,associativity=direct,policy=write-back,line=5 <interconnects> <latency initiator='0' target='0' type='access' value='5'/> <bandwidth initiator='0' target='0' type='access' value='204800' unit='KiB'/> </interconnects> -numa hmat-cache,node-id=0,size=8K,level=1,associativity=direct,policy=write-back,line=5 Can you please help to confirm if the two situations are as expected ?
(In reply to Jing Qi from comment #9) > I tried to verify the new feature and found some issues that need to be > confirmed. > > 1. Add two caches under one cell, and in connectors, use the second cache > in the latency, it reports "non-existent NUMA node cache". Two caches can be > added from the libvirt.org - > Since 6.6.0 the cell element can have a cache child element which describes > memory side cache for memory proximity domains. The cache element has a > level attribute describing the cache level and thus the element can be > repeated multiple times to describe different levels of the cache. > > <cpu mode='host-model' check='partial'> > <numa> > <cell id='0' cpus='0-5' memory='512000' unit='KiB' discard='yes'> > <distances> > <sibling id='0' value='10'/> > <sibling id='1' value='21'/> > </distances> > <cache level='3' associativity='direct' policy='writeback'> > <size value='10' unit='KiB'/> > <line value='8' unit='B'/> > </cache> > <cache level='1' associativity='direct' policy='writeback'> > <size value='8' unit='KiB'/> > <line value='5' unit='B'/> > </cache> > </cell> > <cell id='1' memory='512000' unit='KiB'> > <distances> > <sibling id='0' value='21'/> > <sibling id='1' value='10'/> > </distances> > </cell> > <interconnects> > <latency initiator='0' target='0' cache='1' type='access' value='5'/> > <bandwidth initiator='0' target='0' type='access' value='204800' > unit='KiB'/> > </interconnects> > </numa> > </cpu> > > virsh edit avocado-vt-vm > error: XML error: 'cache' refers to a non-existent NUMA node cache > Failed. Try again? [y,n,i,f,?]: Ah, this is a bug. Patch posted here: https://www.redhat.com/archives/libvir-list/2020-August/msg00536.html > > 2. Remove the "cache='1'" from latency, the domain can be started and the > qemu-cmdline is as below- > > -numa > hmat-cache,node-id=0,size=8K,level=1,associativity=direct,policy=write-back, > line=5 > > <interconnects> > <latency initiator='0' target='0' type='access' value='5'/> > <bandwidth initiator='0' target='0' type='access' value='204800' > unit='KiB'/> > </interconnects> > > -numa > hmat-cache,node-id=0,size=8K,level=1,associativity=direct,policy=write-back, > line=5 > > Can you please help to confirm if the two situations are as expected ? This looks as expected, doesn't it? Do you think there is something wrong with the generated command line? PS sorry for the delay, was on longer PTO.
For the second scenario, there is no issue with the qemu cmd line. But if there is no "cache=X" is specified, does it need to select one as the default? the minimal one, the first one, or the last one? There is a issue with one more scenario,please help to confirm if it's fixed - 1. Edit the xml - <cpu mode='custom' match='exact' check='none'> <model fallback='forbid'>qemu64</model> <numa> <cell id='0' cpus='0-5' memory='512000' unit='KiB' discard='yes'> <distances> <sibling id='0' value='10'/> <sibling id='1' value='21'/> </distances> <cache level='1' associativity='direct' policy='writeback'> <size value='8' unit='KiB'/> <line value='5' unit='B'/> </cache> <cache level='2' associativity='direct' policy='writeback'> <size value='9' unit='KiB'/> <line value='6' unit='B'/> </cache> <cache level='3' associativity='direct' policy='writeback'> <size value='10' unit='KiB'/> <line value='8' unit='B'/> </cache> </cell> <cell id='1' memory='512000' unit='KiB'> <distances> <sibling id='0' value='21'/> <sibling id='1' value='10'/> </distances> </cell> <interconnects> <latency initiator='0' target='0' type='access' value='5'/> <bandwidth initiator='0' target='0' type='access' value='204800' unit='KiB'/> </interconnects> </numa> </cpu> 2.#virsh start vm2 error: Failed to start domain vm2 error: internal error: qemu unexpectedly closed the monitor: 2020-08-19T01:23:10.928169Z qemu-kvm: -numa hmat-cache,node-id=0,size=9K,level=2,associativity=direct,policy=write-back,line=6: Invalid size=9216, the size of level=2 should be less than the size(8192) of level=1
I am not sure the value of size in which level should be smallest? Seems the deeper of the level should be smaller of the value? But below two conditions can work both - 1)<cache level='1' associativity='direct' policy='writeback'> <size value='8' unit='KiB'/> <line value='5' unit='B'/> </cache> <cache level='2' associativity='direct' policy='writeback'> <size value='7' unit='KiB'/> <line value='6' unit='B'/> </cache> <cache level='3' associativity='direct' policy='writeback'> <size value='6' unit='KiB'/> <line value='8' unit='B'/> </cache> 2) <cache level='1' associativity='direct' policy='writeback'> <size value='8' unit='KiB'/> <line value='5' unit='B'/> </cache> <cache level='3' associativity='direct' policy='writeback'> <size value='10' unit='KiB'/> <line value='8' unit='B'/> </cache>
Problem 1) from comment 9 is fixed with patch from comment 10 which is now merged: e41ac71fca numa_conf: Properly check for caches in virDomainNumaDefValidate() v6.6.0-538-ge41ac71fca
(In reply to Jing Qi from comment #11) > 2.#virsh start vm2 > error: Failed to start domain vm2 > error: internal error: qemu unexpectedly closed the monitor: > 2020-08-19T01:23:10.928169Z qemu-kvm: -numa > hmat-cache,node-id=0,size=9K,level=2,associativity=direct,policy=write-back, > line=6: Invalid size=9216, the size of level=2 should be less than the > size(8192) of level=1 I'm not sure it makes sense to replicate these checks in libvirt. Also, this looks very suspicious - usually out in the wild L3 > L2 > L1. Why would QEMU want it the other way?
@yuhuang For the issue in comment 11 & 12, do you think if qemu works as expected? Or it's a bug?
(In reply to Jing Qi from comment #15) > @yuhuang > For the issue in comment 11 & 12, do you think if qemu works as expected? Or > it's a bug? I don't know. Igor?
(In reply to Jing Qi from comment #15) > @yuhuang > For the issue in comment 11 & 12, do you think if qemu works as expected? Or > it's a bug? Looks like bug to me, I'll post a patch to rise the question, I hope original author or someone else from Intel will answer.
(In reply to Jing Qi from comment #11) > For the second scenario, there is no issue with the qemu cmd line. But if > there is no "cache=X" is specified, does it need to select one as the > default? the minimal one, the first one, or the last one? > > There is a issue with one more scenario,please help to confirm if it's fixed > - > > 1. Edit the xml - > <cpu mode='custom' match='exact' check='none'> > <model fallback='forbid'>qemu64</model> > <numa> > <cell id='0' cpus='0-5' memory='512000' unit='KiB' discard='yes'> > <distances> > <sibling id='0' value='10'/> > <sibling id='1' value='21'/> > </distances> > <cache level='1' associativity='direct' policy='writeback'> > <size value='8' unit='KiB'/> > <line value='5' unit='B'/> > </cache> > <cache level='2' associativity='direct' policy='writeback'> > <size value='9' unit='KiB'/> > <line value='6' unit='B'/> > </cache> > <cache level='3' associativity='direct' policy='writeback'> > <size value='10' unit='KiB'/> > <line value='8' unit='B'/> > </cache> > </cell> > <cell id='1' memory='512000' unit='KiB'> > <distances> > <sibling id='0' value='21'/> > <sibling id='1' value='10'/> > </distances> > </cell> > <interconnects> > <latency initiator='0' target='0' type='access' value='5'/> > <bandwidth initiator='0' target='0' type='access' value='204800' > unit='KiB'/> > </interconnects> > </numa> > </cpu> > > 2.#virsh start vm2 > error: Failed to start domain vm2 > error: internal error: qemu unexpectedly closed the monitor: > 2020-08-19T01:23:10.928169Z qemu-kvm: -numa > hmat-cache,node-id=0,size=9K,level=2,associativity=direct,policy=write-back, > line=6: Invalid size=9216, the size of level=2 should be less than the > size(8192) of level=1 can you provide full QEMU CLI that generates this error?
Yes, the qemu command line is as below - /usr/libexec/qemu-kvm \ -name guest=avocado-vt-vm1,debug-threads=on \ -S \ -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-1-avocado-vt-vm1/master-key.aes \ -machine pc-q35-rhel8.3.0,accel=kvm,usb=off,dump-guest-core=off,hmat=on \ -cpu Broadwell,vme=on,ss=on,vmx=off,f16c=on,rdrand=on,hypervisor=on,arat=on,tsc-adjust=on,umip=on,arch-capabilities=on,xsaveopt=on,pdpe1gb=on,abm=on,skip-l1dfl-vmentry=on,pschange-mc-no=on,rtm=on,hle=on \ -m 1000 \ -overcommit mem-lock=off \ -smp 6,sockets=6,cores=1,threads=1 \ -object memory-backend-ram,id=ram-node0,size=524288000 \ -numa node,nodeid=0,cpus=0-5,initiator=0,memdev=ram-node0 \ -object memory-backend-ram,id=ram-node1,size=524288000 \ -numa node,nodeid=1,initiator=0,memdev=ram-node1 \ -numa dist,src=0,dst=0,val=10 \ -numa dist,src=0,dst=1,val=21 \ -numa dist,src=1,dst=0,val=21 \ -numa dist,src=1,dst=1,val=10 \ -numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-latency,latency=5 \ -numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-bandwidth,bandwidth=204800K \ -numa hmat-cache,node-id=0,size=8K,level=1,associativity=direct,policy=write-back,line=5 \ -numa hmat-cache,node-id=0,size=9K,level=2,associativity=direct,policy=write-back,line=6 \ -numa hmat-cache,node-id=0,size=10K,level=3,associativity=direct,policy=write-back,line=8 \ -uuid 03040ca7-2813-4f71-bc80-5943228d6371 \ -no-user-config \ -nodefaults \ -chardev socket,id=charmonitor,fd=35,server,nowait \ -mon chardev=charmonitor,id=monitor,mode=control \ -rtc base=utc,driftfix=slew \ -global kvm-pit.lost_tick_policy=delay \ -no-hpet \ -no-shutdown \ -global ICH9-LPC.disable_s3=1 \ -global ICH9-LPC.disable_s4=1 \ -boot strict=on \ -device pcie-root-port,port=0x10,chassis=1,id=pci.1,bus=pcie.0,multifunction=on,addr=0x2 \ -device pcie-root-port,port=0x11,chassis=2,id=pci.2,bus=pcie.0,addr=0x2.0x1 \ -device pcie-root-port,port=0x12,chassis=3,id=pci.3,bus=pcie.0,addr=0x2.0x2 \ -device pcie-root-port,port=0x13,chassis=4,id=pci.4,bus=pcie.0,addr=0x2.0x3 \ -device pcie-root-port,port=0x14,chassis=5,id=pci.5,bus=pcie.0,addr=0x2.0x4 \ -device pcie-root-port,port=0x15,chassis=6,id=pci.6,bus=pcie.0,addr=0x2.0x5 \ -device pcie-root-port,port=0x16,chassis=7,id=pci.7,bus=pcie.0,addr=0x2.0x6 \ -device qemu-xhci,p2=15,p3=15,id=usb,bus=pci.2,addr=0x0 \ -device virtio-serial-pci,id=virtio-serial0,bus=pci.3,addr=0x0 \ -blockdev '{"driver":"file","filename":"/var/lib/avocado/data/avocado-vt/images/jeos-27-x86_64.qcow2","node-name":"libvirt-1-storage","auto-read-only":true,"discard":"unmap"}' \ -blockdev '{"node-name":"libvirt-1-format","read-only":false,"driver":"qcow2","file":"libvirt-1-storage","backing":null}' \ -device virtio-blk-pci,bus=pci.4,addr=0x0,drive=libvirt-1-format,id=virtio-disk0,bootindex=1 \ -netdev tap,fd=37,id=hostnet0,vhost=on,vhostfd=38 \ -device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:2e:74:ba,bus=pci.1,addr=0x0 \ -chardev pty,id=charserial0 \ -device isa-serial,chardev=charserial0,id=serial0 \ -chardev socket,id=charchannel0,fd=39,server,nowait \ -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=org.qemu.guest_agent.0 \ -device usb-tablet,id=input0,bus=usb.0,port=1 \ -vnc 127.0.0.1:0 \ -device qxl-vga,id=video0,ram_size=67108864,vram_size=67108864,vram64_size_mb=0,vgamem_mb=16,max_outputs=1,bus=pcie.0,addr=0x1 \ -device virtio-balloon-pci,id=balloon0,bus=pci.5,addr=0x0 \ -object rng-random,id=objrng0,filename=/dev/urandom \ -device virtio-rng-pci,rng=objrng0,id=rng0,bus=pci.6,addr=0x0 \ -sandbox on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny \ -msg timestamp=on
To POST: http://post-office.corp.redhat.com/archives/rhvirt-patches/2020-August/msg00261.html Scratch build can be found here: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=31002229 http://brew-task-repos.usersys.redhat.com/repos/scratch/mprivozn/libvirt/6.6.0/3.el8_rc.9b168aa093/libvirt-6.6.0-3.el8_rc.9b168aa093-scratch.repo
(In reply to Michal Privoznik from comment #20) > To POST: > > http://post-office.corp.redhat.com/archives/rhvirt-patches/2020-August/ > msg00261.html > > Scratch build can be found here: > > https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=31002229 > > http://brew-task-repos.usersys.redhat.com/repos/scratch/mprivozn/libvirt/6.6. > 0/3.el8_rc.9b168aa093/libvirt-6.6.0-3.el8_rc.9b168aa093-scratch.repo Test with the scratch build, below is the test result: 1. level='*'(such as 1) can't be added in "latency" attributes. Otherwise, it can not pass the xml validation against the schema. <numa> <cell id='0' cpus='0-5' memory='524288' unit='KiB' discard='yes'> <distances> <sibling id='0' value='10'/> <sibling id='1' value='21'/> </distances> <cache level='1' associativity='direct' policy='writeback'> <size value='10' unit='KiB'/> <line value='5' unit='B'/> </cache> </cell> <cell id='1' memory='524288' unit='KiB'> <distances> <sibling id='0' value='21'/> <sibling id='1' value='10'/> </distances> <cache level='3' associativity='direct' policy='writeback'> <size value='8' unit='KiB'/> <line value='8' unit='B'/> </cache> </cell> <interconnects> <latency initiator='0' target='0' type='access' value='5'/> => level='1' <bandwidth initiator='0' target='0' type='access' value='204800' unit='KiB'/> </interconnects> </numa> 2. Another xml validation failure is about the order of the <distance> & <cache>. <cell id='0' cpus='0-5' memory='524288' unit='KiB' discard='yes'> <cache level='1' associativity='direct' policy='writeback'> <size value='10' unit='KiB'/> <line value='5' unit='B'/> </cache> <distances> <sibling id='0' value='10'/> <sibling id='1' value='21'/> </distances> </cell> Then, try to save - Failed. Try again? [y,n,i,f,?]: error: XML document failed to validate against schema: Unable to validate doc against /usr/share/libvirt/schemas/domain.rng Extra element cpu in interleave Element domain failed to validate content Failed. Try again? [y,n,i,f,?]: If the <cache> is above the <distance>, it failed to validate. Is it as expected? 3. If the <cache> is set, and no "latency" & "bandwidth" for the node. <numa> <cell id='0' cpus='0-5' memory='524288' unit='KiB' discard='yes'> <distances> <sibling id='0' value='10'/> <sibling id='1' value='21'/> </distances> <cache level='1' associativity='direct' policy='writeback'> <size value='10' unit='KiB'/> <line value='5' unit='B'/> </cache> </cell> <cell id='1' memory='524288' unit='KiB'> <distances> <sibling id='0' value='21'/> <sibling id='1' value='10'/> </distances> <cache level='3' associativity='direct' policy='writeback'> <size value='8' unit='KiB'/> <line value='8' unit='B'/> </cache> </cell> <interconnects> <latency initiator='0' target='0' type='access' value='5'/> <bandwidth initiator='0' target='0' type='access' value='204800' unit='KiB'/> </interconnects> </numa> # virsh start avocado-vt-vm1 error: Failed to start domain avocado-vt-vm1 error: internal error: process exited while connecting to monitor: 2020-08-31T08:15:59.794755Z qemu-kvm: -numa hmat-cache,node-id=1,size=8K,level=3,associativity=direct,policy=write-back,line=8: The latency and bandwidth information of node-id=1 should be provided before memory side cache attributes My question is if libvirt is better to add the check in xml schema validation? Thanks.
(In reply to Jing Qi from comment #21) > Test with the scratch build, below is the test result: > > 1. level='*'(such as 1) can't be added in "latency" attributes. > Otherwise, it can not pass the xml validation against the schema. > <numa> > <cell id='0' cpus='0-5' memory='524288' unit='KiB' discard='yes'> > <distances> > <sibling id='0' value='10'/> > <sibling id='1' value='21'/> > </distances> > <cache level='1' associativity='direct' policy='writeback'> > <size value='10' unit='KiB'/> > <line value='5' unit='B'/> > </cache> > </cell> > <cell id='1' memory='524288' unit='KiB'> > <distances> > <sibling id='0' value='21'/> > <sibling id='1' value='10'/> > </distances> > <cache level='3' associativity='direct' policy='writeback'> > <size value='8' unit='KiB'/> > <line value='8' unit='B'/> > </cache> > </cell> > <interconnects> > <latency initiator='0' target='0' type='access' value='5'/> => > level='1' The attribute is named cache: cache='1' should work. > <bandwidth initiator='0' target='0' type='access' value='204800' > unit='KiB'/> > </interconnects> > </numa> > > 2. Another xml validation failure is about the order of the <distance> & > <cache>. > > <cell id='0' cpus='0-5' memory='524288' unit='KiB' discard='yes'> > <cache level='1' associativity='direct' policy='writeback'> > <size value='10' unit='KiB'/> > <line value='5' unit='B'/> > </cache> > <distances> > <sibling id='0' value='10'/> > <sibling id='1' value='21'/> > </distances> > </cell> > > Then, try to save - > > Failed. Try again? [y,n,i,f,?]: > error: XML document failed to validate against schema: Unable to validate > doc against /usr/share/libvirt/schemas/domain.rng > Extra element cpu in interleave > Element domain failed to validate content > > Failed. Try again? [y,n,i,f,?]: > > If the <cache> is above the <distance>, it failed to validate. Is it as > expected? Ouch. No, I will post a patch. > > 3. If the <cache> is set, and no "latency" & "bandwidth" for the node. > > <numa> > <cell id='0' cpus='0-5' memory='524288' unit='KiB' discard='yes'> > <distances> > <sibling id='0' value='10'/> > <sibling id='1' value='21'/> > </distances> > <cache level='1' associativity='direct' policy='writeback'> > <size value='10' unit='KiB'/> > <line value='5' unit='B'/> > </cache> > </cell> > <cell id='1' memory='524288' unit='KiB'> > <distances> > <sibling id='0' value='21'/> > <sibling id='1' value='10'/> > </distances> > <cache level='3' associativity='direct' policy='writeback'> > <size value='8' unit='KiB'/> > <line value='8' unit='B'/> > </cache> > </cell> > <interconnects> > <latency initiator='0' target='0' type='access' value='5'/> > <bandwidth initiator='0' target='0' type='access' value='204800' > unit='KiB'/> > </interconnects> > </numa> > # virsh start avocado-vt-vm1 > error: Failed to start domain avocado-vt-vm1 > error: internal error: process exited while connecting to monitor: > 2020-08-31T08:15:59.794755Z qemu-kvm: -numa > hmat-cache,node-id=1,size=8K,level=3,associativity=direct,policy=write-back, > line=8: The latency and bandwidth information of node-id=1 should be > provided before memory side cache attributes > > My question is if libvirt is better to add the check in xml schema > validation? Thanks. Honestly, I don't know. I think that users of this feature will always fully specify the HMAT. For future reference, here is the generated CMD line: -object memory-backend-memfd,id=ram-node0,hugetlb=yes,hugetlbsize=2097152,size=536870912,host-nodes=0,policy=bind \ -numa node,nodeid=0,cpus=0-3,initiator=0,memdev=ram-node0 \ -object memory-backend-memfd,id=ram-node1,hugetlb=yes,hugetlbsize=2097152,size=536870912,host-nodes=0,policy=bind \ -numa node,nodeid=1,initiator=0,memdev=ram-node1 \ -numa dist,src=0,dst=0,val=10 \ -numa dist,src=0,dst=1,val=21 \ -numa dist,src=1,dst=0,val=21 \ -numa dist,src=1,dst=1,val=10 \ -numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-latency,latency=5 \ -numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-bandwidth,bandwidth=204800K \ -numa hmat-cache,node-id=0,size=10K,level=1,associativity=direct,policy=write-back,line=5 \ -numa hmat-cache,node-id=1,size=8K,level=3,associativity=direct,policy=write-back,line=8 \ Igor, do you have opinion?
Patch for interleaving elements under <cell/>: https://www.redhat.com/archives/libvir-list/2020-August/msg01113.html
(In reply to Michal Privoznik from comment #22) > (In reply to Jing Qi from comment #21) > > Test with the scratch build, below is the test result: > > > > 1. level='*'(such as 1) can't be added in "latency" attributes. > > Otherwise, it can not pass the xml validation against the schema. > > <numa> > > <cell id='0' cpus='0-5' memory='524288' unit='KiB' discard='yes'> > > <distances> > > <sibling id='0' value='10'/> > > <sibling id='1' value='21'/> > > </distances> > > <cache level='1' associativity='direct' policy='writeback'> > > <size value='10' unit='KiB'/> > > <line value='5' unit='B'/> > > </cache> > > </cell> > > <cell id='1' memory='524288' unit='KiB'> > > <distances> > > <sibling id='0' value='21'/> > > <sibling id='1' value='10'/> > > </distances> > > <cache level='3' associativity='direct' policy='writeback'> > > <size value='8' unit='KiB'/> > > <line value='8' unit='B'/> > > </cache> > > </cell> > > <interconnects> > > <latency initiator='0' target='0' type='access' value='5'/> => > > level='1' > > The attribute is named cache: cache='1' should work. > Yes. It works. Thanks.
(In reply to Michal Privoznik from comment #22) > (In reply to Jing Qi from comment #21) [...] > > 2020-08-31T08:15:59.794755Z qemu-kvm: -numa > > hmat-cache,node-id=1,size=8K,level=3,associativity=direct,policy=write-back, > > line=8: The latency and bandwidth information of node-id=1 should be > > provided before memory side cache attributes > > > > My question is if libvirt is better to add the check in xml schema > > validation? Thanks. > > Honestly, I don't know. I think that users of this feature will always fully > specify the HMAT. For future reference, here is the generated CMD line: > > -object > memory-backend-memfd,id=ram-node0,hugetlb=yes,hugetlbsize=2097152, > size=536870912,host-nodes=0,policy=bind \ > -numa node,nodeid=0,cpus=0-3,initiator=0,memdev=ram-node0 \ > -object > memory-backend-memfd,id=ram-node1,hugetlb=yes,hugetlbsize=2097152, > size=536870912,host-nodes=0,policy=bind \ > -numa node,nodeid=1,initiator=0,memdev=ram-node1 \ > -numa dist,src=0,dst=0,val=10 \ > -numa dist,src=0,dst=1,val=21 \ > -numa dist,src=1,dst=0,val=21 \ > -numa dist,src=1,dst=1,val=10 \ > -numa > hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-latency, > latency=5 \ > -numa > hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-bandwidth, > bandwidth=204800K \ > -numa > hmat-cache,node-id=0,size=10K,level=1,associativity=direct,policy=write-back, > line=5 \ > -numa > hmat-cache,node-id=1,size=8K,level=3,associativity=direct,policy=write-back, > line=8 \ > > Igor, do you have opinion? well, CLI above doesn't provide hmat-lb for node 1, but tries to use hmat-cache for node 1, that's what QEMU doesn't like. (it works as expected) Also looking that node 1 doesn't have any cpus, it's probably pointless to specify any hmat-cache for it (and may be hmat-lb, I'm not sure). PS: Is cpu-less configuration on x86 machine possible at all?
I've merged the fix mentioned in comment 23: fd2ad818b2 RNG: Allow interleaving of /domain/cpu/numa/cell children And backported it here: http://post-office.corp.redhat.com/archives/rhvirt-patches/2020-August/msg00303.html I'm not building a new scratch build, because the only change would be RNG fix, no code change since the last one.
(In reply to Igor Mammedov from comment #25) > (In reply to Michal Privoznik from comment #22) > > (In reply to Jing Qi from comment #21) > [...] > > > 2020-08-31T08:15:59.794755Z qemu-kvm: -numa > > > hmat-cache,node-id=1,size=8K,level=3,associativity=direct,policy=write-back, > > > line=8: The latency and bandwidth information of node-id=1 should be > > > provided before memory side cache attributes > > > > > > My question is if libvirt is better to add the check in xml schema > > > validation? Thanks. > > > > Honestly, I don't know. I think that users of this feature will always fully > > specify the HMAT. For future reference, here is the generated CMD line: > > > > -object > > memory-backend-memfd,id=ram-node0,hugetlb=yes,hugetlbsize=2097152, > > size=536870912,host-nodes=0,policy=bind \ > > -numa node,nodeid=0,cpus=0-3,initiator=0,memdev=ram-node0 \ > > -object > > memory-backend-memfd,id=ram-node1,hugetlb=yes,hugetlbsize=2097152, > > size=536870912,host-nodes=0,policy=bind \ > > -numa node,nodeid=1,initiator=0,memdev=ram-node1 \ > > -numa dist,src=0,dst=0,val=10 \ > > -numa dist,src=0,dst=1,val=21 \ > > -numa dist,src=1,dst=0,val=21 \ > > -numa dist,src=1,dst=1,val=10 \ > > -numa > > hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-latency, > > latency=5 \ > > -numa > > hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-bandwidth, > > bandwidth=204800K \ > > -numa > > hmat-cache,node-id=0,size=10K,level=1,associativity=direct,policy=write-back, > > line=5 \ > > -numa > > hmat-cache,node-id=1,size=8K,level=3,associativity=direct,policy=write-back, > > line=8 \ > > > > Igor, do you have opinion? > > > well, CLI above doesn't provide hmat-lb for node 1, but tries to use > hmat-cache for node 1, > that's what QEMU doesn't like. (it works as expected) Okay, I'd say let's track that in a different bug then. The feature works as expected. > > Also looking that node 1 doesn't have any cpus, it's probably pointless to > specify any hmat-cache for it Makes sense. Again, if we want to track it, let's track it in a different bug. > (and may be hmat-lb, I'm not sure). > > PS: > Is cpu-less configuration on x86 machine possible at all? Sure it is: $ ssh root@fedora numactl -H X11 forwarding request failed on channel 0 available: 2 nodes (0-1) node 0 cpus: 0 1 2 3 node 0 size: 502 MB node 0 free: 52 MB node 1 cpus: node 1 size: 471 MB node 1 free: 210 MB node distances: node 0 1 0: 10 21 1: 21 10
Verified with libvirt-daemon-6.6.0-4.module+el8.3.0+7883+3d717aa8.x86_64 & qemu-kvm-5.1.0-4.module+el8.3.0+7846+ae9b566f.x86_64 1. Start domain with below xml part - <cpu mode='host-model' check='partial'> <model fallback='forbid'>qemu64</model> <feature policy='disable' name='svm'/> <numa> <cell id='0' cpus='0-5' memory='512000' unit='KiB' discard='yes'> <distances> <sibling id='0' value='10'/> <sibling id='1' value='21'/> </distances> <cache level='1' associativity='direct' policy='writeback'> <size value='8' unit='KiB'/> <line value='5' unit='B'/> </cache> <cache level='2' associativity='direct' policy='writeback'> <size value='7' unit='KiB'/> <line value='5' unit='B'/> </cache> <cache level='3' associativity='direct' policy='writeback'> <size value='6' unit='KiB'/> <line value='8' unit='B'/> </cache> </cell> <cell id='1' memory='512000' unit='KiB'> <distances> <sibling id='0' value='21'/> <sibling id='1' value='10'/> </distances> </cell> <interconnects> <latency initiator='0' target='0' type='access' value='5'/> <bandwidth initiator='0' target='0' type='access' value='204800' unit='KiB'/> </interconnects> </numa> </cpu> # virsh start avocado-vt-vm Domain avocado-vt-vm started 2. Succeeded to migrate the vm and did some check in the guest. #virsh migrate avocado-vt-vm qemu+ssh://***.redhat.com/system Check "dmesg" in the guest # dmesg |grep -i hmat [ 0.000000] ACPI: HMAT 0x000000003E7E1EC5 000138 (v02 BOCHS BXPCHMAT 00000001 BXPC 00000001) [ 1.153980] acpi/hmat: HMAT: Memory Flags:0001 Processor Domain:0 Memory Domain:0 [ 1.155943] acpi/hmat: HMAT: Memory Flags:0001 Processor Domain:0 Memory Domain:1 [ 1.157787] acpi/hmat: HMAT: Locality: Flags:00 Type:Access Latency Initiator Domains:1 Target Domains:2 Base:1000 [ 1.160003] acpi/hmat: Initiator-Target[0-0]:5 nsec [ 1.161134] acpi/hmat: Initiator-Target[0-1]:0 nsec [ 1.162261] acpi/hmat: HMAT: Locality: Flags:00 Type:Access Bandwidth Initiator Domains:1 Target Domains:2 Base:8 [ 1.164550] acpi/hmat: Initiator-Target[0-0]:200 MB/s [ 1.165698] acpi/hmat: Initiator-Target[0-1]:0 MB/s [ 1.166825] acpi/hmat: HMAT: Cache: Domain:0 Size:8192 Attrs:00051113 SMBIOS Handles:0 [ 1.168720] acpi/hmat: HMAT: Cache: Domain:0 Size:7168 Attrs:00051123 SMBIOS Handles:0 [ 1.170616] acpi/hmat: HMAT: Cache: Domain:0 Size:6144 Attrs:00081133 SMBIOS Handles:0 # numactl --ha available: 2 nodes (0-1) node 0 cpus: 0 1 2 3 4 5 node 0 size: 451 MB node 0 free: 39 MB node 1 cpus: node 1 size: 331 MB node 1 free: 222 MB node distances: node 0 1 0: 10 21 1: 21 10
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (virt:8.3 bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:5137