Bug 1894053
Summary: | Don't prealloc memory for real NVDIMMs | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux Advanced Virtualization | Reporter: | Milan Zamazal <mzamazal> |
Component: | libvirt | Assignee: | Michal Privoznik <mprivozn> |
Status: | CLOSED ERRATA | QA Contact: | Jing Qi <jinqi> |
Severity: | high | Docs Contact: | |
Priority: | unspecified | ||
Version: | 8.3 | CC: | berrange, dhildenb, hhan, imammedo, jdenemar, jinzhao, juzhang, mprivozn, virt-maint, yuhuang |
Target Milestone: | rc | Keywords: | Upstream |
Target Release: | 8.3 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | libvirt-7.0.0-1.el8 | Doc Type: | If docs needed, set a value |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2021-05-25 06:45:10 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | 7.0.0 |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1892800 |
Description
Milan Zamazal
2020-11-03 12:55:26 UTC
I can't reproduce it by qemu. Guest boot up in a few seconds with a 256G NVDIMM device. qemu-kvm-5.1.0-13.module+el8.3.0+8382+afc3bbea kernel-4.18.0-242.el8.x86_64 QEMU cli: # /usr/libexec/qemu-kvm \ -name 'avocado-vt-vm1' \ -sandbox on \ -machine pc,nvdimm=on \ -nodefaults \ -device VGA,bus=pci.0,addr=0x2 \ -m 30720,maxmem=3200G,slots=4 \ -object memory-backend-file,size=256G,mem-path=/dev/pmem0,share=yes,id=mem-mem1 \ -device nvdimm,id=dimm-mem1,memdev=mem-mem1 \ -smp 96,maxcpus=96,cores=48,threads=1,dies=1,sockets=2 \ -cpu 'Cascadelake-Server-noTSX',+kvm_pv_unhalt \ -device virtio-scsi-pci,id=virtio_scsi_pci0,bus=pci.0,addr=0x4 \ -blockdev node-name=file_image1,driver=file,aio=threads,filename=/home/kvm_autotest_root/images/rhel830-64-virtio-scsi.qcow2,cache.direct=on,cache.no-flush=off \ -blockdev node-name=drive_image1,driver=qcow2,cache.direct=on,cache.no-flush=off,file=file_image1 \ -device scsi-hd,id=image1,drive=drive_image1,write-cache=on \ -device virtio-net-pci,mac=9a:cb:96:f7:25:87,id=id41v7Gq,netdev=idMRebYW,bus=pci.0,addr=0x5 \ -netdev tap,id=idMRebYW,vhost=on \ -vnc :0 \ -rtc base=utc,clock=host,driftfix=slew \ -boot menu=off,order=cdn,once=c,strict=off \ -enable-kvm -monitor stdio I was able to reproduce by libvirt. Guest takes several minutes to switch from paused to running. libvirt-client-6.6.0-8.module+el8.3.1+8648+130818f2.x86_64 qemu-kvm-5.1.0-13.module+el8.3.0+8382+afc3bbea kernel-4.18.0-242.el8.x86_64 Thank you for checking. Looking into my QEMU command line, I can see libvirt adds prealloc=yes to the memory-backend-file object. Can it be the reason? (In reply to Milan Zamazal from comment #3) > Thank you for checking. Looking into my QEMU command line, I can see libvirt > adds prealloc=yes to the memory-backend-file object. Can it be the reason? Yeah, you are right. If add prealloc=yes, it takes minutes for qemu to start and powerdown guest. I guess it would take time to prealloc, but not sure how much time it's supposed to be. Amnon - this looks like something for someone on your team dealing with the time it takes to perform preallocation for a memory-backend-file Is it possible that libvirt should not use .prealloc attribute for memory-backend-* that corresponds to '-device nvdimm'? It looks like for cases when nvdimm is backed by a real block device (/dev/pmem0) like in this case then .prealloc is just burning CPU cycles. But what about the case when it is backed by a regular file, say /tmp/nvdimm.raw? Particularly, what happens if the backend is just a regular file and the host runs out of free space on the disk? Even if /tmp/nvdimm.raw was the correct apparent size (to match the requested nvdimm size), BUT was sparse and during the guest run the host runs out of free space on the disk? My understanding of mmap() is that the write() would not be mirrored into the file, silently (!). IMHO, .prealloc shouldn't blindly be used unless explicitly specified by the user, especially not for real NVDIMMs or virtio-pmem devices. Is there any history why we are doing that? I guess some use cases for !NVDIMM might already be relying on this behaviour now. Side note: using !PMEM memory as a memory backend for NVDIMMs is dangerous - we don't have any kind of syncs working any more when using files that are not on PMEM themselves. If the guest crashes, we have data loss. This is the whole reason why virtio-pmem came into existance. So what we currently have in libvirt (since v5.0.0-rc1~48) is <pmem/> element that is used like this: <memory model='nvdimm' access='private'> <source> <path>/tmp/nvdimm</path> <pmem/> </source> <target> <size unit='KiB'>523264</size> <node>0</node> </target> <address type='dimm' slot='0'/> </memory> And it controls .pmem attribute of corresponding memory-backend-file: -object memory-backend-file,id=memnvdimm0,mem-path=/tmp/nvdimm,share=no,prealloc=yes,size=536870912,pmem=yes \ -device nvdimm,node=0,memdev=memnvdimm0,id=nvdimm0,slot=0 \ And reading the commit that added the attribute in QEMU (v3.1.0-rc0~126^2~3) sounds like it's exactly what we want. pmem should be used when the backend is real NVDIMM (in which case we don't want prealloc) and if it's just a regular file (does somebody even use it like that?) then we want prealloc. The reason libvirt used prealloc is that it's better if guest fails to start rather than if was to encounter problems later. For instance, yesterday I've experimented with sparse file as NVDIMM. I've created a 128MB loopback device, created a filesystem there and created a sparse file with 2GB apparent size. Then I've turned off prealloc and started a guest. Then, in the guest I've started writing into that NVDIMM (/dev/pmem0) and as soon as it hit 128MB mark (minus something used by FS itself) the 'dd' stopped. I was unable to kill it, though it wasn't in D state. On the host, qemu started consuming 100% of a core. I'm not saying it is a bug; but when I turned prealloc on, then qemu found the sparse file is not able to store 2GB and refused to start. Therefore, I think we actually want prealloc = !pmem. I agree that for "ordinary files" prealloc might make sense. I agree that for PMEM, prealloc does not really make sense. Note the the use case of "sparse files for NVDIMM" is far from realistic. However, for hugetlbfs/tmpfs/shmem as used for DIMMs, it destroys memory overcommit. That should clearly be a configuration knob instead of blindly adding "prealloc". How is prealloc handled with DIMMs? Overcomit is not really something we want, is it? If you configure your guest to use hugepges then prealloc is what you want, esp. because of the way how hugepages work (you have to allocate a pool, then in another step allocate pages from the pool -> it's not atomic). I'd say for memory-backend-file you almost in call cases (except for pmem) want prealloc. I'm not fully convinced about -ram/-memfd. Of course, with virtio-mem even memory-backend-file rule can be relaxed and not use prealloc. With NVDIMMs/virtio-pmem, overcommiting memory in ordinary filesystem is unlikely. I think we agree on that. Prealloc makes sense for files, prealloc does not make sense for PMEM. With virtio-mem, the size of the memory backend is not an indication if we're overcommitting memory, as the actually consumed size depends on other factors - prealloc is a waste of time as we're discarding preallcoated memory right away. (BTW: I'm planning a virtio-mem prealloc option which tries to prealloc memory before plugging new blocks, and can indicate temporary -ENOMEM to the user, so crashes are avoided as you describe them). Using free page reporting, any memory on hotplugged DIMMs will immediately be reported+discarded in QEMU again. So it does not really make sense preallocating is a waste of time. I tend to agree that for hotplugging DIMMs (even when using ordinary balloon inflation/deflation) makes sense - we don't expect to hotplug a DIMM and inflate the balloon immediately; we assume the guest needs more memory and is going to use it. So prealloc should be disabled with - PMEM - virtio-mem - "free page reporting" Does that sound like what you have in mind? (In reply to David Hildenbrand from comment #11) > So prealloc should be disabled with > - PMEM > - virtio-mem > - "free page reporting" > > Does that sound like what you have in mind? Yes, except I forgot about "free page reporting". So I guess this bug should be switched to libvirt and also, we should make sure that RHEV passes <pmem/> (@Milan - should we file a bug for that?) (In reply to Michal Privoznik from comment #12) > and also, we should make sure that RHEV passes <pmem/> (@Milan - should we file a bug for that?) As I understand the discussion above, we should pass <pmem/> unconditionally, regardless of the backend NVDIMM mode. This is a trivial change without a need for tracking so we don't need a bug for it. And I'll post it right now to not forget about it. Patch proposed upstream: https://www.redhat.com/archives/libvir-list/2020-November/msg01568.html The discussion to my patch lead us to think that this should be fixed in qemu instead, for instance: https://www.redhat.com/archives/libvir-list/2020-November/msg01577.html Switching back to QEMU. Amnon - back to your team. Is there any idea how to proceed with the bug on the QEMU side? What can RHV expect regarding NVDIMM usability? I think the desired QEMU fix is to basically ignore the "prealloc" parameter in QEMU with "-memory-backend-file ...,pmem=on" ... because there is nothing to preallocate with a real pmem. CCing Igor Igor, what do you think about David's idea? Could it be a simple fix in QEMU? (In reply to Milan Zamazal from comment #20) > Igor, what do you think about David's idea? Could it be a simple fix in QEMU? Ah, forgot to mention it here (I replied to mail thread mentioned in comment 16). It seems to be an easy way out but I'm not convinced doing it in QEMU is a good idea (read more on it below). pmem=on doesn't guarantee that backend file is on NVDIMM storage, it could be anything including RAM (conventional or hugepages). I'm of opinion that QEMU should not ignore options explicitly set by user (by default prealloc is off), and if we decide that prealloc should not be used with pmem=on, QEMU should error out instead of ignoring requested option. That will cause compatibility issues with old libvirt. I could make it a warning on QEMU side, so user would now that might run sub-optimal configuration. Bottom line, we should move this BZ to back libvirt and fix it there, so it won't use prealloc=on if backend's file is on NVDIMM storage. (QEMU is complicated enough already, and I'd avoid implementing workarounds for the sake of user (in this case defaulting to prealloc=on) using non optimal CLI). PS: I've looked into if we can use libpmem to check if opened file is on pmem enabled storage, it would be non trivial change on QEMU side (reverting consolidation between file and memfd allocation routines and adding back some code duplication). I'm not convinced that QEMU should ignore explicitly set options at all, bothering with this detection probably is not worthwhile. (In reply to Igor Mammedov from comment #21) > (In reply to Milan Zamazal from comment #20) > > Igor, what do you think about David's idea? Could it be a simple fix in QEMU? > > Ah, forgot to mention it here (I replied to mail thread mentioned in comment > 16). > > It seems to be an easy way out but I'm not convinced doing it in QEMU > is a good idea (read more on it below). > > pmem=on doesn't guarantee that backend file is on NVDIMM storage, > it could be anything including RAM (conventional or hugepages). I don't fully agree. Diving into QEMU details: "RAM_PMEM: the backend @mem_path or @fd is persistent memory" and "/* RAM is a persistent kind memory */" Conventional files as NVDIMM backend are dangerous and not used in practice. We don't consider them as "pmem". Hugepages are not persistent. (In reply to David Hildenbrand from comment #22) > (In reply to Igor Mammedov from comment #21) > > (In reply to Milan Zamazal from comment #20) > > > Igor, what do you think about David's idea? Could it be a simple fix in QEMU? > > > > Ah, forgot to mention it here (I replied to mail thread mentioned in comment > > 16). > > > > It seems to be an easy way out but I'm not convinced doing it in QEMU > > is a good idea (read more on it below). > > > > pmem=on doesn't guarantee that backend file is on NVDIMM storage, > > it could be anything including RAM (conventional or hugepages). > > I don't fully agree. Diving into QEMU details: > > "RAM_PMEM: the backend @mem_path or @fd is persistent memory" > > and > > "/* RAM is a persistent kind memory */" > > Conventional files as NVDIMM backend are dangerous and not used in practice. > We don't consider them as "pmem". Hugepages are not persistent. still it can be used with pmem=on, even if you could detect that file is actual pmem, I'd rather error out than ignore prealloc option. If one doesn't need QEMU performing preallocation, one shouldn't ask for it to begin with. (In reply to Igor Mammedov from comment #23) > (In reply to David Hildenbrand from comment #22) > > (In reply to Igor Mammedov from comment #21) > > > (In reply to Milan Zamazal from comment #20) > > > > Igor, what do you think about David's idea? Could it be a simple fix in QEMU? > > > > > > Ah, forgot to mention it here (I replied to mail thread mentioned in comment > > > 16). > > > > > > It seems to be an easy way out but I'm not convinced doing it in QEMU > > > is a good idea (read more on it below). > > > > > > pmem=on doesn't guarantee that backend file is on NVDIMM storage, > > > it could be anything including RAM (conventional or hugepages). > > > > I don't fully agree. Diving into QEMU details: > > > > "RAM_PMEM: the backend @mem_path or @fd is persistent memory" > > > > and > > > > "/* RAM is a persistent kind memory */" > > > > Conventional files as NVDIMM backend are dangerous and not used in practice. > > We don't consider them as "pmem". Hugepages are not persistent. > > still it can be used with pmem=on, > even if you could detect that file is actual pmem, I'd rather error out > than ignore prealloc option. If one doesn't need QEMU performing > preallocation, > one shouldn't ask for it to begin with. Take for example cpu features, if user turns off/on some of them guest might slowdown, QEMU doesn't magically ignore user provided CLI, it does what user has asked for. That keeps options behaviour consistent regardless of host it runs on. There is also maintenance aspect to it. From my small experience in cases where QEMU used to override/ignore user input later on we had to refactor that code to drop it and put a pile of compat glue on top to keep old behaviour. (In reply to Igor Mammedov from comment #24) > (In reply to Igor Mammedov from comment #23) > > (In reply to David Hildenbrand from comment #22) > > > (In reply to Igor Mammedov from comment #21) > > > > (In reply to Milan Zamazal from comment #20) > > > > > Igor, what do you think about David's idea? Could it be a simple fix in QEMU? > > > > > > > > Ah, forgot to mention it here (I replied to mail thread mentioned in comment > > > > 16). > > > > > > > > It seems to be an easy way out but I'm not convinced doing it in QEMU > > > > is a good idea (read more on it below). > > > > > > > > pmem=on doesn't guarantee that backend file is on NVDIMM storage, > > > > it could be anything including RAM (conventional or hugepages). > > > > > > I don't fully agree. Diving into QEMU details: > > > > > > "RAM_PMEM: the backend @mem_path or @fd is persistent memory" > > > > > > and > > > > > > "/* RAM is a persistent kind memory */" > > > > > > Conventional files as NVDIMM backend are dangerous and not used in practice. > > > We don't consider them as "pmem". Hugepages are not persistent. > > > > still it can be used with pmem=on, > > even if you could detect that file is actual pmem, I'd rather error out > > than ignore prealloc option. If one doesn't need QEMU performing > > preallocation, > > one shouldn't ask for it to begin with. > > Take for example cpu features, if user turns off/on some of them guest might > slowdown, QEMU doesn't magically ignore user provided CLI, it does what user > has asked for. That keeps options behaviour consistent regardless of host > it runs on. > There is also maintenance aspect to it. From my small experience in cases > where QEMU used to override/ignore user input later on we had to refactor > that code to drop it and put a pile of compat glue on top to keep old > behaviour. I fully agree. However, Daniel suggested that instead of the libvirt fix upstream: https://www.redhat.com/archives/libvir-list/2020-November/msg01577.html So do we have an agreement to 1. Fix it in libvirt: don't prealloc 2. Try detecting in QEMU and bailing out (when used with pmem=on? detecting if it's actually real pmem?) (In reply to David Hildenbrand from comment #25) > (In reply to Igor Mammedov from comment #24) > > (In reply to Igor Mammedov from comment #23) > > > (In reply to David Hildenbrand from comment #22) > > > > (In reply to Igor Mammedov from comment #21) > > > > > (In reply to Milan Zamazal from comment #20) > > > > > > Igor, what do you think about David's idea? Could it be a simple fix in QEMU? > > > > > > > > > > Ah, forgot to mention it here (I replied to mail thread mentioned in comment > > > > > 16). > > > > > > > > > > It seems to be an easy way out but I'm not convinced doing it in QEMU > > > > > is a good idea (read more on it below). > > > > > > > > > > pmem=on doesn't guarantee that backend file is on NVDIMM storage, > > > > > it could be anything including RAM (conventional or hugepages). > > > > > > > > I don't fully agree. Diving into QEMU details: > > > > > > > > "RAM_PMEM: the backend @mem_path or @fd is persistent memory" > > > > > > > > and > > > > > > > > "/* RAM is a persistent kind memory */" > > > > > > > > Conventional files as NVDIMM backend are dangerous and not used in practice. > > > > We don't consider them as "pmem". Hugepages are not persistent. > > > > > > still it can be used with pmem=on, > > > even if you could detect that file is actual pmem, I'd rather error out > > > than ignore prealloc option. If one doesn't need QEMU performing > > > preallocation, > > > one shouldn't ask for it to begin with. > > > > Take for example cpu features, if user turns off/on some of them guest might > > slowdown, QEMU doesn't magically ignore user provided CLI, it does what user > > has asked for. That keeps options behaviour consistent regardless of host > > it runs on. > > There is also maintenance aspect to it. From my small experience in cases > > where QEMU used to override/ignore user input later on we had to refactor > > that code to drop it and put a pile of compat glue on top to keep old > > behaviour. > > I fully agree. However, Daniel suggested that instead of the libvirt fix > upstream: > > https://www.redhat.com/archives/libvir-list/2020-November/msg01577.html I saw that and replied on that thread. > So do we have an agreement to > > 1. Fix it in libvirt: don't prealloc yes > 2. Try detecting in QEMU and bailing out (when used with pmem=on? detecting > if it's actually real pmem?) I'd like to error out but I see 2 issues with it: 2.1 it will break existing guests that run just fine (for non large VMs, users probably won't even notice slow down) (so it's more compat code to keep it working) 2.2 for detecting if backend file is actually pmem, we would need to use pmem_is_pmem() which requires file being opened and mapped with pmem_map_file(). see comment 21 (PS) why I'd not go there if possible. 3d option: For QEMU side, I suggested to print warning when prealloc + pmem is in use together. That way if user is bothered enough to read logs, it would help to pinpoint possible issue. Since this was proposed by Dan, I think we need his input too. Go ahead an send an updated patch and we'll continue discussion on list. v2: https://www.redhat.com/archives/libvir-list/2021-January/msg00109.html Let's see how discussion goes and I'll switch this back to libvirt then. Merged upstream as: bf14a9be1e qemu: Don't prealloc mem for real NVDIMMs v6.10.0-290-gbf14a9be1e Verified with libvirt upstream code version v7.0.0-rc1 & qemu-kvm-5.1.0-17.module+el8.3.1+9213+7ace09c3.x86_64 Start vm with below xml - <memory model='nvdimm' access='shared'> <source> <path>/dev/dax0.0</path> <alignsize unit='KiB'>2048</alignsize> <pmem/> </source> <target> <size unit='KiB'>262144000</size> <node>0</node> <label> <size unit='KiB'>128</size> </label> </target> <address type='dimm' slot='0'/> </memory> From the qemu-cmd line, there is no "prealloc" and there is no long waiting time when issuing "start vm" command. -object memory-backend-file,id=memnvdimm0,mem-path=/dev/dax0.0,share=yes,size=268435456000,align=2097152,pmem=yes -device nvdimm,node=0,label-size=131072,memdev=memnvdimm0,id=nvdimm0,slot=0 Verified in versions: libvirt-daemon-7.0.0-1.module+el8.4.0+9464+3e71831a.x86_64 qemu-kvm-5.2.0-2.module+el8.4.0+9186+ec44380f.x86_64 S1. Start vm with pmem set in nvdimm - <memory model='nvdimm' access='shared'> <source> <path>/dev/dax0.0</path> <alignsize unit='KiB'>2048</alignsize> <pmem/> </source> <target> <size unit='KiB'>262144000</size> <node>0</node> <label> <size unit='KiB'>128</size> </label> </target> <address type='dimm' slot='0'/> </memory> From the qemu-cmd line, there is no "prealloc" and there is no long waiting time when issuing "start vm" command. -object memory-backend-file,id=memnvdimm0,mem-path=/dev/dax0.0,share=yes,size=268435456000,align=2097152,pmem=yes -device nvdimm,node=0,label-size=131072,memdev=memnvdimm0,id=nvdimm0,slot=0 S2. Start vm with pmem set in an emulated nvdimm <memory model='nvdimm' access='shared'> <source> <path>/tmp/nvdimm0</path> <alignsize unit='KiB'>2048</alignsize> <pmem/> </source> <target> <size unit='KiB'>524288</size> <node>0</node> <label> <size unit='KiB'>128</size> </label> </target> <address type='dimm' slot='0'/> </memory> @Michal, There is also no prealloc in the cmd line, and there is no error message with the <pmem/> used with an emulated file. And I didn't see any different with prealloc and without prealloc. Is it as expected? Thanks! -object memory-backend-file,id=memnvdimm0,mem-path=/tmp/nvdimm0,share=yes,size=536870912,align=2097152,pmem=yes -device nvdimm,node=0,label-size=131072,memdev=memnvdimm0,id=nvdimm0,slot=0 (In reply to Jing Qi from comment #35) > Verified in versions: > libvirt-daemon-7.0.0-1.module+el8.4.0+9464+3e71831a.x86_64 > qemu-kvm-5.2.0-2.module+el8.4.0+9186+ec44380f.x86_64 > @Michal, > There is also no prealloc in the cmd line, and there is no error message > with the <pmem/> used with an emulated file. And I didn't see any different > with prealloc and without prealloc. Is it as expected? Thanks! Yes, <pmem/> should be added if the path is a real device. Libvirt doesn't examine the path, it doesn't know whether it is a real NVDIMM or not. Therefore, prealloc is turned off if <pmem/> is present, and it's turned on if <pmem/> is NOT present. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (virt:av bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:2098 |