Bug 1894053

Summary:	Don't prealloc memory for real NVDIMMs
Product:	Red Hat Enterprise Linux Advanced Virtualization	Reporter:	Milan Zamazal <mzamazal>
Component:	libvirt	Assignee:	Michal Privoznik <mprivozn>
Status:	CLOSED ERRATA	QA Contact:	Jing Qi <jinqi>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	8.3	CC:	berrange, dhildenb, hhan, imammedo, jdenemar, jinzhao, juzhang, mprivozn, virt-maint, yuhuang
Target Milestone:	rc	Keywords:	Upstream
Target Release:	8.3
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	libvirt-7.0.0-1.el8	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-05-25 06:45:10 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:	7.0.0
Embargoed:
Bug Depends On:
Bug Blocks:	1892800

Description Milan Zamazal 2020-11-03 12:55:26 UTC

Description of problem:

When a VM is started, via libvirt, with an NVDIMM device attached, e.g.

  <memory access="shared" model="nvdimm">
      <source>
          <path>/dev/pmem0</path>
          <alignsize unit="KiB">131072</alignsize>
      </source>
      <target>
          <size unit="KiB">5111808</size>
          <node>0</node>
          <label>
              <size unit="KiB">128</size>
          </label>
      </target>
  </memory>

it takes a long time before libvirt switches the VM to running.  For the small ~5 GB NVDIMM above it takes about 19 s on my laptop, for a 256 GB NVDIMM it takes 7-10 minutes on a server hardware (see BZ 1892800).

I can see in libvirt logs that libvirt waits during that time on an initial QEMU monitor query:

  2020-11-02 15:04:41.169+0000: 3384: info : qemuMonitorSend:946 : QEMU_MONITOR_SEND_MSG: mon=0x7efc1405a030 msg={"execute":"qmp_capabilities","id":"libvirt-1"}
   fd=-1
  ...
  2020-11-02 15:05:00.383+0000: 3384: debug : qemuDomainObjExitMonitorInternal:5520 : Exited monitor (mon=0x7efc1405a030 vm=0x55da46fd46a0 name=centos)

The same VMs without NVDIMMs start in a few seconds.

Version-Release number of selected component (if applicable):

qemu-kvm-5.1.0-13.module+el8.3.0+8382+afc3bbea.x86_64
libvirt-daemon-kvm-6.6.0-6.module+el8.3.0+8125+aefcf088.x86_64
kernel-4.18.0-240.el8.x86_64

How reproducible:

100%

Steps to Reproduce:
1. Start a VM with an NVDIMM device using libvirt.
2. Look how long it takes to get the VM switched from paused to running in libvirt.
3. Compare with starting the same VM without NVDIMM.

Actual results:

A VM with NVDIMM takes a long time to start, e.g. more than 7 minutes with a 256 GB NVDIMM.

Expected results:

The VM starts without long delays.

Additional info:

BZ 1892800

Comment 1 Yumei Huang 2020-11-11 10:02:32 UTC

I can't reproduce it by qemu. Guest boot up in a few seconds with a 256G NVDIMM device. 

qemu-kvm-5.1.0-13.module+el8.3.0+8382+afc3bbea
kernel-4.18.0-242.el8.x86_64

QEMU cli:
# /usr/libexec/qemu-kvm \
    -name 'avocado-vt-vm1'  \
    -sandbox on  \
    -machine pc,nvdimm=on  \
    -nodefaults \
    -device VGA,bus=pci.0,addr=0x2 \
    -m 30720,maxmem=3200G,slots=4 \
    -object memory-backend-file,size=256G,mem-path=/dev/pmem0,share=yes,id=mem-mem1 \
    -device nvdimm,id=dimm-mem1,memdev=mem-mem1  \
    -smp 96,maxcpus=96,cores=48,threads=1,dies=1,sockets=2  \
    -cpu 'Cascadelake-Server-noTSX',+kvm_pv_unhalt \
    -device virtio-scsi-pci,id=virtio_scsi_pci0,bus=pci.0,addr=0x4 \
    -blockdev node-name=file_image1,driver=file,aio=threads,filename=/home/kvm_autotest_root/images/rhel830-64-virtio-scsi.qcow2,cache.direct=on,cache.no-flush=off \
    -blockdev node-name=drive_image1,driver=qcow2,cache.direct=on,cache.no-flush=off,file=file_image1 \
    -device scsi-hd,id=image1,drive=drive_image1,write-cache=on \
    -device virtio-net-pci,mac=9a:cb:96:f7:25:87,id=id41v7Gq,netdev=idMRebYW,bus=pci.0,addr=0x5  \
    -netdev tap,id=idMRebYW,vhost=on \
    -vnc :0  \
    -rtc base=utc,clock=host,driftfix=slew  \
    -boot menu=off,order=cdn,once=c,strict=off \
    -enable-kvm -monitor stdio

Comment 2 Yumei Huang 2020-11-11 10:34:53 UTC

I was able to reproduce by libvirt. Guest takes several minutes to switch from paused to running.

libvirt-client-6.6.0-8.module+el8.3.1+8648+130818f2.x86_64
qemu-kvm-5.1.0-13.module+el8.3.0+8382+afc3bbea
kernel-4.18.0-242.el8.x86_64

Comment 3 Milan Zamazal 2020-11-11 11:04:50 UTC

Thank you for checking. Looking into my QEMU command line, I can see libvirt adds prealloc=yes to the memory-backend-file object. Can it be the reason?

Comment 4 Yumei Huang 2020-11-12 02:31:31 UTC

(In reply to Milan Zamazal from comment #3)
> Thank you for checking. Looking into my QEMU command line, I can see libvirt
> adds prealloc=yes to the memory-backend-file object. Can it be the reason?

Yeah, you are right. If add prealloc=yes, it takes minutes for qemu to start and powerdown guest. I guess it would take time to prealloc, but not sure how much time it's supposed to be.

Comment 5 John Ferlan 2020-11-12 18:01:49 UTC

Amnon - this looks like something for someone on your team dealing with the time it takes to perform preallocation for a memory-backend-file

Comment 6 Michal Privoznik 2020-11-23 11:24:10 UTC

Is it possible that libvirt should not use .prealloc attribute for memory-backend-* that corresponds to '-device nvdimm'? It looks like for cases when nvdimm is backed by a real block device (/dev/pmem0) like in this case then .prealloc is just burning CPU cycles. But what about the case when it is backed by a regular file, say /tmp/nvdimm.raw? Particularly, what happens if the backend is just a regular file and the host runs out of free space on the disk? Even if /tmp/nvdimm.raw was the correct apparent size (to match the requested nvdimm size), BUT was sparse and during the guest run the host runs out of free space on the disk? My understanding of mmap() is that the write() would not be mirrored into the file, silently (!).

Comment 7 David Hildenbrand 2020-11-23 20:00:56 UTC

IMHO, .prealloc shouldn't blindly be used unless explicitly specified by the user, especially not for real NVDIMMs or virtio-pmem devices.

Is there any history why we are doing that? I guess some use cases for !NVDIMM might already be relying on this behaviour now.

Side note: using !PMEM memory as a memory backend for NVDIMMs is dangerous - we don't have any kind of syncs working any more when using files that are not on PMEM themselves. If the guest crashes, we have data loss. This is the whole reason why virtio-pmem came into existance.

Comment 8 Michal Privoznik 2020-11-24 08:21:33 UTC

So what we currently have in libvirt (since v5.0.0-rc1~48) is <pmem/> element that is used like this:

    <memory model='nvdimm' access='private'>
      <source>
        <path>/tmp/nvdimm</path>
        <pmem/>
      </source>
      <target>
        <size unit='KiB'>523264</size>
        <node>0</node>
      </target>
      <address type='dimm' slot='0'/>
    </memory>


And it controls .pmem attribute of corresponding memory-backend-file:

-object memory-backend-file,id=memnvdimm0,mem-path=/tmp/nvdimm,share=no,prealloc=yes,size=536870912,pmem=yes \
-device nvdimm,node=0,memdev=memnvdimm0,id=nvdimm0,slot=0 \

And reading the commit that added the attribute in QEMU (v3.1.0-rc0~126^2~3) sounds like it's exactly what we want. pmem should be used when the backend is real NVDIMM (in which case we don't want prealloc) and if it's just a regular file (does somebody even use it like that?) then we want prealloc.

The reason libvirt used prealloc is that it's better if guest fails to start rather than if was to encounter problems later. For instance, yesterday I've experimented with sparse file as NVDIMM. I've created a 128MB loopback device, created a filesystem there and created a sparse file with 2GB apparent size. Then I've turned off prealloc and started a guest. Then, in the guest I've started writing into that NVDIMM (/dev/pmem0) and as soon as it hit 128MB mark (minus something used by FS itself) the 'dd' stopped. I was unable to kill it, though it wasn't in D state. On the host, qemu started consuming 100% of a core. I'm not saying it is a bug; but when I turned prealloc on, then qemu found the sparse file is not able to store 2GB and refused to start.

Therefore, I think we actually want prealloc = !pmem.

Comment 9 David Hildenbrand 2020-11-24 13:09:58 UTC

I agree that for "ordinary files" prealloc might make sense. I agree that for PMEM, prealloc does not really make sense. Note the the use case of "sparse files for NVDIMM" is far from realistic.

However, for hugetlbfs/tmpfs/shmem as used for DIMMs, it destroys memory overcommit. That should clearly be a configuration knob instead of blindly adding "prealloc". How is prealloc handled with DIMMs?

Comment 10 Michal Privoznik 2020-11-24 15:49:29 UTC

Overcomit is not really something we want, is it? If you configure your guest to use hugepges then prealloc is what you want, esp. because of the way how hugepages work (you have to allocate a pool, then in another step allocate pages from the pool -> it's not atomic). I'd say for memory-backend-file you almost in call cases (except for pmem) want prealloc. I'm not fully convinced about -ram/-memfd. Of course, with virtio-mem even memory-backend-file rule can be relaxed and not use prealloc.

Comment 11 David Hildenbrand 2020-11-25 08:32:32 UTC

With NVDIMMs/virtio-pmem, overcommiting memory in ordinary filesystem is unlikely. I think we agree on that. Prealloc makes sense for files, prealloc does not make sense for PMEM.

With virtio-mem, the size of the memory backend is not an indication if we're overcommitting memory, as the actually consumed size depends on other factors - prealloc is a waste of time as we're discarding preallcoated memory right away. (BTW: I'm planning a virtio-mem prealloc option which tries to prealloc memory before plugging new blocks, and can indicate temporary -ENOMEM to the user, so crashes are avoided as you describe them).

Using free page reporting, any memory on hotplugged DIMMs will immediately be reported+discarded in QEMU again. So it does not really make sense preallocating is a waste of time.

I tend to agree that for hotplugging DIMMs (even when using ordinary balloon inflation/deflation) makes sense - we don't expect to hotplug a DIMM and inflate the balloon immediately; we assume the guest needs more memory and is going to use it.

So prealloc should be disabled with
- PMEM
- virtio-mem
- "free page reporting"

Does that sound like what you have in mind?

Comment 12 Michal Privoznik 2020-11-25 20:50:27 UTC

(In reply to David Hildenbrand from comment #11)

> So prealloc should be disabled with
> - PMEM
> - virtio-mem
> - "free page reporting"
> 
> Does that sound like what you have in mind?

Yes, except I forgot about "free page reporting".
So I guess this bug should be switched to libvirt and also, we should make sure that RHEV passes <pmem/> (@Milan - should we file a bug for that?)

Comment 13 Milan Zamazal 2020-11-25 21:06:33 UTC

(In reply to Michal Privoznik from comment #12)

> and also, we should make sure that RHEV passes <pmem/> (@Milan - should we file a bug for that?)

As I understand the discussion above, we should pass <pmem/> unconditionally, regardless of the backend NVDIMM mode. This is a trivial change without a need for tracking so we don't need a bug for it. And I'll post it right now to not forget about it.

Comment 15 Michal Privoznik 2020-11-30 10:06:52 UTC

Patch proposed upstream:

https://www.redhat.com/archives/libvir-list/2020-November/msg01568.html

Comment 16 Michal Privoznik 2020-11-30 13:10:44 UTC

The discussion to my patch lead us to think that this should be fixed in qemu instead, for instance:

https://www.redhat.com/archives/libvir-list/2020-November/msg01577.html

Switching back to QEMU.

Comment 17 John Ferlan 2020-11-30 16:38:23 UTC

Amnon - back to your team.

Comment 18 Milan Zamazal 2020-12-10 11:11:35 UTC

Is there any idea how to proceed with the bug on the QEMU side? What can RHV expect regarding NVDIMM usability?

Comment 19 David Hildenbrand 2020-12-10 16:24:40 UTC

I think the desired QEMU fix is to basically ignore the "prealloc" parameter in QEMU with "-memory-backend-file ...,pmem=on" ... because there is nothing to preallocate with a real pmem.

CCing Igor

Comment 20 Milan Zamazal 2020-12-16 12:16:09 UTC

Igor, what do you think about David's idea? Could it be a simple fix in QEMU?

Comment 21 Igor Mammedov 2020-12-17 10:07:34 UTC

(In reply to Milan Zamazal from comment #20)
> Igor, what do you think about David's idea? Could it be a simple fix in QEMU?

Ah, forgot to mention it here (I replied to mail thread mentioned in comment 16).

It seems to be an easy way out but I'm not convinced doing it in QEMU
is a good idea (read more on it below). 

pmem=on doesn't guarantee that backend file is on NVDIMM storage,
it could be anything including RAM (conventional or hugepages).

I'm of opinion that QEMU should not ignore options explicitly set by user
(by default prealloc is off), and if we decide that prealloc should not be used
with pmem=on, QEMU should error out instead of ignoring requested option.
That will cause compatibility issues with old libvirt. I could make it
a warning on QEMU side, so user would now that might run sub-optimal
configuration.

Bottom line,
we should move this BZ to back libvirt and fix it there,
so it won't use prealloc=on if backend's file is on NVDIMM storage.

(QEMU is complicated enough already, and I'd avoid implementing
workarounds for the sake of user (in this case defaulting to prealloc=on) using
non optimal CLI).

PS:
I've looked into if we can use libpmem to check if opened file is on
pmem enabled storage, it would be non trivial change on QEMU side
(reverting consolidation between file and memfd allocation routines and
adding back some code duplication). I'm not convinced that QEMU should
ignore explicitly set options at all, bothering with this detection
probably is not worthwhile.

Comment 22 David Hildenbrand 2020-12-17 10:13:31 UTC

(In reply to Igor Mammedov from comment #21)
> (In reply to Milan Zamazal from comment #20)
> > Igor, what do you think about David's idea? Could it be a simple fix in QEMU?
> 
> Ah, forgot to mention it here (I replied to mail thread mentioned in comment
> 16).
> 
> It seems to be an easy way out but I'm not convinced doing it in QEMU
> is a good idea (read more on it below). 
> 
> pmem=on doesn't guarantee that backend file is on NVDIMM storage,
> it could be anything including RAM (conventional or hugepages).

I don't fully agree. Diving into QEMU details:

"RAM_PMEM: the backend @mem_path or @fd is persistent memory"

and

"/* RAM is a persistent kind memory */"

Conventional files as NVDIMM backend are dangerous and not used in practice. We don't consider them as "pmem". Hugepages are not persistent.

Comment 23 Igor Mammedov 2020-12-17 11:46:23 UTC

(In reply to David Hildenbrand from comment #22)
> (In reply to Igor Mammedov from comment #21)
> > (In reply to Milan Zamazal from comment #20)
> > > Igor, what do you think about David's idea? Could it be a simple fix in QEMU?
> > 
> > Ah, forgot to mention it here (I replied to mail thread mentioned in comment
> > 16).
> > 
> > It seems to be an easy way out but I'm not convinced doing it in QEMU
> > is a good idea (read more on it below). 
> > 
> > pmem=on doesn't guarantee that backend file is on NVDIMM storage,
> > it could be anything including RAM (conventional or hugepages).
> 
> I don't fully agree. Diving into QEMU details:
> 
> "RAM_PMEM: the backend @mem_path or @fd is persistent memory"
> 
> and
> 
> "/* RAM is a persistent kind memory */"
> 
> Conventional files as NVDIMM backend are dangerous and not used in practice.
> We don't consider them as "pmem". Hugepages are not persistent.

still it can be used with pmem=on,
even if you could detect that file is actual pmem, I'd rather error out
than ignore prealloc option. If one doesn't need QEMU performing preallocation,
one shouldn't ask for it to begin with.

Comment 24 Igor Mammedov 2020-12-17 12:02:21 UTC

(In reply to Igor Mammedov from comment #23)
> (In reply to David Hildenbrand from comment #22)
> > (In reply to Igor Mammedov from comment #21)
> > > (In reply to Milan Zamazal from comment #20)
> > > > Igor, what do you think about David's idea? Could it be a simple fix in QEMU?
> > > 
> > > Ah, forgot to mention it here (I replied to mail thread mentioned in comment
> > > 16).
> > > 
> > > It seems to be an easy way out but I'm not convinced doing it in QEMU
> > > is a good idea (read more on it below). 
> > > 
> > > pmem=on doesn't guarantee that backend file is on NVDIMM storage,
> > > it could be anything including RAM (conventional or hugepages).
> > 
> > I don't fully agree. Diving into QEMU details:
> > 
> > "RAM_PMEM: the backend @mem_path or @fd is persistent memory"
> > 
> > and
> > 
> > "/* RAM is a persistent kind memory */"
> > 
> > Conventional files as NVDIMM backend are dangerous and not used in practice.
> > We don't consider them as "pmem". Hugepages are not persistent.
> 
> still it can be used with pmem=on,
> even if you could detect that file is actual pmem, I'd rather error out
> than ignore prealloc option. If one doesn't need QEMU performing
> preallocation,
> one shouldn't ask for it to begin with.

Take for example cpu features, if user turns off/on some of them guest might
slowdown, QEMU doesn't magically ignore user provided CLI, it does what user
has asked for. That keeps options behaviour consistent regardless of host
it runs on.
There is also maintenance aspect to it. From my small experience in cases
where QEMU used to override/ignore user input later on we had to refactor
that code to drop it and put a pile of compat glue on top to keep old
behaviour.

Comment 25 David Hildenbrand 2020-12-17 19:09:41 UTC

(In reply to Igor Mammedov from comment #24)
> (In reply to Igor Mammedov from comment #23)
> > (In reply to David Hildenbrand from comment #22)
> > > (In reply to Igor Mammedov from comment #21)
> > > > (In reply to Milan Zamazal from comment #20)
> > > > > Igor, what do you think about David's idea? Could it be a simple fix in QEMU?
> > > > 
> > > > Ah, forgot to mention it here (I replied to mail thread mentioned in comment
> > > > 16).
> > > > 
> > > > It seems to be an easy way out but I'm not convinced doing it in QEMU
> > > > is a good idea (read more on it below). 
> > > > 
> > > > pmem=on doesn't guarantee that backend file is on NVDIMM storage,
> > > > it could be anything including RAM (conventional or hugepages).
> > > 
> > > I don't fully agree. Diving into QEMU details:
> > > 
> > > "RAM_PMEM: the backend @mem_path or @fd is persistent memory"
> > > 
> > > and
> > > 
> > > "/* RAM is a persistent kind memory */"
> > > 
> > > Conventional files as NVDIMM backend are dangerous and not used in practice.
> > > We don't consider them as "pmem". Hugepages are not persistent.
> > 
> > still it can be used with pmem=on,
> > even if you could detect that file is actual pmem, I'd rather error out
> > than ignore prealloc option. If one doesn't need QEMU performing
> > preallocation,
> > one shouldn't ask for it to begin with.
> 
> Take for example cpu features, if user turns off/on some of them guest might
> slowdown, QEMU doesn't magically ignore user provided CLI, it does what user
> has asked for. That keeps options behaviour consistent regardless of host
> it runs on.
> There is also maintenance aspect to it. From my small experience in cases
> where QEMU used to override/ignore user input later on we had to refactor
> that code to drop it and put a pile of compat glue on top to keep old
> behaviour.

I fully agree. However, Daniel suggested that instead of the libvirt fix upstream:

https://www.redhat.com/archives/libvir-list/2020-November/msg01577.html

So do we have an agreement to

1. Fix it in libvirt: don't prealloc
2. Try detecting in QEMU and bailing out (when used with pmem=on? detecting if it's actually real pmem?)

Comment 26 Igor Mammedov 2020-12-18 11:36:02 UTC

(In reply to David Hildenbrand from comment #25)
> (In reply to Igor Mammedov from comment #24)
> > (In reply to Igor Mammedov from comment #23)
> > > (In reply to David Hildenbrand from comment #22)
> > > > (In reply to Igor Mammedov from comment #21)
> > > > > (In reply to Milan Zamazal from comment #20)
> > > > > > Igor, what do you think about David's idea? Could it be a simple fix in QEMU?
> > > > > 
> > > > > Ah, forgot to mention it here (I replied to mail thread mentioned in comment
> > > > > 16).
> > > > > 
> > > > > It seems to be an easy way out but I'm not convinced doing it in QEMU
> > > > > is a good idea (read more on it below). 
> > > > > 
> > > > > pmem=on doesn't guarantee that backend file is on NVDIMM storage,
> > > > > it could be anything including RAM (conventional or hugepages).
> > > > 
> > > > I don't fully agree. Diving into QEMU details:
> > > > 
> > > > "RAM_PMEM: the backend @mem_path or @fd is persistent memory"
> > > > 
> > > > and
> > > > 
> > > > "/* RAM is a persistent kind memory */"
> > > > 
> > > > Conventional files as NVDIMM backend are dangerous and not used in practice.
> > > > We don't consider them as "pmem". Hugepages are not persistent.
> > > 
> > > still it can be used with pmem=on,
> > > even if you could detect that file is actual pmem, I'd rather error out
> > > than ignore prealloc option. If one doesn't need QEMU performing
> > > preallocation,
> > > one shouldn't ask for it to begin with.
> > 
> > Take for example cpu features, if user turns off/on some of them guest might
> > slowdown, QEMU doesn't magically ignore user provided CLI, it does what user
> > has asked for. That keeps options behaviour consistent regardless of host
> > it runs on.
> > There is also maintenance aspect to it. From my small experience in cases
> > where QEMU used to override/ignore user input later on we had to refactor
> > that code to drop it and put a pile of compat glue on top to keep old
> > behaviour.
> 
> I fully agree. However, Daniel suggested that instead of the libvirt fix
> upstream:
> 
> https://www.redhat.com/archives/libvir-list/2020-November/msg01577.html
I saw that and replied on that thread.

> So do we have an agreement to
> 
> 1. Fix it in libvirt: don't prealloc
yes

> 2. Try detecting in QEMU and bailing out (when used with pmem=on? detecting
> if it's actually real pmem?)
I'd like to error out but I see 2 issues with it:
2.1 it will break existing guests that run just fine (for non large VMs, users probably
won't even notice slow down) (so it's more compat code to keep it working)
2.2 for detecting if backend file is actually pmem, we would need to use
pmem_is_pmem() which requires file being opened and mapped with pmem_map_file().
see comment 21 (PS) why I'd not go there if possible.

3d option:
For QEMU side, I suggested to print warning when prealloc + pmem is in use
together. That way if user is bothered enough to read logs, it would help to
pinpoint possible issue.

Comment 27 Michal Privoznik 2020-12-18 15:21:11 UTC

Since this was proposed by Dan, I think we need his input too.

Comment 28 Daniel Berrangé 2021-01-04 11:47:37 UTC

Go ahead an send an updated patch and we'll continue discussion on list.

Comment 29 Michal Privoznik 2021-01-05 11:40:53 UTC

v2:

https://www.redhat.com/archives/libvir-list/2021-January/msg00109.html

Let's see how discussion goes and I'll switch this back to libvirt then.

Comment 30 Michal Privoznik 2021-01-05 12:23:42 UTC

Merged upstream as:

bf14a9be1e qemu: Don't prealloc mem for real NVDIMMs

v6.10.0-290-gbf14a9be1e

Comment 31 Jing Qi 2021-01-11 02:56:16 UTC

Verified with libvirt upstream code version v7.0.0-rc1
& qemu-kvm-5.1.0-17.module+el8.3.1+9213+7ace09c3.x86_64

Start vm with below xml -
<memory model='nvdimm' access='shared'>
      <source>
        <path>/dev/dax0.0</path>
        <alignsize unit='KiB'>2048</alignsize>
        <pmem/>
      </source>
      <target>
        <size unit='KiB'>262144000</size>
        <node>0</node>
        <label>
          <size unit='KiB'>128</size>
        </label>
      </target>
      <address type='dimm' slot='0'/>
    </memory>

From the qemu-cmd line, there is no "prealloc" and there is no long waiting time when issuing "start vm" command.
-object memory-backend-file,id=memnvdimm0,mem-path=/dev/dax0.0,share=yes,size=268435456000,align=2097152,pmem=yes -device nvdimm,node=0,label-size=131072,memdev=memnvdimm0,id=nvdimm0,slot=0

Comment 35 Jing Qi 2021-01-18 08:44:15 UTC

Verified in versions:
libvirt-daemon-7.0.0-1.module+el8.4.0+9464+3e71831a.x86_64
qemu-kvm-5.2.0-2.module+el8.4.0+9186+ec44380f.x86_64

S1. Start vm with pmem set in nvdimm   -
<memory model='nvdimm' access='shared'>
      <source>
        <path>/dev/dax0.0</path>
        <alignsize unit='KiB'>2048</alignsize>
        <pmem/>
      </source>
      <target>
        <size unit='KiB'>262144000</size>
        <node>0</node>
        <label>
          <size unit='KiB'>128</size>
        </label>
      </target>
      <address type='dimm' slot='0'/>
    </memory>

From the qemu-cmd line, there is no "prealloc" and there is no long waiting time when issuing "start vm" command.
-object memory-backend-file,id=memnvdimm0,mem-path=/dev/dax0.0,share=yes,size=268435456000,align=2097152,pmem=yes -device nvdimm,node=0,label-size=131072,memdev=memnvdimm0,id=nvdimm0,slot=0

S2. Start vm with pmem set in an emulated nvdimm

 <memory model='nvdimm' access='shared'>
      <source>
        <path>/tmp/nvdimm0</path>
        <alignsize unit='KiB'>2048</alignsize>
        <pmem/>
      </source>
      <target>
        <size unit='KiB'>524288</size>
        <node>0</node>
        <label>
          <size unit='KiB'>128</size>
        </label>
      </target>
      <address type='dimm' slot='0'/>
    </memory>

@Michal,
There is also no prealloc in the cmd line, and there is no error message with the <pmem/> used with an emulated file. And I didn't see any different with prealloc and without prealloc. Is it as expected? Thanks!


-object memory-backend-file,id=memnvdimm0,mem-path=/tmp/nvdimm0,share=yes,size=536870912,align=2097152,pmem=yes -device nvdimm,node=0,label-size=131072,memdev=memnvdimm0,id=nvdimm0,slot=0

Comment 36 Michal Privoznik 2021-01-18 09:33:38 UTC

(In reply to Jing Qi from comment #35)
> Verified in versions:
> libvirt-daemon-7.0.0-1.module+el8.4.0+9464+3e71831a.x86_64
> qemu-kvm-5.2.0-2.module+el8.4.0+9186+ec44380f.x86_64

> @Michal,
> There is also no prealloc in the cmd line, and there is no error message
> with the <pmem/> used with an emulated file. And I didn't see any different
> with prealloc and without prealloc. Is it as expected? Thanks!

Yes, <pmem/> should be added if the path is a real device. Libvirt doesn't examine the path, it doesn't know whether it is a real NVDIMM or not. Therefore, prealloc is turned off if <pmem/> is present, and it's turned on if <pmem/> is NOT present.

Comment 38 errata-xmlrpc 2021-05-25 06:45:10 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (virt:av bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:2098