This bug has been migrated to another issue tracking site. It has been closed here and may no longer be being monitored.

If you would like to get updates for this issue, or to participate in it, you may do so at Red Hat Issue Tracker .
RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 2177618 - Different behaviors for hotplugging dimm memory in guest with different access attr defined when there is nvdimm device plugged
Summary: Different behaviors for hotplugging dimm memory in guest with different acces...
Keywords:
Status: CLOSED MIGRATED
Alias: None
Product: Red Hat Enterprise Linux 9
Classification: Red Hat
Component: libvirt
Version: 9.2
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: rc
: ---
Assignee: Michal Privoznik
QA Contact: liang cong
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-03-13 07:18 UTC by liang cong
Modified: 2023-09-22 13:26 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-09-22 13:26:45 UTC
Type: Bug
Target Upstream Version:
Embargoed:
pm-rhel: mirror+


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker   RHEL-7113 0 None Migrated None 2023-09-22 13:26:39 UTC
Red Hat Issue Tracker RHELPLAN-151532 0 None None None 2023-03-13 07:18:26 UTC

Description liang cong 2023-03-13 07:18:05 UTC
Description of problem:
Different behaviors for hotplugging dimm memory in guest with different access attr defined when there is nvdimm device plugged

Version-Release number of selected component (if applicable):
libvirt-9.0.0-8.el9_2.x86_64
qemu-kvm-7.2.0-11.el9_2.x86_64

Guest version:
os version: RHEL9.2
kernel version: 5.14.0-284.el9.x86_64

How reproducible:
100%

Steps to Reproduce:
1. Create a 512M file
truncate -s 512M /tmp/nvdimm



2. Define and Start a guest with memory, numa and nvdimm related config xml as below:
<maxMemory slots='16' unit='KiB'>52428800</maxMemory>
<memory unit='KiB'>2097152</memory>
<currentMemory unit='KiB'>2097152</currentMemory>
...
<numa>
      <cell id='0' cpus='0-1' memory='1048576' unit='KiB'/>
      <cell id='1' cpus='2-3' memory='1048576' unit='KiB'/>
    </numa>
...
<memory model='nvdimm'>
      <source>
        <path>/tmp/nvdimm</path>
      </source>
      <target>
        <size unit='KiB'>524288</size>
        <node>1</node>
        <label>
          <size unit='KiB'>256</size>
        </label>
      </target>
    </memory>
...


3. Check the guest memory
[in guest]
# cat /proc/meminfo | grep MemTotal
MemTotal:        1736156 kB


4. Prepare a access defined dimm memory device config xml:
# cat memory1.xml
<memory model='dimm' access='shared'>  <!-- or access='private' -->
      <source>
        <pagesize unit='KiB'>4</pagesize>
      </source>
      <target>
        <size unit='KiB'>524288</size>
        <node>0</node>
      </target>
    </memory>

5. Hot plug the dimm memory device with config xml in step3
# virsh attach-device vm1 memory1.xml
Device attached successfully

6. Check the guest memory again and guest memory is not increased.
[in guest]
# cat /proc/meminfo | grep MemTotal
MemTotal:        1736156 kB

7. Check dmesg in guest and find related error
[in guest]
# dmesg
...
[  198.482981] Block size [0x8000000] unaligned hotplug range: start 0x11ffc0000, size 0x20000000
[  198.483017] acpi PNP0C80:01: add_memory failed
[  198.485362] acpi PNP0C80:01: acpi_memory_enable_device() error
[  198.486377] acpi PNP0C80:01: Enumeration failure

8. If in step4 memory device is not defined without access attr like:
# cat memory1.xml
<memory model='dimm'>
      <source>
        <pagesize unit='KiB'>4</pagesize>
      </source>
.....


Then in step6 the guest would increase as:
[in guest]
# cat /proc/meminfo | grep MemTotal
MemTotal:        2260444 kB



Actual results:
Different behavior for hotplugging dimm device in guest with different access attr.



Expected results:
Shared or private access defined dimm device should be same behavior with no defined dimm device. 


Additional info:
Also checked other scenarios:
Note: the guest area memory size of nvdimm is 524288 KiB - 256 KiB = 524032 KiB, which is not multiple of 128M

If nvdimm guest area memory(total-size - label-size) is multiple of 128M as label size set as: 0 (no label size defined), 128M, 256M, 384M, then no matter how to set access attr, dimm device could be plugged successfully in guest.


For dimm device has no access attr defined.  If set nvdimm label size [0, 2M), [128M, 130M), [256M, 258M).. the dimm device could be plugged successfully in guest.


So as the info above, the behaviors with different access attr defined are different.

Comment 1 Michal Privoznik 2023-03-13 12:24:06 UTC
Even though I can reproduce, I'm not convinced this is a libvirt bug. The following is QMP communication between libvirt and QEMU (when hotplugging the dimm):

2023-03-13 12:10:20.472+0000: 23974: info : qemuMonitorSend:861 : QEMU_MONITOR_SEND_MSG: mon=0x7f7bd0009180 msg={"execute":"object-add","arguments":{"qom-type":"memory-backend-file","id":"memdimm1","mem-path":"/var/lib/libvirt/qemu/ram/3-fedora/dimm1","discard-data":true,"share":true,"prealloc":true,"prealloc-threads":16,"size":536870912},"id":"libvirt-428"}
 fd=-1
2023-03-13 12:10:20.472+0000: 9301: info : qemuMonitorIOWrite:366 : QEMU_MONITOR_IO_WRITE: mon=0x7f7bd0009180 buf={"execute":"object-add","arguments":{"qom-type":"memory-backend-file","id":"memdimm1","mem-path":"/var/lib/libvirt/qemu/ram/3-fedora/dimm1","discard-data":true,"share":true,"prealloc":true,"prealloc-threads":16,"size":536870912},"id":"libvirt-428"}
 len=250 ret=250 errno=0
2023-03-13 12:10:20.506+0000: 9301: debug : qemuMonitorJSONIOProcessLine:191 : Line [{"return": {}, "id": "libvirt-428"}]
2023-03-13 12:10:20.506+0000: 9301: info : qemuMonitorJSONIOProcessLine:210 : QEMU_MONITOR_RECV_REPLY: mon=0x7f7bd0009180 reply={"return": {}, "id": "libvirt-428"}
2023-03-13 12:10:20.506+0000: 23974: debug : qemuMonitorAddDeviceProps:2604 : mon:0x7f7bd0009180 vm:0x7f7bb80a0ec0 fd:35
2023-03-13 12:10:20.506+0000: 23974: info : qemuMonitorSend:861 : QEMU_MONITOR_SEND_MSG: mon=0x7f7bd0009180 msg={"execute":"device_add","arguments":{"driver":"pc-dimm","node":0,"memdev":"memdimm1","id":"dimm1","slot":1},"id":"libvirt-429"}
 fd=-1
2023-03-13 12:10:20.506+0000: 9301: info : qemuMonitorIOWrite:366 : QEMU_MONITOR_IO_WRITE: mon=0x7f7bd0009180 buf={"execute":"device_add","arguments":{"driver":"pc-dimm","node":0,"memdev":"memdimm1","id":"dimm1","slot":1},"id":"libvirt-429"}
 len=129 ret=129 errno=0
2023-03-13 12:10:20.509+0000: 9301: debug : qemuMonitorJSONIOProcessLine:191 : Line [{"timestamp": {"seconds": 1678709420, "microseconds": 509368}, "event": "ACPI_DEVICE_OST", "data": {"info": {"device": "dimm1", "source": 1, "status": 1, "slot": "1", "slot-type": "DIMM"}}}]
2023-03-13 12:10:20.509+0000: 9301: info : qemuMonitorJSONIOProcessLine:205 : QEMU_MONITOR_RECV_EVENT: mon=0x7f7bd0009180 event={"timestamp": {"seconds": 1678709420, "microseconds": 509368}, "event": "ACPI_DEVICE_OST", "data": {"info": {"device": "dimm1", "source": 1, "status": 1, "slot": "1", "slot-type": "DIMM"}}}
2023-03-13 12:10:20.509+0000: 9301: debug : qemuMonitorJSONIOProcessEvent:154 : mon=0x7f7bd0009180 obj=0x7f7bb80d8af0
2023-03-13 12:10:20.509+0000: 9301: debug : qemuMonitorEmitEvent:1069 : mon=0x7f7bd0009180 event=ACPI_DEVICE_OST
2023-03-13 12:10:20.509+0000: 9301: debug : qemuProcessHandleEvent:546 : vm=0x7f7bb80a0ec0
2023-03-13 12:10:20.509+0000: 9301: debug : qemuMonitorJSONIOProcessEvent:177 : handle ACPI_DEVICE_OST handler=0x7f7bd4368d8c data=0x7f7bb80848b0
2023-03-13 12:10:20.509+0000: 9301: debug : qemuMonitorEmitAcpiOstInfo:1347 : mon=0x7f7bd0009180, alias='dimm1', slotType='DIMM', slot='1', source='1' status=1
2023-03-13 12:10:20.509+0000: 9301: debug : qemuProcessHandleAcpiOstInfo:1291 : ACPI OST info for device dimm1 domain 0x7f7bb80a0ec0 fedora. slotType='DIMM' slot='1' source=1 status=1
2023-03-13 12:10:20.518+0000: 9301: debug : qemuMonitorJSONIOProcessLine:191 : Line [{"return": {}, "id": "libvirt-429"}]
2023-03-13 12:10:20.518+0000: 9301: info : qemuMonitorJSONIOProcessLine:210 : QEMU_MONITOR_RECV_REPLY: mon=0x7f7bd0009180 reply={"return": {}, "id": "libvirt-429"}
2023-03-13 12:10:20.518+0000: 23974: debug : qemuDomainObjExitMonitor:6254 : Exited monitor (mon=0x7f7bd0009180 vm=0x7f7bb80a0ec0 name=fedora)

Nowhere does QEMU signal any kind of problem. Therefore, I think this is a QEMU bug (the dmesg output from the guest suggests that an unaligned address was picked). Let me switch over to QEMU for further investigation.
BTW: I can reproduce even with upstream QEMU (v7.2.0-2624-g29c8a9e31a).

Comment 2 John Ferlan 2023-03-14 13:03:03 UTC
Dave or Igor - could you please take a look.

Comment 3 David Hildenbrand 2023-03-14 13:31:08 UTC
When QEMU performs address assignment in GPA (Guest Physical Address Space), it primarily only considers the alignment requirements of the memory backend (-> due to the backend page size, such as 4k on x86-64 as default). Apart from that, it simply looks for the next GPA range that can fit the new memory device and adds it.

In the "normal" case, we only add DIMMs/NVDIMMs that are reasonable in size (e.g., >= 128 MiB DIMMs) and, therefore, get automatically a reasonable alignment that Linux can use.

Linux is only able to make use of certain aligned DIMMs (e.g., 128M on x86-64 as default, 2G on larger machines). That is a Linux implementation detail and QEMU cannot (and should not) make a decision of how to align. Changing any defaults in QEMU will require compat handling and we might eventually run into trouble when fragmenting the GPA (such that "maxmem" cannot be rached).

To compensate, QEMU can be instructed to place DIMMs as specific (aligned) addresses. In case of memory-backend-file, one can specify an "align" property, that will be considered when finding a GPA (e.g., set "align=128M").

Now, it is not quite clear to me what happens in this scenario when *not* specifying "<pagesize unit='KiB'>4</pagesize>". Are we maybe defaulting to some memory backend that has a reasonably large alignment? Let me play with it.

Comment 4 David Hildenbrand 2023-03-14 13:55:51 UTC
Okay, I think I know what's happening.

For anonymous memory (memory-backend-ram), we always align at a 2 MiB boundary on x86-64, such that we can make use of THP. For file-backed memory (memory-backend-file), we don't do that, and default to 4K alignment on x86-64.

In your example, it works by pure luck: if we align 524032 KiB up by 2*MiB, we end up at an aligned 524288 KiB size. So we will automatically place the DIMM (by luck properly aligned to 128 MiB) after the NVDIMM. With "share=on" we end up using memory-backend-file and don't end up placing the DIMM at an 128 MiB-aligned (by luck) address.


QEMU, in general, doesn't make Linux-specific address alignment choices for memory devices. Simple example:

$ qemu-system-x86_64 -m 4G,maxmem=16G,slots=16 -S -monitor stdio -nographic -nodefaults -object memory-backend-ram,id=mem0,size=124M -device pc-dimm,id=dimm0,memdev=mem0 -object memory-backend-ram,id=mem1,size=128M -device pc-dimm,id=dimm1,memdev=mem1
QEMU 7.0.0 monitor - type 'help' for more information
(qemu) info memory-devices
Memory device [dimm]: "dimm0"
  addr: 0x140000000
  slot: 0
  node: 0
  size: 130023424
  memdev: /objects/mem0
  hotplugged: false
  hotpluggable: true
Memory device [dimm]: "dimm1"
  addr: 0x147c00000
  slot: 1
  node: 0
  size: 134217728
  memdev: /objects/mem1
  hotplugged: false
  hotpluggable: true


Note that, while dimm0 has a properly aligned address, it won't be usable by Linux due to the unaligned size. Consequently, also the dimm1 with a properly aligned size gets an unaligned memory address, turning it also unusable for Linux.

We could fairly easily add an "align" property for DIMMs where one could specify "128 MiB" when running Linux guests. But QEMU shouldn't make such decisions because (a) it's guest-specific and might change in the future. Again some Linux machines even require 2 GiB alignment. (b) the user already has to be aware about the same alignment requirements when it comes to DIMM sizes.

Comment 5 Igor Mammedov 2023-03-21 12:06:01 UTC
(In reply to David Hildenbrand from comment #4)
[...]
> 
> Note that, while dimm0 has a properly aligned address, it won't be usable by
> Linux due to the unaligned size. Consequently, also the dimm1 with a
> properly aligned size gets an unaligned memory address, turning it also
> unusable for Linux.
> 
> We could fairly easily add an "align" property for DIMMs where one could
> specify "128 MiB" when running Linux guests. But QEMU shouldn't make such
> decisions because (a) it's guest-specific and might change in the future.
> Again some Linux machines even require 2 GiB alignment. (b) the user already
> has to be aware about the same alignment requirements when it comes to DIMM
> sizes.

As far as I remember when assigning GPA at which dimm starts, it should be
aligned on 1G by default (see: enforce_aligned_dimm), backend kind shouldn't matter.
(if it doesn't work anymore, it's likely that we've regressed it somehow)

Comment 6 Igor Mammedov 2023-03-21 12:21:44 UTC
(In reply to Igor Mammedov from comment #5)
> (In reply to David Hildenbrand from comment #4)
> [...]
> > 
> > Note that, while dimm0 has a properly aligned address, it won't be usable by
> > Linux due to the unaligned size. Consequently, also the dimm1 with a
> > properly aligned size gets an unaligned memory address, turning it also
> > unusable for Linux.
> > 
> > We could fairly easily add an "align" property for DIMMs where one could
> > specify "128 MiB" when running Linux guests. But QEMU shouldn't make such
> > decisions because (a) it's guest-specific and might change in the future.
> > Again some Linux machines even require 2 GiB alignment. (b) the user already
> > has to be aware about the same alignment requirements when it comes to DIMM
> > sizes.
> 
> As far as I remember when assigning GPA at which dimm starts, it should be
> aligned on 1G by default (see: enforce_aligned_dimm), backend kind shouldn't
> matter.
> (if it doesn't work anymore, it's likely that we've regressed it somehow)

Well, I see we still reserve address space for 1Gb * max_slots, but I don't
see start address being aligned to that.

Comment 7 Igor Mammedov 2023-03-21 12:35:15 UTC
(In reply to Igor Mammedov from comment #6)
> (In reply to Igor Mammedov from comment #5)
> > (In reply to David Hildenbrand from comment #4)
> > [...]
> > > 
> > > Note that, while dimm0 has a properly aligned address, it won't be usable by
> > > Linux due to the unaligned size. Consequently, also the dimm1 with a
> > > properly aligned size gets an unaligned memory address, turning it also
> > > unusable for Linux.
> > > 
> > > We could fairly easily add an "align" property for DIMMs where one could
> > > specify "128 MiB" when running Linux guests. But QEMU shouldn't make such
> > > decisions because (a) it's guest-specific and might change in the future.
> > > Again some Linux machines even require 2 GiB alignment. (b) the user already
> > > has to be aware about the same alignment requirements when it comes to DIMM
> > > sizes.
> > 
> > As far as I remember when assigning GPA at which dimm starts, it should be
> > aligned on 1G by default (see: enforce_aligned_dimm), backend kind shouldn't
> > matter.
> > (if it doesn't work anymore, it's likely that we've regressed it somehow)
> 
> Well, I see we still reserve address space for 1Gb * max_slots, but I don't
> see start address being aligned to that.

Never-mind, I was remembering it incorrectly, enforce_aligned_dimm didn't set
address alignment from the beginning (commit 085f8e88ba73).

Comment 8 David Hildenbrand 2023-03-21 13:26:20 UTC
(In reply to Igor Mammedov from comment #7)
> (In reply to Igor Mammedov from comment #6)
> > (In reply to Igor Mammedov from comment #5)
> > > (In reply to David Hildenbrand from comment #4)
> > > [...]
> > > > 
> > > > Note that, while dimm0 has a properly aligned address, it won't be usable by
> > > > Linux due to the unaligned size. Consequently, also the dimm1 with a
> > > > properly aligned size gets an unaligned memory address, turning it also
> > > > unusable for Linux.
> > > > 
> > > > We could fairly easily add an "align" property for DIMMs where one could
> > > > specify "128 MiB" when running Linux guests. But QEMU shouldn't make such
> > > > decisions because (a) it's guest-specific and might change in the future.
> > > > Again some Linux machines even require 2 GiB alignment. (b) the user already
> > > > has to be aware about the same alignment requirements when it comes to DIMM
> > > > sizes.
> > > 
> > > As far as I remember when assigning GPA at which dimm starts, it should be
> > > aligned on 1G by default (see: enforce_aligned_dimm), backend kind shouldn't
> > > matter.
> > > (if it doesn't work anymore, it's likely that we've regressed it somehow)
> > 
> > Well, I see we still reserve address space for 1Gb * max_slots, but I don't
> > see start address being aligned to that.
> 
> Never-mind, I was remembering it incorrectly, enforce_aligned_dimm didn't set
> address alignment from the beginning (commit 085f8e88ba73).

Right.

IIRC, we only reserved the 1 GiB such that we can properly align memory devices with gigantic pages as backend (host requirement).

We don't any kind of guest-specific alignment.

Comment 9 Mario Casquero 2023-04-11 15:25:26 UTC
This bug is reproducible using directly qemu-kvm when hotplugging a dimm device to a memory-backend-file object

Test environment
kernel-5.14.0-295.el9.x86_64
qemu-kvm-7.2.0-10.el9.x86_64

Guest
RHEL.9.2.0
kernel-5.14.0-284.6.1.el9_2.x86_64

When booting a guest using a qemu-kvm cmd line[1] based on the information provided in the bug description, the guest is started successfully. If a dimm device is hotplugged[2], no matters if share option is true, false or if it is not specified (false is the default value), but the memory has to be backed by a file and the error will be visible at guest dmesg[3].

[1] /usr/libexec/qemu-kvm \
...
-smp 4,sockets=4,cores=1,threads=1 \
-m 2048,maxmem=80G,slots=20 \
-object '{"qom-type":"memory-backend-ram","id":"ram-node0","size":1073741824}' \
-numa node,nodeid=0,cpus=0-1,memdev=ram-node0 \
-object '{"qom-type":"memory-backend-ram","id":"ram-node1","size":1073741824}' \
-numa node,nodeid=1,cpus=2-3,memdev=ram-node1 \
-object '{"qom-type":"memory-backend-file","id":"memnvdimm0","mem-path":"/tmp/nvdimm","prealloc":true,"size":536870912}' \
-device '{"driver":"nvdimm","node":1,"label-size":262144,"memdev":"memnvdimm0","id":"nvdimm0","slot":0}' \
...

[2] Hotplug a dimm device using HMP/QMP
QMP
{"execute": "object-add", "arguments": {"id": "mem1", "qom-type": "memory-backend-file", "size": 536870912, "mem-path": "/tmp/dimm1"}}
{"execute": "device_add", "arguments": {"id": "dimm1", "driver": "pc-dimm", "memdev": "mem1"}}
HMP
(qemu) object_add qom-type=memory-backend-file,id=mem1,mem-path=/tmp/dimm1,size=512M
(qemu) device_add pc-dimm,id=dimm1,memdev=mem1

[3] Guest dmesg
[  188.843152] Block size [0x8000000] unaligned hotplug range: start 0x11ffc0000, size 0x20000000
[  188.843158] acpi PNP0C80:01: add_memory failed
[  188.844013] acpi PNP0C80:01: acpi_memory_enable_device() error
[  188.844386] acpi PNP0C80:01: Enumeration failure

Comment 10 Igor Mammedov 2023-04-12 06:45:25 UTC
So it's guest specific align requirements that lead to failure.
QEMU typically doesn't carter to that (for example 2Mb aligned
block would work fine with Windows guest while might fail with Linux).

Libvirt is the next component in stack which actually knows what
guest OS is used in VM. Based on this knowledge it can do OS
specific dimm/nvdimm alignment validation before applying settings.

Move BZ to libvirt for further triage.

Comment 11 David Hildenbrand 2023-04-12 09:10:20 UTC
As expressed in comment #4, I suspect an align property on DIMMs (and/or NVDIMMs) might be helpful. In that case, libvirt could set that align property to e.g., 128 MiB on x86-64 with Linux guests and have it working in most setups. But maybe the user should just specify that alignment, because I suspect even libvirt might not know.

For example, older Linux on aarch64 can only hotplug 1 GiB DIMMs. Newer ones can hotplug 128 MiB DIMMs on aarch64 with 4k but only 512 MiB DIMMs (IIRC) on aarch64 with 64k ... so besides knowing "this is the minimum granularity I can use for that guest", we also have to know "this minimum granularity implies the alignment".

Comment 12 Michal Privoznik 2023-04-12 14:24:39 UTC
(In reply to Igor Mammedov from comment #10)
> Libvirt is the next component in stack which actually knows what
> guest OS is used in VM.

No it doesn't.

(In reply to David Hildenbrand from comment #11)
> As expressed in comment #4, I suspect an align property on DIMMs (and/or
> NVDIMMs) might be helpful. In that case, libvirt could set that align
> property to e.g., 128 MiB on x86-64 with Linux guests and have it working in
> most setups. But maybe the user should just specify that alignment, because
> I suspect even libvirt might not know.

Your suspicion is correct. User specified alignment it is.

> 
> For example, older Linux on aarch64 can only hotplug 1 GiB DIMMs. Newer ones
> can hotplug 128 MiB DIMMs on aarch64 with 4k but only 512 MiB DIMMs (IIRC)
> on aarch64 with 64k ... so besides knowing "this is the minimum granularity
> I can use for that guest", we also have to know "this minimum granularity
> implies the alignment".

Is this documented somewhere? Might be worth linking that from libvirt docs for the new attribute.

Comment 13 David Hildenbrand 2023-04-12 14:47:28 UTC
> > 
> > For example, older Linux on aarch64 can only hotplug 1 GiB DIMMs. Newer ones
> > can hotplug 128 MiB DIMMs on aarch64 with 4k but only 512 MiB DIMMs (IIRC)
> > on aarch64 with 64k ... so besides knowing "this is the minimum granularity
> > I can use for that guest", we also have to know "this minimum granularity
> > implies the alignment".
> 
> Is this documented somewhere? Might be worth linking that from libvirt docs
> for the new attribute.

Not implemented (and, therefore) upstream/downstream yet. If you agree that this alignment property could be helpful, I'll implement+propose upstream.

Comment 14 Michal Privoznik 2023-04-12 15:25:17 UTC
(In reply to David Hildenbrand from comment #13)
> > > 
> > > For example, older Linux on aarch64 can only hotplug 1 GiB DIMMs. Newer ones
> > > can hotplug 128 MiB DIMMs on aarch64 with 4k but only 512 MiB DIMMs (IIRC)
> > > on aarch64 with 64k ... so besides knowing "this is the minimum granularity
> > > I can use for that guest", we also have to know "this minimum granularity
> > > implies the alignment".
> > 
> > Is this documented somewhere? Might be worth linking that from libvirt docs
> > for the new attribute.
> 
> Not implemented (and, therefore) upstream/downstream yet. If you agree that
> this alignment property could be helpful, I'll implement+propose upstream.

Oh, I thought this was already implemented. But looking into QEMU's sources it isn't. Alright then, so just to make sure I understand correctly:

1) user would provide the alignment value when hotplugging memory (somewhere in the memory device XML), or maybe even in the guest XML,
2) libvirt would then pass this value down to QEMU in 'device_add' command.

Now the only question is, how should user know/guess the correct value to pass to libvirt. But maybe libvirt can document this somewhere.

Comment 15 David Hildenbrand 2023-04-13 08:20:20 UTC
(In reply to Michal Privoznik from comment #14)
> (In reply to David Hildenbrand from comment #13)
> > > > 
> > > > For example, older Linux on aarch64 can only hotplug 1 GiB DIMMs. Newer ones
> > > > can hotplug 128 MiB DIMMs on aarch64 with 4k but only 512 MiB DIMMs (IIRC)
> > > > on aarch64 with 64k ... so besides knowing "this is the minimum granularity
> > > > I can use for that guest", we also have to know "this minimum granularity
> > > > implies the alignment".
> > > 
> > > Is this documented somewhere? Might be worth linking that from libvirt docs
> > > for the new attribute.
> > 
> > Not implemented (and, therefore) upstream/downstream yet. If you agree that
> > this alignment property could be helpful, I'll implement+propose upstream.
> 
> Oh, I thought this was already implemented. But looking into QEMU's sources
> it isn't. Alright then, so just to make sure I understand correctly:
> 
> 1) user would provide the alignment value when hotplugging memory (somewhere
> in the memory device XML), or maybe even in the guest XML,
> 2) libvirt would then pass this value down to QEMU in 'device_add' command.

I'm asking myself if we want to handle this per device, or simply for the whole machine. It might make sense to use the same min alignment for all DIMMs/NVDIMMs/virtio-pmem/virtio-mem devices, instead of having per-device options.

So we could either

(1) Let the user specify it for the machine in QEMU (e.g., -m 8g,maxmem=100g,slots=10,align=128M), in libvirt we'd specify it for the domain.

(2) Let the user specify it per device in QEMU (e.g., device_add pc-dimm,....,align=128M), in libvirt we'd specify it either for the domain or per device.

Semantics of "align" are "minimum alignment of memory devices in guest physical address space".

IMHO, we might not need the flexibility of (2) -- we could add per-device overwrites later -- and (1) would be sufficient for now.

> 
> Now the only question is, how should user know/guess the correct value to
> pass to libvirt. But maybe libvirt can document this somewhere.

A sane default might be 128 MiB. In Linux guests, we can figure it out by looking at "/sys/devices/system/memory/block_size_bytes". Any DIMM that's not aligned in size and start address cannot be (fully) used by Linux.

Comment 16 Igor Mammedov 2023-04-13 11:15:20 UTC
QEMU already reserves 1G of GPA per device, so why not align every one on 1G border (without adding any new options)?

Comment 17 David Hildenbrand 2023-04-14 08:05:08 UTC
(In reply to Igor Mammedov from comment #16)
> QEMU already reserves 1G of GPA per device, so why not align every one on 1G
> border (without adding any new options)?

We only do that on x86 so far IIRC, and only for memory devices that require an ACPI slot (we don't know how many other devices we might have). The underlying reason IIRC, was to handle memory backends with gigantic pages that require a certain alignment in GPA. So on x86 we could eventually align only such devices (DIMMs/NVDIMMs) to 1 GiB without further changes. For everything else, we could break existing setups eventually and would require some compat handling (I recall that any such gpa layout changes might require compat handling, but at least libvirt should be able to deal with that). A user option won't require gluing that to compat machines.

Aligning all DIMMs to 1 GiB is also not really desired IMHO. If you hotplug multiple smaller DIMMs (< 1 GiB, which apparently users do for Kata and such), you'd get quite a lot of (large) GPA holes in between, implying that PFN walkers (like compaction) inside the VM get more expensive (i.e., zones not contiguous) and that such memory can never get used for larger contiguous allocations (such as gigantic pages).

Ideally, we don't get any holes, even when hotplugging DIMMs that are any multiples of 128 MiB (on x86), which is the common case and only doesn't work because NVDIMMs do weird stuff with the labels. But that 128 MiB alignment is both guest and arch specific.

Getting that intended minimum alignment from the user is IMHO better than hard-coding it in QEMU and having to deal with compat handling.

Comment 19 Michal Privoznik 2023-04-18 14:35:10 UTC
(In reply to David Hildenbrand from comment #17)
> Getting that intended minimum alignment from the user is IMHO better than
> hard-coding it in QEMU and having to deal with compat handling.

But problem is whether user will know what value to put in. To sum up:

QEMU knows what values are acceptable, but not which OS is running in the guest,
libvirt does not know what value to pass, nor which OS is running in the guest,
user does not know what value to pass, but it knows what OS is running in the guest.

So I wonder whether we should:
a) chose a reasonable default in QEMU, and possibly
b) offer users a way to tweak the alignment.

Comment 20 David Hildenbrand 2023-04-18 14:57:15 UTC
(In reply to Michal Privoznik from comment #19)
> (In reply to David Hildenbrand from comment #17)
> > Getting that intended minimum alignment from the user is IMHO better than
> > hard-coding it in QEMU and having to deal with compat handling.
> 
> But problem is whether user will know what value to put in. To sum up:
> 
> QEMU knows what values are acceptable, but not which OS is running in the
> guest,
> libvirt does not know what value to pass, nor which OS is running in the
> guest,
> user does not know what value to pass, but it knows what OS is running in
> the guest.

QEMU most certainly knows the least ;)

Again, the user already has to be aware of guest OS restrictions. While hotplugging a 128 MiB DIMM to a VM running an arm64 Linux kernel with 4k page size will work, it's unusable by an arm64 Linux kernel with a 64k page size. Just like the minimum granularity, the alignment is guest-OS specific.

> 
> So I wonder whether we should:
> a) chose a reasonable default in QEMU, and possibly

I'm afraid that will require compat machine changes.

And there is no reasonable default for arm64, for example, without knowing what's running inside the VM. Using an alignment of 512MiB just because the guest could be running a 64k kernel fragments guest physical address space when hotplugging 128 MiB DIMMs.

Comment 21 David Hildenbrand 2023-04-18 15:36:50 UTC
(In reply to David Hildenbrand from comment #20)
> (In reply to Michal Privoznik from comment #19)
> > (In reply to David Hildenbrand from comment #17)
> > > Getting that intended minimum alignment from the user is IMHO better than
> > > hard-coding it in QEMU and having to deal with compat handling.
> > 
> > But problem is whether user will know what value to put in. To sum up:
> > 
> > QEMU knows what values are acceptable, but not which OS is running in the
> > guest,
> > libvirt does not know what value to pass, nor which OS is running in the
> > guest,
> > user does not know what value to pass, but it knows what OS is running in
> > the guest.
> 
> QEMU most certainly knows the least ;)
> 
> Again, the user already has to be aware of guest OS restrictions. While
> hotplugging a 128 MiB DIMM to a VM running an arm64 Linux kernel with 4k
> page size will work, it's unusable by an arm64 Linux kernel with a 64k page
> size. Just like the minimum granularity, the alignment is guest-OS specific.
> 
> > 
> > So I wonder whether we should:
> > a) chose a reasonable default in QEMU, and possibly
> 
> I'm afraid that will require compat machine changes.
> 
> And there is no reasonable default for arm64, for example, without knowing
> what's running inside the VM. Using an alignment of 512MiB just because the
> guest could be running a 64k kernel fragments guest physical address space
> when hotplugging 128 MiB DIMMs.

BTW, I was playing with the idea of deciding the alignment based on the size.

DIMM size is multiples of 128 MiB -> align to 128 MiB
DIMM size is multiples of 256 MiB -> align to 256 MiB
DIMM size is multiples of 512 MiB -> align to 512 MiB

It's still sub-optimal, though. Hotplugging a 128 MiB DIMM first followed by a 256 MiB DIMM would unnecessarily create a hole ...

Comment 22 Michal Privoznik 2023-04-19 06:49:36 UTC
(In reply to David Hildenbrand from comment #21)

Spoiler alert: I know next to nothing about memory mgmt.

> It's still sub-optimal, though. Hotplugging a 128 MiB DIMM first followed by
> a 256 MiB DIMM would unnecessarily create a hole ...

Can you enlighten me please - why are holes bad? Is it because if a DIMM is backed by a hugepage then it's wasteful?
Also - how is this solved at real HW level? I mean, when I plug a DIMM into a slot, it might too create a hole, couldn't it?

Comment 23 David Hildenbrand 2023-04-19 07:12:50 UTC
(In reply to Michal Privoznik from comment #22)
> (In reply to David Hildenbrand from comment #21)
> 
> Spoiler alert: I know next to nothing about memory mgmt.
> 
> > It's still sub-optimal, though. Hotplugging a 128 MiB DIMM first followed by
> > a 256 MiB DIMM would unnecessarily create a hole ...
> 
> Can you enlighten me please - why are holes bad? Is it because if a DIMM is
> backed by a hugepage then it's wasteful?

Because the GPA will be fragmented. For Linux, this implies that certain operations, such as memory compaction, get more expensive because Linux as to consider holes in memory zones and has to scan over these holes.

Further, Linux cannot make use of that memory for larger allocations (such as gigantic pages). It's a secondary concern, though.


> Also - how is this solved at real HW level? I mean, when I plug a DIMM into
> a slot, it might too create a hole, couldn't it?

I was told by Intel a while ago that real HW does not support hotplug of individual DIMMs, but only complete NUMA nodes. Holes between other nodes are less of a concern (in Linux, it's separate memory zones either way). So it's not really an issue on real HW.

Comment 24 RHEL Program Management 2023-09-22 13:24:31 UTC
Issue migration from Bugzilla to Jira is in process at this time. This will be the last message in Jira copied from the Bugzilla bug.

Comment 25 RHEL Program Management 2023-09-22 13:26:45 UTC
This BZ has been automatically migrated to the issues.redhat.com Red Hat Issue Tracker. All future work related to this report will be managed there.

Due to differences in account names between systems, some fields were not replicated.  Be sure to add yourself to Jira issue's "Watchers" field to continue receiving updates and add others to the "Need Info From" field to continue requesting information.

To find the migrated issue, look in the "Links" section for a direct link to the new issue location. The issue key will have an icon of 2 footprints next to it, and begin with "RHEL-" followed by an integer.  You can also find this issue by visiting https://issues.redhat.com/issues/?jql= and searching the "Bugzilla Bug" field for this BZ's number, e.g. a search like:

"Bugzilla Bug" = 1234567

In the event you have trouble locating or viewing this issue, you can file an issue by sending mail to rh-issues. You can also visit https://access.redhat.com/articles/7032570 for general account information.


Note You need to log in before you can comment on or make changes to this bug.