Bug 1441938

Summary: When boot windows guest with two numa nodes and pc-dimm assigned to the second node, the dimm cannot be recognized by the guest
Product: Red Hat Enterprise Linux 7 Reporter: Yumei Huang <yuhuang>
Component: qemu-kvm-rhevAssignee: Virtualization Maintenance <virt-maint>
Status: CLOSED ERRATA QA Contact: Yumei Huang <yuhuang>
Severity: high Docs Contact:
Priority: medium    
Version: 7.4CC: chayang, imammedo, jinzhao, juzhang, knoel, michen, mrezanin, virt-maint
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: qemu-kvm-rhev-2.10.0-1.el7 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-04-11 00:16:25 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1473046    

Description Yumei Huang 2017-04-13 06:55:39 UTC
Description of problem:
Boot windows guest with two numa nodes(node 0 and node 1) and one pc-dimm assigned to node 1, the dimm cannot be recognized by the guest. 

Version-Release number of selected component (if applicable):
qemu-kvm-rhev-2.9.0-0.el7.patchwork201703291116
kernel-3.10.0-643.el7.x86_64

How reproducible:
always

Steps to Reproduce:
1. Boot windows guest with two numa nodes and one pc-dimm(node=1)

# /usr/libexec/qemu-kvm -machine pc  \
-drive id=drive_image1,if=none,snapshot=off,aio=native,cache=none,format=qcow2,file=/home/kvm_autotest_root/images/win2012-64r2-virtio.qcow2 -device virtio-blk-pci,id=image1,drive=drive_image1 \

-m 1024,slots=4,maxmem=32G \

-object memory-backend-ram,size=1G,id=mem-mem1 -device pc-dimm,id=dimm-mem1,memdev=mem-mem1,node=1 \

-smp 16,cores=8,threads=1,sockets=2 \

-numa node,nodeid=0 \

-numa node,nodeid=1 \

-vnc :0 -monitor stdio

2. Check guest total memory


Actual results:
The guest total memory is 1G.

Expected results:
The guest total memory should be 2G.

Additional info:
1. Hit the issue with win2012r2 and win2008r2. 
2. After boot the guest, hotplug a dimm to node 1 via hmp, the dimm can be recognized.
3. If the dimm in qemu cmdline is assigned to node 0, it can be recognized.

Comment 3 Ladi Prosek 2017-05-22 14:47:52 UTC
I can reproduce this with Windows Server 2016 as well.

Reverting the following commit (thanks Igor for the pointer!) fixes it for 2016 but not for 2008R2 and 2012R. In fact, it makes older Windows even more broken - assigning the DIMM to node 0 then doesn't work either.

  commit cec65193d41099519f14fb744440eeabbfa6e4e3
  Author: Igor Mammedov <imammedo>
  Date:   Mon Jun 2 15:25:28 2014 +0200

      pc: ACPI BIOS: reserve SRAT entry for hotplug mem hole
    
      Needed for Windows to use hotplugged memory device, otherwise
      it complains that server is not configured for memory hotplug.
      Tests shows that aftewards it uses dynamically provided
      proximity value from _PXM() method if available.


So if nothing else, we could add a switch to disable the old workaround because whatever was broken in Windows before is now fixed and the workaround is counterproductive.

I'll see if I can debug the relevant Windows code and check what Hyper-V does. Maybe we can come up with something that would work for all Windows versions.

Comment 4 Igor Mammedov 2017-05-23 05:56:18 UTC
Thinking more about cec65193d, I recall that it's used in linux too, if guest has been started with less then 4G of present memory:

  x86/mm/64: Enable SWIOTLB if system has SRAT memory regions above MAX_DMA32_PFN
so we can't just remove it.

Comment 5 Ladi Prosek 2017-05-23 14:55:44 UTC
Thanks, that makes sense. Reading the ACPI spec, I wonder if the most correct thing to do isn't declaring multiple MEM_AFFINITY_HOTPLUGGABLE regions, one for each NUMA node and then plugging DIMMs into the respective region at run-time. I understand that it would require changes to the current model, though, and would have its drawbacks.

I'm having trouble understanding Windows ACPI internals due to indirections and asynchrony. Windows reads the acpi-mem-hotplug region in a very generic looking routine running in a separate kernel thread.

ACPI!WriteSystemIO+0x1c842
ACPI!AccessBaseField+0x236
ACPI!WriteFieldObj+0x14e
ACPI!RunContext+0x1e0
ACPI!InsertReadyQueue+0x403
ACPI!RestartCtxtPassive+0x2f
ACPI!ACPIWorkerThread+0xed
nt_fffff80084285000!PspSystemThreadStartup+0x41
nt_fffff80084285000!KiStartSystemThread+0x16


However, I have confirmed that this trivial change:

diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c
index afcadac..9e56d4e 100644
--- a/hw/i386/acpi-build.c
+++ b/hw/i386/acpi-build.c
@@ -2411,7 +2411,7 @@ build_srat(GArray *table_data, BIOSLinker *linker, MachineState *machine)
     if (hotplugabble_address_space_size) {
         numamem = acpi_data_push(table_data, sizeof *numamem);
         build_srat_memory(numamem, pcms->hotplug_memory.base,
-                          hotplugabble_address_space_size, 0,
+                          hotplugabble_address_space_size, pcms->numa_nodes - 1,
                           MEM_AFFINITY_HOTPLUGGABLE | MEM_AFFINITY_ENABLED);
     }

Fixes this BZ on all OSes I've tested - 2008R2, 2012R2, 2016. And also fixes hot-plug on 2016 and 2012R, no longer requiring the first DIMM to be plugged into node 0. 2008R2 is still sensitive to the hot-plug order and this changes is not making it any worse.

Igor, do you see any problem with declaring the MEM_AFFINITY_HOTPLUGGABLE region for the last node instead of first?

Comment 6 Igor Mammedov 2017-05-24 06:15:16 UTC
if above won't break case in comment 4 and makes windows happier, then I'm fine with it.

Comment 7 Ladi Prosek 2017-05-24 08:21:08 UTC
Fix posted:
https://lists.nongnu.org/archive/html/qemu-devel/2017-05/msg05467.html

I have verified that Linux enables SWIOTLB just like before.

Comment 9 Ladi Prosek 2017-08-31 06:43:47 UTC
The fix has been merged upstream as

  ede24a0 pc: ACPI BIOS: use highest NUMA node for hotplug mem hole SRAT entry

Comment 11 Yumei Huang 2017-11-08 08:19:26 UTC
Verify:
qemu-kvm-rhev-2.10.0-4.el7
kernel-3.10.0-765.el7.x86_64

Test with win2016, win2012r2, win2008r2 guest, whenever the dimm is assigned to the first node or the second node in qemu cmdline, it can be recognized by guest.

Comment 14 errata-xmlrpc 2018-04-11 00:16:25 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:1104