Bug 872524

Summary:

windows server 2012 guest w/ 256GB memory always be killed only when numad is enabled on host(w/ 512GB memory)

Product:

Red Hat Enterprise Linux 6

Reporter:

Mike Cao <bcao>

Component:

numad

Assignee:

Bill Gray <bgray>

Status:

CLOSED ERRATA

QA Contact:

Jakub Prokes <jprokes>

Severity:

urgent

Docs Contact:

Priority:

urgent

Version:

6.4

CC:

andebjor, areis, bcao, bgray, bsarathy, cpelland, ddumas, drjones, jherrman, jprokes, jsynacek, juzhang, leiwang, lijin, lnovich, michen, mkenneth, nobody, perfbz, psklenar, qe-baseos-daemons, qzhang, rbalakri, sradvan, virt-maint, xfu

Target Milestone:

Keywords:

ZStream

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

Known Issue

Doc Text:

Previously, running the numad daemon on a system executing a process with very large resident memory (such as a Windows Server 2012 guest) could cause memory swapping. As a consequence, significant latencies under some circumstances occurred on the system, which could in turn lead to other processes (such as qemu-kvm) becoming unresponsive. With this update, numad no longer causes memory swapping in the above scenario, and the consequent latencies and hangs no longer occur.

Story Points:

---

Clone Of:

Clones:

1112280 (view as bug list)

Environment:

Last Closed:

2014-10-14 08:21:27 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

871829, 883516, 957226, 1002699, 1112280

Attachments:

Description	Flags
dmesg	none

Description Mike Cao 2012-11-02 10:42:31 UTC

Description of problem:


Version-Release number of selected component (if applicable):
2.6.32-279.11.1.el6.x86_64
qemu-kvm-rhev-0.12.1.2-2.295.el6_3.5.x86_64


How reproducible:
100%

Steps to Reproduce:
1.Start windows server 2012 with 256GB memeory

  
Actual results:
qemu-kvm process has been killed 

Expected results:


Additional info:

Comment 1 Mike Cao 2012-11-02 10:43:26 UTC

Created attachment 637035 [details]
dmesg

Comment 2 Mike Cao 2012-11-02 10:44:09 UTC

processor	: 47
vendor_id	: AuthenticAMD
cpu family	: 16
model		: 9
model name	: AMD Opteron(tm) Processor 6172
stepping	: 1
cpu MHz		: 2100.142
cache size	: 512 KB
physical id	: 1
siblings	: 12
core id		: 5
cpu cores	: 12
apicid		: 27
initial apicid	: 27
fpu		: yes
fpu_exception	: yes
cpuid level	: 5
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nonstop_tsc extd_apicid amd_dcm pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt nodeid_msr npt lbrv svm_lock nrip_save pausefilter
bogomips	: 4200.41
TLB size	: 1024 4K pages
clflush size	: 64
cache_alignment	: 64
address sizes	: 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate

Comment 3 Mike Cao 2012-11-02 10:44:32 UTC

# cat /proc/meminfo 
MemTotal:       529297552 kB
MemFree:        514979580 kB
Buffers:           27648 kB
Cached:          3732784 kB
SwapCached:         3792 kB
Active:          7912824 kB
Inactive:         129600 kB
Active(anon):    4275460 kB
Inactive(anon):    10096 kB
Active(file):    3637364 kB
Inactive(file):   119504 kB
Unevictable:       34992 kB
Mlocked:           10464 kB
SwapTotal:       4194296 kB
SwapFree:        4168956 kB
Dirty:                40 kB
Writeback:             0 kB
AnonPages:       4331988 kB
Mapped:            26516 kB
Shmem:               924 kB
Slab:            1124940 kB
SReclaimable:     122964 kB
SUnreclaim:      1001976 kB
KernelStack:       10736 kB
PageTables:        20564 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    268843072 kB
Committed_AS:    4643860 kB
VmallocTotal:   34359738367 kB
VmallocUsed:      894832 kB
VmallocChunk:   33888258768 kB
HardwareCorrupted:     0 kB
AnonHugePages:    886784 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:        6756 kB
DirectMap2M:     3129344 kB
DirectMap1G:    533725184 kB
[root@dell-per815-01 cgroup]#

Comment 5 Mike Cao 2012-11-02 10:53:10 UTC

Since I was asked to run svvp test over windows server 2012 platform on RHEL6.3.z host .

So It is a testblocker to me

Comment 7 Mike Cao 2012-11-06 06:41:46 UTC

this issue only occurs when enable numad on the host
#mount cgroup -t cgroup -o cpuset /cgroup
#numad -D /cgroup

After remove it ,this issue has gone ,but guest always hang at a blank screen
Will report a new bug to track it .

CLI:

/usr/libexec/qemu-kvm -boot menu=on -m 256G -smp 48,cores=48,sockets=1,threads=1 -cpu Opteron_G3,family=0xf -drive file=windows_server_2012_max_amd,format=raw,if=none,id=drive-ide0,cache=none,werror=stop,rerror=stop -device ide-drive,drive=drive-ide0,id=ide0,bootindex=1 -netdev tap,sndbuf=0,id=hostnet0,script=/etc/qemu-ifup,downscript=no -device e1000,netdev=hostnet0,mac=00:52:1a:21:62:01,bus=pci.0,addr=0x4,id=virtio-net-pci0 -uuid ac64c74a-a8d5-4c24-9839-fcc491439493 -rtc base=localtime,clock=host,driftfix=slew -no-kvm-pit-reinjection -device usb-ehci,id=ehci0 -drive file=usb_storage_max,format=raw,if=none,id=drive-usb0,cache=none,werror=stop,rerror=stop -device usb-storage,drive=drive-usb0,removable=on,bus=ehci0.0 -chardev socket,id=111a,path=/tmp/amd-max-sut,server,nowait -mon chardev=111a,mode=readline -name amd-max-sut -vnc :0 -drive file=en_windows_server_2012_x64_dvd_915478.iso,id=drive-cdrom,format=raw,if=none,werror=stop,rerror=stop,media=cdrom -device ide-drive,drive=drive-cdrom,id=cdrom -vga std -numa node,mem=32G,cpus=0,4,8,12,16,20,nodeid=0 -numa node,mem=32G,cpus=24,28,32,36,40,44,nodeid=1-numa node,mem=32G,cpus=3,7,11,15,19,23,nodeid=2 -numa node,mem=32G,cpus=27,31,35,39,43,47,nodeid=3 -numa node,mem=32G,cpus=2,6,10,14,18,22,nodeid=4 -numa node,mem=32G,cpus=26,30,34,38,42,46,nodeid=5 -numa node,mem=32G,cpus=1,5,9,13,17,21,nodeid=6 -numa node,mem=32G,cpus=25,29,33,37,41,45,nodeid=7

Comment 8 Qunfang Zhang 2012-11-12 10:05:19 UTC

Reproduce this issue with the same command line, and:
1) This issue only happens for win2012 guest. I tried win2k8r2 and win7, have no problem.
2) 
Tested with "-m 256G" and assign each numa node with 32G mem. Failed (5/5 times).
Tested with "-m 192G"/"-m 184G"/"-m 144G" mem, all failed.
Tested with "-m 128G" and assign each numa node with 16G mem (8 numa node in total), passed. Guest can boot up successfully. (3/3 times)
Tested with "-m 120G" and assign each numa node with 15G mem.(1/4 failed, 3/4 passed.)
3) Stop numad service and re-test, this issue does not appear.

Comment 9 Qunfang Zhang 2012-11-13 03:31:26 UTC

And this issue also happens on the latest rhel6.4 host.
kernel-2.6.32-341.el6.x86_64
qemu-kvm-0.12.1.2-2.334.el6.x86_64

Comment 10 Ademar Reis 2012-11-14 04:01:24 UTC

I'm assuming numad is the culprit here. Please investigate and let us know if you believe qemu-kvm is the faulty component instead.

Comment 11 Bill Burns 2012-11-15 13:02:02 UTC

Disable KSM and see if that helps. Bill Gray, and other suggestions/input?

Comment 12 Andrew Jones 2012-11-19 11:12:36 UTC

(In reply to comment #7)
> this issue only occurs when enable numad on the host
> #mount cgroup -t cgroup -o cpuset /cgroup
> #numad -D /cgroup
> 
> After remove it ,this issue has gone ,but guest always hang at a blank screen
> Will report a new bug to track it .

What state is the guest and its vcpus in at this point? Guest is still "running"? vcpus are ?? (check 'ps -eLo pid,comm,s | grep qemu' frequently - or maybe just watch top)

(In reply to comment #10)
> I'm assuming numad is the culprit here. Please investigate and let us know
> if you believe qemu-kvm is the faulty component instead.

I'm guessing it's not numad's fault, but rather the state the qemu threads are causing cgroups to choke when attempting to add them - which leads to them getting killed somehow (or "cleaned up"). We should focus on the hang without numad first. I assume the bug opened for it is bug 874406?

Comment 13 Andrew Jones 2013-01-07 15:35:09 UTC

Try reproducing this without any guests using e1000 nics (I see e1000 in the cmdline in comment 7). If this issue still reproduces then it's separate, otherwise we can dup this bug to bug 874406.

Comment 15 Mike Cao 2013-01-14 03:36:21 UTC

(In reply to comment #12)
> (In reply to comment #7)
> > this issue only occurs when enable numad on the host
> > #mount cgroup -t cgroup -o cpuset /cgroup
> > #numad -D /cgroup
> > 
> > After remove it ,this issue has gone ,but guest always hang at a blank screen
> > Will report a new bug to track it .
> 
> What state is the guest and its vcpus in at this point? Guest is still
> "running"? vcpus are ?? (check 'ps -eLo pid,comm,s | grep qemu' frequently -
> or maybe just watch top)

Details referring to https://bugzilla.redhat.com/show_bug.cgi?id=873613

> 
> (In reply to comment #10)
> > I'm assuming numad is the culprit here. Please investigate and let us know
> > if you believe qemu-kvm is the faulty component instead.
> 
> I'm guessing it's not numad's fault, but rather the state the qemu threads
> are causing cgroups to choke when attempting to add them - which leads to
> them getting killed somehow (or "cleaned up"). We should focus on the hang
> without numad first. I assume the bug opened for it is bug 874406?

No. it should be https://bugzilla.redhat.com/show_bug.cgi?id=873613

Comment 16 Mike Cao 2013-01-14 06:45:25 UTC

(In reply to comment #13)
> Try reproducing this without any guests using e1000 nics (I see e1000 in the
> cmdline in comment 7). If this issue still reproduces then it's separate,
> otherwise we can dup this bug to bug 874406.

Still can reproduce this issue only w/ virtio-net-pci and rtl8139 emulated NICs

Mike

Comment 17 Andrew Jones 2013-01-14 08:32:01 UTC

(In reply to comment #15)
> No. it should be https://bugzilla.redhat.com/show_bug.cgi?id=873613

That bug says that > 32 vcpus won't work, the cmdline in comment 7 has 48. Did you reproduce this issue with only 32 vcpus? and no e1000 nics? We need to remove all known issues in order to see if there is anything left.

I suggest using a known-good config, but with 256G of memory.
a) test without numad - make sure it works
b) test with numad    - see what happens

Comment 18 Mike Cao 2013-01-14 08:41:05 UTC

(In reply to comment #17)
> (In reply to comment #15)
> > No. it should be https://bugzilla.redhat.com/show_bug.cgi?id=873613
> 
> That bug says that > 32 vcpus won't work, the cmdline in comment 7 has 48.
> Did you reproduce this issue with only 32 vcpus? and no e1000 nics? We need
> to remove all known issues in order to see if there is anything left.

I am using -smp 48 to test this issue ,I did not try vcpu=32 .
Do we need it ?


> 
> I suggest using a known-good config, but with 256G of memory.
> a) test without numad - make sure it works
-smp 48 + w/o numa ---> guest always  hang
> b) test with numad    - see what happens
-smp 48 + w/numa -----> qemu-kvm process killed

Mike

Comment 19 Andrew Jones 2013-01-14 10:13:16 UTC

(In reply to comment #18)
> (In reply to comment #17)
> > (In reply to comment #15)
> > > No. it should be https://bugzilla.redhat.com/show_bug.cgi?id=873613
> > 
> > That bug says that > 32 vcpus won't work, the cmdline in comment 7 has 48.
> > Did you reproduce this issue with only 32 vcpus? and no e1000 nics? We need
> > to remove all known issues in order to see if there is anything left.
> 
> I am using -smp 48 to test this issue ,I did not try vcpu=32 .
> Do we need it ?
> 

Yes, based on bug 873613 comment 3, I would say so. In general, we need to eliminate all config options that have found other bugs (and other bugs have already been opened for them). This bug is to address a possible problem with 256G configs and numad. So the config for the test guest should be a known-good config (i.e. one that works) plus 256G. Then, the numad variable should be toggled, as I outlined.

> 
> > 
> > I suggest using a known-good config, but with 256G of memory.
> > a) test without numad - make sure it works
> -smp 48 + w/o numa ---> guest always  hang
> > b) test with numad    - see what happens
> -smp 48 + w/numa -----> qemu-kvm process killed

The problem with debugging like this is that both of these symptoms can be from the same problem. As I wrote at the bottom of comment 12, the process getting killed is likely just a result of numad trying to manage a broken guest. To debug we need to compare working vs. not-working. Not, not-working-one-way vs. not-working-another-way.

Comment 20 Mike Cao 2013-01-14 14:09:09 UTC

(In reply to comment #19)
> (In reply to comment #18)
> > (In reply to comment #17)
> > > (In reply to comment #15)
> > > > No. it should be https://bugzilla.redhat.com/show_bug.cgi?id=873613
> > > 
> > > That bug says that > 32 vcpus won't work, the cmdline in comment 7 has 48.
> > > Did you reproduce this issue with only 32 vcpus? and no e1000 nics? We need
> > > to remove all known issues in order to see if there is anything left.
> > 
> > I am using -smp 48 to test this issue ,I did not try vcpu=32 .
> > Do we need it ?
> > 
> 
> Yes, based on bug 873613 comment 3, I would say so. In general, we need to
> eliminate all config options that have found other bugs (and other bugs have
> already been opened for them). This bug is to address a possible problem
> with 256G configs and numad. So the config for the test guest should be a
> known-good config (i.e. one that works) plus 256G. Then, the numad variable
> should be toggled, as I outlined.

Will try w/ -smp 32 with numad enabled .
> 
> > 
> > > 
> > > I suggest using a known-good config, but with 256G of memory.
> > > a) test without numad - make sure it works
> > -smp 48 + w/o numa ---> guest always  hang
> > > b) test with numad    - see what happens
> > -smp 48 + w/numa -----> qemu-kvm process killed
> 
> The problem with debugging like this is that both of these symptoms can be
> from the same problem. As I wrote at the bottom of comment 12, the process
> getting killed is likely just a result of numad trying to manage a broken
> guest. To debug we need to compare working vs. not-working. Not,
> not-working-one-way vs. not-working-another-way.

Hi, Andrew

What's the "broken guest" mean ? the image I am using is a image which I used for SVVP test now ,it should not be a broken guest .after reproducing this bug ,the image I am still using it for SVVP test ,and it works fine .

For this bug ,when numad service running,qemu-kvm process has been killed  ,qemu-kvm process will exist when numad stopped .why idea why this may dup of  873613?

Thanks,
Mike

Comment 21 Andrew Jones 2013-01-14 15:11:15 UTC

(In reply to comment #20)
> 
> Will try w/ -smp 32 with numad enabled .

AND with it disabled first to make sure that works. I.e. get a clean baseline FIRST.


> What's the "broken guest" mean ? the image I am using is a image which I
> used for SVVP test now ,it should not be a broken guest .after reproducing
> this bug ,the image I am still using it for SVVP test ,and it works fine .
> 

Not broken image, but broken config. If there's an existing bug that says win2012 guests don't work with >32 vcpus, then why are we still creating configs with >32 vcpus?

> For this bug ,when numad service running,qemu-kvm process has been killed 
> ,qemu-kvm process will exist when numad stopped .why idea why this may dup
> of  873613?

If testing a 256G guest with numad doesn't produce any problems, then this bug could be closed as a dup, or just NOTABUG.

Comment 22 Mike Cao 2013-01-15 05:45:50 UTC

(In reply to comment #21)
> (In reply to comment #20)
> > 
> > Will try w/ -smp 32 with numad enabled .
> 
> AND with it disabled first to make sure that works. I.e. get a clean
> baseline FIRST.
> 


e1000+ -smp32 + 256GB +without numa service-->guest works
e1000+ -smp32 + 256GB +with numa service --->the terminal which runs qemu-kvm process freezed ,I use vncviewer to track guest ,find guest freeze during boot

Will try test w/o e1000 later

Comment 23 Mike Cao 2013-01-15 05:49:01 UTC

(In reply to comment #22)
> (In reply to comment #21)
> > (In reply to comment #20)
> > > 
> > > Will try w/ -smp 32 with numad enabled .
> > 
> > AND with it disabled first to make sure that works. I.e. get a clean
> > baseline FIRST.
> > 
> 
> 
> e1000+ -smp32 + 256GB +without numa service-->guest works
> e1000+ -smp32 + 256GB +with numa service --->the terminal which runs
> qemu-kvm process freezed ,I use vncviewer to track guest ,find guest freeze
> during boot
> 
> Will try test w/o e1000 later



Need to mention I wait for more than 1 hour ,
I tried ps -eaf|grep qemu when nuamd is running ,but output is hang . 
Then I stop nuamd service ,the terminal running ps process works ,and it shows 
[qemu-kvm] <defunct>
grep qemu

Comment 24 Mike Cao 2013-01-15 06:11:41 UTC

Re-test this issue on kernel-351 & qemu-kvm-rhev-348 w/out e1000

Results:

rtl8139+ smp32 + 256GB + w/o nuamd ---> guest works fine
rtl8139+ smp32 + 256GB + w nuamd ---> guest has been killed 


cat /var/log/numad
PID 4023 moved to node(s) 1 in 103.2 seconds
Removing obsolete cpuset : /cgroup/nuamd.4023

Comment 25 Andrew Jones 2013-01-15 16:16:36 UTC

(In reply to comment #24)
> Re-test this issue on kernel-351 & qemu-kvm-rhev-348 w/out e1000
> 
> Results:
> 
> rtl8139+ smp32 + 256GB + w/o nuamd ---> guest works fine
> rtl8139+ smp32 + 256GB + w nuamd ---> guest has been killed 
> 
> 
> cat /var/log/numad
> PID 4023 moved to node(s) 1 in 103.2 seconds
> Removing obsolete cpuset : /cgroup/nuamd.4023

OK, this data is starting to point at numad/cgroups. We now have a clean baseline (guest works w/out numad) and we have logs from numad stating a migration took 103 seconds. If the guest was blocked 103 seconds, then it wouldn't be too surprising that it died.

I tried to reproduce this on my small numa system with no luck. Likely the large guest memory configuration is required - which would also make for longer migration times.

Jan, I suggest we find a system where we can reproduce this, and then experiment with rate-limiting the migrations, or just avoiding large migrations all together.

Comment 27 Bill Gray 2013-01-15 19:20:46 UTC

Yes, we might need to release note this, since clearly identifying the problem is unlikely in time for 6.4.   

Numad does nothing different for various types of guests, yet per comment 8, other varieties of Windows guests seem to work fine with numad.  It might also be relevant that the Windows 2012 guest seems to hang without numad (if I read the comments correctly).  Though maybe the more recent Windows 2012 guest tests without numad are working OK? 

Thanks very much for running so many tests!  One more I might suggest would be using less than 30% of the system resources for the guest: so about 160GB RAM and about 30 vCPUs -- just to be sure 2x the guest fits well within the system resources.  Assuming this also fails when numad is running, please start numad with the -l7 option before starting the guest, to capture more detailed debugging information.  Thanks!

Comment 31 Andrew Jones 2013-01-22 10:21:28 UTC

Added a DocText. The additional information about swapping leading to latencies comes from my experimentation on amd-dinar-08.lab.bos.redhat.com (32G 4 node system running 28G win2012 guest).

I saw with top that shortly after numad kicked in a bunch of migration threads we got this

13190 root      20   0 18592  528  388 D  3.3  0.0   5:46.18 numad
  339 root      20   0     0    0    0 D  1.7  0.0   0:26.51 kswapd1
  341 root      20   0     0    0    0 D  1.7  0.0   0:28.07 kswapd3
  340 root      20   0     0    0    0 D  1.3  0.0   0:30.57 kswapd2
22808 root      20   0 15436 1720  944 R  0.7  0.0   0:01.29 top
  141 root      20   0     0    0    0 S  0.3  0.0   0:10.12 events/10
 1933 root      20   0     0    0    0 S  0.3  0.0  10:51.94 kondemand/0
 1964 root      20   0     0    0    0 S  0.3  0.0   0:45.07 kondemand/31
    1 root      20   0 19352 1152  984 S  0.0  0.0   0:02.87 init
    2 root      20   0     0    0    0 S  0.0  0.0   0:00.01 kthreadd
    3 root      RT   0     0    0    0 S  0.0  0.0   0:12.61 migration/0
    4 root      20   0     0    0    0 S  0.0  0.0   0:03.73 ksoftirqd/0
    5 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 migration/0
    6 root      RT   0     0    0    0 S  0.0  0.0   0:00.84 watchdog/0
    7 root      RT   0     0    0    0 S  0.0  0.0   0:17.21 migration/1
    8 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 migration/1
    9 root      20   0     0    0    0 S  0.0  0.0   0:05.04 ksoftirqd/1
   10 root      RT   0     0    0    0 S  0.0  0.0   0:00.53 watchdog/1
   11 root      RT   0     0    0    0 S  0.0  0.0   0:06.48 migration/2
   12 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 migration/2
   13 root      20   0     0    0    0 S  0.0  0.0   0:00.78 ksoftirqd/2
   14 root      RT   0     0    0    0 S  0.0  0.0   0:00.61 watchdog/2
   15 root      RT   0     0    0    0 S  0.0  0.0   0:05.61 migration/3
   16 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 migration/3
   17 root      20   0     0    0    0 S  0.0  0.0   0:01.30 ksoftirqd/3
   18 root      RT   0     0    0    0 S  0.0  0.0   0:00.55 watchdog/3
   19 root      RT   0     0    0    0 S  0.0  0.0   0:07.01 migration/4
   20 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 migration/4
   21 root      20   0     0    0    0 S  0.0  0.0   0:00.58 ksoftirqd/4
   22 root      RT   0     0    0    0 S  0.0  0.0   0:00.74 watchdog/4
   23 root      RT   0     0    0    0 S  0.0  0.0   0:08.23 migration/5
   24 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 migration/5
   25 root      20   0     0    0    0 S  0.0  0.0   0:01.51 ksoftirqd/5
   26 root      RT   0     0    0    0 S  0.0  0.0   0:00.54 watchdog/5
   27 root      RT   0     0    0    0 S  0.0  0.0   0:05.08 migration/6
   28 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 migration/6
   29 root      20   0     0    0    0 S  0.0  0.0   0:01.36 ksoftirqd/6
   30 root      RT   0     0    0    0 S  0.0  0.0   0:00.56 watchdog/6
   31 root      RT   0     0    0    0 S  0.0  0.0   0:04.36 migration/7
   32 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 migration/7
   33 root      20   0     0    0    0 S  0.0  0.0   0:01.08 ksoftirqd/7
   34 root      RT   0     0    0    0 S  0.0  0.0   0:00.54 watchdog/7
   35 root      RT   0     0    0    0 S  0.0  0.0   0:04.84 migration/8
   36 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 migration/8
   37 root      20   0     0    0    0 S  0.0  0.0   0:03.47 ksoftirqd/8

Comment 32 Ademar Reis 2013-01-22 14:11:02 UTC

(In reply to comment #31)
> Added a DocText. The additional information about swapping leading to
> latencies comes from my experimentation on amd-dinar-08.lab.bos.redhat.com
> (32G 4 node system running 28G win2012 guest).

Changed the doctext from "Bug Fix" to "Known Issue", as there's no fix at the moment.

Comment 33 Ademar Reis 2013-01-22 14:17:49 UTC

(In reply to comment #32)
> (In reply to comment #31)
> > Added a DocText. The additional information about swapping leading to
> > latencies comes from my experimentation on amd-dinar-08.lab.bos.redhat.com
> > (32G 4 node system running 28G win2012 guest).
> 
> Changed the doctext from "Bug Fix" to "Known Issue", as there's no fix at
> the moment.

Had to change the text as well otherwise bugzilla reverts it back to "Bug Fix".

Comment 34 Bill Gray 2013-01-22 15:20:34 UTC

Thanks for adding the DocText.  Yes, it is probably true that the system will be forced to swap when trying to use too much memory.  However, per comment 8 ("This issue only happens for win2012 guest. I tried win2k8r2 and win7, have no problem.") the DocText should indicate this appears to cause trouble only with very large Windows 2012 guests.

Comment 35 Andrew Jones 2013-01-22 16:31:05 UTC

(In reply to comment #34)
> Thanks for adding the DocText.  Yes, it is probably true that the system
> will be forced to swap when trying to use too much memory.  However, per
> comment 8 ("This issue only happens for win2012 guest. I tried win2k8r2 and
> win7, have no problem.") the DocText should indicate this appears to cause
> trouble only with very large Windows 2012 guests.

The latencies will occur for any huge task being migrated, which would be all Windows VMs that have have large amounts of memory allocated to them in their configs - due to their page zeroing. Maybe only 2012 was overly sensitive to that latency though? I'm not opposed to adding 2012 the DocText, but it probably wouldn't hurt to keep it more general. There's really no point in using numad in these scenarios anyway (moving a 4 node guest to 3 nodes). The potential for problems seems to outweigh the benefit.

Comment 36 Bill Gray 2013-01-22 17:29:40 UTC

Thanks.  Since there appears to be such a clear Windows 2012 specific component here, it would be odd not to mention it.  

I agree there is marginal (if any) benefit from moving a huge guest from 100% of system to 75% of system -- but there might be benefit to moving to 50% of the system nodes, depending on the work load...

While we know moving huge amounts of memory takes time, and we suspect Windows 2012 is more sensitive to it than other guest types, we don't yet clearly know the root cause of the problem.  It might be most accurate to explicitly communicate the facts as we know them: Windows 2008r2 and Windows 7 guests as large as 256GB appear to work correctly in a numad environment, but Windows 2012 guests as small as 120GB sometimes seem to hang in a numad environment, depending on the system memory quantity and configuration.

Comment 37 Bill Gray 2013-02-05 17:15:09 UTC

We have reproduced this issue, and learned more about it.  The issue is specific to Windows 2012 guests that use more memory than exists in a single node. Windows 2012 guests appear to allocate memory more gradually than other Windows guest types, which triggers the issue.  Note that other varieties of Windows guests do not seem to experience this problem.    

You can work around this problem by:
(1) limiting Windows 2012 guests to less memory than exists in a given node -- so on a typical 4 node system with even memory distribution the guest would need to be less than the total amount of system memory divided by 4;  or 
(2) allowing the Windows 2012 guests to finish allocating all of its memory before allowing numad to run.  Numad will handle extremely huge Windows 2012 guests correctly after allowing a few minutes for the guest to finish allocating all of its memory.

We will work on a general fix to handle all Windows 2012 guests, for some subsequent release.

Comment 43 Mike Cao 2013-05-13 12:04:58 UTC

(In reply to comment #42)
> (In reply to comment #40)
> > (In reply to comment #39)
> > > The testing environment for this bug is um.. nontrivial (both hw and sw).
> > > Mike, could we reuse your setup for testing the fix during rhel6.5 testing
> > > phase? Or would you retest it yourself once the fixed packages are available?
> > > 
> > > Thanks in advance
> > 

> 
> I tested this issue with the same machine as comment 2. and the latest
> kernel and qemu on host.
> 
> kernel: 2.6.32-376.el6.x86_64 qemu-kvm:qemu-kvm-0.12.1.2-2.369.el6.x86_64
> and qemu-kvm-rhev-0.12.1.2-2.369.el6.x86_64
> 
> command line:
> /usr/libexec/qemu-kvm -boot menu=on -m 256G -smp
> 48,cores=48,sockets=1,threads=1 -cpu Opteron_G3,family=0xf -drive
> file=/home/win2012-64-virtio.qcow2,format=qcow2,if=none,id=drive-ide0,
> cache=none,werror=stop,rerror=stop -device
> ide-drive,drive=drive-ide0,id=ide0,bootindex=1 -netdev
> tap,sndbuf=0,id=hostnet0,script=/etc/qemu-ifup,downscript=no -device
> e1000,netdev=hostnet0,mac=00:52:1a:21:62:01,bus=pci.0,addr=0x4,id=virtio-net-
> pci0 -uuid ac64c74a-a8d5-4c24-9839-fcc491439493 -rtc
> base=localtime,clock=host,driftfix=slew -no-kvm-pit-reinjection -chardev
> socket,id=111a,path=/tmp/amd-max-sut,server,nowait -mon
> chardev=111a,mode=readline -name amd-max-sut -vnc :0 -vga std -numa
> node,mem=32G,cpus=0,4,8,12,16,20,nodeid=0 -numa
> node,mem=32G,cpus=24,28,32,36,40,44,nodeid=1 -numa
> node,mem=32G,cpus=3,7,11,15,19,23,nodeid=2 -numa
> node,mem=32G,cpus=27,31,35,39,43,47,nodeid=3 -numa
> node,mem=32G,cpus=2,6,10,14,18,22,nodeid=4 -numa
> node,mem=32G,cpus=26,30,34,38,42,46,nodeid=5 -numa
> node,mem=32G,cpus=1,5,9,13,17,21,nodeid=6 -numa
> node,mem=32G,cpus=25,29,33,37,41,45,nodeid=7
> 
> Test result(with or without numa): guest work well. don't hit this issue.
> 
> 
> Ales and bcao, If I have any wrong, please correct me.

Pay attention to the bug component ,This is bug is numad issue ,while I did not find you start numad service on the host ,pls retest it with numad services running

Comment 45 FuXiangChun 2013-05-14 02:57:29 UTC

Re-test this issue with 2.6.32-376.el6.x86_64 qemu-kvm:qemu-kvm-rhev-0.12.1.2-2.369.el6.x86_64. If start numad service on host, then this bug is reproduced.

1.# mount cgroup -t cgroup -o cpuset /cgroup
  # numad -D /cgroup
  # /etc/rc.d/init.d/numad status
numad (pid  13514) is running...

2. boot guest as comment 42
# /usr/libexec/qemu-kvm -boot menu=on -m 256G -smp 48,cores=48,sockets=1,threads=1 -cpu Opteron_G3,family=0xf -drive file=/home/win2012-64-virtio.qcow2,format=qcow2,if=none,id=drive-ide0,cache=none,werror=stop,rerror=stop -device ide-drive,drive=drive-ide0,id=ide0,bootindex=1 -netdev tap,sndbuf=0,id=hostnet0,script=/etc/qemu-ifup,downscript=no -device e1000,netdev=hostnet0,mac=00:52:1a:21:62:01,bus=pci.0,addr=0x4,id=virtio-net-pci0 -uuid ac64c74a-a8d5-4c24-9839-fcc491439493 -rtc base=localtime,clock=host,driftfix=slew -no-kvm-pit-reinjection -chardev socket,id=111a,path=/tmp/amd-max-sut,server,nowait -mon chardev=111a,mode=readline -name amd-max-sut -vnc :0 -vga std -numa node,mem=32G,cpus=0,4,8,12,16,20,nodeid=0 -numa node,mem=32G,cpus=24,28,32,36,40,44,nodeid=1 -numa node,mem=32G,cpus=3,7,11,15,19,23,nodeid=2 -numa node,mem=32G,cpus=27,31,35,39,43,47,nodeid=3 -numa node,mem=32G,cpus=2,6,10,14,18,22,nodeid=4 -numa node,mem=32G,cpus=26,30,34,38,42,46,nodeid=5 -numa node,mem=32G,cpus=1,5,9,13,17,21,nodeid=6 -numa node,mem=32G,cpus=25,29,33,37,41,45,nodeid=7
Killed

Comment 46 FuXiangChun 2013-05-14 03:07:51 UTC

numad version:
# rpm -qa|grep numa
numad-0.5-8.20121015git.el6.x86_64
numactl-2.0.7-6.el6.x86_64
numactl-devel-2.0.7-6.el6.x86_64

Comment 47 RHEL Program Management 2013-05-21 01:00:23 UTC

This request was evaluated by Red Hat Product Management for
inclusion in a Red Hat Enterprise Linux release.  Product
Management has requested further review of this request by
Red Hat Engineering, for potential inclusion in a Red Hat
Enterprise Linux release for currently deployed products.
This request is not yet committed for inclusion in a release.

Comment 60 Bill Gray 2014-05-23 04:01:06 UTC

wrt comment 52 asking if this can be resolved in 6.5 at all: Yes,  I will try to get the fixed version in 6.5.z shortly after the verified fix is in an early 6.6 base level.

Comment 62 Bill Gray 2014-06-23 12:23:38 UTC

Attempting to get it in 6.6 ...

Comment 68 errata-xmlrpc 2014-10-14 08:21:27 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2014-1594.html