Bug 1286917

Summary: QEMU process is killed when exceed memory limitation of cgroup
Product: Red Hat Enterprise Linux 6 Reporter: Yumei Huang <yuhuang>
Component: qemu-kvmAssignee: Andrew Jones <drjones>
Status: CLOSED NOTABUG QA Contact: Virtualization Bugs <virt-bugs>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 6.8CC: ailan, chayang, drjones, juzhang, michen, mkenneth, ngu, qzhang, rbalakri, virt-maint, xfu
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-12-07 18:10:11 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Yumei Huang 2015-12-01 05:51:01 UTC
Description of problem:
When guest use more memory than the cgroup limitation, the qemu process is killed directly. Maybe it's right for cgroup to kill process which exceed the limitation. But QE thought qemu-kvm should prevent being killed directly when one process of guest try to use exceeded memory. It's better that guest kill this process instead of the guest being killed. 

Version-Release number of selected component (if applicable):
qemu: qemu-kvm-0.12.1.2-2.481.el6
kernel: 2.6.32-583.el6.x86_64 

How reproducible:
always

Steps to Reproduce:
1. create cgroup, set limitation:
 #mount -t cgroup -o memory memory /cgroup/memory
 #echo 1G > /cgroup/memory/memory.limit_in_bytes
 #echo 2G > /cgroup/memory/memory.memsw.limit_in_bytes
2. start a guest with 4G memory:
#/usr/libexec/qemu-kvm -name rhel6-1 -m 4G -smp 4 \
-nodefaults -rtc base=utc,driftfix=slew -global kvm-pit.lost_tick_policy=discard -boot menu=on,strict=on \
-drive file=/home/guest/rhel6-1.img,if=none,id=drive-virtio-disk0,format=qcow2,cache=none -device virtio-blk-pci,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 \
-netdev tap,id=idinWyYp,vhost=on -device virtio-net-pci,mac=41:ce:a9:d2:4d:d7,id=idlbq7eA,netdev=idinWyYp \
-usb -device usb-tablet,id=input0 -vga qxl \
-spice port=5901,addr=0.0.0.0,disable-ticketing,image-compression=off,seamless-migration=on -monitor stdio
3. echo 'qemu_pid' >  /cgroup/memory/tasks  
4. in guest, use dd to allocate 3G memory 
  #mount -t tmpfs none /mnt
  #dd if=/dev/zero of=/mnt/zero bs=1024k count=3072

Actual results:
(qemu) kvm_run: Bad address
Killed

Expected results:
The dd process of guest is killed instead of the qemu process being killed.

Additional info:

Comment 2 Gu Nini 2015-12-01 08:18:17 UTC
Hi Andrew,

The bug came from following case although we have some change. It also occurred on RHEL7.2.

https://polarion.engineering.redhat.com/polarion/#/project/RHEL6/workitem?id=RHEL6-6638


In prevent of you could not see the case, I list the steps here:
# mount -t cgroup -o memory none /cgroup
# cd /cgroup
1.create cgroup
# mkdir memory
2.set limit.
# echo 1G > ...memory/memory.limit_in_bytes
# echo 2G > ...memory.memsw.limit_in_bytes	
3.boot guest with 2G mem.	
4.# echo `pidof qemu-kvm` > ...memory/tasks **(contain threads)**	
5.in the guest.
# mount -t tmpfs none /mnt/
# dd if=/dev/zero of=/mnt/mem bs=1M count=2000	
6.in host:
# top

The Expected Result is: After the step 6, verify the RES in top is < 1G.

However, the real result at present is the RES in top would increase to larger than 1G besides this bz problem, i.e. it can be seen the RES increases to more than 1G before the guest is killed, do you think it is normal since we have set 'memory.limit_in_bytes' to 1G in step2?

Comment 3 Andrew Jones 2015-12-02 18:19:30 UTC
> Description of problem:
> When guest use more memory than the cgroup limitation, the qemu process is
> killed directly. Maybe it's right for cgroup to kill process which exceed
> the limitation.

Yes. From section 9.8 of kernel documentation Documentation/cgroups/memcg_test.txt, which is pasted below, your test case is "Case B"

 9.8 OOM-Killer
        Out-of-memory caused by memcg's limit will kill tasks under
        the memcg. When hierarchy is used, a task under hierarchy
        will be killed by the kernel.
        In this case, panic_on_oom shouldn't be invoked and tasks
        in other groups shouldn't be killed.

        It's not difficult to cause OOM under memcg as following.
        Case A) when you can swapoff
        #swapoff -a
        #echo 50M > /memory.limit_in_bytes
        run 51M of malloc

        Case B) when you use mem+swap limitation.
        #echo 50M > memory.limit_in_bytes
        #echo 50M > memory.memsw.limit_in_bytes
        run 51M of malloc(In reply to Yumei Huang from comment #0)

> But QE thought qemu-kvm should prevent being killed directly
> when one process of guest try to use exceeded memory. It's better that guest
> kill this process instead of the guest being killed. 
> 

The guest would need to be configured with less than or equal memory to the cgroup allowance. Your test case gives the guest 2G, but then limits its cgroup to only 1G. Your test case does allow memory+swap to be 2G, but the guest kernel isn't going to start swapping until it's getting much closer to exhausting main memory (which is 2G). This is why it consumes more than 1G, and then is eventually killed by the host.

> 
> Expected results:
> The dd process of guest is killed instead of the qemu process being killed.

The expectation is wrong for the test case. Also, even if you configure the guest memory and cgroup limits correctly (in order to give the guest kernel a chance to kill its memory hog), then I'm not sure that the guest kernel would always choose the dd process. It may choose something else. The only way to be sure it will choose the dd process is to also run that process in a cgroup inside the guest, limiting it to something low enough that the kernel will function correctly with the remaining memory, and pick the right task to kill.

Comment 4 Yumei Huang 2015-12-03 11:35:34 UTC
Hi Andrew, I tried to configure the guest memory equal to cgroup limits, guest terminate dd process with "No space left on device" and continue working. 

Then how do we test cgroup  on qemu-kvm if we always set guest memory less than or equal to cgroup limits? On this condition, What's the difference to guest between with cgroup limits and without cgroup limits? 


By the way, I also did below test with guest memory > cgroup limit:
In host, with swap on, and swap used is 0,  echo 1G > memory.limit_in_bytes. 
Then start a guest with 4G memory, echo qemu_pid > tasks. 
In guest, use dd to allocate 3000M memory. 
#mount -t tmpfs none /mnt
#dd if=/dev/zero of=/mnt/aa bs=1M count=3000
The guest terminate dd process with "No space left on device". 

Check the guest, guest free memory is about 1.5G, swap used is 0, the size of aa file is about 1.9G.
Check the host, the swap used is 897M:
#free -m
             total       used       free     shared    buffers     cached
Mem:        242157       7363     234794          0         51       3763
-/+ buffers/cache:       3548     238609 
Swap:         4095        897       3198

And RES of this qemu process is 1.8G:
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                  
 8038 root      20   0 6508m 1.8g 6336 S 10.0  0.8   1:05.46 qemu-kvm 

Also checked memory.memsw.limit_in_bytes, it's the default value "9223372036854775807".

Can you explain why the size of aa file is 1.9G and RES > 1G (which is the cgroup limit) ? The cgroup definitely does limit the guest memory, but how ? or is there anything wrong about my steps?

Comment 5 Andrew Jones 2015-12-03 21:25:22 UTC
(In reply to Yumei Huang from comment #4)
> Hi Andrew, I tried to configure the guest memory equal to cgroup limits,
> guest terminate dd process with "No space left on device" and continue
> working. 
> 
> Then how do we test cgroup  on qemu-kvm if we always set guest memory less
> than or equal to cgroup limits?

It doesn't make sense to give a guest 2G memory, and then restrict its QEMU process to 1G, as you're just setting the guest up for a crash when it tries to use the memory it was told that it has. Thus, the original test case was testing a nonsense config. However it was correctly testing cgroups limits.

> On this condition, What's the difference to
> guest between with cgroup limits and without cgroup limits? 

The difference is that if a QEMU process starts using more memory, either from a malicious guest finding a way to force it, or due to some QEMU memory leaks, then only that QEMU process will be killed, rather than the whole host eventually running out of memory. Actually, a better value for the cgroup limit, than just being equal to the guest memory allocation, would something a bit higher, something allowing for QEMU data structures etc.

> 
> 
> By the way, I also did below test with guest memory > cgroup limit:
> In host, with swap on, and swap used is 0,  echo 1G > memory.limit_in_bytes. 
> Then start a guest with 4G memory, echo qemu_pid > tasks. 
> In guest, use dd to allocate 3000M memory. 
> #mount -t tmpfs none /mnt
> #dd if=/dev/zero of=/mnt/aa bs=1M count=3000
> The guest terminate dd process with "No space left on device". 
> 
> Check the guest, guest free memory is about 1.5G, swap used is 0, the size
> of aa file is about 1.9G.
> Check the host, the swap used is 897M:
> #free -m
>              total       used       free     shared    buffers     cached
> Mem:        242157       7363     234794          0         51       3763
> -/+ buffers/cache:       3548     238609 
> Swap:         4095        897       3198
> 
> And RES of this qemu process is 1.8G:
>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND        
> 
>  8038 root      20   0 6508m 1.8g 6336 S 10.0  0.8   1:05.46 qemu-kvm 
> 
> Also checked memory.memsw.limit_in_bytes, it's the default value
> "9223372036854775807".
> 
> Can you explain why the size of aa file is 1.9G and RES > 1G (which is the
> cgroup limit) ? The cgroup definitely does limit the guest memory, but how ?
> or is there anything wrong about my steps?

If you don't turn the swap off in the host, then you need to set memory.memsw.limit_in_bytes to some limit. In this case you should set it to 1G. See both "Case A" and "Case B" from section 9.8 of the documentation I pasted in comment 3.