Bug 1286917
Summary: | QEMU process is killed when exceed memory limitation of cgroup | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Yumei Huang <yuhuang> |
Component: | qemu-kvm | Assignee: | Andrew Jones <drjones> |
Status: | CLOSED NOTABUG | QA Contact: | Virtualization Bugs <virt-bugs> |
Severity: | unspecified | Docs Contact: | |
Priority: | unspecified | ||
Version: | 6.8 | CC: | ailan, chayang, drjones, juzhang, michen, mkenneth, ngu, qzhang, rbalakri, virt-maint, xfu |
Target Milestone: | rc | ||
Target Release: | --- | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2015-12-07 18:10:11 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Yumei Huang
2015-12-01 05:51:01 UTC
Hi Andrew, The bug came from following case although we have some change. It also occurred on RHEL7.2. https://polarion.engineering.redhat.com/polarion/#/project/RHEL6/workitem?id=RHEL6-6638 In prevent of you could not see the case, I list the steps here: # mount -t cgroup -o memory none /cgroup # cd /cgroup 1.create cgroup # mkdir memory 2.set limit. # echo 1G > ...memory/memory.limit_in_bytes # echo 2G > ...memory.memsw.limit_in_bytes 3.boot guest with 2G mem. 4.# echo `pidof qemu-kvm` > ...memory/tasks **(contain threads)** 5.in the guest. # mount -t tmpfs none /mnt/ # dd if=/dev/zero of=/mnt/mem bs=1M count=2000 6.in host: # top The Expected Result is: After the step 6, verify the RES in top is < 1G. However, the real result at present is the RES in top would increase to larger than 1G besides this bz problem, i.e. it can be seen the RES increases to more than 1G before the guest is killed, do you think it is normal since we have set 'memory.limit_in_bytes' to 1G in step2? > Description of problem: > When guest use more memory than the cgroup limitation, the qemu process is > killed directly. Maybe it's right for cgroup to kill process which exceed > the limitation. Yes. From section 9.8 of kernel documentation Documentation/cgroups/memcg_test.txt, which is pasted below, your test case is "Case B" 9.8 OOM-Killer Out-of-memory caused by memcg's limit will kill tasks under the memcg. When hierarchy is used, a task under hierarchy will be killed by the kernel. In this case, panic_on_oom shouldn't be invoked and tasks in other groups shouldn't be killed. It's not difficult to cause OOM under memcg as following. Case A) when you can swapoff #swapoff -a #echo 50M > /memory.limit_in_bytes run 51M of malloc Case B) when you use mem+swap limitation. #echo 50M > memory.limit_in_bytes #echo 50M > memory.memsw.limit_in_bytes run 51M of malloc(In reply to Yumei Huang from comment #0) > But QE thought qemu-kvm should prevent being killed directly > when one process of guest try to use exceeded memory. It's better that guest > kill this process instead of the guest being killed. > The guest would need to be configured with less than or equal memory to the cgroup allowance. Your test case gives the guest 2G, but then limits its cgroup to only 1G. Your test case does allow memory+swap to be 2G, but the guest kernel isn't going to start swapping until it's getting much closer to exhausting main memory (which is 2G). This is why it consumes more than 1G, and then is eventually killed by the host. > > Expected results: > The dd process of guest is killed instead of the qemu process being killed. The expectation is wrong for the test case. Also, even if you configure the guest memory and cgroup limits correctly (in order to give the guest kernel a chance to kill its memory hog), then I'm not sure that the guest kernel would always choose the dd process. It may choose something else. The only way to be sure it will choose the dd process is to also run that process in a cgroup inside the guest, limiting it to something low enough that the kernel will function correctly with the remaining memory, and pick the right task to kill. Hi Andrew, I tried to configure the guest memory equal to cgroup limits, guest terminate dd process with "No space left on device" and continue working. Then how do we test cgroup on qemu-kvm if we always set guest memory less than or equal to cgroup limits? On this condition, What's the difference to guest between with cgroup limits and without cgroup limits? By the way, I also did below test with guest memory > cgroup limit: In host, with swap on, and swap used is 0, echo 1G > memory.limit_in_bytes. Then start a guest with 4G memory, echo qemu_pid > tasks. In guest, use dd to allocate 3000M memory. #mount -t tmpfs none /mnt #dd if=/dev/zero of=/mnt/aa bs=1M count=3000 The guest terminate dd process with "No space left on device". Check the guest, guest free memory is about 1.5G, swap used is 0, the size of aa file is about 1.9G. Check the host, the swap used is 897M: #free -m total used free shared buffers cached Mem: 242157 7363 234794 0 51 3763 -/+ buffers/cache: 3548 238609 Swap: 4095 897 3198 And RES of this qemu process is 1.8G: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 8038 root 20 0 6508m 1.8g 6336 S 10.0 0.8 1:05.46 qemu-kvm Also checked memory.memsw.limit_in_bytes, it's the default value "9223372036854775807". Can you explain why the size of aa file is 1.9G and RES > 1G (which is the cgroup limit) ? The cgroup definitely does limit the guest memory, but how ? or is there anything wrong about my steps? (In reply to Yumei Huang from comment #4) > Hi Andrew, I tried to configure the guest memory equal to cgroup limits, > guest terminate dd process with "No space left on device" and continue > working. > > Then how do we test cgroup on qemu-kvm if we always set guest memory less > than or equal to cgroup limits? It doesn't make sense to give a guest 2G memory, and then restrict its QEMU process to 1G, as you're just setting the guest up for a crash when it tries to use the memory it was told that it has. Thus, the original test case was testing a nonsense config. However it was correctly testing cgroups limits. > On this condition, What's the difference to > guest between with cgroup limits and without cgroup limits? The difference is that if a QEMU process starts using more memory, either from a malicious guest finding a way to force it, or due to some QEMU memory leaks, then only that QEMU process will be killed, rather than the whole host eventually running out of memory. Actually, a better value for the cgroup limit, than just being equal to the guest memory allocation, would something a bit higher, something allowing for QEMU data structures etc. > > > By the way, I also did below test with guest memory > cgroup limit: > In host, with swap on, and swap used is 0, echo 1G > memory.limit_in_bytes. > Then start a guest with 4G memory, echo qemu_pid > tasks. > In guest, use dd to allocate 3000M memory. > #mount -t tmpfs none /mnt > #dd if=/dev/zero of=/mnt/aa bs=1M count=3000 > The guest terminate dd process with "No space left on device". > > Check the guest, guest free memory is about 1.5G, swap used is 0, the size > of aa file is about 1.9G. > Check the host, the swap used is 897M: > #free -m > total used free shared buffers cached > Mem: 242157 7363 234794 0 51 3763 > -/+ buffers/cache: 3548 238609 > Swap: 4095 897 3198 > > And RES of this qemu process is 1.8G: > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > > 8038 root 20 0 6508m 1.8g 6336 S 10.0 0.8 1:05.46 qemu-kvm > > Also checked memory.memsw.limit_in_bytes, it's the default value > "9223372036854775807". > > Can you explain why the size of aa file is 1.9G and RES > 1G (which is the > cgroup limit) ? The cgroup definitely does limit the guest memory, but how ? > or is there anything wrong about my steps? If you don't turn the swap off in the host, then you need to set memory.memsw.limit_in_bytes to some limit. In this case you should set it to 1G. See both "Case A" and "Case B" from section 9.8 of the documentation I pasted in comment 3. |