1734737 – [cgroup_v2] qemu crashed when set memory.max to a too small value

Bug 1734737 - [cgroup_v2] qemu crashed when set memory.max to a too small value

Summary: [cgroup_v2] qemu crashed when set memory.max to a too small value

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat Enterprise Linux Advanced Virtualization
Classification:	Red Hat
Component:	libvirt
Sub Component:
Version:	8.1
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Pavel Hrdina
QA Contact:	yisun
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-07-31 10:35 UTC by yisun
Modified:	2020-02-27 16:14 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-02-27 16:14:01 UTC
Type:	Bug
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description yisun 2019-07-31 10:35:17 UTC

Description:
[cgroup_v2] qemu crashed when set memory.max to a too small value

Versions:
qemu-kvm-4.0.0-6.module+el8.1.0+3736+a2aefea3.x86_64
kernel-4.18.0-107.el8.x86_64

How reproducible:
100%

Steps:
0. enable cgroup v2
1. having a running guest
$ virsh list
 Id   Name             State
--------------------------------
 1    avocado-vt-vm1   running

$ echo "" > /var/log/libvirt/qemu/avocado-vt-vm1.log

2. set it's memory.max cgroup param with following cmd:
$ virsh memtune avocado-vt-vm1 1000
(or we can use pure cgroup cmd: echo 1000 > /sys/fs/cgroup/machine.slice/machine-qemu\\x2d1\\x2davocado\\x2dvt\\x2dvm1.scope/memory.max)

3. crash happened
$ virsh list
 Id   Name   State
--------------------

$ cat /var/log/libvirt/qemu/avocado-vt-vm1.log
2019-07-31 08:57:21.418+0000: shutting down, reason=crashed

4. the backtrace info as follow:
$gdb -p [PID_OF_VM]
...
(gdb) bt
#0  0x00007f6f7eb8b2d6 in __GI_ppoll (fds=0x55a353f9e8b0, nfds=72, timeout=<optimized out>, sigmask=0x0) at ../sysdeps/unix/sysv/linux/ppoll.c:39
#1  0x000055a3530fc575 in qemu_poll_ns ()
Backtrace stopped: Cannot access memory at address 0x7ffc1b9ab9a8


Expected result:
qemu should not crash

Additional info:
if using cgroup v1, the memtune cmd will fail and no crash happened, as follow:
# virsh memtune avocado-vt-vm1 1000
error: Unable to change memory parameters
error: Unable to write to '/sys/fs/cgroup/memory/machine.slice/machine-qemu\x2d2\x2davocado\x2dvt\x2dvm1.scope/memory.limit_in_bytes': Device or resource busy

Comment 1 yisun 2019-07-31 10:37:00 UTC

qemu cmd as follow:
qemu      3146     1  8 06:31 ?        00:00:21 /usr/libexec/qemu-kvm -name guest=avocado-vt-vm1,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-1-avocado-vt-vm1/master-key.aes -machine pc-q35-rhel8.0.0,accel=kvm,usb=off,dump-guest-core=off -cpu IvyBridge-IBRS,ss=on,vmx=off,pcid=on,hypervisor=on,arat=on,tsc_adjust=on,umip=on,md-clear=on,stibp=on,arch-capabilities=on,ssbd=on,xsaveopt=on -m 1024 -overcommit mem-lock=off -smp 2,sockets=2,cores=1,threads=1 -uuid aaff70d2-782a-4acb-accc-61ee9c4ed673 -no-user-config -nodefaults -chardev socket,id=charmonitor,fd=30,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc,driftfix=slew -global kvm-pit.lost_tick_policy=delay -no-hpet -no-shutdown -global ICH9-LPC.disable_s3=1 -global ICH9-LPC.disable_s4=1 -boot strict=on -device pcie-root-port,port=0x10,chassis=1,id=pci.1,bus=pcie.0,multifunction=on,addr=0x2 -device pcie-root-port,port=0x11,chassis=2,id=pci.2,bus=pcie.0,addr=0x2.0x1 -device pcie-root-port,port=0x12,chassis=3,id=pci.3,bus=pcie.0,addr=0x2.0x2 -device pcie-root-port,port=0x13,chassis=4,id=pci.4,bus=pcie.0,addr=0x2.0x3 -device pcie-root-port,port=0x14,chassis=5,id=pci.5,bus=pcie.0,addr=0x2.0x4 -device pcie-root-port,port=0x15,chassis=6,id=pci.6,bus=pcie.0,addr=0x2.0x5 -device pcie-root-port,port=0x16,chassis=7,id=pci.7,bus=pcie.0,addr=0x2.0x6 -device qemu-xhci,p2=15,p3=15,id=usb,bus=pci.2,addr=0x0 -device virtio-serial-pci,id=virtio-serial0,bus=pci.3,addr=0x0 -drive file=/var/lib/avocado/data/avocado-vt/images/jeos-27-x86_64.qcow2,format=qcow2,if=none,id=drive-virtio-disk0 -device virtio-blk-pci,scsi=off,bus=pci.4,addr=0x0,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -netdev tap,fd=32,id=hostnet0,vhost=on,vhostfd=33 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:8e:07:3e,bus=pci.1,addr=0x0 -chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 -chardev socket,id=charchannel0,fd=34,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=org.qemu.guest_agent.0 -device usb-tablet,id=input0,bus=usb.0,port=1 -vnc 127.0.0.1:0 -device qxl-vga,id=video0,ram_size=67108864,vram_size=67108864,vram64_size_mb=0,vgamem_mb=16,max_outputs=1,bus=pcie.0,addr=0x1 -device virtio-balloon-pci,id=balloon0,bus=pci.5,addr=0x0 -object rng-random,id=objrng0,filename=/dev/urandom -device virtio-rng-pci,rng=objrng0,id=rng0,bus=pci.6,addr=0x0 -sandbox on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny -msg timestamp=on

Comment 3 Maxim Levitsky 2019-11-19 19:42:52 UTC

Today I gave this rabbit hole a dive and I think I finally have a full picture of what is going on.

Not going too much in the details, it is more or less like that:


1. virsh memtune sets the _hard_ memory limit on the virtual machine cgroup.

Hard limit is supposed to make the tasks inside the cgroup get a visit from the oom killer
when kernel is unable to swap-out/reclaim some of the memory when the process allocates memory.

2. In cgroups v1, the memory.limit_in_bytes is the hard limit on the cgroups memory,
however when _setting_ it to a lower value that the cgroup currently uses, the kernel just tries
few times to reclaim memory and if fails it just returns -EBUSY. 
I confirmed this with the source of the 5.3 kernel. The kernel documentation is silent about this.

3. In cgroups V2, however the hard limit is memory.max and when user sets it to low value
the kernel tries once again to reclaim the memory and on failure the oom killer is summoned.
This is also verified from the source of the same kernel, plus this is explicitly mentioned
in the cgroups v2 documentation in the kernel.

The cgroups v2 documentation explicitly frowns upon using the memory.max and strongly advices to use
the soft limit instead which is called memory.


memory.max
...
>This is the ultimate protection mechanism.  As long as the
>	high limit is used and monitored properly, this limit's
>	utility is limited to providing the final safety net."
>
>Instead the soft limit, memory.high is supposed to heavily throttle
>the process if breached but never invoke the oom killer:


memory.high
...
>Memory usage throttle limit.  This is the main mechanism to
>	control memory usage of a cgroup.  If a cgroup's usage goes
>	over the high boundary, the processes of the cgroup are
>	throttled and put under heavy reclaim pressure.
>
>	Going over the high limit never invokes the OOM killer and
>	under extreme conditions the limit may be breached.
...


>Usage Guidelines
>~~~~~~~~~~~~~~~~
>
>"memory.high" is the main mechanism to control memory usage.
>Over-committing on high limit (sum of high limits > available memory)
>and letting global memory pressure to distribute memory according to
>usage is a viable strategy.
>
>Because breach of the high limit doesn't trigger the OOM killer but
>throttles the offending cgroup, a management agent has ample
>opportunities to monitor and take appropriate actions such as granting
>more memory or terminating the workload.
>
>Determining whether a cgroup has enough memory is not trivial as
>memory usage doesn't indicate whether the workload can benefit from
>more memory.  For example, a workload which writes data received from
>network to a file can use all available memory but can also operate as
>performant with a small amount of memory.  A measure of memory
>pressure - how much the workload is being impacted due to lack of
>memory - is necessary to determine whether a workload needs more
>memory; unfortunately, memory pressure monitoring mechanism isn't

...


That is basically it. Maybe we want to make the libvirt refuse to set the hard limit when
used with cgroups v2, or at least add some protection to avoid setting it too low, like
less that amount of memory that the guest was given.
Or at least make the virsh set the soft limit by default.

Or we might raise that issue again (I am sure that this API change was done on purpose,
and was debated already, but still)


Or we might as well keep things as is, and just let the user shoot its foot, while adding
an advice to not set the hard limit to anything that could remotely crash the vm.


Let me know what you think.

Best regards,
     Maxim Levitsky

Comment 4 yisun 2019-11-20 04:27:11 UTC

(In reply to Maxim Levitsky from comment #3)
> Today I gave this rabbit hole a dive and I think I finally have a full
> picture of what is going on.
> 
> Not going too much in the details, it is more or less like that:
> 
> 
> 1. virsh memtune sets the _hard_ memory limit on the virtual machine cgroup.
> 
> Hard limit is supposed to make the tasks inside the cgroup get a visit from
> the oom killer
> when kernel is unable to swap-out/reclaim some of the memory when the
> process allocates memory.
> 
> 2. In cgroups v1, the memory.limit_in_bytes is the hard limit on the cgroups
> memory,
> however when _setting_ it to a lower value that the cgroup currently uses,
> the kernel just tries
> few times to reclaim memory and if fails it just returns -EBUSY. 
> I confirmed this with the source of the 5.3 kernel. The kernel documentation
> is silent about this.
> 
> 3. In cgroups V2, however the hard limit is memory.max and when user sets it
> to low value
> the kernel tries once again to reclaim the memory and on failure the oom
> killer is summoned.
> This is also verified from the source of the same kernel, plus this is
> explicitly mentioned
> in the cgroups v2 documentation in the kernel.
> 
> The cgroups v2 documentation explicitly frowns upon using the memory.max and
> strongly advices to use
> the soft limit instead which is called memory.
> 
> 
> memory.max
> ...
> >This is the ultimate protection mechanism.  As long as the
> >	high limit is used and monitored properly, this limit's
> >	utility is limited to providing the final safety net."
> >
> >Instead the soft limit, memory.high is supposed to heavily throttle
> >the process if breached but never invoke the oom killer:
> 
> 
> memory.high
> ...
> >Memory usage throttle limit.  This is the main mechanism to
> >	control memory usage of a cgroup.  If a cgroup's usage goes
> >	over the high boundary, the processes of the cgroup are
> >	throttled and put under heavy reclaim pressure.
> >
> >	Going over the high limit never invokes the OOM killer and
> >	under extreme conditions the limit may be breached.
> ...
> 
> 
> >Usage Guidelines
> >~~~~~~~~~~~~~~~~
> >
> >"memory.high" is the main mechanism to control memory usage.
> >Over-committing on high limit (sum of high limits > available memory)
> >and letting global memory pressure to distribute memory according to
> >usage is a viable strategy.
> >
> >Because breach of the high limit doesn't trigger the OOM killer but
> >throttles the offending cgroup, a management agent has ample
> >opportunities to monitor and take appropriate actions such as granting
> >more memory or terminating the workload.
> >
> >Determining whether a cgroup has enough memory is not trivial as
> >memory usage doesn't indicate whether the workload can benefit from
> >more memory.  For example, a workload which writes data received from
> >network to a file can use all available memory but can also operate as
> >performant with a small amount of memory.  A measure of memory
> >pressure - how much the workload is being impacted due to lack of
> >memory - is necessary to determine whether a workload needs more
> >memory; unfortunately, memory pressure monitoring mechanism isn't
> 
> ...
> 
> 
> That is basically it. Maybe we want to make the libvirt refuse to set the
> hard limit when
> used with cgroups v2, or at least add some protection to avoid setting it
> too low, like
> less that amount of memory that the guest was given.
> Or at least make the virsh set the soft limit by default.
> 
> Or we might raise that issue again (I am sure that this API change was done
> on purpose,
> and was debated already, but still)
> 
> 
> Or we might as well keep things as is, and just let the user shoot its foot,
> while adding
> an advice to not set the hard limit to anything that could remotely crash
> the vm.
> 
> 
> Let me know what you think.
> 
> Best regards,
>      Maxim Levitsky

Thanks for the detailed info! Since this is "by design" from kernel's perspective. And libvirt do little about error handling of cgroup, but let cgroup itself to report errors. So maybe we could just add some info in libvirt's document about this.  Could you please move this to libvirt, thx.

Comment 5 Maxim Levitsky 2019-11-20 11:51:25 UTC

Done.

Note that I by mistake quoted this

"The cgroups v2 documentation explicitly frowns upon using the memory.max and
strongly advices to use
the soft limit instead which is called memory."

This is just my observation, but later in the 'Usage Guidelines' this is more
or less what kernel docs say anyway.

Comment 6 Jaroslav Suchanek 2019-11-20 14:20:23 UTC

Pavel, can you please review/comment this? And move it to documentation
component eventually? Thanks.

Comment 7 Pavel Hrdina 2020-02-27 16:14:01 UTC

It's already documented [1].  If you use 'virsh memtune vm 1000' it will set the 'hard_limit'.

[1] <https://libvirt.org/formatdomain.html#elementsMemoryTuning>

Note You need to log in before you can comment on or make changes to this bug.