Description: [cgroup_v2] qemu crashed when set memory.max to a too small value Versions: qemu-kvm-4.0.0-6.module+el8.1.0+3736+a2aefea3.x86_64 kernel-4.18.0-107.el8.x86_64 How reproducible: 100% Steps: 0. enable cgroup v2 1. having a running guest $ virsh list Id Name State -------------------------------- 1 avocado-vt-vm1 running $ echo "" > /var/log/libvirt/qemu/avocado-vt-vm1.log 2. set it's memory.max cgroup param with following cmd: $ virsh memtune avocado-vt-vm1 1000 (or we can use pure cgroup cmd: echo 1000 > /sys/fs/cgroup/machine.slice/machine-qemu\\x2d1\\x2davocado\\x2dvt\\x2dvm1.scope/memory.max) 3. crash happened $ virsh list Id Name State -------------------- $ cat /var/log/libvirt/qemu/avocado-vt-vm1.log 2019-07-31 08:57:21.418+0000: shutting down, reason=crashed 4. the backtrace info as follow: $gdb -p [PID_OF_VM] ... (gdb) bt #0 0x00007f6f7eb8b2d6 in __GI_ppoll (fds=0x55a353f9e8b0, nfds=72, timeout=<optimized out>, sigmask=0x0) at ../sysdeps/unix/sysv/linux/ppoll.c:39 #1 0x000055a3530fc575 in qemu_poll_ns () Backtrace stopped: Cannot access memory at address 0x7ffc1b9ab9a8 Expected result: qemu should not crash Additional info: if using cgroup v1, the memtune cmd will fail and no crash happened, as follow: # virsh memtune avocado-vt-vm1 1000 error: Unable to change memory parameters error: Unable to write to '/sys/fs/cgroup/memory/machine.slice/machine-qemu\x2d2\x2davocado\x2dvt\x2dvm1.scope/memory.limit_in_bytes': Device or resource busy
qemu cmd as follow: qemu 3146 1 8 06:31 ? 00:00:21 /usr/libexec/qemu-kvm -name guest=avocado-vt-vm1,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-1-avocado-vt-vm1/master-key.aes -machine pc-q35-rhel8.0.0,accel=kvm,usb=off,dump-guest-core=off -cpu IvyBridge-IBRS,ss=on,vmx=off,pcid=on,hypervisor=on,arat=on,tsc_adjust=on,umip=on,md-clear=on,stibp=on,arch-capabilities=on,ssbd=on,xsaveopt=on -m 1024 -overcommit mem-lock=off -smp 2,sockets=2,cores=1,threads=1 -uuid aaff70d2-782a-4acb-accc-61ee9c4ed673 -no-user-config -nodefaults -chardev socket,id=charmonitor,fd=30,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc,driftfix=slew -global kvm-pit.lost_tick_policy=delay -no-hpet -no-shutdown -global ICH9-LPC.disable_s3=1 -global ICH9-LPC.disable_s4=1 -boot strict=on -device pcie-root-port,port=0x10,chassis=1,id=pci.1,bus=pcie.0,multifunction=on,addr=0x2 -device pcie-root-port,port=0x11,chassis=2,id=pci.2,bus=pcie.0,addr=0x2.0x1 -device pcie-root-port,port=0x12,chassis=3,id=pci.3,bus=pcie.0,addr=0x2.0x2 -device pcie-root-port,port=0x13,chassis=4,id=pci.4,bus=pcie.0,addr=0x2.0x3 -device pcie-root-port,port=0x14,chassis=5,id=pci.5,bus=pcie.0,addr=0x2.0x4 -device pcie-root-port,port=0x15,chassis=6,id=pci.6,bus=pcie.0,addr=0x2.0x5 -device pcie-root-port,port=0x16,chassis=7,id=pci.7,bus=pcie.0,addr=0x2.0x6 -device qemu-xhci,p2=15,p3=15,id=usb,bus=pci.2,addr=0x0 -device virtio-serial-pci,id=virtio-serial0,bus=pci.3,addr=0x0 -drive file=/var/lib/avocado/data/avocado-vt/images/jeos-27-x86_64.qcow2,format=qcow2,if=none,id=drive-virtio-disk0 -device virtio-blk-pci,scsi=off,bus=pci.4,addr=0x0,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -netdev tap,fd=32,id=hostnet0,vhost=on,vhostfd=33 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:8e:07:3e,bus=pci.1,addr=0x0 -chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 -chardev socket,id=charchannel0,fd=34,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=org.qemu.guest_agent.0 -device usb-tablet,id=input0,bus=usb.0,port=1 -vnc 127.0.0.1:0 -device qxl-vga,id=video0,ram_size=67108864,vram_size=67108864,vram64_size_mb=0,vgamem_mb=16,max_outputs=1,bus=pcie.0,addr=0x1 -device virtio-balloon-pci,id=balloon0,bus=pci.5,addr=0x0 -object rng-random,id=objrng0,filename=/dev/urandom -device virtio-rng-pci,rng=objrng0,id=rng0,bus=pci.6,addr=0x0 -sandbox on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny -msg timestamp=on
Today I gave this rabbit hole a dive and I think I finally have a full picture of what is going on. Not going too much in the details, it is more or less like that: 1. virsh memtune sets the _hard_ memory limit on the virtual machine cgroup. Hard limit is supposed to make the tasks inside the cgroup get a visit from the oom killer when kernel is unable to swap-out/reclaim some of the memory when the process allocates memory. 2. In cgroups v1, the memory.limit_in_bytes is the hard limit on the cgroups memory, however when _setting_ it to a lower value that the cgroup currently uses, the kernel just tries few times to reclaim memory and if fails it just returns -EBUSY. I confirmed this with the source of the 5.3 kernel. The kernel documentation is silent about this. 3. In cgroups V2, however the hard limit is memory.max and when user sets it to low value the kernel tries once again to reclaim the memory and on failure the oom killer is summoned. This is also verified from the source of the same kernel, plus this is explicitly mentioned in the cgroups v2 documentation in the kernel. The cgroups v2 documentation explicitly frowns upon using the memory.max and strongly advices to use the soft limit instead which is called memory. memory.max ... >This is the ultimate protection mechanism. As long as the > high limit is used and monitored properly, this limit's > utility is limited to providing the final safety net." > >Instead the soft limit, memory.high is supposed to heavily throttle >the process if breached but never invoke the oom killer: memory.high ... >Memory usage throttle limit. This is the main mechanism to > control memory usage of a cgroup. If a cgroup's usage goes > over the high boundary, the processes of the cgroup are > throttled and put under heavy reclaim pressure. > > Going over the high limit never invokes the OOM killer and > under extreme conditions the limit may be breached. ... >Usage Guidelines >~~~~~~~~~~~~~~~~ > >"memory.high" is the main mechanism to control memory usage. >Over-committing on high limit (sum of high limits > available memory) >and letting global memory pressure to distribute memory according to >usage is a viable strategy. > >Because breach of the high limit doesn't trigger the OOM killer but >throttles the offending cgroup, a management agent has ample >opportunities to monitor and take appropriate actions such as granting >more memory or terminating the workload. > >Determining whether a cgroup has enough memory is not trivial as >memory usage doesn't indicate whether the workload can benefit from >more memory. For example, a workload which writes data received from >network to a file can use all available memory but can also operate as >performant with a small amount of memory. A measure of memory >pressure - how much the workload is being impacted due to lack of >memory - is necessary to determine whether a workload needs more >memory; unfortunately, memory pressure monitoring mechanism isn't ... That is basically it. Maybe we want to make the libvirt refuse to set the hard limit when used with cgroups v2, or at least add some protection to avoid setting it too low, like less that amount of memory that the guest was given. Or at least make the virsh set the soft limit by default. Or we might raise that issue again (I am sure that this API change was done on purpose, and was debated already, but still) Or we might as well keep things as is, and just let the user shoot its foot, while adding an advice to not set the hard limit to anything that could remotely crash the vm. Let me know what you think. Best regards, Maxim Levitsky
(In reply to Maxim Levitsky from comment #3) > Today I gave this rabbit hole a dive and I think I finally have a full > picture of what is going on. > > Not going too much in the details, it is more or less like that: > > > 1. virsh memtune sets the _hard_ memory limit on the virtual machine cgroup. > > Hard limit is supposed to make the tasks inside the cgroup get a visit from > the oom killer > when kernel is unable to swap-out/reclaim some of the memory when the > process allocates memory. > > 2. In cgroups v1, the memory.limit_in_bytes is the hard limit on the cgroups > memory, > however when _setting_ it to a lower value that the cgroup currently uses, > the kernel just tries > few times to reclaim memory and if fails it just returns -EBUSY. > I confirmed this with the source of the 5.3 kernel. The kernel documentation > is silent about this. > > 3. In cgroups V2, however the hard limit is memory.max and when user sets it > to low value > the kernel tries once again to reclaim the memory and on failure the oom > killer is summoned. > This is also verified from the source of the same kernel, plus this is > explicitly mentioned > in the cgroups v2 documentation in the kernel. > > The cgroups v2 documentation explicitly frowns upon using the memory.max and > strongly advices to use > the soft limit instead which is called memory. > > > memory.max > ... > >This is the ultimate protection mechanism. As long as the > > high limit is used and monitored properly, this limit's > > utility is limited to providing the final safety net." > > > >Instead the soft limit, memory.high is supposed to heavily throttle > >the process if breached but never invoke the oom killer: > > > memory.high > ... > >Memory usage throttle limit. This is the main mechanism to > > control memory usage of a cgroup. If a cgroup's usage goes > > over the high boundary, the processes of the cgroup are > > throttled and put under heavy reclaim pressure. > > > > Going over the high limit never invokes the OOM killer and > > under extreme conditions the limit may be breached. > ... > > > >Usage Guidelines > >~~~~~~~~~~~~~~~~ > > > >"memory.high" is the main mechanism to control memory usage. > >Over-committing on high limit (sum of high limits > available memory) > >and letting global memory pressure to distribute memory according to > >usage is a viable strategy. > > > >Because breach of the high limit doesn't trigger the OOM killer but > >throttles the offending cgroup, a management agent has ample > >opportunities to monitor and take appropriate actions such as granting > >more memory or terminating the workload. > > > >Determining whether a cgroup has enough memory is not trivial as > >memory usage doesn't indicate whether the workload can benefit from > >more memory. For example, a workload which writes data received from > >network to a file can use all available memory but can also operate as > >performant with a small amount of memory. A measure of memory > >pressure - how much the workload is being impacted due to lack of > >memory - is necessary to determine whether a workload needs more > >memory; unfortunately, memory pressure monitoring mechanism isn't > > ... > > > That is basically it. Maybe we want to make the libvirt refuse to set the > hard limit when > used with cgroups v2, or at least add some protection to avoid setting it > too low, like > less that amount of memory that the guest was given. > Or at least make the virsh set the soft limit by default. > > Or we might raise that issue again (I am sure that this API change was done > on purpose, > and was debated already, but still) > > > Or we might as well keep things as is, and just let the user shoot its foot, > while adding > an advice to not set the hard limit to anything that could remotely crash > the vm. > > > Let me know what you think. > > Best regards, > Maxim Levitsky Thanks for the detailed info! Since this is "by design" from kernel's perspective. And libvirt do little about error handling of cgroup, but let cgroup itself to report errors. So maybe we could just add some info in libvirt's document about this. Could you please move this to libvirt, thx.
Done. Note that I by mistake quoted this "The cgroups v2 documentation explicitly frowns upon using the memory.max and strongly advices to use the soft limit instead which is called memory." This is just my observation, but later in the 'Usage Guidelines' this is more or less what kernel docs say anyway.
Pavel, can you please review/comment this? And move it to documentation component eventually? Thanks.
It's already documented [1]. If you use 'virsh memtune vm 1000' it will set the 'hard_limit'. [1] <https://libvirt.org/formatdomain.html#elementsMemoryTuning>