This bug has been migrated to another issue tracking site. It has been closed here and may no longer be being monitored.

If you would like to get updates for this issue, or to participate in it, you may do so at Red Hat Issue Tracker .
Bug 2168346 - VM stuck at CrashLoopBackOff state after it hits OOM
Summary: VM stuck at CrashLoopBackOff state after it hits OOM
Keywords:
Status: CLOSED MIGRATED
Alias: None
Product: Container Native Virtualization (CNV)
Classification: Red Hat
Component: Virtualization
Version: 4.11.3
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.15.1
Assignee: Igor Bezukh
QA Contact: Kedar Bidarkar
URL:
Whiteboard:
Depends On:
Blocks: 2173980 2225204
TreeView+ depends on / blocked
 
Reported: 2023-02-08 18:51 UTC by Boaz
Modified: 2024-04-13 04:25 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 2173980 (view as bug list)
Environment:
Last Closed: 2023-12-14 16:12:47 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker   CNV-25200 0 None None None 2023-12-14 16:12:46 UTC

Description Boaz 2023-02-08 18:51:37 UTC
--------------------------------------------------
Description of problem:
--------------------------------------------------
as a part of an OOM investigation, I deliberately attempted hitting OOM on a VM by straining the memory while using a heavy IO workload,but something unexpected occurred when instead of just OOM which causes QEMU reboot the VM failed to run again:

NAME            AGE   STATUS             READY
rhel82-vm0001   26h   CrashLoopBackOff   False
rhel82-vm0002   26h   Running            True
rhel82-vm0003   26h   Stopped            False

pod logs :

{"component":"virt-launcher","kind":"","level":"error","msg":"Failed to sync vmi","name":"rhel82-vm0001","namespace":"default","pos":"server.go:184","reason":"virError(Code=1, Domain=10, Message='internal error: UNIX socket path '/var/run/kubevirt-private/libvirt/qemu/channel/target/domain-212-default_rhel82-vm000/org.qemu.guest_agent.0' too long')","timestamp":"2023-02-08T18:36:49.591084Z","uid":"0a6404c5-2ba7-4cc0-ad2a-307018174023"}
{"component":"virt-launcher","level":"info","msg":"Still missing PID for default_rhel82-vm0001, open /run/libvirt/qemu/run/default_rhel82-vm0001.pid: no such file or directory","pos":"monitor.go:125","timestamp":"2023-02-08T18:36:49.664614Z"}
{"component":"virt-launcher","level":"error","msg":"internal error: UNIX socket path '/var/run/kubevirt-private/libvirt/qemu/channel/target/domain-213-default_rhel82-vm000/org.qemu.guest_agent.0' too long","pos":"qemuOpenChrChardevUNIXSocket:5223","subcomponent":"libvirt","thread":"30","timestamp":"2023-02-08T18:36:50.627000Z"}
{"component":"virt-launcher","kind":"","level":"error","msg":"Failed to start VirtualMachineInstance with flags 0.","name":"rhel82-vm0001","namespace":"default","pos":"manager.go:880","reason":"virError(Code=1, Domain=10, Message='internal error: UNIX socket path '/var/run/kubevirt-private/libvirt/qemu/channel/target/domain-213-default_rhel82-vm000/org.qemu.guest_agent.0' too long')","timestamp":"2023-02-08T18:36:50.628244Z","uid":"0a6404c5-2ba7-4cc0-ad2a-307018174023"}
{"component":"virt-launcher","kind":"","level":"error","msg":"Failed to sync vmi","name":"rhel82-vm0001","namespace":"default","pos":"server.go:184","reason":"virError(Code=1, Domain=10, Message='internal error: UNIX socket path '/var/run/kubevirt-private/libvirt/qemu/channel/target/domain-213-default_rhel82-vm000/org.qemu.guest_agent.0' too long')","timestamp":"2023-02-08T18:36:50.628304Z","uid":"0a6404c5-2ba7-4cc0-ad2a-307018174023"}
{"component":"virt-launcher","level":"info","msg":"Still missing PID for default_rhel82-vm0001, open /run/libvirt/qemu/run/default_rhel82-vm0001.pid: no such file or directory","pos":"monitor.go:125","timestamp":"2023-02-08T18:36:50.663725Z"}
{"component":"virt-launcher","level":"info","msg":"Still missing PID for default_rhel82-vm0001, open /run/libvirt/qemu/run/default_rhel82-vm0001.pid: no such file or directory","pos":"monitor.go:125","timestamp":"2023-02-08T18:36:51.663684Z"}
{"component":"virt-launcher","level":"error","msg":"internal error: UNIX socket path '/var/run/kubevirt-private/libvirt/qemu/channel/target/domain-214-default_rhel82-vm000/org.qemu.guest_agent.0' too long","pos":"qemuOpenChrChardevUNIXSocket:5223","subcomponent":"libvirt","thread":"29","timestamp":"2023-02-08T18:36:51.663000Z"}
{"component":"virt-launcher","kind":"","level":"error","msg":"Failed to start VirtualMachineInstance with flags 0.","name":"rhel82-vm0001","namespace":"default","pos":"manager.go:880","reason":"virError(Code=1, Domain=10, Message='internal error: UNIX socket path '/var/run/kubevirt-private/libvirt/qemu/channel/target/domain-214-default_rhel82-vm000/org.qemu.guest_agent.0' too long')","timestamp":"2023-02-08T18:36:51.664818Z","uid":"0a6404c5-2ba7-4cc0-ad2a-307018174023"}
{"component":"virt-launcher","kind":"","level":"error","msg":"Failed to sync vmi","name":"rhel82-vm0001","namespace":"default","pos":"server.go:184","reason":"virError(Code=1, Domain=10, Message='internal error: UNIX socket path '/var/run/kubevirt-private/libvirt/qemu/channel/target/domain-214-default_rhel82-vm00
0/org.qemu.guest_agent.0' too long')","timestamp":"2023-02-08T18:36:51.664884Z","uid":"0a6404c5-2ba7-4cc0-ad2a-307018174023"}




OOM record:
[Wed Feb  8 12:19:15 2023] worker invoked oom-killer: gfp_mask=0x620100(GFP_NOIO|__GFP_HARDWALL|__GFP_WRITE), order=0, oom_score_adj=979
[Wed Feb  8 12:19:15 2023]  oom_kill_process.cold.32+0xb/0x10
[Wed Feb  8 12:19:15 2023] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
[Wed Feb  8 12:19:15 2023] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=crio-a2021e5dd93338ba5e39cef21c773838a294ab95a466c7887054e9e24f72e8e4.scope,mems_allowed=0-1,oom_memcg=/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod6553054d_e923_4628_b36c_c6754eb6e0b1.slice,task_memcg=/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod6553054d_e923_4628_b36c_c6754eb6e0b1.slice/crio-a2021e5dd93338ba5e39cef21c773838a294ab95a466c7887054e9e24f72e8e4.scope,task=qemu-kvm,pid=3196344,uid=107
[Wed Feb  8 12:19:15 2023] Memory cgroup out of memory: Killed process 3196344 (qemu-kvm) total-vm:64560756kB, anon-rss:58285188kB, file-rss:17672kB, shmem-rss:4kB, UID:107 pgtables:115428kB oom_score_adj:979
[Wed Feb  8 12:19:15 2023] oom_reaper: reaped process 3196344 (qemu-kvm), now anon-rss:0kB, file-rss:68kB, shmem-rss:4kB


--------------------------------------------------
Version-Release number of selected component (if applicable):
--------------------------------------------------
kubevirt-hyperconverged-operator.v4.11.3
local-storage-operator.v4.12.0-202301042354
mcg-operator.v4.11.4
ocs-operator.v4.11.4

--------------------------------------------------
How reproducible:
--------------------------------------------------

no idea but the current state is persistent throughout.

--------------------------------------------------
Steps to Reproduce:
--------------------------------------------------

1. strain the VM using a heavy-duty workload
2. reach OOM
3. repeat  

--------------------------------------------------
Actual results:
--------------------------------------------------
VM no longer boot.

--------------------------------------------------
Expected results:
--------------------------------------------------
VM reboots and starts normally

--------------------------------------------------
logs:
--------------------------------------------------
I collect both must gather and the SOS report from the specific node that ran the VM
http://perf148h.perf.lab.eng.bos.redhat.com/share/BZ_logs/vm_doesnt_boot_after_oom.tar.gz

Comment 1 Yan Du 2023-02-15 13:14:50 UTC
It doesn't look like storage component, move to Virt component.

Comment 2 Jed Lejosne 2023-02-27 18:55:41 UTC
There's an interesting error there:
internal error: UNIX socket path '/var/run/kubevirt-private/libvirt/qemu/channel/target/domain-214-default_rhel82-vm000/org.qemu.guest_agent.0' too long

This indeed 108 characters long, 1 more than the 107 allowed by Linux. I think "214" here is the number of times the VM rebooted.
This means VMs can only be rebooted 98 times. We need to address that.
I don't see why, as far as libvirt is concerned, VMs couldn't just be called "vm" instead of "<namespace>_<VMI name>".

Comment 3 Jed Lejosne 2023-02-27 19:44:51 UTC
(In reply to Jed Lejosne from comment #2)
> [...]
> This means VMs can only be rebooted 98 times. We need to address that.

This is actually incorrect, VMs need to actually crash for that number to increase, so that's not such a big deal.
However, @bbenshab, please give more information on how you managed to trigger the OOM killer.
If that was solely by doing things from inside the guest, then that's a problem. No matter what guests do, that should cause virt-launcher to run out of memory...

Comment 5 Antonio Cardace 2023-03-03 16:48:21 UTC
Deferring to 4.14 due to capacity.

Comment 6 Igor Bezukh 2023-08-23 08:56:17 UTC
Clone of libvirt fix for RHEL 9.2.0.z: https://bugzilla.redhat.com/show_bug.cgi?id=2233744

Comment 7 Igor Bezukh 2023-08-31 08:00:46 UTC
Hi,

Libvirt fix will be available at RHEL 9.2.0.z batch update 3, which will be released at 12-09-2023

CNV blocker only date is 05-09-2023

I would suggest to the defer the bug to 4.15

Comment 8 Antonio Cardace 2023-10-11 12:40:16 UTC
Deferring to 4.15.1 due to capacity.

Comment 9 Red Hat Bugzilla 2024-04-13 04:25:08 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days


Note You need to log in before you can comment on or make changes to this bug.