RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 2225204 - The path to the guest agent socket file can become too long and cause problems
Summary: The path to the guest agent socket file can become too long and cause problems
Keywords:
Status: CLOSED DUPLICATE of bug 2173980
Alias: None
Product: Red Hat Enterprise Linux 9
Classification: Red Hat
Component: libvirt
Version: CentOS Stream
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: rc
: ---
Assignee: Michal Privoznik
QA Contact: Virtualization Bugs
URL:
Whiteboard:
Depends On: 2168346 2173980
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-07-24 14:33 UTC by Igor Bezukh
Modified: 2023-07-25 08:10 UTC (History)
17 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 2173980
Environment:
Last Closed: 2023-07-25 08:10:05 UTC
Type: Bug
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHELPLAN-163159 0 None None None 2023-07-24 14:37:26 UTC

Description Igor Bezukh 2023-07-24 14:33:09 UTC
+++ This bug was initially created as a clone of Bug #2173980 +++

+++ This bug was initially created as a clone of Bug #2168346 +++

--------------------------------------------------
Description of problem:
--------------------------------------------------
as a part of an OOM investigation, I deliberately attempted hitting OOM on a VM by straining the memory while using a heavy IO workload,but something unexpected occurred when instead of just OOM which causes QEMU reboot the VM failed to run again:

NAME            AGE   STATUS             READY
rhel82-vm0001   26h   CrashLoopBackOff   False
rhel82-vm0002   26h   Running            True
rhel82-vm0003   26h   Stopped            False

pod logs :

{"component":"virt-launcher","kind":"","level":"error","msg":"Failed to sync vmi","name":"rhel82-vm0001","namespace":"default","pos":"server.go:184","reason":"virError(Code=1, Domain=10, Message='internal error: UNIX socket path '/var/run/kubevirt-private/libvirt/qemu/channel/target/domain-212-default_rhel82-vm000/org.qemu.guest_agent.0' too long')","timestamp":"2023-02-08T18:36:49.591084Z","uid":"0a6404c5-2ba7-4cc0-ad2a-307018174023"}
{"component":"virt-launcher","level":"info","msg":"Still missing PID for default_rhel82-vm0001, open /run/libvirt/qemu/run/default_rhel82-vm0001.pid: no such file or directory","pos":"monitor.go:125","timestamp":"2023-02-08T18:36:49.664614Z"}
{"component":"virt-launcher","level":"error","msg":"internal error: UNIX socket path '/var/run/kubevirt-private/libvirt/qemu/channel/target/domain-213-default_rhel82-vm000/org.qemu.guest_agent.0' too long","pos":"qemuOpenChrChardevUNIXSocket:5223","subcomponent":"libvirt","thread":"30","timestamp":"2023-02-08T18:36:50.627000Z"}
{"component":"virt-launcher","kind":"","level":"error","msg":"Failed to start VirtualMachineInstance with flags 0.","name":"rhel82-vm0001","namespace":"default","pos":"manager.go:880","reason":"virError(Code=1, Domain=10, Message='internal error: UNIX socket path '/var/run/kubevirt-private/libvirt/qemu/channel/target/domain-213-default_rhel82-vm000/org.qemu.guest_agent.0' too long')","timestamp":"2023-02-08T18:36:50.628244Z","uid":"0a6404c5-2ba7-4cc0-ad2a-307018174023"}
{"component":"virt-launcher","kind":"","level":"error","msg":"Failed to sync vmi","name":"rhel82-vm0001","namespace":"default","pos":"server.go:184","reason":"virError(Code=1, Domain=10, Message='internal error: UNIX socket path '/var/run/kubevirt-private/libvirt/qemu/channel/target/domain-213-default_rhel82-vm000/org.qemu.guest_agent.0' too long')","timestamp":"2023-02-08T18:36:50.628304Z","uid":"0a6404c5-2ba7-4cc0-ad2a-307018174023"}
{"component":"virt-launcher","level":"info","msg":"Still missing PID for default_rhel82-vm0001, open /run/libvirt/qemu/run/default_rhel82-vm0001.pid: no such file or directory","pos":"monitor.go:125","timestamp":"2023-02-08T18:36:50.663725Z"}
{"component":"virt-launcher","level":"info","msg":"Still missing PID for default_rhel82-vm0001, open /run/libvirt/qemu/run/default_rhel82-vm0001.pid: no such file or directory","pos":"monitor.go:125","timestamp":"2023-02-08T18:36:51.663684Z"}
{"component":"virt-launcher","level":"error","msg":"internal error: UNIX socket path '/var/run/kubevirt-private/libvirt/qemu/channel/target/domain-214-default_rhel82-vm000/org.qemu.guest_agent.0' too long","pos":"qemuOpenChrChardevUNIXSocket:5223","subcomponent":"libvirt","thread":"29","timestamp":"2023-02-08T18:36:51.663000Z"}
{"component":"virt-launcher","kind":"","level":"error","msg":"Failed to start VirtualMachineInstance with flags 0.","name":"rhel82-vm0001","namespace":"default","pos":"manager.go:880","reason":"virError(Code=1, Domain=10, Message='internal error: UNIX socket path '/var/run/kubevirt-private/libvirt/qemu/channel/target/domain-214-default_rhel82-vm000/org.qemu.guest_agent.0' too long')","timestamp":"2023-02-08T18:36:51.664818Z","uid":"0a6404c5-2ba7-4cc0-ad2a-307018174023"}
{"component":"virt-launcher","kind":"","level":"error","msg":"Failed to sync vmi","name":"rhel82-vm0001","namespace":"default","pos":"server.go:184","reason":"virError(Code=1, Domain=10, Message='internal error: UNIX socket path '/var/run/kubevirt-private/libvirt/qemu/channel/target/domain-214-default_rhel82-vm00
0/org.qemu.guest_agent.0' too long')","timestamp":"2023-02-08T18:36:51.664884Z","uid":"0a6404c5-2ba7-4cc0-ad2a-307018174023"}




OOM record:
[Wed Feb  8 12:19:15 2023] worker invoked oom-killer: gfp_mask=0x620100(GFP_NOIO|__GFP_HARDWALL|__GFP_WRITE), order=0, oom_score_adj=979
[Wed Feb  8 12:19:15 2023]  oom_kill_process.cold.32+0xb/0x10
[Wed Feb  8 12:19:15 2023] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
[Wed Feb  8 12:19:15 2023] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=crio-a2021e5dd93338ba5e39cef21c773838a294ab95a466c7887054e9e24f72e8e4.scope,mems_allowed=0-1,oom_memcg=/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod6553054d_e923_4628_b36c_c6754eb6e0b1.slice,task_memcg=/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod6553054d_e923_4628_b36c_c6754eb6e0b1.slice/crio-a2021e5dd93338ba5e39cef21c773838a294ab95a466c7887054e9e24f72e8e4.scope,task=qemu-kvm,pid=3196344,uid=107
[Wed Feb  8 12:19:15 2023] Memory cgroup out of memory: Killed process 3196344 (qemu-kvm) total-vm:64560756kB, anon-rss:58285188kB, file-rss:17672kB, shmem-rss:4kB, UID:107 pgtables:115428kB oom_score_adj:979
[Wed Feb  8 12:19:15 2023] oom_reaper: reaped process 3196344 (qemu-kvm), now anon-rss:0kB, file-rss:68kB, shmem-rss:4kB


--------------------------------------------------
Version-Release number of selected component (if applicable):
--------------------------------------------------
kubevirt-hyperconverged-operator.v4.11.3
local-storage-operator.v4.12.0-202301042354
mcg-operator.v4.11.4
ocs-operator.v4.11.4

--------------------------------------------------
How reproducible:
--------------------------------------------------

no idea but the current state is persistent throughout.

--------------------------------------------------
Steps to Reproduce:
--------------------------------------------------

1. strain the VM using a heavy-duty workload
2. reach OOM
3. repeat  

--------------------------------------------------
Actual results:
--------------------------------------------------
VM no longer boot.

--------------------------------------------------
Expected results:
--------------------------------------------------
VM reboots and starts normally

--------------------------------------------------
logs:
--------------------------------------------------
I collect both must gather and the SOS report from the specific node that ran the VM
http://perf148h.perf.lab.eng.bos.redhat.com/share/BZ_logs/vm_doesnt_boot_after_oom.tar.gz

--- Additional comment from Yan Du on 2023-02-15 13:14:50 UTC ---

It doesn't look like storage component, move to Virt component.

--- Additional comment from Jed Lejosne on 2023-02-27 18:55:41 UTC ---

There's an interesting error there:
internal error: UNIX socket path '/var/run/kubevirt-private/libvirt/qemu/channel/target/domain-214-default_rhel82-vm000/org.qemu.guest_agent.0' too long

This indeed 108 characters long, 1 more than the 107 allowed by Linux. I think "214" here is the number of times the VM rebooted.
This means VMs can only be rebooted 98 times. We need to address that.
I don't see why, as far as libvirt is concerned, VMs couldn't just be called "vm" instead of "<namespace>_<VMI name>".

--- Additional comment from Jed Lejosne on 2023-02-27 19:44:51 UTC ---

(In reply to Jed Lejosne from comment #2)
> [...]
> This means VMs can only be rebooted 98 times. We need to address that.

This is actually incorrect, VMs need to actually crash for that number to increase, so that's not such a big deal.
However, @bbenshab, please give more information on how you managed to trigger the OOM killer.
If that was solely by doing things from inside the guest, then that's a problem. No matter what guests do, that should cause virt-launcher to run out of memory...

--- Additional comment from Boaz on 2023-02-28 10:29:42 UTC ---

@jlejosne this was an investigation of a customer case that is described here:
https://docs.google.com/document/d/1bMWAkw7Scp98XgXmtVH-vRD_YOjB2xUW7KUcs8SeUSg

--- Additional comment from Jed Lejosne on 2023-02-28 15:03:49 UTC ---

The maximum socket path in Linux is 107. In KubeVirt, we can end up with a 108 characters path for guest-agent, like:
/var/run/kubevirt-private/libvirt/qemu/channel/target/domain-214-default_rhel82-vm000/org.qemu.guest_agent.0

We need to find a way to either shorten the path or replace it with something else, like a file descriptor.

--- Additional comment from Michal Privoznik on 2023-02-28 17:04:17 UTC ---

I see couple ways to fix this issue:

1) make the cfg->channelDir configurable so that mgmt apps can skip "/libvirt/qemu/channel/target/" part, or

2) KubeVirt may use shorter VM name,

3) Libvirt may generate even shorter "shortened names" (currently 20 chars), since we're guaranteed to be unique regardless of short name length (due to domain ID in the path).

It was also suggested that libvirt could just pass FD to QEMU. I did not realize it back then, but libvirt's already doing that, e.g.:

  -chardev socket,id=charchannel0,fd=32,server=on,wait=off \
  -device {"driver":"virtserialport","bus":"virtio-serial0.0","nr":1,"chardev":"charchannel0","id":"channel0","name":"org.qemu.guest_agent.0"}

but that won't really fly when libvirtd restarts and needs to reconnect back to the socket. Unless there's a way in QEMU to provide a new fd, in which case libvirt could just use an unnamed socket and tell QEMU the new FD on reconnect.

--- Additional comment from Lili Zhu on 2023-03-01 08:44:30 UTC ---

Reproduce this bug with:
libvirt-9.0.0-7.el9.x86_64

1. Create directory for guest agent socket source:
# mkdir  /var/run/kubevirt-private/libvirt/qemu/channel/target/domain-212-default_rhel82-vm000/ -p


2. Change the owner and selinux context of the directory:
   # chown qemu:qemu /var/run/kubevirt-private/libvirt/qemu/channel/target/domain-212-default_rhel82-vm000/
   # chcon system_u:object_r:qemu_var_run_t:s0 /var/run/kubevirt-private/libvirt/qemu/channel/target/domain-212-default_rhel82-vm000/

3. prepare a guest with the following xml snippet
   <channel type='unix'>
      <source mode='bind' path='/var/run/kubevirt-private/libvirt/qemu/channel/target/domain-212-default_rhel82-vm000/org.qemu.guest_agent.1'/>
      <target type='virtio' name='org.qemu.guest_agent.0'/>
      <address type='virtio-serial' controller='0' bus='0' port='2'/>
    </channel>

4. start the guest
# virsh start vm1
error: Failed to start domain 'vm1'
error: internal error: UNIX socket path '/var/run/kubevirt-private/libvirt/qemu/channel/target/domain-212-default_rhel82-vm000/org.qemu.guest_agent.1' too long


Additional info:
If I change the last character of the path to '0', i.e., /var/run/kubevirt-private/libvirt/qemu/channel/target/domain-212-default_rhel82-vm000/org.qemu.guest_agent.0
<channel type='unix'>
      <source mode='bind' path='/var/run/kubevirt-private/libvirt/qemu/channel/target/domain-212-default_rhel82-vm000/org.qemu.guest_agent.0'/>
      <target type='virtio' name='org.qemu.guest_agent.0'/>
      <address type='virtio-serial' controller='0' bus='0' port='2'/>
    </channel>

libvirt will remove the specified path, and use the default socket path, 
  <channel type='unix'>
      <source mode='bind' path='/var/lib/libvirt/qemu/channel/target/domain-1-vm1/org.qemu.guest_agent.0'/>
      <target type='virtio' name='org.qemu.guest_agent.0' state='disconnected'/>
      <alias name='channel0'/>
      <address type='virtio-serial' controller='0' bus='0' port='2'/>
    </channel>

Hi, Michal

I am curious why the above path "/var/run/kubevirt-private/libvirt/qemu/channel/target/domain-212-default_rhel82-vm000/org.qemu.guest_agent.0" was not removed in cnv. Please help to check if you are available. Thanks very much.

--- Additional comment from Michal Privoznik on 2023-03-01 10:26:11 UTC ---

(In reply to Lili Zhu from comment #3)
> Hi, Michal
> 
> I am curious why the above path
> "/var/run/kubevirt-private/libvirt/qemu/channel/target/domain-212-
> default_rhel82-vm000/org.qemu.guest_agent.0" was not removed in cnv. Please
> help to check if you are available. Thanks very much.

I believe they don't specify the path. For the guest agent, libvirt puts the socket into so called channelTargetDir, which is derived as follows:

virQEMUDriverConfig *
virQEMUDriverConfigNew(bool privileged,
                       const char *root)
{
    if (root != NULL) {
        cfg->channelTargetDir = g_strdup_printf("%s/channel/target", cfg->libDir);
    } else if (privileged) {
        cfg->channelTargetDir = g_strdup_printf("%s/channel/target", cfg->libDir);
    } else {
        cfg->configBaseDir = virGetUserConfigDirectory();

        cfg->channelTargetDir = g_strdup_printf("%s/qemu/channel/target",
                                                cfg->configBaseDir); 
    }
}

and since CNV uses qemu:///session the third option is used. In here, virGetUserConfigDirectory() basically returns "$XDG_CONFIG_HOME/libvirt". In other words, this path is autogenerated by libvirt and as the domain ID grows, it also prolongs the socket path. And since the path was long enough to begin with, after couple of iterations, the domain ID is large enough to make the generated path not fit into the limit.

--- Additional comment from Igor Bezukh on 2023-07-10 12:16:09 UTC ---

Hi Michal, Lili,

Can you please update on the bug status? would you pick it for upcoming release?

TIA,
Igor

--- Additional comment from Michal Privoznik on 2023-07-12 15:02:40 UTC ---

Patches proposed on the list:

https://listman.redhat.com/archives/libvir-list/2023-July/240680.html

--- Additional comment from Michal Privoznik on 2023-07-24 10:12:07 UTC ---

v2:

https://listman.redhat.com/archives/libvir-list/2023-July/240904.html

Comment 1 Michal Privoznik 2023-07-25 08:10:05 UTC
This should not have been cloned. There's a process we need to follow in order to get this fixed in older RHEL releases. I'll close this and set corresponding flags on the original bug to start the process.

*** This bug has been marked as a duplicate of bug 2173980 ***


Note You need to log in before you can comment on or make changes to this bug.