Bug 2057048 - consume libvirt fix for: Failed to connect socket to '/run/libvirt/virtlogd-sock' - possibly caused by Too many open files from libvirtd
Summary: consume libvirt fix for: Failed to connect socket to '/run/libvirt/virtlogd-s...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: redhat-virtualization-host
Version: 4.4.9
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ovirt-4.4.10-3
: 4.4.10
Assignee: Sandro Bonazzola
QA Contact: cshao
URL:
Whiteboard:
Depends On: 2045879 2060945 2063286
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-02-22 15:57 UTC by amashah
Modified: 2022-03-24 13:31 UTC (History)
11 users (show)

Fixed In Version: libvirt-7.6.0-6.1.module+el8.5.0+14474+b3410d40
Doc Type: Rebase: Bug Fixes Only
Doc Text:
Rebase package(s) to version: libvirt-7.6.0-6.1.module+el8.5.0+14474+b3410d40 Highlights and important bug fixes: consume libvirt fix for failure to connect socket to '/run/libvirt/virtlogd-sock' - possibly caused by too many open files from libvirtd.
Clone Of:
Environment:
Last Closed: 2022-03-24 13:30:59 UTC
oVirt Team: Node
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHV-44753 0 None None None 2022-02-22 16:03:42 UTC
Red Hat Knowledge Base (Solution) 6812471 0 None None None 2022-03-14 08:59:10 UTC
Red Hat Product Errata RHSA-2022:1053 0 None None None 2022-03-24 13:31:27 UTC

Description amashah 2022-02-22 15:57:40 UTC
Description of problem:
In the host we see:

~~~
Feb 11 00:19:03 hostname.example.com libvirtd[4742]: Failed to connect socket to '/run/libvirt/virtlogd-sock': Permission denied
~~~

The file has the correct permission and is present:

~~~
$ ls -laZ /run/libvirt/virtlogd-sock

srw-------. 1 root root system_u:object_r:virtlogd_var_run_t:s0 0 Dec  1 07:44 /run/libvirt/virtlogd-sock
~~~

Just before this occurs, libvirtd is reporting Too many open files:

~~~
Feb 11 00:19:03 hostname.example.com libvirtd[4742]: Unable to create pipes: Too many open files
Feb 11 00:19:03 hostname.example.com libvirtd[4742]: Unable to create pipes: Too many open files
Feb 11 00:19:03 hostname.example.com libvirtd[4742]: unable to clear socket security context 'system_u:system_r:svirt_t:s0:c434,c889': Too many open files
Feb 11 00:19:03 hostname.example.com libvirtd[4742]: Unable to create UNIX socket: Too many open files
Feb 11 00:19:03 hostname.example.com libvirtd[4742]: unable to clear socket security context 'system_u:system_r:svirt_t:s0:c434,c889': Too many open files
Feb 11 00:19:03 hostname.example.com libvirtd[4742]: Unable to create pipes: Too many open files
Feb 11 00:19:03 hostname.example.com libvirtd[4742]: Unable to create pipes: Too many open files
Feb 11 00:19:03 hostname.example.com libvirtd[4742]: cannot open directory '/var/lib/libvirt/qemu/domain-5010-fsc-q83exxgb-windows': Too many open files
Feb 11 00:19:03 hostname.example.com libvirtd[4742]: Unable to create pipes: Too many open files
Feb 11 00:19:03 hostname.example.com libvirtd[4742]: Hook script execution failed: Unable to create pipes: Too many open files
Feb 11 00:19:03 hostname.example.com libvirtd[4742]: Unable to create pipes: Too many open files
Feb 11 00:19:03 hostname.example.com libvirtd[4742]: Hook script execution failed: Unable to create pipes: Too many open files
~~~


If we check lsof output:

~~~
$ grep ^libvirtd lsof | wc -l
8270
~~~


Not sure what these STREAM are exactly, but that is the majority of the entries:
~~~
$ grep ^libvirtd lsof | grep STREAM | wc -l
8101
~~~

Most look something like:

~~~
libvirtd     4742        0    1u     unix 0xffff8d1f9626f080       0t0        652 type=STREAM
libvirtd     4742        0    2u     unix 0xffff8d1f9626f080       0t0        652 type=STREAM
~~~

In the libvirtd.service file:

~~~
# At least 1 FD per guest, often 2 (eg qemu monitor + qemu agent).
# eg if we want to support 4096 guests, we'll typically need 8192 FDs
# If changing this, also consider virtlogd.service & virtlockd.service
# limits which are also related to number of guests
LimitNOFILE=8192
~~~


Currently there are no guests running:

~~~
 $ cat sos_commands/virsh/virsh_-r_list_--all 
 Id   Name   State
--------------------
~~~

On another host, with similar symptoms, we can see the VM ID count is above 4000, although there are only 9 VM's actually running:

~~~
$ cat sos_commands/virsh/virsh_-r_list_--all 
 Id     Name                   State
----------------------------------------
 1      redacted-name-here       running
 1679   redacted-name-here       running
 1904   redacted-name-here       running
 3109   redacted-name-here       running
 3407   redacted-name-here       running
 3455   redacted-name-here       running
 3459   redacted-name-here       running
 3502   redacted-name-here       running
 4123   redacted-name-here       running
~~~

It *seems* like VM's that have been shutdown do not release the open files of libvirtd, and that eventually causes it to exceed the limit and then we have a bunch of failures.

It's also possible I am looking at this the wrong way, but it seems like virtlogd socket issue was related to above.

Rebooting the host resolved the issue, but it will likely re-occur as time goes on.


Version-Release number of selected component (if applicable):
RHV-H 4.4.9 

How reproducible:
Only in 1 customer case as of now.


Actual results:
virtlogd fails to connect to socket, and libvirtd is holding a lot of files open that should not be

Expected results:
virtlogd works and libvirtd doesn't hold lots of open files that are unneeded 

Additional info:
We could increase the LimitNOFILE of libvirtd, but it will likely hit that eventually as well if the files continue to stay open.

Comment 2 Arik 2022-02-22 16:16:13 UTC
Milan, can you please check this?

Comment 3 Milan Zamazal 2022-02-23 08:30:29 UTC
A similar issue has been mentioned on libvirt-users mailing list recently: https://listman.redhat.com/archives/libvirt-users/2022-February/msg00044.html
There is a related bug: BZ 2007168
And a fix: https://gitlab.com/libvirt/libvirt/-/commit/5de203f8795d96229d2663e9ea1a24fba5db38fc

If we had a way to reproduce the problem, we could check if the fix above helps. Maybe we could try to set LimitNOFILE to a low value, to start-stop VMs and see if the error occurs.

Comment 4 Milan Zamazal 2022-02-28 11:22:03 UTC
I can reproduce a file descriptor leak by setting LimitNOFILE=54 and starting and restarting several VMs quickly. After a while, no VM can be started anymore, with the error "Hook script execution failed: Unable to create pipes: Too many open files". This happens with both libvirt libvirt-daemon-7.6.0-6.el8s.x86_64 and 8.0.0-1.module+el8.6.0+13896+a8fa8f67.x86_64.

Michal, any idea about this?

Comment 5 Michal Privoznik 2022-03-01 10:29:35 UTC
(In reply to Milan Zamazal from comment #3)
> A similar issue has been mentioned on libvirt-users mailing list recently:
> https://listman.redhat.com/archives/libvirt-users/2022-February/msg00044.html
> There is a related bug: BZ 2007168
> And a fix:

This ^^^

> https://gitlab.com/libvirt/libvirt/-/commit/
> 5de203f8795d96229d2663e9ea1a24fba5db38fc
> 

and this ^^^ are two separate problems. In both the problem lies in glib and libvirt is merely just working around broken glib. I admit that my initial gut feeling when reading the libvirt-users e-mail was that it's the same issue, but it turned out, the ML issue is the bug you've linked. Also, the ML issue is reported against CentOS-8, where CentOS is still behind RHEL (which is fixed).

(In reply to Milan Zamazal from comment #4)
> I can reproduce a file descriptor leak by setting LimitNOFILE=54 and
> starting and restarting several VMs quickly. After a while, no VM can be
> started anymore, with the error "Hook script execution failed: Unable to
> create pipes: Too many open files". This happens with both libvirt
> libvirt-daemon-7.6.0-6.el8s.x86_64 and
> 8.0.0-1.module+el8.6.0+13896+a8fa8f67.x86_64.
> 
> Michal, any idea about this?

Just to be perfectly clear, you have to allow libvirt to have FDs, otherwise it won't be able to spawn any machines. Therefore, setting too tight limit on NOFILE and seeing errors when starting VMs is expected behaviour. Each VM requires at least one FD for QEMU monitor, another one for the guest agent, and possibly other FDs (depending on config). Not to mention, libvirt opens (and closes) various files/pipes/sockets when starting a VM.
Therefore, in order to be sure that it is an FD leak you're seeing you have to check the number of opened FDs *BEFORE* any VM is started, then start VM(s), shut them down, disconnect all the clients and check the number of FDs again. At this point, the numbers should be equal.

Comment 6 Milan Zamazal 2022-03-02 08:54:40 UTC
> Therefore, in order to be sure that it is an FD leak you're seeing you have to check the number of opened FDs *BEFORE* any VM is started, then start VM(s), shut them down, disconnect all the clients and check the number of FDs again. At this point, the numbers should be equal.

Checking with `lsof -p PID' where PID is the one reported by `systemctl status libvirtd'. Starting a VM without a guest OS and powering it off makes one more open file descriptor left per each VM run. After restarting vdsmd and supervdsmd, when no VMs are running, the number of open file descriptors is the same as before the restarted. The leaked file descriptors apparently are (after 4 VM runs):

  libvirtd 661466 root   41u     unix 0xffff990fbf70da00      0t0   29580486 type=STREAM
  libvirtd 661466 root   42u     unix 0xffff991030b7cc80      0t0   29583488 type=STREAM
  libvirtd 661466 root   43u     unix 0xffff991001def500      0t0   29586122 type=STREAM
  libvirtd 661466 root   44u     unix 0xffff991030b7ba80      0t0   29602433 type=STREAM

libvirt-daemon-8.0.0-1.module+el8.6.0+13896+a8fa8f67.x86_64
glib2-2.56.4-158.el8.x86_64 (the same also with glib2-2.56.4-156.el8.x86_64)

Is it a known or unknown leak or is there anything else I should check?

Comment 8 Michal Privoznik 2022-03-03 12:07:50 UTC
(In reply to Milan Zamazal from comment #6)
> > Therefore, in order to be sure that it is an FD leak you're seeing you have to check the number of opened FDs *BEFORE* any VM is started, then start VM(s), shut them down, disconnect all the clients and check the number of FDs again. At this point, the numbers should be equal.
> 
> Checking with `lsof -p PID' where PID is the one reported by `systemctl
> status libvirtd'. Starting a VM without a guest OS and powering it off makes
> one more open file descriptor left per each VM run. After restarting vdsmd
> and supervdsmd, when no VMs are running, the number of open file descriptors
> is the same as before the restarted. The leaked file descriptors apparently
> are (after 4 VM runs):
> 
>   libvirtd 661466 root   41u     unix 0xffff990fbf70da00      0t0   29580486
> type=STREAM
>   libvirtd 661466 root   42u     unix 0xffff991030b7cc80      0t0   29583488
> type=STREAM
>   libvirtd 661466 root   43u     unix 0xffff991001def500      0t0   29586122
> type=STREAM
>   libvirtd 661466 root   44u     unix 0xffff991030b7ba80      0t0   29602433
> type=STREAM
> 

This smells like bug 2045879. Basically, glib screwed up backport which broke the way we work around its another bug. So we needed another workaround to the workaround.

> libvirt-daemon-8.0.0-1.module+el8.6.0+13896+a8fa8f67.x86_64
> glib2-2.56.4-158.el8.x86_64 (the same also with glib2-2.56.4-156.el8.x86_64)

And this assures me that you are seeing that bug. Should be fixed with libvirt-8.0.0-3.module+el8.6.0+14098+5bee65f4
Thanks glib!

Comment 9 Milan Zamazal 2022-03-03 12:34:45 UTC
Thank you, Michal, for explanation. Indeed, after upgrading libvirt to libvirt-daemon-kvm-8.0.0-4.module+el8.6.0+14253+42d4e2d6.x86_64, the problem is gone.

So this problem will be fixed in the next RHV release, once we switch to RHEL 8.6.

Comment 20 cshao 2022-03-22 08:47:03 UTC
Test version:
RHVH-4.4-20220321.0-RHVH-x86_64-dvd1.iso 
libvirt-7.6.0-6.1.module+el8.5.0+14474+b3410d40.x86_64

# imgbase w 
You are on rhvh-4.4.10.3-0.20220321.0+1
rpm -qa| grep libvirt-7.6.0-6.1.module+el8.5.0+14474+b3410d40
libvirt-7.6.0-6.1.module+el8.5.0+14474+b3410d40.x86_64


RHVH include the correct libvirt package, so the bug is fixed, change bug status to verified.

Comment 22 errata-xmlrpc 2022-03-24 13:30:59 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat Virtualization Host security and enhancement update [ovirt-4.4.10] Async #2), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:1053


Note You need to log in before you can comment on or make changes to this bug.