Bug 1442043
Summary: | Default FD limit prevents running more than ~470 VMs | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Jaroslav Reznik <jreznik> |
Component: | libvirt | Assignee: | Laine Stump <laine> |
Status: | CLOSED ERRATA | QA Contact: | chhu |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 7.3 | CC: | berrange, dyuan, emarcian, jsuchane, laine, mgoldboi, mtessun, nsoffer, pkrempa, rbalakri, xuzhang, yalzhang, ykaul |
Target Milestone: | rc | Keywords: | Performance, Upstream, ZStream |
Target Release: | --- | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | libvirt-2.0.0-10.el7_3.7 | Doc Type: | Bug Fix |
Doc Text: |
On very large installations, libvirt could fail to start some guests, giving a "Too many open files error". The default open file handle limits have been substantially increased for the libvirtd, virtlockd, and virtlogd processes, allowing for several thousand guests on a host (if these generous limits are exceeded, the limits can be further increased in the systemd service files for the processes).
|
Story Points: | --- |
Clone Of: | 1429551 | Environment: | |
Last Closed: | 2017-05-25 15:36:50 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1429551 | ||
Bug Blocks: |
Description
Jaroslav Reznik
2017-04-13 11:53:51 UTC
Hi, Laine I checked the configuration files (libvirtd.service/virtlogd.service/ virtlockd.service), which are with the correct LimitNOFILE. And the service limit settings for "max open files" are also correct. However, when I tried to define and start 4100 guests. Met error in libvirtd log: "error : virNetSocketReadWire:1615 : Cannot recv data: Connection reset by peer", when near the virtlockd limit 16384. I used ABRT, no core dump. 1. I think this is as design, how do you think ? 2. The error on rhel7.3.z and rhel7.4 are different, I think the error on rhel7.4 is better, will we do some modification here on rhel7.3.z ? 1) rhel7.3.z(libvirt-2.0.0-10.el7_3.9.x86_64): #virsh start r7_test3275 error: Failed to start domain r7_test3275 error: Cannot recv data: Connection reset by peer 2)) rhel7.4 (libvirt-3.2.0-4.el7.x86_64): "Too many open files" # virsh start r7_test3275 error: Failed to start domain r7_test3275 error: Unable to open/create resource /var/lib/libvirt/lockd/files/403e582c39cc88f45b039d796a9cc4ae9175dce9b5323667562d14543436cbac: Too many open files Tested on packages: libvirt-2.0.0-10.el7_3.9.x86_64 kernel-3.10.0-514.21.1.el7.x86_64 qemu-kvm-rhev-2.6.0-28.el7_3.9.x86_64 Test steps: 1. Check the configuration files: PASS #cat /usr/lib/systemd/system/libvirtd.service| grep LimitNOFILE LimitNOFILE=8192 #cat /usr/lib/systemd/system/virtlogd.service| grep LimitNOFILE LimitNOFILE=8192 #cat /usr/lib/systemd/system/virtlockd.service| grep LimitNOFILE LimitNOFILE=16384 2. Check the service limit settings: PASS #cat /proc/`pidof libvirtd`/limits| grep open Max open files 8192 8192 files #cat /proc/`pidof virtlogd`/limits| grep open Max open files 8192 8192 files #cat /proc/`pidof virtlockd`/limits| grep open Max open files 16384 16384 files 3. Check the virtlockd limit: 16384: Try to start 4100 guests with 4 disks in each guests, enable virtlockd service. Settings: 1) Edit /etc/libvirt/qemu.conf lock_manager = "lockd" max_processes = 65535 max_files = 65535 2) Edit /etc/libvirt/qemu-lockd.conf auto_disk_leases = 1 file_lockspace_dir = "/var/lib/libvirt/lockd/files" 3) Edit max_clients in /etc/libvirt/virtlockd.conf max_clients = 16385 4) #ulimit -n 65535 5) Restart service #systemctl start virtlockd #systemctl restart libvirtd Steps: 1) Run define.sh, try to define and start 4100 guests. Met error in libvirtd log: "error : virNetSocketReadWire:1615 : Cannot recv data: Connection reset by peer", when near the virtlockd limit 16384. Use ABRT, no core dump. #virsh start r7_test3275 error: Failed to start domain r7_test3275 error: Cannot recv data: Connection reset by peer 2) Check it's near the virtlockd limit:16384 #ls /proc/`pidof libvirtd`/fd/ | wc -l 3296 #ls /proc/`pidof virtlogd`/fd/ | wc -l 6559 #ls /proc/`pidof virtlockd`/fd/ | wc -l 16382 You're apparently seeing some other limit hit that has been remedied in the newer libvirt. I don't think that should stop you from verifying this BZ. You may want to install the libvirt-debuginfo package (if it's not there already), then attach gdb to the libvirtd process before running your test, and grab "thread apply all bt" when it segfaults. Then you could look for an existing bug with the same signature, and if you don't find one, file a new one. 4. Check the virtlogd limit: 8192: PASS Try to start 2048 guests with xml below with 1 disk in each guest, hit error in libvirtd.log, when near the virtlogd limit 8192. "error : virNetClientProgramDispatchError:177 : Unable to open file: /var/log/libvirt/serial_test2045.log: Too many open files" 1) xml with serial log and guest agent. <serial type='file'> <source path='/var/log/libvirt/serial_##.log' append='off'/> <target port='0'/> </serial> <serial type='pty'> <target port='0'/> </serial> <console type='pty'> <target type='serial' port='0'/> </console> <channel type='unix'> <target type='virtio' name='org.qemu.guest_agent.0'/> <address type='virtio-serial' controller='0' bus='0' port='1'/> </channel> #virsh start r7_test2045 error: Failed to start domain r7_test2045 error: Unable to open file: /var/log/libvirt/serial_test2045.log: Too many open files 2) check the limit, near virtlogd limit:8192 #ls /proc/`pidof virtlogd`/fd/ | wc -l 8187 #ls /proc/`pidof libvirtd`/fd/ | wc -l 2066 #ls /proc/`pidof virtlockd`/fd/ | wc -l 4100 5. Check libvirtd limit: 8192: PASS Try to start 8192 guests, with 1 disk in each guest, hit error in libvirtd.log, when near libvirtd limit: 8192. "error : getDevNull:399 : cannot open /dev/null: Too many open files" Settings: 1) Edit the virtlogd limit to 65535 #cat /usr/lib/systemd/system/virtlogd.service| grep LimitNOFILE #LimitNOFILE=8192 LimitNOFILE=65535 2) check the limits: #cat /proc/`pidof virtlockd`/limits| grep open Max open files 16384 16384 files #cat /proc/`pidof virtlogd`/limits| grep open Max open files 65535 65535 files #cat /proc/`pidof libvirtd`/limits| grep open Max open files 8192 8192 files Steps: 1) Try to start the 8192 guest with script. Failed to start the VM, and hit error "Too many open files" in libvirtd.log when near libvirtd limit: 8192 -------------------------------------------------------------------- error : getDevNull:399 : cannot open /dev/null: Too many open files debug : virCommandRunAsync:2432 : Command result -1, with PID -1 debug : virFileClose:102 : Closed fd 8189 debug :* virFileClose:102 : Closed fd 8190* debug : qemuProcessLaunch:5267 : QEMU vm=0x7f47c5e3ca10 name=r7_test8162 failed to spawn debug : qemuProcessLaunch:5270 : Writing early domain status to disk debug : virFileMakePathHelper:2837 : path=/var/run/libvirt/qemu mode=0777 debug : virFileClose:102 : Closed fd 8189 debug : qemuProcessLaunch:5274 : Waiting for handshake from child error : virCommandHandshakeWait:2653 : internal error: invalid use of command API --------------------------------------------------------------- 2) check the libvirtd open files near the limit 8192. #ls /proc/`pidof libvirtd`/fd/ | wc -l 8183 #ls /proc/`pidof virtlockd`/fd/ | wc -l 16334 #ls /proc/`pidof virtlogd`/fd/ | wc -l 16333 3) virsh list --all check 8161 guests are running. (In reply to chhu from comment #7) > Hi, Laine > > I checked the configuration files (libvirtd.service/virtlogd.service/ > virtlockd.service), which are with the correct LimitNOFILE. > And the service limit settings for "max open files" are also correct. > > However, when I tried to define and start 4100 guests. > Met error in libvirtd log: > "error : virNetSocketReadWire:1615 : Cannot recv data: Connection reset by > peer", > when near the virtlockd limit 16384. I used ABRT, no core dump. > Hi, Laine Thank you for your help! I used GDB and found no segfaults happened when try to start VM near the virlockd limit. And SIGTERM was sent to qemu-kvm by libvirtd at that moment, which caused the connection reset. Would you like to modify the error message to "Too many open files" as in rhel7.4 ? Steps: 1) Installed libvirt-debuginfo package, then attached gdb to the libvirtd/ virtlockd process, then try to start VM when near the virlockd limit 16384: No segfaults. # virsh start r7_test3275 error: Failed to start domain r7_test3275 error: Cannot recv data: Connection reset by peer 2) run virsh list --all, get return 3274 VM are running, r7_test3275 is shutoff. No error is found in /var/log/libvirt/qemu/r7_test3275.log and r7_test3274.log. 3) trace the signal while try to start VM r7_test3275. "SIGTERM was sent to qemu-kvm (pid:42586) by libvirtd" # stap kill.stp SIGTERM was sent to systemd-udevd (pid:42559) by systemd-udevd uid:0 SIGTERM was sent to qemu-kvm (pid:42586) by libvirtd uid:0 SIGTERM was sent to systemd-udevd (pid:42589) by systemd-udevd uid:0 SIGKILL was sent to dbus-daemon (pid:42680) by dbus-daemon uid:81 More informations: 1) Errors in libvirtd.log error : virNetSocketReadWire:1615 : Cannot recv data: Connection reset by peer debug : virNetClientMarkClose:639 : client=0x7ff860def820, reason=0 debug : virNetSocketRemoveIOCallback:2021 : Watch not registered on socket 0x7ff860def5d0 debug : virNetClientIOEventLoop:1619 : error on socket: Cannot recv data: Connection reset by peer 2) kill.stp ------------------------------------------ # cat kill.stp #! /usr/bin/env stap # sigkill.stp # Copyright (C) 2007 Red Hat, Inc., Eugene Teo <eteo> # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License version 2 as # published by the Free Software Foundation. # # /usr/share/systemtap/tapset/signal.stp: # [...] # probe signal.send = _signal.send.* # { # sig=$sig # sig_name = _signal_name($sig) # sig_pid = task_pid(task) # pid_name = task_execname(task) # [...] probe signal.send { if (sig_name == "SIGKILL" | sig_name == "SIGTERM" | sig_name == "SIGQUIT" | sig_name == "SIGABRT" | sig_name == "SIGSEGV" | sig_name == "SIGPIPE" | sig_name == "SIGSTOP") printf("%s was sent to %s (pid:%d) by %s uid:%d\n", sig_name, pid_name, sig_pid, execname(), uid()) } (In reply to chhu from comment #10) > > I used GDB and found no segfaults happened when try to start VM near the > virlockd limit. And SIGTERM was sent to qemu-kvm by libvirtd at that moment, > which caused the connection reset. > > Would you like to modify the error message to "Too many open files" as in > rhel7.4 ? "Connection reset by peer" is a generic error in a generic place, and could be encountered in many different situations. The error in this particular situation is "Too many open files" with the 7.4 libvirt because "something else somewhere else" changed. Since the version difference is 2.0.0 vs 3.2.0, finding which patch makes the behavior change in this one situation, and determining which other patches must also be backported due to dependencies, would likely be a long process, and prone to causing regressions. For those reasons, I don't think it's a good idea to try to change the error message. Thanks Laine! According to comment 7,9,10,11, change the bug status to VERIFIED. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:1304 |