Hide Forgot
Created attachment 513257 [details] Logs Environment: RHEVM 3.0 on dev env, libvirt-0.9.3-2.el6.x86_64, vdsm-4.9-81.el6.x86_64. I used the QA automation framework which perform operation on ~30 VMs, operations like snapshot/template creation and deletion, disks/nics addition and deletion, stop/start/suspend VMs, it tries to perform the same operations on few VMs almost in parallel. As a result, libvirtd opened storage devices (for instance /dev/dm-23) and they remained open due to deadlock. (gdb) info threads 7 Thread 0x7f2c8b6b0700 (LWP 1305) pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162 6 Thread 0x7f2c8acaf700 (LWP 1306) pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162 5 Thread 0x7f2c8a2ae700 (LWP 1307) pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162 4 Thread 0x7f2c898ad700 (LWP 1308) pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162 3 Thread 0x7f2c88eac700 (LWP 1309) pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162 2 Thread 0x7f2c870f1700 (LWP 2775) pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162 * 1 Thread 0x7f2c91542800 (LWP 1304) 0x0000003027644582 in _IO_vfprintf_internal (s=<value optimized out>, format=<value optimized out>, ap=<value optimized out>) at vfprintf.c:1977 (gdb) where #0 0x0000003027644582 in _IO_vfprintf_internal (s=<value optimized out>, format=<value optimized out>, ap=<value optimized out>) at vfprintf.c:1977 #1 0x000000302770000c in __vasprintf_chk (result_ptr=0x7fff1d886140, flags=1, format=0x3636939c68 "%02d:%02d:%02d.%03d: %d: %s : %s:%lld : %s\n", args=0x7fff1d885f00) at vasprintf_chk.c:68 #2 0x0000003636850d54 in vasprintf (strp=<value optimized out>, fmt=<value optimized out>, list=<value optimized out>) at /usr/include/bits/stdio2.h:199 #3 virVasprintf (strp=<value optimized out>, fmt=<value optimized out>, list=<value optimized out>) at util/util.c:1648 #4 0x0000003636850df8 in virAsprintf (strp=<value optimized out>, fmt=<value optimized out>) at util/util.c:1666 #5 0x0000003636840423 in virLogFormatString (msg=0x7fff1d886140, funcname=0x3636938fd0 "virEventPollDispatchHandles", linenr=454, time_info=0x7fff1d8860c0, cur_time=<value optimized out>, priority=<value optimized out>, str=0x880910 "i=25 w=55") at util/logging.c:634 #6 0x0000003636840c06 in virLogMessage (category=0x3636938a8b "file.util/event_poll.c", priority=1, funcname=0x3636938fd0 "virEventPollDispatchHandles", linenr=454, flags=0, fmt=<value optimized out>) at util/logging.c:729 #7 0x000000363683a356 in virEventPollDispatchHandles () at util/event_poll.c:454 #8 virEventPollRunOnce () at util/event_poll.c:610 #9 0x0000003636839397 in virEventRunDefaultImpl () at util/event.c:247 #10 0x000000000043c86d in virNetServerRun (srv=0x814ae0) at rpc/virnetserver.c:662 #11 0x000000000041d708 in main (argc=<value optimized out>, argv=<value optimized out>) at libvirtd.c:1552 (gdb) break virNetTLSSessionRead Breakpoint 1 at 0x442250: file rpc/virnettlscontext.c, line 809. (2 locations) (gdb) c Continuing. Breakpoint 1, virNetTLSSessionRead (sess=0x7f2c78024f40, buf=0x7f2c78073bc8 "", len=4) at rpc/virnettlscontext.c:809 809 { (gdb) n 812 ret = gnutls_record_recv(sess->session, buf, len); (gdb) 814 if (ret >= 0) (gdb) p ret $1 = -10 [root@intg-vdsb ~]# lsof -p 6632 COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME libvirtd 6632 root cwd DIR 253,0 4096 2 / libvirtd 6632 root rtd DIR 253,0 4096 2 / libvirtd 6632 root txt REG 253,0 1105328 403877 /usr/sbin/libvirtd (deleted) libvirtd 6632 root mem REG 253,0 155696 783375 /lib64/ld-2.12.so libvirtd 6632 root mem REG 253,0 31984 396738 /usr/lib64/libpciaccess.so.0.10.8 libvirtd 6632 root mem REG 253,0 1912928 783379 /lib64/libc-2.12.so libvirtd 6632 root mem REG 253,0 22536 783450 /lib64/libdl-2.12.so libvirtd 6632 root mem REG 253,0 145672 783381 /lib64/libpthread-2.12.so libvirtd 6632 root mem REG 253,0 598816 783435 /lib64/libm-2.12.so libvirtd 6632 root mem REG 253,0 88240 783399 /lib64/libz.so.1.2.3 libvirtd 6632 root mem REG 253,0 124624 783641 /lib64/libselinux.so.1 libvirtd 6632 root mem REG 253,0 47072 783383 /lib64/librt-2.12.so libvirtd 6632 root mem REG 253,0 268200 783437 /lib64/libdbus-1.so.3.4.0 libvirtd 6632 root mem REG 253,0 113904 783687 /lib64/libresolv-2.12.so libvirtd 6632 root mem REG 253,0 387880 783460 /lib64/libfreebl3.so libvirtd 6632 root mem REG 253,0 43392 783475 /lib64/libcrypt-2.12.so libvirtd 6632 root mem REG 253,0 248304 783647 /lib64/libsepol.so.1 libvirtd 6632 root mem REG 253,0 14800 784383 /lib64/libcom_err.so.2.1 libvirtd 6632 root mem REG 253,0 127480 395372 /usr/lib64/libnssutil3.so libvirtd 6632 root mem REG 253,0 241536 783441 /lib64/libnspr4.so libvirtd 6632 root mem REG 253,0 18864 784397 /lib64/libplc4.so libvirtd 6632 root mem REG 253,0 1287424 395499 /usr/lib64/libnss3.so libvirtd 6632 root mem REG 253,0 14704 784398 /lib64/libplds4.so libvirtd 6632 root mem REG 253,0 215792 395501 /usr/lib64/libssl3.so libvirtd 6632 root mem REG 253,0 95016 783442 /lib64/libaudit.so.1.0.0 libvirtd 6632 root mem REG 253,0 1651160 395817 /usr/lib64/libcrypto.so.1.0.0 libvirtd 6632 root mem REG 253,0 183440 395503 /usr/lib64/libsmime3.so libvirtd 6632 root mem REG 253,0 181608 784382 /lib64/libk5crypto.so.3.1 libvirtd 6632 root mem REG 253,0 10328 783668 /lib64/libkeyutils.so.1.3 libvirtd 6632 root mem REG 253,0 35752 395505 /usr/lib64/libnuma.so.1 libvirtd 6632 root mem REG 253,0 43320 784381 /lib64/libkrb5support.so.0.1 libvirtd 6632 root mem REG 253,0 915104 784384 /lib64/libkrb5.so.3.3 libvirtd 6632 root mem REG 253,0 15616 783664 /lib64/libgpg-error.so.0.5.0 libvirtd 6632 root mem REG 253,0 225808 397181 /usr/lib64/libpcap.so.1.0.0 libvirtd 6632 root mem REG 253,0 109664 398139 /usr/lib64/libsasl2.so.2.0.23 libvirtd 6632 root mem REG 253,0 269104 784400 /lib64/libgssapi_krb5.so.2.2 libvirtd 6632 root mem REG 253,0 1379080 394916 /usr/lib64/libxml2.so.2.7.6 libvirtd 6632 root mem REG 253,0 479856 783665 /lib64/libgcrypt.so.11.5.3 libvirtd 6632 root mem REG 253,0 60816 785052 /lib64/liblber-2.4.so.2.5.6 libvirtd 6632 root mem REG 253,0 307224 785053 /lib64/libldap-2.4.so.2.5.6 libvirtd 6632 root mem REG 253,0 61504 783649 /lib64/libudev.so.0.5.1 libvirtd 6632 root mem REG 253,0 374128 395921 /usr/lib64/libssl.so.1.0.0 libvirtd 6632 root mem REG 253,0 209088 783446 /lib64/libidn.so.11.6.1 libvirtd 6632 root mem REG 253,0 71456 394775 /usr/lib64/libtasn1.so.3.1.6 libvirtd 6632 root mem REG 253,0 164024 397192 /usr/lib64/libssh2.so.1.0.1 libvirtd 6632 root mem REG 253,0 662656 394864 /usr/lib64/libgnutls.so.26.14.12 libvirtd 6632 root mem REG 253,0 343864 399107 /usr/lib64/libcurl.so.4.1.1 libvirtd 6632 root mem REG 253,0 340072 783495 /lib64/libnl.so.1.1 libvirtd 6632 root mem REG 253,0 21160 785075 /lib64/libcap-ng.so.0.0.0 libvirtd 6632 root mem REG 253,0 31352 394890 /usr/lib64/libyajl.so.1.0.7 libvirtd 6632 root mem REG 253,0 51056 395369 /usr/lib64/libavahi-common.so.3.5.1 libvirtd 6632 root mem REG 253,0 84296 399312 /usr/lib64/libexslt.so.0.8.15 libvirtd 6632 root mem REG 253,0 68768 393585 /usr/lib64/libavahi-client.so.3.2.5 libvirtd 6632 root mem REG 253,0 248720 399028 /usr/lib64/libxslt.so.1.1.26 libvirtd 6632 root mem REG 253,0 136744 394932 /usr/lib64/libfa.so.1.3.1 libvirtd 6632 root mem REG 253,0 304424 398772 /usr/lib64/libaugeas.so.0.11.0 libvirtd 6632 root mem REG 253,0 61480 404455 /usr/lib64/libnetcf.so.1.4.0 libvirtd 6632 root mem REG 253,0 158912 787495 /lib64/libdevmapper.so.1.02 libvirtd 6632 root mem REG 253,0 27424 783388 /lib64/libnss_dns-2.12.so libvirtd 6632 root mem REG 253,0 17048 399244 /usr/lib64/sasl2/liblogin.so.2.0.23 libvirtd 6632 root mem REG 253,0 19352 403817 /usr/lib64/sasl2/libcrammd5.so.2.0.23 libvirtd 6632 root mem REG 253,0 49240 403820 /usr/lib64/sasl2/libdigestmd5.so.2.0.23 libvirtd 6632 root mem REG 253,0 16408 393090 /usr/lib64/sasl2/libanonymous.so.2.0.23 libvirtd 6632 root mem REG 253,0 1526984 783432 /lib64/libdb-4.7.so libvirtd 6632 root mem REG 253,0 22272 393093 /usr/lib64/sasl2/libsasldb.so.2.0.23 libvirtd 6632 root mem REG 253,0 29560 393484 /usr/lib64/sasl2/libgssapiv2.so.2.0.23 libvirtd 6632 root mem REG 253,0 16952 399247 /usr/lib64/sasl2/libplain.so.2.0.23 libvirtd 6632 root mem REG 253,0 61624 783400 /lib64/libnss_files-2.12.so libvirtd 6632 root mem REG 253,0 26050 394826 /usr/lib64/gconv/gconv-modules.cache libvirtd 6632 root mem REG 253,0 99158752 404858 /usr/lib/locale/locale-archive libvirtd 6632 root DEL REG 253,0 404151 /usr/lib64/libvirt.so.0.9.3.#prelink#.A1z8f8 libvirtd 6632 root DEL REG 253,0 404055 /usr/lib64/libvirt-qemu.so.0.9.3.#prelink#.BWRrqE libvirtd 6632 root 0r CHR 1,3 0t0 3779 /dev/null libvirtd 6632 root 1w CHR 1,3 0t0 3779 /dev/null libvirtd 6632 root 2w CHR 1,3 0t0 3779 /dev/null libvirtd 6632 root 3w REG 253,0 1238563079 915495 /var/log/libvirtd.log libvirtd 6632 root 4r FIFO 0,8 0t0 19732 pipe libvirtd 6632 root 5u unix 0xffff88040baf90c0 0t0 19744 socket libvirtd 6632 root 6w FIFO 0,8 0t0 19732 pipe libvirtd 6632 root 7r FIFO 0,8 0t0 19733 pipe libvirtd 6632 root 8w FIFO 0,8 0t0 19733 pipe libvirtd 6632 root 9u sock 0,6 0t0 19734 can't identify protocol libvirtd 6632 root 10u unix 0xffff88040cc56c80 0t0 19737 /var/run/libvirt/libvirt-sock libvirtd 6632 root 11u unix 0xffff88040ba63680 0t0 19739 /var/run/libvirt/libvirt-sock-ro libvirtd 6632 root 12r CHR 1,9 0t0 3784 /dev/urandom libvirtd 6632 root 13u IPv4 19742 0t0 TCP *:16514 (LISTEN) libvirtd 6632 root 14u sock 0,6 0t0 19750 can't identify protocol libvirtd 6632 root 15u unix 0xffff88040ab66080 0t0 19743 socket libvirtd 6632 root 16w REG 0,3 0 4026531946 /proc/mtrr libvirtd 6632 root 17u REG 0,0 4096 9935 /sys/devices/pci0000:00/0000:00:03.0/0000:02:00.1/config libvirtd 6632 root 18u sock 0,6 0t0 19793 can't identify protocol libvirtd 6632 root 19u unix 0xffff8807f56eb680 0t0 6321290 /var/run/libvirt/libvirt-sock libvirtd 6632 root 20u sock 0,6 0t0 21690 can't identify protocol libvirtd 6632 root 21u sock 0,6 0t0 21691 can't identify protocol libvirtd 6632 root 22u sock 0,6 0t0 6321358 can't identify protocol libvirtd 6632 root 23u sock 0,6 0t0 6321359 can't identify protocol libvirtd 6632 root 24u unix 0xffff8807c24e8380 0t0 6322577 /var/run/libvirt/libvirt-sock libvirtd 6632 root 25u IPv4 158404 0t0 TCP intg-vdsb.qa.lab.tlv.redhat.com:44223->nott-vdsa.qa.lab.tlv.redhat.com:165 14 (CLOSE_WAIT) libvirtd 6632 root 26u sock 0,6 0t0 138909 can't identify protocol libvirtd 6632 root 27r FIFO 0,8 0t0 158405 pipe libvirtd 6632 root 28u IPv4 167029 0t0 TCP intg-vdsb.qa.lab.tlv.redhat.com:60217->nott-vdsa.qa.lab.tlv.redhat.com:165 14 (CLOSE_WAIT) libvirtd 6632 root 29r FIFO 0,8 0t0 167030 pipe libvirtd 6632 root 30w FIFO 0,8 0t0 167030 pipe libvirtd 6632 root 31u sock 0,6 0t0 6322584 can't identify protocol libvirtd 6632 root 32r BLK 253,23 0x90000000 143969 /dev/dm-23 libvirtd 6632 root 33u sock 0,6 0t0 6322585 can't identify protocol libvirtd 6632 root 34u IPv4 163644 0t0 TCP intg-vdsb.qa.lab.tlv.redhat.com:60213->nott-vdsa.qa.lab.tlv.redhat.com:165 14 (CLOSE_WAIT) libvirtd 6632 root 36r FIFO 0,8 0t0 163645 pipe libvirtd 6632 root 39w FIFO 0,8 0t0 163645 pipe libvirtd 6632 root 40r BLK 253,21 0x40000000 141784 /dev/dm-21 libvirtd 6632 root 42w FIFO 0,8 0t0 158405 pipe libvirtd 6632 root 45r BLK 253,22 0x90000000 142904 /dev/dm-22 libvirtd 6632 root 48u IPv4 159265 0t0 TCP intg-vdsb.qa.lab.tlv.redhat.com:44225->nott-vdsa.qa.lab.tlv.redhat.com:165 14 (CLOSE_WAIT) libvirtd 6632 root 49r FIFO 0,8 0t0 159266 pipe libvirtd 6632 root 50w FIFO 0,8 0t0 159266 pipe libvirtd 6632 root 57u IPv4 159832 0t0 TCP intg-vdsb.qa.lab.tlv.redhat.com:44227->nott-vdsa.qa.lab.tlv.redhat.com:165 14 (CLOSE_WAIT) libvirtd 6632 root 58r FIFO 0,8 0t0 159833 pipe libvirtd 6632 root 60w FIFO 0,8 0t0 159833 pipe
Are you in a setup where you could use valgrind to track for leaked fds? valgrind --track-fds=yes might be able to pinpoint if libvirt code is opening a block device and forgetting to close it.
I'm not sure I see what you think is evidence that there is a deadlock. Is this a case of libvirt being unresponsive, at a point where it happens to have some device fds open, although the deadlock has nothing in particular to do with device code? Or is this a case of the file system itself going into deadlock at the kernel level and thus blocking libvirt as a result? Are there any particular strings you want me to look at in the log files to back up your claim of deadlock? How reproducible is the setup that caused this problem?
From the GDB output I see that gnutls_record_recv returns -10, which is GNUTLS_E_INVALID_SESSION. This happens on incorrect closed session. I've helped rvaknin to debug this and it was returning over and over again, which results in very much 'I/O errors' messages in logs. Therefore I think this might be dup of BZ#721411 or at least related to.
This bug reproduces on libvirt-0.9.3-5.el6.x86_64 I ran libvirtd under valgrind using the following command line: valgrind --track-fds=yes --verbose --log-file=/root/valgrind_fd_nott /usr/sbin/libvirtd --daemon --listen The output of "lsof -p" shows that /dev/dm-31 remains open: memcheck- 15350 root 38r BLK 253,31 0x90000000 1655163 /dev/dm-31 I'll attach valgrind's log.
Created attachment 513647 [details] valgrind's log
According to valgrind's log, it seems that the FD's leak is in dup2 of stderr (look for the followin vm name in valgrind's log: "stress_new_pool-02")
(In reply to comment #4) > From the GDB output I see that gnutls_record_recv returns -10, which is > GNUTLS_E_INVALID_SESSION. This happens on incorrect closed session. I've helped > rvaknin to debug this and it was returning over and over again, which results > in very much 'I/O errors' messages in logs. Therefore I think this might be dup > of BZ#721411 or at least related to. Note that this bug reproduces in libvirt-0.9.3-7.el6.x86_64 although BZ#721411 already verified.
Rami: a leaked fd that is a dup of stderr is unrelated to the problem you've reported - stderr would never be a block device. Also, the word "deadlock" is in the summary of the report, but I don't see anything in the report that suggests there is a deadlock anywhere. Can you explain where you think there is a deadlock? To be sure I understand this problem correctly: The symptoms are that after some number of domain life cycle operations, there are no domains left running on the system, but there are several block devices still open (in the example given in Comment 0, it's /dev/dm-21, /dev/dm-22, and /dev/dm-23). Other than these leaked fds that are open on block devices, there are no other operationally visible symptoms. Is that correct? What exactly are these block devices used for? The image files of the guests? Is this easy to reproduce? Any chance you can come up with a reproduction case simple enough to be done on a single machine? Alternately, can you install the latest libvirt build from brew (including the debuginfo), and attempt to reproduce the problem, then give me access to the machine to poke around? Michal: was there a particular reason you thought the failure of gnutls_record_recv might be the cause of these leaked block device fds? Or did that just happen to be the only interesting error in the logs?
Regarding a run with valgrind - can you try that again, but letting it run until you see the extra block devices open in the output of lsof? At that point valgrind will be able to tell us what line of code opened the block device, which will help to narrow down who should be closing it.
(In reply to comment #9) > Rami: a leaked fd that is a dup of stderr is unrelated to the problem you've > reported - stderr would never be a block device. I thought it might be related. btw, is the leaked fd that is a dup of stderr is a known issue? is it covered by any BZ? > Also, the word "deadlock" is in the summary of the report, but I don't see > anything in the report that suggests there is a deadlock anywhere. Can you > explain where you think there is a deadlock? That's the impression me and mprivozn got while trying to debug the issue for the first time. > > To be sure I understand this problem correctly: > > The symptoms are that after some number of domain life cycle operations, there > are no domains left running on the system, but there are several block devices > still open (in the example given in Comment 0, it's /dev/dm-21, /dev/dm-22, and > /dev/dm-23). Other than these leaked fds that are open on block devices, there > are no other operationally visible symptoms. Is that correct? > You're right, except for the operationally visible symptom - the symptom is vdsm-wise, it can't do any operation on those VMs as there disk remain open by another process. > What exactly are these block devices used for? The image files of the guests? > Correct. > Is this easy to reproduce? Any chance you can come up with a reproduction case > simple enough to be done on a single machine? Alternately, can you install the > latest libvirt build from brew (including the debuginfo), and attempt to > reproduce the problem, then give me access to the machine to poke around? > It will take a lot of time to spot the scenario since the automatic test has a lot steps, I assume that the same reproduction with the newest libvirt will be easy. BTW, I reproduced it on any libvirt I've tested (0.9.3-2 to 0.9.3-7), let me know which libvirt exactly to use. > Michal: was there a particular reason you thought the failure of > gnutls_record_recv might be the cause of these leaked block device fds? Or did > that just happen to be the only interesting error in the logs?
(In reply to comment #10) > Regarding a run with valgrind - can you try that again, but letting it run > until you see the extra block devices open in the output of lsof? At that point > valgrind will be able to tell us what line of code opened the block device, > which will help to narrow down who should be closing it. Please let me know which valgrind's parameters will help to spot the issue.
Laine: yes, it was the first guide to error. I was thinking - we go stucked in there (as I've noted above, we called this function over and over again) and therefore not reached the point where fds are closed. But it was the very first (and rough) idea what may happen.
Based on the other information you gave, it sounds likely that it's not a resource leak, but just some thread not finishing up a task, so don't spend your time with a valgrind run yet, as it may not show anything anyway. However, we discussed this in the weekly libvirt meeting, and I learned that all the versions you've tested with had newly rewritten rpc code that still had a number of problems; it wasn't stabilized until sometime later. To get to a known "working" baseline, can you run again with "0.9.4-rc1.2" from the hybrid repo? This has several well known fixes to the new rpc code. We should also try to arrange to do the testing at a time when I can be around to poke at it with gdb if/when it fails.
I ran the same automation test twice and it seems that the bug is fixed and does not reproduces anymore. Tested with libvirt-0.9.4-0rc1.2.el6.x86_64
Since the problem no longer occurs with the fixed rpc code when the same test is run by the same person, I think we can safely assume it has been fixed, so I'm closing the bug.