Bug 1271183
Summary: | XML-RPC error : Cannot write data: Transport endpoint is not connected | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Richard W.M. Jones <rjones> | ||||||||||
Component: | libvirt | Assignee: | Libvirt Maintainers <libvirt-maint> | ||||||||||
Status: | CLOSED ERRATA | QA Contact: | Fedora Extras Quality Assurance <extras-qa> | ||||||||||
Severity: | high | Docs Contact: | |||||||||||
Priority: | high | ||||||||||||
Version: | 23 | CC: | agedosier, alexey.kosilin, berrange, cfergeau, clalancette, crobinso, dcaroest, debarshir, dshetty, eedri, fdeutsch, itamar, jforbes, jfrieben, laine, libvirt-maint, mcatanzaro+wrong-account-do-not-cc, rsawhill, veillard, virt-maint, zeenix | ||||||||||
Target Milestone: | --- | Keywords: | Reopened | ||||||||||
Target Release: | --- | ||||||||||||
Hardware: | Unspecified | ||||||||||||
OS: | Unspecified | ||||||||||||
Whiteboard: | |||||||||||||
Fixed In Version: | libvirt-1.2.18.2-2.fc23 libvirt-1.2.18.2-3.fc23 | Doc Type: | Bug Fix | ||||||||||
Doc Text: | Story Points: | --- | |||||||||||
Clone Of: | |||||||||||||
: | 1297690 (view as bug list) | Environment: | |||||||||||
Last Closed: | 2016-03-22 19:53:35 UTC | Type: | Bug | ||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||
Documentation: | --- | CRM: | |||||||||||
Verified Versions: | Category: | --- | |||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
Embargoed: | |||||||||||||
Bug Depends On: | |||||||||||||
Bug Blocks: | 910269, 1287260, 1297690 | ||||||||||||
Attachments: |
|
Description
Richard W.M. Jones
2015-10-13 10:52:49 UTC
Here's a good way to reproduce this. *NB* I am running everything (except the sudo command) as NON-root: killall libvirtd sudo bash -c 'echo 3 > /proc/sys/vm/drop_caches' libguestfs-test-tool With those 3 command, I can pretty much reproduce this 100% on my laptop. If I run 'libguestfs-test-tool' a second time after that it works, showing that it's a race condition caused by libvirtd being slow to start up. I tried your reproducer on a rawhide VM, but it didn't trigger for me. src/rpc/virnetsocket.c has some retry code in this path, but it's only catching connect(2) errors, while yours seems to be hitting on send/write. did this just start failing? This has been failing randomly for a long time - over a year. Unfortunately I can only reproduce it when the machine has slow disks. With SSDs it's not reproducible, or at least that's the common factor in the machines I have where it fails or works. Still failing for me every so often (only on my laptop with HDDs). There is also a bug filed against GNOME Boxes which looks suspiciously similar: bug 1269933 Yeah I've heard boxes guys say they hit occasional issues with this error... probably the same root cause *** Bug 1269933 has been marked as a duplicate of this bug. *** I can reproduce this almost 100% when starting Boxes for the first time after booting my computer; none of the VMs appear. Closing and reopening Boxes always fixes the issue. On my new laptop (with SSD) this happens all the time, to the point where I've just disabled the libvirt backend. This is fresh Fedora 23 install. What information needs to be collected to debug this? Someone is going to need to instrument the code to figure out where the problem is. The libvirt client autolaunches session libvirtd, then src/rpc/virnetsocket.c:virNetSocketNewConnectUNIX keeps trying to connect(2) to the daemons unix socket path. It will retry for ENOENT (socket file doesn't exist yet) and ECONNREFUSED (nothing is listening on the other end of the socket). That's all the code assumes we need to worry about. But this bug is about an error thrown when we attempt a write(2) on the connected socket. Which seems weird... docs make it sound like connect(2) is supposed to handle this, so maybe it's a kernel bug or something. First bit I'd suggest though is start sticking sleep() calls in daemon startup code after the various server socket operations to see if we can identify where the race window is. The same behaviour on fc23. What is more, virtual-manager application doesn't have this effect - so it seems that session connection procedure is implemented more accurately in the one. Just have checked with libvirt 1.2.18.1-2 and virt-manager 1.2.1-3 - everything is ok - the behaviour is the same either in virt-manager and gnome-boxes. Sorry for the wrong information. ) (In reply to Alexey Kosilin from comment #10) > The same behaviour on fc23. What is more, virtual-manager application > doesn't have this effect - so it seems that session connection procedure is > implemented more accurately in the one. The common feature appears to be qemu:///session connections. virt-manager is probably using a qemu:///system connection. - - - On F24 the error message is slightly different, but it's still happening. Unfortunately it is infuriatingly difficult to reproduce. As in, sometimes it happens consistently, then it doesn't happen, with no apparent reason. I spent about a day last month trying to make a reliable reproducer, and it just stopped happening -- the act of enabling debugging in libvirt seemed to have stopped it that time. (In reply to Richard W.M. Jones from comment #12) > (In reply to Alexey Kosilin from comment #10) > > The same behaviour on fc23. What is more, virtual-manager application > > doesn't have this effect - so it seems that session connection procedure is > > implemented more accurately in the one. > > The common feature appears to be qemu:///session connections. > virt-manager is probably using a qemu:///system connection. > > - - - > > On F24 the error message is slightly different, but it's still > happening. Unfortunately it is infuriatingly difficult to reproduce. > As in, sometimes it happens consistently, then it doesn't happen, with > no apparent reason. > > I spent about a day last month trying to make a reliable reproducer, > and it just stopped happening -- the act of enabling debugging in libvirt > seemed to have stopped it that time. Thanks for trying. Seems some people in here are able to reproduce it more easily than others. If you (or any libvirt dev) could provide instructions on how to debug this issue, maybe those folks could help? If we fail to resolve this before FOSDEM, let's get together at the canteen for some live debugging. :) Unfortunately debugging libvirt is a lot more complicated than it should be, but here goes. You have to create ~/.config/libvirt/libvirtd.conf as described here: http://libguestfs.org/guestfs-faq.1.html#debugging-libvirt Then killall libvirtd (as NON-root, so it kills any session libvirtd, not the global libvirtd). Then export LIBVIRT_DEBUG=1 Then run whatever command which is failing (also as non-root). You'll get lots of debugging from libvirt library side on stderr, plus you'll get even more debugging from the session libvirtd process in /tmp/libvirtd.log We will need both streams of debugging in order to make sense of any problems, since messages are exchanged between the client and the daemon. (The above steps describe debugging qemu:///session problems. For qemu:///system problems -- unlikely to be the case for the current bug -- see https://fedoraproject.org/wiki/How_to_debug_Virtualization_problems#libvirt ). (I'm not going to FOSDEM this year) We are also seeing this issue on CentOS 7 Created attachment 1112489 [details]
libvirtd.log
Created attachment 1112490 [details]
output of failing virsh 'list' command
I'm pretty sure these patches fix things: https://www.redhat.com/archives/libvir-list/2016-January/msg00422.html Thanks a bunch Cole; I rely on libvirt (via Boxes) and it's annoying when it doesn't start up properly. If you do a Fedora build, I'd be happy to test it! Scratch build for Rawhide: http://koji.fedoraproject.org/koji/taskinfo?taskID=12513210 Scratch build for F23: http://koji.fedoraproject.org/koji/taskinfo?taskID=12513229 *** Bug 1300010 has been marked as a duplicate of this bug. *** Fixed if you build libvirt-1.3.1-1.fc24 which is now in dist-git (but not yet built in Rawhide). http://koji.fedoraproject.org/koji/packageinfo?packageID=91 libvirt-1.2.18.2-2.fc23 has been submitted as an update to Fedora 23. https://bodhi.fedoraproject.org/updates/FEDORA-2016-02dc87c44e libvirt-1.2.18.2-2.fc23 has been pushed to the Fedora 23 testing repository. If problems still persist, please make note of it in this bug report. See https://fedoraproject.org/wiki/QA:Updates_Testing for instructions on how to install test updates. You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2016-02dc87c44e libvirt-1.2.18.2-2.fc23 has been pushed to the Fedora 23 stable repository. If problems still persist, please make note of it in this bug report. I just hit this again today with libvirt-1.2.18.2-2.fc23, unfortunately. :( Can you export: LIBVIRT_LOG_FILTERS="1:rpc" LIBVIRT_OUTPUT_FILTERS="1:stderr" and run gnome-boxes from a terminal, and when you reproduce again please post the full output here Also please ensure you're running updated libvirt-client as well, not just plain libvirt package Also you'll need to make sure that the session libvirtd process was restarted, so you're not running the old version. You can usually do (as NON-root): killall libvirtd (In reply to Cole Robinson from comment #27) > Can you export: > > LIBVIRT_LOG_FILTERS="1:rpc" LIBVIRT_OUTPUT_FILTERS="1:stderr" > > and run gnome-boxes from a terminal, and when you reproduce again please > post the full output here OK, I'll try to do this whenever starting Boxes for the next few days. (In reply to Cole Robinson from comment #28) > Also please ensure you're running updated libvirt-client as well, not just > plain libvirt package (In reply to Richard W.M. Jones from comment #29) > Also you'll need to make sure that the session libvirtd process > was restarted, so you're not running the old version. You can > usually do (as NON-root): > > killall libvirtd I'm certain I was running the updated libvirt-client and libvirt-daemon as I've rebooted my computer since installing the updates a couple days ago. I have version 1.2.18.2-2.fc23 of libvirt-client and libvirt-daemon. *** Bug 1301391 has been marked as a duplicate of this bug. *** Created attachment 1119326 [details]
Debug output
Output when running Boxes with:
LIBVIRT_LOG_FILTERS="1:rpc"
LIBVIRT_OUTPUT_FILTERS="1:stderr"
G_MESSAGES_DEBUG=all
Okay the error is at least different now: Error: Failed to connect socket to '/run/user/1000/libvirt/libvirt-sock': No such file or directory The debug logs show the repeated connect() attempts all failing with errorcode ENOENT=2. So for some reason the daemon isn't starting up/socket isn't showing up Michael, is this random or regularly reproducible? I'm wondering if there's any discernable pattern. I guess we need more debugging to see what's going on. So on the boxes side can you enable LIBVIRT_DEBUG=1 to get the full debug firehose, and do rich's suggestion in Comment #14 to enable debugging for the daemon to /tmp/libvirtd.log. Then post both when reproducing again (In reply to Cole Robinson from comment #33) > Michael, is this random or regularly reproducible? I'm wondering if there's > any discernable pattern. It almost always happens when launching Boxes for the first time after booting (but not always -- it worked properly after a reboot when I tested the last update, but guess it was just by chance). It also frequently happens when launching Boxes after having had it closed for a long time (I guess ~30 minutes or so). It never occurs when reopening Boxes shortly after closing it. Running 'killall libvirtd' with or without sudo does not make any difference in whether the bug can be reproduced. Logging out of my user account and back in does not make any difference either. The only way I've found to reproduce easily is to restart my computer. > I guess we need more debugging to see what's going on. So on the boxes side > can you enable LIBVIRT_DEBUG=1 to get the full debug firehose, and do rich's > suggestion in Comment #14 to enable debugging for the daemon to > /tmp/libvirtd.log. Then post both when reproducing again Unfortunately this affects the timing and causes the bug to not occur with Boxes on my machine. :/ I got the original debug log in comments #16 and #17 by testing with 'virsh list' instead of gnome-boxes. I can reproduce a similar issue reliably by running 'sudo virsh list': $ sudo virsh list error: failed to connect to the hypervisor error: no valid connection error: Failed to connect socket to '/var/run/libvirt/libvirt-sock': No such file or directory This occurs 100% no matter how many times I execute it, but I'm not sure if that's a bug or not. It works fine when run without sudo, and I don't see any debug log generated even after editing /etc/libvirt/libvirt.conf as described in the FAQ. I can get the debug logs just fine when running 'virsh list' as non-root, or when running Boxes, but as I said, I unfortunately cannot reproduce the issue with debugging turned on. (In reply to Cole Robinson from comment #33) > Error: Failed to connect socket to '/run/user/1000/libvirt/libvirt-sock': No > such file or directory This one has been reported several times before: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=763141 https://bugzilla.redhat.com/show_bug.cgi?id=842114 https://bugzilla.redhat.com/show_bug.cgi?id=1140624 Note the latter was marked as a duplicate, but I'm not sure the duplicate is correct. In any case, it still happens for me from time to time but is very hard to reproduce. It's (a) likely a different bug from the current one and (b) not fixed. Another thing to check is just see if there's any libvirt messages in non-root journalctl -b . It should report some messages every time the daemon starts up. Maybe it's reporting an error or warning Feb 02 11:50:41 victory-road libvirtd[19024]: libvirt version: 1.2.18.2, package: 2.fc23 (Fedora Project, 2016-01-21-00:54:13, buildhw-01-nfs.phx2.fedoraproject.org) Feb 02 11:50:41 victory-road libvirtd[19024]: Module /usr/lib64/libvirt/connection-driver/libvirt_driver_xen.so not accessible Feb 02 11:50:41 victory-road libvirtd[19024]: Module /usr/lib64/libvirt/connection-driver/libvirt_driver_libxl.so not accessible Feb 02 11:50:41 victory-road libvirtd[19024]: Module /usr/lib64/libvirt/connection-driver/libvirt_driver_lxc.so not accessible Feb 02 11:50:41 victory-road libvirtd[19024]: Module /usr/lib64/libvirt/connection-driver/libvirt_driver_uml.so not accessible Feb 02 11:50:41 victory-road libvirtd[19024]: Module /usr/lib64/libvirt/connection-driver/libvirt_driver_vbox.so not accessible Feb 02 11:50:43 victory-road org.gnome.Boxes[2252]: (gnome-boxes:18974): Boxes-WARNING **: libvirt-broker.vala:117: Unable to open qemu+unix:///session: Failed to connect socket to '/run/user/1000/libvirt/libvir Feb 02 11:50:47 victory-road libvirtd[19024]: internal error: Network is already in use by interface virbr0 This is very serious and making Boxes look really bad. Any progress by libvirt developers on this? I can reliably reproduce this "unable to open qemu+unix:///session: Failed to connect socket to '/run/user/1000/libvirt/libvir" issue with libvirt.git with killall libvirtd sudo bash -c 'echo 3 > /proc/sys/vm/drop_caches' virsh list --all (this is on a 2 years old laptop with a spinning harddrive) After experimenting with diff --git a/src/rpc/virnetsocket.c b/src/rpc/virnetsocket.c index b0d5b1c..02194d9 100644 --- a/src/rpc/virnetsocket.c +++ b/src/rpc/virnetsocket.c @@ -704,10 +704,11 @@ int virNetSocketNewConnectUNIX(const char *path, if (virNetSocketForkDaemon(binary) < 0) goto cleanup; daemonLaunched = true; } - usleep(5000); + usleep(50000); } localAddr.len = sizeof(localAddr.data); which is not failing, it looks like the session libvirtd takes about 1 second to start on that machine, but virNetSocketNewConnectUNIX waits for about 100 * 5ms. Created attachment 1136251 [details]
virsh list --all debug output on failure
danpb mentioned be78814ae on IRC, I can reproduce a similar failure with be78814ae by following the steps in #39 (error message is different though). Then I tested be78814ae^ and I haven't been able to reproduce this problem (the steps in #39 give me a list of VMs). This type of race condition will exist in libvirt for ever, as long as we have code which loops on connect, sleeping a finite amount of time. Before be78814ae was applied, we explicitly passed the pre-opened listener down to libvirtd, which would totally avoid any race between connect() and listen(). The explanation in be78814ae as to why it was necessary to revert it doesn't really make sense, because it is talking about a race between daemon startup + listen() + connect() which would not even exist in the code it was removing. I vote for reverting be78814ae and re-examining whatever problem led to that patch. Here's the referenced commit: commit be78814ae07f092d9c4e71fd82dd1947aba2f029 Author: Michal Privoznik <mprivozn> Date: Thu Apr 2 14:41:17 2015 +0200 virNetSocketNewConnectUNIX: Use flocks when spawning a daemon https://bugzilla.redhat.com/show_bug.cgi?id=1200149 Even though we have a mutex mechanism so that two clients don't spawn two daemons, it's not strong enough. It can happen that while one client is spawning the daemon, the other one fails to connect. Basically two possible errors can happen: error: Failed to connect socket to '/home/mprivozn/.cache/libvirt/libvirt-sock': Connection refused or: error: Failed to connect socket to '/home/mprivozn/.cache/libvirt/libvirt-sock': No such file or directory The problem in both cases is, the daemon is only starting up, while we are trying to connect (and fail). We should postpone the connecting phase until the daemon is started (by the other thread that is spawning it). In order to do that, create a file lock 'libvirt-lock' in the directory where session daemon would create its socket. So even when called from multiple processes, spawning a daemon will serialize on the file lock. So only the first to come will spawn the daemon. Tested-by: Richard W. M. Jones <rjones> Signed-off-by: Michal Privoznik <mprivozn> Unwinding that isn't going to be easy: there's several changes in between, and TBH the previous code had really confusing flow IMO. I think it's better to bump the retry timeout now while we discuss/review the proper fix... Patch to bump up the retry timeout sent upstream: http://www.redhat.com/archives/libvir-list/2016-March/msg00675.html libvirt-1.2.18.2-3.fc23 has been submitted as an update to Fedora 23. https://bodhi.fedoraproject.org/updates/FEDORA-2016-399269e155 libvirt-1.2.18.2-3.fc23 has been pushed to the Fedora 23 testing repository. If problems still persist, please make note of it in this bug report. See https://fedoraproject.org/wiki/QA:Updates_Testing for instructions on how to install test updates. You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2016-399269e155 libvirt-1.2.18.2-3.fc23 has been pushed to the Fedora 23 stable repository. If problems still persist, please make note of it in this bug report. *** Bug 1287260 has been marked as a duplicate of this bug. *** (In reply to Cole Robinson from comment #44) > Patch to bump up the retry timeout sent upstream: > > http://www.redhat.com/archives/libvir-list/2016-March/msg00675.html Thanks for looking into and fixing this. |