Bug 1271183

Summary:

XML-RPC error : Cannot write data: Transport endpoint is not connected

Product:

[Fedora] Fedora

Reporter:

Richard W.M. Jones <rjones>

Component:

libvirt

Assignee:

Libvirt Maintainers <libvirt-maint>

Status:

CLOSED ERRATA

QA Contact:

Fedora Extras Quality Assurance <extras-qa>

Severity:

high

Docs Contact:

Priority:

high

Version:

CC:

agedosier, alexey.kosilin, berrange, cfergeau, clalancette, crobinso, dcaroest, debarshir, dshetty, eedri, fdeutsch, itamar, jforbes, jfrieben, laine, libvirt-maint, mcatanzaro+wrong-account-do-not-cc, rsawhill, veillard, virt-maint, zeenix

Target Milestone:

---

Keywords:

Reopened

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

libvirt-1.2.18.2-2.fc23 libvirt-1.2.18.2-3.fc23

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Clones:

1297690 (view as bug list)

Environment:

Last Closed:

2016-03-22 19:53:35 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

910269, 1287260, 1297690

Attachments:

Description	Flags
libvirtd.log	none
output of failing virsh 'list' command	none
Debug output	none
virsh list --all debug output on failure	none

Description Richard W.M. Jones 2015-10-13 10:52:49 UTC

Description of problem:

libvirt still fails with qemu:///session intermittently:
libguestfs: opening libvirt handle: URI = qemu:///session, auth = default+wrapper, flags = 0
libvirt: XML-RPC error : Cannot write data: Transport endpoint is not connected
libguestfs: error: could not connect to libvirt (URI = qemu:///session): Cannot write data: Transport endpoint is not connected [code=38 domain=7]
libguestfs-test-tool: failed to launch appliance

There's another bug about this, but I can't find it right now.

Version-Release number of selected component (if applicable):

libvirt-1.2.20-1.fc24.x86_64

How reproducible:

Unknown, but not often.  Seems to happen when user libvirtd
starts up the first time, but not every time.

Steps to Reproduce:
1. Run: libguestfs-test-tool as non-root.

Comment 1 Richard W.M. Jones 2015-10-13 10:56:37 UTC

Here's a good way to reproduce this.

*NB* I am running everything (except the sudo command) as NON-root:

killall libvirtd
sudo bash -c 'echo 3 >  /proc/sys/vm/drop_caches'
libguestfs-test-tool

With those 3 command, I can pretty much reproduce this 100% on my laptop.

If I run 'libguestfs-test-tool' a second time after that it works,
showing that it's a race condition caused by libvirtd being slow
to start up.

Comment 2 Cole Robinson 2015-10-16 00:40:07 UTC

I tried your reproducer on a rawhide VM, but it didn't trigger for me.

src/rpc/virnetsocket.c has some retry code in this path, but it's only catching connect(2) errors, while yours seems to be hitting on send/write.

did this just start failing?

Comment 3 Richard W.M. Jones 2015-10-16 08:02:03 UTC

This has been failing randomly for a long time - over a year.

Unfortunately I can only reproduce it when the machine has slow
disks.  With SSDs it's not reproducible, or at least that's the
common factor in the machines I have where it fails or works.

Comment 4 Richard W.M. Jones 2015-11-12 12:50:43 UTC

Still failing for me every so often (only on my laptop with HDDs).

There is also a bug filed against GNOME Boxes which looks
suspiciously similar: bug 1269933

Comment 5 Cole Robinson 2015-11-12 15:19:11 UTC

Yeah I've heard boxes guys say they hit occasional issues with this error... probably the same root cause

Comment 6 Michael Catanzaro 2015-12-16 15:57:07 UTC

*** Bug 1269933 has been marked as a duplicate of this bug. ***

Comment 7 Michael Catanzaro 2015-12-16 15:58:28 UTC

I can reproduce this almost 100% when starting Boxes for the first time after booting my computer; none of the VMs appear. Closing and reopening Boxes always fixes the issue.

Comment 8 Richard W.M. Jones 2015-12-17 08:37:52 UTC

On my new laptop (with SSD) this happens all the time, to the
point where I've just disabled the libvirt backend.  This is
fresh Fedora 23 install.

What information needs to be collected to debug this?

Comment 9 Cole Robinson 2015-12-17 19:43:50 UTC

Someone is going to need to instrument the code to figure out where the problem is.

The libvirt client autolaunches session libvirtd, then src/rpc/virnetsocket.c:virNetSocketNewConnectUNIX keeps trying to connect(2) to the daemons unix socket path. It will retry for ENOENT (socket file doesn't exist yet) and ECONNREFUSED (nothing is listening on the other end of the socket). That's all the code assumes we need to worry about.

But this bug is about an error thrown when we attempt a write(2) on the connected socket. Which seems weird... docs make it sound like connect(2) is supposed to handle this, so maybe it's a kernel bug or something.

First bit I'd suggest though is start sticking sleep() calls in daemon startup code after the various server socket operations to see if we can identify where the race window is.

Comment 10 Alexey Kosilin 2015-12-21 09:35:40 UTC

The same behaviour on fc23. What is more, virtual-manager application doesn't have this effect - so it seems that session connection procedure is implemented more accurately in the one.

Comment 11 Alexey Kosilin 2015-12-21 09:47:33 UTC

Just have checked with libvirt 1.2.18.1-2 and virt-manager 1.2.1-3 - everything is ok - the behaviour is the same either in virt-manager and gnome-boxes. Sorry for the wrong information. )

Comment 12 Richard W.M. Jones 2016-01-05 13:58:11 UTC

(In reply to Alexey Kosilin from comment #10)
> The same behaviour on fc23. What is more, virtual-manager application
> doesn't have this effect - so it seems that session connection procedure is
> implemented more accurately in the one.

The common feature appears to be qemu:///session connections.
virt-manager is probably using a qemu:///system connection.

- - -

On F24 the error message is slightly different, but it's still
happening.  Unfortunately it is infuriatingly difficult to reproduce.
As in, sometimes it happens consistently, then it doesn't happen, with
no apparent reason.

I spent about a day last month trying to make a reliable reproducer,
and it just stopped happening -- the act of enabling debugging in libvirt
seemed to have stopped it that time.

Comment 13 Zeeshan Ali 2016-01-05 14:54:51 UTC

(In reply to Richard W.M. Jones from comment #12)
> (In reply to Alexey Kosilin from comment #10)
> > The same behaviour on fc23. What is more, virtual-manager application
> > doesn't have this effect - so it seems that session connection procedure is
> > implemented more accurately in the one.
> 
> The common feature appears to be qemu:///session connections.
> virt-manager is probably using a qemu:///system connection.
> 
> - - -
> 
> On F24 the error message is slightly different, but it's still
> happening.  Unfortunately it is infuriatingly difficult to reproduce.
> As in, sometimes it happens consistently, then it doesn't happen, with
> no apparent reason.
> 
> I spent about a day last month trying to make a reliable reproducer,
> and it just stopped happening -- the act of enabling debugging in libvirt
> seemed to have stopped it that time.

Thanks for trying. Seems some people in here are able to reproduce it more easily than others. If you (or any libvirt dev) could provide instructions on how to debug this issue, maybe those folks could help?

If we fail to resolve this before FOSDEM, let's get together at the canteen for some live debugging. :)

Comment 14 Richard W.M. Jones 2016-01-05 15:53:22 UTC

Unfortunately debugging libvirt is a lot more complicated than
it should be, but here goes.

You have to create ~/.config/libvirt/libvirtd.conf as described here:

  http://libguestfs.org/guestfs-faq.1.html#debugging-libvirt

Then killall libvirtd (as NON-root, so it kills any session
libvirtd, not the global libvirtd).

Then export LIBVIRT_DEBUG=1

Then run whatever command which is failing (also as non-root).

You'll get lots of debugging from libvirt library side on stderr,
plus you'll get even more debugging from the session libvirtd process
in /tmp/libvirtd.log

We will need both streams of debugging in order to make sense
of any problems, since messages are exchanged between the
client and the daemon.

(The above steps describe debugging qemu:///session problems.  For
qemu:///system problems -- unlikely to be the case for the current
bug -- see https://fedoraproject.org/wiki/How_to_debug_Virtualization_problems#libvirt ).

(I'm not going to FOSDEM this year)

Comment 15 Fabian Deutsch 2016-01-07 11:25:12 UTC

We are also seeing this issue on CentOS 7

Comment 16 Michael Catanzaro 2016-01-07 16:07:25 UTC

Created attachment 1112489 [details]
libvirtd.log

Comment 17 Michael Catanzaro 2016-01-07 16:09:13 UTC

Created attachment 1112490 [details]
output of failing virsh 'list' command

Comment 18 Cole Robinson 2016-01-12 01:24:06 UTC

I'm pretty sure these patches fix things:

https://www.redhat.com/archives/libvir-list/2016-January/msg00422.html

Comment 19 Michael Catanzaro 2016-01-12 04:07:56 UTC

Thanks a bunch Cole; I rely on libvirt (via Boxes) and it's annoying when it doesn't start up properly. If you do a Fedora build, I'd be happy to test it!

Comment 20 Richard W.M. Jones 2016-01-12 09:39:34 UTC

Scratch build for Rawhide:
http://koji.fedoraproject.org/koji/taskinfo?taskID=12513210

Scratch build for F23:
http://koji.fedoraproject.org/koji/taskinfo?taskID=12513229

Comment 21 Richard W.M. Jones 2016-01-19 19:02:23 UTC

*** Bug 1300010 has been marked as a duplicate of this bug. ***

Comment 22 Richard W.M. Jones 2016-01-19 19:02:53 UTC

Fixed if you build libvirt-1.3.1-1.fc24 which is now
in dist-git (but not yet built in Rawhide).

http://koji.fedoraproject.org/koji/packageinfo?packageID=91

Comment 23 Fedora Update System 2016-01-21 17:51:30 UTC

libvirt-1.2.18.2-2.fc23 has been submitted as an update to Fedora 23. https://bodhi.fedoraproject.org/updates/FEDORA-2016-02dc87c44e

Comment 24 Fedora Update System 2016-01-22 04:55:46 UTC

libvirt-1.2.18.2-2.fc23 has been pushed to the Fedora 23 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2016-02dc87c44e

Comment 25 Fedora Update System 2016-01-24 03:29:08 UTC

libvirt-1.2.18.2-2.fc23 has been pushed to the Fedora 23 stable repository. If problems still persist, please make note of it in this bug report.

Comment 26 Michael Catanzaro 2016-01-27 16:17:18 UTC

I just hit this again today with libvirt-1.2.18.2-2.fc23, unfortunately. :(

Comment 27 Cole Robinson 2016-01-27 17:02:17 UTC

Can you export:

LIBVIRT_LOG_FILTERS="1:rpc" LIBVIRT_OUTPUT_FILTERS="1:stderr" 

and run gnome-boxes from a terminal, and when you reproduce again please post the full output here

Comment 28 Cole Robinson 2016-01-27 17:03:15 UTC

Also please ensure you're running updated libvirt-client as well, not just plain libvirt package

Comment 29 Richard W.M. Jones 2016-01-27 17:04:51 UTC

Also you'll need to make sure that the session libvirtd process
was restarted, so you're not running the old version.  You can
usually do (as NON-root):

  killall libvirtd

Comment 30 Michael Catanzaro 2016-01-27 18:21:40 UTC

(In reply to Cole Robinson from comment #27)
> Can you export:
> 
> LIBVIRT_LOG_FILTERS="1:rpc" LIBVIRT_OUTPUT_FILTERS="1:stderr" 
> 
> and run gnome-boxes from a terminal, and when you reproduce again please
> post the full output here

OK, I'll try to do this whenever starting Boxes for the next few days.

(In reply to Cole Robinson from comment #28)
> Also please ensure you're running updated libvirt-client as well, not just
> plain libvirt package
(In reply to Richard W.M. Jones from comment #29)
> Also you'll need to make sure that the session libvirtd process
> was restarted, so you're not running the old version.  You can
> usually do (as NON-root):
> 
>   killall libvirtd

I'm certain I was running the updated libvirt-client and libvirt-daemon as I've rebooted my computer since installing the updates a couple days ago. I have version 1.2.18.2-2.fc23 of libvirt-client and libvirt-daemon.

Comment 31 Ján Tomko 2016-01-28 12:22:35 UTC

*** Bug 1301391 has been marked as a duplicate of this bug. ***

Comment 32 Michael Catanzaro 2016-01-29 02:59:07 UTC

Created attachment 1119326 [details]
Debug output

Output when running Boxes with:

LIBVIRT_LOG_FILTERS="1:rpc"
LIBVIRT_OUTPUT_FILTERS="1:stderr"
G_MESSAGES_DEBUG=all

Comment 33 Cole Robinson 2016-01-29 17:06:51 UTC

Okay the error is at least different now:

Error: Failed to connect socket to '/run/user/1000/libvirt/libvirt-sock': No such file or directory

The debug logs show the repeated connect() attempts all failing with errorcode ENOENT=2. So for some reason the daemon isn't starting up/socket isn't showing up

Michael, is this random or regularly reproducible? I'm wondering if there's any discernable pattern.

I guess we need more debugging to see what's going on. So on the boxes side can you enable LIBVIRT_DEBUG=1 to get the full debug firehose, and do rich's suggestion in Comment #14 to enable debugging for the daemon to /tmp/libvirtd.log. Then post both when reproducing again

Comment 34 Michael Catanzaro 2016-01-29 22:41:41 UTC

(In reply to Cole Robinson from comment #33)
> Michael, is this random or regularly reproducible? I'm wondering if there's
> any discernable pattern.

It almost always happens when launching Boxes for the first time after booting (but not always -- it worked properly after a reboot when I tested the last update, but guess it was just by chance). It also frequently happens when launching Boxes after having had it closed for a long time (I guess ~30 minutes or so). It never occurs when reopening Boxes shortly after closing it.

Running 'killall libvirtd' with or without sudo does not make any difference in whether the bug can be reproduced. Logging out of my user account and back in does not make any difference either. The only way I've found to reproduce easily is to restart my computer.

> I guess we need more debugging to see what's going on. So on the boxes side
> can you enable LIBVIRT_DEBUG=1 to get the full debug firehose, and do rich's
> suggestion in Comment #14 to enable debugging for the daemon to
> /tmp/libvirtd.log. Then post both when reproducing again

Unfortunately this affects the timing and causes the bug to not occur with Boxes on my machine. :/ I got the original debug log in comments #16 and #17 by testing with 'virsh list' instead of gnome-boxes.

I can reproduce a similar issue reliably by running 'sudo virsh list':

$ sudo virsh list
error: failed to connect to the hypervisor
error: no valid connection
error: Failed to connect socket to '/var/run/libvirt/libvirt-sock': No such file or directory

This occurs 100% no matter how many times I execute it, but I'm not sure if that's a bug or not. It works fine when run without sudo, and I don't see any debug log generated even after editing /etc/libvirt/libvirt.conf as described in the FAQ.

I can get the debug logs just fine when running 'virsh list' as non-root, or when running Boxes, but as I said, I unfortunately cannot reproduce the issue with debugging turned on.

Comment 35 Richard W.M. Jones 2016-01-30 07:35:09 UTC

(In reply to Cole Robinson from comment #33)
> Error: Failed to connect socket to '/run/user/1000/libvirt/libvirt-sock': No
> such file or directory

This one has been reported several times before:
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=763141
https://bugzilla.redhat.com/show_bug.cgi?id=842114
https://bugzilla.redhat.com/show_bug.cgi?id=1140624
Note the latter was marked as a duplicate, but I'm not sure the
duplicate is correct.  In any case, it still happens for me from
time to time but is very hard to reproduce.  It's (a) likely
a different bug from the current one and (b) not fixed.

Comment 36 Cole Robinson 2016-02-02 15:50:00 UTC

Another thing to check is just see if there's any libvirt messages in non-root journalctl -b . It should report some messages every time the daemon starts up. Maybe it's reporting an error or warning

Comment 37 Michael Catanzaro 2016-02-02 17:52:10 UTC

Feb 02 11:50:41 victory-road libvirtd[19024]: libvirt version: 1.2.18.2, package: 2.fc23 (Fedora Project, 2016-01-21-00:54:13, buildhw-01-nfs.phx2.fedoraproject.org)
Feb 02 11:50:41 victory-road libvirtd[19024]: Module /usr/lib64/libvirt/connection-driver/libvirt_driver_xen.so not accessible
Feb 02 11:50:41 victory-road libvirtd[19024]: Module /usr/lib64/libvirt/connection-driver/libvirt_driver_libxl.so not accessible
Feb 02 11:50:41 victory-road libvirtd[19024]: Module /usr/lib64/libvirt/connection-driver/libvirt_driver_lxc.so not accessible
Feb 02 11:50:41 victory-road libvirtd[19024]: Module /usr/lib64/libvirt/connection-driver/libvirt_driver_uml.so not accessible
Feb 02 11:50:41 victory-road libvirtd[19024]: Module /usr/lib64/libvirt/connection-driver/libvirt_driver_vbox.so not accessible
Feb 02 11:50:43 victory-road org.gnome.Boxes[2252]: (gnome-boxes:18974): Boxes-WARNING **: libvirt-broker.vala:117: Unable to open qemu+unix:///session: Failed to connect socket to '/run/user/1000/libvirt/libvir
Feb 02 11:50:47 victory-road libvirtd[19024]: internal error: Network is already in use by interface virbr0

Comment 38 Zeeshan Ali 2016-03-14 17:04:02 UTC

This is very serious and making Boxes look really bad. Any progress by libvirt developers on this?

Comment 39 Christophe Fergeau 2016-03-14 18:02:20 UTC

I can reliably reproduce this "unable to open qemu+unix:///session: Failed to connect socket to '/run/user/1000/libvirt/libvir" issue with libvirt.git with

killall libvirtd
sudo bash -c 'echo 3 >  /proc/sys/vm/drop_caches'
virsh list --all

(this is on a 2 years old laptop with a spinning harddrive)

After experimenting with

diff --git a/src/rpc/virnetsocket.c b/src/rpc/virnetsocket.c
index b0d5b1c..02194d9 100644
--- a/src/rpc/virnetsocket.c
+++ b/src/rpc/virnetsocket.c
@@ -704,10 +704,11 @@ int virNetSocketNewConnectUNIX(const char *path,
             if (virNetSocketForkDaemon(binary) < 0)
                 goto cleanup;

             daemonLaunched = true;
         }

-        usleep(5000);
+        usleep(50000);
     }

     localAddr.len = sizeof(localAddr.data);

which is not failing, it looks like the session libvirtd takes about 1 second to start on that machine, but virNetSocketNewConnectUNIX waits for about 100 * 5ms.

Comment 40 Christophe Fergeau 2016-03-14 18:05:23 UTC

Created attachment 1136251 [details]
virsh list --all debug output on failure

Comment 41 Christophe Fergeau 2016-03-14 18:19:40 UTC

danpb mentioned be78814ae on IRC, I can reproduce a similar failure with be78814ae by following the steps in #39 (error message is different though). Then I tested  be78814ae^ and I haven't been able to reproduce this problem (the steps in #39 give me a list of VMs).

Comment 42 Daniel Berrangé 2016-03-14 18:30:42 UTC

This type of race condition will exist in libvirt for ever, as long as we have code which loops on connect, sleeping a finite amount of time. Before be78814ae was applied, we explicitly passed the pre-opened listener down to libvirtd, which would totally avoid any race between connect() and listen(). The explanation in be78814ae as to why it was necessary to revert it doesn't really make sense, because it is talking about a race between daemon startup + listen() + connect() which would not even exist in the code it was removing.

I vote for reverting be78814ae and re-examining whatever problem led to that patch.

Comment 43 Cole Robinson 2016-03-15 21:07:08 UTC

Here's the referenced commit:

commit be78814ae07f092d9c4e71fd82dd1947aba2f029
Author: Michal Privoznik <mprivozn>
Date:   Thu Apr 2 14:41:17 2015 +0200

    virNetSocketNewConnectUNIX: Use flocks when spawning a daemon
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1200149
    
    Even though we have a mutex mechanism so that two clients don't spawn
    two daemons, it's not strong enough. It can happen that while one
    client is spawning the daemon, the other one fails to connect.
    Basically two possible errors can happen:
    
      error: Failed to connect socket to '/home/mprivozn/.cache/libvirt/libvirt-sock': Connection refused
    
    or:
    
      error: Failed to connect socket to '/home/mprivozn/.cache/libvirt/libvirt-sock': No such file or directory
    
    The problem in both cases is, the daemon is only starting up, while we
    are trying to connect (and fail). We should postpone the connecting
    phase until the daemon is started (by the other thread that is
    spawning it). In order to do that, create a file lock 'libvirt-lock'
    in the directory where session daemon would create its socket. So even
    when called from multiple processes, spawning a daemon will serialize
    on the file lock. So only the first to come will spawn the daemon.
    
    Tested-by: Richard W. M. Jones <rjones>
    Signed-off-by: Michal Privoznik <mprivozn>



Unwinding that isn't going to be easy: there's several changes in between, and TBH the previous code had really confusing flow IMO. I think it's better to bump the retry timeout now while we discuss/review the proper fix...

Comment 44 Cole Robinson 2016-03-15 21:16:34 UTC

Patch to bump up the retry timeout sent upstream:

http://www.redhat.com/archives/libvir-list/2016-March/msg00675.html

Comment 45 Fedora Update System 2016-03-17 22:30:34 UTC

libvirt-1.2.18.2-3.fc23 has been submitted as an update to Fedora 23. https://bodhi.fedoraproject.org/updates/FEDORA-2016-399269e155

Comment 46 Fedora Update System 2016-03-19 01:24:11 UTC

libvirt-1.2.18.2-3.fc23 has been pushed to the Fedora 23 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2016-399269e155

Comment 47 Fedora Update System 2016-03-22 19:53:04 UTC

libvirt-1.2.18.2-3.fc23 has been pushed to the Fedora 23 stable repository. If problems still persist, please make note of it in this bug report.

Comment 48 Cole Robinson 2016-04-11 23:05:58 UTC

*** Bug 1287260 has been marked as a duplicate of this bug. ***

Comment 49 Zeeshan Ali 2016-04-27 12:14:07 UTC

(In reply to Cole Robinson from comment #44)
> Patch to bump up the retry timeout sent upstream:
> 
> http://www.redhat.com/archives/libvir-list/2016-March/msg00675.html

Thanks for looking into and fixing this.