895901 – libvirt monitor socket didn't show up sometimes

Bug 895901 - libvirt monitor socket didn't show up sometimes

Summary: libvirt monitor socket didn't show up sometimes

Keywords:
Status:	CLOSED UPSTREAM
Alias:	None
Product:	Virtualization Tools
Classification:	Community
Component:	libvirt
Sub Component:
Version:	unspecified
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Assignee:	Libvirt Maintainers
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	TRACKER-bugs-affecting-libguestfs 892273 1051364
TreeView+	depends on / blocked

Reported:	2013-01-16 09:03 UTC by Richard W.M. Jones
Modified:	2014-01-17 07:45 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Clone Of:	892273
Environment:
Last Closed:	2014-01-17 07:45:53 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
build.log (1.46 MB, text/plain) 2014-01-04 08:55 UTC, Richard W.M. Jones	no flags	Details
root.log (172.32 KB, text/plain) 2014-01-04 08:55 UTC, Richard W.M. Jones	no flags	Details
View All

Description Richard W.M. Jones 2013-01-16 09:03:23 UTC

+++ This bug was initially created as a clone of Bug #892273 +++

Description of problem:

sometimes libvirt monitor socket didn't show up which will make libguestfs fail to startup, just met this issue from time to time, don't know how to reproduce, 
possible reproducer is to start more than 20 appliance 

12/22 13:18:32 INFO |libguestfs:0069| ---------------- Test output ----------------
12/22 13:18:34 INFO |guestfs_su:0686| GS:libguestfs: trace: add_drive_ro "/usr/local/staf/test/RHEV/libguestfs-autotest/autotest/client/tests/libguestfs/images/Augeas.ext2.raw"
12/22 13:18:34 INFO |guestfs_su:0686| GS:libguestfs: trace: add_drive_ro = 0
12/22 13:18:34 INFO |guestfs_su:0686| GS:libguestfs: trace: launch
12/22 13:18:34 INFO |guestfs_su:0686| GS:libguestfs: trace: get_tmpdir
12/22 13:18:34 INFO |guestfs_su:0686| GS:libguestfs: trace: get_tmpdir = "/tmp"
12/22 13:18:36 INFO |guestfs_su:0686| GS:libguestfs: trace: get_cachedir
12/22 13:18:36 INFO |guestfs_su:0686| GS:libguestfs: trace: get_cachedir = "/var/tmp"
12/22 13:18:36 INFO |guestfs_su:0686| GS:libguestfs: trace: get_cachedir
12/22 13:18:36 INFO |guestfs_su:0686| GS:libguestfs: trace: get_cachedir = "/var/tmp"
12/22 13:19:14 INFO |guestfs_su:0686| GS:libguestfs: trace: get_cachedir
12/22 13:19:14 INFO |guestfs_su:0686| GS:libguestfs: trace: get_cachedir = "/var/tmp"
12/22 13:19:18 INFO |guestfs_su:0686| GS:libguestfs: error: could not create appliance through libvirt: monitor socket did not show up.: No such file or directory [code=38 domain=10]
12/22 13:19:18 INFO |guestfs_su:0686| GS:libguestfs: trace: launch = -1 (error)
12/22 13:19:18 INFO |guestfs_su:0686| GS:libguestfs: trace: close
12/22 13:19:18 INFO |guestfs_su:0686| GS:(Process terminated with status 1)
12/22 13:19:18 INFO |libguestfs:0072| ---------------- End of test output ----------------
12/22 13:19:18 ERROR|libguestfs:0092| Test failed: Test 'aug_get' ended with 'libguestfs: error: could not create appliance through libvirt: monitor socket did not show up.: No such file or directory [code=38 domain=10]' 
12/22 13:19:18 DEBUG|libguestfs:0093| Postprocessing on error...
12/22 13:19:18 ERROR|      test:0415| Exception escaping from test:
Traceback (most recent call last):




Version-Release number of selected component (if applicable):
libguestfs-1.20.1-4.el7.x86_64

How reproducible:
sometimes

Steps to Reproduce:
1, startup more than 20 guestfish appliance 
  
Actual results:


Expected results:


Additional info:

--- Additional comment from Mohua Li on 2013-01-10 21:51:37 EST ---

i have tried several test run, and now met this problem quite often, like below,


01/10 21:06:30 INFO |guestfs_su:0686| GS:lvm partition...
01/10 21:06:31 INFO |guestfs_su:0686| GS:libguestfs: trace: add_drive "/usr/local/staf/test/RHEV/libguestfs-autotest/autotest/client/tests/libguestfs/images/fs_mount.btrfs.qcow2"
01/10 21:06:31 INFO |guestfs_su:0686| GS:libguestfs: trace: add_drive = 0
01/10 21:06:31 INFO |guestfs_su:0686| GS:libguestfs: trace: launch
01/10 21:06:31 INFO |guestfs_su:0686| GS:libguestfs: trace: get_tmpdir
01/10 21:06:31 INFO |guestfs_su:0686| GS:libguestfs: trace: get_tmpdir = "/tmp"
01/10 21:06:31 INFO |guestfs_su:0686| GS:libvir: XML-RPC error : Failed to connect socket to '/var/run/libvirt/libvirt-sock': Connection refused
01/10 21:06:31 INFO |guestfs_su:0686| GS:*stdin*:1: libguestfs: error: could not connect to libvirt (URI = NULL): Failed to connect socket to '/var/run/libvirt/libvirt-sock': Connection refused [code=38 domain=7]
01/10 21:06:31 INFO |guestfs_su:0686| GS:libguestfs: trace: launch = -1 (error)
01/10 21:06:31 INFO |guestfs_su:0686| GS:libguestfs: trace: close

--- Additional comment from Richard W.M. Jones on 2013-01-16 04:02:51 EST ---

I'm pretty sure this has got to be a libvirt bug.  After all,
we send libvirt the same XML every time!  Or perhaps qemu, if
for some reason there is a race/delay between libvirt and qemu
creating the monitor.

Comment 1 Richard W.M. Jones 2014-01-04 08:55:09 UTC

Created attachment 845269 [details]
build.log

Occurred again during a Fedora 20 build.  The build log is attached.
Search for 'FAIL:' in the log file.

Comment 2 Richard W.M. Jones 2014-01-04 08:55:34 UTC

Created attachment 845292 [details]
root.log

Corresponding root.log.

Comment 3 Richard W.M. Jones 2014-01-04 08:56:21 UTC

The other problem is that libvirt loses information about
the failure.  The failures go to an obscure log file, which
was lost after the build.  Instead, the failures should go
into the libvirt error message.

Comment 4 Martin Kletzander 2014-01-07 10:21:44 UTC

So this may be related to the default timeout libvirt waits for qemu to create the monitor socket.  We now wait 3 seconds at max and if the load on the system is high enough, that may not be enough for qemu.

To reproduce, you might probably (no guarantee) decrease the value in "src/qemu/qemu_monitor.c" file, the "timeout" variable in qemuMonitorOpenUnix() function or write a wrapper for qemu which waits certain amount of time before launching qemu if it is supposed to run a guest.  The following works for me when used as an emulator for the machine:

#!/usr/bin/env python

import os
import sys
import time

TIMEOUT = 4
QEMU_PATH = "/usr/bin/qemu-kvm"

if "-uuid" in sys.argv:
   time.sleep(TIMEOUT)

sys.argv[0] = QEMU_PATH
os.execv(QEMU_PATH, sys.argv)

Timeout 2 seconds is ok, 4 is not and makes the guest fail to start.  If that's true reproducer for you (you might try timeout of value 2 and running multiple appliances for example), than this upstream proposal might make it configurable:

https://www.redhat.com/archives/libvir-list/2014-January/msg00060.html

any reply in the thread is appreciated, I'll try to review it asap.

Comment 5 Richard W.M. Jones 2014-01-07 10:24:24 UTC

We have several bugs tracking similar issues, but that's essentially
the same test that we did in
https://bugzilla.redhat.com/show_bug.cgi?id=1048818

Comment 6 Martin Kletzander 2014-01-17 07:45:53 UTC

Hopefully fixed upstream by v1.2.1-11-gfe89b68:

commit fe89b687a02d1a8e1dce695a67b4f9d2c254d7b9
Author: Martin Kletzander <mkletzan>
Date:   Thu Jan 9 07:57:59 2014 +0100

    qemu: Change the default unix monitor timeout

Note You need to log in before you can comment on or make changes to this bug.