Bug 859596 - libvirtd doesn't let you overwrite some environment variables (eg $TMPDIR) using <qemu:env>
Summary: libvirtd doesn't let you overwrite some environment variables (eg $TMPDIR) us...
Keywords:
Status: CLOSED UPSTREAM
Alias: None
Product: Virtualization Tools
Classification: Community
Component: libvirt
Version: unspecified
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Libvirt Maintainers
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2012-09-22 09:11 UTC by Richard W.M. Jones
Modified: 2013-06-24 16:08 UTC (History)
4 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2013-06-24 16:08:17 UTC
Embargoed:


Attachments (Terms of Use)
test-launch-race-libvirtd.log (968.67 KB, text/x-log)
2012-09-22 09:11 UTC, Richard W.M. Jones
no flags Details
test-launch-race-libvirtd-2.log (1.78 MB, text/x-log)
2012-09-22 09:44 UTC, Richard W.M. Jones
no flags Details
libvirtd.log corresponding to comment 4 (3.08 MB, text/plain)
2012-09-22 13:37 UTC, Richard W.M. Jones
no flags Details

Description Richard W.M. Jones 2012-09-22 09:11:10 UTC
Created attachment 615718 [details]
test-launch-race-libvirtd.log

Description of problem:

Sometimes, but certainly not all the time, test-launch-race.pl fails
when tested with the libvirt backend.  The error message is very
unexpected:

/tmp/whenjobs89b451727722b7fedf12c26aeba48fa1/libguestfs-1.19.43/run --test ./test-launch-race.pl
could not create appliance through libvirt: unable to set security context 'unconfined_u:object_r:svirt_image_t:s0:c498,c594' on '/tmp/whenjobs89b451727722b7fedf12c26aeba48fa1/libguestfs-1.19.43/tmp/pyrdBdaqbA/libguestfs4Hj3ae/console.sock': No such file or directory [code=38 domain=24] at /tmp/whenjobs89b451727722b7fedf12c26aeba48fa1/libguestfs-1.19.43/tests/protocol/test-launch-race.pl line 41.
could not create appliance through libvirt: unable to set security context 'unconfined_u:object_r:svirt_image_t:s0:c323,c513' on '/tmp/whenjobs89b451727722b7fedf12c26aeba48fa1/libguestfs-1.19.43/tmp/pyrdBdaqbA/libguestfs4LsW9d/console.sock': No such file or directory [code=38 domain=24] at /tmp/whenjobs89b451727722b7fedf12c26aeba48fa1/libguestfs-1.19.43/tests/protocol/test-launch-race.pl line 47.
/tmp/whenjobs89b451727722b7fedf12c26aeba48fa1/libguestfs-1.19.43/run: command failed with exit code 2
FAIL: test-launch-race.pl

The full libvirtd logfile is attached.

Notes:

(1) This is a parallel test.  Two handles are created in
separate processes.

(2) Libguestfs should be creating console.sock.  It's not clear
if the socket really doesn't exist (eg. libguestfs creates it
in the wrong place) or if libvirt cannot find it or if something
deletes it.

Version-Release number of selected component (if applicable):

libguestfs 1.19.43

How reproducible:

? Not always, but often.

Steps to Reproduce:
1. make -C tests/protocol check TESTS=test-launch-race.pl

Comment 1 Richard W.M. Jones 2012-09-22 09:12:59 UTC
Here is the Perl test:

https://github.com/libguestfs/libguestfs/blob/8acdc23622ea95a233ac831d67c143517aa94bba/tests/protocol/test-launch-race.pl

Looking at this, I'm not going to exclude the possibility
that Perl File::Temp itself is deleting the temporary
directory, if one process exits significantly before the
other process.

Comment 2 Richard W.M. Jones 2012-09-22 09:35:45 UTC
I think the call to _exit (line 42) means that the tempdir
is not being cleaned up early.

Interestingly if I make sure that libvirtd is started up
and running before starting the test, then I can run many
iterations of the test no problem.  The bug seems to happen
(or happens most reliably) when there is no libvirtd running
already.  Perhaps both processes start separate libvirtd
daemons?

$ killall libvirtd lt-libvirtd
libvirtd(8138): Operation not permitted
lt-libvirtd: no process found
$ make -C tests/protocol check TESTS=test-launch-race.pl
make: Entering directory `/home/rjones/d/libguestfs/tests/protocol'
make  check-TESTS
make[1]: Entering directory `/home/rjones/d/libguestfs/tests/protocol'
/home/rjones/d/libguestfs/run --test ./test-launch-race.pl
could not create appliance through libvirt: internal error process exited while connecting to monitor: 2012-09-22 09:35:14.863+0000: 7237: debug : virFileClose:72 : Closed fd 24
2012-09-22 09:35:14.863+0000: 7237: debug : virFileClose:72 : Closed fd 29
2012-09-22 09:35:14.869+0000: 7237: debug : virFileClose:72 : Closed fd 3
 [code=1 domain=10] at /home/rjones/d/libguestfs/tests/protocol/test-launch-race.pl line 47.
cannot remove directory for /home/rjones/d/libguestfs/tmp/GqXOSuGJ4Z/.guestfs-1000: Directory not empty at /usr/share/perl5/File/Temp.pm line 902.
cannot remove directory for /home/rjones/d/libguestfs/tmp/GqXOSuGJ4Z: Directory not empty at /usr/share/perl5/File/Temp.pm line 902.
/home/rjones/d/libguestfs/run: command failed with exit code 2
FAIL: test-launch-race.pl
==================
1 of 1 test failed
==================
make[1]: *** [check-TESTS] Error 1
make[1]: Leaving directory `/home/rjones/d/libguestfs/tests/protocol'
make: *** [check-am] Error 2
make: Leaving directory `/home/rjones/d/libguestfs/tests/protocol'

Comment 3 Richard W.M. Jones 2012-09-22 09:44:20 UTC
Created attachment 615720 [details]
test-launch-race-libvirtd-2.log

The libvirtd log from the test in the comment above.  This is
probably easier to understand than the original log that
was attached.

There's some strange stuff going on in this log file.  Try
grepping the log for 'console.sock'.  Here is an edited
extract of that grep with comments:

# The first guest.
<source mode="connect" path="/home/rjones/d/libguestfs/tmp/GqXOSuGJ4Z/libguestfsZJ1eCg/console.sock"/>
/bin/qemu-kvm -name guestfs-49vtnm0nub8bqvob [...] -chardev socket,id=charserial0,path=/home/rjones/d/libguestfs/tmp/GqXOSuGJ4Z/libguestfsZJ1eCg/console.sock

# Why do these messages appear _after_ qemu has started up?
2012-09-22 09:35:14.884+0000: 7073: info : virSecuritySELinuxSetFileconHelper:754 : Setting SELinux context on '/home/rjones/d/libguestfs/tmp/GqXOSuGJ4Z/libguestfsZJ1eCg/console.sock' to 'unconfined_u:object_r:svirt_image_t:s0:c268,c436'
2012-09-22 09:35:15.104+0000: 7073: info : virSecuritySELinuxRestoreSecurityFileLabel:884 : Restoring SELinux context on '/home/rjones/d/libguestfs/tmp/GqXOSuGJ4Z/libguestfsZJ1eCg/console.sock'

# Why is libvirtd setting the context on this socket again?
2012-09-22 09:35:15.114+0000: 7073: info : virSecuritySELinuxSetFileconHelper:754 : Setting SELinux context on '/home/rjones/d/libguestfs/tmp/GqXOSuGJ4Z/libguestfsZJ1eCg/console.sock' to 'unconfined_u:object_r:user_home_t:s0'

# The second guest.  Note a slightly different console.sock.
<source mode="connect" path="/home/rjones/d/libguestfs/tmp/GqXOSuGJ4Z/libguestfslC80Cg/console.sock"/>
/bin/qemu-kvm -name guestfs-qoi7f8gqsg9mkqvi [...] -chardev socket,id=charserial0,path=/home/rjones/d/libguestfs/tmp/GqXOSuGJ4Z/libguestfslC80Cg/console.sock

# Why did libvirtd not try to label the second socket at all?


So libvirtd (or multiple libvirtd's?) label one guest socket twice, and the
other not at all.

Comment 4 Richard W.M. Jones 2012-09-22 13:09:02 UTC
Full strace here:

  http://oirase.annexia.org/tmp/strace.log.xz

The two test processes are 12159 & 12237.

The libvirtd processes are 12268 & 12267.  12267 loses the race
to get the PID file, and 12268 becomes the libvirtd.  I notice
that both test processes correctly connect to the winning libvirtd,
so it's not that the multiprocess case in libvirt has been forgotten.

It's still something of a mystery why this wouldn't work.
Perhaps there's some locking or data separation problem inside
libvirtd?

Comment 5 Richard W.M. Jones 2012-09-22 13:33:34 UTC
(In reply to comment #3)
> 2012-09-22 09:35:15.104+0000: 7073: info :
> virSecuritySELinuxRestoreSecurityFileLabel:884 : Restoring SELinux context
> on '/home/rjones/d/libguestfs/tmp/GqXOSuGJ4Z/libguestfsZJ1eCg/console.sock'
> 
> # Why is libvirtd setting the context on this socket again?
> 2012-09-22 09:35:15.114+0000: 7073: info :
> virSecuritySELinuxSetFileconHelper:754 : Setting SELinux context on
> '/home/rjones/d/libguestfs/tmp/GqXOSuGJ4Z/libguestfsZJ1eCg/console.sock' to
> 'unconfined_u:object_r:user_home_t:s0'

Looking at this more closely, I see this is the message from
console.sock being restored back to user_home_t, thus this
is not the problem.

Still leaves the question why the second console socket is
not touched.

Comment 6 Richard W.M. Jones 2012-09-22 13:37:34 UTC
Created attachment 615815 [details]
libvirtd.log corresponding to comment 4

libvirtd.log corresponding to comment 4, so you can see
the strace and libvirtd log from the same run.

Comment 7 Richard W.M. Jones 2012-09-24 14:24:16 UTC
Changing the component to libvirt.  I can reproduce this
bug very often when running libguestfs tests in parallel.

Comment 8 Daniel Berrangé 2012-09-24 14:46:21 UTC
Ok, so from the QEMU VM log file the last 3 lines are these:

2012-09-22 09:35:14.885+0000: 7238: debug : virFileClose:72 : Closed fd 27
2012-09-22 09:35:14.885+0000: 7238: debug : virCommandHook:2115 : Hook is done 0
2012-09-22 09:35:15.090+0000: shutting down

So at 09:35:14.885  libvirt has successfully exec()'d the QEMU process, after doing all SELinux labelling. The way QEMU starts up, should have guaranteed that the socket is open & listening at this point (unless the recent QEMU code has broken this)

At 09:35:15.090 libvirt has recorded that QEMU has completed shutdown

From the libvirtd.log file at the matching timestamp we see

2012-09-22 09:35:14.888+0000: 7073: debug : virObjectNew:110 : OBJECT_NEW: obj=0x7f32900c6df0 classname=qemuMonitor
2012-09-22 09:35:15.089+0000: 7073: error : qemuMonitorOpenUnix:293 : failed to connect to monitor socket: No such process

So libvirtd was unable to connect to the QEMU monitor. Either the QEMU socket has not been created (unlikely, since QEMU startup sequence should guarantee this can't happen), or is an indication that QEMU has exited.

I would normally vote for the second scenario, but for the fact that QEMU has written absolutely no error messages to the logfile about problems starting up.
I'm unclear whether the builder is running with SELinux, but if it is, then conceivably SELinux could be preventing QEMU writing the actual erorr to the logs, and also causing the failure in the first place.

Comment 9 Daniel Berrangé 2012-09-24 15:38:30 UTC
Ohh, in the strace file from comment #4, I see these interesting lines, and some error message that QEMU is printing to stderr (and thanks to SELinux, not getting into the log):


12084 open("/tmp/whenjobs393242c33db7a83e72d4a1785617ba3b/build-test/tests/qemu/test1.img", O_RDONLY|O_NONBLOCK|O_CLOEXEC) = 11
12084 fstat(11, {st_mode=S_IFREG|0664, st_size=104857600, ...}) = 0
12084 close(11)                         = 0
12084 open("/tmp/whenjobs393242c33db7a83e72d4a1785617ba3b/build-test/tests/qemu/test1.img", O_RDONLY|O_NONBLOCK|O_CLOEXEC) = 11
12084 fstat(11, {st_mode=S_IFREG|0664, st_size=104857600, ...}) = 0
12084 close(11)                         = 0
12084 stat("/tmp/whenjobs393242c33db7a83e72d4a1785617ba3b/build-test/tests/qemu/test1.img", {st_mode=S_IFREG|0664, st_size=104857600, ...}) = 0
12084 open("/tmp/whenjobs393242c33db7a83e72d4a1785617ba3b/build-test/tests/qemu/test1.img", O_RDONLY|O_CLOEXEC) = 11
12084 pipe2([12, 13], O_CLOEXEC)        = 0
12084 fcntl(12, F_SETFL, O_RDONLY|O_NONBLOCK) = 0
12084 fcntl(13, F_SETFL, O_RDONLY|O_NONBLOCK) = 0
12084 write(10, "\1\0\0\0\0\0\0\0", 8)  = 8
12084 fstatfs(11, {f_type="EXT2_SUPER_MAGIC", f_bsize=4096, f_blocks=12901535, f_bfree=8495659, f_bavail=8364609, f_files=3276800, f_ffree=2945651, f_fsid={-1192703646, 1551807752}, f_namelen=255, f_frsize=4096}) = 0
12084 lseek(11, 0, SEEK_END)            = 104857600
12084 lseek(11, 0, SEEK_END)            = 104857600
12084 lseek(11, 0, SEEK_END)            = 104857600
12084 rt_sigprocmask(SIG_BLOCK, NULL, [BUS USR1 ALRM IO], 8) = 0
12084 mmap(NULL, 1052672, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f909db99000
12084 rt_sigprocmask(SIG_SETMASK, [BUS USR1 ALRM IO], [BUS USR1 ALRM IO], 8) = 0
12084 close(11)                         = 0
12084 open("/home/rjones/d/libguestfs/tmp/wOItsTBzJj/vl.fhR6na", O_RDWR|O_CREAT|O_EXCL, 0600) = -1 ENOENT (No such file or directory)
12084 write(2, "qemu-kvm:", 9)          = 9
12084 write(2, " -drive", 7)            = 7
12084 write(2, " file=/tmp/whenjobs393242c33db7a83e72d4a1785617ba3b/build-test/tests/qemu/test1.img,if=none,id=drive-scsi0-0-0-0,format=raw", 123) = 123
12084 write(2, ": ", 2)                 = 2
12084 write(2, "could not open disk image /tmp/whenjobs393242c33db7a83e72d4a1785617ba3b/build-test/tests/qemu/test1.img: No such file or directory", 130) = 130
12084 write(2, "\n", 1)                 = 1


The error message is in fact incredibly misleading. The file /tmp/whenjobs393242c33db7a83e72d4a1785617ba3b/build-test/tests/qemu/test1.img *does* exist because the strace logs show QEMU just successfully opened it several times ! What does not exist is "/home/rjones/d/libguestfs/tmp/wOItsTBzJj/vl.fhR6na"

Comment 10 Daniel Berrangé 2012-09-24 15:58:06 UTC
Further info from the log:

12299 write(21, ": starting up\n", 14)  = 14
12299 write(21, "LC_ALL=C", 8)          = 8
12299 write(21, " ", 1)                 = 1
12299 write(21, "LD_LIBRARY_PATH=/home/rjones/d/libguestfs/src/.libs:/home/rjones/d/libguestfs/gobject/.libs:/home/rjones/d/libguestfs/ruby/ext/guestfs", 134) =
 134
12299 write(21, " ", 1)                 = 1
12299 write(21, "PATH=/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/home/rjones/.local/bin:/home/rjones/bin", 100) = 100
12299 write(21, " ", 1)                 = 1
12299 write(21, "HOME=/home/rjones", 17) = 17
12299 write(21, " ", 1)                 = 1
12299 write(21, "USER=rjones", 11)      = 11
12299 write(21, " ", 1)                 = 1
12299 write(21, "LOGNAME=rjones", 14)   = 14
12299 write(21, " ", 1)                 = 1
12299 write(21, "TMPDIR=/home/rjones/d/libguestfs/tmp/wOItsTBzJj", 47) = 47
12299 write(21, " ", 1)                 = 1
12299 write(21, "TMPDIR=/tmp/whenjobs393242c33db7a83e72d4a1785617ba3b/build-test/tmp", 67) = 67
12299 write(21, " ", 1)                 = 1


Notice how TMPDIR is set twice, once to the valid inherited from libvirtd, and once to the value libguestfs specified in the XML via qemu:env custom args.

The one that's being used is the former, while it needs to be the latter.

Now obviously this is an example of the kind of things that go wrong with using <qemu:env> and why this is unsupported in general. We could, however, deal with this particular case better.

src/util/command.c just appends env vars to the end of the array. It could first check whether the named var already exists, and if so, replace its value. This would let apps override TMPDIR (and other vars libvirt sets by default for QEMU).

Comment 11 Richard W.M. Jones 2012-09-24 18:56:33 UTC
Posted patches:
https://www.redhat.com/archives/libvir-list/2012-September/msg01624.html

Comment 13 Richard W.M. Jones 2012-09-24 20:08:55 UTC
The (or another) error I see now is:
libvir:  error : libvirtd quit during handshake: Input/output error

Comment 14 Richard W.M. Jones 2012-09-25 11:10:49 UTC
(In reply to comment #13)
> The (or another) error I see now is:
> libvir:  error : libvirtd quit during handshake: Input/output error

This is caused by SELinux policy problems: bug 860235

The environment variable fixes upstream, plus SELinux
policy changes, are both needed to fix this bug.
The env changes are included in libvirt >= 0.10.2-3.

Comment 15 Richard W.M. Jones 2012-10-09 12:13:55 UTC
I still don't think the fix for this is correct.  I
still see occasional (but not always) failure like this:

/tmp/whenjobs4dd17406bcb80d1544f4dc7708c03ebc/libguestfs-1.19.49/run --test ./test-launch-race.pl
could not create appliance through libvirt: unable to set security context 'unconfined_u:object_r:svirt_image_t:s0:c114,c690' on '/tmp/whenjobs4dd17406bcb80d1544f4dc7708c03ebc/libguestfs-1.19.49/tmp/M9ZRm53mhB/libguestfs56NJGt/console.sock': No such file or directory [code=38 domain=24] at /tmp/whenjobs4dd17406bcb80d1544f4dc7708c03ebc/libguestfs-1.19.49/tests/protocol/test-launch-race.pl line 41.
could not create appliance through libvirt: unable to set security context 'unconfined_u:object_r:svirt_image_t:s0:c165,c590' on '/tmp/whenjobs4dd17406bcb80d1544f4dc7708c03ebc/libguestfs-1.19.49/tmp/M9ZRm53mhB/libguestfsnPrxFt/console.sock': No such file or directory [code=38 domain=24] at /tmp/whenjobs4dd17406bcb80d1544f4dc7708c03ebc/libguestfs-1.19.49/tests/protocol/test-launch-race.pl line 47.
/tmp/whenjobs4dd17406bcb80d1544f4dc7708c03ebc/libguestfs-1.19.49/run: command failed with exit code 2
FAIL: test-launch-race.pl

where on further inspection of libvirtd.log, the paths from
two different instances of the test suite seem to have got
mixed up.

Comment 16 Richard W.M. Jones 2012-10-09 12:18:08 UTC
I think I'll revise comment 15.  I'm going to open another bug
for what I think may be a different issue.


Note You need to log in before you can comment on or make changes to this bug.