Bug 859596
Summary: | libvirtd doesn't let you overwrite some environment variables (eg $TMPDIR) using <qemu:env> | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | [Community] Virtualization Tools | Reporter: | Richard W.M. Jones <rjones> | ||||||||
Component: | libvirt | Assignee: | Libvirt Maintainers <libvirt-maint> | ||||||||
Status: | CLOSED UPSTREAM | QA Contact: | |||||||||
Severity: | unspecified | Docs Contact: | |||||||||
Priority: | unspecified | ||||||||||
Version: | unspecified | CC: | berrange, crobinso, dyasny, mbooth | ||||||||
Target Milestone: | --- | ||||||||||
Target Release: | --- | ||||||||||
Hardware: | Unspecified | ||||||||||
OS: | Unspecified | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2013-06-24 16:08:17 UTC | Type: | Bug | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Attachments: |
|
Here is the Perl test: https://github.com/libguestfs/libguestfs/blob/8acdc23622ea95a233ac831d67c143517aa94bba/tests/protocol/test-launch-race.pl Looking at this, I'm not going to exclude the possibility that Perl File::Temp itself is deleting the temporary directory, if one process exits significantly before the other process. I think the call to _exit (line 42) means that the tempdir is not being cleaned up early. Interestingly if I make sure that libvirtd is started up and running before starting the test, then I can run many iterations of the test no problem. The bug seems to happen (or happens most reliably) when there is no libvirtd running already. Perhaps both processes start separate libvirtd daemons? $ killall libvirtd lt-libvirtd libvirtd(8138): Operation not permitted lt-libvirtd: no process found $ make -C tests/protocol check TESTS=test-launch-race.pl make: Entering directory `/home/rjones/d/libguestfs/tests/protocol' make check-TESTS make[1]: Entering directory `/home/rjones/d/libguestfs/tests/protocol' /home/rjones/d/libguestfs/run --test ./test-launch-race.pl could not create appliance through libvirt: internal error process exited while connecting to monitor: 2012-09-22 09:35:14.863+0000: 7237: debug : virFileClose:72 : Closed fd 24 2012-09-22 09:35:14.863+0000: 7237: debug : virFileClose:72 : Closed fd 29 2012-09-22 09:35:14.869+0000: 7237: debug : virFileClose:72 : Closed fd 3 [code=1 domain=10] at /home/rjones/d/libguestfs/tests/protocol/test-launch-race.pl line 47. cannot remove directory for /home/rjones/d/libguestfs/tmp/GqXOSuGJ4Z/.guestfs-1000: Directory not empty at /usr/share/perl5/File/Temp.pm line 902. cannot remove directory for /home/rjones/d/libguestfs/tmp/GqXOSuGJ4Z: Directory not empty at /usr/share/perl5/File/Temp.pm line 902. /home/rjones/d/libguestfs/run: command failed with exit code 2 FAIL: test-launch-race.pl ================== 1 of 1 test failed ================== make[1]: *** [check-TESTS] Error 1 make[1]: Leaving directory `/home/rjones/d/libguestfs/tests/protocol' make: *** [check-am] Error 2 make: Leaving directory `/home/rjones/d/libguestfs/tests/protocol' Created attachment 615720 [details]
test-launch-race-libvirtd-2.log
The libvirtd log from the test in the comment above. This is
probably easier to understand than the original log that
was attached.
There's some strange stuff going on in this log file. Try
grepping the log for 'console.sock'. Here is an edited
extract of that grep with comments:
# The first guest.
<source mode="connect" path="/home/rjones/d/libguestfs/tmp/GqXOSuGJ4Z/libguestfsZJ1eCg/console.sock"/>
/bin/qemu-kvm -name guestfs-49vtnm0nub8bqvob [...] -chardev socket,id=charserial0,path=/home/rjones/d/libguestfs/tmp/GqXOSuGJ4Z/libguestfsZJ1eCg/console.sock
# Why do these messages appear _after_ qemu has started up?
2012-09-22 09:35:14.884+0000: 7073: info : virSecuritySELinuxSetFileconHelper:754 : Setting SELinux context on '/home/rjones/d/libguestfs/tmp/GqXOSuGJ4Z/libguestfsZJ1eCg/console.sock' to 'unconfined_u:object_r:svirt_image_t:s0:c268,c436'
2012-09-22 09:35:15.104+0000: 7073: info : virSecuritySELinuxRestoreSecurityFileLabel:884 : Restoring SELinux context on '/home/rjones/d/libguestfs/tmp/GqXOSuGJ4Z/libguestfsZJ1eCg/console.sock'
# Why is libvirtd setting the context on this socket again?
2012-09-22 09:35:15.114+0000: 7073: info : virSecuritySELinuxSetFileconHelper:754 : Setting SELinux context on '/home/rjones/d/libguestfs/tmp/GqXOSuGJ4Z/libguestfsZJ1eCg/console.sock' to 'unconfined_u:object_r:user_home_t:s0'
# The second guest. Note a slightly different console.sock.
<source mode="connect" path="/home/rjones/d/libguestfs/tmp/GqXOSuGJ4Z/libguestfslC80Cg/console.sock"/>
/bin/qemu-kvm -name guestfs-qoi7f8gqsg9mkqvi [...] -chardev socket,id=charserial0,path=/home/rjones/d/libguestfs/tmp/GqXOSuGJ4Z/libguestfslC80Cg/console.sock
# Why did libvirtd not try to label the second socket at all?
So libvirtd (or multiple libvirtd's?) label one guest socket twice, and the
other not at all.
Full strace here: http://oirase.annexia.org/tmp/strace.log.xz The two test processes are 12159 & 12237. The libvirtd processes are 12268 & 12267. 12267 loses the race to get the PID file, and 12268 becomes the libvirtd. I notice that both test processes correctly connect to the winning libvirtd, so it's not that the multiprocess case in libvirt has been forgotten. It's still something of a mystery why this wouldn't work. Perhaps there's some locking or data separation problem inside libvirtd? (In reply to comment #3) > 2012-09-22 09:35:15.104+0000: 7073: info : > virSecuritySELinuxRestoreSecurityFileLabel:884 : Restoring SELinux context > on '/home/rjones/d/libguestfs/tmp/GqXOSuGJ4Z/libguestfsZJ1eCg/console.sock' > > # Why is libvirtd setting the context on this socket again? > 2012-09-22 09:35:15.114+0000: 7073: info : > virSecuritySELinuxSetFileconHelper:754 : Setting SELinux context on > '/home/rjones/d/libguestfs/tmp/GqXOSuGJ4Z/libguestfsZJ1eCg/console.sock' to > 'unconfined_u:object_r:user_home_t:s0' Looking at this more closely, I see this is the message from console.sock being restored back to user_home_t, thus this is not the problem. Still leaves the question why the second console socket is not touched. Created attachment 615815 [details] libvirtd.log corresponding to comment 4 libvirtd.log corresponding to comment 4, so you can see the strace and libvirtd log from the same run. Changing the component to libvirt. I can reproduce this bug very often when running libguestfs tests in parallel. Ok, so from the QEMU VM log file the last 3 lines are these: 2012-09-22 09:35:14.885+0000: 7238: debug : virFileClose:72 : Closed fd 27 2012-09-22 09:35:14.885+0000: 7238: debug : virCommandHook:2115 : Hook is done 0 2012-09-22 09:35:15.090+0000: shutting down So at 09:35:14.885 libvirt has successfully exec()'d the QEMU process, after doing all SELinux labelling. The way QEMU starts up, should have guaranteed that the socket is open & listening at this point (unless the recent QEMU code has broken this) At 09:35:15.090 libvirt has recorded that QEMU has completed shutdown From the libvirtd.log file at the matching timestamp we see 2012-09-22 09:35:14.888+0000: 7073: debug : virObjectNew:110 : OBJECT_NEW: obj=0x7f32900c6df0 classname=qemuMonitor 2012-09-22 09:35:15.089+0000: 7073: error : qemuMonitorOpenUnix:293 : failed to connect to monitor socket: No such process So libvirtd was unable to connect to the QEMU monitor. Either the QEMU socket has not been created (unlikely, since QEMU startup sequence should guarantee this can't happen), or is an indication that QEMU has exited. I would normally vote for the second scenario, but for the fact that QEMU has written absolutely no error messages to the logfile about problems starting up. I'm unclear whether the builder is running with SELinux, but if it is, then conceivably SELinux could be preventing QEMU writing the actual erorr to the logs, and also causing the failure in the first place. Ohh, in the strace file from comment #4, I see these interesting lines, and some error message that QEMU is printing to stderr (and thanks to SELinux, not getting into the log): 12084 open("/tmp/whenjobs393242c33db7a83e72d4a1785617ba3b/build-test/tests/qemu/test1.img", O_RDONLY|O_NONBLOCK|O_CLOEXEC) = 11 12084 fstat(11, {st_mode=S_IFREG|0664, st_size=104857600, ...}) = 0 12084 close(11) = 0 12084 open("/tmp/whenjobs393242c33db7a83e72d4a1785617ba3b/build-test/tests/qemu/test1.img", O_RDONLY|O_NONBLOCK|O_CLOEXEC) = 11 12084 fstat(11, {st_mode=S_IFREG|0664, st_size=104857600, ...}) = 0 12084 close(11) = 0 12084 stat("/tmp/whenjobs393242c33db7a83e72d4a1785617ba3b/build-test/tests/qemu/test1.img", {st_mode=S_IFREG|0664, st_size=104857600, ...}) = 0 12084 open("/tmp/whenjobs393242c33db7a83e72d4a1785617ba3b/build-test/tests/qemu/test1.img", O_RDONLY|O_CLOEXEC) = 11 12084 pipe2([12, 13], O_CLOEXEC) = 0 12084 fcntl(12, F_SETFL, O_RDONLY|O_NONBLOCK) = 0 12084 fcntl(13, F_SETFL, O_RDONLY|O_NONBLOCK) = 0 12084 write(10, "\1\0\0\0\0\0\0\0", 8) = 8 12084 fstatfs(11, {f_type="EXT2_SUPER_MAGIC", f_bsize=4096, f_blocks=12901535, f_bfree=8495659, f_bavail=8364609, f_files=3276800, f_ffree=2945651, f_fsid={-1192703646, 1551807752}, f_namelen=255, f_frsize=4096}) = 0 12084 lseek(11, 0, SEEK_END) = 104857600 12084 lseek(11, 0, SEEK_END) = 104857600 12084 lseek(11, 0, SEEK_END) = 104857600 12084 rt_sigprocmask(SIG_BLOCK, NULL, [BUS USR1 ALRM IO], 8) = 0 12084 mmap(NULL, 1052672, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f909db99000 12084 rt_sigprocmask(SIG_SETMASK, [BUS USR1 ALRM IO], [BUS USR1 ALRM IO], 8) = 0 12084 close(11) = 0 12084 open("/home/rjones/d/libguestfs/tmp/wOItsTBzJj/vl.fhR6na", O_RDWR|O_CREAT|O_EXCL, 0600) = -1 ENOENT (No such file or directory) 12084 write(2, "qemu-kvm:", 9) = 9 12084 write(2, " -drive", 7) = 7 12084 write(2, " file=/tmp/whenjobs393242c33db7a83e72d4a1785617ba3b/build-test/tests/qemu/test1.img,if=none,id=drive-scsi0-0-0-0,format=raw", 123) = 123 12084 write(2, ": ", 2) = 2 12084 write(2, "could not open disk image /tmp/whenjobs393242c33db7a83e72d4a1785617ba3b/build-test/tests/qemu/test1.img: No such file or directory", 130) = 130 12084 write(2, "\n", 1) = 1 The error message is in fact incredibly misleading. The file /tmp/whenjobs393242c33db7a83e72d4a1785617ba3b/build-test/tests/qemu/test1.img *does* exist because the strace logs show QEMU just successfully opened it several times ! What does not exist is "/home/rjones/d/libguestfs/tmp/wOItsTBzJj/vl.fhR6na" Further info from the log: 12299 write(21, ": starting up\n", 14) = 14 12299 write(21, "LC_ALL=C", 8) = 8 12299 write(21, " ", 1) = 1 12299 write(21, "LD_LIBRARY_PATH=/home/rjones/d/libguestfs/src/.libs:/home/rjones/d/libguestfs/gobject/.libs:/home/rjones/d/libguestfs/ruby/ext/guestfs", 134) = 134 12299 write(21, " ", 1) = 1 12299 write(21, "PATH=/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/home/rjones/.local/bin:/home/rjones/bin", 100) = 100 12299 write(21, " ", 1) = 1 12299 write(21, "HOME=/home/rjones", 17) = 17 12299 write(21, " ", 1) = 1 12299 write(21, "USER=rjones", 11) = 11 12299 write(21, " ", 1) = 1 12299 write(21, "LOGNAME=rjones", 14) = 14 12299 write(21, " ", 1) = 1 12299 write(21, "TMPDIR=/home/rjones/d/libguestfs/tmp/wOItsTBzJj", 47) = 47 12299 write(21, " ", 1) = 1 12299 write(21, "TMPDIR=/tmp/whenjobs393242c33db7a83e72d4a1785617ba3b/build-test/tmp", 67) = 67 12299 write(21, " ", 1) = 1 Notice how TMPDIR is set twice, once to the valid inherited from libvirtd, and once to the value libguestfs specified in the XML via qemu:env custom args. The one that's being used is the former, while it needs to be the latter. Now obviously this is an example of the kind of things that go wrong with using <qemu:env> and why this is unsupported in general. We could, however, deal with this particular case better. src/util/command.c just appends env vars to the end of the array. It could first check whether the named var already exists, and if so, replace its value. This would let apps override TMPDIR (and other vars libvirt sets by default for QEMU). Patches went upstream: http://libvirt.org/git/?p=libvirt.git;a=commitdiff;h=f644361b1eeb78fd59be4cd7ec85567bbf300506 http://libvirt.org/git/?p=libvirt.git;a=commitdiff;h=2b32735af480055e27400068d27364d521071117 However this does NOT fix the current bug. The (or another) error I see now is: libvir: error : libvirtd quit during handshake: Input/output error (In reply to comment #13) > The (or another) error I see now is: > libvir: error : libvirtd quit during handshake: Input/output error This is caused by SELinux policy problems: bug 860235 The environment variable fixes upstream, plus SELinux policy changes, are both needed to fix this bug. The env changes are included in libvirt >= 0.10.2-3. I still don't think the fix for this is correct. I still see occasional (but not always) failure like this: /tmp/whenjobs4dd17406bcb80d1544f4dc7708c03ebc/libguestfs-1.19.49/run --test ./test-launch-race.pl could not create appliance through libvirt: unable to set security context 'unconfined_u:object_r:svirt_image_t:s0:c114,c690' on '/tmp/whenjobs4dd17406bcb80d1544f4dc7708c03ebc/libguestfs-1.19.49/tmp/M9ZRm53mhB/libguestfs56NJGt/console.sock': No such file or directory [code=38 domain=24] at /tmp/whenjobs4dd17406bcb80d1544f4dc7708c03ebc/libguestfs-1.19.49/tests/protocol/test-launch-race.pl line 41. could not create appliance through libvirt: unable to set security context 'unconfined_u:object_r:svirt_image_t:s0:c165,c590' on '/tmp/whenjobs4dd17406bcb80d1544f4dc7708c03ebc/libguestfs-1.19.49/tmp/M9ZRm53mhB/libguestfsnPrxFt/console.sock': No such file or directory [code=38 domain=24] at /tmp/whenjobs4dd17406bcb80d1544f4dc7708c03ebc/libguestfs-1.19.49/tests/protocol/test-launch-race.pl line 47. /tmp/whenjobs4dd17406bcb80d1544f4dc7708c03ebc/libguestfs-1.19.49/run: command failed with exit code 2 FAIL: test-launch-race.pl where on further inspection of libvirtd.log, the paths from two different instances of the test suite seem to have got mixed up. I think I'll revise comment 15. I'm going to open another bug for what I think may be a different issue. |
Created attachment 615718 [details] test-launch-race-libvirtd.log Description of problem: Sometimes, but certainly not all the time, test-launch-race.pl fails when tested with the libvirt backend. The error message is very unexpected: /tmp/whenjobs89b451727722b7fedf12c26aeba48fa1/libguestfs-1.19.43/run --test ./test-launch-race.pl could not create appliance through libvirt: unable to set security context 'unconfined_u:object_r:svirt_image_t:s0:c498,c594' on '/tmp/whenjobs89b451727722b7fedf12c26aeba48fa1/libguestfs-1.19.43/tmp/pyrdBdaqbA/libguestfs4Hj3ae/console.sock': No such file or directory [code=38 domain=24] at /tmp/whenjobs89b451727722b7fedf12c26aeba48fa1/libguestfs-1.19.43/tests/protocol/test-launch-race.pl line 41. could not create appliance through libvirt: unable to set security context 'unconfined_u:object_r:svirt_image_t:s0:c323,c513' on '/tmp/whenjobs89b451727722b7fedf12c26aeba48fa1/libguestfs-1.19.43/tmp/pyrdBdaqbA/libguestfs4LsW9d/console.sock': No such file or directory [code=38 domain=24] at /tmp/whenjobs89b451727722b7fedf12c26aeba48fa1/libguestfs-1.19.43/tests/protocol/test-launch-race.pl line 47. /tmp/whenjobs89b451727722b7fedf12c26aeba48fa1/libguestfs-1.19.43/run: command failed with exit code 2 FAIL: test-launch-race.pl The full libvirtd logfile is attached. Notes: (1) This is a parallel test. Two handles are created in separate processes. (2) Libguestfs should be creating console.sock. It's not clear if the socket really doesn't exist (eg. libguestfs creates it in the wrong place) or if libvirt cannot find it or if something deletes it. Version-Release number of selected component (if applicable): libguestfs 1.19.43 How reproducible: ? Not always, but often. Steps to Reproduce: 1. make -C tests/protocol check TESTS=test-launch-race.pl