Created attachment 562013 [details] factory log Description of problem: 2012-02-14 13:39:40,849 DEBUG imgfac.builders.BaseBuilder.RHEL6_vsphere_Builder thread(c1c0c696) Message: Exception caught in ImageFactory 2012-02-14 13:39:40,850 DEBUG imgfac.builders.BaseBuilder.RHEL6_vsphere_Builder thread(c1c0c696) Message: Traceback (most recent call last): File "/usr/lib/python2.6/site-packages/imgfac/builders/Fedora_vsphere_Builder.py", line 49, in build_image self.build_upload(build_id) File "/usr/lib/python2.6/site-packages/imgfac/builders/Fedora_vsphere_Builder.py", line 115, in build_upload self.output_descriptor = self.guest.customize_and_generate_icicle(libvirt_xml) File "/usr/lib/python2.6/site-packages/oz/RedHat.py", line 1178, in customize_and_generate_icicle return self._internal_customize(libvirt_xml, True) File "/usr/lib/python2.6/site-packages/oz/RedHat.py", line 1143, in _internal_customize self._collect_setup(modified_xml) File "/usr/lib/python2.6/site-packages/oz/RedHat.py", line 427, in _collect_setup g_handle = self._guestfs_handle_setup(libvirt_xml) File "/usr/lib/python2.6/site-packages/oz/Guest.py", line 882, in _guestfs_handle_setup g.launch() File "/usr/lib/python2.6/site-packages/guestfs.py", line 152, in launch return libguestfsmod.launch (self._o) RuntimeError: link: /tmp/.guestfs-0/kernel /tmp/.guestfs-0/kernel.10139: File exists Attaching template and factory log
Created attachment 562014 [details] template template
Created attachment 562018 [details] aeolus-debug logs
[root@hp-sl2x170zg6-01 ~]# rpm -qa | grep aeolus aeolus-conductor-daemons-0.8.0-26.el6.noarch aeolus-conductor-0.8.0-26.el6.noarch aeolus-conductor-doc-0.8.0-26.el6.noarch rubygem-aeolus-cli-0.3.0-8.el6.noarch aeolus-configure-2.5.0-13.el6.noarch aeolus-all-0.8.0-26.el6.noarch rubygem-aeolus-image-0.3.0-7.el6.noarch imagefactory-1.0.0rc4_2_gf16dfc5-1.el6.noarch rubygem-imagefactory-console-0.4.0-1.el6.noarch imagefactory-jeosconf-ec2-fedora-1.0.0rc4_2_gf16dfc5-1.el6.noarch imagefactory-jeosconf-ec2-rhel-1.0.0rc4_2_gf16dfc5-1.el6.noarch
FYI.. I ran a rebuild via the webui and I do *not* see the same failure in the logs. I will continue to poke at this..
So, this error seems to originate in the code that creates symlinks for launching the appliance VM that libguestfs uses to do its job. The relevant code is here: https://github.com/libguestfs/libguestfs/blob/master/src/appliance.c The comments at the top make it sound as if this activity _should_ be thread/concurrency safe. However, maybe not. Guestfs is using a read lock on the checksum file to avoid walking on itself when creating the links. Oz did something similar to avoid having two concurrent Oz instances download the same ISO file into the same canonical location. The issue with Oz, which may come into play here, is that we (Factory) are a single process multi-threaded application and multiple threads of the same process seem to be able to acquire a lock on the same file at the same time without error. So, perhaps we have two threads doing g.launch() at the same time. Both libguestfs instances try, and succeeded, in acquiring a read lock on the checksum file (as they are both part of the same process), one creates the kernel.$PID symlink first and the second one fails because the link is already there. Am going to ask Mr. Jones to weigh in on my largely uninformed speculation above to see if there's any sense in it. If this is indeed the case we may be able to work around the race in the short term by launching and then stopping a dummy libguestfs instance during Factory startup, before we start the server and "go multithreaded".
Thanks for the analysis. It does indeed sound like this is a bug in libguestfs's handling of locks. I'm going to clone this bug for libguestfs upstream.
Patch posted for the upstream libguestfs bug here: https://www.redhat.com/archives/libguestfs/2012-February/msg00068.html
Workaround commit here: https://github.com/aeolusproject/imagefactory/commit/d75d7dae33efdbd1e400f88c2f8d2a319a324541 I'll brew this for the puddle being formed today. It will be rc6 (and will include one other fix).
Moving this to POST since a patch exists. Ian, please move to MODIFIED when the build goes into Brew.
Patch brewed as part of imagefactory-1.0.0rc6-1
so far I'm not able to reproduce this issue with packages: iwhd-1.2-3.el6.x86_64 rubygem-imagefactory-console-0.5.0-4.20110824113238gitd9debef.el6.noarch imagefactory-jeosconf-ec2-fedora-1.0.0rc6-1.el6.noarch rubygem-aeolus-cli-0.4.0-0.20120118121626git6fddd65.el6.noarch imagefactory-1.0.0rc6-1.el6.noarch aeolus-conductor-0.9.0-0.20120118181603git71cd8bc.el6.noarch oz-0.8.0-5.el6.noarch aeolus-all-0.9.0-0.20120118181603git71cd8bc.el6.noarch rubygem-arel-2.0.10-0.aeolus.el6.noarch aeolus-conductor-doc-0.9.0-0.20120118181603git71cd8bc.el6.noarch imagefactory-jeosconf-ec2-rhel-1.0.0rc6-1.el6.noarch aeolus-conductor-daemons-0.9.0-0.20120118181603git71cd8bc.el6.noarch rubygem-aeolus-image-0.4.0-0.20120118121635git0d31a37.el6.noarch aeolus-configure-2.6.0-0.20120118121620gita996371.el6.noarch
So I just hit this running the following... [root@hp-sl390s-01 ~]# rpm -qa | grep imagefactory imagefactory-1.0.0rc8-1.el6.noarch rubygem-imagefactory-console-0.4.0-1.el6.noarch imagefactory-jeosconf-ec2-rhel-1.0.0rc8-1.el6.noarch imagefactory-jeosconf-ec2-fedora-1.0.0rc8-1.el6.noarch [root@hp-sl390s-01 ~]# rpm -qa | grep libguest libguestfs-1.7.17-26.el6.x86_64 python-libguestfs-1.7.17-26.el6.x86_64 Reopening for at a minimum, a review... it wasn't clear to me what release the fix to libguestfs was going in
(In reply to comment #14) > So I just hit this running the following... > > [root@hp-sl390s-01 ~]# rpm -qa | grep imagefactory > imagefactory-1.0.0rc8-1.el6.noarch > rubygem-imagefactory-console-0.4.0-1.el6.noarch > imagefactory-jeosconf-ec2-rhel-1.0.0rc8-1.el6.noarch > imagefactory-jeosconf-ec2-fedora-1.0.0rc8-1.el6.noarch > [root@hp-sl390s-01 ~]# rpm -qa | grep libguest > libguestfs-1.7.17-26.el6.x86_64 > python-libguestfs-1.7.17-26.el6.x86_64 > > > Reopening for at a minimum, a review... it wasn't clear to me what release the > fix to libguestfs was going in libguestfs 1.7.17-26 is known to have this bug. Please try libguestfs >= 1.16.8-1 from: http://people.redhat.com/~rjones/libguestfs-RHEL-6.3-preview/
Dave, That Factory package does indeed contain the workaround mentioned in Comment 9. Richard, All I'm doing to avoid this issue is to create a single libguestfs instance before our app ever launches any worker threads. https://github.com/aeolusproject/imagefactory/commit/d75d7dae33efdbd1e400f88c2f8d2a319a324541 As expected, this seems to create the process-local set of links for the mini-root. I actually can't see how this can still be happening. Any thoughts? (I understand that the preferred solution is the libguestfs package update. We put in this workaround mainly to save the effort of pressing the package update into the 6.2 update stream, which is our beta release target platform.)
I had a think about this last night, and I'm not so sure that the workaround (comment 16) is correct. Unfortunately even though the links are created, they will still be recreated, racily, every time a handle is launched. Can you put a mutex around all calls to g.launch? Something like: # global variable glk = threading.Lock() #... glk.acquire() try: g.launch() finally: glk.release() This is essentially what the upstream change to libguestfs does.
Yup. We can mutex launch calls. I'll put something together this afternoon.
This looks like a blocker to ansmith and I, set blocker?
Revised workaround is here: https://github.com/aeolusproject/imagefactory/commit/7c469ee8e6301cffd65599364d948ee94fcc217b And is brewed as 1.0.0rc9
Haven't reproduced the issue with versions: #rpm -qa | grep 'aeolus\|imagefactory-\|oz-\|iwhd' rubygem-aeolus-cli-0.3.0-14.el6.noarch aeolus-conductor-doc-0.8.0-41.el6.noarch rubygem-aeolus-image-0.3.0-12.el6.noarch imagefactory-jeosconf-ec2-rhel-1.0.0rc9-1.el6.noarch aeolus-conductor-0.8.0-41.el6.noarch aeolus-configure-2.5.0-18.el6.noarch imagefactory-jeosconf-ec2-fedora-1.0.0rc9-1.el6.noarch iwhd-1.2-3.el6.x86_64 oz-0.8.0-5.el6.noarch aeolus-all-0.8.0-41.el6.noarch imagefactory-1.0.0rc9-1.el6.noarch aeolus-conductor-daemons-0.8.0-41.el6.noarch Moving bug to VERIFIED.
*** Bug 801925 has been marked as a duplicate of this bug. ***
https://bugzilla.redhat.com/show_bug.cgi?id=801907 fyi.. recreate in ^
This should be MODIFIED as imcleod claims it's fixed in imagefactory-1.0.0rc9-1.el6.noarch which is brewed and slated for the beta 5 puddle.
(In reply to comment #23) > https://bugzilla.redhat.com/show_bug.cgi?id=801907 > > fyi.. recreate in ^ I don't believe it was with version imagefactory-1.0.0rc9-1.el6.noarch, but some older one...so I believe bug can be in VERIFIED status since I have successfully tested version rc9.
Switching to ON_QA. Martin, on the basis of your last comment, can you then switch it to VERIFIED?
(In reply to comment #26) > Switching to ON_QA. > > Martin, on the basis of your last comment, can you then switch it to VERIFIED? yes, from the bug https://bugzilla.redhat.com/show_bug.cgi?id=801907 I see it was recreated with version rc8. rc9 has been tested, I mean I wasn't able to reproduce it with this version
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHEA-2012-0588.html