Bug 790528 - multiprovider build error: RuntimeError: link: /tmp/.guestfs-0/kernel /tmp/.guestfs-0/kernel.10139: File exists
Summary: multiprovider build error: RuntimeError: link: /tmp/.guestfs-0/kernel /tmp/.g...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: CloudForms Cloud Engine
Classification: Retired
Component: imagefactory
Version: 1.0.0
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: beta5
Assignee: Ian McLeod
QA Contact: Martin Kočí
URL:
Whiteboard:
: 801925 (view as bug list)
Depends On: 790721 790958
Blocks:
TreeView+ depends on / blocked
 
Reported: 2012-02-14 18:49 UTC by wes hayutin
Modified: 2012-05-15 20:21 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 790721 (view as bug list)
Environment:
Last Closed: 2012-05-15 20:21:58 UTC
Embargoed:


Attachments (Terms of Use)
factory log (34.99 KB, text/plain)
2012-02-14 18:49 UTC, wes hayutin
no flags Details
template (856 bytes, text/plain)
2012-02-14 18:49 UTC, wes hayutin
no flags Details
aeolus-debug logs (362.60 KB, application/x-gzip)
2012-02-14 18:52 UTC, wes hayutin
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2012:0588 0 normal SHIPPED_LIVE new packages: imagefactory 2012-05-15 22:31:27 UTC

Description wes hayutin 2012-02-14 18:49:17 UTC
Created attachment 562013 [details]
factory log

Description of problem:

2012-02-14 13:39:40,849 DEBUG imgfac.builders.BaseBuilder.RHEL6_vsphere_Builder thread(c1c0c696) Message: Exception caught in ImageFactory
2012-02-14 13:39:40,850 DEBUG imgfac.builders.BaseBuilder.RHEL6_vsphere_Builder thread(c1c0c696) Message: Traceback (most recent call last):
  File "/usr/lib/python2.6/site-packages/imgfac/builders/Fedora_vsphere_Builder.py", line 49, in build_image
    self.build_upload(build_id)
  File "/usr/lib/python2.6/site-packages/imgfac/builders/Fedora_vsphere_Builder.py", line 115, in build_upload
    self.output_descriptor = self.guest.customize_and_generate_icicle(libvirt_xml)
  File "/usr/lib/python2.6/site-packages/oz/RedHat.py", line 1178, in customize_and_generate_icicle
    return self._internal_customize(libvirt_xml, True)
  File "/usr/lib/python2.6/site-packages/oz/RedHat.py", line 1143, in _internal_customize
    self._collect_setup(modified_xml)
  File "/usr/lib/python2.6/site-packages/oz/RedHat.py", line 427, in _collect_setup
    g_handle = self._guestfs_handle_setup(libvirt_xml)
  File "/usr/lib/python2.6/site-packages/oz/Guest.py", line 882, in _guestfs_handle_setup
    g.launch()
  File "/usr/lib/python2.6/site-packages/guestfs.py", line 152, in launch
    return libguestfsmod.launch (self._o)
RuntimeError: link: /tmp/.guestfs-0/kernel /tmp/.guestfs-0/kernel.10139: File exists



Attaching template and factory log

Comment 1 wes hayutin 2012-02-14 18:49:51 UTC
Created attachment 562014 [details]
template

template

Comment 2 wes hayutin 2012-02-14 18:52:03 UTC
Created attachment 562018 [details]
aeolus-debug logs

Comment 3 wes hayutin 2012-02-14 18:53:17 UTC
[root@hp-sl2x170zg6-01 ~]# rpm -qa | grep aeolus
aeolus-conductor-daemons-0.8.0-26.el6.noarch
aeolus-conductor-0.8.0-26.el6.noarch
aeolus-conductor-doc-0.8.0-26.el6.noarch
rubygem-aeolus-cli-0.3.0-8.el6.noarch
aeolus-configure-2.5.0-13.el6.noarch
aeolus-all-0.8.0-26.el6.noarch
rubygem-aeolus-image-0.3.0-7.el6.noarch
imagefactory-1.0.0rc4_2_gf16dfc5-1.el6.noarch
rubygem-imagefactory-console-0.4.0-1.el6.noarch
imagefactory-jeosconf-ec2-fedora-1.0.0rc4_2_gf16dfc5-1.el6.noarch
imagefactory-jeosconf-ec2-rhel-1.0.0rc4_2_gf16dfc5-1.el6.noarch

Comment 4 wes hayutin 2012-02-14 19:44:38 UTC
FYI.. I ran a rebuild via the webui and I do *not* see the same failure in the logs. I will continue to poke at this..

Comment 5 Ian McLeod 2012-02-14 22:39:45 UTC
So, this error seems to originate in the code that creates symlinks for launching the appliance VM that libguestfs uses to do its job.  The relevant code is here:

https://github.com/libguestfs/libguestfs/blob/master/src/appliance.c

The comments at the top make it sound as if this activity _should_ be thread/concurrency safe.

However, maybe not.  

Guestfs is using a read lock on the checksum file to avoid walking on itself when creating the links.  Oz did something similar to avoid having two concurrent Oz instances download the same ISO file into the same canonical location.

The issue with Oz, which may come into play here, is that we (Factory) are a single process multi-threaded application and multiple threads of the same process seem to be able to acquire a lock on the same file at the same time without error.

So, perhaps we have two threads doing g.launch() at the same time.  Both libguestfs instances try, and succeeded, in acquiring a read lock on the checksum file (as they are both part of the same process), one creates the kernel.$PID symlink first and the second one fails because the link is already there.

Am going to ask Mr. Jones to weigh in on my largely uninformed speculation above to see if there's any sense in it.

If this is indeed the case we may be able to work around the race in the short term by launching and then stopping a dummy libguestfs instance during Factory startup, before we start the server and "go multithreaded".

Comment 6 Richard W.M. Jones 2012-02-15 09:19:50 UTC
Thanks for the analysis.  It does indeed sound like
this is a bug in libguestfs's handling of locks.
I'm going to clone this bug for libguestfs upstream.

Comment 7 Richard W.M. Jones 2012-02-15 19:33:55 UTC
Patch posted for the upstream libguestfs bug here:
https://www.redhat.com/archives/libguestfs/2012-February/msg00068.html

Comment 9 Ian McLeod 2012-02-22 16:02:36 UTC
Workaround commit here:

https://github.com/aeolusproject/imagefactory/commit/d75d7dae33efdbd1e400f88c2f8d2a319a324541

I'll brew this for the puddle being formed today.  It will be rc6 (and will include one other fix).

Comment 10 Hugh Brock 2012-02-22 17:59:03 UTC
Moving this to POST since a patch exists. Ian, please move to MODIFIED when the build goes into Brew.

Comment 11 Ian McLeod 2012-02-22 18:20:11 UTC
Patch brewed as part of imagefactory-1.0.0rc6-1

Comment 12 Martin Kočí 2012-02-23 14:17:08 UTC
so far I'm not able to reproduce this issue with packages: 

iwhd-1.2-3.el6.x86_64
rubygem-imagefactory-console-0.5.0-4.20110824113238gitd9debef.el6.noarch
imagefactory-jeosconf-ec2-fedora-1.0.0rc6-1.el6.noarch
rubygem-aeolus-cli-0.4.0-0.20120118121626git6fddd65.el6.noarch
imagefactory-1.0.0rc6-1.el6.noarch
aeolus-conductor-0.9.0-0.20120118181603git71cd8bc.el6.noarch
oz-0.8.0-5.el6.noarch
aeolus-all-0.9.0-0.20120118181603git71cd8bc.el6.noarch
rubygem-arel-2.0.10-0.aeolus.el6.noarch
aeolus-conductor-doc-0.9.0-0.20120118181603git71cd8bc.el6.noarch
imagefactory-jeosconf-ec2-rhel-1.0.0rc6-1.el6.noarch
aeolus-conductor-daemons-0.9.0-0.20120118181603git71cd8bc.el6.noarch
rubygem-aeolus-image-0.4.0-0.20120118121635git0d31a37.el6.noarch
aeolus-configure-2.6.0-0.20120118121620gita996371.el6.noarch

Comment 14 Dave Johnson 2012-03-07 20:58:41 UTC
So I just hit this running the following...

[root@hp-sl390s-01 ~]# rpm -qa | grep imagefactory
imagefactory-1.0.0rc8-1.el6.noarch
rubygem-imagefactory-console-0.4.0-1.el6.noarch
imagefactory-jeosconf-ec2-rhel-1.0.0rc8-1.el6.noarch
imagefactory-jeosconf-ec2-fedora-1.0.0rc8-1.el6.noarch
[root@hp-sl390s-01 ~]# rpm -qa | grep libguest
libguestfs-1.7.17-26.el6.x86_64
python-libguestfs-1.7.17-26.el6.x86_64


Reopening for at a minimum, a review... it wasn't clear to me what release the fix to libguestfs was going in

Comment 15 Richard W.M. Jones 2012-03-07 22:12:40 UTC
(In reply to comment #14)
> So I just hit this running the following...
> 
> [root@hp-sl390s-01 ~]# rpm -qa | grep imagefactory
> imagefactory-1.0.0rc8-1.el6.noarch
> rubygem-imagefactory-console-0.4.0-1.el6.noarch
> imagefactory-jeosconf-ec2-rhel-1.0.0rc8-1.el6.noarch
> imagefactory-jeosconf-ec2-fedora-1.0.0rc8-1.el6.noarch
> [root@hp-sl390s-01 ~]# rpm -qa | grep libguest
> libguestfs-1.7.17-26.el6.x86_64
> python-libguestfs-1.7.17-26.el6.x86_64
> 
> 
> Reopening for at a minimum, a review... it wasn't clear to me what release the
> fix to libguestfs was going in

libguestfs 1.7.17-26 is known to have this bug.
Please try libguestfs >= 1.16.8-1 from:

http://people.redhat.com/~rjones/libguestfs-RHEL-6.3-preview/

Comment 16 Ian McLeod 2012-03-07 22:27:27 UTC
Dave,

That Factory package does indeed contain the workaround mentioned in Comment 9.

Richard,

All I'm doing to avoid this issue is to create a single libguestfs instance before our app ever launches any worker threads.  

https://github.com/aeolusproject/imagefactory/commit/d75d7dae33efdbd1e400f88c2f8d2a319a324541

As expected, this seems to create the process-local set of links for the mini-root.  I actually can't see how this can still be happening.  Any thoughts?

(I understand that the preferred solution is the libguestfs package update.  We put in this workaround mainly to save the effort of pressing the package update into the 6.2 update stream, which is our beta release target platform.)

Comment 17 Richard W.M. Jones 2012-03-08 18:04:11 UTC
I had a think about this last night, and I'm not
so sure that the workaround (comment 16) is correct.
Unfortunately even though the links are created, they
will still be recreated, racily, every time a handle
is launched.

Can you put a mutex around all calls to g.launch?
Something like:

  # global variable
  glk = threading.Lock()
  
  #...
  glk.acquire()
  try:
      g.launch()
  finally:
      glk.release()

This is essentially what the upstream change to
libguestfs does.

Comment 18 Ian McLeod 2012-03-08 21:05:15 UTC
Yup.  We can mutex launch calls.  I'll put something together this afternoon.

Comment 19 Hugh Brock 2012-03-09 19:50:49 UTC
This looks like a blocker to ansmith and I, set blocker?

Comment 20 Ian McLeod 2012-03-09 22:43:26 UTC
Revised workaround is here:

https://github.com/aeolusproject/imagefactory/commit/7c469ee8e6301cffd65599364d948ee94fcc217b

And is brewed as 1.0.0rc9

Comment 21 Martin Kočí 2012-03-12 10:54:16 UTC
Haven't reproduced the issue with versions:

#rpm -qa | grep 'aeolus\|imagefactory-\|oz-\|iwhd'
rubygem-aeolus-cli-0.3.0-14.el6.noarch
aeolus-conductor-doc-0.8.0-41.el6.noarch 
rubygem-aeolus-image-0.3.0-12.el6.noarch 
imagefactory-jeosconf-ec2-rhel-1.0.0rc9-1.el6.noarch 
aeolus-conductor-0.8.0-41.el6.noarch 
aeolus-configure-2.5.0-18.el6.noarch 
imagefactory-jeosconf-ec2-fedora-1.0.0rc9-1.el6.noarch 
iwhd-1.2-3.el6.x86_64 
oz-0.8.0-5.el6.noarch 
aeolus-all-0.8.0-41.el6.noarch 
imagefactory-1.0.0rc9-1.el6.noarch 
aeolus-conductor-daemons-0.8.0-41.el6.noarch

Moving bug to VERIFIED.

Comment 22 Brad P. Crochet 2012-03-12 11:42:20 UTC
*** Bug 801925 has been marked as a duplicate of this bug. ***

Comment 23 wes hayutin 2012-03-12 20:08:51 UTC
https://bugzilla.redhat.com/show_bug.cgi?id=801907

fyi.. recreate in ^

Comment 24 Hugh Brock 2012-03-12 20:26:31 UTC
This should be MODIFIED as imcleod claims it's fixed in imagefactory-1.0.0rc9-1.el6.noarch which is brewed and slated for the beta 5 puddle.

Comment 25 Martin Kočí 2012-03-12 20:35:02 UTC
(In reply to comment #23)
> https://bugzilla.redhat.com/show_bug.cgi?id=801907
> 
> fyi.. recreate in ^

I don't believe it was with version imagefactory-1.0.0rc9-1.el6.noarch, but some older one...so I believe bug can be in VERIFIED status since I have successfully tested version rc9.

Comment 26 Ian McLeod 2012-03-20 16:12:46 UTC
Switching to ON_QA.  

Martin, on the basis of your last comment, can you then switch it to VERIFIED?

Comment 27 Martin Kočí 2012-03-27 12:58:23 UTC
(In reply to comment #26)
> Switching to ON_QA.  
> 
> Martin, on the basis of your last comment, can you then switch it to VERIFIED?

yes, from the bug https://bugzilla.redhat.com/show_bug.cgi?id=801907 I see it was recreated with version rc8. rc9 has been tested, I mean I wasn't able to reproduce it with this version

Comment 28 errata-xmlrpc 2012-05-15 20:21:58 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHEA-2012-0588.html


Note You need to log in before you can comment on or make changes to this bug.