Bug 472122 - Long running xen hosts can eat up all 32000 xenbl* files
Long running xen hosts can eat up all 32000 xenbl* files
Status: CLOSED CURRENTRELEASE
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: xen (Show other bugs)
5.2
x86_64 Linux
medium Severity low
: rc
: 5.6
Assigned To: Xen Maintainance List
Virtualization Bugs
:
Depends On:
Blocks: 514500
  Show dependency treegraph
 
Reported: 2008-11-18 15:39 EST by Greg Blomquist
Modified: 2010-11-09 08:08 EST (History)
4 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2010-06-25 05:23:52 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Greg Blomquist 2008-11-18 15:39:52 EST
Description of problem:

We have a xen host running RHEL5.2 with up to 7 xen guests.  The host has been running for several months, and the VMs are typically running for several days at a time, perhaps as long as a month.

We tried stopping and starting a single xen guest and it wouldn't start again.  No errors or output reported to the command line; the `xm create` command simply didn't respond.

After a day of debugging, someone noticed that there were 10s of thousands of "xenbl*" files in /var/lib/xen.  We cleared out those files and restarted xend.  Now, we are able to stop and start VMs without issue.

I see this block in our /var/log/xen/xend.log file:

[2008-11-17 12:07:50 xend.XendDomainInfo 15074] INFO (XendDomainInfo:234) Recreating domain 8, UUID 4bd90068-c83b-e964-7323-e7eeae01a0b2.
[2008-11-17 12:07:50 xend 15074] ERROR (XendDomain:221) Failed to recreate information for domain 8.  Destroying it in the hope of recovery.
Traceback (most recent call last):
  File "/usr/lib64/python2.4/site-packages/xen/xend/XendDomain.py", line 215, in refresh
    self._add_domain(
  File "/usr/lib64/python2.4/site-packages/xen/xend/XendDomainInfo.py", line 258, in recreate
    vm = XendDomainInfo(xeninfo, domid, dompath, True, priv)
  File "/usr/lib64/python2.4/site-packages/xen/xend/XendDomainInfo.py", line 468, in __init__
    self.validateInfo()
  File "/usr/lib64/python2.4/site-packages/xen/xend/XendDomainInfo.py", line 618, in validateInfo
    raise VmError('Invalid memory size')
VmError: Invalid memory size

The xen guest in question seemed to have correct memory settings for maxmem and memory.  But, it's possible that this error was leaving behind the "xenbl*" files in /var/lib/xen/.

Based on https://bugzilla.redhat.com/show_bug.cgi?id=182328, it looks like code was added to deal with errors that resulted in dangling xenbl files.  But, it looks like the "finally" block suggested by the reported was never integrated into the XendBootloader code.  It appears that a finally block is definitely in order for that code.


Version-Release number of selected component (if applicable): xen 3.0.3-64


How reproducible:

Two of our hosts have experienced this problem so far.  But, now that we have a workaround (see below), we shouldn't have any serious problems moving forward due to this bug.


Steps to Reproduce:
1.  Long running xen host with VMs that produce errors on startup

  
Actual results:

The Xen host can use up all of the 32,000 xenbl files allowed by the bootloader function in XendBootloader.


Expected results:

The bootloader function cleans up all xenbl files even if it errors on startup.


Additional info:

Workaround:

Remove the xenbl* files from /var/lib/xen/ and restart xend service
Comment 3 Michal Novotny 2010-03-16 08:47:18 EDT
(In reply to comment #0)
> Description of problem:
> 
> We have a xen host running RHEL5.2 with up to 7 xen guests.  The host has been
> running for several months, and the VMs are typically running for several days
> at a time, perhaps as long as a month.
> 
> We tried stopping and starting a single xen guest and it wouldn't start again. 
> No errors or output reported to the command line; the `xm create` command
> simply didn't respond.
> 
> After a day of debugging, someone noticed that there were 10s of thousands of
> "xenbl*" files in /var/lib/xen.  We cleared out those files and restarted xend.
>  Now, we are able to stop and start VMs without issue.
> 
> I see this block in our /var/log/xen/xend.log file:
> 
> [2008-11-17 12:07:50 xend.XendDomainInfo 15074] INFO (XendDomainInfo:234)
> Recreating domain 8, UUID 4bd90068-c83b-e964-7323-e7eeae01a0b2.
> [2008-11-17 12:07:50 xend 15074] ERROR (XendDomain:221) Failed to recreate
> information for domain 8.  Destroying it in the hope of recovery.
> Traceback (most recent call last):
>   File "/usr/lib64/python2.4/site-packages/xen/xend/XendDomain.py", line 215,
> in refresh
>     self._add_domain(
>   File "/usr/lib64/python2.4/site-packages/xen/xend/XendDomainInfo.py", line
> 258, in recreate
>     vm = XendDomainInfo(xeninfo, domid, dompath, True, priv)
>   File "/usr/lib64/python2.4/site-packages/xen/xend/XendDomainInfo.py", line
> 468, in __init__
>     self.validateInfo()
>   File "/usr/lib64/python2.4/site-packages/xen/xend/XendDomainInfo.py", line
> 618, in validateInfo
>     raise VmError('Invalid memory size')
> VmError: Invalid memory size
> 
> The xen guest in question seemed to have correct memory settings for maxmem and
> memory.  But, it's possible that this error was leaving behind the "xenbl*"
> files in /var/lib/xen/.
> 
> Based on https://bugzilla.redhat.com/show_bug.cgi?id=182328, it looks like code
> was added to deal with errors that resulted in dangling xenbl files.  But, it
> looks like the "finally" block suggested by the reported was never integrated
> into the XendBootloader code.  It appears that a finally block is definitely in
> order for that code.
> 
> 
> Version-Release number of selected component (if applicable): xen 3.0.3-64
> 
> 
> How reproducible:
> 
> Two of our hosts have experienced this problem so far.  But, now that we have a
> workaround (see below), we shouldn't have any serious problems moving forward
> due to this bug.
> 
> 
> Steps to Reproduce:
> 1.  Long running xen host with VMs that produce errors on startup
> 
> 
> Actual results:
> 
> The Xen host can use up all of the 32,000 xenbl files allowed by the bootloader
> function in XendBootloader.
> 
> 
> Expected results:
> 
> The bootloader function cleans up all xenbl files even if it errors on startup.
> 
> 
> Additional info:
> 
> Workaround:
> 
> Remove the xenbl* files from /var/lib/xen/ and restart xend service    

Could you please try using the latest versions of xen and kernel-xen packages?

Thanks,
Michal
Comment 4 Michal Novotny 2010-06-24 09:34:49 EDT
Well, I did investigate this in the XendBootloader.py code and there's a working os.unlink(fifo) code so the /var/lib/xen/xenbl.* files are currently not there and I did try to boot the PV guest properly (i.e. by setting up right files in pygrub) and also I simulated the boot failure (by setting up some wrong file in the pygrub) and in both cases the /var/lib/xen/xenbl.* files were both created and deleted/unlinked.

Greg, could you please try using the xen packages available in RHEL-5.5 ? I was unable to see the problem there since those files are being automatically deleted.

Michal
Comment 5 Greg Blomquist 2010-06-24 22:05:41 EDT
Hi Michal,

Sorry I didn't get back to you sooner.

To be honest, we're not really running Xen on many hosts anymore.  Those that do run Xen are older RHEL5 boxes that aren't really accessible to me to be upgraded.

Pretty much everything we're running is KVM-based.

Honestly, if no one else is running into this problem and you've verified that it's not an issue, I'd mark this as closed.

Sorry I couldn't be of more help.  :(
Comment 6 Michal Novotny 2010-06-25 05:23:52 EDT
Since I've tried to reproduce it but I'm unable to reproduce it with RHEL-5.5 packages I'm closing it as CURRENTRELEASE.

Michal

Note You need to log in before you can comment on or make changes to this bug.