472122 – Long running xen hosts can eat up all 32000 xenbl* files

Bug 472122 - Long running xen hosts can eat up all 32000 xenbl* files

Summary: Long running xen hosts can eat up all 32000 xenbl* files

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	xen
Sub Component:
Version:	5.2
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	low
Target Milestone:	rc
Target Release:	5.6
Assignee:	Xen Maintainance List
QA Contact:	Virtualization Bugs
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	514500
TreeView+	depends on / blocked

Reported:	2008-11-18 20:39 UTC by Greg Blomquist
Modified:	2010-11-09 13:08 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2010-06-25 09:23:52 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Greg Blomquist 2008-11-18 20:39:52 UTC

Description of problem:

We have a xen host running RHEL5.2 with up to 7 xen guests. The host has been running for several months, and the VMs are typically running for several days at a time, perhaps as long as a month.

We tried stopping and starting a single xen guest and it wouldn't start again. No errors or output reported to the command line; the `xm create` command simply didn't respond.

After a day of debugging, someone noticed that there were 10s of thousands of "xenbl*" files in /var/lib/xen. We cleared out those files and restarted xend. Now, we are able to stop and start VMs without issue.

I see this block in our /var/log/xen/xend.log file:

[2008-11-17 12:07:50 xend.XendDomainInfo 15074] INFO (XendDomainInfo:234) Recreating domain 8, UUID 4bd90068-c83b-e964-7323-e7eeae01a0b2.
[2008-11-17 12:07:50 xend 15074] ERROR (XendDomain:221) Failed to recreate information for domain 8. Destroying it in the hope of recovery.
Traceback (most recent call last):
File "/usr/lib64/python2.4/site-packages/xen/xend/XendDomain.py", line 215, in refresh
self._add_domain(
File "/usr/lib64/python2.4/site-packages/xen/xend/XendDomainInfo.py", line 258, in recreate
vm = XendDomainInfo(xeninfo, domid, dompath, True, priv)
File "/usr/lib64/python2.4/site-packages/xen/xend/XendDomainInfo.py", line 468, in __init__
self.validateInfo()
File "/usr/lib64/python2.4/site-packages/xen/xend/XendDomainInfo.py", line 618, in validateInfo
raise VmError('Invalid memory size')
VmError: Invalid memory size

The xen guest in question seemed to have correct memory settings for maxmem and memory. But, it's possible that this error was leaving behind the "xenbl*" files in /var/lib/xen/.

Based on https://bugzilla.redhat.com/show_bug.cgi?id=182328, it looks like code was added to deal with errors that resulted in dangling xenbl files. But, it looks like the "finally" block suggested by the reported was never integrated into the XendBootloader code. It appears that a finally block is definitely in order for that code.

Version-Release number of selected component (if applicable): xen 3.0.3-64

How reproducible:

Two of our hosts have experienced this problem so far. But, now that we have a workaround (see below), we shouldn't have any serious problems moving forward due to this bug.

Steps to Reproduce:
1. Long running xen host with VMs that produce errors on startup

Actual results:

The Xen host can use up all of the 32,000 xenbl files allowed by the bootloader function in XendBootloader.

Expected results:

The bootloader function cleans up all xenbl files even if it errors on startup.

Additional info:

Workaround:

Remove the xenbl* files from /var/lib/xen/ and restart xend service

Comment 3 Michal Novotny 2010-03-16 12:47:18 UTC

(In reply to comment #0)
> Description of problem:
> 
> We have a xen host running RHEL5.2 with up to 7 xen guests.  The host has been
> running for several months, and the VMs are typically running for several days
> at a time, perhaps as long as a month.
> 
> We tried stopping and starting a single xen guest and it wouldn't start again. 
> No errors or output reported to the command line; the `xm create` command
> simply didn't respond.
> 
> After a day of debugging, someone noticed that there were 10s of thousands of
> "xenbl*" files in /var/lib/xen.  We cleared out those files and restarted xend.
>  Now, we are able to stop and start VMs without issue.
> 
> I see this block in our /var/log/xen/xend.log file:
> 
> [2008-11-17 12:07:50 xend.XendDomainInfo 15074] INFO (XendDomainInfo:234)
> Recreating domain 8, UUID 4bd90068-c83b-e964-7323-e7eeae01a0b2.
> [2008-11-17 12:07:50 xend 15074] ERROR (XendDomain:221) Failed to recreate
> information for domain 8.  Destroying it in the hope of recovery.
> Traceback (most recent call last):
>   File "/usr/lib64/python2.4/site-packages/xen/xend/XendDomain.py", line 215,
> in refresh
>     self._add_domain(
>   File "/usr/lib64/python2.4/site-packages/xen/xend/XendDomainInfo.py", line
> 258, in recreate
>     vm = XendDomainInfo(xeninfo, domid, dompath, True, priv)
>   File "/usr/lib64/python2.4/site-packages/xen/xend/XendDomainInfo.py", line
> 468, in __init__
>     self.validateInfo()
>   File "/usr/lib64/python2.4/site-packages/xen/xend/XendDomainInfo.py", line
> 618, in validateInfo
>     raise VmError('Invalid memory size')
> VmError: Invalid memory size
> 
> The xen guest in question seemed to have correct memory settings for maxmem and
> memory.  But, it's possible that this error was leaving behind the "xenbl*"
> files in /var/lib/xen/.
> 
> Based on https://bugzilla.redhat.com/show_bug.cgi?id=182328, it looks like code
> was added to deal with errors that resulted in dangling xenbl files.  But, it
> looks like the "finally" block suggested by the reported was never integrated
> into the XendBootloader code.  It appears that a finally block is definitely in
> order for that code.
> 
> 
> Version-Release number of selected component (if applicable): xen 3.0.3-64
> 
> 
> How reproducible:
> 
> Two of our hosts have experienced this problem so far.  But, now that we have a
> workaround (see below), we shouldn't have any serious problems moving forward
> due to this bug.
> 
> 
> Steps to Reproduce:
> 1.  Long running xen host with VMs that produce errors on startup
> 
> 
> Actual results:
> 
> The Xen host can use up all of the 32,000 xenbl files allowed by the bootloader
> function in XendBootloader.
> 
> 
> Expected results:
> 
> The bootloader function cleans up all xenbl files even if it errors on startup.
> 
> 
> Additional info:
> 
> Workaround:
> 
> Remove the xenbl* files from /var/lib/xen/ and restart xend service    

Could you please try using the latest versions of xen and kernel-xen packages?

Thanks,
Michal

Comment 4 Michal Novotny 2010-06-24 13:34:49 UTC

Well, I did investigate this in the XendBootloader.py code and there's a working os.unlink(fifo) code so the /var/lib/xen/xenbl.* files are currently not there and I did try to boot the PV guest properly (i.e. by setting up right files in pygrub) and also I simulated the boot failure (by setting up some wrong file in the pygrub) and in both cases the /var/lib/xen/xenbl.* files were both created and deleted/unlinked.

Greg, could you please try using the xen packages available in RHEL-5.5 ? I was unable to see the problem there since those files are being automatically deleted.

Michal

Comment 5 Greg Blomquist 2010-06-25 02:05:41 UTC

Hi Michal,

Sorry I didn't get back to you sooner.

To be honest, we're not really running Xen on many hosts anymore.  Those that do run Xen are older RHEL5 boxes that aren't really accessible to me to be upgraded.

Pretty much everything we're running is KVM-based.

Honestly, if no one else is running into this problem and you've verified that it's not an issue, I'd mark this as closed.

Sorry I couldn't be of more help.  :(

Comment 6 Michal Novotny 2010-06-25 09:23:52 UTC

Since I've tried to reproduce it but I'm unable to reproduce it with RHEL-5.5 packages I'm closing it as CURRENTRELEASE.

Michal

Note You need to log in before you can comment on or make changes to this bug.