Bug 439182

Summary: [RHEL5.2]: Running "xm destroy" on a domain with PVFB causes a xenstore leak
Product: Red Hat Enterprise Linux 5 Reporter: Chris Lalancette <clalance>
Component: xenAssignee: Michal Novotny <minovotn>
Status: CLOSED ERRATA QA Contact: Virtualization Bugs <virt-bugs>
Severity: low Docs Contact:
Priority: low    
Version: 5.2CC: areis, berrange, cward, gozen, jdenemar, llim, minovotn, tao, xen-maint, yoyzhang
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: xen-3.0.3-85.el5 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-09-02 10:11:00 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 448899, 492190    
Attachments:
Description Flags
Configuration file for a RHEL-5 PV guest
none
Output of "xenstore-ls" right after dom0 boot
none
Output of "xenstore-ls" right after "xm create" of PV domain
none
Output of "xenstore-ls" right after "xm destroy" of PV domain
none
PVFB device detach fix none

Description Chris Lalancette 2008-03-27 13:56:02 UTC
Description of problem:
I was poking around xenstore last night, and I ran into what looks like a leak
in the PVFB xenstore entries.  If I create a PV domain with this in the
configuration file:

vfb = [ "type=vnc,vncunused=1" ]

Then it (properly) creates a vfb and a vkbd xenstore entry for those devices in
xenstore.  However, if I then "xm destroy" that same domain, the entries never
go away.  While it's not an immediate problem, it will probably cause xenstore
to slow down over time, and probably leak memory.  I'll attach my domain
configuration, and the output of xenstore-ls right after boot, right after
booting a single PV domain, and right after running "xm destroy" on that same
domain.

Comment 1 Chris Lalancette 2008-03-27 13:57:16 UTC
Created attachment 299317 [details]
Configuration file for a RHEL-5 PV guest

Comment 2 Chris Lalancette 2008-03-27 13:57:44 UTC
Created attachment 299318 [details]
Output of "xenstore-ls" right after dom0 boot

Comment 3 Chris Lalancette 2008-03-27 13:58:12 UTC
Created attachment 299319 [details]
Output of "xenstore-ls" right after "xm create" of PV domain

Comment 4 Chris Lalancette 2008-03-27 13:58:35 UTC
Created attachment 299320 [details]
Output of "xenstore-ls" right after "xm destroy" of PV domain

Comment 5 RHEL Program Management 2008-06-10 10:05:10 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 7 Michal Novotny 2009-04-08 15:00:04 UTC
Well, the problem is that something prevents detach at time it's being detached. There are 2 entries for both device class (vkbd & vfb). One is located in: /local/domain/0/backend/{deviceClass}/{domid} and the second one is in /vm/{uuid}/device/{deviceClass}

I did some research about it and I found out that deleting /vm/{uuid}/device/{deviceClass} didn't return error as described in BZ #478868 but vfb-* and vkbd-* entries still exists in /sys/bus/xen-backend/devices/ like described in BZ #438439. I think problem is that something prevents device detach at the time. I have tried to add some PVFB cleanup code there and it's working fine but I was unable to find any reference to upstream changeset but according to information from kraxel in http://lists.xensource.com/archives/html/xen-devel/2009-03/msg00307.html upstream doesn't have this issue.

Comment 8 Michal Novotny 2009-04-08 15:42:53 UTC
Created attachment 338736 [details]
PVFB device detach fix

Well, this is the patch for it that works for me. Please try it

Comment 9 Chris Lalancette 2009-04-10 09:44:53 UTC
*** Bug 438439 has been marked as a duplicate of this bug. ***

Comment 10 Chris Lalancette 2009-04-10 09:46:55 UTC
Cool, thanks for the patch, I'll try it out next week.  The thing is, it would be good to understand *why* upstream doesn't have the problem.  Maybe we can leverage the upstream patch, but if not, at least we can understand what the difference is between RHEL-5 and upstream.

Chris Lalancette

Comment 11 Michal Novotny 2009-04-14 07:53:45 UTC
Yeah Chris, I completely agree with you and that's why I am trying to recompile upstream version of Xen on my Fedora 10 box (the box I am running on my laptop, it should work, right?) but no luck yet. I need to investigate it on working upstream version of Xen there but there are some issues when booting the required Xen kernel so I need to investigate further.

Comment 12 Michal Novotny 2009-04-23 08:54:27 UTC
Well Chris, I've finally tested this on my desktop box (I was unable to get it running on Fedora 10 because my laptop's hardware seems to be too new for kernel required by upstream Xen) and I've tested it now. Upstream doesn't have this issue and the thing is that it has a different codebase. I know know why exactly but the problem is that in upstream's codebase, there are device classes defined in XendDevices.py but we're having it defined in XendDomainInfo.py. Something prevents the detaching in our codebase because I've been testing it some time ago and 'vkbd' and 'vfb' entries were there in 'for' cycle as well. Maybe the reason upstream code for that has been overwritten is that causes errors in their codebase as well. But the strange thing is that we're having function for release devices that's now working so this was just a workaround/patch to make it working.

Comment 13 Jiri Denemark 2009-05-11 13:40:26 UTC
Fix built into xen-3.0.3-85.el5

Comment 14 Jiri Denemark 2009-05-21 11:47:42 UTC
*** Bug 478868 has been marked as a duplicate of this bug. ***

Comment 16 Issue Tracker 2009-06-09 14:17:37 UTC
Event posted on 06-09-2009 08:01am EDT by Glen Johnson

------- Comment From santwana.samantray.com 2009-06-09 07:53
EDT-------
Hi,

I tried to reproduce this issue in RHEL5.4-pre Alpha
release(k.v-2.6.18-151.el5xen).Its very had to reproduce this issue in
general.After many iterations, I was able to reproduce this issue,i.e the
guest failed to restart after rebooting.
Below is a snip from /var/log/xen/xend.log as below:

<snip>
[2009-06-09 21:31:34 xend.XendDomainInfo 5670] ERROR (XendDomainInfo:2555)
Failed to restart domain 22.
Traceback (most recent call last):
File "/usr/lib64/python2.4/site-packages/xen/xend/XendDomainInfo.py",
line 2539, in restart
for x in prev_vm_xend[0][1]:
TypeError: iteration over non-sequence
</snip>

List of Xen rpm versions:
xen-3.0.3-86.el5
xen-libs-3.0.3-86.el5
kernel-xen-devel-2.6.18-151.el5
kernel-xen-2.6.18-151.el5

Thanks,
Santwana


This event sent from IssueTracker by jkachuck 
 issue 252463

Comment 17 Jiri Denemark 2009-06-09 15:36:07 UTC
Hmm, this error is most likely a different issue. Could you, please, add the following line:
log.debug("/vm/UUID/xend: %s", prev_vm_xend)

above line 2539 in /usr/lib64/python2.4/site-packages/xen/xend/XendDomainInfo.py, restart xend and try to reproduce it again? And attach the whole xend.log next time, please.

Thanks

Comment 18 Michal Novotny 2009-06-10 08:28:33 UTC
Well, my patch was about removing remaining PVFB backend after domain has shutdown/reboot/been destroyed so it has nothing to do with prev_vm_xend or anything similar, ie. this is definitely a different issue.

Also, I have created a package with some logging of this applied and investigation showed me only reboot calls this function. You can download and test using my new RPMs at:

http://people.redhat.com/minovotn/xen/

Thanks,
Michal

Comment 19 Chris Ward 2009-07-03 18:01:33 UTC
~~ Attention - RHEL 5.4 Beta Released! ~~

RHEL 5.4 Beta has been released! There should be a fix present in the Beta release that addresses this particular request. Please test and report back results here, at your earliest convenience. RHEL 5.4 General Availability release is just around the corner!

If you encounter any issues while testing Beta, please describe the issues you have encountered and set the bug into NEED_INFO. If you encounter new issues, please clone this bug to open a new issue and request it be reviewed for inclusion in RHEL 5.4 or a later update, if it is not of urgent severity.

Please do not flip the bug status to VERIFIED. Only post your verification results, and if available, update Verified field with the appropriate value.

Questions can be posted to this bug or your customer or partner representative.

Comment 24 zhanghaiyan 2009-07-29 10:29:48 UTC
Could you please help give out test steps this bug ? thanks a lot

Comment 25 Chris Lalancette 2009-07-29 11:00:14 UTC
(In reply to comment #24)
> Could you please help give out test steps this bug ? thanks a lot  

I think that Gurhan proved it in Comment #23, so we should be good to go.

Chris Lalancette

Comment 27 zhanghaiyan 2009-07-30 01:50:44 UTC
Thanks a lot for your guide !

Comment 31 errata-xmlrpc 2009-09-02 10:11:00 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2009-1328.html