Bug 532648

Summary:

Backport upstream fixes to vbd hotplug

Product:

Red Hat Enterprise Linux 5

Reporter:

Jiri Denemark <jdenemar>

Component:

xen

Assignee:

Michal Novotny <minovotn>

Status:

CLOSED ERRATA

QA Contact:

Virtualization Bugs <virt-bugs>

Severity:

medium

Docs Contact:

Priority:

low

Version:

5.4

CC:

areis, jzheng, leiwang, llim, moshiro, mrezanin, pbonzini, plyons, xen-maint, yuzhang

Target Milestone:

Keywords:

ZStream

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

xen-3.0.3-118.el5

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2011-01-13 22:19:22 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

514498, 678282

Attachments:

Description	Flags
Backport of upstream fixes	none
New version of the backport	none
Patch version 3	none
xend.log	none
xen-hotplug.log	none
test output	none

Description Jiri Denemark 2009-11-03 10:00:51 UTC

Upstream description:

When many PV domains are started in quick succesion, some domains fail to
start, with xend reporting: "Error: Device nnn (vbd) could not be connected.
Hotplug scripts not working".  The domains used vbd's backed by files.

Multiple instances of the hotplug "add" script were contending on its
lock file.  Some would get delayed so long that the xend DeviceController
would time out, and the domain would fail to start.  Also, the timeout
triggers a hotplug "remove" which could race with the "add", leaving
loopback devices allocated for non-existent domains.

http://lists.xensource.com/archives/html/xen-devel/2009-11/msg00070.html
http://lists.xensource.com/archives/html/xen-devel/2009-11/msg00071.html
http://lists.xensource.com/archives/html/xen-devel/2009-11/msg00072.html

Comment 1 Michal Novotny 2010-09-02 12:47:31 UTC

Created attachment 442611 [details]
Backport of upstream fixes

Those are fixes for upstream c/s 20392 and c/s 20393.

It's been tested on x86_64 RHEL-5.5 dom0 to run 5 PV guests in a row several times. All the guests were started successfully and everything was working fine, incl. localhost migrations.

Michal

Comment 2 Michal Novotny 2010-09-07 08:56:02 UTC

Created attachment 443456 [details]
New version of the backport

This is the new - direct - version of the backport excluding the localhost migration check.

Michal

Comment 3 Michal Novotny 2010-09-08 12:26:53 UTC

Created attachment 445971 [details]
Patch version 3

New version of the backport for codebase with reverted localhost patch

This is the new - direct - version of the backport for codebase without localhost patch in the tree (i.e. with reverted local migration patch).

Michal

Comment 5 Miroslav Rezanina 2010-11-11 09:02:53 UTC

Fix built into xen-3.0.3-118.el5

Comment 7 Jinxin Zheng 2010-11-18 12:01:25 UTC

This bug is reproduced on xen-3.0.3-117.el5 by:

1. copy a set of PV images that are used to create domains. e.g.

for i in `seq -w 01 80`; do
  dd if=rhel-server-32-pv.img of=/tmp/img$i bs=1M count=50
done

(the first 50MB of the full image should be enough to create a domain.)

2. create many PV domains in quick succesion. e.g.

for i in `seq -w 01 80`; do
  (xm create test.cfg name="pv$i" disk="file:/tmp/img$i,hda,w" & )
done

This would require big amount of memory. I allocated 64M mem for each domain, and performed this on a machine with 8G physical mem installed.

By this way, the "Error: Device 0 (vbd) could not be connected. Hotplug scripts not working" message shows up in the probability of about 30%.


It is unfortunate that upgrading to xen-3.0.3-118.el5 did not solve the issue: the error message could still show up in multiple tests. Please consider the case I have described above.

Comment 8 Miroslav Rezanina 2010-11-18 13:44:23 UTC

I'd like to ask few questions:

Is there difference in success rate on -117 and -118? If yes, what's the exact difference?

Can you provide xend.log and xenhotplug.log for this test?

Comment 9 Jinxin Zheng 2010-11-19 02:43:38 UTC

Created attachment 461433 [details]
xend.log

Comment 10 Jinxin Zheng 2010-11-19 02:44:50 UTC

Created attachment 461434 [details]
xen-hotplug.log

Comment 11 Jinxin Zheng 2010-11-19 02:48:30 UTC

(In reply to comment #8)
> I'd like to ask few questions:
> 
> Is there difference in success rate on -117 and -118? If yes, what's the exact
> difference?

I cannot tell the 'exact' difference but they are different at some extent.

I would say it is more reproducible on 118 than on 117. But it's only roughly estimated.

> 
> Can you provide xend.log and xenhotplug.log for this test?

Comment 12 Jinxin Zheng 2010-11-22 10:41:09 UTC

I s/file/tap:aio/ in the reproduce script and tried again:

for i in `seq -w 01 80`; do (xm create test.cfg name="pv$i" disk="tap:aio:/tmp/img$i,hda,w" & ); done

and this one:

for i in `seq -w 01 80`; do (xm create test.cfg name="pv$i" disk="tap:aio:/tmp/img$i,xvda,w" & ); done

Both reproduced the error message.

Comment 13 Michal Novotny 2010-11-22 16:16:17 UTC

Well, how did you achieve that? Running:

for i in `seq -w 01 80`; do (xm create rhel5-32pv name="pv$i"; disk="tap:aio:/tmp/img$i,xvda,w" & ); done

always wanted to use the default image (as found in the config file) as the disk so I've generated the configuration files first and I discovered that 50M are not enough since it was returning an error about boot loader returning no data so I had to use first 100M of those files (RHEL-5 i386 PV guest). I also changed names and UUIDs - not just disks.

In the process of creation there were some "Error: (4, 'Out of memory', "xc_dom_boot_mem_init: can't allocate low memory for domain\n")" errors but this could be caused by setting up memory per guest to 32M. Despite this fact it started the domains: "Started domain pv48" using the latest virttest tree so I can't reproduce it myself. I'm not sure whether the backports are really in the -118 version. Unfortunately this made the host machine pretty slow but in xm list output (and XenD related operations) I was able to see the machines (except those machines that returned with Out of memory error - i.e. 5 machines from those 80 machines).

I also did the test for 64M PV guests of RHEL-5 i386 and the results were exactly the same although the host machine (dom0) was very slow to list the domains. Since the image was not full it crashed all the domains so I did set the on_crash/on_poweroff and on_restart conditions to preserve for those guests to be sure all the domains are there.

Michal

Comment 14 Jinxin Zheng 2010-11-23 03:04:14 UTC

(In reply to comment #13)
> Well, how did you achieve that? Running:
> 
> for i in `seq -w 01 80`; do (xm create rhel5-32pv name="pv$i";
> disk="tap:aio:/tmp/img$i,xvda,w" & ); done
> 
> always wanted to use the default image (as found in the config file) as the
> disk so I've generated the configuration files first

Is that true? It's totally different here on my system.

If the command is exactly what you used, I think you should remove the semicolon ';' after name="pv$i" ...

This was the config I've used in the test:

bootloader = "/usr/bin/pygrub"
vif = ['script=vif-bridge,bridge=xenbr0']
on_reboot = "restart"
localtime = "0"
apic = "1"
on_poweroff = "destroy"
on_crash = "preserve"
vcpus = "1"
pae = "1"
memory = "64"
vnclisten = "0.0.0.0"
vnc = "1"
#disk = ['tap:aio:/root/RHEL-Server-5.5-64-pv.raw,xvda,w']
acpi = "1"
maxmem = "64"


> and I discovered that 50M
> are not enough since it was returning an error about boot loader returning no
> data so I had to use first 100M of those files (RHEL-5 i386 PV guest). I also
> changed names and UUIDs - not just disks.

I used RHEL-Server-5.5-64-pv. 50M are enough for me. The domain would boot into a crashed state and could be preserved.

I'll do this test again using 32 bit guest and see what's different.

> 
> In the process of creation there were some "Error: (4, 'Out of memory',
> "xc_dom_boot_mem_init: can't allocate low memory for domain\n")" errors but
> this could be caused by setting up memory per guest to 32M. Despite this fact
> it started the domains: "Started domain pv48" using the latest virttest tree so
> I can't reproduce it myself. I'm not sure whether the backports are really in
> the -118 version. Unfortunately this made the host machine pretty slow but in
> xm list output (and XenD related operations) I was able to see the machines
> (except those machines that returned with Out of memory error - i.e. 5 machines
> from those 80 machines).

I don't get this 'out of memory' error. All 'Started domain pv??'. I can see them crashed from xm list, very slow though.

> 
> I also did the test for 64M PV guests of RHEL-5 i386 and the results were
> exactly the same although the host machine (dom0) was very slow to list the
> domains. Since the image was not full it crashed all the domains so I did set
> the on_crash/on_poweroff and on_restart conditions to preserve for those guests
> to be sure all the domains are there.
> 
> Michal

Comment 15 Jinxin Zheng 2010-11-23 03:06:25 UTC

Created attachment 462211 [details]
test output

This was the output of the test command that reproduced the error message.

Comment 16 Jinxin Zheng 2010-11-23 05:13:26 UTC

I tried with RHEL-Server-5.5-32-pv. Have not successfully reproduced yet.

I am feeling that it's faster to create a 32b domain than to create a 64b domain. When testing the 64b guest it's much easier to reproduce the bug but with 32b guest there's no reproduction yet.

Comment 17 Michal Novotny 2010-11-23 09:07:45 UTC

(In reply to comment #16)
> I tried with RHEL-Server-5.5-32-pv. Have not successfully reproduced yet.
> 
> I am feeling that it's faster to create a 32b domain than to create a 64b
> domain. When testing the 64b guest it's much easier to reproduce the bug but
> with 32b guest there's no reproduction yet.

This may be the issue, I'll try with x86_64 guest then. With i386 guest I saw no "Error: Device 0 (vif) could not be connected. Hotplug scripts not working." 
messages so if you say it's much easier to reproduce with x86_64 guest, I'll try it and I'll comment this BZ with the results. Also one note: I was able to see those "Device could not be connected" messages with the i386 guest without my patch applied and not I was unable to see it anymore so it helped at least with i386 guests. I need to test with x86_64 guests now.

Michal

Comment 18 Michal Novotny 2010-11-23 10:01:27 UTC

(In reply to comment #17)
> (In reply to comment #16)
> > I tried with RHEL-Server-5.5-32-pv. Have not successfully reproduced yet.
> > 
> > I am feeling that it's faster to create a 32b domain than to create a 64b
> > domain. When testing the 64b guest it's much easier to reproduce the bug but
> > with 32b guest there's no reproduction yet.
> 
> This may be the issue, I'll try with x86_64 guest then. With i386 guest I saw
> no "Error: Device 0 (vif) could not be connected. Hotplug scripts not working." 
> messages so if you say it's much easier to reproduce with x86_64 guest, I'll
> try it and I'll comment this BZ with the results. Also one note: I was able to
> see those "Device could not be connected" messages with the i386 guest without
> my patch applied and not I was unable to see it anymore so it helped at least
> with i386 guests. I need to test with x86_64 guests now.
> 
> Michal

Well, I did try with RHEL-5 x86_64 PV guest (and 50M of the disk image was really enough) but I saw all the domains started successfully (but crashed) when having 64M RAM assigned each (total 80 guests). Version used was xen-3.0.3-118.el5virttest34.gb1c76b9.x86_64.rpm available for testing at http://people.redhat.com/minovotn/xen/test/ (x86_64 version only).

Michal

Comment 20 Jinxin Zheng 2010-11-23 11:48:34 UTC

Michal, I've tested xen-3.0.3-118.el5virttest34.gb1c76b9.x86_64.rpm you provided with x64 guest. Still "Error: Device 0 (vif) could not be connected. Hotplug scripts not working.". i386 guest works as the same of -118.

I noticed that we were actually expecting "Error: Device 0 (vbd)..." but this test outputs "Error: Device 0 (vif)...". Does it make any difference?

Comment 21 Michal Novotny 2010-11-23 12:22:36 UTC

(In reply to comment #20)
> Michal, I've tested xen-3.0.3-118.el5virttest34.gb1c76b9.x86_64.rpm you
> provided with x64 guest. Still "Error: Device 0 (vif) could not be connected.
> Hotplug scripts not working.". i386 guest works as the same of -118.
> 
> I noticed that we were actually expecting "Error: Device 0 (vbd)..." but this
> test outputs "Error: Device 0 (vif)...". Does it make any difference?

Honestly it does make a huge difference since vif is network interface that's being managed by some other script of Xen scripts. It's not connected to vbd at all since vbd is the disk device. I can't see the issue with network in my environment (although I have a guest network set) so I guess this is something reproducible only on your machine or some different setup than I have. This is not connected to vbd rather than networking scripts.

Michal

Comment 22 Miroslav Rezanina 2010-11-23 12:26:43 UTC

Hi Jinxin,
based on log where vif is problem, this should be VERIFIED, as we are handling vbd. In case you have vif problem, please report new BZ.

Comment 23 Jinxin Zheng 2010-11-24 04:27:51 UTC

OK. Since the patch handles vbd and I cannot reproduce vbd error on -118, I'll put this into VERIFIED.

The vif error seems another problem for which I'll later file one bug. Sorry for the confusion.

Comment 25 errata-xmlrpc 2011-01-13 22:19:22 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-0031.html

Comment 26 Miroslav Rezanina 2011-02-17 09:28:32 UTC

This bugfix has to be backported to 5.4.z as fix for bug 666800 increases chance to hit this problem. To safely apply 666800 without regressions we need this bugfix.