Bug 532648
Summary: | Backport upstream fixes to vbd hotplug | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Jiri Denemark <jdenemar> | ||||||||||||||
Component: | xen | Assignee: | Michal Novotny <minovotn> | ||||||||||||||
Status: | CLOSED ERRATA | QA Contact: | Virtualization Bugs <virt-bugs> | ||||||||||||||
Severity: | medium | Docs Contact: | |||||||||||||||
Priority: | low | ||||||||||||||||
Version: | 5.4 | CC: | areis, jzheng, leiwang, llim, moshiro, mrezanin, pbonzini, plyons, xen-maint, yuzhang | ||||||||||||||
Target Milestone: | rc | Keywords: | ZStream | ||||||||||||||
Target Release: | --- | ||||||||||||||||
Hardware: | All | ||||||||||||||||
OS: | Linux | ||||||||||||||||
Whiteboard: | |||||||||||||||||
Fixed In Version: | xen-3.0.3-118.el5 | Doc Type: | Bug Fix | ||||||||||||||
Doc Text: | Story Points: | --- | |||||||||||||||
Clone Of: | Environment: | ||||||||||||||||
Last Closed: | 2011-01-13 22:19:22 UTC | Type: | --- | ||||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||||
Documentation: | --- | CRM: | |||||||||||||||
Verified Versions: | Category: | --- | |||||||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||||
Embargoed: | |||||||||||||||||
Bug Depends On: | |||||||||||||||||
Bug Blocks: | 514498, 678282 | ||||||||||||||||
Attachments: |
|
Description
Jiri Denemark
2009-11-03 10:00:51 UTC
Created attachment 442611 [details]
Backport of upstream fixes
Those are fixes for upstream c/s 20392 and c/s 20393.
It's been tested on x86_64 RHEL-5.5 dom0 to run 5 PV guests in a row several times. All the guests were started successfully and everything was working fine, incl. localhost migrations.
Michal
Created attachment 443456 [details]
New version of the backport
This is the new - direct - version of the backport excluding the localhost migration check.
Michal
Created attachment 445971 [details]
Patch version 3
New version of the backport for codebase with reverted localhost patch
This is the new - direct - version of the backport for codebase without localhost patch in the tree (i.e. with reverted local migration patch).
Michal
Fix built into xen-3.0.3-118.el5 This bug is reproduced on xen-3.0.3-117.el5 by: 1. copy a set of PV images that are used to create domains. e.g. for i in `seq -w 01 80`; do dd if=rhel-server-32-pv.img of=/tmp/img$i bs=1M count=50 done (the first 50MB of the full image should be enough to create a domain.) 2. create many PV domains in quick succesion. e.g. for i in `seq -w 01 80`; do (xm create test.cfg name="pv$i" disk="file:/tmp/img$i,hda,w" & ) done This would require big amount of memory. I allocated 64M mem for each domain, and performed this on a machine with 8G physical mem installed. By this way, the "Error: Device 0 (vbd) could not be connected. Hotplug scripts not working" message shows up in the probability of about 30%. It is unfortunate that upgrading to xen-3.0.3-118.el5 did not solve the issue: the error message could still show up in multiple tests. Please consider the case I have described above. I'd like to ask few questions: Is there difference in success rate on -117 and -118? If yes, what's the exact difference? Can you provide xend.log and xenhotplug.log for this test? Created attachment 461433 [details]
xend.log
Created attachment 461434 [details]
xen-hotplug.log
(In reply to comment #8) > I'd like to ask few questions: > > Is there difference in success rate on -117 and -118? If yes, what's the exact > difference? I cannot tell the 'exact' difference but they are different at some extent. I would say it is more reproducible on 118 than on 117. But it's only roughly estimated. > > Can you provide xend.log and xenhotplug.log for this test? I s/file/tap:aio/ in the reproduce script and tried again: for i in `seq -w 01 80`; do (xm create test.cfg name="pv$i" disk="tap:aio:/tmp/img$i,hda,w" & ); done and this one: for i in `seq -w 01 80`; do (xm create test.cfg name="pv$i" disk="tap:aio:/tmp/img$i,xvda,w" & ); done Both reproduced the error message. Well, how did you achieve that? Running: for i in `seq -w 01 80`; do (xm create rhel5-32pv name="pv$i"; disk="tap:aio:/tmp/img$i,xvda,w" & ); done always wanted to use the default image (as found in the config file) as the disk so I've generated the configuration files first and I discovered that 50M are not enough since it was returning an error about boot loader returning no data so I had to use first 100M of those files (RHEL-5 i386 PV guest). I also changed names and UUIDs - not just disks. In the process of creation there were some "Error: (4, 'Out of memory', "xc_dom_boot_mem_init: can't allocate low memory for domain\n")" errors but this could be caused by setting up memory per guest to 32M. Despite this fact it started the domains: "Started domain pv48" using the latest virttest tree so I can't reproduce it myself. I'm not sure whether the backports are really in the -118 version. Unfortunately this made the host machine pretty slow but in xm list output (and XenD related operations) I was able to see the machines (except those machines that returned with Out of memory error - i.e. 5 machines from those 80 machines). I also did the test for 64M PV guests of RHEL-5 i386 and the results were exactly the same although the host machine (dom0) was very slow to list the domains. Since the image was not full it crashed all the domains so I did set the on_crash/on_poweroff and on_restart conditions to preserve for those guests to be sure all the domains are there. Michal (In reply to comment #13) > Well, how did you achieve that? Running: > > for i in `seq -w 01 80`; do (xm create rhel5-32pv name="pv$i"; > disk="tap:aio:/tmp/img$i,xvda,w" & ); done > > always wanted to use the default image (as found in the config file) as the > disk so I've generated the configuration files first Is that true? It's totally different here on my system. If the command is exactly what you used, I think you should remove the semicolon ';' after name="pv$i" ... This was the config I've used in the test: bootloader = "/usr/bin/pygrub" vif = ['script=vif-bridge,bridge=xenbr0'] on_reboot = "restart" localtime = "0" apic = "1" on_poweroff = "destroy" on_crash = "preserve" vcpus = "1" pae = "1" memory = "64" vnclisten = "0.0.0.0" vnc = "1" #disk = ['tap:aio:/root/RHEL-Server-5.5-64-pv.raw,xvda,w'] acpi = "1" maxmem = "64" > and I discovered that 50M > are not enough since it was returning an error about boot loader returning no > data so I had to use first 100M of those files (RHEL-5 i386 PV guest). I also > changed names and UUIDs - not just disks. I used RHEL-Server-5.5-64-pv. 50M are enough for me. The domain would boot into a crashed state and could be preserved. I'll do this test again using 32 bit guest and see what's different. > > In the process of creation there were some "Error: (4, 'Out of memory', > "xc_dom_boot_mem_init: can't allocate low memory for domain\n")" errors but > this could be caused by setting up memory per guest to 32M. Despite this fact > it started the domains: "Started domain pv48" using the latest virttest tree so > I can't reproduce it myself. I'm not sure whether the backports are really in > the -118 version. Unfortunately this made the host machine pretty slow but in > xm list output (and XenD related operations) I was able to see the machines > (except those machines that returned with Out of memory error - i.e. 5 machines > from those 80 machines). I don't get this 'out of memory' error. All 'Started domain pv??'. I can see them crashed from xm list, very slow though. > > I also did the test for 64M PV guests of RHEL-5 i386 and the results were > exactly the same although the host machine (dom0) was very slow to list the > domains. Since the image was not full it crashed all the domains so I did set > the on_crash/on_poweroff and on_restart conditions to preserve for those guests > to be sure all the domains are there. > > Michal Created attachment 462211 [details]
test output
This was the output of the test command that reproduced the error message.
I tried with RHEL-Server-5.5-32-pv. Have not successfully reproduced yet. I am feeling that it's faster to create a 32b domain than to create a 64b domain. When testing the 64b guest it's much easier to reproduce the bug but with 32b guest there's no reproduction yet. (In reply to comment #16) > I tried with RHEL-Server-5.5-32-pv. Have not successfully reproduced yet. > > I am feeling that it's faster to create a 32b domain than to create a 64b > domain. When testing the 64b guest it's much easier to reproduce the bug but > with 32b guest there's no reproduction yet. This may be the issue, I'll try with x86_64 guest then. With i386 guest I saw no "Error: Device 0 (vif) could not be connected. Hotplug scripts not working." messages so if you say it's much easier to reproduce with x86_64 guest, I'll try it and I'll comment this BZ with the results. Also one note: I was able to see those "Device could not be connected" messages with the i386 guest without my patch applied and not I was unable to see it anymore so it helped at least with i386 guests. I need to test with x86_64 guests now. Michal (In reply to comment #17) > (In reply to comment #16) > > I tried with RHEL-Server-5.5-32-pv. Have not successfully reproduced yet. > > > > I am feeling that it's faster to create a 32b domain than to create a 64b > > domain. When testing the 64b guest it's much easier to reproduce the bug but > > with 32b guest there's no reproduction yet. > > This may be the issue, I'll try with x86_64 guest then. With i386 guest I saw > no "Error: Device 0 (vif) could not be connected. Hotplug scripts not working." > messages so if you say it's much easier to reproduce with x86_64 guest, I'll > try it and I'll comment this BZ with the results. Also one note: I was able to > see those "Device could not be connected" messages with the i386 guest without > my patch applied and not I was unable to see it anymore so it helped at least > with i386 guests. I need to test with x86_64 guests now. > > Michal Well, I did try with RHEL-5 x86_64 PV guest (and 50M of the disk image was really enough) but I saw all the domains started successfully (but crashed) when having 64M RAM assigned each (total 80 guests). Version used was xen-3.0.3-118.el5virttest34.gb1c76b9.x86_64.rpm available for testing at http://people.redhat.com/minovotn/xen/test/ (x86_64 version only). Michal Michal, I've tested xen-3.0.3-118.el5virttest34.gb1c76b9.x86_64.rpm you provided with x64 guest. Still "Error: Device 0 (vif) could not be connected. Hotplug scripts not working.". i386 guest works as the same of -118. I noticed that we were actually expecting "Error: Device 0 (vbd)..." but this test outputs "Error: Device 0 (vif)...". Does it make any difference? (In reply to comment #20) > Michal, I've tested xen-3.0.3-118.el5virttest34.gb1c76b9.x86_64.rpm you > provided with x64 guest. Still "Error: Device 0 (vif) could not be connected. > Hotplug scripts not working.". i386 guest works as the same of -118. > > I noticed that we were actually expecting "Error: Device 0 (vbd)..." but this > test outputs "Error: Device 0 (vif)...". Does it make any difference? Honestly it does make a huge difference since vif is network interface that's being managed by some other script of Xen scripts. It's not connected to vbd at all since vbd is the disk device. I can't see the issue with network in my environment (although I have a guest network set) so I guess this is something reproducible only on your machine or some different setup than I have. This is not connected to vbd rather than networking scripts. Michal Hi Jinxin, based on log where vif is problem, this should be VERIFIED, as we are handling vbd. In case you have vif problem, please report new BZ. OK. Since the patch handles vbd and I cannot reproduce vbd error on -118, I'll put this into VERIFIED. The vif error seems another problem for which I'll later file one bug. Sorry for the confusion. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2011-0031.html This bugfix has to be backported to 5.4.z as fix for bug 666800 increases chance to hit this problem. To safely apply 666800 without regressions we need this bugfix. |