Bug 509099
Created attachment 350086 [details]
pv domain xml file
Created attachment 350087 [details]
perl script to do pv domain migration stress test
Created attachment 350088 [details]
dependency of perl script virshmig.pl
Created attachment 350089 [details]
xml dump when this pv domain migrated to remote for the 1st time
Created attachment 350090 [details]
xml dump when this pv domain migrated back to local for the 1st time
Created attachment 350091 [details]
xml dump when this pv domain migrated to remote for the 2nd time
Created attachment 350092 [details]
xml dump when this pv domain migrated back to local for the 2nd time
So are the devices lost in the xml before the hang occurs? If so this doesn't sound like a kernel problem. This really isn't a libvirt bug. libvirt XML just shows what XenD provides. So missing devices are XenD's fault. Missing <graphics> tag just sounds like another case of this https://bugzilla.redhat.com/show_bug.cgi?id=507765 Hi, could you please try this one with RPMs from: http://people.redhat.com/minovotn/xen There is a new patch that *may be* affecting this one. Please provide test results. Thanks, Michal (In reply to comment #15) > This really isn't a libvirt bug. libvirt XML just shows what XenD provides. So > missing devices are XenD's fault. > > Missing <graphics> tag just sounds like another case of this > > https://bugzilla.redhat.com/show_bug.cgi?id=507765 Daniel, it seems you're right it's dup of this BZ #507765 but I've been working on this one yesterday and the patch is available in RPMs I wrote about in comment #17 so I need Edward to install those RPMs and try to reproduce it with those RPMs. Most probably it's a dup but I won't close it until Edward tests it and it passes to be sure it's not something different. Michal (In reply to comment #17) > Hi, > could you please try this one with RPMs from: > > http://people.redhat.com/minovotn/xen > > There is a new patch that *may be* affecting this one. > > Please provide test results. > > Thanks, > Michal Michal, This bug should be fixed by your new xen package. With the same steps in comment #0, after tried 20 times stress migration of xen pv domain, the network still alive and no devices lost in domain xml config file. Related package info is as below: xen - xen-3.0.3-90mig.elt libvirt - libvirt-0.6.3-15.el5 kernel - 2.6.18-158.el5xen Thanks Edward Wang (In reply to comment #18) > (In reply to comment #15) > > This really isn't a libvirt bug. libvirt XML just shows what XenD provides. So > > missing devices are XenD's fault. > > > > Missing <graphics> tag just sounds like another case of this > > > > https://bugzilla.redhat.com/show_bug.cgi?id=507765 > > Daniel, it seems you're right it's dup of this BZ #507765 but I've been working > on this one yesterday and the patch is available in RPMs I wrote about in > comment #17 so I need Edward to install those RPMs and try to reproduce it with > those RPMs. Most probably it's a dup but I won't close it until Edward tests it > and it passes to be sure it's not something different. > > Michal Michal, The root cause of this bug is same with BZ #507765? But the phenomena seems different. Thanks. Edward, BZ #507765 was only about PVFB devices - ie. virtual keyboard and framebuffer. It had nothing to do with networking but since those RPMs are based on -90 version of xen package (official version with just 2 new patches applies and none of them had anything to do with network) it appears that the current official RPMs already have this one fixed. The graphics devices lost issue was caused by 507765 but not the network issue. If you were unable to reproduce it now I would wait for next official version and I would try on this one and close this BZ if it's no longer reproducible on *official* version... ;) Michal (In reply to comment #20) > The root cause of this bug is same with BZ #507765? But the phenomena seems > different. Edward, Since there's no xend.log attached to this BZ I cannot be 100% sure but still I'm pretty sure that the root cause of this bug and https://bugzilla.redhat.com/show_bug.cgi?id=507765 is the same. What you see as a device loss most likely prevents successful migration of the guest. I guess the guest is not even running at the time it loses network connection... (In reply to comment #22) > (In reply to comment #20) > > The root cause of this bug is same with BZ #507765? But the phenomena seems > > different. > > Edward, > > Since there's no xend.log attached to this BZ I cannot be 100% sure but still > I'm pretty sure that the root cause of this bug and > https://bugzilla.redhat.com/show_bug.cgi?id=507765 is the same. What you see as > a device loss most likely prevents successful migration of the guest. I guess > the guest is not even running at the time it loses network connection... I don't think so because 507765 was caused by "bad" fix for xenstore leak about PVFB devices so I am not touching other that vkbd/vfb devices. In BZ #439182 there was a leak in xenstore reported so I fixed it but the fix caused regression reported in BZ #507765. The only thing I was manipulating with was vkbd/vfb and I didn't even read any frontend/backend vif device so it's highly unlikely that this should be fixed by this patch for leaking PVFB devices so it may be corrected by some networking BZ that was applied there. Anyway, xend.log would be helpful so could you please attach xend.log from at least 2 migration attempts with new RPMs? Thanks, Michal Michal, I've tried 9 times and xend.log is attached for your reference. Thanks, Edward Created attachment 354655 [details]
xend.log
Edward, From the log you attached, it seems you were trying it with a package which had the latest patch for 507765 applied. And you still experienced the issue with lost network, right? (In reply to comment #26) > Edward, > > From the log you attached, it seems you were trying it with a package which had > the latest patch for 507765 applied. And you still experienced the issue with > lost network, right? No. No network issue and device lost (in xml) occurs any more after I upgraded to Michal's private package in comment #17. Michal's package already fixed this bug. We are just discussing whether the root cause of this bug is the same with BZ #507765. Thanks Edward Interesting, I saw some errors in the log regarding vbd device... But it doesn't matter, could you please also provide xend.log when you try stress migration with older xen package (i.e. -88.el5 which you reported this bug for), that is when the guest loses network connection. Thanks a lot. (In reply to comment #28) > Interesting, I saw some errors in the log regarding vbd device... But it > doesn't matter, could you please also provide xend.log when you try stress > migration with older xen package (i.e. -88.el5 which you reported this bug > for), that is when the guest loses network connection. > > Thanks a lot. I reproduced this bug with following components: xen: -90.el5 libvirt: -15.el5 kernel: -158.el5xen And, xend.log is attached for your reference. The log name is: xend(reproducible).log Then, I upgrade xen with Michal's package in comment #17, try to check whether this bug is fixed in the same environment, the result is not produced, that is, Michal's package in comment #17 fixed this bug. Thanks Edward Created attachment 354659 [details]
xend.log when this bug is reproduced
Thanks for that. Although it's kind of interesting that there is no error in the log at all. Any chance you could sent xend.log from the other machine? Actually, I did not reproduce this bug on my own two machines and I've already tried to found another two machines of colleagues. Though there is no error in xend.log I attached, but Michal's private package really has fixed this bug and I have to switch to other project issues now... If you are really interested in the root cause of this bug, maybe you can try to find another two machines to reproduce it? :) Thanks for your understanding Edward Edward, my package should fix PVFB issues, that's right but I did nothing about vif interface so you say you were able to reproduce the bug on -90 version of Xen package for both PVFB devices and network (vif interface) devices? My PVFB patch did nothing about vif devices at all so I don't understand what the issue was there. In fact I don't know whether it's really a XenD fault if you consider networking... According to XML there was a <graphics> tag missing which refers to PVFB devices (BZ #507765) but there was a bridge interface and networking device with MAC address in all XMLs: <interface type='bridge'> <mac address='00:31:a3:14:3e:f0'/> <source bridge='xenbr0'/> <script path='vif-bridge'/> <target dev='vif23.0'/> </interface> So I guess if this is not reproduced it was done by something but since patch for BZ #507765 doesn't touch anything else than PVFB devices, it's highly unlikely it was corrected by this patch so that I would like to ask you whether you lost networking with -90 version you were testing now... Did you lost your networking? Thanks, Michal Michal, (In reply to comment #33) > Edward, my package should fix PVFB issues, that's right but I did nothing about > vif interface so you say you were able to reproduce the bug on -90 version of > Xen package for both PVFB devices and network (vif interface) devices? My PVFB > patch did nothing about vif devices at all so I don't understand what the issue > was there. > > In fact I don't know whether it's really a XenD fault if you consider > networking... According to XML there was a <graphics> tag missing which refers > to PVFB devices (BZ #507765) but there was a bridge interface and networking > device with MAC address in all XMLs: > > <interface type='bridge'> > <mac address='00:31:a3:14:3e:f0'/> > <source bridge='xenbr0'/> > <script path='vif-bridge'/> > <target dev='vif23.0'/> > </interface> > > So I guess if this is not reproduced it was done by something but since patch > for BZ #507765 doesn't touch anything else than PVFB devices, it's highly > unlikely it was corrected by this patch so that I would like to ask you whether > you lost networking with -90 version you were testing now... Did you lost your > networking? > > Thanks, > Michal Michal, Yesterday, I've do the testing on two intel machines with following versions: - xen: -90.el5 - libvirt: -15.el5 - kernel: -158.el5xen and find that domain devices lost (input, graphics xml nodes) and also the domain can not ping during 2nd round mingration. one round migration means migrate domain to remote then back to local from remote again. Then, I upgrade xen package with packages in comment #17, this bug disappear. The environment is all the same except that I've upgraded xen package you provided to me. No sure whether this is helpful to you? Thanks Edward Edward, this is exactly what I meant. Thanks. I am just a little confused because my PVFB patch did nothing with networking devices. In fact, official version -91 of xen package should contain this PVFB patch (BZ #507765)... could you try with this version when it's available? Thanks, Michal Michal, I've tested xen pv domain migration for 5 cycles (1 cycle migration means migrate domain from local to remote then back to local), no network and device lost. The component version is: kernel - 158.el5 xen - 91.el5 libvirt - 16.el5 This bug really fixed now. Thanks Edward Thanks for testing Edward, so this one is fixed... That's good ;) Thanks, Michal Refer to comment #38, change this bug status to VERIFIED Clear needinfo flag OK. Since it seems this is fixed, I'm going to close this as a dup of 507765, even though we don't know exactly why that patch fixed this problem. Chris Lalancette *** This bug has been marked as a duplicate of bug 507765 *** Can reproduce on RHEL5.4-i386 pv guest migration Version: - RHEL5.4-Server-i386 - libvirt-0.6.3-17.el5 - xen-3.0.3-92.el5 Migrate pv guest from system A to system B through Virt-manager After migrate, - On system A, no error info, guest is shut off. - On system B, guest is migrated from system A, and in running status. But lost its interface info. Attach rhel5u4-pv xml on system B after migration. Update comment #45 guest is RHEL5U3 system Attach rhel5u3-pv-before-mig xml on system A before migration Attach rhel5u3-pv-after-mig xml on system B after migration Created attachment 355924 [details]
rhel5u3-pv before migration
Created attachment 355925 [details]
rhel5u3-pv after migration
Can reproduce on RHEL5.4-i386 fv guest migration Version: - RHEL5.4-Server-i386 - libvirt-0.6.3-17.el5 - xen-3.0.3-92.el5 Migrate fv guest from system A to system B through Virt-manager After migrate, - On system A, guest is shut off, and virt-manager report error info: Error migrating domain: Domain not found: xenUnifiedDomainLookupByName Traceback (most recent call last): File "/usr/share/virt-manager/virtManager/engine.py", line 561, in migrate_domain vm.migrate(destconn) File "/usr/share/virt-manager/virtManager/domain.py", line 1387, in migrate self.vm.migrate(conn, flags, None, uri, 0) File "/usr/lib/python2.4/site-packages/libvirt.py", line 378, in migrate if ret is None:raise libvirtError('virDomainMigrate() failed', dom=self) libvirtError: Domain not found: xenUnifiedDomainLookupByName - On system B, guest is migrated from system A, and in running status. But no console screen, although it doesn't lose interface info in xml, cannot ping it. Attach rhel5u4-fv-before-mig xml before migration Attach rhel5u4-fv-after-mig xml after migration Created attachment 355926 [details]
rhel5u4-fv before migration
Created attachment 355927 [details]
rhel5u4-fv after migration
Attach /var/log/xen/xend.log Created attachment 355928 [details]
xend.log
This bug can be reproduced on xen-3.0.3-93.el5, which have the same test result with comment #45, comment #49 Could you also provide xend.log from the target machine (system B)? When migrating, xend.log from source machine is rarely useful. Thanks. (In reply to comment #54) > This bug can be reproduced on xen-3.0.3-93.el5, which have the same test result > with comment #45, comment #49 So, you can reproduce it on -93 version several times in a row but you can't on -91 ? I think it's most likely reproducible on -91 as well when you run test on this several times in a row, isn't it ? Therefore this one was not fixed in any version but it's hardly reproducible? Could you do some more testing about that one ? Thanks, Michal This bug cannot be reproduced on xen -91 on x86_64 system But this bug can be reproduced on xen -92 and -93 i386 system So, I think maybe it is an i386 related bug This seems strange. Why did you test it on xen -91 on x86_64 and -92 and -93 on i386 system? Could you please test it on xen -91 on i386 as well and write a report about that whether it's working fine ? Well, somebody to provide me more information about that ? Information whether it's reproducible on -91 i386 would be grateful but I am not currently having i386 version of dom0. Please ignore the previous test result on i386-xen-92 i386-xen-93, because finally I found that one of my machine has some hardware issue. Sorry for that. I retested pv guest migration on i386-xen-94 on 2 good machine, the following is test step and result: 1. Create 1 guest rhel5u4 on host A 2. Migrate from host A to host B with virsh command (2 rounds) # virsh migrate --live rhel5u4 xen+ssh://*.*.*.* ---> Successfully But I found the following issues: 1. Network info and device info does not lost during migration 2. In host, using ping to guest successfully during migration 3. After guest migrate to host A, guest can ping host successfully After guest migrate to host B, guest ping host failed It seems strange that on host B, guest ping host failed, but host ping guest successfully. (In reply to comment #61) > Please ignore the previous test result on i386-xen-92 i386-xen-93, because > finally I found that one of my machine has some hardware issue. Sorry for that. > Ok. > I retested pv guest migration on i386-xen-94 on 2 good machine, the following > is test step and result: > > 1. Create 1 guest rhel5u4 on host A > 2. Migrate from host A to host B with virsh command (2 rounds) > # virsh migrate --live rhel5u4 xen+ssh://*.*.*.* > ---> Successfully > So, was it working fine on machines with no hardware issues? > But I found the following issues: > 1. Network info and device info does not lost during migration What do you mean by that? The information persists here when the domain is migrating and it should not or what do you mean? > 2. In host, using ping to guest successfully during migration So, you can ping the domain when migrating? It is live migration, right? I think this should be working in live migration and there should be just minor outage... > 3. After guest migrate to host A, guest can ping host successfully > After guest migrate to host B, guest ping host failed > You mean pinging host machine from the guest machine? > It seems strange that on host B, guest ping host failed, but host ping guest > successfully. Well, does this problem persist after migration is successfully done? Maybe there is some outage for a while so therefore you can't ping it but you should be able to ping it if you try again later (with doing nothing else). (In reply to comment #62) > (In reply to comment #61) > > Please ignore the previous test result on i386-xen-92 i386-xen-93, because > > finally I found that one of my machine has some hardware issue. Sorry for that. > > > > Ok. > > > I retested pv guest migration on i386-xen-94 on 2 good machine, the following > > is test step and result: > > > > 1. Create 1 guest rhel5u4 on host A > > 2. Migrate from host A to host B with virsh command (2 rounds) > > # virsh migrate --live rhel5u4 xen+ssh://*.*.*.* > > ---> Successfully > > > > So, was it working fine on machines with no hardware issues? ---->Yes, no hardware issues > > > > But I found the following issues: ------> Should update 'issues' to 'result' > > 1. Network info and device info does not lost during migration > > > What do you mean by that? The information persists here when the domain is > migrating and it should not or what do you mean? -----> This is expected result > > > > 2. In host, using ping to guest successfully during migration > > > So, you can ping the domain when migrating? It is live migration, right? I > think this should be working in live migration and there should be just minor > outage... -----> This is expected result > > > > 3. After guest migrate to host A, guest can ping host successfully > > After guest migrate to host B, guest ping host failed > > > > You mean pinging host machine from the guest machine? > -----> Yes, ping host from guest machine > > > It seems strange that on host B, guest ping host failed, but host ping guest > > successfully. > > > Well, does this problem persist after migration is successfully done? Maybe > there is some outage for a while so therefore you can't ping it but you should > be able to ping it if you try again later (with doing nothing else). ----> Yes, this problem persist after migration is successfully done. I will try to test ping it after a while and tell you result later (In reply to comment #63) > (In reply to comment #62) > > (In reply to comment #61) > > > Please ignore the previous test result on i386-xen-92 i386-xen-93, because > > > finally I found that one of my machine has some hardware issue. Sorry for that. > > > > > > > Ok. > > > > > I retested pv guest migration on i386-xen-94 on 2 good machine, the following > > > is test step and result: > > > > > > 1. Create 1 guest rhel5u4 on host A > > > 2. Migrate from host A to host B with virsh command (2 rounds) > > > # virsh migrate --live rhel5u4 xen+ssh://*.*.*.* > > > ---> Successfully > > > > > > > So, was it working fine on machines with no hardware issues? > ---->Yes, no hardware issues > > > > Ok, so no software issue found and the software issue turned out to be caused by an hardware issue, right? > > > But I found the following issues: > ------> Should update 'issues' to 'result' Ok, good. > > > 1. Network info and device info does not lost during migration > > > > > > What do you mean by that? The information persists here when the domain is > > migrating and it should not or what do you mean? > -----> This is expected result > > Ok. > > > > > 2. In host, using ping to guest successfully during migration > > > > > > So, you can ping the domain when migrating? It is live migration, right? I > > think this should be working in live migration and there should be just minor > > outage... > -----> This is expected result > > > > Ok, great. > > > 3. After guest migrate to host A, guest can ping host successfully > > > After guest migrate to host B, guest ping host failed > > > > > > > You mean pinging host machine from the guest machine? > > > -----> Yes, ping host from guest machine > > Ok, what about trying to ping another machine? What does the guest report in ifconfig and stats for network interface? > > > It seems strange that on host B, guest ping host failed, but host ping guest > > > successfully. > > > > > > Well, does this problem persist after migration is successfully done? Maybe > > there is some outage for a while so therefore you can't ping it but you should > > be able to ping it if you try again later (with doing nothing else). > ----> Yes, this problem persist after migration is successfully done. > I will try to test ping it after a while and tell you result later Please do, if this is the issue, some logs from both host machine and guest machine will be useful. Mainly information about network device configuration in the guest could help I think. (In reply to comment #64) > (In reply to comment #63) > > (In reply to comment #62) > > > (In reply to comment #61) > > > > It seems strange that on host B, guest ping host failed, but host ping guest > > > > successfully. > > > > > > > > > Well, does this problem persist after migration is successfully done? Maybe > > > there is some outage for a while so therefore you can't ping it but you should > > > be able to ping it if you try again later (with doing nothing else). > > ----> Yes, this problem persist after migration is successfully done. > > I will try to test ping it after a while and tell you result later > > > > Please do, if this is the issue, some logs from both host machine and guest > machine will be useful. Mainly information about network device configuration > in the guest could help I think. Yes, wait about 8 minutes after migrate, guest can ping to host. (In reply to comment #65) > (In reply to comment #64) > > (In reply to comment #63) > > > (In reply to comment #62) > > > > (In reply to comment #61) > > > > > It seems strange that on host B, guest ping host failed, but host ping guest > > > > > successfully. > > > > > > > > > > > > Well, does this problem persist after migration is successfully done? Maybe > > > > there is some outage for a while so therefore you can't ping it but you should > > > > be able to ping it if you try again later (with doing nothing else). > > > ----> Yes, this problem persist after migration is successfully done. > > > I will try to test ping it after a while and tell you result later > > > > > > > > Please do, if this is the issue, some logs from both host machine and guest > > machine will be useful. Mainly information about network device configuration > > in the guest could help I think. > > Yes, wait about 8 minutes after migrate, guest can ping to host. I did several migrations (about 10 per each host machine) and I found no issue. I've been using -94 version with some new patches applied that's available at: http://people.redhat.com/minovotn/xen . Could you please try with this one since I was unable to reproduce it? I used this version of xen package (available on the link above) and kernel-xen-2.6.18-164 ... Could you please try to reproduce with this one and tell us whether this is still the issue on this one? Thanks, Michal Was any testing done with this one? I was unable to reproduce it at all. Michal Michal,sorry for late update. Verify this bug on two Intel machines Version-Release number of selected component (if applicable): libvirt: libvirt-0.6.3-20.el5 xen: xen-3.0.3-105.el5 kernel: kernel-2.6.18-194.el5 Host is RHEL-Server-5.4.x86_64 PV Guest is RHEL-server-5.4.i386/x86_64 As this bug is said nothing to do with libvirt, I test migration only by "xm migrate -l" command. There is already a sub test about migration in xen autotest: ping_pong_migration (which uses "xm migrate -l" to migrate vm to remote host,then migrate it back,then remote,then back...) After migrate vm 20 cycles(migrate to remote and back), there's no device lost,all network/input/graphic devices exist as the vm was created. That is to say the bug was not reproduced. (In reply to comment #71) > Michal,sorry for late update. > > Verify this bug on two Intel machines > > Version-Release number of selected component (if applicable): > libvirt: libvirt-0.6.3-20.el5 > xen: xen-3.0.3-105.el5 > kernel: kernel-2.6.18-194.el5 > > Host is RHEL-Server-5.4.x86_64 > PV Guest is RHEL-server-5.4.i386/x86_64 > > As this bug is said nothing to do with libvirt, I test migration only by "xm > migrate -l" command. > There is already a sub test about migration in xen autotest: > ping_pong_migration (which uses "xm migrate -l" to migrate vm to remote > host,then migrate it back,then remote,then back...) > > After migrate vm 20 cycles(migrate to remote and back), there's no device > lost,all network/input/graphic devices exist as the vm was created. > That is to say the bug was not reproduced. Ok, so, does it mean the bug is no longer reproducible but it was reproducible before? What version of xen package was it reproducible on? You say you tried it with xen-3.0.3-105.el5 and it was not reproducible. Originally it has been closed as a dup of bug 507765. Patch for this bug was already build into xen-3.0.3-91.el5. Also, I did some testing with -94 version of xen package as described in comment #66 and I found no issue here. But the bug got reopened in time the reporter was using -93 version of xen package so could you please retest using both i386 and x86_64 version of both dom0 and domU to be sure it's working fine? Thanks, Michal tried to reproduce this bug with xen-3.0.3-80.el5 (there is no xen -81 ~ -93 package available in brew now), the pv guest suspended just after 2 rounds of migrations. host: RHEL-Server-5.4 x86_64 (xen-3.0.3-80.el5 kernel-xen-2.6.18-164.el5) pv guest: RHEL-Server-5.4 i386 $ xm list Name ID Mem(MiB) VCPUs State Time(s) Domain-0 0 7352 2 r----- 1061.5 migrating-rhel54pv 2 511 1 ---s-- 0.0 $ cat /var/log/xen/xend.log ... [2010-05-05 14:33:34 xend 4453] DEBUG (DevController:496) hotplugStatusCallback /local/domain/0/backend/tap/2/51712/hotplug-status. [2010-05-05 14:33:34 xend 4453] DEBUG (DevController:510) hotplugStatusCallback 1. [2010-05-05 14:33:34 xend 4453] DEBUG (DevController:154) Waiting for devices vtpm. [2010-05-05 14:33:34 xend.XendDomainInfo 4453] DEBUG (XendDomainInfo:1036) XendDomainInfo.handleShutdownWatch [2010-05-05 14:33:34 xend.XendDomainInfo 4453] INFO (XendDomainInfo:994) Domain has shutdown: name=migrating-rhel54pv id=2 reason=suspend. [2010-05-05 14:33:34 xend 4453] INFO (XendCheckpoint:99) Domain 2 suspended. [2010-05-05 14:33:34 xend 4453] DEBUG (XendCheckpoint:108) Written done [2010-05-05 14:35:53 xend.XendDomainInfo 4453] INFO (XendDomainInfo:994) Domain has shutdown: name=migrating-rhel54pv id=2 reason=suspend. [2010-05-05 14:35:53 xend.XendDomainInfo 4453] INFO (XendDomainInfo:994) Domain has shutdown: name=migrating-rhel54pv id=2 reason=suspend. [2010-05-05 14:35:53 xend.XendDomainInfo 4453] INFO (XendDomainInfo:994) Domain has shutdown: name=migrating-rhel54pv id=2 reason=suspend. [2010-05-05 14:36:46 xend.XendDomainInfo 4453] INFO (XendDomainInfo:994) Domain has shutdown: name=migrating-rhel54pv id=2 reason=suspend. ... the test result in comment #73 is invalid, I probably tried to migrate a migrating guest. I re-tested it against xen-3.0.3-80.el5 as the following steps: 1. create a pv guest with disk placed on the shared NFS storage (mounted on /data by the hosts in this case): $ cat rhel54pv.cfg name = "rhel54pv" maxmem = 512 memory = 512 vcpus = 1 bootloader = "/usr/bin/pygrub" pae = 1 on_poweroff = "destroy" on_reboot = "restart" on_crash = "restart" vfb = [ 'type=vnc,vncunused=1,keymap=en-us,vnclisten=0.0.0.0' ] disk = [ "tap:aio:/data/rhel-server-5.4-32-pv.img,xvda,w" ] vif = [ "mac=00:16:36:63:75:48,bridge=xenbr0,script=vif-bridge" ] 2. start x windows in guest and issue the following command in gnome terminal: $ i=0; while sleep 1; do echo $((i++));done keep ping the guest from outside. 3. migrate the pv guest between to host (src and dst) for 100 rounds: [src]$ xm migrate -l rhel54pv $(ip_of_dst) [dst]$ xm migrate -l rhel54pv $(ip_of_src) 4. after 100 rounds migrations, the command in step 2 keep running in guest, network and vfb session of the guest also works well. I also tested against xen-3.0.3-94.el5 with same environment, same steps as above. Got same results as xen-3.0.3-80.el5 ( migrate without '-l' parameter also covered). so I can't reproduce this bug by now. (In reply to comment #74) > the test result in comment #73 is invalid, I probably tried to migrate a > migrating guest. > > I re-tested it against xen-3.0.3-80.el5 as the following steps: > > 1. create a pv guest with disk placed on the shared NFS storage (mounted on > /data by the hosts in this case): > > $ cat rhel54pv.cfg > name = "rhel54pv" > maxmem = 512 > memory = 512 > vcpus = 1 > bootloader = "/usr/bin/pygrub" > pae = 1 > on_poweroff = "destroy" > on_reboot = "restart" > on_crash = "restart" > vfb = [ 'type=vnc,vncunused=1,keymap=en-us,vnclisten=0.0.0.0' ] > disk = [ "tap:aio:/data/rhel-server-5.4-32-pv.img,xvda,w" ] > vif = [ "mac=00:16:36:63:75:48,bridge=xenbr0,script=vif-bridge" ] > > 2. start x windows in guest and issue the following command in gnome terminal: > $ i=0; while sleep 1; do echo $((i++));done > > keep ping the guest from outside. > > 3. migrate the pv guest between to host (src and dst) for 100 rounds: > [src]$ xm migrate -l rhel54pv $(ip_of_dst) > [dst]$ xm migrate -l rhel54pv $(ip_of_src) > > 4. after 100 rounds migrations, the command in step 2 keep running in guest, > network and vfb session of the guest also works well. > > I also tested against xen-3.0.3-94.el5 with same environment, same steps as > above. Got same results as xen-3.0.3-80.el5 ( migrate without '-l' parameter > also covered). > > so I can't reproduce this bug by now. Good, what about with the latest version of xen packages, i.e. xen-3.0.3-107.el5[virttest25] ? Thanks, Michal Tested against xen-3.0.3-107.el5 with the same steps in comment #74 $ uname -a Linux intel-8400-8-1 2.6.18-164.el5xen #1 SMP Tue Aug 18 15:59:52 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux $ rpm -q xen xen-3.0.3-107.el5 1. migrate the guest with 'xm migrate $dom $host', after 100 rounds of migration, the pv guest can live healthily. 2. live migrate the guest with 'xm migrate -l $dom $host', after about 50 rounds of migration, the pv guest disappeared completely from 'xm list' output on both the source and destination hosts. xend.log on both the source host and destination host will be attached soon. Created attachment 412062 [details]
xend.log on the source host
Created attachment 412064 [details]
xend.log on the source host
Created attachment 412065 [details]
xend.log on the destination host
(In reply to comment #79) > Created an attachment (id=412065) [details] > xend.log on the destination host Well, it looks like this is something different but investigation of this log revealed that the "[2010-05-06 13:27:11 xend 3456] INFO (XendCheckpoint:364) ERROR Internal error: Failed to pin batch of 164 page tables" is the key error that makes the xc_restore fail. It's the code for PV guests only which calls xc_mmuext_op() function to pin the page tables to the guest when batch of MAX_PIN_BATCH pins (which is set to 1024) is accumulated. xc_mmuext_op() function declares a hypercall of __HYPERVISOR_mmuext_op to the hypervisor so I need to ask you some more things to be sure: 1. Was the kernel-xen version same all the time? I.e. kernel-xen-2.6.18-194.el5 ? 2. Was the only package changed just the xen user-space package? 3. Did the error occur only during live migration and was it a repeated behavior? I.e. is it 100% reproducible or not ? I've been looking at the user-space changes from xen-3.0.3-94.el5 but I can see anything that could cause this behaviour so those information I am asking about could be helpful. Thanks, Michal Oh, sorry. I meant I can't see anything that could have caused this behavior. Michal (In reply to comment #80) > (In reply to comment #79) > > Created an attachment (id=412065) [details] [details] > > xend.log on the destination host > > Well, it looks like this is something different but investigation of this log > revealed that the "[2010-05-06 13:27:11 xend 3456] INFO (XendCheckpoint:364) > ERROR Internal error: Failed to pin batch of 164 page tables" is the key error > that makes the xc_restore fail. It's the code for PV guests only which calls > xc_mmuext_op() function to pin the page tables to the guest when batch of > MAX_PIN_BATCH pins (which is set to 1024) is accumulated. xc_mmuext_op() > function declares a hypercall of __HYPERVISOR_mmuext_op to the hypervisor so I > need to ask you some more things to be sure: > > 1. Was the kernel-xen version same all the time? I.e. kernel-xen-2.6.18-194.el5 > ? yes, it's always kernel-xen-2.6.18-164.el5 which is come with released RHEL5.4. > 2. Was the only package changed just the xen user-space package? yes, I installed released RHEL5.4 server, and then update the xen user-space package > 3. Did the error occur only during live migration and was it a repeated > behavior? I.e. is it 100% reproducible or not ? I only saw this problem the first time when I live migrate the PV guest, and can not reproduce it after that. I test live migration in the same environment with same steps for another 3 times, live migrate the pv guest for 400 rounds, the guest live well after the migration. > > I've been looking at the user-space changes from xen-3.0.3-94.el5 but I can see > anything that could cause this behaviour so those information I am asking about > could be helpful. > > Thanks, > Michal (In reply to comment #82) > (In reply to comment #80) > > (In reply to comment #79) > > > Created an attachment (id=412065) [details] [details] [details] > > > xend.log on the destination host > > > > Well, it looks like this is something different but investigation of this log > > revealed that the "[2010-05-06 13:27:11 xend 3456] INFO (XendCheckpoint:364) > > ERROR Internal error: Failed to pin batch of 164 page tables" is the key error > > that makes the xc_restore fail. It's the code for PV guests only which calls > > xc_mmuext_op() function to pin the page tables to the guest when batch of > > MAX_PIN_BATCH pins (which is set to 1024) is accumulated. xc_mmuext_op() > > function declares a hypercall of __HYPERVISOR_mmuext_op to the hypervisor so I > > need to ask you some more things to be sure: > > > > 1. Was the kernel-xen version same all the time? I.e. kernel-xen-2.6.18-194.el5 > > ? > yes, it's always kernel-xen-2.6.18-164.el5 which is come with released RHEL5.4. > > > 2. Was the only package changed just the xen user-space package? > yes, I installed released RHEL5.4 server, and then update the xen user-space > package > > > 3. Did the error occur only during live migration and was it a repeated > > behavior? I.e. is it 100% reproducible or not ? > I only saw this problem the first time when I live migrate the PV guest, and > can not reproduce it after that. > I test live migration in the same environment with same steps for another 3 > times, live migrate the pv guest for 400 rounds, the guest live well after the > migration. > What do you mean by you saw the problem the first time only? Is it not reproducible any longer? How is it reproducible? Not 100% I guess. Could you please try using the latest kernel-xen and xen packages and provide us test results? Thanks a lot, Michal (In reply to comment #83) > > What do you mean by you saw the problem the first time only? Is it not > reproducible any longer? How is it reproducible? Not 100% I guess. The problem mentioned in comment #76 (guest lost while doing live migration) is not reproducible, I only hit that problem 1 time during the testing ( I migrated the pv guest 400 rounds for about 10 times). > Could you please try using the latest kernel-xen and xen packages and > provide us test results? > tested against kernel-xen-2.6.18-199.el5 and xen-3.0.3-109.el5, live migrate the pv guest for 200 rounds as the same steps in comment #74. Tested for 5 times, no problem found during testing, the guest works well after migrations. (In reply to comment #84) > (In reply to comment #83) > > > > What do you mean by you saw the problem the first time only? Is it not > > reproducible any longer? How is it reproducible? Not 100% I guess. > > The problem mentioned in comment #76 (guest lost while doing live migration) is > not reproducible, I only hit that problem 1 time during the testing ( I > migrated the pv guest 400 rounds for about 10 times). > > > Could you please try using the latest kernel-xen and xen packages and > > provide us test results? > > > > tested against kernel-xen-2.6.18-199.el5 and xen-3.0.3-109.el5, live migrate > the pv guest for 200 rounds as the same steps in comment #74. Tested for 5 > times, no problem found during testing, the guest works well after migrations. So, does it mean you cannot run into those issues in all the testing you have been doing using the latest kernel-xen and xen version? If so, feel free to close this bug. Thanks, Michal (In reply to comment #85) > So, does it mean you cannot run into those issues in all the testing you have > been doing using the latest kernel-xen and xen version? If so, feel free to > close this bug. Yes, I'm closing it as VERIFIED. (In reply to comment #86) > (In reply to comment #85) > > > So, does it mean you cannot run into those issues in all the testing you have > > been doing using the latest kernel-xen and xen version? If so, feel free to > > close this bug. > Yes, I'm closing it as VERIFIED. Great. Thanks! Michal |
Created attachment 350085 [details] xml file for creating pool1 Description of problem: During pv domain stress bi-way migration testing on xen hypervisor, the domain loses its network and migration hang there without response anymore. Version-Release number of selected component (if applicable): libvirt: libvirt-0.6.3-11.el5 xen: xen-3.0.3-88.el5 kernel: 2.6.18-155.el5xen How reproducible: 100%, every time. Setup: There are two hosts with same hardware & software configuration, host A and host B. These two hosts are ssh believable, that is, I've created ssh public key on host A and then copied this public key to host B, vice versa. Steps to Reproduce: 1. create "pool1" on host A and host B by command "virsh pool-define pool1.xml" & "virsh pool-start pool1", its type is "netfs", so these two host share the same nfs pool (pool1.xml is attached) 2. wget a pv domain disks images to the pool target 3. define and start domain "rhel5u4" by command "virsh define rhel5u4.xml" & "virsh start rhel5u4" on host A, the domain disk pints to the disks image downloaded in step 2(rhel5u4.xml is attached); 4. run command "perl virshmig.pl --guestname=rhel5u4 --mac=00:31:a3:14:3e:f0 --peerip=<host B ip address> --myip=<host A ip address>" on host A to start up the stress migration test process. Note that: a. "virshmig.pl" is a perl script which invoke virsh command to do domain migration strees testing. please find it and its dependency ipget.sh in the attachment. b. --mac is the domain mac address got from its dump xml file Actual results: 1, this pv type domain lost its network during stress migration testing 2, from the xml file of the domain dumped dynamically, find that some devices (graphics, input) lost from the dumped xml file, see remote2.xml and local2.xml attached, they are got when the domain migrated to remote and migrated back to local for the 2nd time 3, when some devices lost, the succeeding migration testing hang there without response anymore Expected results: 1, the pv domain network should not lost 2, the devices (graphic, input) should not lost from the dumped xml file Additional info: For xen full virtualization domain stress migration testing, this bug can NOT be reproduced. So, this bug only exists for xen pv domain migration.