Bug 509099

Summary: xen pv domain network lost during stress migration test
Product: Red Hat Enterprise Linux 5 Reporter: Edward Wang <edwang>
Component: xenAssignee: Michal Novotny <minovotn>
Status: CLOSED CURRENTRELEASE QA Contact: Virtualization Bugs <virt-bugs>
Severity: high Docs Contact:
Priority: medium    
Version: 5.4CC: areis, clalance, ddutile, leiwang, llim, mrezanin, mshao, qwan, tyan, virt-maint, xen-maint, yoyzhang, yshao
Target Milestone: rcKeywords: Reopened
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2010-06-07 07:45:30 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 514498    
Attachments:
Description Flags
xml file for creating pool1
none
pv domain xml file
none
perl script to do pv domain migration stress test
none
dependency of perl script virshmig.pl
none
xml dump when this pv domain migrated to remote for the 1st time
none
xml dump when this pv domain migrated back to local for the 1st time
none
xml dump when this pv domain migrated to remote for the 2nd time
none
xml dump when this pv domain migrated back to local for the 2nd time
none
xend.log
none
xend.log when this bug is reproduced
none
rhel5u3-pv before migration
none
rhel5u3-pv after migration
none
rhel5u4-fv before migration
none
rhel5u4-fv after migration
none
xend.log
none
xend.log on the source host
none
xend.log on the source host
none
xend.log on the destination host none

Description Edward Wang 2009-07-01 11:08:54 UTC
Created attachment 350085 [details]
xml file for creating pool1

Description of problem:
During pv domain stress bi-way migration testing on xen hypervisor, the domain loses its network and migration hang there without response anymore.

Version-Release number of selected component (if applicable):
libvirt: libvirt-0.6.3-11.el5
xen: xen-3.0.3-88.el5
kernel: 2.6.18-155.el5xen

How reproducible:
100%, every time.

Setup:
There are two hosts with same hardware & software configuration, host A and host B. These two hosts are ssh believable, that is, I've created ssh public key on host A and then copied this public key to host B, vice versa.

Steps to Reproduce:
1. create "pool1" on host A and host B by command "virsh pool-define pool1.xml" & "virsh pool-start pool1", its type is "netfs", so these two host share the same nfs pool (pool1.xml is attached)
2. wget a pv domain disks images to the pool target
3. define and start domain "rhel5u4" by command "virsh define rhel5u4.xml" & "virsh start rhel5u4" on host A, the domain disk pints to the disks image downloaded in step 2(rhel5u4.xml is attached);
4. run command "perl virshmig.pl --guestname=rhel5u4 --mac=00:31:a3:14:3e:f0 --peerip=<host B ip address> --myip=<host A ip address>" on host A to start up the stress migration test process.
Note that:
a. "virshmig.pl" is a perl script which invoke virsh command to do domain migration strees testing. please find it and its dependency ipget.sh in the attachment.
b. --mac is the domain mac address got from its dump xml file

  
Actual results:
1, this pv type domain lost its network during stress migration testing
2, from the xml file of the domain dumped dynamically, find that some devices (graphics, input) lost from the dumped xml file, see remote2.xml and local2.xml attached, they are got when the domain migrated to remote and migrated back to local for the 2nd time
3, when some devices lost, the succeeding migration testing hang there without response anymore

Expected results:
1, the pv domain network should not lost
2, the devices (graphic, input) should not lost from the dumped xml file

Additional info:
For xen full virtualization domain stress migration testing, this bug can NOT be reproduced. So, this bug only exists for xen pv domain migration.

Comment 1 Edward Wang 2009-07-01 11:10:34 UTC
Created attachment 350086 [details]
pv domain xml file

Comment 2 Edward Wang 2009-07-01 11:11:47 UTC
Created attachment 350087 [details]
perl script to do pv domain migration stress test

Comment 3 Edward Wang 2009-07-01 11:12:30 UTC
Created attachment 350088 [details]
dependency of perl script virshmig.pl

Comment 4 Edward Wang 2009-07-01 11:13:36 UTC
Created attachment 350089 [details]
xml dump when this pv domain migrated to remote for the 1st time

Comment 5 Edward Wang 2009-07-01 11:14:17 UTC
Created attachment 350090 [details]
xml dump when this pv domain migrated back to local for the 1st time

Comment 6 Edward Wang 2009-07-01 11:14:45 UTC
Created attachment 350091 [details]
xml dump when this pv domain migrated to remote for the 2nd time

Comment 7 Edward Wang 2009-07-01 11:15:23 UTC
Created attachment 350092 [details]
xml dump when this pv domain migrated back to local for the 2nd time

Comment 11 Herbert Xu 2009-07-17 00:40:26 UTC
So are the devices lost in the xml before the hang occurs? If so this doesn't sound like a kernel problem.

Comment 15 Daniel Berrangé 2009-07-20 20:59:54 UTC
This really isn't a libvirt bug. libvirt XML just shows what XenD provides. So missing devices are XenD's fault.

Missing <graphics> tag just sounds like another case of this

https://bugzilla.redhat.com/show_bug.cgi?id=507765

Comment 17 Michal Novotny 2009-07-21 09:45:10 UTC
Hi,
could you please try this one with RPMs from:

http://people.redhat.com/minovotn/xen

There is a new patch that *may be* affecting this one.

Please provide test results.

Thanks,
Michal

Comment 18 Michal Novotny 2009-07-21 10:52:19 UTC
(In reply to comment #15)
> This really isn't a libvirt bug. libvirt XML just shows what XenD provides. So
> missing devices are XenD's fault.
> 
> Missing <graphics> tag just sounds like another case of this
> 
> https://bugzilla.redhat.com/show_bug.cgi?id=507765  

Daniel, it seems you're right it's dup of this BZ #507765 but I've been working on this one yesterday and the patch is available in RPMs I wrote about in comment #17 so I need Edward to install those RPMs and try to reproduce it with those RPMs. Most probably it's a dup but I won't close it until Edward tests it and it passes to be sure it's not something different.

Michal

Comment 19 Edward Wang 2009-07-21 11:33:17 UTC
(In reply to comment #17)
> Hi,
> could you please try this one with RPMs from:
> 
> http://people.redhat.com/minovotn/xen
> 
> There is a new patch that *may be* affecting this one.
> 
> Please provide test results.
> 
> Thanks,
> Michal  

Michal,

This bug should be fixed by your new xen package.

With the same steps in comment #0, after tried 20 times stress migration of xen
pv domain, the network still alive and no devices lost in domain xml config
file.

Related package info is as below:
xen - xen-3.0.3-90mig.elt
libvirt - libvirt-0.6.3-15.el5
kernel - 2.6.18-158.el5xen

Thanks
Edward Wang

Comment 20 Edward Wang 2009-07-21 11:35:04 UTC
(In reply to comment #18)
> (In reply to comment #15)
> > This really isn't a libvirt bug. libvirt XML just shows what XenD provides. So
> > missing devices are XenD's fault.
> > 
> > Missing <graphics> tag just sounds like another case of this
> > 
> > https://bugzilla.redhat.com/show_bug.cgi?id=507765  
> 
> Daniel, it seems you're right it's dup of this BZ #507765 but I've been working
> on this one yesterday and the patch is available in RPMs I wrote about in
> comment #17 so I need Edward to install those RPMs and try to reproduce it with
> those RPMs. Most probably it's a dup but I won't close it until Edward tests it
> and it passes to be sure it's not something different.
> 
> Michal  

Michal,

The root cause of this bug is same with BZ #507765? But the phenomena seems different.

Thanks.

Comment 21 Michal Novotny 2009-07-21 12:24:59 UTC
Edward,
BZ #507765 was only about PVFB devices - ie. virtual keyboard and framebuffer. It had nothing to do with networking but since those RPMs are based on -90 version of xen package (official version with just 2 new patches applies and none of them had anything to do with network) it appears that the current official RPMs already have this one fixed. The graphics devices lost issue was caused by 507765 but not the network issue. If you were unable to reproduce it now I would wait for next official version and I would try on this one and close this BZ if it's no longer reproducible on *official* version... ;)

Michal

Comment 22 Jiri Denemark 2009-07-21 12:48:44 UTC
(In reply to comment #20)
> The root cause of this bug is same with BZ #507765? But the phenomena seems
> different.

Edward,

Since there's no xend.log attached to this BZ I cannot be 100% sure but still I'm pretty sure that the root cause of this bug and https://bugzilla.redhat.com/show_bug.cgi?id=507765 is the same. What you see as a device loss most likely prevents successful migration of the guest. I guess the guest is not even running at the time it loses network connection...

Comment 23 Michal Novotny 2009-07-21 13:01:10 UTC
(In reply to comment #22)
> (In reply to comment #20)
> > The root cause of this bug is same with BZ #507765? But the phenomena seems
> > different.
> 
> Edward,
> 
> Since there's no xend.log attached to this BZ I cannot be 100% sure but still
> I'm pretty sure that the root cause of this bug and
> https://bugzilla.redhat.com/show_bug.cgi?id=507765 is the same. What you see as
> a device loss most likely prevents successful migration of the guest. I guess
> the guest is not even running at the time it loses network connection...  

I don't think so because 507765 was caused by "bad" fix for xenstore leak about PVFB devices so I am not touching other that vkbd/vfb devices. In BZ #439182 there was a leak in xenstore reported so I fixed it but the fix caused regression reported in BZ #507765. The only thing I was manipulating with was vkbd/vfb and I didn't even read any frontend/backend vif device so it's highly unlikely that this should be fixed by this patch for leaking PVFB devices so it may be corrected by some networking BZ that was applied there.

Anyway, xend.log would be helpful so could you please attach xend.log from at least 2 migration attempts with new RPMs?

Thanks,
Michal

Comment 24 Edward Wang 2009-07-22 06:34:21 UTC
Michal,

I've tried 9 times and xend.log is attached for your reference.

Thanks,
Edward

Comment 25 Edward Wang 2009-07-22 06:35:03 UTC
Created attachment 354655 [details]
xend.log

Comment 26 Jiri Denemark 2009-07-22 06:45:04 UTC
Edward,

From the log you attached, it seems you were trying it with a package which had the latest patch for 507765 applied. And you still experienced the issue with lost network, right?

Comment 27 Edward Wang 2009-07-22 07:07:28 UTC
(In reply to comment #26)
> Edward,
> 
> From the log you attached, it seems you were trying it with a package which had
> the latest patch for 507765 applied. And you still experienced the issue with
> lost network, right?  

No. 

No network issue and device lost (in xml) occurs any more after I upgraded to Michal's private package in comment #17. Michal's package already fixed this bug. We are just discussing whether the root cause of this bug is the same with BZ #507765.

Thanks
Edward

Comment 28 Jiri Denemark 2009-07-22 07:18:40 UTC
Interesting, I saw some errors in the log regarding vbd device... But it doesn't matter, could you please also provide xend.log when you try stress migration with older xen package (i.e. -88.el5 which you reported this bug for), that is when the guest loses network connection.

Thanks a lot.

Comment 29 Edward Wang 2009-07-22 09:21:08 UTC
(In reply to comment #28)
> Interesting, I saw some errors in the log regarding vbd device... But it
> doesn't matter, could you please also provide xend.log when you try stress
> migration with older xen package (i.e. -88.el5 which you reported this bug
> for), that is when the guest loses network connection.
> 
> Thanks a lot.  

I reproduced this bug with following components:
xen: -90.el5
libvirt: -15.el5
kernel: -158.el5xen

And, xend.log is attached for your reference. The log name is: xend(reproducible).log

Then, I upgrade xen with Michal's package in comment #17, try to check whether this bug is fixed in the same environment, the result is not produced, that is, Michal's package in comment #17 fixed this bug.

Thanks
Edward

Comment 30 Edward Wang 2009-07-22 09:22:45 UTC
Created attachment 354659 [details]
xend.log when this bug is reproduced

Comment 31 Jiri Denemark 2009-07-22 09:35:30 UTC
Thanks for that. Although it's kind of interesting that there is no error in the log at all. Any chance you could sent xend.log from the other machine?

Comment 32 Edward Wang 2009-07-22 09:46:18 UTC
Actually, I did not reproduce this bug on my own two machines and I've already tried to found another two machines of colleagues.

Though there is no error in xend.log I attached, but Michal's private package really has fixed this bug and I have to switch to other project issues now...

If you are really interested in the root cause of this bug, maybe you can try to find another two machines to reproduce it? :)

Thanks for your understanding
Edward

Comment 33 Michal Novotny 2009-07-22 11:35:19 UTC
Edward, my package should fix PVFB issues, that's right but I did nothing about vif interface so you say you were able to reproduce the bug on -90 version of Xen package for both PVFB devices and network (vif interface) devices? My PVFB patch did nothing about vif devices at all so I don't understand what the issue was there.

In fact I don't know whether it's really a XenD fault if you consider networking... According to XML there was a <graphics> tag missing which refers to PVFB devices (BZ #507765) but there was a bridge interface and networking device with MAC address in all XMLs:

    <interface type='bridge'>
      <mac address='00:31:a3:14:3e:f0'/>
      <source bridge='xenbr0'/>
      <script path='vif-bridge'/>
      <target dev='vif23.0'/>
    </interface>

So I guess if this is not reproduced it was done by something but since patch for BZ #507765 doesn't touch anything else than PVFB devices, it's highly unlikely it was corrected by this patch so that I would like to ask you whether you lost networking with -90 version you were testing now... Did you lost your networking?

Thanks,
Michal

Comment 34 Edward Wang 2009-07-23 02:10:03 UTC
Michal,

(In reply to comment #33)
> Edward, my package should fix PVFB issues, that's right but I did nothing about
> vif interface so you say you were able to reproduce the bug on -90 version of
> Xen package for both PVFB devices and network (vif interface) devices? My PVFB
> patch did nothing about vif devices at all so I don't understand what the issue
> was there.
> 
> In fact I don't know whether it's really a XenD fault if you consider
> networking... According to XML there was a <graphics> tag missing which refers
> to PVFB devices (BZ #507765) but there was a bridge interface and networking
> device with MAC address in all XMLs:
> 
>     <interface type='bridge'>
>       <mac address='00:31:a3:14:3e:f0'/>
>       <source bridge='xenbr0'/>
>       <script path='vif-bridge'/>
>       <target dev='vif23.0'/>
>     </interface>
> 
> So I guess if this is not reproduced it was done by something but since patch
> for BZ #507765 doesn't touch anything else than PVFB devices, it's highly
> unlikely it was corrected by this patch so that I would like to ask you whether
> you lost networking with -90 version you were testing now... Did you lost your
> networking?
> 
> Thanks,
> Michal  

Michal,

Yesterday, I've do the testing on two intel machines with following versions:
- xen: -90.el5
- libvirt: -15.el5
- kernel: -158.el5xen
and find that domain devices lost (input, graphics xml nodes) and also the domain can not ping during 2nd round mingration. one round migration means migrate domain to remote then back to local from remote again.

Then, I upgrade xen package with packages in comment #17, this bug disappear. The environment is all the same except that I've upgraded xen package you provided to me.

No sure whether this is helpful to you?

Thanks
Edward

Comment 35 Michal Novotny 2009-07-23 08:44:39 UTC
Edward,
this is exactly what I meant. Thanks. I am just a little confused because my PVFB patch did nothing with networking devices. In fact, official version -91 of xen package should contain this PVFB patch (BZ #507765)... could you try with this version when it's available?

Thanks,
Michal

Comment 38 Edward Wang 2009-07-23 10:20:32 UTC
Michal,

I've tested xen pv domain migration for 5 cycles (1 cycle migration means migrate domain from local to remote then back to local), no network and device lost. The component version is:
kernel - 158.el5
xen - 91.el5
libvirt - 16.el5

This bug really fixed now.

Thanks
Edward

Comment 39 Michal Novotny 2009-07-26 21:34:00 UTC
Thanks for testing Edward, so this one is fixed... That's good ;)

Thanks,
Michal

Comment 42 zhanghaiyan 2009-07-30 09:06:33 UTC
Refer to comment #38, change this bug status to VERIFIED

Comment 43 zhanghaiyan 2009-07-30 09:07:00 UTC
Clear needinfo flag

Comment 44 Chris Lalancette 2009-07-30 11:55:30 UTC
OK.  Since it seems this is fixed, I'm going to close this as a dup of 507765, even though we don't know exactly why that patch fixed this problem.

Chris Lalancette

*** This bug has been marked as a duplicate of bug 507765 ***

Comment 45 zhanghaiyan 2009-08-02 07:32:31 UTC
Can reproduce on RHEL5.4-i386 pv guest migration
Version:
- RHEL5.4-Server-i386
- libvirt-0.6.3-17.el5
- xen-3.0.3-92.el5

Migrate pv guest from system A to system B through Virt-manager

After migrate, 
- On system A,
  no error info, guest is shut off.

- On system B,
  guest is migrated from system A, and in running status.
  But lost its interface info.

Attach rhel5u4-pv xml on system B after migration.

Comment 46 zhanghaiyan 2009-08-02 07:36:26 UTC
Update comment #45
guest is RHEL5U3 system
Attach rhel5u3-pv-before-mig xml on system A before migration
Attach rhel5u3-pv-after-mig xml on system B after migration

Comment 47 zhanghaiyan 2009-08-02 07:40:58 UTC
Created attachment 355924 [details]
rhel5u3-pv before migration

Comment 48 zhanghaiyan 2009-08-02 07:41:46 UTC
Created attachment 355925 [details]
rhel5u3-pv after migration

Comment 49 zhanghaiyan 2009-08-02 07:49:01 UTC
Can reproduce on RHEL5.4-i386 fv guest migration
Version:
- RHEL5.4-Server-i386
- libvirt-0.6.3-17.el5
- xen-3.0.3-92.el5

Migrate fv guest from system A to system B through Virt-manager

After migrate, 
- On system A,
  guest is shut off, and virt-manager report error info:
  
Error migrating domain: Domain not found: xenUnifiedDomainLookupByName
Traceback (most recent call last):
  File "/usr/share/virt-manager/virtManager/engine.py", line 561, in migrate_domain
    vm.migrate(destconn)
  File "/usr/share/virt-manager/virtManager/domain.py", line 1387, in migrate
    self.vm.migrate(conn, flags, None, uri, 0)
  File "/usr/lib/python2.4/site-packages/libvirt.py", line 378, in migrate
    if ret is None:raise libvirtError('virDomainMigrate() failed', dom=self)
libvirtError: Domain not found: xenUnifiedDomainLookupByName


- On system B,
  guest is migrated from system A, and in running status.
  But no console screen, although it doesn't lose interface info in xml, cannot ping it.

Attach rhel5u4-fv-before-mig xml before migration
Attach rhel5u4-fv-after-mig xml after migration

Comment 50 zhanghaiyan 2009-08-02 07:49:50 UTC
Created attachment 355926 [details]
rhel5u4-fv before migration

Comment 51 zhanghaiyan 2009-08-02 07:50:37 UTC
Created attachment 355927 [details]
rhel5u4-fv after migration

Comment 52 zhanghaiyan 2009-08-02 07:52:07 UTC
Attach /var/log/xen/xend.log

Comment 53 zhanghaiyan 2009-08-02 07:52:30 UTC
Created attachment 355928 [details]
xend.log

Comment 54 zhanghaiyan 2009-08-03 08:01:53 UTC
This bug can be reproduced on xen-3.0.3-93.el5, which have the same test result with comment #45, comment #49

Comment 55 Jiri Denemark 2009-08-03 09:24:09 UTC
Could you also provide xend.log from the target machine (system B)? When migrating, xend.log from source machine is rarely useful.

Thanks.

Comment 56 Michal Novotny 2009-08-03 09:50:38 UTC
(In reply to comment #54)
> This bug can be reproduced on xen-3.0.3-93.el5, which have the same test result
> with comment #45, comment #49  

So, you can reproduce it on -93 version several times in a row but you can't on -91 ? I think it's most likely reproducible on -91 as well when you run test on this several times in a row, isn't it ? Therefore this one was not fixed in any version but it's hardly reproducible? Could you do some more testing about that one ?

Thanks,
Michal

Comment 57 zhanghaiyan 2009-08-03 09:58:54 UTC
This bug cannot be reproduced on xen -91 on x86_64 system
But this bug can be reproduced on xen -92 and -93 i386 system

So, I think maybe it is an i386 related bug

Comment 58 Michal Novotny 2009-08-03 11:07:42 UTC
This seems strange. Why did you test it on xen -91 on x86_64 and -92 and -93 on i386 system? Could you please test it on xen -91 on i386 as well and write a report about that whether it's working fine ?

Comment 60 Michal Novotny 2009-08-13 15:26:01 UTC
Well, somebody to provide me more information about that ? Information whether it's reproducible on -91 i386 would be grateful but I am not currently having i386 version of dom0.

Comment 61 zhanghaiyan 2009-08-14 09:46:11 UTC
Please ignore the previous test result on i386-xen-92 i386-xen-93, because finally I found that one of my machine has some hardware issue. Sorry for that.

I retested pv guest migration on i386-xen-94 on 2 good machine, the following is test step and result:

1. Create 1 guest rhel5u4 on host A
2. Migrate from host A to host B with virsh command (2 rounds)
   # virsh migrate --live rhel5u4 xen+ssh://*.*.*.*
---> Successfully

But I found the following issues:
1. Network info and device info does not lost during migration
2. In host, using ping to guest successfully during migration
3. After guest migrate to host A, guest can ping host successfully
   After guest migrate to host B, guest ping host failed

It seems strange that on host B, guest ping host failed, but host ping guest successfully.

Comment 62 Michal Novotny 2009-08-14 09:57:08 UTC
(In reply to comment #61)
> Please ignore the previous test result on i386-xen-92 i386-xen-93, because
> finally I found that one of my machine has some hardware issue. Sorry for that.
> 

Ok.

> I retested pv guest migration on i386-xen-94 on 2 good machine, the following
> is test step and result:
> 
> 1. Create 1 guest rhel5u4 on host A
> 2. Migrate from host A to host B with virsh command (2 rounds)
>    # virsh migrate --live rhel5u4 xen+ssh://*.*.*.*
> ---> Successfully
> 

So, was it working fine on machines with no hardware issues?


> But I found the following issues:
> 1. Network info and device info does not lost during migration


What do you mean by that? The information persists here when the domain is migrating and it should not or what do you mean?


> 2. In host, using ping to guest successfully during migration


So, you can ping the domain when migrating? It is live migration, right? I think this should be working in live migration and there should be just minor outage...


> 3. After guest migrate to host A, guest can ping host successfully
>    After guest migrate to host B, guest ping host failed
> 

You mean pinging host machine from the guest machine?


> It seems strange that on host B, guest ping host failed, but host ping guest
> successfully.  


Well, does this problem persist after migration is successfully done? Maybe there is some outage for a while so therefore you can't ping it but you should be able to ping it if you try again later (with doing nothing else).

Comment 63 zhanghaiyan 2009-08-14 10:08:19 UTC
(In reply to comment #62)
> (In reply to comment #61)
> > Please ignore the previous test result on i386-xen-92 i386-xen-93, because
> > finally I found that one of my machine has some hardware issue. Sorry for that.
> > 
> 
> Ok.
> 
> > I retested pv guest migration on i386-xen-94 on 2 good machine, the following
> > is test step and result:
> > 
> > 1. Create 1 guest rhel5u4 on host A
> > 2. Migrate from host A to host B with virsh command (2 rounds)
> >    # virsh migrate --live rhel5u4 xen+ssh://*.*.*.*
> > ---> Successfully
> > 
> 
> So, was it working fine on machines with no hardware issues?
---->Yes, no hardware issues
> 
> 
> > But I found the following issues:
------> Should update 'issues' to 'result'
> > 1. Network info and device info does not lost during migration
> 
> 
> What do you mean by that? The information persists here when the domain is
> migrating and it should not or what do you mean?
-----> This is expected result
> 
> 
> > 2. In host, using ping to guest successfully during migration
> 
> 
> So, you can ping the domain when migrating? It is live migration, right? I
> think this should be working in live migration and there should be just minor
> outage...
-----> This is expected result
> 
> 
> > 3. After guest migrate to host A, guest can ping host successfully
> >    After guest migrate to host B, guest ping host failed
> > 
> 
> You mean pinging host machine from the guest machine?
> 
-----> Yes, ping host from guest machine
> 
> > It seems strange that on host B, guest ping host failed, but host ping guest
> > successfully.  
> 
> 
> Well, does this problem persist after migration is successfully done? Maybe
> there is some outage for a while so therefore you can't ping it but you should
> be able to ping it if you try again later (with doing nothing else).  
----> Yes, this problem persist after migration is successfully done.
I will try to test ping it after a while and tell you result later

Comment 64 Michal Novotny 2009-08-14 10:34:51 UTC
(In reply to comment #63)
> (In reply to comment #62)
> > (In reply to comment #61)
> > > Please ignore the previous test result on i386-xen-92 i386-xen-93, because
> > > finally I found that one of my machine has some hardware issue. Sorry for that.
> > > 
> > 
> > Ok.
> > 
> > > I retested pv guest migration on i386-xen-94 on 2 good machine, the following
> > > is test step and result:
> > > 
> > > 1. Create 1 guest rhel5u4 on host A
> > > 2. Migrate from host A to host B with virsh command (2 rounds)
> > >    # virsh migrate --live rhel5u4 xen+ssh://*.*.*.*
> > > ---> Successfully
> > > 
> > 
> > So, was it working fine on machines with no hardware issues?
> ---->Yes, no hardware issues
> > 
> > 

Ok, so no software issue found and the software issue turned out to be caused by an hardware issue, right?

> > > But I found the following issues:
> ------> Should update 'issues' to 'result'

Ok, good.

> > > 1. Network info and device info does not lost during migration
> > 
> > 
> > What do you mean by that? The information persists here when the domain is
> > migrating and it should not or what do you mean?
> -----> This is expected result
> > 

Ok.


> > 
> > > 2. In host, using ping to guest successfully during migration
> > 
> > 
> > So, you can ping the domain when migrating? It is live migration, right? I
> > think this should be working in live migration and there should be just minor
> > outage...
> -----> This is expected result
> > 
> > 

Ok, great.



> > > 3. After guest migrate to host A, guest can ping host successfully
> > >    After guest migrate to host B, guest ping host failed
> > > 
> > 
> > You mean pinging host machine from the guest machine?
> > 
> -----> Yes, ping host from guest machine
> > 


Ok, what about trying to ping another machine? What does the guest report in ifconfig and stats for network interface?


> > > It seems strange that on host B, guest ping host failed, but host ping guest
> > > successfully.  
> > 
> > 
> > Well, does this problem persist after migration is successfully done? Maybe
> > there is some outage for a while so therefore you can't ping it but you should
> > be able to ping it if you try again later (with doing nothing else).  
> ----> Yes, this problem persist after migration is successfully done.
> I will try to test ping it after a while and tell you result later  



Please do, if this is the issue, some logs from both host machine and guest machine will be useful. Mainly information about network device configuration in the guest could help I think.

Comment 65 zhanghaiyan 2009-08-20 02:01:26 UTC
(In reply to comment #64)
> (In reply to comment #63)
> > (In reply to comment #62)
> > > (In reply to comment #61)
> > > > It seems strange that on host B, guest ping host failed, but host ping guest
> > > > successfully.  
> > > 
> > > 
> > > Well, does this problem persist after migration is successfully done? Maybe
> > > there is some outage for a while so therefore you can't ping it but you should
> > > be able to ping it if you try again later (with doing nothing else).  
> > ----> Yes, this problem persist after migration is successfully done.
> > I will try to test ping it after a while and tell you result later  
> 
> 
> 
> Please do, if this is the issue, some logs from both host machine and guest
> machine will be useful. Mainly information about network device configuration
> in the guest could help I think.  

Yes, wait about 8 minutes after migrate, guest can ping to host.

Comment 66 Michal Novotny 2009-10-05 11:39:34 UTC
(In reply to comment #65)
> (In reply to comment #64)
> > (In reply to comment #63)
> > > (In reply to comment #62)
> > > > (In reply to comment #61)
> > > > > It seems strange that on host B, guest ping host failed, but host ping guest
> > > > > successfully.  
> > > > 
> > > > 
> > > > Well, does this problem persist after migration is successfully done? Maybe
> > > > there is some outage for a while so therefore you can't ping it but you should
> > > > be able to ping it if you try again later (with doing nothing else).  
> > > ----> Yes, this problem persist after migration is successfully done.
> > > I will try to test ping it after a while and tell you result later  
> > 
> > 
> > 
> > Please do, if this is the issue, some logs from both host machine and guest
> > machine will be useful. Mainly information about network device configuration
> > in the guest could help I think.  
> 
> Yes, wait about 8 minutes after migrate, guest can ping to host.  

I did several migrations (about 10 per each host machine) and I found no issue. I've been using -94 version with some new patches applied that's available at: http://people.redhat.com/minovotn/xen . Could you please try with this one since I was unable to reproduce it? I used this version of xen package (available on the link above) and kernel-xen-2.6.18-164 ... Could you please try to reproduce with this one and tell us whether this is still the issue on this one?

Thanks,
Michal

Comment 70 Michal Novotny 2010-05-05 07:46:49 UTC
Was any testing done with this one? I was unable to reproduce it at all.

Michal

Comment 71 Lei Wang 2010-05-05 08:35:27 UTC
Michal,sorry for late update.

Verify this bug on two Intel machines

Version-Release number of selected component (if applicable):
libvirt: libvirt-0.6.3-20.el5
xen: xen-3.0.3-105.el5
kernel: kernel-2.6.18-194.el5

Host is RHEL-Server-5.4.x86_64
PV Guest is RHEL-server-5.4.i386/x86_64

As this bug is said nothing to do with libvirt, I test migration only by "xm migrate -l" command.
There is already a sub test about migration in xen autotest: ping_pong_migration (which uses "xm migrate -l" to migrate vm to remote host,then migrate it back,then remote,then back...)

After migrate vm 20 cycles(migrate to remote and back), there's no device lost,all network/input/graphic devices exist as the vm was created.
That is to say the bug was not reproduced.

Comment 72 Michal Novotny 2010-05-05 08:47:30 UTC
(In reply to comment #71)
> Michal,sorry for late update.
> 
> Verify this bug on two Intel machines
> 
> Version-Release number of selected component (if applicable):
> libvirt: libvirt-0.6.3-20.el5
> xen: xen-3.0.3-105.el5
> kernel: kernel-2.6.18-194.el5
> 
> Host is RHEL-Server-5.4.x86_64
> PV Guest is RHEL-server-5.4.i386/x86_64
> 
> As this bug is said nothing to do with libvirt, I test migration only by "xm
> migrate -l" command.
> There is already a sub test about migration in xen autotest:
> ping_pong_migration (which uses "xm migrate -l" to migrate vm to remote
> host,then migrate it back,then remote,then back...)
> 
> After migrate vm 20 cycles(migrate to remote and back), there's no device
> lost,all network/input/graphic devices exist as the vm was created.
> That is to say the bug was not reproduced.    

Ok, so, does it mean the bug is no longer reproducible but it was reproducible before? What version of xen package was it reproducible on? You say you tried it with xen-3.0.3-105.el5 and it was not reproducible. Originally it has been closed as a dup of bug 507765. Patch for this bug was already build into xen-3.0.3-91.el5. Also, I did some testing with -94 version of xen package as described in comment #66 and I found no issue here. But the bug got reopened in time the reporter was using -93 version of xen package so could you please retest using both i386 and x86_64 version of both dom0 and domU to be sure it's working fine?

Thanks,
Michal

Comment 73 Qixiang Wan 2010-05-05 14:53:39 UTC
tried to reproduce this bug with xen-3.0.3-80.el5 (there is no xen -81 ~ -93 package available in brew now), the pv guest suspended just after 2 rounds of migrations.

host: RHEL-Server-5.4 x86_64 (xen-3.0.3-80.el5 kernel-xen-2.6.18-164.el5)
pv guest: RHEL-Server-5.4 i386

$ xm list
Name                                      ID Mem(MiB) VCPUs State   Time(s)
Domain-0                                   0     7352     2 r-----   1061.5
migrating-rhel54pv                         2      511     1 ---s--      0.0

$ cat /var/log/xen/xend.log
...
[2010-05-05 14:33:34 xend 4453] DEBUG (DevController:496) hotplugStatusCallback /local/domain/0/backend/tap/2/51712/hotplug-status.
[2010-05-05 14:33:34 xend 4453] DEBUG (DevController:510) hotplugStatusCallback 1.
[2010-05-05 14:33:34 xend 4453] DEBUG (DevController:154) Waiting for devices vtpm.
[2010-05-05 14:33:34 xend.XendDomainInfo 4453] DEBUG (XendDomainInfo:1036) XendDomainInfo.handleShutdownWatch
[2010-05-05 14:33:34 xend.XendDomainInfo 4453] INFO (XendDomainInfo:994) Domain has shutdown: name=migrating-rhel54pv id=2 reason=suspend.
[2010-05-05 14:33:34 xend 4453] INFO (XendCheckpoint:99) Domain 2 suspended.
[2010-05-05 14:33:34 xend 4453] DEBUG (XendCheckpoint:108) Written done
[2010-05-05 14:35:53 xend.XendDomainInfo 4453] INFO (XendDomainInfo:994) Domain has shutdown: name=migrating-rhel54pv id=2 reason=suspend.
[2010-05-05 14:35:53 xend.XendDomainInfo 4453] INFO (XendDomainInfo:994) Domain has shutdown: name=migrating-rhel54pv id=2 reason=suspend.
[2010-05-05 14:35:53 xend.XendDomainInfo 4453] INFO (XendDomainInfo:994) Domain has shutdown: name=migrating-rhel54pv id=2 reason=suspend.
[2010-05-05 14:36:46 xend.XendDomainInfo 4453] INFO (XendDomainInfo:994) Domain has shutdown: name=migrating-rhel54pv id=2 reason=suspend.
...

Comment 74 Qixiang Wan 2010-05-05 16:49:30 UTC
the test result in comment #73 is invalid, I probably tried to migrate a migrating guest.

I re-tested it against xen-3.0.3-80.el5 as the following steps:

1. create a pv guest with disk placed on the shared NFS storage (mounted on /data by the hosts in this case):

$ cat rhel54pv.cfg
name = "rhel54pv"
maxmem = 512
memory = 512
vcpus = 1
bootloader = "/usr/bin/pygrub"
pae = 1
on_poweroff = "destroy"
on_reboot = "restart"
on_crash = "restart"
vfb = [ 'type=vnc,vncunused=1,keymap=en-us,vnclisten=0.0.0.0' ]
disk = [ "tap:aio:/data/rhel-server-5.4-32-pv.img,xvda,w" ]
vif = [ "mac=00:16:36:63:75:48,bridge=xenbr0,script=vif-bridge" ]

2. start x windows in guest and issue the following command in gnome terminal:
$ i=0; while sleep 1; do echo $((i++));done

keep ping the guest from outside.

3. migrate the pv guest between to host (src and dst) for 100 rounds:
[src]$ xm migrate -l rhel54pv $(ip_of_dst)
[dst]$ xm migrate -l rhel54pv $(ip_of_src)

4. after 100 rounds migrations, the command in step 2 keep running in guest, network and vfb session of the guest also works well.

I also tested against xen-3.0.3-94.el5 with same environment, same steps as above. Got same results as xen-3.0.3-80.el5 ( migrate without '-l' parameter also covered).

so I can't reproduce this bug by now.

Comment 75 Michal Novotny 2010-05-06 10:51:53 UTC
(In reply to comment #74)
> the test result in comment #73 is invalid, I probably tried to migrate a
> migrating guest.
> 
> I re-tested it against xen-3.0.3-80.el5 as the following steps:
> 
> 1. create a pv guest with disk placed on the shared NFS storage (mounted on
> /data by the hosts in this case):
> 
> $ cat rhel54pv.cfg
> name = "rhel54pv"
> maxmem = 512
> memory = 512
> vcpus = 1
> bootloader = "/usr/bin/pygrub"
> pae = 1
> on_poweroff = "destroy"
> on_reboot = "restart"
> on_crash = "restart"
> vfb = [ 'type=vnc,vncunused=1,keymap=en-us,vnclisten=0.0.0.0' ]
> disk = [ "tap:aio:/data/rhel-server-5.4-32-pv.img,xvda,w" ]
> vif = [ "mac=00:16:36:63:75:48,bridge=xenbr0,script=vif-bridge" ]
> 
> 2. start x windows in guest and issue the following command in gnome terminal:
> $ i=0; while sleep 1; do echo $((i++));done
> 
> keep ping the guest from outside.
> 
> 3. migrate the pv guest between to host (src and dst) for 100 rounds:
> [src]$ xm migrate -l rhel54pv $(ip_of_dst)
> [dst]$ xm migrate -l rhel54pv $(ip_of_src)
> 
> 4. after 100 rounds migrations, the command in step 2 keep running in guest,
> network and vfb session of the guest also works well.
> 
> I also tested against xen-3.0.3-94.el5 with same environment, same steps as
> above. Got same results as xen-3.0.3-80.el5 ( migrate without '-l' parameter
> also covered).
> 
> so I can't reproduce this bug by now.    

Good, what about with the latest version of xen packages, i.e. xen-3.0.3-107.el5[virttest25] ?

Thanks,
Michal

Comment 76 Qixiang Wan 2010-05-06 13:44:18 UTC
Tested against xen-3.0.3-107.el5 with the same steps in comment #74

$ uname -a
Linux intel-8400-8-1 2.6.18-164.el5xen #1 SMP Tue Aug 18 15:59:52 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux
$ rpm -q xen
xen-3.0.3-107.el5

1. migrate the guest with 'xm migrate $dom $host', after 100 rounds of migration, the pv guest can live healthily.

2. live migrate the guest with 'xm migrate -l $dom $host', after about 50 rounds of migration, the pv guest disappeared completely from 'xm list' output on both the source and destination hosts.
xend.log on both the source host and destination host will be attached soon.

Comment 77 Qixiang Wan 2010-05-06 13:52:30 UTC
Created attachment 412062 [details]
xend.log on the source host

Comment 78 Qixiang Wan 2010-05-06 13:56:55 UTC
Created attachment 412064 [details]
xend.log on the source host

Comment 79 Qixiang Wan 2010-05-06 13:57:56 UTC
Created attachment 412065 [details]
xend.log on the destination host

Comment 80 Michal Novotny 2010-05-07 14:27:12 UTC
(In reply to comment #79)
> Created an attachment (id=412065) [details]
> xend.log on the destination host    

Well, it looks like this is something different but investigation of this log revealed that the "[2010-05-06 13:27:11 xend 3456] INFO (XendCheckpoint:364) ERROR Internal error: Failed to pin batch of 164 page tables" is the key error that makes the xc_restore fail. It's the code for PV guests only which calls xc_mmuext_op() function to pin the page tables to the guest when batch of MAX_PIN_BATCH pins (which is set to 1024) is accumulated. xc_mmuext_op() function declares a hypercall of __HYPERVISOR_mmuext_op to the hypervisor so I need to ask you some more things to be sure:

1. Was the kernel-xen version same all the time? I.e. kernel-xen-2.6.18-194.el5 ?
2. Was the only package changed just the xen user-space package?
3. Did the error occur only during live migration and was it a repeated behavior? I.e. is it 100% reproducible or not ?

I've been looking at the user-space changes from xen-3.0.3-94.el5 but I can see anything that could cause this behaviour so those information I am asking about could be helpful.

Thanks,
Michal

Comment 81 Michal Novotny 2010-05-07 14:28:45 UTC
Oh, sorry. I meant I can't see anything that could have caused this behavior.

Michal

Comment 82 Qixiang Wan 2010-05-10 09:54:48 UTC
(In reply to comment #80)
> (In reply to comment #79)
> > Created an attachment (id=412065) [details] [details]
> > xend.log on the destination host    
> 
> Well, it looks like this is something different but investigation of this log
> revealed that the "[2010-05-06 13:27:11 xend 3456] INFO (XendCheckpoint:364)
> ERROR Internal error: Failed to pin batch of 164 page tables" is the key error
> that makes the xc_restore fail. It's the code for PV guests only which calls
> xc_mmuext_op() function to pin the page tables to the guest when batch of
> MAX_PIN_BATCH pins (which is set to 1024) is accumulated. xc_mmuext_op()
> function declares a hypercall of __HYPERVISOR_mmuext_op to the hypervisor so I
> need to ask you some more things to be sure:
> 
> 1. Was the kernel-xen version same all the time? I.e. kernel-xen-2.6.18-194.el5
> ?
yes, it's always kernel-xen-2.6.18-164.el5 which is come with released RHEL5.4.

> 2. Was the only package changed just the xen user-space package?
yes, I installed released RHEL5.4 server, and then update the xen user-space package

> 3. Did the error occur only during live migration and was it a repeated
> behavior? I.e. is it 100% reproducible or not ?
I only saw this problem the first time when I live migrate the PV guest, and can not reproduce it after that.
I test live migration in the same environment with same steps for another 3 times, live migrate the pv guest for 400 rounds, the guest live well after the migration.

> 
> I've been looking at the user-space changes from xen-3.0.3-94.el5 but I can see
> anything that could cause this behaviour so those information I am asking about
> could be helpful.
> 
> Thanks,
> Michal

Comment 83 Michal Novotny 2010-05-19 14:54:08 UTC
(In reply to comment #82)
> (In reply to comment #80)
> > (In reply to comment #79)
> > > Created an attachment (id=412065) [details] [details] [details]
> > > xend.log on the destination host    
> > 
> > Well, it looks like this is something different but investigation of this log
> > revealed that the "[2010-05-06 13:27:11 xend 3456] INFO (XendCheckpoint:364)
> > ERROR Internal error: Failed to pin batch of 164 page tables" is the key error
> > that makes the xc_restore fail. It's the code for PV guests only which calls
> > xc_mmuext_op() function to pin the page tables to the guest when batch of
> > MAX_PIN_BATCH pins (which is set to 1024) is accumulated. xc_mmuext_op()
> > function declares a hypercall of __HYPERVISOR_mmuext_op to the hypervisor so I
> > need to ask you some more things to be sure:
> > 
> > 1. Was the kernel-xen version same all the time? I.e. kernel-xen-2.6.18-194.el5
> > ?
> yes, it's always kernel-xen-2.6.18-164.el5 which is come with released RHEL5.4.
> 
> > 2. Was the only package changed just the xen user-space package?
> yes, I installed released RHEL5.4 server, and then update the xen user-space
> package
> 
> > 3. Did the error occur only during live migration and was it a repeated
> > behavior? I.e. is it 100% reproducible or not ?
> I only saw this problem the first time when I live migrate the PV guest, and
> can not reproduce it after that.
> I test live migration in the same environment with same steps for another 3
> times, live migrate the pv guest for 400 rounds, the guest live well after the
> migration.
> 

What do you mean by you saw the problem the first time only? Is it not reproducible any longer? How is it reproducible? Not 100% I guess. Could you please try using the latest kernel-xen and xen packages and provide us test results?

Thanks a lot,
Michal

Comment 84 Qixiang Wan 2010-05-20 16:24:32 UTC
(In reply to comment #83)
> 
> What do you mean by you saw the problem the first time only? Is it not
> reproducible any longer? How is it reproducible? Not 100% I guess. 

The problem mentioned in comment #76 (guest lost while doing live migration) is not reproducible, I only hit that problem 1 time during the testing ( I migrated the pv guest 400 rounds for about 10 times).

> Could you please try using the latest kernel-xen and xen packages and 
> provide us test results?
> 

tested against kernel-xen-2.6.18-199.el5 and xen-3.0.3-109.el5, live migrate the pv guest for 200 rounds as the same steps in comment #74. Tested for 5 times, no problem found during testing, the guest works well after migrations.

Comment 85 Michal Novotny 2010-05-31 10:33:08 UTC
(In reply to comment #84)
> (In reply to comment #83)
> > 
> > What do you mean by you saw the problem the first time only? Is it not
> > reproducible any longer? How is it reproducible? Not 100% I guess. 
> 
> The problem mentioned in comment #76 (guest lost while doing live migration) is
> not reproducible, I only hit that problem 1 time during the testing ( I
> migrated the pv guest 400 rounds for about 10 times).
> 
> > Could you please try using the latest kernel-xen and xen packages and 
> > provide us test results?
> > 
> 
> tested against kernel-xen-2.6.18-199.el5 and xen-3.0.3-109.el5, live migrate
> the pv guest for 200 rounds as the same steps in comment #74. Tested for 5
> times, no problem found during testing, the guest works well after migrations.    

So, does it mean you cannot run into those issues in all the testing you have been doing using the latest kernel-xen and xen version? If so, feel free to close this bug.

Thanks,
Michal

Comment 86 Qixiang Wan 2010-06-02 13:07:46 UTC
(In reply to comment #85)

> So, does it mean you cannot run into those issues in all the testing you have
> been doing using the latest kernel-xen and xen version? If so, feel free to
> close this bug.
Yes, I'm closing it as VERIFIED.

Comment 87 Michal Novotny 2010-06-02 13:17:06 UTC
(In reply to comment #86)
> (In reply to comment #85)
> 
> > So, does it mean you cannot run into those issues in all the testing you have
> > been doing using the latest kernel-xen and xen version? If so, feel free to
> > close this bug.
> Yes, I'm closing it as VERIFIED.    

Great. Thanks!

Michal