Bug 622501 - HVM guest got disk read-only after local migration
Summary: HVM guest got disk read-only after local migration
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: xen
Version: 5.6
Hardware: All
OS: Linux
medium
high
Target Milestone: rc
: ---
Assignee: Michal Novotny
QA Contact: Virtualization Bugs
URL:
Whiteboard:
: 608964 (view as bug list)
Depends On:
Blocks: 514498 605617
TreeView+ depends on / blocked
 
Reported: 2010-08-09 14:37 UTC by Pengzhen Cao
Modified: 2014-02-02 22:38 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2011-02-28 09:59:17 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Patch to fix local migrations (3.14 KB, patch)
2010-08-10 11:56 UTC, Michal Novotny
no flags Details | Diff
Patch to fix local migrations v2 (3.73 KB, patch)
2010-08-16 10:40 UTC, Michal Novotny
no flags Details | Diff
Patch to fix local migrations v3 (6.01 KB, patch)
2010-08-17 13:42 UTC, Michal Novotny
no flags Details | Diff
Patch to fix local migration v4 (3.51 KB, patch)
2010-11-22 13:15 UTC, Michal Novotny
no flags Details | Diff
Patch v5 (2.58 KB, patch)
2011-01-06 11:19 UTC, Michal Novotny
no flags Details | Diff

Description Pengzhen Cao 2010-08-09 14:37:13 UTC
Description of problem:
HVM guest got disk read-only after local migration

Version-Release number of selected component (if applicable):
xen-3.0.3-115.el5
kernel-xen-2.6.18-210.el5


How reproducible:
100%

Steps to Reproduce:
1. start a linux hvm guest with "xm create RHEL-5.4-64-hvm.conf"
2. run local migration "xm migrate -l ID localhost"
3. after migration done, run step 2 again
  
Actual results:
Guest vm got disk read-only error after 2nd migration

Expected results:
Everything should be fine after migration

Additional info:
1. for WinXP hvm guest, guest will hang after first migration

2. this issue happens both with and without pv driver

3. after checking "xenstore-ls" before and after every migration, I found that the vm's name changed strangely:
before migration:    "/vm/1efb30c3-86fd-9dd7-4934-9b32b6a84432"
after 1st migration: "/vm/1efb30c3-86fd-9dd7-4934-9b32b6a84432-1" 
after 2nd migration: "/vm/1efb30c3-86fd-9dd7-4934-9b32b6a84432" 

4. there is error msg in xend's log after 1st migration:
-----------
[2010-08-10 06:01:48 xend 6808] DEBUG (DevController:160) Waiting for devices usb.
[2010-08-10 06:01:48 xend 6808] DEBUG (DevController:160) Waiting for devices vbd.
[2010-08-10 06:01:48 xend 6808] DEBUG (DevController:166) Waiting for 768.
[2010-08-10 06:01:48 xend 6808] DEBUG (DevController:538) hotplugStatusCallback /local/domain/0/backend/vbd/26/768/hotplug-status.
[2010-08-10 06:01:48 xend 6808] DEBUG (DevController:552) hotplugStatusCallback 5.
[2010-08-10 06:01:48 xend 6808] ERROR (XendCheckpoint:356) Device 768 (vbd) could not be connected.
File /home/ovirt-VMs/RHEL-Server-5.4-64-hvm.raw is loopback-mounted through /dev/loop0,
which is mounted in a guest domain,
and so cannot be mounted now.
Traceback (most recent call last):
  File "/usr/lib64/python2.4/site-packages/xen/xend/XendCheckpoint.py", line 354, in restore
    dominfo.waitForDevices() # Wait for backends to set up
  File "/usr/lib64/python2.4/site-packages/xen/xend/XendDomainInfo.py", line 2440, in waitForDevices
    self.waitForDevices_(c)
  File "/usr/lib64/python2.4/site-packages/xen/xend/XendDomainInfo.py", line 1453, in waitForDevices_
    return self.getDeviceController(deviceClass).waitForDevices()
  File "/usr/lib64/python2.4/site-packages/xen/xend/server/DevController.py", line 162, in waitForDevices
    return map(self.waitForDevice, self.deviceIDs())
  File "/usr/lib64/python2.4/site-packages/xen/xend/server/DevController.py", line 196, in waitForDevice
    raise VmError("Device %s (%s) could not be connected.\n%s" %
VmError: Device 768 (vbd) could not be connected.
File /home/ovirt-VMs/RHEL-Server-5.4-64-hvm.raw is loopback-mounted through /dev/loop0,
which is mounted in a guest domain,
and so cannot be mounted now.
-------------------

Comment 1 Pengzhen Cao 2010-08-09 14:42:26 UTC
Created attachment 437615 [details]
logs for rhgz622501

---logs in /var/log/xen/--------
qemu-dm.16446.log
qemu-dm.16857.log
qemu-dm.17080.log
qemu-dm.17204.log
qemu-dm.17628.log
qemu-dm.17850.log
qemu-dm.17980.log
qemu-dm.18380.log
xend.log
xen-hotplug.log

---xenstore-ls log, before, 1st migration, 2nd migration, with pv driver----
xenstore-ls-ID22-rhel54-before-migrate
xenstore-ls-ID23-rhel54-after-migrate-1st
xenstore-ls-ID24-rhel54-after-migrate-2nd

---xenstore-ls log, before, 1st migration, 2nd migration, without pv driver----
xenstore-ls-ID25-rhel54-nopv-before-migrate
xenstore-ls-ID26-rhel54-nopv-after-migrate-1st
xenstore-ls-ID27-rhel54-nopv-after-migrate-2nd

---xenstore-ls log, before, 1st migration,WinXP 32bit with pv driver----
xenstore-ls-ID28-winxp-before-migrate
xenstore-ls-ID29-winxp-after-migrate-1st

---xm dmesg ---
xm-dmesg.log

Comment 2 Pengzhen Cao 2010-08-09 14:44:30 UTC
-----------winxp config file------------
# Xen configuration generated by xen-autotest
vncunused = "1"
kernel = "/usr/lib/xen/boot/hvmloader"
uuid = "1efb30c3-86fd-9dd7-4934-9b72b6a84422"
on_poweroff = "destroy"
vif = ['mac=00:21:7F:B7:11:02,script=vif-bridge,bridge=xenbr0,type=netfront']
name = "winXP-32bit"
on_reboot = "restart"
localtime = "0"
builder = "hvm"
apic = "1"
sdl = "0"
device_model = "/usr/lib64/xen/bin/qemu-dm"
vcpus = "4"
pae = "1"
memory = "512"
vnclisten = "0.0.0.0"
vnc = "1"
disk = ['file:/home/ovirt-VMs/WinXP-32-hvm.raw,xvda,w']
acpi = "1"
maxmem = "512"
soundhw = "es1370"

---------RHEL5.4-64bit config file------------
# Xen configuration generated by xen-autotest
vncunused = "1"
kernel = "/usr/lib/xen/boot/hvmloader"
uuid = "1efb30c3-86fd-9dd7-4934-9b32b6a84432"
on_poweroff = "destroy"
vif = ['mac=00:21:7F:B7:43:02,script=vif-bridge,bridge=xenbr0']
name = "RHEL5.4-64bit-hv"
on_reboot = "restart"
localtime = "0"
builder = "hvm"
apic = "1"
sdl = "0"
device_model = "/usr/lib64/xen/bin/qemu-dm"
vcpus = "2"
pae = "1"
memory = "1024"
vnclisten = "0.0.0.0"
vnc = "1"
#disk = ['file:/home/ovirt-VMs/RHEL-Server-5.4-64-hvm.raw,xvda,w']
disk = ['file:/home/ovirt-VMs/RHEL-Server-5.4-64-hvm.raw,hda,w']
acpi = "1"
maxmem = "1024"
soundhw = "sb16"

Comment 3 Michal Novotny 2010-08-09 15:05:36 UTC
This is strange, studying those logs it appears that before the first migration there's VBD drive 51712 for definition in /vm/$UUID and frontend:

   vbd = ""
    51712 = ""
     frontend = "/local/domain/22/device/vbd/51712"
     frontend-id = "22"
     backend-id = "0"
     backend = "/local/domain/0/backend/vbd/22/51712"

...

and also for backend:

    vbd = ""
     22 = ""
      51712 = ""
       domain = "RHEL5.4-64bit-hv"
       frontend = "/local/domain/22/device/vbd/51712"
       dev = "xvda"
       state = "4"
       params = "/home/ovirt-VMs/RHEL-Server-5.4-64-hvm.raw"
       mode = "w"
       online = "1"
       frontend-id = "22"
       type = "file"
       node = "/dev/loop0"
       physical-device = "7:0"
       hotplug-status = "connected"
       sectors = "16777216"
       info = "0"
       sector-size = "512"

so all the paths are valid but after the first migration the /vm/$UUID is missing and there's no trace of vbd in the file. When doing diff between state after first and after second migration there's no change considering the vbd device, i.e. the device is still missing so it appears that after first migration the guest is still having the information to be able to boot (which is strange and it should not be having those information). Also, the Windows HVM guest is having those vbd entries in the xenstore already but the error message is printed here:

+     29 = ""
       51712 = ""
        domain = "winXP-32bit"
-       frontend = "/local/domain/28/device/vbd/51712"
+       frontend = "/local/domain/29/device/vbd/51712"
        dev = "xvda"
-       state = "4"
+       state = "5"
        params = "/home/ovirt-VMs/WinXP-32-hvm.raw"
        mode = "w"
-       online = "1"
-       frontend-id = "28"
+       online = "0"
+       frontend-id = "29"
        type = "file"
-       node = "/dev/loop0"
-       physical-device = "7:0"
-       hotplug-status = "connected"
-       sectors = "20971520"
-       info = "0"
-       sector-size = "512"
+       hotplug-error = "File /home/ovirt-VMs/WinXP-32-hvm.raw is loopback-mo..."
+       hotplug-status = "busy"

This means that the drive parameters are being removed and state is being changed from connected to closing and online is being set to 0 - that's why there's the issue.

So, Pengzhen, the Windows HVM guest with PV drivers will hang and Linux guest is having the drive mounted as read-only and is it seeing it as read-only? Are does it hang/panic on not seeing the disk at all?

Thanks,
Michal

Comment 4 Pengzhen Cao 2010-08-10 02:09:45 UTC
(In reply to comment #3)

> So, Pengzhen, the Windows HVM guest with PV drivers will hang and Linux guest
> is having the drive mounted as read-only and is it seeing it as read-only? Are
> does it hang/panic on not seeing the disk at all?
> 
> Thanks,
> Michal    

Hi Michal,

1. Yes, winodw HVM guest will hang(after 1st migration) and linux hvm guest will not hang and have the driver mounted read-only(after 2nd migration)
2. Yes, it can see the driver. fdisk shows the driver still online. You can check "guest-vm-dmesg" in the tarball attached.

Comment 5 Michal Novotny 2010-08-10 11:56:02 UTC
Created attachment 437852 [details]
Patch to fix local migrations

Well, I was able to reproduce this and I did the investigation. Finally I found out that this is caused by the message saying "File /home/ovirt-VMs/RHEL-Server-5.4-64-hvm.raw is loopback-mounted through /dev/loop0". The issue is that for Linux guests the block device script doesn't write the error into xenstore which makes it working after the first live localhost migration but not after the second one. Windows PV drivers are most probably having a slightly different implementation which causes that it's not working immediately after the first live localhost migration.

This is the patch that basically checks for the same path being used and if it's used it's checking for the domain name. If the name matches (since we can't consider UUID because of we change UUID for localhost migration purposes) then we can assume the migration is the localhost one so we can skip the check for sharing of this image file.

I did test it using the RHEL-5 x86_64 guest on RHEL-5 x86_64 dom0 and it was returning I/O errors after the second migration when this patch was not applied. With this patch applied I managed to do 10 localhost migrations of the same guest in a row and the disk was always mounted as read-write with no I/O errors.

I also tried Windows 2003 x86 guest without PV drivers and it was working fine to do a loop of several live localhost migrations in a row but the very same guest just with the PV drivers failed to migrate to localhost even for the first time so I guess this is the purely Windows drivers issue.

Thanks,
Michal

Comment 6 Michal Novotny 2010-08-10 11:57:15 UTC
Pengzhen, I recommend filing a bug against xenpv-win for this issue too since it was working fine without PV drivers but not with them. Will you file a bug including all the version information for XenPV Windows drivers and relevant information from the testing yourself please?

Thanks,
Michal

Comment 7 Pengzhen Cao 2010-08-10 12:54:38 UTC
(In reply to comment #5)
> Created an attachment (id=437852) [details]
> Patch to fix local migrations
> 
> Well, I was able to reproduce this and I did the investigation. Finally I found
> out that this is caused by the message saying "File
> /home/ovirt-VMs/RHEL-Server-5.4-64-hvm.raw is loopback-mounted through
> /dev/loop0". The issue is that for Linux guests the block device script doesn't
> write the error into xenstore which makes it working after the first live
> localhost migration but not after the second one. Windows PV drivers are most
> probably having a slightly different implementation which causes that it's not
> working immediately after the first live localhost migration.
> 
> This is the patch that basically checks for the same path being used and if
> it's used it's checking for the domain name. If the name matches (since we
> can't consider UUID because of we change UUID for localhost migration purposes)
> then we can assume the migration is the localhost one so we can skip the check
> for sharing of this image file.
> 
> I did test it using the RHEL-5 x86_64 guest on RHEL-5 x86_64 dom0 and it was
> returning I/O errors after the second migration when this patch was not
> applied. With this patch applied I managed to do 10 localhost migrations of the
> same guest in a row and the disk was always mounted as read-write with no I/O
> errors.
> 
> I also tried Windows 2003 x86 guest without PV drivers and it was working fine
> to do a loop of several live localhost migrations in a row but the very same
> guest just with the PV drivers failed to migrate to localhost even for the
> first time so I guess this is the purely Windows drivers issue.
> 
> Thanks,
> Michal    

Hi Michal,
Verified the patch  work for linux hvm guest. Local migration succeed with 3 times. 
However, It still failed for windows hvm guest even for the first local migration. So I do not think it is purely xenpv-win driver issue. Maybe it is due to windows handle block device in a different way to linux.
You can use WinXP 32bit guest for test. Local migration for win2003 32bit will PASS even with xenpv-win driver.


Regards,
Pengzhen

Comment 8 Michal Novotny 2010-08-10 13:08:27 UTC
(In reply to comment #7)
> (In reply to comment #5)
> > Created an attachment (id=437852) [details] [details]
> > Patch to fix local migrations
> > 
> > Well, I was able to reproduce this and I did the investigation. Finally I found
> > out that this is caused by the message saying "File
> > /home/ovirt-VMs/RHEL-Server-5.4-64-hvm.raw is loopback-mounted through
> > /dev/loop0". The issue is that for Linux guests the block device script doesn't
> > write the error into xenstore which makes it working after the first live
> > localhost migration but not after the second one. Windows PV drivers are most
> > probably having a slightly different implementation which causes that it's not
> > working immediately after the first live localhost migration.
> > 
> > This is the patch that basically checks for the same path being used and if
> > it's used it's checking for the domain name. If the name matches (since we
> > can't consider UUID because of we change UUID for localhost migration purposes)
> > then we can assume the migration is the localhost one so we can skip the check
> > for sharing of this image file.
> > 
> > I did test it using the RHEL-5 x86_64 guest on RHEL-5 x86_64 dom0 and it was
> > returning I/O errors after the second migration when this patch was not
> > applied. With this patch applied I managed to do 10 localhost migrations of the
> > same guest in a row and the disk was always mounted as read-write with no I/O
> > errors.
> > 
> > I also tried Windows 2003 x86 guest without PV drivers and it was working fine
> > to do a loop of several live localhost migrations in a row but the very same
> > guest just with the PV drivers failed to migrate to localhost even for the
> > first time so I guess this is the purely Windows drivers issue.
> > 
> > Thanks,
> > Michal    
> 
> Hi Michal,
> Verified the patch  work for linux hvm guest. Local migration succeed with 3
> times. 
> However, It still failed for windows hvm guest even for the first local
> migration. So I do not think it is purely xenpv-win driver issue. Maybe it is
> due to windows handle block device in a different way to linux.
> You can use WinXP 32bit guest for test. Local migration for win2003 32bit will
> PASS even with xenpv-win driver.
> 
> 
> Regards,
> Pengzhen    

Pengzhen,
please read the comment 6. I mentioned this is not working for Windows 2003 with PV drivers but it's working for the guest *without* PV drivers so this is the issue you should file against xenpv-win component which is the component for Windows PV drivers. It can pass for Windows 2003 x86 (32-bit) now since I've found out I may be using some older version of PV drivers and the latest may have it fixes. Nevertheless if the guest that's not working for localhost migration is using PV drivers try to uninstall them and retest.

I did try both Windows 2003 x86 and Windows XP x86 without PV drivers and I was unable to reproduce but I did manage to reproduce it when having PV drivers installed. If it's reproducible only when using PV drivers and not otherwise you should file a bug against xenpv-win instead.

Regards,
Michal

Comment 9 Pengzhen Cao 2010-08-11 07:30:31 UTC
(In reply to comment #8)
> (In reply to comment #7)
> > (In reply to comment #5)
> > > Created an attachment (id=437852) [details] [details] [details]
> > > Patch to fix local migrations
> > > 
> > > Well, I was able to reproduce this and I did the investigation. Finally I found
> > > out that this is caused by the message saying "File
> > > /home/ovirt-VMs/RHEL-Server-5.4-64-hvm.raw is loopback-mounted through
> > > /dev/loop0". The issue is that for Linux guests the block device script doesn't
> > > write the error into xenstore which makes it working after the first live
> > > localhost migration but not after the second one. Windows PV drivers are most
> > > probably having a slightly different implementation which causes that it's not
> > > working immediately after the first live localhost migration.
> > > 
> > > This is the patch that basically checks for the same path being used and if
> > > it's used it's checking for the domain name. If the name matches (since we
> > > can't consider UUID because of we change UUID for localhost migration purposes)
> > > then we can assume the migration is the localhost one so we can skip the check
> > > for sharing of this image file.
> > > 
> > > I did test it using the RHEL-5 x86_64 guest on RHEL-5 x86_64 dom0 and it was
> > > returning I/O errors after the second migration when this patch was not
> > > applied. With this patch applied I managed to do 10 localhost migrations of the
> > > same guest in a row and the disk was always mounted as read-write with no I/O
> > > errors.
> > > 
> > > I also tried Windows 2003 x86 guest without PV drivers and it was working fine
> > > to do a loop of several live localhost migrations in a row but the very same
> > > guest just with the PV drivers failed to migrate to localhost even for the
> > > first time so I guess this is the purely Windows drivers issue.
> > > 
> > > Thanks,
> > > Michal    
> > 
> > Hi Michal,
> > Verified the patch  work for linux hvm guest. Local migration succeed with 3
> > times. 
> > However, It still failed for windows hvm guest even for the first local
> > migration. So I do not think it is purely xenpv-win driver issue. Maybe it is
> > due to windows handle block device in a different way to linux.
> > You can use WinXP 32bit guest for test. Local migration for win2003 32bit will
> > PASS even with xenpv-win driver.
> > 
> > 
> > Regards,
> > Pengzhen    
> 
> Pengzhen,
> please read the comment 6. I mentioned this is not working for Windows 2003
> with PV drivers but it's working for the guest *without* PV drivers so this is
> the issue you should file against xenpv-win component which is the component
> for Windows PV drivers. It can pass for Windows 2003 x86 (32-bit) now since
> I've found out I may be using some older version of PV drivers and the latest
> may have it fixes. Nevertheless if the guest that's not working for localhost
> migration is using PV drivers try to uninstall them and retest.
> 
> I did try both Windows 2003 x86 and Windows XP x86 without PV drivers and I was
> unable to reproduce but I did manage to reproduce it when having PV drivers
> installed. If it's reproducible only when using PV drivers and not otherwise
> you should file a bug against xenpv-win instead.
> 
> Regards,
> Michal    

Hi Michal,

I have tried with winxp *without* xenpv-win driver and it could migration.
However, when you try to click start menu to shutdown the vm or run dir with C:\, the guest hang immediately. And there is chance that the winxp vm will hang after migration without any operation inside vm. I did the above test with your patch.  So this is not only xenpv-win issue

Regards,
Pengzhen

Comment 10 Pengzhen Cao 2010-08-12 03:19:53 UTC
Could you check again with windows hvm guest without pv driver?

Comment 11 Michal Novotny 2010-08-12 07:25:05 UTC
Yeah, I did again and it's working fine for Windows XP and 2003 *without* PV drivers.

What version of Xen are you using? Are you having this patch applied to the latest virttest packages? I was unable to reproduce the results from comment 9 so I don't know.

Michal

Comment 12 Michal Novotny 2010-08-12 07:26:22 UTC
Oh, one more thing: This is about migrations and if it happens without any migrations when you run the guest it's pretty strange. Please file a new bugzilla with *exact* steps to reproduce it including the version information (also including information about the guest - Windows XP? 2003? 32-bit? 64-bit? etc.)

Thanks,
Michal

Comment 13 Pengzhen Cao 2010-08-12 07:50:54 UTC
(In reply to comment #12)
> Oh, one more thing: This is about migrations and if it happens without any
> migrations when you run the guest it's pretty strange. Please file a new
> bugzilla with *exact* steps to reproduce it including the version information
> (also including information about the guest - Windows XP? 2003? 32-bit? 64-bit?
> etc.)
> 
> Thanks,
> Michal    

Hi Michal,

I am using "xen-3.0.3-115_x86_64" and "kernel-xen-2.6.18-210". 
Guest is "WinXP 32bit".
I mean the vm is OK without migration. It will only hang after migration, even without PV driver.

Regards,
Pengzhen

Comment 14 Michal Novotny 2010-08-12 12:09:39 UTC
(In reply to comment #13)
> (In reply to comment #12)
> > Oh, one more thing: This is about migrations and if it happens without any
> > migrations when you run the guest it's pretty strange. Please file a new
> > bugzilla with *exact* steps to reproduce it including the version information
> > (also including information about the guest - Windows XP? 2003? 32-bit? 64-bit?
> > etc.)
> > 
> > Thanks,
> > Michal    
> 
> Hi Michal,
> 
> I am using "xen-3.0.3-115_x86_64" and "kernel-xen-2.6.18-210". 
> Guest is "WinXP 32bit".
> I mean the vm is OK without migration. It will only hang after migration, even
> without PV driver.
> 
> Regards,
> Pengzhen    

Hi Pengzhen,
this is the issue then. The -115 version of Xen package doesn't have the fix applied. I've checked mrezanin's virttest package and it's not applied as well so I created my own version of Xen package with this patch applied - before you told me about you use -115 package I thought you're having your own recompiled Xen package with this patch applied but apparently you don't.

The version I recompiled is -virttest31 based and it's named 'xen-3.0.3-115.el5virttest31.g7e4798b', the package is located at:

http://people.redhat.com/minovotn/xen/

Please test using this version of Xen package,
Michal

Note: I didn't mean this is not a bug, this surely is (and that's why this is having a patch already and why this is in POST state) but there's also a bug in Windows PV drivers but you need too file a new bug against xenpv-win then.

Comment 15 Pengzhen Cao 2010-08-12 13:30:05 UTC
(In reply to comment #14)
> (In reply to comment #13)
> > (In reply to comment #12)
> > > Oh, one more thing: This is about migrations and if it happens without any
> > > migrations when you run the guest it's pretty strange. Please file a new
> > > bugzilla with *exact* steps to reproduce it including the version information
> > > (also including information about the guest - Windows XP? 2003? 32-bit? 64-bit?
> > > etc.)
> > > 
> > > Thanks,
> > > Michal    
> > 
> > Hi Michal,
> > 
> > I am using "xen-3.0.3-115_x86_64" and "kernel-xen-2.6.18-210". 
> > Guest is "WinXP 32bit".
> > I mean the vm is OK without migration. It will only hang after migration, even
> > without PV driver.
> > 
> > Regards,
> > Pengzhen    
> 
> Hi Pengzhen,
> this is the issue then. The -115 version of Xen package doesn't have the fix
> applied. I've checked mrezanin's virttest package and it's not applied as well
> so I created my own version of Xen package with this patch applied - before you
> told me about you use -115 package I thought you're having your own recompiled
> Xen package with this patch applied but apparently you don't.
> 
> The version I recompiled is -virttest31 based and it's named
> 'xen-3.0.3-115.el5virttest31.g7e4798b', the package is located at:
> 
> http://people.redhat.com/minovotn/xen/
> 
> Please test using this version of Xen package,
> Michal
> 
> Note: I didn't mean this is not a bug, this surely is (and that's why this is
> having a patch already and why this is in POST state) but there's also a bug in
> Windows PV drivers but you need too file a new bug against xenpv-win then.    

Hi Michal,

I was using xen-115 but I patched "/etc/xen/scripts/block" manually with your patch. And I have tried with "xen-3.0.3-115.el5virttest31.g7e4798b" just now, it is still not work for windows hvm guest. Can you have a look with my server?

Regards,
Pengzhen

Comment 16 Michal Novotny 2010-08-12 15:21:34 UTC
Well, Pengzhen, we've been investigating this and we've found out both local and remote migration is working fine on Intel but neither local nor remote migration was not working on AMD.

I don't know whether it can be relevant but to hypervisor/kernel but I'm seeing following messages on your AMD machine:

Aug 12 20:57:16 amd-B95-8-1 kernel: Warning Timer ISR/1: Time went backwards: delta=-11000403 delta_cpu=636999597 shadow=18650886977204 off=210022848 processed=18651108000000 cpu_processed=18650460000000
Aug 12 20:57:16 amd-B95-8-1 kernel:  0: 18651084000000
Aug 12 20:57:16 amd-B95-8-1 kernel:  1: 18650460000000
Aug 12 20:57:16 amd-B95-8-1 kernel:  2: 18651032000000
Aug 12 20:57:16 amd-B95-8-1 kernel:  3: 18651104000000
Aug 12 20:57:16 amd-B95-8-1 kernel: Warning Timer ISR/0: Time went backwards: delta=-10955058 delta_cpu=13044942 shadow=18650696978825 off=400067434 processed=18651108000000 cpu_processed=18651084000000
Aug 12 20:57:16 amd-B95-8-1 kernel:  0: 18651084000000
Aug 12 20:57:16 amd-B95-8-1 kernel:  1: 18651092000000
Aug 12 20:57:16 amd-B95-8-1 kernel:  2: 18651032000000
Aug 12 20:57:16 amd-B95-8-1 kernel:  3: 18651104000000

I can see no error in xend.log but in `xm dmesg` output I've discovered following messages at the end:

(XEN) traps.c:1877:d0 Domain attempted WRMSR 00000000c001001f from 00582000:00000008 to 00586000:00000008.
(XEN) save.c:174:d0 HVM restore: Xen changeset was not saved.
(XEN) lapic_load to rearm the actimer:bus cycle is 10ns, saved tmict count 116770000, period 1167700000ns, irq=253
(XEN) save.c:174:d0 HVM restore: Xen changeset was not saved.
(XEN) lapic_load to rearm the actimer:bus cycle is 10ns, saved tmict count 116770000, period 1167700000ns, irq=253
(XEN) save.c:174:d0 HVM restore: Xen changeset was not saved.
(XEN) lapic_load to rearm the actimer:bus cycle is 10ns, saved tmict count 2562350000, period 4148663520ns, irq=253
(XEN) save.c:174:d0 HVM restore: Xen changeset was not saved.
(XEN) lapic_load to rearm the actimer:bus cycle is 10ns, saved tmict count 2562350000, period 4148663520ns, irq=253
(XEN) save.c:174:d0 HVM restore: Xen changeset was not saved.
(XEN) lapic_load to rearm the actimer:bus cycle is 10ns, saved tmict count 3403050000, period 3965728928ns, irq=253

So I guess this is something hypervisor related since according to the testing it's always working on Intel but never on AMD.

Regards,
Michal

Comment 17 Andrew Jones 2010-08-12 15:53:16 UTC
(In reply to comment #16)
> So I guess this is something hypervisor related since according to the testing
> it's always working on Intel but never on AMD.
> 

Could be. This looks like the biggest clue to me

(XEN) save.c:174:d0 HVM restore: Xen changeset was not saved.

I hear we can't even save+restore though, so this is unrelated to the local migration problem and needs its own bug. In that bug we need to figure out if it's machine dependant, processor dependant, guest dependent, etc.

For this bug, QA should avoid trying to test on machines where they can't even save and restore.

Comment 18 Pengzhen Cao 2010-08-13 02:37:13 UTC
(In reply to comment #16)
> Well, Pengzhen, we've been investigating this and we've found out both local
> and remote migration is working fine on Intel but neither local nor remote
> migration was not working on AMD.
> 
> I don't know whether it can be relevant but to hypervisor/kernel but I'm seeing
> following messages on your AMD machine:
> 
> Aug 12 20:57:16 amd-B95-8-1 kernel: Warning Timer ISR/1: Time went backwards:
> delta=-11000403 delta_cpu=636999597 shadow=18650886977204 off=210022848
> processed=18651108000000 cpu_processed=18650460000000
> Aug 12 20:57:16 amd-B95-8-1 kernel:  0: 18651084000000
> Aug 12 20:57:16 amd-B95-8-1 kernel:  1: 18650460000000
> Aug 12 20:57:16 amd-B95-8-1 kernel:  2: 18651032000000
> Aug 12 20:57:16 amd-B95-8-1 kernel:  3: 18651104000000
> Aug 12 20:57:16 amd-B95-8-1 kernel: Warning Timer ISR/0: Time went backwards:
> delta=-10955058 delta_cpu=13044942 shadow=18650696978825 off=400067434
> processed=18651108000000 cpu_processed=18651084000000
> Aug 12 20:57:16 amd-B95-8-1 kernel:  0: 18651084000000
> Aug 12 20:57:16 amd-B95-8-1 kernel:  1: 18651092000000
> Aug 12 20:57:16 amd-B95-8-1 kernel:  2: 18651032000000
> Aug 12 20:57:16 amd-B95-8-1 kernel:  3: 18651104000000
> 
> I can see no error in xend.log but in `xm dmesg` output I've discovered
> following messages at the end:
> 
> (XEN) traps.c:1877:d0 Domain attempted WRMSR 00000000c001001f from
> 00582000:00000008 to 00586000:00000008.
> (XEN) save.c:174:d0 HVM restore: Xen changeset was not saved.
> (XEN) lapic_load to rearm the actimer:bus cycle is 10ns, saved tmict count
> 116770000, period 1167700000ns, irq=253
> (XEN) save.c:174:d0 HVM restore: Xen changeset was not saved.
> (XEN) lapic_load to rearm the actimer:bus cycle is 10ns, saved tmict count
> 116770000, period 1167700000ns, irq=253
> (XEN) save.c:174:d0 HVM restore: Xen changeset was not saved.
> (XEN) lapic_load to rearm the actimer:bus cycle is 10ns, saved tmict count
> 2562350000, period 4148663520ns, irq=253
> (XEN) save.c:174:d0 HVM restore: Xen changeset was not saved.
> (XEN) lapic_load to rearm the actimer:bus cycle is 10ns, saved tmict count
> 2562350000, period 4148663520ns, irq=253
> (XEN) save.c:174:d0 HVM restore: Xen changeset was not saved.
> (XEN) lapic_load to rearm the actimer:bus cycle is 10ns, saved tmict count
> 3403050000, period 3965728928ns, irq=253
> 
> So I guess this is something hypervisor related since according to the testing
> it's always working on Intel but never on AMD.
> 
> Regards,
> Michal    

Maybe. I had cloned a bug for this error msg. It is regression of rhbz437252.
https://bugzilla.redhat.com/show_bug.cgi?id=617043
https://bugzilla.redhat.com/show_bug.cgi?id=437252

Regards,
Pengzhen

Comment 19 Pengzhen Cao 2010-08-13 02:41:59 UTC
(In reply to comment #17)
> (In reply to comment #16)
> > So I guess this is something hypervisor related since according to the testing
> > it's always working on Intel but never on AMD.
> > 
> 
> Could be. This looks like the biggest clue to me
> 
> (XEN) save.c:174:d0 HVM restore: Xen changeset was not saved.
> 
> I hear we can't even save+restore though, so this is unrelated to the local
> migration problem and needs its own bug. In that bug we need to figure out if
> it's machine dependant, processor dependant, guest dependent, etc.
> 
> For this bug, QA should avoid trying to test on machines where they can't even
> save and restore.    

Hi Andrew,

Yes, the fix is actually working for both Win/Linux hvm guest on Intel machine and this issue should be considered fixed.
Then there should be a separate bug for the migration and save/restore issue on AMD machine, what do you think?

Regards,
Pengzhen

Comment 20 Andrew Jones 2010-08-13 06:32:20 UTC
(In reply to comment #19)
> Then there should be a separate bug for the migration and save/restore issue on
> AMD machine, what do you think?

Agreed. Although, I also think we should try to hunt down an AMD machine that doesn't have the "time went backwards" issues in order to do a clean test, i.e. determine whether we're looking at a machine dependant problem or processor dependant problem here.

Drew

Comment 21 Andrew Jones 2010-08-13 06:41:08 UTC
New BZ opened already and is here bug 623729.

Comment 22 Bill Burns 2010-08-13 12:38:14 UTC
*** Bug 608964 has been marked as a duplicate of this bug. ***

Comment 23 Michal Novotny 2010-08-13 15:42:44 UTC
(In reply to comment #20)
> (In reply to comment #19)
> > Then there should be a separate bug for the migration and save/restore issue on
> > AMD machine, what do you think?
> 
> Agreed. Although, I also think we should try to hunt down an AMD machine that
> doesn't have the "time went backwards" issues in order to do a clean test, i.e.
> determine whether we're looking at a machine dependant problem or processor
> dependant problem here.
> 
> Drew    

Oh, this was not on colossus just for clarification. It was some other AMD machine and I was having access to 2 machines (for remote migration testing) and I was able to see this on both of the machines. I don't know the CPUs now but what I know for sure is that one machine was having Phenome B2 processor.

Michal

Comment 24 Michal Novotny 2010-08-16 10:40:51 UTC
Created attachment 438920 [details]
Patch to fix local migrations v2

This is the patch for BZ 622501 that basically checks for the local
migrations in progress. If two guests are trying to use the image file
it's checking if the name of the guests matches (since we can't consider
UUID because of we change UUID for localhost migration purposes) and then
if the name is the same we can assume the migration is the localhost one
so we can skip the check for sharing of this image file.

Differences between version 1 and version 2 (this one):
 - Fixed bugs in the comparison signs and different approach chosen
 - Tested with multiple guests not to disable the check entirely
 - Few optimalizations of xenstore-read calls

Michal

Comment 25 Michal Novotny 2010-08-17 13:42:59 UTC
Created attachment 439115 [details]
Patch to fix local migrations v3

This is the patch for BZ 622501 that basically checks for the local
migrations in progress. If two guests are trying to use the image file
it's checking if the name of the guests matches (since we can't consider
UUID because of we change UUID for localhost migration purposes) and then
if the name is the same we can assume the migration is the localhost one
so we can skip the check for sharing of this image file.

Differences between version 1 and version 2:
 - Fixed bugs in the comparison signs and different approach chosen
 - Tested with multiple guests not to disable the check entirely
 - Few optimalizations of xenstore-read calls

Differences between version 2 and version 3 (this one):
 - The check loop (new one) has been merged into the previous loop to
   avoid using two almost identical loops

Michal

Comment 28 Michal Novotny 2010-11-22 13:15:28 UTC
Created attachment 462020 [details]
Patch to fix local migration v4

Patch for new codebase with vbd backports implemented.

Michal

Comment 31 Michal Novotny 2011-01-06 11:19:32 UTC
Created attachment 472037 [details]
Patch v5

New version of the patch where local migration check has been moved to the check_sharing function.

Michal

Comment 32 RHEL Program Management 2011-01-11 19:51:31 UTC
This request was evaluated by Red Hat Product Management for
inclusion in the current release of Red Hat Enterprise Linux.
Because the affected component is not scheduled to be updated in the
current release, Red Hat is unfortunately unable to address this
request at this time. Red Hat invites you to ask your support
representative to propose this request, if appropriate and relevant,
in the next release of Red Hat Enterprise Linux.

Comment 33 RHEL Program Management 2011-01-12 15:21:05 UTC
This request was erroneously denied for the current release of
Red Hat Enterprise Linux.  The error has been fixed and this
request has been re-proposed for the current release.

Comment 34 Michal Novotny 2011-02-28 09:59:17 UTC
Irreproducible on -124.el5 version of Xen package so closing as CURRENTRELEASE.

Michal

Comment 35 Paolo Bonzini 2011-03-09 16:02:21 UTC
It's fixed by the patch of bug 679280.


Note You need to log in before you can comment on or make changes to this bug.