Bug 1301713 - Live disk migration causes data corruption if source disk is on Microsoft NFS
Summary: Live disk migration causes data corruption if source disk is on Microsoft NFS
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: vdsm
Classification: oVirt
Component: Core
Version: 4.17.13
Hardware: x86_64
OS: Linux
medium
high
Target Milestone: ovirt-3.6.7
: ---
Assignee: Nir Soffer
QA Contact: Aharon Canan
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-01-25 18:56 UTC by Pavel Gashev
Modified: 2017-03-06 12:23 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-05-02 11:15:15 UTC
oVirt Team: Storage
Embargoed:
pax: needinfo-
ylavi: ovirt-3.6.z?
ylavi: exception?
ylavi: planning_ack+
rule-engine: devel_ack?
rule-engine: testing_ack?


Attachments (Terms of Use)
/var/log/vdsm/vdsm.log (172.35 KB, text/plain)
2016-01-25 18:56 UTC, Pavel Gashev
no flags Details
script to reproduce the issue (3.79 KB, text/x-python)
2016-01-27 08:52 UTC, Pavel Gashev
no flags Details
screenshot of checking filesystem (21.65 KB, image/png)
2016-01-27 09:32 UTC, Pavel Gashev
no flags Details

Description Pavel Gashev 2016-01-25 18:56:04 UTC
Created attachment 1118164 [details]
/var/log/vdsm/vdsm.log

Description of problem:
Live disk migration causes data corruption if source disk is on Microsoft NFS

Version-Release number of selected component (if applicable):
vdsm-4.17.13-1.el7.noarch
libvirt-daemon-1.2.17-13.el7_2.2.x86_64
qemu-kvm-ev-2.3.0-31.el7_2.4.1.x86_64

Steps to Reproduce:
1. Create a VM on a MS NFS storage
2. Live migrate disk to another storage (Linux NFS, or iSCSI)
3. Shutdown VM
4. Boot VM into a recovery mode and check filesystem

Actual results:
There are filesystem errors.

Expected results:
No filesystem errors.

Additional info:
1. If you rollback to the "Auto-Generated for Live Storage Migration" snapshot, filesystem has no errors.
2. It doesn't depend on existence of guest agent.
3. It doesn't depend on SPM. So it corrupts data when SPM is on the same host, or another.
4. Live disk migration between other storages works well with no corruption.

Comment 1 Nir Soffer 2016-01-26 21:12:59 UTC
Reported on users mailing list:
http://lists.ovirt.org/pipermail/users/2016-January/037392.html

Info extracted from the mailing list so far:

- Happens only when doing live storage migration from MS NFS server, and the
  guest is running W2K12. Cannot be reproduced when the guest is running Linux.

- Can be reproduced using ovirt and virsh - starting an ovirt vm, creating live
  snapshot, and mirroring the drive to another file.

Reproducing using virsh:

1. Create a VM on MS NFS
2. Start VM
3. Create a disk-only snapshot
4. virsh blockcopy VM1 /some/file --wait --verbose --reuse-external --shallow
5. virsh blockjob VM1 vda --abort --pivot
6. Shutdown VM
7. Copy the /some/file back to /rhev/data-center/..the.latest.snapshot.of.VM..
8. Start VM and check filesystem

Pavel,

- What is MS NFS server?
- What is W2K12?
- What is the issue seen on the guest when the file system is corrupted?
- To make sure the reproducer is correct, please provide the output of
  qemu-img info for /some/file after step 4, and for the latest vm snapshot
  after step 7

Comment 2 Nir Soffer 2016-01-26 21:26:35 UTC
Eric, this looks like a libvirt or qemu issue, can you take a look?

Comment 3 Eric Blake 2016-01-26 21:50:19 UTC
What command was used in step 3 to create the disk-only snapshot? I echo the desire to know the output of qemu-img for /some/file, after both step 3 and 4.  After running the domain in step 8, does the <domain> XML look like it is sanely reflecting your data layout after the copy of /some/file back to the data center?

The claim that the choice of guest OS affects things is weird; if it is a libvirt or qemu problem, I'd suspect it to be independent of the guest.

Comment 4 Pavel Gashev 2016-01-27 08:52:38 UTC
Created attachment 1118708 [details]
script to reproduce the issue

Comment 5 Pavel Gashev 2016-01-27 09:32:46 UTC
Created attachment 1118712 [details]
screenshot of checking filesystem

Comment 6 Pavel Gashev 2016-01-27 10:14:51 UTC
(In reply to Nir Soffer from comment #1)
> Reproducing using virsh:
> 
> 1. Create a VM on MS NFS
> 2. Start VM
> 3. Create a disk-only snapshot
> 4. virsh blockcopy VM1 /some/file --wait --verbose --reuse-external --shallow
> 5. virsh blockjob VM1 vda --abort --pivot
> 6. Shutdown VM
> 7. Copy the /some/file back to
> /rhev/data-center/..the.latest.snapshot.of.VM..
> 8. Start VM and check filesystem

Please find attached a script to reproduce the issue.

> Pavel,
> 
> - What is MS NFS server?

Windows 2012 R2

> - What is W2K12?

Windows 2012, specifically Windows 2012 R2 x64

> - What is the issue seen on the guest when the file system is corrupted?

Please find attached the screenshot

> - To make sure the reproducer is correct, please provide the output of
>   qemu-img info for /some/file after step 4, and for the latest vm snapshot
>   after step 7

Sorry, I forgot to mention qemu-img commands in the sequence. Please take a look at the script. Is it ok?

Comment 7 Nir Soffer 2016-03-03 15:19:26 UTC
Pavel,

Thanks for the script! I think we need to remove the dependency on ovirt
in this script to make it useful to libvirt developers.

can you try to answer Eric questions from comment 3?

Comment 8 Yaniv Lavi 2016-05-02 11:15:15 UTC
Please reopen if you can add the needed info.


Note You need to log in before you can comment on or make changes to this bug.