Bug 1762944 - Win10 disk corruption using virtio-scsi
Summary: Win10 disk corruption using virtio-scsi
Keywords:
Status: CLOSED UPSTREAM
Alias: None
Product: Virtualization Tools
Classification: Community
Component: virtio-win
Version: unspecified
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
Assignee: Vadim Rozenfeld
QA Contact: Virtualization Bugs
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-10-17 20:42 UTC by bugzilla
Modified: 2019-11-10 13:12 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-11-10 13:12:17 UTC
Embargoed:


Attachments (Terms of Use)

Description bugzilla 2019-10-17 20:42:17 UTC
Description of problem:

Disk corruption reported by both qemu-img check and chkdsk within win10 guest.

Version-Release number of selected component (if applicable):

qemu 4.1, libvirt 5.6.0, virtio-win 0.1.172

How reproducible:

100% - win10 home 1903-v2 or pro 1903-v1, cache=writethrough or cache=writeback, BIOS or UEFI guests.

Steps to Reproduce:

virt-install \
--virt-type kvm \
--name=windows10 \
--os-variant=win10 \
--vcpus 2 \
--cpu host-passthrough \
--memory 4096 \
--features kvm_hidden=on \
--disk path=~/win10.qcow2,size=50,format=qcow2,sparse=true,bus=scsi,discard=unmap,io=threads  \
--controller type=scsi,model=virtio-scsi \
--graphics spice \
--channel spicevmc,target_type=virtio,name=com.redhat.spice.0 \
--video model=qxl,vgamem=32768,ram=131072,vram=131072,heads=1 \
--network bridge=br0,model=virtio \
--input type=tablet,bus=virtio \
--disk ~/virtio-win-0.1.172.iso,device=cdrom \
--cdrom ~/Win10_1903_V2_English_x64.iso

Actual results:

After a couple of boot/reboot cycles the guest will warn of disk errors, "qemu-img check -r all" will fix the qcow2 image but sometimes result in a totally unbootable vm.

Expected results:

VM boot's 100% of the time in a reliable fashion - just like Debian 10 guests running virtio-scsi do. Host SSD is not the culprit, believe it to be a vioscsi windows driver bug.

Additional info:

Host SSD not mounted using discard option, using fstrim.timer instead, ext4 filesystem. Host runs 5.2.17-1 kernel.

Comment 1 Vadim Rozenfeld 2019-10-17 23:41:49 UTC
Hi Li Jin,

Can QE try to reproduce this problem?

Thanks,
Vadim.

Comment 2 lijin 2019-10-18 01:17:56 UTC
(In reply to Vadim Rozenfeld from comment #1)
> Hi Li Jin,
> 
> Can QE try to reproduce this problem?
> 
> Thanks,
> Vadim.

Hi Yu,

Could you help to reproduce this issue?

Thanks

Comment 4 bugzilla 2019-10-20 09:29:30 UTC
Some more info on the types of problems found by qemu-img check:

Repairing cluster 196604 refcount=1 reference=0
Repairing cluster 196605 refcount=1 reference=0
Repairing cluster 196606 refcount=1 reference=0
Repairing cluster 196607 refcount=1 reference=0
Repairing OFLAG_COPIED data cluster: l2_entry=8000000279430000 refcount=2
Repairing OFLAG_COPIED data cluster: l2_entry=8000000279440000 refcount=2
Repairing OFLAG_COPIED data cluster: l2_entry=8000000279430000 refcount=2
Repairing OFLAG_COPIED data cluster: l2_entry=8000000279440000 refcount=2
The following inconsistencies were found and repaired:

    4101 leaked clusters
    6 corruptions

Also:

Repairing OFLAG_COPIED data cluster: l2_entry=80000001fdf90000 refcount=2
Repairing OFLAG_COPIED data cluster: l2_entry=800000021a690000 refcount=2
Repairing OFLAG_COPIED data cluster: l2_entry=800000021a6a0000 refcount=2
Repairing OFLAG_COPIED data cluster: l2_entry=800000021a6b0000 refcount=2
Repairing OFLAG_COPIED data cluster: l2_entry=80000001b44c0000 refcount=2
The following inconsistencies were found and repaired:

    1 leaked clusters
    4253 corruptions

Oddly enough Windows can get stuck in a repair/reboot loop as it's detecting errors from chkdsk even when qemu-img isn't.

I cannot reproduce on Win2019 (that seems to mark the drive as not optimizable) so seems specific to Win10.

All the latest spice-tools, qemu-agent, virtio-win drivers installed.

Comment 5 bugzilla 2019-10-20 09:42:10 UTC
Also the number of errors and fixes seems odd - 3 errors found but 6 listed, then 3 fixed and 9 now reported!


$ qemu-img check win10uefi.qcow2
ERROR cluster 143838 refcount=1 reference=2
ERROR cluster 143839 refcount=1 reference=2
ERROR cluster 143840 refcount=1 reference=2

3 errors were found on the image.
Data may be corrupted, or further writes to the image may corrupt it.
237450/819200 = 28.99% allocated, 11.01% fragmented, 0.00% compressed clusters
Image end offset: 17673814016

$ qemu-img check -r leaks win10uefi.qcow2
ERROR cluster 143838 refcount=1 reference=2
ERROR cluster 143839 refcount=1 reference=2
ERROR cluster 143840 refcount=1 reference=2
ERROR cluster 143838 refcount=1 reference=2
ERROR cluster 143839 refcount=1 reference=2
ERROR cluster 143840 refcount=1 reference=2

3 errors were found on the image.
Data may be corrupted, or further writes to the image may corrupt it.
237450/819200 = 28.99% allocated, 11.01% fragmented, 0.00% compressed clusters
Image end offset: 17673814016

$ qemu-img check win10uefi.qcow2
ERROR cluster 143838 refcount=1 reference=2
ERROR cluster 143839 refcount=1 reference=2
ERROR cluster 143840 refcount=1 reference=2

3 errors were found on the image.
Data may be corrupted, or further writes to the image may corrupt it.
237450/819200 = 28.99% allocated, 11.01% fragmented, 0.00% compressed clusters
Image end offset: 17673814016

$ qemu-img check -r all win10uefi.qcow2
ERROR cluster 143838 refcount=1 reference=2
ERROR cluster 143839 refcount=1 reference=2
ERROR cluster 143840 refcount=1 reference=2
Repairing cluster 143838 refcount=1 reference=2
Repairing cluster 143839 refcount=1 reference=2
Repairing cluster 143840 refcount=1 reference=2
Repairing OFLAG_COPIED data cluster: l2_entry=8000000231df0000 refcount=2
Repairing OFLAG_COPIED data cluster: l2_entry=8000000231de0000 refcount=2
Repairing OFLAG_COPIED data cluster: l2_entry=8000000231e00000 refcount=2
Repairing OFLAG_COPIED data cluster: l2_entry=8000000231de0000 refcount=2
Repairing OFLAG_COPIED data cluster: l2_entry=8000000231df0000 refcount=2
Repairing OFLAG_COPIED data cluster: l2_entry=8000000231e00000 refcount=2
The following inconsistencies were found and repaired:

    0 leaked clusters
    9 corruptions

Double checking the fixed image now...
No errors were found on the image.
237450/819200 = 28.99% allocated, 11.01% fragmented, 0.00% compressed clusters
Image end offset: 17673814016

$ qemu-img check win10uefi.qcow2
No errors were found on the image.
237450/819200 = 28.99% allocated, 11.01% fragmented, 0.00% compressed clusters
Image end offset: 17673814016

Comment 6 Yu Wang 2019-10-22 06:52:18 UTC
Hi,

I cannot reproduce on my env. (both qemu and libvirt)

Steps:

1 boot with win10-64 guest installed by en_windows_10_business_editions_version_1903_x64_dvd_37200948
virt-install \
--virt-type kvm \
--name=windows10 \
--os-variant=win10 \
--vcpus 2 \
--cpu host-passthrough \
--memory 4096 \
--features kvm_hidden=on \
--disk path=~/win10.qcow2,size=50,format=qcow2,sparse=true,bus=scsi,discard=unmap,io=threads  \
--controller type=scsi,model=virtio-scsi \
--graphics spice \
--channel spicevmc,target_type=virtio,name=com.redhat.spice.0 \
--video model=qxl,vgamem=32768,ram=131072,vram=131072,heads=1 \
--network bridge=br0,model=virtio \
--input type=tablet,bus=virtio \

2 reboot guest continuously in 24 hours

3 check with command
qemu-img check win10-64-virtio-scsi.qcow2
qemu-img check -r leaks win10-64-virtio-scsi.qcow2
qemu-img check -r all win10-64-virtio-scsi.qcow2

results:
I cannot check error in host

[root@dell-per440-05 images]# qemu-img check win10-64-virtio-scsi.qcow2
No errors were found on the image.
172971/491520 = 35.19% allocated, 30.01% fragmented, 0.00% compressed clusters
Image end offset: 11803033600
[root@dell-per440-05 images]# qemu-img check -r leaks win10-64-virtio-scsi.qcow2
No errors were found on the image.
172971/491520 = 35.19% allocated, 30.01% fragmented, 0.00% compressed clusters
Image end offset: 11803033600
[root@dell-per440-05 images]# qemu-img check -r all win10-64-virtio-scsi.qcow2
No errors were found on the image.
172971/491520 = 35.19% allocated, 30.01% fragmented, 0.00% compressed clusters
Image end offset: 11803033600


Version
Guest:en_windows_10_business_editions_version_1903_x64_dvd_37200948
      virtio-win-prewhql-172.iso

Host: qemu-kvm-4.1.0-13.module+el8.1.0+4313+ef76ec61.x86_64
      seabios-1.12.0-5.module+el8.1.0+4022+29a53beb.x86_64
      kernel-4.18.0-137.el8.x86_64

I cannot find the iso version you used on msdn subscriber, could you try with the iso for 1903 release? 
Maybe this version cannot hit this issue. Or could you upload the iso you used?

Thanks
Yu Wang

Comment 7 bugzilla 2019-10-22 08:53:49 UTC
its the regular 1903 image from: https://www.microsoft.com/en-gb/software-download/windows10ISO

i've not got an msdn subscription. i had the same issue with home and pro versions.

Comment 8 Yu Wang 2019-10-25 02:46:49 UTC
Tried to reproduce this issue with 1903 V2 image from: https://www.microsoft.com/en-gb/software-download/windows10ISO, still cannot reproduce it. No errors during 24-hour continuously reboot. (the same as comment#6)


Guest:Win10_1903_V2_English_x64.iso (Home)
      virtio-win-prewhql-172.iso

Host: qemu-kvm-4.1.0-13.module+el8.1.0+4313+ef76ec61.x86_64
      seabios-1.12.0-5.module+el8.1.0+4022+29a53beb.x86_64
      kernel-4.18.0-137.el8.x86_64

Thanks
Yu Wang

Comment 9 bugzilla 2019-10-25 08:48:44 UTC
that kernel is hideously old, i'm testing with 5.2/5.3 kernels.

seems it may be a qcow2 bug rather than virtio-scsi or virtio-win:

https://bugs.launchpad.net/qemu/+bug/1847793

https://bugs.launchpad.net/qemu/+bug/1847793

Comment 10 Vadim Rozenfeld 2019-10-25 09:41:52 UTC
(In reply to bugzilla from comment #9)
> that kernel is hideously old, i'm testing with 5.2/5.3 kernels.
> 
> seems it may be a qcow2 bug rather than virtio-scsi or virtio-win:
> 
> https://bugs.launchpad.net/qemu/+bug/1847793
> 
> https://bugs.launchpad.net/qemu/+bug/1847793

Please take a look at https://bugzilla.redhat.com/show_bug.cgi?id=1743176
It might be related as well.

Best,
Vadim.

Comment 11 bugzilla 2019-11-10 13:12:17 UTC
Just seen qcow2 corruption on a debian 10 guest, so its not a virtio-win issue, probably is one of the multitude of qcow2 corruption bugs in qemu 4.1


Note You need to log in before you can comment on or make changes to this bug.