|Summary:||Data loss when writing to qcow2-format disk files|
|Product:||[Community] Virtualization Tools||Reporter:||Richard W.M. Jones <rjones>|
|Component:||libguestfs||Assignee:||Richard W.M. Jones <rjones>|
|Status:||CLOSED NEXTRELEASE||QA Contact:|
|Version:||unspecified||CC:||dyasny, knoel, leiwang, mbooth, moli, qguan|
|Fixed In Version:||Doc Type:||Bug Fix|
|Doc Text:||Story Points:||---|
|:||837691 (view as bug list)||Environment:|
|Last Closed:||2012-07-07 20:55:05 UTC||Type:||Bug|
|oVirt Team:||---||RHEL 7.3 requirements from Atomic Host:|
|Bug Depends On:|
|Bug Blocks:||837691, 837941|
Description Richard W.M. Jones 2012-06-30 08:30:38 UTC
Description of problem: The virt-resize test fails occasionally with the following: [...] libguestfs: trace: umount_all libguestfs: trace: umount_all = 0 libguestfs: trace: sync libguestfs: trace: sync = 0 libguestfs: trace: close libguestfs: trace: internal_autosync libguestfs: trace: internal_autosync = 0 libguestfs: trace: kill_subprocess libguestfs: trace: kill_subprocess = 0 libguestfs: trace: add_drive_opts "test2.img" "readonly:false" "format:qcow2" libguestfs: trace: add_drive_opts = 0 libguestfs: trace: launch libguestfs: trace: launch = 0 Expanding /dev/sda2 using the 'pvresize' method ... libguestfs: trace: pvresize "/dev/sda2" libguestfs: trace: pvresize = -1 (error) Fatal error: exception Guestfs.Error("pvresize: pvresize_stub: /dev/sda2: No such file or directory") libguestfs: trace: close libguestfs: trace: internal_autosync libguestfs: trace: internal_autosync = 0 libguestfs: trace: kill_subprocess libguestfs: trace: kill_subprocess = 0 /home/rjones/d/libguestfs/run: command failed with exit code 2 FAIL: test-virt-resize.sh Could be a race when detecting partitions? We're not sure if this is related to virtio-scsi. Version-Release number of selected component (if applicable): libguestfs 1.19.15 How reproducible: Unknown, but not very often. Steps to Reproduce: 1. make -C resize check
Comment 1 Richard W.M. Jones 2012-07-01 19:32:34 UTC
The following command will hit the bug on a fast machine after many iterations (enabling debugging seems to negatively affect the ability to reproduce the bug; it's simpler to reproduce if debugging is turned off, but obviously less useful). while make -C resize check LIBGUESTFS_DEBUG=1 LIBGUESTFS_TRACE=1 > /tmp/log 2>&1 ; do echo -n . ; done Debugging output to be attached as soon as I can get Bugzilla to work ...
Comment 2 Richard W.M. Jones 2012-07-01 19:35:15 UTC
Created attachment 595560 [details] LIBGUESTFS_DEBUG=1 output when the virt-resize test fails
Comment 3 Richard W.M. Jones 2012-07-01 21:25:37 UTC
Interestingly, even a seemingly innocent test case fails, again, only very rarely. ------------------------------ #!/bin/sh - set -e cd /tmp qemu-img create -f qcow2 test.img 500M >/dev/null 2>&1 guestfish <<EOF add test.img readonly:false format:qcow2 run part-init /dev/sda gpt part-add /dev/sda primary 64 65599 part-add /dev/sda primary 65664 1019647 copy-device-to-device /dev/zero /dev/sda2 size:385810944 EOF # This command will fail if the partition cannot be found, so # effectively it's a test of whether Linux recognized the # partition table on disk. guestfish <<EOF add test.img run blockdev-getsize64 /dev/sda2 EOF ------------------------------
Comment 4 Richard W.M. Jones 2012-07-02 07:06:06 UTC
Commenting out the copy-device-to-device line makes the bug disappear, which is very strange. It does seem like a qemu data corruptor bug.
Comment 5 Richard W.M. Jones 2012-07-02 07:41:39 UTC
Created attachment 595646 [details] reread.sh Self-contained test. Download the attached file. chmod +x reread.sh ./reread.sh Output will look like: Testing: .............................................. with each dot corresponding to one run of the test. After perhaps 100-500 runs it may exit, indicating a test failure. After it fails, look at the script, the log file and the data file (the data file will probably be an all-zeroes virtual disk, which it certainly should not be). Commenting out the line copy-device-to-device seems to make the test pass every time (at least, I tested over 10000 iterations like this without seeing the bug). Failure observed on: Fedora 17 (w/ virtio-blk) Fedora Rawhide (w/ virtio-scsi) Both are using the same version of qemu.
Comment 6 Avi Kivity 2012-07-02 09:36:03 UTC
Could this be just a problem with udev not bring up the device not quickly enough? Try adding 'udev settle' after adding the partition and see.
Comment 7 Richard W.M. Jones 2012-07-02 09:44:21 UTC
(In reply to comment #6) > Could this be just a problem with udev not bring up the device not quickly > enough? > > Try adding 'udev settle' after adding the partition and see. I don't think so. Two reasons why not: (a) The disk image, examined after the test failed, is completely blank, so it doesn't contain any partitions. This would indicate that the writes are failing in the first run of qemu. (b) The second boot of the kernel doesn't see any partitions. From https://bugzilla.redhat.com/attachment.cgi?id=595560 : [ 0.983353] sda: unknown partition table (This is of course not surprising given fact (a)).
Comment 8 Richard W.M. Jones 2012-07-02 09:47:37 UTC
In addition: We are doing udev settle after adding the partitions during the first run of the kernel. parted does an ioctl to reread the partition table. We know the kernel in the first run sees the new partition table, because the copy from /dev/zero to /dev/sda2 works. Yet the disk is blank. This indicates to me a qemu bug of some sort.
Comment 9 Avi Kivity 2012-07-02 09:59:45 UTC
Try using raw instead of qcow2. This takes qemu out of the equation as far as caching is concerned.
Comment 10 Richard W.M. Jones 2012-07-03 18:05:22 UTC
Comment 11 Richard W.M. Jones 2012-07-03 18:05:47 UTC
(In reply to comment #9) > Try using raw instead of qcow2. This takes qemu out of the equation as far > as caching is concerned. Yup, works fine for raw.
Comment 12 Richard W.M. Jones 2012-07-04 15:04:49 UTC
The underlying qemu issue is fixed in qemu-kvm >= 1.1.0 (see bug 836913). The libguestfs issue is fixed in >= 1.19.16 which we'll probably backport to older Fedora and RHEL 6.3.
Comment 13 Richard W.M. Jones 2012-07-04 19:44:25 UTC
I've written what I hope is the definitive guide to this bug here: https://www.redhat.com/archives/libguestfs/2012-July/msg00020.html
Comment 14 Richard W.M. Jones 2012-07-05 18:35:37 UTC
Fix is upstream and published in 1.19.16.
Comment 15 Richard W.M. Jones 2012-07-07 20:55:05 UTC
Also in stable branch versions >= 1.18.4, >= 1.16.27.