Bug 836710
Summary: | Data loss when writing to qcow2-format disk files | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | [Community] Virtualization Tools | Reporter: | Richard W.M. Jones <rjones> | ||||||
Component: | libguestfs | Assignee: | Richard W.M. Jones <rjones> | ||||||
Status: | CLOSED NEXTRELEASE | QA Contact: | |||||||
Severity: | unspecified | Docs Contact: | |||||||
Priority: | unspecified | ||||||||
Version: | unspecified | CC: | dyasny, knoel, leiwang, mbooth, moli, qguan | ||||||
Target Milestone: | --- | ||||||||
Target Release: | --- | ||||||||
Hardware: | Unspecified | ||||||||
OS: | Unspecified | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | |||||||||
: | 837691 (view as bug list) | Environment: | |||||||
Last Closed: | 2012-07-07 20:55:05 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | |||||||||
Bug Blocks: | 837691, 837941 | ||||||||
Attachments: |
|
Description
Richard W.M. Jones
2012-06-30 08:30:38 UTC
The following command will hit the bug on a fast machine after many iterations (enabling debugging seems to negatively affect the ability to reproduce the bug; it's simpler to reproduce if debugging is turned off, but obviously less useful). while make -C resize check LIBGUESTFS_DEBUG=1 LIBGUESTFS_TRACE=1 > /tmp/log 2>&1 ; do echo -n . ; done Debugging output to be attached as soon as I can get Bugzilla to work ... Created attachment 595560 [details]
LIBGUESTFS_DEBUG=1 output when the virt-resize test fails
Interestingly, even a seemingly innocent test case fails, again, only very rarely. ------------------------------ #!/bin/sh - set -e cd /tmp qemu-img create -f qcow2 test.img 500M >/dev/null 2>&1 guestfish <<EOF add test.img readonly:false format:qcow2 run part-init /dev/sda gpt part-add /dev/sda primary 64 65599 part-add /dev/sda primary 65664 1019647 copy-device-to-device /dev/zero /dev/sda2 size:385810944 EOF # This command will fail if the partition cannot be found, so # effectively it's a test of whether Linux recognized the # partition table on disk. guestfish <<EOF add test.img run blockdev-getsize64 /dev/sda2 EOF ------------------------------ Commenting out the copy-device-to-device line makes the bug disappear, which is very strange. It does seem like a qemu data corruptor bug. Created attachment 595646 [details]
reread.sh
Self-contained test.
Download the attached file.
chmod +x reread.sh
./reread.sh
Output will look like:
Testing: ..............................................
with each dot corresponding to one run of the test.
After perhaps 100-500 runs it may exit, indicating a test failure.
After it fails, look at the script, the log file and the data file
(the data file will probably be an all-zeroes virtual disk, which it
certainly should not be).
Commenting out the line copy-device-to-device seems to make the
test pass every time (at least, I tested over 10000 iterations like
this without seeing the bug).
Failure observed on:
Fedora 17 (w/ virtio-blk)
Fedora Rawhide (w/ virtio-scsi)
Both are using the same version of qemu.
Could this be just a problem with udev not bring up the device not quickly enough? Try adding 'udev settle' after adding the partition and see. (In reply to comment #6) > Could this be just a problem with udev not bring up the device not quickly > enough? > > Try adding 'udev settle' after adding the partition and see. I don't think so. Two reasons why not: (a) The disk image, examined after the test failed, is completely blank, so it doesn't contain any partitions. This would indicate that the writes are failing in the first run of qemu. (b) The second boot of the kernel doesn't see any partitions. From https://bugzilla.redhat.com/attachment.cgi?id=595560 : [ 0.983353] sda: unknown partition table (This is of course not surprising given fact (a)). In addition: We are doing udev settle after adding the partitions during the first run of the kernel. parted does an ioctl to reread the partition table. We know the kernel in the first run sees the new partition table, because the copy from /dev/zero to /dev/sda2 works. Yet the disk is blank. This indicates to me a qemu bug of some sort. Try using raw instead of qcow2. This takes qemu out of the equation as far as caching is concerned. (In reply to comment #9) > Try using raw instead of qcow2. This takes qemu out of the equation as far > as caching is concerned. Yup, works fine for raw. The underlying qemu issue is fixed in qemu-kvm >= 1.1.0 (see bug 836913). The libguestfs issue is fixed in >= 1.19.16 which we'll probably backport to older Fedora and RHEL 6.3. I've written what I hope is the definitive guide to this bug here: https://www.redhat.com/archives/libguestfs/2012-July/msg00020.html Fix is upstream and published in 1.19.16. Also in stable branch versions >= 1.18.4, >= 1.16.27. |