836710 – Data loss when writing to qcow2-format disk files

Bug 836710 - Data loss when writing to qcow2-format disk files

Summary: Data loss when writing to qcow2-format disk files

Keywords:
Status:	CLOSED NEXTRELEASE
Alias:	None
Product:	Virtualization Tools
Classification:	Community
Component:	libguestfs
Sub Component:
Version:	unspecified
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	Richard W.M. Jones
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	837691 837941
TreeView+	depends on / blocked

Reported:	2012-06-30 08:30 UTC by Richard W.M. Jones
Modified:	2013-01-09 12:04 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Clone Of:
Clones:	837691 (view as bug list)
Environment:
Last Closed:	2012-07-07 20:55:05 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
LIBGUESTFS_DEBUG=1 output when the virt-resize test fails (148.28 KB, text/plain) 2012-07-01 19:35 UTC, Richard W.M. Jones	no flags	Details
reread.sh (966 bytes, text/plain) 2012-07-02 07:41 UTC, Richard W.M. Jones	no flags	Details
View All

Description Richard W.M. Jones 2012-06-30 08:30:38 UTC

Description of problem:

The virt-resize test fails occasionally with the following:

[...]
libguestfs: trace: umount_all
libguestfs: trace: umount_all = 0
libguestfs: trace: sync
libguestfs: trace: sync = 0
libguestfs: trace: close
libguestfs: trace: internal_autosync
libguestfs: trace: internal_autosync = 0
libguestfs: trace: kill_subprocess
libguestfs: trace: kill_subprocess = 0
libguestfs: trace: add_drive_opts "test2.img" "readonly:false" "format:qcow2"
libguestfs: trace: add_drive_opts = 0
libguestfs: trace: launch
libguestfs: trace: launch = 0
Expanding /dev/sda2 using the 'pvresize' method ...
libguestfs: trace: pvresize "/dev/sda2"
libguestfs: trace: pvresize = -1 (error)
Fatal error: exception Guestfs.Error("pvresize: pvresize_stub: /dev/sda2: No such file or directory")
libguestfs: trace: close
libguestfs: trace: internal_autosync
libguestfs: trace: internal_autosync = 0
libguestfs: trace: kill_subprocess
libguestfs: trace: kill_subprocess = 0
/home/rjones/d/libguestfs/run: command failed with exit code 2
FAIL: test-virt-resize.sh

Could be a race when detecting partitions?

We're not sure if this is related to virtio-scsi.


Version-Release number of selected component (if applicable):

libguestfs 1.19.15

How reproducible:

Unknown, but not very often.

Steps to Reproduce:
1. make -C resize check

Comment 1 Richard W.M. Jones 2012-07-01 19:32:34 UTC

The following command will hit the bug on a fast machine after
many iterations (enabling debugging seems to negatively affect
the ability to reproduce the bug; it's simpler to reproduce if
debugging is turned off, but obviously less useful).

while make -C resize check LIBGUESTFS_DEBUG=1 LIBGUESTFS_TRACE=1 > /tmp/log 2>&1 ; do echo -n . ; done

Debugging output to be attached as soon as I can get Bugzilla
to work ...

Comment 2 Richard W.M. Jones 2012-07-01 19:35:15 UTC

Created attachment 595560 [details]
LIBGUESTFS_DEBUG=1 output when the virt-resize test fails

Comment 3 Richard W.M. Jones 2012-07-01 21:25:37 UTC

Interestingly, even a seemingly innocent test case fails,
again, only very rarely.

------------------------------
#!/bin/sh -
set -e
cd /tmp

qemu-img create -f qcow2 test.img 500M >/dev/null 2>&1

guestfish <<EOF
  add test.img readonly:false format:qcow2
  run
  part-init /dev/sda gpt
  part-add /dev/sda primary 64 65599
  part-add /dev/sda primary 65664 1019647
  copy-device-to-device /dev/zero /dev/sda2 size:385810944
EOF

# This command will fail if the partition cannot be found, so
# effectively it's a test of whether Linux recognized the
# partition table on disk.
guestfish <<EOF
  add test.img
  run
  blockdev-getsize64 /dev/sda2
EOF
------------------------------

Comment 4 Richard W.M. Jones 2012-07-02 07:06:06 UTC

Commenting out the copy-device-to-device line makes
the bug disappear, which is very strange.  It does
seem like a qemu data corruptor bug.

Comment 5 Richard W.M. Jones 2012-07-02 07:41:39 UTC

Created attachment 595646 [details]
reread.sh

Self-contained test.

Download the attached file.

  chmod +x reread.sh
  ./reread.sh

Output will look like:

  Testing: ..............................................

with each dot corresponding to one run of the test.

After perhaps 100-500 runs it may exit, indicating a test failure.

After it fails, look at the script, the log file and the data file
(the data file will probably be an all-zeroes virtual disk, which it
certainly should not be).

Commenting out the line copy-device-to-device seems to make the
test pass every time (at least, I tested over 10000 iterations like
this without seeing the bug).

Failure observed on:

Fedora 17 (w/ virtio-blk)
Fedora Rawhide (w/ virtio-scsi)

Both are using the same version of qemu.

Comment 6 Avi Kivity 2012-07-02 09:36:03 UTC

Could this be just a problem with udev not bring up the device not quickly enough?

Try adding 'udev settle' after adding the partition and see.

Comment 7 Richard W.M. Jones 2012-07-02 09:44:21 UTC

(In reply to comment #6)
> Could this be just a problem with udev not bring up the device not quickly
> enough?
> 
> Try adding 'udev settle' after adding the partition and see.

I don't think so.  Two reasons why not:

(a) The disk image, examined after the test failed, is completely blank,
so it doesn't contain any partitions.  This would indicate that the
writes are failing in the first run of qemu.

(b) The second boot of the kernel doesn't see any partitions.
From https://bugzilla.redhat.com/attachment.cgi?id=595560 :
[    0.983353]  sda: unknown partition table
(This is of course not surprising given fact (a)).

Comment 8 Richard W.M. Jones 2012-07-02 09:47:37 UTC

In addition:

We are doing udev settle after adding the partitions during
the first run of the kernel.

parted does an ioctl to reread the partition table.

We know the kernel in the first run sees the new partition table,
because the copy from /dev/zero to /dev/sda2 works.

Yet the disk is blank.  This indicates to me a qemu bug of
some sort.

Comment 9 Avi Kivity 2012-07-02 09:59:45 UTC

Try using raw instead of qcow2.  This takes qemu out of the equation as far as caching is concerned.

Comment 10 Richard W.M. Jones 2012-07-03 18:05:22 UTC

Patches posted:
https://www.redhat.com/archives/libguestfs/2012-July/msg00008.html

Comment 11 Richard W.M. Jones 2012-07-03 18:05:47 UTC

(In reply to comment #9)
> Try using raw instead of qcow2.  This takes qemu out of the equation as far
> as caching is concerned.

Yup, works fine for raw.

Comment 12 Richard W.M. Jones 2012-07-04 15:04:49 UTC

The underlying qemu issue is fixed in qemu-kvm >= 1.1.0
(see bug 836913).

The libguestfs issue is fixed in >= 1.19.16 which we'll probably
backport to older Fedora and RHEL 6.3.

Comment 13 Richard W.M. Jones 2012-07-04 19:44:25 UTC

I've written what I hope is the definitive guide to
this bug here:

https://www.redhat.com/archives/libguestfs/2012-July/msg00020.html

Comment 14 Richard W.M. Jones 2012-07-05 18:35:37 UTC

Fix is upstream and published in 1.19.16.

Comment 15 Richard W.M. Jones 2012-07-07 20:55:05 UTC

Also in stable branch versions >= 1.18.4, >= 1.16.27.

Note You need to log in before you can comment on or make changes to this bug.