Bug 1029344 - qcow2 images can become corrupted if host crashes,leading to full data loss/unavailability
qcow2 images can become corrupted if host crashes,leading to full data loss/u...
Status: CLOSED WORKSFORME
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: qemu-kvm (Show other bugs)
6.5
x86_64 Linux
unspecified Severity high
: rc
: ---
Assigned To: Ademar Reis
Virtualization Bugs
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2013-11-12 03:36 EST by Vali Dragnuta
Modified: 2013-12-06 00:11 EST (History)
9 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2013-11-25 13:24:06 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
CentOS 0006733 None None None Never

  None (edit)
Description Vali Dragnuta 2013-11-12 03:36:11 EST
Description of problem:
In case of power failure disk images that were active and created in qcow2 format can become logically corrupt so that they actually appear as unused (full of zeroes).
Data seems to be there, but at this moment i cannot find any reliable method to recover it. Should it be a raw image, a recovery path would be available, but a qcow2 image only presents zeroes once it gets corrupted. My understanding is that the blockmap of the image gets reset and the image is then assumed to be unused.
My detailed setup :

Kernel 2.6.32-358.18.1.el6.x86_64
qemu-kvm-0.12.1.2-2.355.0.1.el6.centos.7.x86_64
Used via libvirt libvirt-0.10.2-18.el6_4.14.x86_64
The image was used from a NFS share (the nfs server did NOT crash and remained permanently active).

qemu-img check finds no corruption;
qemu-img convert will fully convert the image to raw at a raw image full of zeroes. However, there is data in the file, and the storage backend was not restarted, inactivated during the incident.
I encountered this issue on two different machines, in both cases i was not able to recover the data.
Image was qcow2, thin provisioned, created like this :
 qemu-img create -f qcow2 -o cluster_size=2M imagename.img

While addressing the root cause in order to not have this issue repeat would be the ideal scenario, a temporary workaround to run on the affected qcow2 image to "patch" it and recover the data (eventually after a full fsck/recovery inside the guest) would also be good. Otherwise we are basically losing data on a large scale when using qcow2.

 

Version-Release number of selected component (if applicable):
Kernel 2.6.32-358.18.1.el6.x86_64
qemu-kvm-0.12.1.2-2.355.0.1.el6.centos.7.x86_64
Used via libvirt libvirt-0.10.2-18.el6_4.14.x86_64

How reproducible:
I am not able (and don't have at the moment enough resources to try to manually reproduce it), but the probability of the issue seems quite high as this is the second case of such corruption in weeks.
Additional info:
I can privately provide an image displaying the corruption.
Comment 2 Ademar Reis 2013-11-12 06:36:13 EST
Vali, thanks for taking the time to enter a bug report with us. We appreciate
the feedback and look to use reports such as this to guide our efforts at
improving our products. That being said, we're not able to guarantee the
timeliness or suitability of a resolution for issues entered here because this
is not a mechanism for requesting support.

If this issue is critical or in any way time sensitive, please raise a ticket
through your regular Red Hat support channels to make certain  it receives the
proper attention and prioritization to assure a timely resolution.

For information on how to contact the Red Hat production support team, please
visit: https://www.redhat.com/support/process/production/#howto
Comment 3 Kevin Wolf 2013-11-12 07:50:19 EST
What is your qemu command line (in particular, the cache option used for the
image) and your guest OS?

In the default cache mode, qcow2 optimises performance by writing out metadata
only if the guest requests a disk flush. However, especially older guest OSes
neglect to correctly issue flush commands. In this case, in order to be safe,
you would have to change the cache mode e.g. to writethrough.
Comment 4 Vali Dragnuta 2013-11-12 08:24:01 EST
Hello everybody.

The cache option was "none" explicitly set in the libvirt definition of the virtual machine,like this :
    <disk type='file' device='disk'>
      <driver name='qemu' type='qcow2' cache='none'/>
      <source file='/mnt/ARRAY1-NFS4/hostnameXXXX.disk2.qcow.img'/>
      <target dev='vdb' bus='virtio'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x06' function='0x0'/>
    </disk>

Also, fortunately, it's not critical data that i have lost on that image and I can actually remake it. However, the issue is "nasty" and I'm afraid this might happen again and cause much more serious damage without even having a command line tool to scan the image in detail and be able to repair it to being further usable and recoverable with fsck tools inside the guest.

Back to your questions :
@Ademar : unfortunately, we do not have access to formal support channels. Fortunately, it is not a critical issue (yet). But it looks to me like a serious problem with potential grave consequences, hence my marking it as of high importance. 
@Kevin : cache was set to none (libvirt requires this anyway in order to be able to live migrate guests from node to node) . Centos 6.4. The only special thing i can think of is the clustersize of 2M.
Comment 5 Kevin Wolf 2013-11-12 09:13:09 EST
cache=none is not a writethrough cache mode, it merely bypasses the host kernel's
page cache. You could well be hitting the scenario I mentioned. Switching to
cache=writethrough would solve this, at the cost of write performance.

However, if correctly configured, a CentOS 6.4 guest should be sending the disk
flush commands. Which file system are you using on the guest? If ext3/4, please
make sure that write barriers are enabled (they are by default on ext4, but not
on ext3). Otherwise check if there are similar mount options for your FS.
Comment 6 Vali Dragnuta 2013-11-12 09:48:35 EST
Hm, i thought that cache=none is the barest of them all and would send the data to the storage as soon as the guest issues the request.

The storage was NFS, anyway, so I do not think that barriers are even relevant here. Also, barriers on the nfs SERVER should not be relevant, as in both cases the NAS was available. So whatever data loss happened happened  between the kvm process and the OS nfs layer. I was expecting that cache=none would not insert any virthost level delayed buffering in this chain,with the added benefit of not having the host pagecache used for (arguably) useless   caching at all (let the guest manage it's own caching in the limit of its own allocated memory).
And to avoid issues like this https://bugzilla.redhat.com/show_bug.cgi?id=974798 
Now, the manual page is also quite confusing, it only warns of possible data loss for cache=writeback, not for the other modes.
So, at this point the two philosophical questions are :
-how one recovers data (that should still exist) in such a qcow image ?
-what would be the recommended cache mode to have both data consistency, AND avoiding the host pagecache for guest disk i/o ?  AND be libvirt supported :)
-i myself would find logical that having no cache at all would also mean that there would be nowhere where the data could be lost in case of powerfailure.
Also, we should at least have a better documentation for all the options and the implication of using each of them.
- why whould such a failure corrupt the WHOLE image ? Because this is what actually happens, after such a failure the image appears as made of only zeroes.
Isn't qcow2 in this case too weak and prone to failures ? Ok, i get that you could lose a block or two or three, or N blocks in such an event, but it seems that such an event invalidates the whole image, this does not happen even on a physical disk used only with writeback caches with an filesystem without journals :) Maybe an internal journal for qcow would be useful for such scenarios.
Comment 7 Kevin Wolf 2013-11-12 11:10:23 EST
(In reply to Vali Dragnuta from comment #6)
> The storage was NFS, anyway, so I do not think that barriers are even
> relevant here.

They are relevant. In this case not because they are passed to the NFS server
(which they are as well), but for qemu. I am specifically talking about
barriers in the guest, not on the host. Missing barriers on the host can cause
trouble as well, but it would look different (e.g. qemu-img check failing).

> -how one recovers data (that should still exist) in such a qcow image ?

File system recovery tools may or may not work.

> -what would be the recommended cache mode to have both data consistency, AND
> avoiding the host pagecache for guest disk i/o ?  AND be libvirt supported :)

The recommended mode of operation would be to tell the guest to flush the disk
(use write barriers) when needed. qemu respects disk flushes, and if your image
contains only zeros, the guest hasn't sent any flush command.

Upstream qemu knows cache=directsync, which is basically writethrough + none,
but RHEL 6 doesn't, and it makes writes _slow_, so if you can, you want to
avoid it.

> - why whould such a failure corrupt the WHOLE image ? Because this is what
> actually happens, after such a failure the image appears as made of only
> zeroes.

It can only cause new allocations to not be written out. If you happen to run
the VM from an empty image (i.e. the same qemu instance that installs the
system or a live system), and the guest didn't send a flush, this may mean the
whole content of the image.

After a clean qemu shutdown or a flush command data won't be lost.
Comment 8 Vali Dragnuta 2013-11-12 12:46:17 EST
Hell(In reply to Kevin Wolf from comment #7)
> (In reply to Vali Dragnuta from comment #6)
> > The storage was NFS, anyway, so I do not think that barriers are even
> > relevant here.
> 
> They are relevant. In this case not because they are passed to the NFS server
> (which they are as well), but for qemu. I am specifically talking about
> barriers in the guest, not on the host. Missing barriers on the host can
> cause
> trouble as well, but it would look different (e.g. qemu-img check failing).

The guest was also Centos 6.4, the affected image was hosting a lvm physical volume containing a swap device and an ext4 filesystem with default mount options. However, i still fail to see how mount options in the guest can lead to the corruption of the qcow image, the corruption resides in the metadata of qcow, the guest is not even able to see that metadata, so any (wrong) mountoption in the guest should not produce effects at the qcow metadata level.



> > -how one recovers data (that should still exist) in such a qcow image ?
> 
> File system recovery tools may or may not work.

Did i tell already that even if a binary editor will see data and strings in the actual file, once this file is passed to the kvm or qemu-img tools it will only show ZEROES, like the system forgot any association between existing previously allocated sectors in the image and written data in the qcow image ?
If i run qemu-img convert on this image asking it to produce a raw image it will produce an image full of zeroes (actually a sparse file of the size declared in the image). The error is not at the guest filesystem, the error is that although data exists in the file, the existence of this data is no longer known to the code reading and mapping file zones in the image to blocks provided to the "consumer" running above.



> 
> > -what would be the recommended cache mode to have both data consistency, AND
> > avoiding the host pagecache for guest disk i/o ?  AND be libvirt supported :)
> 
> The recommended mode of operation would be to tell the guest to flush the
> disk
> (use write barriers) when needed. qemu respects disk flushes, and if your
> image
> contains only zeros, the guest hasn't sent any flush command.
> 
> Upstream qemu knows cache=directsync, which is basically writethrough + none,
> but RHEL 6 doesn't, and it makes writes _slow_, so if you can, you want to
> avoid it.


> 
> > - why whould such a failure corrupt the WHOLE image ? Because this is what
> > actually happens, after such a failure the image appears as made of only
> > zeroes.
> 
> It can only cause new allocations to not be written out. If you happen to run
> the VM from an empty image (i.e. the same qemu instance that installs the
> system or a live system), and the guest didn't send a flush, this may mean
> the
> whole content of the image.

Well, in this case it's clear that this is NOT the case : because that virtual machine was stopped and started a few times (during it's provisioning, installation with application  software and some basic tuning/testing). However, the image lost all the allocations, suggesting to me that something else happened here.

> 
> After a clean qemu shutdown or a flush command data won't be lost.

As i said, this seems to be another type of corruption, we lost all the previously allocated sectors.
Comment 9 Kevin Wolf 2013-11-13 05:33:39 EST
Can you make the image available for download somewhere? I'd like to see if
there's any metadata in the image and just the link to it has been lost. Though
qemu-img check succeeding suggests otherwise, it would find a lot of leaked
clusters otherwise.
Comment 10 Vali Dragnuta 2013-11-13 05:42:29 EST
Hello Kevin. 
I will send you an email with the link to download at least an image.
Please note, that I most likely reallocated ONE cluster in the qcow, because initially I did not realize that the whole image was full of zeroes and the first thing i tried was to recreate the partition table, so this may have force the allocation of one block, the first.
Comment 11 Vali Dragnuta 2013-11-13 05:58:23 EST
Sent email with download link
Comment 12 Vali Dragnuta 2013-11-16 05:13:26 EST
I just realized another fact : if cache=none is really buffered and unsafe  (and I still doubt this is the case,but let's go with this), then why libvirt refuses to migrate any vm that has disks with cache mode anything else than 'none' ? :

virsh migrate --persistent --undefinesource h-node-02 qemu+ssh://root@ps20/system
error: Unsafe migration: Migration may lead to data corruption if disks use cache != none
Comment 13 Kevin Wolf 2013-11-19 04:55:38 EST
I couldn't find any signs of old metadata in the image file. It looks exactly
like an image that was newly created and got only the first 2 MB allocated. On
the other hand, with a 2 MB cluster size, a single cluster is enough to hold
the mapping for the entire image file, so if somehow this cluster got
overwritten with invalid data, it's only logical that no old metadata is left
over, because an additional metadata cluster never existed.

However, qcow2 has some redundancy in that you not only have the mapping, but
also the reference count, and any inconsistencies would be obvious. We do not
have any metadata inconsistencies on this image, though.

The reason why libvirt denies live migration without cache=none is because the
destination host could have stale data in its kernel page cache, so it needs to
be bypassed to get the current data.
Comment 14 Vali Dragnuta 2013-11-20 05:41:09 EST
Maybe the reference count you are talking about got re-initialized at the first rewrite of the first block of data ? This because initially I was not sure what happened, it looked to me like (just) a corrupt partition table, and I recreated the partition table the same way I remembered i created it in the first place. The rewrite of the first sector could have also set the refference count to a correct one ?
Either way, i have two images with the same issue, something must have caused this issue.
I dumped the strings from one of the images previously provided to you. I can clearly see log messages and configuration files from the software that was previously installed there, it is clear that that image did contain data.

Back to to the cache issue, you are saying :

- cache = none may lead to data loss between the guest and its storage, so it should not be used;
- cache = anything else than none would preven libvirt guest migration from one host to another
From the two above one can only assume that there is actually no safe way to use the official, supported/recommended  virtualization method and at the same time have guest migration. I think a proper documentation regarding the available cache modes, the issues with each and the supported/recommended configurations should be created.

It would also be quite improbable to have the second (target) host with cache already primed with data from the images already opened by the source host (except the very special casees of two guests opening the same image at the same time, which is usually not the case).
Even so, as I have already mentioned, the corruption is not on the guest-guest storage path but on the hypervisor-host storage path.  

Anyway, I will try to watch closely these systems, maybe I can find something else of relevance. Meanwhile, I will probably move all guests from qcow images to sparse raw files, at least with those there is no block mapping that can get trashed.(Ok, except the mapping in the backing filesystem itself, but i have no issue with that ). Also, I'm not sure if this is the right place to suggest this, but I'll go on and suggest that for the future qcow versions a more robust implementation should be implemented. Eventually with mappings (optionally) in an external files, eventually holding a few copies or a few versions, and/or some kind of journal for operations made on the qcow file... 
As far as I'm concerned, this is still an issue, and I will definitely come back here with more informations if I find anything else.
Thank you for your time, I still hope we'll be able to sort this out somehow.
Comment 15 Kevin Wolf 2013-11-20 06:11:35 EST
(In reply to Vali Dragnuta from comment #14)
> Maybe the reference count you are talking about got re-initialized at the
> first rewrite of the first block of data ? This because initially I was not
> sure what happened, it looked to me like (just) a corrupt partition table,
> and I recreated the partition table the same way I remembered i created it
> in the first place. The rewrite of the first sector could have also set the
> refference count to a correct one ?

If your theory is that the refcounts and the mappings got just zeroed out at
some point for some unknown reason: It's not only data clusters that have the
correct refcount of 1, but also metadata clusters like the image header. Their
refcount is only set during image creation, so it's unlikely that it got zeroed
out first and then recreated.

> - cache = none may lead to data loss between the guest and its storage, so
> it should not be used;

No, I'm not saying it shouldn't be used. I'm saying you need to configure your
guest correctly to issue flushes (ext4 does this by default).

It's the same thing as with physical hard disks with a volatile write cache. I
wouldn't say they shouldn't be used, but you need to flush the cache if you
want to be sure that the data persists even in case of a power failure.

> It would also be quite improbable to have the second (target) host with
> cache already primed with data from the images already opened by the source
> host (except the very special casees of two guests opening the same image at
> the same time, which is usually not the case).

That's unfortunately not true with the way qemu works. It first open the image
file on the destination, creates all devices and only then the actual migration
starts. Between this point and when the migration finishes, the source will
write more data to the image which is already open on the destination. A stale
cache is a very real problem there.

Also, even without this behaviour, imagine migrating from A to B, and later
back to A.

> Also, I'm not sure if
> this is the right place to suggest this, but I'll go on and suggest that for
> the future qcow versions a more robust implementation should be implemented.
> Eventually with mappings (optionally) in an external files, eventually
> holding a few copies or a few versions, and/or some kind of journal for
> operations made on the qcow file...

If you'd like to contribute such code, feel free to start a discussion on the
qemu-devel mailing list.

> As far as I'm concerned, this is still an issue, and I will definitely come
> back here with more informations if I find anything else.
> Thank you for your time, I still hope we'll be able to sort this out somehow.

I hope you'll find more information, or even a consistent reproducer. I am
interested in fixing this, but I simply don't have enough information in this
report yet. And as Ademar already said in comment 2, bugs from the community
can't always take my priority, so I don't have a lot of time to invest for
trying to reproduce it myself.
Comment 16 Vali Dragnuta 2013-11-20 10:34:38 EST
Hello,
Yes, I'm very aware of this being a "community bug" and the low priority of it, I understand and have no problem with this :). 

BTW, how do you explain the actual data that can be retrieved from the corrupt image ? There are actual full logs from an application server there... 
The strings are also consistent with the hostname of the guest and with actual logfiles that were produced on that machine. Something DID write legitimate data there, but all references to that data got lost, with or without refcounts valid or not :) This is what is the most annoying fact, that i can clearly identify pieces of data that was there, yet what you say is that the image is consistent with an just-created image. I do not doubt that what you say is true, just that there is this contradiction between what the current metadata suggests and what the pieces of leftover raw data suggest. And this holds true for both the images.
Anyway, if I find something else, I'll come back. Even if it's just a new mangled image :)
Comment 17 Kevin Wolf 2013-11-20 10:51:01 EST
(In reply to Vali Dragnuta from comment #16)
> BTW, how do you explain the actual data that can be retrieved from the
> corrupt image ?

The answer is surprisingly simple: I don't. :-)

This is in fact the puzzling part of the bug: This looks exactly as if it had
crashed with metadata caches not written out (the thing with cache=none we were
discussing above). But you're using a recent OS that does send flushes, and you
even said that you did quit qemu after the installation and restarted it later,
which also flushes the cache. So this is completely ruled out as the real
cause.

The metadata must have been present in the file at the point that you shut down
qemu, or starting it again wouldn't have worked. So something must have caused
this metadata to disappear.

At the moment I don't really have a theory for this.

Or let me ask another stupid question: You never accessed this image file from
more than one process (be it qemu, qemu-img or something else) at the same
time, did you?
Comment 18 Vali Dragnuta 2013-11-20 11:39:25 EST
I thought about that, too, and no, I did not  access this from another process :(
I will watch over, maybe it happens again.
As about the metadata not being flushed to the disk,  this is why I was suggesting to have at least a copy of the previous valid metadata, I don't think it would even be such a big performance impact and we would always have something valid to come back to in case of emergency. This will also be good for cases with otherwise known causes of corruption (ex bad blocks etc). Anyway, let's see what happens next.
Comment 19 Ademar Reis 2013-11-25 13:24:06 EST
Vali: Given we can't reproduce it or find what's wrong, I'm closing this bug as part of our triaging efforts. If you find out a way to reproduce it or has more details, please reopen. Thanks.

Note You need to log in before you can comment on or make changes to this bug.