1700189 – Unsafe creation of volumes for snapshots can cause outages and data corruption

Bug 1700189 - Unsafe creation of volumes for snapshots can cause outages and data corruption

Summary: Unsafe creation of volumes for snapshots can cause outages and data corruption

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-engine
Sub Component:
Version:	4.2.8
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	urgent
Target Milestone:	ovirt-4.3.5
Target Release:	---
Assignee:	Eyal Shenitzky
QA Contact:	Elad
Docs Contact:
URL:
Whiteboard:
Depends On:	1700623
Blocks:	gss_rhv_4_3_4
TreeView+	depends on / blocked

Reported:	2019-04-16 05:57 UTC by Germano Veit Michel
Modified:	2020-08-03 15:16 UTC (History)
CC List:	13 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-06-02 10:30:58 UTC
oVirt Team:	Storage
Target Upstream Version:
Embargoed:
Flags:	lsvaty: testing_plan_complete-

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	4065541	0	None	None	None	2019-04-16 06:04:35 UTC

Description Germano Veit Michel 2019-04-16 05:57:04 UTC

Description of problem:

We know the SIZE key in a volume metadata can be wrong for a few reasons, like past bugs, like what we saw in BZ1308375, BZ1201268 and probably others I am not aware of.

IIUC, since the change below[1] vdsm does not activate the backing volume anymore to check the actual virtual size reported by qemu-img. Instead it relies on the size that comes from the volume metadata[2].

This is not safe, because if the size from the metadata is wrong, the new leaf volume will have the incorrect size. I'm attaching to this bug a case where the VM saw its disk go from 350GB to 150GB after a snapshot creation, causing an outage and apparent data corruption.

Since we have customers upgrading from older versions which may have a bunch of volumes with incorrect size this is very dangerous and may cause more outages and corruption for more customers. Please evaluate reverting this.

Version-Release number of selected component (if applicable):
vdsm-4.20.39.1-1.el7ev.x86_64

Actual results:
VM sees its disk change size, causing IO errors, pausing and corruption.

Expected results:
Do more checks, activate the backing volume and check the actual capacity of the volume.

Additional info:

[1] 
~~~
commit 7222af8805b4a642c288971edb102cf82103938c

    create-snapshot: Use unsafe option when creating snapshots
    
    With new qemu-img support for unsafe operations (added in qemu-img
    2.10.0), and as the VM may be running while creating snapshots,
    we have to use new qemu-img support for unsafe operations. When
    using the new unsafe support, qemu-img doesn't access the backing file,
    hence we have to provide the size.
    
    Providing the size of the volume will save preparing and tearing down
    the volume, thus improving the performance of snapshot creation.
    
    However, when working with qemu-img < 2.10, we do have to prepare and
    the backing volume despite the fact that we add the size, because it is
    NOT documented that qemu-img will not access the backing chain.
    
    Change-Id: If7525e5f3d9e18eaf20aefd4ce25e8dcc48ee675
~~~

[2]

    def clone(self, dstPath, volFormat, size_blk):
        """
        Clone self volume to the specified dst_image_dir/dst_volUUID
        """
        wasleaf = False
        taskName = "parent volume rollback: " + self.volUUID
        vars.task.pushRecovery(
            task.Recovery(taskName, "volume", "Volume",
                          "parentVolumeRollback",
                          [self.sdUUID, self.imgUUID, self.volUUID]))
        if self.isLeaf():
            wasleaf = True
            self.setInternal()
        try:
            self.log.debug('cloning volume %s to %s', self.volumePath,
                           dstPath)
            parent = getBackingVolumePath(self.imgUUID, self.volUUID)
            domain = sdCache.produce(self.sdUUID)
            # Using unsafe=True in order to create volumes when the backing
            # chain isn't available. In this case qemu-img cannot get the size,
            # hence we have to provide it.
            operation = qemuimg.create(
                dstPath,
                size=size_blk * sc.BLOCK_SIZE,
                backing=parent,
                format=sc.fmt2str(volFormat),
                qcow2Compat=domain.qcow2_compat(),
                backingFormat=sc.fmt2str(self.getFormat()),
                unsafe=True)
            operation.run()

Comment 2 Tal Nisan 2019-04-16 16:08:36 UTC

Nir / Eyal, review

Comment 3 Nir Soffer 2019-04-16 17:20:37 UTC

(In reply to Germano Veit Michel from comment #0)
> Description of problem:
> 
> We know the SIZE key in a volume metadata can be wrong for a few reasons,
> like past bugs, like what we saw in BZ1308375, 

I don't think this is related, this bug is about actual size, not virtual size.

> BZ1201268 

This also probably not related, the fix mentioned is fixing extend of thin block
volume, which does not change the virtual size.

> and probably others
> I am not aware of.

I think we don't have any evidence yet that ZIZE key contain incorrect size.

There is once case when volume size is not available, during image resize.
We set the size to 0 before resize, and set it back to the to the correct
size when qemu finish the resize. If the operation failed, the size will be
corrected on the next time we try to access the image.
 
> IIUC, since the change below[1] vdsm does not activate the backing volume
> anymore to check the actual virtual size reported by qemu-img. Instead it
> relies on the size that comes from the volume metadata[2].

True, this info must be correct, and used for many operations on images.

qcow2 virtual size is controlled by RHV. RHV managed the virtual size in 
engine database and on storage, and perform resize operation either via
libvirt (when vm is running) or with qemu-img resize. It is possible that
a user or support person resized the image behind RHV back?

> This is not safe, because if the size from the metadata is wrong, the new
> leaf volume will have the incorrect size. I'm attaching to this bug a case
> where the VM saw its disk go from 350GB to 150GB after a snapshot creation,
> causing an outage and apparent data corruption.

Unfortunately qemu started to take locks on images, and creating a snapshot
on a running VM cannot succeed because qemu-img will fail to lock the image
locked by qemu. So we must use unsafe mode when creating a snapshot on
a running VM.

If the VM is not running, we can use safe mode but activating and deactivating
entire chain is a performance issues when having lot of snapshots.

> Since we have customers upgrading from older versions which may have a bunch
> of volumes with incorrect size this is very dangerous and may cause more
> outages and corruption for more customers. Please evaluate reverting this.

If you think we can have old disk with incorrect metadata, we need to fix
the metadata. This can be done with a tool that check all disks and warn or
repair incorrect metadata.
 
> Version-Release number of selected component (if applicable):
> vdsm-4.20.39.1-1.el7ev.x86_64
> 
> Actual results:
> VM sees its disk change size, causing IO errors, pausing and corruption.
> 
> Expected results:
> Do more checks, activate the backing volume and check the actual capacity of
> the volume.

As explained, not possible in most interesting cases, unless qemu provide a way
to access image info with qemu-img create while a vm is running.

I think the issue is not the unsafe qemu-img operation, but the wrong size in
image metadata. Do we have info why the size was incorrect?

Kevin, do we have a way to create an external snapshot on a running VM, without
tripping on qemu locks?

Example setup:

    VM running with /path/to/disk
    Creating external snapshot disk:
     
        qemu-img create -f qcow2 -b /path/to/disk /path/to/snapshot

What we do today is:

        qemu-img create -u -f qcow2 -b /path/to/disk /path/to/snapshot size
 
Regardless of having a way to create a snapshot without unsafe mode, we need
to make sure that RHV metadata is correct, because this will break other flows.

I see two ways to attack this:

- When preparing an image, validate image metadata
  (e.g. SIZE/CAP matches qcow2 virtual size).
  If the image capacity does not match qcow metadata, image must not be used.
  In this case both engine database and RHV on storage metadata may need a fix.

- Add a tool to check and repair volumes metadata, similar to the tool for 
  checking and repairing leases.

> Additional info:
> 
> [1] 
> ~~~
> commit 7222af8805b4a642c288971edb102cf82103938c
> 
>     create-snapshot: Use unsafe option when creating snapshots
>     
>     With new qemu-img support for unsafe operations (added in qemu-img
>     2.10.0), and as the VM may be running while creating snapshots,
>     we have to use new qemu-img support for unsafe operations. When
>     using the new unsafe support, qemu-img doesn't access the backing file,
>     hence we have to provide the size.
>     
>     Providing the size of the volume will save preparing and tearing down
>     the volume, thus improving the performance of snapshot creation.
>     
>     However, when working with qemu-img < 2.10, we do have to prepare and
>     the backing volume despite the fact that we add the size, because it is
>     NOT documented that qemu-img will not access the backing chain.
>     
>     Change-Id: If7525e5f3d9e18eaf20aefd4ce25e8dcc48ee675
> ~~~
> 
> [2]
> 
>     def clone(self, dstPath, volFormat, size_blk):
>         """
>         Clone self volume to the specified dst_image_dir/dst_volUUID
>         """
>         wasleaf = False
>         taskName = "parent volume rollback: " + self.volUUID
>         vars.task.pushRecovery(
>             task.Recovery(taskName, "volume", "Volume",
>                           "parentVolumeRollback",
>                           [self.sdUUID, self.imgUUID, self.volUUID]))
>         if self.isLeaf():
>             wasleaf = True
>             self.setInternal()
>         try:
>             self.log.debug('cloning volume %s to %s', self.volumePath,
>                            dstPath)
>             parent = getBackingVolumePath(self.imgUUID, self.volUUID)
>             domain = sdCache.produce(self.sdUUID)
>             # Using unsafe=True in order to create volumes when the backing
>             # chain isn't available. In this case qemu-img cannot get the
> size,
>             # hence we have to provide it.
>             operation = qemuimg.create(
>                 dstPath,
>                 size=size_blk * sc.BLOCK_SIZE,
>                 backing=parent,
>                 format=sc.fmt2str(volFormat),
>                 qcow2Compat=domain.qcow2_compat(),
>                 backingFormat=sc.fmt2str(self.getFormat()),
>                 unsafe=True)
>             operation.run()

Comment 4 Kevin Wolf 2019-04-16 19:12:33 UTC

(In reply to Nir Soffer from comment #3)
> Kevin, do we have a way to create an external snapshot on a running VM,
> without tripping on qemu locks?

Unless I'm missing something, qemu-img can't be convinced to automatically probe the size when that would require ignoring locks. The assumption is that management tools know the size of their images (and indeed, you have the supposed image size stored).

You could of course just query the image size (with QMP query-block or qemu-img info -u) right before creating the snapshot because you know that the image size isn't going to change, but as you say, inaccurate tracking of the image size will be a problem in other contexts, too. So I think you need to get the metadata fixed anyway and not using it would just be papering over the real problem.

Comment 5 Germano Veit Michel 2019-04-16 22:56:05 UTC

Hi Nir,

Thanks for the quick response.

(In reply to Nir Soffer from comment #3)
> (In reply to Germano Veit Michel from comment #0)
> > Description of problem:
> > 
> > We know the SIZE key in a volume metadata can be wrong for a few reasons,
> > like past bugs, like what we saw in BZ1308375, 
> 
> I don't think this is related, this bug is about actual size, not virtual
> size.

Why not?

1. Preallocated/raw base
2. Create snapshot
3. Delete snapshot, base extended
4. Create snapshot (4.20.14 and lower), qcow2 virtual size is SIZE+extension (qemu-img reads size from backing file in oder versions, no?)
5. Cycle through 2-3 a few times, and you get a huge difference between them.
6. Upgrade to 4.20.15 and create snapshot: size is reduced.

But you are right, there are probably more problems related to having wrong SIZE in the metadata. I'll try to find on some cases why this happened or is happening, and it is indeed the actual root cause. If SIZE was always right, this would not be a problem. Problem most likely is once we detect this and ask for logs, they are most likely already rotated.

> I think we don't have any evidence yet that ZIZE key contain incorrect size.

Truth is we deal with a significant number of customer tickets where SIZE is wrong, there are KCS and tickets everywhere. We see this mostly on cloning/importing/moving VMs to Block Storage. I currently have a case with 20+ VMs (on 3.6) that won't clone due to wrong SIZE in the metadata. IMHO MD SIZE cannot be trusted, I think its a widespread problem, especially on older versions.

There is another case being worked on (not by me, Nijin was on it) where another VM apparently had the same fate of this one I opened the BZ for. So maybe we already have 2+ tickets with this.

> qcow2 virtual size is controlled by RHV. RHV managed the virtual size in 
> engine database and on storage, and perform resize operation either via
> libvirt (when vm is running) or with qemu-img resize. It is possible that
> a user or support person resized the image behind RHV back?

On that case the size in the DB was 350GB, the customer extended the disk via RHV. It does not look like the extension bug above, but something else.
It is possible sometimes support does something wrong, but we had too many cases with wrong SIZE for this to be human error.

> Unfortunately qemu started to take locks on images, and creating a snapshot
> on a running VM cannot succeed because qemu-img will fail to lock the image
> locked by qemu. So we must use unsafe mode when creating a snapshot on
> a running VM.
> 
> If the VM is not running, we can use safe mode but activating and
> deactivating
> entire chain is a performance issues when having lot of snapshots.

Understood, makes sense.

> If you think we can have old disk with incorrect metadata, we need to fix
> the metadata. This can be done with a tool that check all disks and warn or
> repair incorrect metadata.
>
> Regardless of having a way to create a snapshot without unsafe mode, we need
> to make sure that RHV metadata is correct, because this will break other
> flows.

Totally agreed but how to do this before the problem happens? The problem is the severity of this, it's corruption and downtime. Wrong size in the metadata in other flows does not cause severe problems: the VM just fails to clone/import/lsm, we fix the size and all is good. But not on this one, especially if the customer starts clicking around to delete the new snapshot etc...

This is why this is much more severe than in other places if the metadata is wrong and I raised this bug. It does not seem to be safe.

> I see two ways to attack this:
> 
> - When preparing an image, validate image metadata
>   (e.g. SIZE/CAP matches qcow2 virtual size).
>   If the image capacity does not match qcow metadata, image must not be used.
>   In this case both engine database and RHV on storage metadata may need a
> fix.

Yes, this would be a good solution. Just aborting is fine, support can deal with fixing.

> 
> - Add a tool to check and repair volumes metadata, similar to the tool for 
>   checking and repairing leases.

Not sure about a tool. It would help but most customers will not run this proactively and corruption/outage will still happen.
It would be very helpful to fix the problem, but I'm afraid it won't prevent it in many cases. 
After my PTO I can help writing this.

Comment 6 Germano Veit Michel 2019-04-17 00:10:22 UTC

Here we go: https://bugzilla.redhat.com/show_bug.cgi?id=1700623

Comment 8 Germano Veit Michel 2019-04-30 06:17:19 UTC

Isn'thttps://gerrit.ovirt.org/#/c/99541/ going to help prevent this? Shouldn't it be attached to this BZ too?

Comment 10 Nir Soffer 2019-05-05 09:49:44 UTC

(In reply to Germano Veit Michel from comment #8)
> Isn'thttps://gerrit.ovirt.org/#/c/99541/ going to help prevent this?
> Shouldn't it be attached to this BZ too?

I think we don't have a way to avoid unsafe creation of qcow2 volumes with
a parent, see comment 4.

This bug will be resolved when bug 1700623 (which is the root cause) will be
resolved.

Comment 12 Germano Veit Michel 2019-05-14 01:28:44 UTC

Hi Nir,

I'm back from leave now and willing to write the tool to check and repair the volumes.

(In reply to Nir Soffer from comment #3)
> I see two ways to attack this:
> 
> - When preparing an image, validate image metadata
>   (e.g. SIZE/CAP matches qcow2 virtual size).
>   If the image capacity does not match qcow metadata, image must not be used.
>   In this case both engine database and RHV on storage metadata may need a
> fix.
> 
> - Add a tool to check and repair volumes metadata, similar to the tool for 
>   checking and repairing leases.

I started writing the tool, which is rather simple. 

However, the tool needs to use Image.prepare before doing the Volume.getQemuImageInfo to retrieve the real size from the qcow2 header. But https://gerrit.ovirt.org/#/c/99793/ should fix the metadata on Image.prepare without raising exception (different from your first point), so this is a bit confusing regarding what the tool should do.

I think the tool would:
1) Retrieve the metadata size of all volumes first
2) Prepare all volumes and get qemu size (which would fix the metadata)
3) Print a list of all the volumes that have size mismatch (and should be already fixed by this point)

What do you think?

Thanks

Comment 13 Nir Soffer 2019-05-14 15:07:06 UTC

(In reply to Germano Veit Michel from comment #12)

Germano, lets not use this bug, since the root cause is bug 1700623.

> I started writing the tool, which is rather simple. 
> 
> However, the tool needs to use Image.prepare before doing the
> Volume.getQemuImageInfo to retrieve the real size from the qcow2 header. But
> https://gerrit.ovirt.org/#/c/99793/ should fix the metadata on Image.prepare
> without raising exception (different from your first point), so this is a
> bit confusing regarding what the tool should do.
> 
> I think the tool would:
> 1) Retrieve the metadata size of all volumes first
> 2) Prepare all volumes and get qemu size (which would fix the metadata)
> 3) Print a list of all the volumes that have size mismatch (and should be
> already fixed by this point)
> 
> What do you think?

System including the patches does not need any tool, volumes will be fixed
automatically when you start a vm, or when you prepare a disk.

A tool for fixing volumes is needed only for system without the fix (e.g. 4.2.8),
and in this case preparing a volume will not fix anything.

I think we can start with dump-volume-chain call, getting a json or aql with
all volumes.

Then we can find the top volume in every chain and:

1. Check if the image is prepared for a vm or by storage operation
   I don't know a good way to do this, this is something engine manages.

2. If the image is not used, prepare it with Image.prepare

3. Get the backing chain info with all the info about the disk

    qemu-img info -U --backing-chain --output json /path/to/volume

4. If the image was not used, tear it down with Image.teardown

With this you have all the info to report which volumes metadta is not
synced with qcow2 metadata (qcow2 image) or actual file size (raw image).

Comment 14 Eyal Shenitzky 2019-05-26 07:13:25 UTC

Is this bug still needed now when bug 1700623 is modified?

Comment 15 Germano Veit Michel 2019-05-26 22:26:05 UTC

(In reply to Eyal Shenitzky from comment #14)
> Is this bug still needed now when bug 1700623 is modified?

Don't think so, since Nir and Kevin explained there isn't much that can be done to make the volume creation safer.

Note You need to log in before you can comment on or make changes to this bug.