Bug 2018947 - vm: do not ignore errors when syncing volume chain
Summary: vm: do not ignore errors when syncing volume chain
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: vdsm
Classification: oVirt
Component: Core
Version: 4.40.90.4
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ovirt-4.5.0
: 4.50.0.2
Assignee: Roman Bednář
QA Contact: Shir Fishbain
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-11-01 10:52 UTC by Roman Bednář
Modified: 2022-05-03 06:46 UTC (History)
7 users (show)

Fixed In Version: vdsm-4.50.0.2
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-05-03 06:46:19 UTC
oVirt Team: Storage
Embargoed:
ahadas: ovirt-4.5?
ahadas: devel_ack+


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHV-43889 0 None None None 2021-11-01 10:52:52 UTC
oVirt gerrit 117344 0 master MERGED vm: add imageSyncVolumeChain() wrapper 2021-11-02 14:36:05 UTC
oVirt gerrit 117421 0 master MERGED livemerge: do not ignore errors when syncing volume chain 2021-11-02 14:36:07 UTC
oVirt gerrit 117424 0 ovirt-4.4.z ABANDONED vm: add imageSyncVolumeChain() wrapper 2021-12-20 00:53:45 UTC
oVirt gerrit 117425 0 ovirt-4.4.z ABANDONED livemerge: do not ignore errors when syncing volume chain 2021-12-20 00:53:51 UTC

Description Roman Bednář 2021-11-01 10:52:00 UTC
HSM.imageSyncVolumeChain() does *not* raise now on errors because storage dispatcher 
only returns a status dict and callers must check result["status"]["code"] to detect
errors.

VM.sync_volume_chain() handles this correctly - as an example.

If imageSyncVolumeChain() fails before a pivot, we ignore the error and pivot
*without* marking the leaf as illegal. This may lead to data corruption if the 
VM is restarted after the pivot with the old leaf.

This should be most likely fixed in the VM class by adding a wrapper similar to
VM.setVolumeSize() instead of checking for res["status"]["code"] everywhere.

Comment 1 Nir Soffer 2021-11-01 15:57:31 UTC
This issue can cause a volume to remain LEGAL after successful pivot.

The events that may lead this issue:
1. Calling imageSyncVolumeChain fails
2. Pivot succeeded
3. Vdsm stopped or crashed before calling imageSyncVolumeChain again.

If this VM is started again, it may start from the old top volume, which
will corrupt the disk.

This is very unlikely but the results are bad.

Comment 2 Nir Soffer 2021-11-02 14:27:15 UTC
Fixing version - this was found when working on 4.4.9 code.

Proposing for 4.4.10.

Comment 3 Avihai 2021-11-22 12:51:21 UTC
(In reply to Nir Soffer from comment #2)
> Fixing version - this was found when working on 4.4.9 code.
> 
> Proposing for 4.4.10.

Hi Nir, 

Can you provide a clear verification scenario? ( as I read you previous comments it's very unlikely , so if regression is enough so be it)

BTW, Not sure how this is 4.4.10 material if it's hardly reproducible and does not have a customer ticket.
As this can cause data corruption I understand the need to try and fix it but if it's that hard to reproduce still not sure it's worth it.

QE has a very limited capacity for 4.4.10 and non-main flow/customer bugs should not be pushed here.

Comment 4 Nir Soffer 2021-11-22 13:11:42 UTC
(In reply to Avihai from comment #3)
...
> Can you provide a clear verification scenario? ( as I read you previous
> comments it's very unlikely , so if regression is enough so be it)

There is not practical way to reproduce this issue. The only thing to test
is run the regular regression tests to make sure that the fix does not
cause any regressions.
 
> BTW, Not sure how this is 4.4.10 material if it's hardly reproducible and
> does not have a customer ticket.

If this will fail in a real system, this can cause data corruption. These kind
of fixes should be delivered to users before they are hit but the issue.

Customer bug means we already failed and customer hit the issue.

Comment 6 Arik 2021-12-05 14:32:03 UTC
Yeah, if we are able to verify that the common flows keep working (using regression tests) and that negative flow could have resulted in data corruption, I think it is worth taking the risk of backporting the changes (looking at the patches, they are fairly simple)
Roman, were you able to simulate it in an artificial way and see that the system reacts well to these changes?

Comment 8 Roman Bednář 2021-12-14 11:37:41 UTC
As mentioned in previous comments - this is difficult to reproduce. I tested the patches on my system, just doing a "sanity check" verifying that the livemerge flow works. But I think an adequate regression testing should be performed by QE team to increase confidence.

Comment 9 Arik 2021-12-14 16:43:58 UTC
(In reply to Roman Bednář from comment #8)
> As mentioned in previous comments - this is difficult to reproduce. I tested
> the patches on my system, just doing a "sanity check" verifying that the
> livemerge flow works. But I think an adequate regression testing should be
> performed by QE team to increase confidence.

Ack, thanks
So if we didn't reproduce the flow in which errors arise during the synchronization of a volume chain in order to exercise the new code,
let's take a more conservative approach here and push it out to 4.5

Comment 12 Shir Fishbain 2022-04-30 19:47:36 UTC
Verified - After investigation of tier1, tier2, and tier3 runnings, the fix doesn't cause regressions related to LSM or snapshots. 

Versions:
vdsm-4.50.0.13-1.el8ev.x86_64
ovirt-engine-4.5.0.4-0.1.el8ev.noarch

Comment 13 Sandro Bonazzola 2022-05-03 06:46:19 UTC
This bugzilla is included in oVirt 4.5.0 release, published on April 20th 2022.

Since the problem described in this bug report should be resolved in oVirt 4.5.0 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.