2018947 – vm: do not ignore errors when syncing volume chain

Bug 2018947 - vm: do not ignore errors when syncing volume chain

Summary: vm: do not ignore errors when syncing volume chain

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	vdsm
Classification:	oVirt
Component:	Core
Sub Component:
Version:	4.40.90.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	ovirt-4.5.0
Target Release:	4.50.0.2
Assignee:	Roman Bednář
QA Contact:	Shir Fishbain
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-11-01 10:52 UTC by Roman Bednář
Modified:	2022-05-03 06:46 UTC (History)
CC List:	7 users (show)
Fixed In Version:	vdsm-4.50.0.2
Clone Of:
Environment:
Last Closed:	2022-05-03 06:46:19 UTC
oVirt Team:	Storage
Embargoed:
Dependent Products:
Flags:	ahadas: ovirt-4.5? ahadas: devel_ack+

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	RHV-43889	None	None	None	2021-11-01 10:52:52 UTC
oVirt gerrit	117344	master	MERGED	vm: add imageSyncVolumeChain() wrapper	2021-11-02 14:36:05 UTC
oVirt gerrit	117421	master	MERGED	livemerge: do not ignore errors when syncing volume chain	2021-11-02 14:36:07 UTC
oVirt gerrit	117424	ovirt-4.4.z	ABANDONED	vm: add imageSyncVolumeChain() wrapper	2021-12-20 00:53:45 UTC
oVirt gerrit	117425	ovirt-4.4.z	ABANDONED	livemerge: do not ignore errors when syncing volume chain	2021-12-20 00:53:51 UTC

Description Roman Bednář 2021-11-01 10:52:00 UTC

HSM.imageSyncVolumeChain() does *not* raise now on errors because storage dispatcher 
only returns a status dict and callers must check result["status"]["code"] to detect
errors.

VM.sync_volume_chain() handles this correctly - as an example.

If imageSyncVolumeChain() fails before a pivot, we ignore the error and pivot
*without* marking the leaf as illegal. This may lead to data corruption if the 
VM is restarted after the pivot with the old leaf.

This should be most likely fixed in the VM class by adding a wrapper similar to
VM.setVolumeSize() instead of checking for res["status"]["code"] everywhere.

Comment 1 Nir Soffer 2021-11-01 15:57:31 UTC

This issue can cause a volume to remain LEGAL after successful pivot.

The events that may lead this issue:
1. Calling imageSyncVolumeChain fails
2. Pivot succeeded
3. Vdsm stopped or crashed before calling imageSyncVolumeChain again.

If this VM is started again, it may start from the old top volume, which
will corrupt the disk.

This is very unlikely but the results are bad.

Comment 2 Nir Soffer 2021-11-02 14:27:15 UTC

Fixing version - this was found when working on 4.4.9 code.

Proposing for 4.4.10.

Comment 3 Avihai 2021-11-22 12:51:21 UTC

(In reply to Nir Soffer from comment #2)
> Fixing version - this was found when working on 4.4.9 code.
> 
> Proposing for 4.4.10.

Hi Nir, 

Can you provide a clear verification scenario? ( as I read you previous comments it's very unlikely , so if regression is enough so be it)

BTW, Not sure how this is 4.4.10 material if it's hardly reproducible and does not have a customer ticket.
As this can cause data corruption I understand the need to try and fix it but if it's that hard to reproduce still not sure it's worth it.

QE has a very limited capacity for 4.4.10 and non-main flow/customer bugs should not be pushed here.

Comment 4 Nir Soffer 2021-11-22 13:11:42 UTC

(In reply to Avihai from comment #3)
...
> Can you provide a clear verification scenario? ( as I read you previous
> comments it's very unlikely , so if regression is enough so be it)

There is not practical way to reproduce this issue. The only thing to test
is run the regular regression tests to make sure that the fix does not
cause any regressions.
 
> BTW, Not sure how this is 4.4.10 material if it's hardly reproducible and
> does not have a customer ticket.

If this will fail in a real system, this can cause data corruption. These kind
of fixes should be delivered to users before they are hit but the issue.

Customer bug means we already failed and customer hit the issue.

Comment 6 Arik 2021-12-05 14:32:03 UTC

Yeah, if we are able to verify that the common flows keep working (using regression tests) and that negative flow could have resulted in data corruption, I think it is worth taking the risk of backporting the changes (looking at the patches, they are fairly simple)
Roman, were you able to simulate it in an artificial way and see that the system reacts well to these changes?

Comment 8 Roman Bednář 2021-12-14 11:37:41 UTC

As mentioned in previous comments - this is difficult to reproduce. I tested the patches on my system, just doing a "sanity check" verifying that the livemerge flow works. But I think an adequate regression testing should be performed by QE team to increase confidence.

Comment 9 Arik 2021-12-14 16:43:58 UTC

(In reply to Roman Bednář from comment #8)
> As mentioned in previous comments - this is difficult to reproduce. I tested
> the patches on my system, just doing a "sanity check" verifying that the
> livemerge flow works. But I think an adequate regression testing should be
> performed by QE team to increase confidence.

Ack, thanks
So if we didn't reproduce the flow in which errors arise during the synchronization of a volume chain in order to exercise the new code,
let's take a more conservative approach here and push it out to 4.5

Comment 12 Shir Fishbain 2022-04-30 19:47:36 UTC

Verified - After investigation of tier1, tier2, and tier3 runnings, the fix doesn't cause regressions related to LSM or snapshots. 

Versions:
vdsm-4.50.0.13-1.el8ev.x86_64
ovirt-engine-4.5.0.4-0.1.el8ev.noarch

Comment 13 Sandro Bonazzola 2022-05-03 06:46:19 UTC

This bugzilla is included in oVirt 4.5.0 release, published on April 20th 2022.

Since the problem described in this bug report should be resolved in oVirt 4.5.0 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.

Note You need to log in before you can comment on or make changes to this bug.