Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 2018947

Summary: vm: do not ignore errors when syncing volume chain
Product: [oVirt] vdsm Reporter: Roman Bednář <rbednar>
Component: CoreAssignee: Roman Bednář <rbednar>
Status: CLOSED CURRENTRELEASE QA Contact: Shir Fishbain <sfishbai>
Severity: high Docs Contact:
Priority: high    
Version: 4.40.90.4CC: aefrat, ahadas, bugs, lsurette, nsoffer, srevivo, ycui
Target Milestone: ovirt-4.5.0Keywords: ZStream
Target Release: 4.50.0.2Flags: ahadas: ovirt-4.5?
ahadas: devel_ack+
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: vdsm-4.50.0.2 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-05-03 06:46:19 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Roman Bednář 2021-11-01 10:52:00 UTC
HSM.imageSyncVolumeChain() does *not* raise now on errors because storage dispatcher 
only returns a status dict and callers must check result["status"]["code"] to detect
errors.

VM.sync_volume_chain() handles this correctly - as an example.

If imageSyncVolumeChain() fails before a pivot, we ignore the error and pivot
*without* marking the leaf as illegal. This may lead to data corruption if the 
VM is restarted after the pivot with the old leaf.

This should be most likely fixed in the VM class by adding a wrapper similar to
VM.setVolumeSize() instead of checking for res["status"]["code"] everywhere.

Comment 1 Nir Soffer 2021-11-01 15:57:31 UTC
This issue can cause a volume to remain LEGAL after successful pivot.

The events that may lead this issue:
1. Calling imageSyncVolumeChain fails
2. Pivot succeeded
3. Vdsm stopped or crashed before calling imageSyncVolumeChain again.

If this VM is started again, it may start from the old top volume, which
will corrupt the disk.

This is very unlikely but the results are bad.

Comment 2 Nir Soffer 2021-11-02 14:27:15 UTC
Fixing version - this was found when working on 4.4.9 code.

Proposing for 4.4.10.

Comment 3 Avihai 2021-11-22 12:51:21 UTC
(In reply to Nir Soffer from comment #2)
> Fixing version - this was found when working on 4.4.9 code.
> 
> Proposing for 4.4.10.

Hi Nir, 

Can you provide a clear verification scenario? ( as I read you previous comments it's very unlikely , so if regression is enough so be it)

BTW, Not sure how this is 4.4.10 material if it's hardly reproducible and does not have a customer ticket.
As this can cause data corruption I understand the need to try and fix it but if it's that hard to reproduce still not sure it's worth it.

QE has a very limited capacity for 4.4.10 and non-main flow/customer bugs should not be pushed here.

Comment 4 Nir Soffer 2021-11-22 13:11:42 UTC
(In reply to Avihai from comment #3)
...
> Can you provide a clear verification scenario? ( as I read you previous
> comments it's very unlikely , so if regression is enough so be it)

There is not practical way to reproduce this issue. The only thing to test
is run the regular regression tests to make sure that the fix does not
cause any regressions.
 
> BTW, Not sure how this is 4.4.10 material if it's hardly reproducible and
> does not have a customer ticket.

If this will fail in a real system, this can cause data corruption. These kind
of fixes should be delivered to users before they are hit but the issue.

Customer bug means we already failed and customer hit the issue.

Comment 6 Arik 2021-12-05 14:32:03 UTC
Yeah, if we are able to verify that the common flows keep working (using regression tests) and that negative flow could have resulted in data corruption, I think it is worth taking the risk of backporting the changes (looking at the patches, they are fairly simple)
Roman, were you able to simulate it in an artificial way and see that the system reacts well to these changes?

Comment 8 Roman Bednář 2021-12-14 11:37:41 UTC
As mentioned in previous comments - this is difficult to reproduce. I tested the patches on my system, just doing a "sanity check" verifying that the livemerge flow works. But I think an adequate regression testing should be performed by QE team to increase confidence.

Comment 9 Arik 2021-12-14 16:43:58 UTC
(In reply to Roman Bednář from comment #8)
> As mentioned in previous comments - this is difficult to reproduce. I tested
> the patches on my system, just doing a "sanity check" verifying that the
> livemerge flow works. But I think an adequate regression testing should be
> performed by QE team to increase confidence.

Ack, thanks
So if we didn't reproduce the flow in which errors arise during the synchronization of a volume chain in order to exercise the new code,
let's take a more conservative approach here and push it out to 4.5

Comment 12 Shir Fishbain 2022-04-30 19:47:36 UTC
Verified - After investigation of tier1, tier2, and tier3 runnings, the fix doesn't cause regressions related to LSM or snapshots. 

Versions:
vdsm-4.50.0.13-1.el8ev.x86_64
ovirt-engine-4.5.0.4-0.1.el8ev.noarch

Comment 13 Sandro Bonazzola 2022-05-03 06:46:19 UTC
This bugzilla is included in oVirt 4.5.0 release, published on April 20th 2022.

Since the problem described in this bug report should be resolved in oVirt 4.5.0 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.