HSM.imageSyncVolumeChain() does *not* raise now on errors because storage dispatcher only returns a status dict and callers must check result["status"]["code"] to detect errors. VM.sync_volume_chain() handles this correctly - as an example. If imageSyncVolumeChain() fails before a pivot, we ignore the error and pivot *without* marking the leaf as illegal. This may lead to data corruption if the VM is restarted after the pivot with the old leaf. This should be most likely fixed in the VM class by adding a wrapper similar to VM.setVolumeSize() instead of checking for res["status"]["code"] everywhere.
This issue can cause a volume to remain LEGAL after successful pivot. The events that may lead this issue: 1. Calling imageSyncVolumeChain fails 2. Pivot succeeded 3. Vdsm stopped or crashed before calling imageSyncVolumeChain again. If this VM is started again, it may start from the old top volume, which will corrupt the disk. This is very unlikely but the results are bad.
Fixing version - this was found when working on 4.4.9 code. Proposing for 4.4.10.
(In reply to Nir Soffer from comment #2) > Fixing version - this was found when working on 4.4.9 code. > > Proposing for 4.4.10. Hi Nir, Can you provide a clear verification scenario? ( as I read you previous comments it's very unlikely , so if regression is enough so be it) BTW, Not sure how this is 4.4.10 material if it's hardly reproducible and does not have a customer ticket. As this can cause data corruption I understand the need to try and fix it but if it's that hard to reproduce still not sure it's worth it. QE has a very limited capacity for 4.4.10 and non-main flow/customer bugs should not be pushed here.
(In reply to Avihai from comment #3) ... > Can you provide a clear verification scenario? ( as I read you previous > comments it's very unlikely , so if regression is enough so be it) There is not practical way to reproduce this issue. The only thing to test is run the regular regression tests to make sure that the fix does not cause any regressions. > BTW, Not sure how this is 4.4.10 material if it's hardly reproducible and > does not have a customer ticket. If this will fail in a real system, this can cause data corruption. These kind of fixes should be delivered to users before they are hit but the issue. Customer bug means we already failed and customer hit the issue.
Yeah, if we are able to verify that the common flows keep working (using regression tests) and that negative flow could have resulted in data corruption, I think it is worth taking the risk of backporting the changes (looking at the patches, they are fairly simple) Roman, were you able to simulate it in an artificial way and see that the system reacts well to these changes?
As mentioned in previous comments - this is difficult to reproduce. I tested the patches on my system, just doing a "sanity check" verifying that the livemerge flow works. But I think an adequate regression testing should be performed by QE team to increase confidence.
(In reply to Roman Bednář from comment #8) > As mentioned in previous comments - this is difficult to reproduce. I tested > the patches on my system, just doing a "sanity check" verifying that the > livemerge flow works. But I think an adequate regression testing should be > performed by QE team to increase confidence. Ack, thanks So if we didn't reproduce the flow in which errors arise during the synchronization of a volume chain in order to exercise the new code, let's take a more conservative approach here and push it out to 4.5
Verified - After investigation of tier1, tier2, and tier3 runnings, the fix doesn't cause regressions related to LSM or snapshots. Versions: vdsm-4.50.0.13-1.el8ev.x86_64 ovirt-engine-4.5.0.4-0.1.el8ev.noarch
This bugzilla is included in oVirt 4.5.0 release, published on April 20th 2022. Since the problem described in this bug report should be resolved in oVirt 4.5.0 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report.