Bug 1916122
Summary: | Disk corruption caused by removing in use snapshot LV | ||
---|---|---|---|
Product: | [oVirt] ovirt-engine | Reporter: | Jean-Louis Dupond <jean-louis> |
Component: | BLL.Storage | Assignee: | Liran Rotenberg <lrotenbe> |
Status: | CLOSED INSUFFICIENT_DATA | QA Contact: | meital avital <mavital> |
Severity: | high | Docs Contact: | |
Priority: | unspecified | ||
Version: | 4.4.4.5 | CC: | ahadas, bugs, bzlotnik, eshenitz, jean-louis, lrotenbe, tnisan |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | x86_64 | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2021-03-14 08:03:20 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | Virt | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Jean-Louis Dupond
2021-01-14 09:26:48 UTC
The documentation text flag should only be set after 'doc text' field is provided. Please provide the documentation text and set the flag to '?' again. Arik, this looks like the FS freeze bug you guys fixed, can you please share the configuration needed in order to prevent this issue? So we're talking about a live-snapshot that doesn't include memory, right? It also seems that VDSM initiated the freeze so I suspect the config value LiveSnapshotPerformFreezeInEngine is set to 'false'. But even if LiveSnapshotPerformFreezeInEngine is set to 'true', the freeze would fail as it appears the guest agent is inaccessible. Is there qemu-guest-agent running in the guest? ^ That would explain the failure though, just why the FS wasn't freezed That *wouldn't* explain the failure though, just why the FS wasn't freezed This is indeed a live-snapshot without memory (its used in our backup flow atm). # engine-config -g LiveSnapshotPerformFreezeInEngine LiveSnapshotPerformFreezeInEngine: false version: general (which is the default, advantages to enable this?) The VM is already removed ... But very good chance it did not have an agent installed (because some OS'es are quite old :(() But that freeze failed isn't such an issue here :) Arik, Anything else that we could do here? I think we need to check the error handling of create-snapshot since it's not the first time we're told that when it fails, bad things happen on the storage Arik, from the logs it seems that we should recognize and handle the case of failure in creating the snapshot on VDSM side. Please have a look. (In reply to Eyal Shenitzky from comment #9) > Arik, from the logs it seems that we should recognize and handle the case of > failure in creating the snapshot on VDSM side. > Please have a look. It makes a lot of sense to improve the error handling on the VDSM side and if you've investigated the logs, could you please share your findings to save us time and help us determine if it's virt or storage? From the logs in the description, it seems like the operation failed in the live-snapshot creation process (snapshot.py). So it is probably around the creation of it or in the recovery mechanism. Do you have other thoughts on it? I think there is an issue in the error handling in vdsm, since we this log[1]: 2021-01-13 22:12:26,832+0100 INFO (virt/cf577e3d) [virt.vm] (vmId='fc00512f-7d52-42d9-81b5-4c3fbc2eb0a3') Snapshot timeout reached, operation aborted (snapshot:80) But the operation started at: 2021-01-13 22:12:26,003+0100 INFO (virt/cf577e3d) [root] Running job 'cf577e3d-c6e5-4b16-9ce3-34532b8ad110'... (jobs:185) Not even a second earlier, while the timeout is 30 minutes So this abort is treated like it was initiated by vdsm, but it is actually done by libvirt. What is unclear to me is how this snapshot ended being usable by the VM given how fast this failed. Regarding the stale volume removal, it is not an actual lvremove, it just removes an LV uuid from vdsm's internal cache after seeing it's not present in the latest reload (invocation of `lvs` command). The LV was likely removed before it was used by the VM, because LVM would have prevented the removal of an in-use LV. Full logs (including engine) would help better understand the chain of events. [1] https://github.com/oVirt/vdsm/blob/master/lib/vdsm/virt/jobs/snapshot.py#L464 (In reply to Benny Zlotnik from comment #12) > > So this abort is treated like it was initiated by vdsm, but it is actually > done by libvirt. What is unclear to me is how this snapshot ended being > usable by the VM given how fast this failed. > Most likely something in libvirt failed because we tried incremental backup stuff on the VM. After restarting libvirt, the vm was restarted and the issue was resolved (so no libvirt debug logs) > > Regarding the stale volume removal, it is not an actual lvremove, it just > removes an LV uuid from vdsm's internal cache after seeing it's not present > in the latest reload (invocation of `lvs` command). The LV was likely > removed before it was used by the VM, because LVM would have prevented the > removal of an in-use LV. Is that the case? Cause isn't the 'lvremove' executed on the SPM? And if the VM is running on another node, it will not know its in use? The logs of the delete: 2021-01-13T22:12:17.362+01:00 xxxx [org.ovirt.engine.core.vdsbroker.irsbroker.CreateVolumeVDSCommand] START, CreateVolumeVDSCommand( CreateVolumeVDSCommandParameters:{storagePoolId='d497efe5-2344-4d58-8985-7b053d3c35a3', ignoreFailoverLimit='false', storageDomainId='cc29e364-6bf2-4a52-8213-3c83649fc067', imageGroupId='e811105b-987a-4511-9378-ecc167413466', imageSizeInBytes='26843545600', volumeFormat='COW', newImageId='efb6948e-4e27-4224-8715-f555cbf3c01a', imageType='Sparse', newImageDescription='', imageInitialSizeInBytes='0', imageId='cb6cee06-1b8a-4e34-88a8-c96257a9a2f3', sourceImageGroupId='e811105b-987a-4511-9378-ecc167413466', shouldAddBitmaps='false'}), log id: 37f7989a 2021-01-13T22:12:17.412+01:00 xxxx [org.ovirt.engine.core.vdsbroker.irsbroker.CreateVolumeVDSCommand] FINISH, CreateVolumeVDSCommand, return: efb6948e-4e27-4224-8715-f555cbf3c01a, log id: 37f7989a 2021-01-13T22:12:31.302+01:00 xxxx [org.ovirt.engine.core.vdsbroker.irsbroker.DestroyImageVDSCommand] START, DestroyImageVDSCommand( DestroyImageVDSCommandParameters:{storagePoolId='d497efe5-2344-4d58-8985-7b053d3c35a3', ignoreFailoverLimit='false', storageDomainId='cc29e364-6bf2-4a52-8213-3c83649fc067', imageGroupId='e811105b-987a-4511-9378-ecc167413466', imageId='00000000-0000-0000-0000-000000000000', imageList='[efb6948e-4e27-4224-8715-f555cbf3c01a]', postZero='false', force='false'}), log id: 6f053449 > Full logs (including engine) would help better understand the chain of > events. > I can send you the logs if wanted. But prefer via email then as it might contain data that should not be in public :) > > > [1] > https://github.com/oVirt/vdsm/blob/master/lib/vdsm/virt/jobs/snapshot.py#L464 (In reply to Benny Zlotnik from comment #12) > I think there is an issue in the error handling in vdsm, since we this > log[1]: > 2021-01-13 22:12:26,832+0100 INFO (virt/cf577e3d) [virt.vm] > (vmId='fc00512f-7d52-42d9-81b5-4c3fbc2eb0a3') Snapshot timeout reached, > operation aborted (snapshot:80) > > But the operation started at: > 2021-01-13 22:12:26,003+0100 INFO (virt/cf577e3d) [root] Running job > 'cf577e3d-c6e5-4b16-9ce3-34532b8ad110'... (jobs:185) > Not even a second earlier, while the timeout is 30 minutes > > So this abort is treated like it was initiated by vdsm, but it is actually > done by libvirt. What is unclear to me is how this snapshot ended being > usable by the VM given how fast this failed. > > [1] > https://github.com/oVirt/vdsm/blob/master/lib/vdsm/virt/jobs/snapshot.py#L464 Yes the VDSM logs might be confusing, in line 464 as you indicated we are saying it's timeout. This is because in[2] we are calling libvirt to abort the job. But in this case it seems we got it from libvirt itself. I opened BZ1933669. [2] https://github.com/oVirt/vdsm/blob/master/lib/vdsm/virt/jobs/snapshot.py#L640 (In reply to Jean-Louis Dupond from comment #13) > (In reply to Benny Zlotnik from comment #12) > > > > So this abort is treated like it was initiated by vdsm, but it is actually > > done by libvirt. What is unclear to me is how this snapshot ended being > > usable by the VM given how fast this failed. > > > Most likely something in libvirt failed because we tried incremental backup > stuff on the VM. > After restarting libvirt, the vm was restarted and the issue was resolved > (so no libvirt debug logs) > > > > > Regarding the stale volume removal, it is not an actual lvremove, it just > > removes an LV uuid from vdsm's internal cache after seeing it's not present > > in the latest reload (invocation of `lvs` command). The LV was likely > > removed before it was used by the VM, because LVM would have prevented the > > removal of an in-use LV. > > Is that the case? Cause isn't the 'lvremove' executed on the SPM? And if the > VM is running on another node, it will not know its in use? you're right, for some reason I assumed there's just one host :) Regardless, I do think the image was not reported at the time of the removal, as we have a check in the engine to see whether an image is used by a VM > The logs of the delete: > 2021-01-13T22:12:17.362+01:00 xxxx > [org.ovirt.engine.core.vdsbroker.irsbroker.CreateVolumeVDSCommand] START, > CreateVolumeVDSCommand( > CreateVolumeVDSCommandParameters:{storagePoolId='d497efe5-2344-4d58-8985- > 7b053d3c35a3', ignoreFailoverLimit='false', > storageDomainId='cc29e364-6bf2-4a52-8213-3c83649fc067', > imageGroupId='e811105b-987a-4511-9378-ecc167413466', > imageSizeInBytes='26843545600', volumeFormat='COW', > newImageId='efb6948e-4e27-4224-8715-f555cbf3c01a', imageType='Sparse', > newImageDescription='', imageInitialSizeInBytes='0', > imageId='cb6cee06-1b8a-4e34-88a8-c96257a9a2f3', > sourceImageGroupId='e811105b-987a-4511-9378-ecc167413466', > shouldAddBitmaps='false'}), log id: 37f7989a > 2021-01-13T22:12:17.412+01:00 xxxx > [org.ovirt.engine.core.vdsbroker.irsbroker.CreateVolumeVDSCommand] FINISH, > CreateVolumeVDSCommand, return: efb6948e-4e27-4224-8715-f555cbf3c01a, log > id: 37f7989a > 2021-01-13T22:12:31.302+01:00 xxxx > [org.ovirt.engine.core.vdsbroker.irsbroker.DestroyImageVDSCommand] START, > DestroyImageVDSCommand( > DestroyImageVDSCommandParameters:{storagePoolId='d497efe5-2344-4d58-8985- > 7b053d3c35a3', ignoreFailoverLimit='false', > storageDomainId='cc29e364-6bf2-4a52-8213-3c83649fc067', > imageGroupId='e811105b-987a-4511-9378-ecc167413466', > imageId='00000000-0000-0000-0000-000000000000', > imageList='[efb6948e-4e27-4224-8715-f555cbf3c01a]', postZero='false', > force='false'}), log id: 6f053449 > > > > Full logs (including engine) would help better understand the chain of > > events. > > > I can send you the logs if wanted. But prefer via email then as it might > contain data that should not be in public :) you can send it in the email > > > > [1] > > https://github.com/oVirt/vdsm/blob/master/lib/vdsm/virt/jobs/snapshot.py#L464 (In reply to Liran Rotenberg from comment #14) > (In reply to Benny Zlotnik from comment #12) > > I think there is an issue in the error handling in vdsm, since we this > > log[1]: > > 2021-01-13 22:12:26,832+0100 INFO (virt/cf577e3d) [virt.vm] > > (vmId='fc00512f-7d52-42d9-81b5-4c3fbc2eb0a3') Snapshot timeout reached, > > operation aborted (snapshot:80) > > > > But the operation started at: > > 2021-01-13 22:12:26,003+0100 INFO (virt/cf577e3d) [root] Running job > > 'cf577e3d-c6e5-4b16-9ce3-34532b8ad110'... (jobs:185) > > Not even a second earlier, while the timeout is 30 minutes > > > > So this abort is treated like it was initiated by vdsm, but it is actually > > done by libvirt. What is unclear to me is how this snapshot ended being > > usable by the VM given how fast this failed. > > > > [1] > > https://github.com/oVirt/vdsm/blob/master/lib/vdsm/virt/jobs/snapshot.py#L464 > > Yes the VDSM logs might be confusing, in line 464 as you indicated we are > saying it's timeout. > This is because in[2] we are calling libvirt to abort the job. But in this > case it seems we got it from libvirt itself. > > I opened BZ1933669. > > [2] > https://github.com/oVirt/vdsm/blob/master/lib/vdsm/virt/jobs/snapshot.py#L640 It's more than the logging, in this case it seems that we can't actually trust libvirt's abort as the image ended being used by the VM, so it seems we may need a distinct error to be sent to the engine to indicate it should not try to rollback (In reply to Benny Zlotnik from comment #16) > > It's more than the logging, in this case it seems that we can't actually > trust libvirt's abort as the image ended being used by the VM, so it seems > we may need a distinct error to be sent to the engine to indicate it should > not try to rollback Yes I see your point. But, we never encounter it. When we get to the part of updating drives in VDSM we know rollback is too late (we didn't reach to it here). libvirt reports the domain XML once it changed, in that case we should get the new chain, including the new volume in use. We also have the check to prevent this rollback in case it's there(your fix). Unless there is some new behavior from libvirt, it's surprising to get a new volume in use, not appearing in the chain, making us to delete it. The question is how it happened and why? I don't think that it should happen from libvirt point of view, then doing special return to the engine in order to prevent the volume removal(always in such case?) is wrong. After a discussion with Benny, we might try to prevent rolling back in such situation. This means the VM will have the new volume and use it, while it won't be existing on the domain XML volume chain. It should be first tested to see there no big regression because of it. (In reply to Liran Rotenberg from comment #18) > It should be first tested to see there no big regression because of it. Seems we already choose not to remove a volume when we are not able to retrieve the volume-chain [1] - which is similar in a way. [1] https://github.com/oVirt/ovirt-engine/blob/master/backend/manager/modules/bll/src/main/java/org/ovirt/engine/core/bll/snapshots/CreateSnapshotCommand.java#L236-L239 Hi, We tried to figure out what can be wrong and it just speculations on the air. Without proper logs it will be hard to tell what went wrong. A volume in use shouldn't be removed of course. I tried to see if the engine got the volume in the chain like you pointed out in comment #0, but didn't find anything. Also, if it was in the chain, the code in comment #19, would save us from removing it. Given that, if you can provide full logs (hopefully including libvirt), we can look more into it. If that is not possible I recommend to close this bug until someone will be able to provide them. From offline discussion with Jean-Louis Dupond: Thanks for having an in depth look. I don't have more to share at this moment, so guess we'll have to close the bug then. If it occurs again I'll try to gather more info (and directly archive the engine and vdsm logs). Therefore, I'm closing the bug. If it comes up again with more info please re-open this bug or a new one. |