Bug 1152529
| Summary: | Problem with vm snapshots and disk | ||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | [oVirt] ovirt-engine | Reporter: | Shirly Radco <sradco> | ||||||||||||||
| Component: | Frontend.WebAdmin | Assignee: | Shmuel Melamud <smelamud> | ||||||||||||||
| Status: | CLOSED WORKSFORME | QA Contact: | meital avital <mavital> | ||||||||||||||
| Severity: | high | Docs Contact: | |||||||||||||||
| Priority: | medium | ||||||||||||||||
| Version: | --- | CC: | bugs, ecohen, gklein, istein, lsurette, mgoldboi, michal.skrivanek, rbalakri, Rhev-m-bugs, smelamud, sradco, yeylon | ||||||||||||||
| Target Milestone: | ovirt-4.0.0-alpha | Flags: | ylavi:
ovirt-4.0.0?
rule-engine: planning_ack? rule-engine: devel_ack? rule-engine: testing_ack? |
||||||||||||||
| Target Release: | --- | ||||||||||||||||
| Hardware: | Unspecified | ||||||||||||||||
| OS: | Unspecified | ||||||||||||||||
| Whiteboard: | virt | ||||||||||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||||||||||
| Doc Text: | Story Points: | --- | |||||||||||||||
| Clone Of: | Environment: | ||||||||||||||||
| Last Closed: | 2016-01-03 16:43:29 UTC | Type: | Bug | ||||||||||||||
| Regression: | --- | Mount Type: | --- | ||||||||||||||
| Documentation: | --- | CRM: | |||||||||||||||
| Verified Versions: | Category: | --- | |||||||||||||||
| oVirt Team: | Virt | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||||
| Embargoed: | |||||||||||||||||
| Attachments: |
|
||||||||||||||||
|
Description
Shirly Radco
2014-10-14 11:16:53 UTC
Created attachment 946840 [details]
screenshot1
Created attachment 946842 [details]
screenshot2
Created attachment 946843 [details]
screenshot3
Created attachment 946844 [details]
screenshot4
Created attachment 946845 [details]
screenshot5
can you please explain what did you do to get to this? also please attach engine+spm logs (In reply to Omer Frenkel from comment #6) > can you please explain what did you do to get to this? > also please attach engine+spm logs This happened to me on production environment of eng lab. First I shutdown the vm and tried to go back to a previous snapshot, using "preview" and then commit but it created the multiple ""Active VM before the preview" snapshot . Tried this several times and tried to delete the previous snapshots but with no success. any chance for the logs? Created attachment 949743 [details]
engine logs for 2013-10-14
Looks like an issue with rollback/compensation of RestoreAllSnapshotsCommand that cause duplicate entries in the snapshots table i see in the log that DeleteImageGroupVDSCommand fails: 2014-10-14 08:53:36,974 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (ajp-/127.0.0.1:8702-14) [502dbcc9] IrsBroker::Failed::DeleteImageGroupVDS due to: IRSErrorException: IRSGenericException: IRSErrorException: Failed to DeleteImageGroupVDS, error = Storage domain does not exist this means the snapshot had memory volume, which was on a storage domain that was already deleted... still want to consider this "soon" Shmuel is this duplicate of bug 1236061 ? I don't think so. In bug 1236061 the problem appears when exception is thrown in CreateAllSnapshotsFromVmCommand. I don't see anything similar in the log here. Here the problem appeared after running RestoreAllSnapshotsCommand. I've taken a look on it and I see that the command itself is transactive, but compensation is used somewhere. If an exception is thrown, this may cause a snapshot that should be deleted to appear twice - the first as the result of transaction rollback, and the second as the result of compensation of the same removal. But this is just a guess. Shmual, Can you please provide steps to reproduce? Thanks, Ilanit. (In reply to Omer Frenkel from comment #10) > Looks like an issue with rollback/compensation of RestoreAllSnapshotsCommand > that cause duplicate entries in the snapshots table > > i see in the log that DeleteImageGroupVDSCommand fails: > > 2014-10-14 08:53:36,974 ERROR > [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] > (ajp-/127.0.0.1:8702-14) [502dbcc9] IrsBroker::Failed::DeleteImageGroupVDS > due to: IRSErrorException: IRSGenericException: IRSErrorException: Failed to > DeleteImageGroupVDS, error = Storage domain does not exist > > this means the snapshot had memory volume, which was on a storage domain > that was already deleted... in this case, shouldn't we ignore the error and just continue instead of rollback? lowering priority as we dont have a clear reproducer for this (Shmuel couldn't reproduce so far) from the logs it seems that on storage issue there is an issue with rollback. we should consider pushing to 4.0. It is unclear where the broken snapshot comes from. The log doesn't give an answer. The error condition in RestoreAllSnapshotsCommand appears only as result of that, it is not the cause of the problem. In this situation, if we cannot find where the problem lies originally, we can at least ignore the StorageDomainDoesNotExist in DeleteImage. It looks logical and safe and will allow to remove the broken snapshot without modifying the DB directly. See the patch in Gerrit: https://gerrit.ovirt.org/46706 After some research, I don't think this is a good solution anymore. I've created a VM with a disk on a separate storage domain and then cleaned up manually the storage directory. The storage domain went down and I've detached it from the data center. After that I've tried to remove the VM and got error message about absence of the storage domain. It would be good, from my point of view, to give user possibility to remove the disk in such situation. But simply ignoring this error from RemoveImageCommand doesn't help, we need to remove this check from RemoveImageCommand.canDoAction(). The logic will be: if RemoveImageCommand is called to remove an image, but the storage domain is not available, delete just record from the DB and don't execute the VDSM action. But this logic is bad: if the storage domain is just temporarily not available and user executed RemoveImageCommand without knowing about it, this will leave the image orphaned on the storage. It will be more correct to check, if the storage domain is completely detached from the DC or even already not known to the engine. In regular scenario it will cause all links to this storage domain to be deleted. If there is a record in the DB pointing to an image on such a storage, it means an error occured. In such case we can safely allow RemoveImage to remove just the record in the DB. Michal, Omer, what do you think? In the case I've described above it is possible to remove the VM using 'Destroy' command from the right-click menu of the VM. This command removes the VM and all links to its disk images, even when the images itself are not accessible. If the situation described in the bug is similar, user can also use 'Destroy' to remove the disfunctional VM. Shirly, can you give some additional information about the VM? 1. Content of 'Storage' tab. 2. Content of 'Disk' subtab of the VM. 3. Name of the storage domain where each of the disk of the VM is located. Is it possible to run the VM? Are the disks accessible? This bug is flagged for 3.6, yet the milestone is for 4.0 version, therefore the milestone has been reset. Please set the correct milestone or add the flag. (In reply to Shmuel Melamud from comment #20) > In the case I've described above it is possible to remove the VM using > 'Destroy' command from the right-click menu of the VM. This command removes > the VM and all links to its disk images, even when the images itself are not > accessible. > > If the situation described in the bug is similar, user can also use > 'Destroy' to remove the disfunctional VM. > > Shirly, can you give some additional information about the VM? > > 1. Content of 'Storage' tab. > 2. Content of 'Disk' subtab of the VM. > 3. Name of the storage domain where each of the disk of the VM is located. > > Is it possible to run the VM? Are the disks accessible? The vm was deleted from nott4 so I cant give you more details. Sorry. pending closure if there are no news in ~14 days No news since month ago, closing. |