Bug 1213157
Summary: | VM block SNAPSHOT disks become illegal after failed Live Delete Snapshot Merge | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Kevin Alon Goldblatt <kgoldbla> | ||||||||
Component: | ovirt-engine | Assignee: | Greg Padgett <gpadgett> | ||||||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Kevin Alon Goldblatt <kgoldbla> | ||||||||
Severity: | urgent | Docs Contact: | |||||||||
Priority: | unspecified | ||||||||||
Version: | 3.5.1 | CC: | ahino, alitke, amureini, dev-unix-virtualization, gklein, gpadgett, kgoldbla, lpeer, lsurette, ratamir, rbalakri, Rhev-m-bugs, tnisan, v.astafiev, yeylon, ykaul, ylavi | ||||||||
Target Milestone: | ovirt-3.6.0-rc | Keywords: | ZStream | ||||||||
Target Release: | 3.6.0 | ||||||||||
Hardware: | x86_64 | ||||||||||
OS: | Unspecified | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | ovirt-engine-3.6.0_alpha1 | Doc Type: | Bug Fix | ||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | |||||||||||
: | 1227497 1231535 (view as bug list) | Environment: | |||||||||
Last Closed: | Type: | Bug | |||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | Storage | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Bug Depends On: | |||||||||||
Bug Blocks: | 1231535 | ||||||||||
Attachments: |
|
Description
Kevin Alon Goldblatt
2015-04-19 15:00:57 UTC
Libvirt rpms on host -------------------------------- libvirt-client-1.2.8-16.el7_1.2.x86_64 libvirt-daemon-driver-qemu-1.2.8-16.el7_1.2.x86_64 libvirt-daemon-driver-interface-1.2.8-16.el7_1.2.x86_64 libvirt-daemon-1.2.8-16.el7_1.2.x86_64 libvirt-daemon-driver-nwfilter-1.2.8-16.el7_1.2.x86_64 libvirt-daemon-config-nwfilter-1.2.8-16.el7_1.2.x86_64 libvirt-daemon-driver-secret-1.2.8-16.el7_1.2.x86_64 libvirt-daemon-driver-nodedev-1.2.8-16.el7_1.2.x86_64 libvirt-python-1.2.8-7.el7_1.1.x86_64 libvirt-daemon-kvm-1.2.8-16.el7_1.2.x86_64 libvirt-daemon-driver-network-1.2.8-16.el7_1.2.x86_64 libvirt-lock-sanlock-1.2.8-16.el7_1.2.x86_64 libvirt-daemon-driver-storage-1.2.8-16.el7_1.2.x86_64 Created attachment 1016104 [details]
engine, server, vdsm and libvirt logs
Added logs
NOTE THERE IS A 3 HOUR DIFFERENCE BETWEEN THE ENGINE AND HOST e.g.
Current Engine time: Sun Apr 19 18:18:05 IDT 2015
Current Host time: Sun Apr 19 21:18:35 IDT 2015
The error seems unrelated to the restart (missing PVs?), so I'm not quite sure what I'd want the disk status to be in... Adam/Greg, please take a look. Kevin - assuming the domain is still functioning (which doesn't seem necessary from the error, if you manually update the DB to mark those disks as OK [1], can you run the VM? [1] in psql, use: UPDATE images SET imagestatus = 0 WHERE image_group_id = <the disk's ID>; COMMIT; Just to verify, the illegal state referred to here is illegal on the host? (I don't see any references to illegal disks in the supplied engine log.) (In reply to Greg Padgett from comment #4) > Just to verify, the illegal state referred to here is illegal on the host? > (I don't see any references to illegal disks in the supplied engine log.) Then again, the vdsm log has no errors at all. Kevin, are you sure these are the right logs? (In reply to Greg Padgett from comment #4) > Just to verify, the illegal state referred to here is illegal on the host? > (I don't see any references to illegal disks in the supplied engine log.) (In reply to Greg Padgett from comment #4) > Just to verify, the illegal state referred to here is illegal on the host? > (I don't see any references to illegal disks in the supplied engine log.) 1. NOTE THERE IS A 3 HOUR DIFFERENCE BETWEEN THE ENGINE AND HOST e.g. Current Engine time: Sun Apr 19 18:18:05 IDT 2015 Current Host time: Sun Apr 19 21:18:35 IDT 2015 2. ILLEGAL status appears in the host vdsm log e.g. cbfe5fbe-9f6a-441f-bb5b-4bac3de41fdc::INFO::2015-04-19 19:01:20,332::volume::617::Storage.Volume::(setLegality) sdUUID=17eac76f-4aaa-4d99-8f1a-b7f6e68e4c70 imgUUID=4a31b260-8136-411d-9743-b3e937859097 volUUID = 54c94300-650a-4da9-9656-87969a6c60a4 legality = ILLEGAL 0bb73a86-25be-42d5-9695-3b6bf1411d5b::INFO::2015-04-19 19:01:24,958::volume::617::Storage.Volume::(setLegality) sdUUID=3eae5544-0c43-483b-bf06-118660fcbfb1 imgUUID=32d03b22-d9a8-40e8-a7f6-fae12cdbb840 volUUID = 2655ffd3-5c64-4cd4-8aac-16599acae719 legality = ILLEGAL e4ece9d5-2b01-4fda-b1af-2a6e0a766bed::INFO::2015-04-19 3. In the GUI we see the ILLEGAL status displayed in the Storage > <Storage Domain> > Disk Snapshots > Status > illegal Added screen shot 4. 2 of the disk snapshots are displayed as Illegal since the failed Live Delete Snapshot operation. Subsequently I can no longer delete this snapshot (In reply to Allon Mureinik from comment #5) > (In reply to Greg Padgett from comment #4) > > Just to verify, the illegal state referred to here is illegal on the host? > > (I don't see any references to illegal disks in the supplied engine log.) > Then again, the vdsm log has no errors at all. > Kevin, are you sure these are the right logs? Appologies, the vdsm log rotated before I loaded them. Adding logs again Created attachment 1016393 [details]
engine, server, vdsm and libvirt logs
Added logs again as the vdsm log was old due to log rotation
Edited the summary of the bz to reflect the problem The included vdsm log does not show the actual merge operation so it's hard to tell if there were any specific errors with that flow. What I see here are some errors from LVM indicating missing PVs. It seems like there could have been some external corruption of the storage domain (perhaps someone removed some LUNs on the server side? (In reply to Kevin Alon Goldblatt from comment #6) > 4. 2 of the disk snapshots are displayed as Illegal since the failed Live > Delete Snapshot operation. Subsequently I can no longer delete this snapshot Just to clarify - it's just the snapshot that's marked illegal? The Disk object is OK? Can you stop the VM and re-run it? (In reply to Allon Mureinik from comment #11) > (In reply to Kevin Alon Goldblatt from comment #6) > > 4. 2 of the disk snapshots are displayed as Illegal since the failed Live > > Delete Snapshot operation. Subsequently I can no longer delete this snapshot > > Just to clarify - it's just the snapshot that's marked illegal? > The Disk object is OK? > Can you stop the VM and re-run it? Yes I can stop and re-run the VM, But any attempt to delete the same snapshot again fails. (In reply to Adam Litke from comment #10) > The included vdsm log does not show the actual merge operation so it's hard > to tell if there were any specific errors with that flow. What I see here > are some errors from LVM indicating missing PVs. It seems like there could > have been some external corruption of the storage domain (perhaps someone > removed some LUNs on the server side? No LUNS are missing on the storage. Created attachment 1016713 [details]
engine, server, vdsm and libvirt logs
Added an additiona vdsm log that includes the earlier time from before the merge
vdsm.log.22.xz
Upon further investigation I found that the illegal disk is displayed in the Storage > Disks > Disk snapshots but not in the VM > Snapshots > Disks. This means that the database has 2 tables that are not synchronized (In reply to Kevin Alon Goldblatt from comment #15) > Upon further investigation I found that the illegal disk is displayed in the > Storage > Disks > Disk snapshots but not in the VM > Snapshots > Disks. This > means that the database has 2 tables that are not synchronized No, it doesn't. It means the disk as a whole is OK (i.e., can be used for VM operations like running it), but there's a problem somewhere down the chain which prevents things like merging and requires some intervention. (In reply to Kevin Alon Goldblatt from comment #15) > Upon further investigation I found that the illegal disk is displayed in the > Storage > Disks > Disk snapshots but not in the VM > Snapshots > Disks. <snip> Okay, I think I see what's going on. The reboot happened during the destroy_image phase of the merge attempts and thus the merges were successful but the snapshot volumes are still on the storage. We see they're illegal, as they should be; however, it's designed so that you can retry the merge on the illegal volumes in order to clean them up. This last bit isn't working for some reason or another. I'm looking into it. After further testing, it appears the live merge could not be retried because the command would not converge after failure to delete the image. [1] With the linked patch, the failure causes overall command failure, leaving the snapshot and image unlocked and in the proper state to allow a retry. There may still be a failure due to the timing of the restart, but note that live merge isn't designed to necessarily succeed if restarted at any point--but rather to recover gracefully if re-executed after a failure such as this. Please test with this patch once it's in the build and let me know if you still see any problems. [1] In this case, DestroyImageCommand failed, which is the third part (of 3) of the live merge flow on the engine. This child command is an old-style command (runs an SPM task) whereas its parent which controls the flow uses newer infrastructure (CommandCoordinator). The two don't always play well together, and we had already added a special case to get the endAction flow working between the two. This was extended further to avoid an endless loop of status-checking if the child command fails. Found some issues later in the retry flow, moving back to assigned. A partial fix is posted to gerrit. Looks like there's still a case where the engine can "forget" that it perfomed a merge, which will cause an error when it retries during command recovery. (In reply to Greg Padgett from comment #20) > A partial fix is posted to gerrit. Looks like there's still a case where > the engine can "forget" that it perfomed a merge, which will cause an error > when it retries during command recovery. So let's clone the bug for better tracking. We'll have this bug for 3.5.4 to track the fix we DO have, and use the other one to track the remaining problem. (In reply to Allon Mureinik from comment #21) > (In reply to Greg Padgett from comment #20) > > A partial fix is posted to gerrit. Looks like there's still a case where > > the engine can "forget" that it perfomed a merge, which will cause an error > > when it retries during command recovery. > So let's clone the bug for better tracking. > We'll have this bug for 3.5.4 to track the fix we DO have, and use the other > one to track the remaining problem. Opened bug 1227497 to track that issue. Verified using Version: -------------------------------------------------------- vdsm-4.17.0-1054.git562e711.el7.noarchrpm -qa ovirt-engine-3.6.0-0.0.master.20150627185750.git6f063c1.el6.noarch Verified with the following scenario: ------------------------------------------------------- Steps to Reproduce: 1. Create a VM with several disks including block preallocated and thin and nfs preallocated and thin 2. Started the VM 3. Created 3 snapshots snsa1, snsa2, snsa3 4. Deleted snapshot snsa2 - while the snapshot is locked restarted the vdsm >>>>> The snapshot delete operation was SUCCESSFUL 5. Deleted snapshot snsa1 - while the snapshot is locked restarted the vdsm >>>>> The snapshot delete operation was SUCCESSFUL Verified using a nested host! Moving to VERIFIED! Hi, Is it possible to port this fix back into RHEV 3.5 please ? This is the version we've integrated against and would really like to see a fix for that version. -Amit Mitkar (In reply to dev-unix-virtualization from comment #25) > Hi, > > Is it possible to port this fix back into RHEV 3.5 please ? This is the > version we've integrated against and would really like to see a fix for that > version. > > -Amit Mitkar It was included in 3.5.4. RHEV 3.6.0 has been released, setting status to CLOSED CURRENTRELEASE RHEV 3.6.0 has been released, setting status to CLOSED CURRENTRELEASE RHEV 3.6.0 has been released, setting status to CLOSED CURRENTRELEASE |