Bug 1227497
| Summary: | Live Merge - in certain failure cases engine may not correctly mark merged snapshots in the database | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | [oVirt] ovirt-engine | Reporter: | Greg Padgett <gpadgett> | ||||||
| Component: | General | Assignee: | Greg Padgett <gpadgett> | ||||||
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Elad <ebenahar> | ||||||
| Severity: | medium | Docs Contact: | Kevin Alon Goldblatt <kgoldbla> | ||||||
| Priority: | high | ||||||||
| Version: | 3.5.1 | CC: | acanan, alitke, amureini, bugs, gklein, gpadgett, kgoldbla, lpeer, lsurette, rbalakri, Rhev-m-bugs, tnisan, v.astafiev, yeylon, ykaul, ylavi | ||||||
| Target Milestone: | ovirt-3.6.2 | Flags: | rule-engine:
ovirt-3.6.z+
ylavi: planning_ack+ amureini: devel_ack+ rule-engine: testing_ack+ |
||||||
| Target Release: | 3.6.2.5 | ||||||||
| Hardware: | x86_64 | ||||||||
| OS: | Unspecified | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||
| Doc Text: | Story Points: | --- | |||||||
| Clone Of: | 1213157 | Environment: | |||||||
| Last Closed: | 2016-02-18 11:03:46 UTC | Type: | Bug | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | Storage | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Embargoed: | |||||||||
| Bug Depends On: | |||||||||
| Bug Blocks: | 1306732 | ||||||||
| Attachments: |
|
||||||||
|
Description
Greg Padgett
2015-06-02 19:59:00 UTC
The cloned comments in the description don't provide a lot of detail on this particular defect. To elaborate, occasionally Live Merge executions would result in a failure that could not be recovered by retries (as is the convention for Live Merge). The problem turns out to be improper tracking of the image deletion after the merge is complete. If the vdsm call to deleteVolume is made and completes but the engine can't immediately retrieve the status, engine will never discover that the actual state of the image. Instead, upon retry engine will again try deleteVolume, which subsequently fails and causes Live Merge to fail. This will be fixed by SPM removal is complete, but in the meantime we'll add another command to the RemoveSnapshotSingleDiskLiveCommand chain to check if the image has been deleted so that the above scenario will let the command complete successfully if deletion succeeded without engine's knowledge. Verification:
This one is a bit tricky because the problem only occurs if vdsm dies at a very specific point in time. Vdsm must be killed right after the image deletion occurs but before the task status is saved. The best way that I've found to accomplish this is the following:
1. Stop vdsm on the host
2. Edit /usr/share/vdsm/storage/task.py and apply the code in the attached patch file (listed below [1] for convenience--applying the patch is preferred due to python's whitespace treatment)
3. Edit /usr/lib/systemd/system/vdsmd.service, change "Restart=always" to "Restart=no" to prevent vdsm from restarting with the patched code during the test
There may very well be some other way to kill vdsm at just the right moment, but this is the most reliable way I could think of.
Now to reproduce:
1. Ensure the host to be used for snapshot removal is running a vdsm with the patched code from above.
2. Create a snapshot for a vm
3. Ensure the vm is running
4. Remove the snapshot
5. Wait for vdsm to exit; the end of the log should show "HACK: Found deleteVolume in finished state, exiting". The engine will start raising exceptions due to connection errors; this is normal.
6. Remove the vdsm code patch (best to remove it with `patch -r` to avoid possible syntax errors due to whitespace issues)
7. Start vdsm
8. Remove the same snapshot again
If the engine fix works, this second attempt in step 8 will result in successful snapshot removal. Prior to the fix, the second attempt would fail leaving the snapshot behind.
I've attached log samples showing excerpts of the engine log for a failure and successful removal. The errors and start/end of the logs are the most obvious signs; if needed, these can be used as a guide to help correlate with the behavior you're seeing during testing.
[1] patch contents:
--- vdsm/storage/task.py 2015-09-16 23:00:26.936111622 -0400
+++ vdsm/storage/task.py.new 2015-09-16 23:00:16.492108251 -0400
@@ -751,4 +751,9 @@
taskDir = os.path.join(storPath, self.id + TEMP_EXT)
self.log.debug("_save: orig %s temp %s", origTaskDir, taskDir)
+ # TODO hack
+ if self.state == State.finished and self.name == 'deleteVolume':
+ self.log.error("HACK: Found deleteVolume in finished state, exiting")
+ os._exit(1)
+ # TODO end hack
if getProcPool().os.path.exists(taskDir):
getProcPool().fileUtils.cleanupdir(taskDir)
Created attachment 1074236 [details]
VDSM patch for reproduction
Created attachment 1074237 [details]
Success/failure engine log excerpts
Yaniv, this bug should be a rare occurrence; however, when it happens the snapshot cannot be removed without some manual intervention (ie editing the db). There's more detail about the bug in general in my preceding comments. Given all that, should this be backported to z-stream? (In reply to Greg Padgett from comment #5) > Yaniv, this bug should be a rare occurrence; however, when it happens the > snapshot cannot be removed without some manual intervention (ie editing the > db). There's more detail about the bug in general in my preceding comments. > Given all that, should this be backported to z-stream? Can you please add step to correct the db in case a user hits this? this is an automated message. oVirt 3.6.0 RC3 has been released and GA is targeted to next week, Nov 4th 2015. Please review this bug and if not a blocker, please postpone to a later release. All bugs not postponed on GA release will be automatically re-targeted to - 3.6.1 if severity >= high - 4.0 if severity < high Greg, we currently targeted it to 3.6.1 but it depends on the risk, how risky is this change? (In reply to Tal Nisan from comment #8) > Greg, we currently targeted it to 3.6.1 but it depends on the risk, how > risky is this change? I'd consider this relatively low risk. The main live merge flow has been updated, but the change is to call a well-baked vds command to verify the deletion. Testing for regressions should be straightforward. (In reply to Yaniv Dary from comment #6) > (In reply to Greg Padgett from comment #5) > > Yaniv, this bug should be a rare occurrence; however, when it happens the > > snapshot cannot be removed without some manual intervention (ie editing the > > db). There's more detail about the bug in general in my preceding comments. > > Given all that, should this be backported to z-stream? > > Can you please add step to correct the db in case a user hits this? The steps here are rather intricate and involve updating rows in multiple tables as well altering the vm configuration text stored in the snapshot record. I'm not sure yet if there is a way to trigger an update to the vm configuration or if it must be done by hand. Regardless, it may be safer to override the vdsm deleteVolume verb return value, re-run Live merge, and let the Live Merge flow update the records itself. Thoughts on this alternative? (In reply to Greg Padgett from comment #9) > (In reply to Tal Nisan from comment #8) > > Greg, we currently targeted it to 3.6.1 but it depends on the risk, how > > risky is this change? > > I'd consider this relatively low risk. The main live merge flow has been > updated, but the change is to call a well-baked vds command to verify the > deletion. Testing for regressions should be straightforward. > > > (In reply to Yaniv Dary from comment #6) > > (In reply to Greg Padgett from comment #5) > > > Yaniv, this bug should be a rare occurrence; however, when it happens the > > > snapshot cannot be removed without some manual intervention (ie editing the > > > db). There's more detail about the bug in general in my preceding comments. > > > Given all that, should this be backported to z-stream? > > > > Can you please add step to correct the db in case a user hits this? > > The steps here are rather intricate and involve updating rows in multiple > tables as well altering the vm configuration text stored in the snapshot > record. I'm not sure yet if there is a way to trigger an update to the vm > configuration or if it must be done by hand. Regardless, it may be safer to > override the vdsm deleteVolume verb return value, re-run Live merge, and let > the Live Merge flow update the records itself. Thoughts on this alternative? I don't have knowledge of other steps. Tal? Seems better and less intrusive than manual update, I think it can be backported Bug tickets must have version flags set prior to targeting them to a release. Please ask maintainer to set the correct version flags and only then set the target milestone. Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release. (In reply to Yaniv Dary from comment #10) > (In reply to Greg Padgett from comment #9) > > (In reply to Tal Nisan from comment #8) > > > Greg, we currently targeted it to 3.6.1 but it depends on the risk, how > > > risky is this change? > > > > I'd consider this relatively low risk. The main live merge flow has been > > updated, but the change is to call a well-baked vds command to verify the > > deletion. Testing for regressions should be straightforward. > > > > > > (In reply to Yaniv Dary from comment #6) > > > (In reply to Greg Padgett from comment #5) > > > > Yaniv, this bug should be a rare occurrence; however, when it happens the > > > > snapshot cannot be removed without some manual intervention (ie editing the > > > > db). There's more detail about the bug in general in my preceding comments. > > > > Given all that, should this be backported to z-stream? > > > > > > Can you please add step to correct the db in case a user hits this? > > > > The steps here are rather intricate and involve updating rows in multiple > > tables as well altering the vm configuration text stored in the snapshot > > record. I'm not sure yet if there is a way to trigger an update to the vm > > configuration or if it must be done by hand. Regardless, it may be safer to > > override the vdsm deleteVolume verb return value, re-run Live merge, and let > > the Live Merge flow update the records itself. Thoughts on this alternative? > > I don't have knowledge of other steps. Tal? Hello, could you please tell me how override the vdsm deleteVolume verb return value? (In reply to boruvka.michal from comment #14) > (In reply to Yaniv Dary from comment #10) > > (In reply to Greg Padgett from comment #9) > > > (In reply to Tal Nisan from comment #8) > > > > Greg, we currently targeted it to 3.6.1 but it depends on the risk, how > > > > risky is this change? > > > > > > I'd consider this relatively low risk. The main live merge flow has been > > > updated, but the change is to call a well-baked vds command to verify the > > > deletion. Testing for regressions should be straightforward. > > > > > > > > > (In reply to Yaniv Dary from comment #6) > > > > (In reply to Greg Padgett from comment #5) > > > > > Yaniv, this bug should be a rare occurrence; however, when it happens the > > > > > snapshot cannot be removed without some manual intervention (ie editing the > > > > > db). There's more detail about the bug in general in my preceding comments. > > > > > Given all that, should this be backported to z-stream? > > > > > > > > Can you please add step to correct the db in case a user hits this? > > > > > > The steps here are rather intricate and involve updating rows in multiple > > > tables as well altering the vm configuration text stored in the snapshot > > > record. I'm not sure yet if there is a way to trigger an update to the vm > > > configuration or if it must be done by hand. Regardless, it may be safer to > > > override the vdsm deleteVolume verb return value, re-run Live merge, and let > > > the Live Merge flow update the records itself. Thoughts on this alternative? > > > > I don't have knowledge of other steps. Tal? > > Hello, could you please tell me how override the vdsm deleteVolume verb > return value? Hi Michal, Those were instructions to our QA team to attempt to reproduce the bug in order to see if this fix works. What exactly are you trying to achieve? (In reply to Allon Mureinik from comment #15) > (In reply to boruvka.michal from comment #14) > > (In reply to Yaniv Dary from comment #10) > > > (In reply to Greg Padgett from comment #9) > > > > (In reply to Tal Nisan from comment #8) > > > > > Greg, we currently targeted it to 3.6.1 but it depends on the risk, how > > > > > risky is this change? > > > > > > > > I'd consider this relatively low risk. The main live merge flow has been > > > > updated, but the change is to call a well-baked vds command to verify the > > > > deletion. Testing for regressions should be straightforward. > > > > > > > > > > > > (In reply to Yaniv Dary from comment #6) > > > > > (In reply to Greg Padgett from comment #5) > > > > > > Yaniv, this bug should be a rare occurrence; however, when it happens the > > > > > > snapshot cannot be removed without some manual intervention (ie editing the > > > > > > db). There's more detail about the bug in general in my preceding comments. > > > > > > Given all that, should this be backported to z-stream? > > > > > > > > > > Can you please add step to correct the db in case a user hits this? > > > > > > > > The steps here are rather intricate and involve updating rows in multiple > > > > tables as well altering the vm configuration text stored in the snapshot > > > > record. I'm not sure yet if there is a way to trigger an update to the vm > > > > configuration or if it must be done by hand. Regardless, it may be safer to > > > > override the vdsm deleteVolume verb return value, re-run Live merge, and let > > > > the Live Merge flow update the records itself. Thoughts on this alternative? > > > > > > I don't have knowledge of other steps. Tal? > > > > Hello, could you please tell me how override the vdsm deleteVolume verb > > return value? > Hi Michal, > > Those were instructions to our QA team to attempt to reproduce the bug in > order to see if this fix works. > What exactly are you trying to achieve? I have VM snapshot with disk in illegal state and I cant delete it. This disk is in illegal state after Live Merge operation. Could you give me some advice how to delete this snapshot? (In reply to boruvka.michal from comment #16) > (In reply to Allon Mureinik from comment #15) > > (In reply to boruvka.michal from comment #14) [trimmed] > > Hi Michal, > > > > Those were instructions to our QA team to attempt to reproduce the bug in > > order to see if this fix works. > > What exactly are you trying to achieve? > > I have VM snapshot with disk in illegal state and I cant delete it. This > disk is in illegal state after Live Merge operation. Could you give me some > advice how to delete this snapshot? Hi Michal, Are the symptoms the same as this bug: even after stopping/re-starting the vm, live merge fails during the removal phase (engine log indicates "DESTROY_IMAGE") and the error on the vdsm side is something like "image file not found"? If not, please check the engine/vdsm/libvirt logs for the root cause of the failure. However, if this is the case and you don't want to wait on the fix, then overriding the vdsm return value may be the easiest way to resolve the situation. To do this: 1. Note that you should not perform any volume/storage operations other than the required live merge while this change is in effect, as it may cause inconsistencies within the engine database and/or possible corruption down the road. 2. On the host running the VM, edit the /usr/share/vdsm/storage/sp.py and search for the line starting with " def deleteVolume" (preceded with 8 spaces). On the very next line, add the following line: return Note that the word "return" is preceded with 8 spaces (no tabs). 3. Restart vdsm on the host: service vdsmd restart 4. Wait for the engine to discover the host is back up (possibly re-electing the SPM if one is needed) 5. Perform the merge 6. Undo the change in step 2 7. Restart vdsm as in step 3 (In reply to Greg Padgett from comment #17) > To do this: > 1. Note that you should not perform any volume/storage operations other than > the required live merge while this change is in effect, as it may cause > inconsistencies within the engine database and/or possible corruption down > the road. > 2. On the host running the VM, edit the /usr/share/vdsm/storage/sp.py and > search for the line starting with " def deleteVolume" (preceded with 8 > spaces). On the very next line, add the following line: > return > Note that the word "return" is preceded with 8 spaces (no tabs). Correction: the deletion is done on the SPM host, not the host running the VM. Please perform the steps on your SPM host, making sure this host remains SPM throughout the operation (re-elect it if necessary). > 3. Restart vdsm on the host: service vdsmd restart > 4. Wait for the engine to discover the host is back up (possibly re-electing > the SPM if one is needed) > 5. Perform the merge > 6. Undo the change in step 2 > 7. Restart vdsm as in step 3 (In reply to Greg Padgett from comment #2) > Verification: > > This one is a bit tricky because the problem only occurs if vdsm dies at a > very specific point in time. Vdsm must be killed right after the image > deletion occurs but before the task status is saved. The best way that I've > found to accomplish this is the following: > > 1. Stop vdsm on the host > 2. Edit /usr/share/vdsm/storage/task.py and apply the code in the attached > patch file (listed below [1] for convenience--applying the patch is > preferred due to python's whitespace treatment) > 3. Edit /usr/lib/systemd/system/vdsmd.service, change "Restart=always" to > "Restart=no" to prevent vdsm from restarting with the patched code during > the test > > There may very well be some other way to kill vdsm at just the right moment, > but this is the most reliable way I could think of. > > Now to reproduce: > 1. Ensure the host to be used for snapshot removal is running a vdsm with > the patched code from above. > 2. Create a snapshot for a vm > 3. Ensure the vm is running > 4. Remove the snapshot > 5. Wait for vdsm to exit; the end of the log should show "HACK: Found > deleteVolume in finished state, exiting". The engine will start raising > exceptions due to connection errors; this is normal. > 6. Remove the vdsm code patch (best to remove it with `patch -r` to avoid > possible syntax errors due to whitespace issues) > 7. Start vdsm > 8. Remove the same snapshot again > > If the engine fix works, this second attempt in step 8 will result in > successful snapshot removal. Prior to the fix, the second attempt would > fail leaving the snapshot behind. > > I've attached log samples showing excerpts of the engine log for a failure > and successful removal. The errors and start/end of the logs are the most > obvious signs; if needed, these can be used as a guide to help correlate > with the behavior you're seeing during testing. > > > [1] patch contents: > --- vdsm/storage/task.py 2015-09-16 23:00:26.936111622 -0400 > +++ vdsm/storage/task.py.new 2015-09-16 23:00:16.492108251 -0400 > @@ -751,4 +751,9 @@ > taskDir = os.path.join(storPath, self.id + TEMP_EXT) > self.log.debug("_save: orig %s temp %s", origTaskDir, taskDir) > + # TODO hack > + if self.state == State.finished and self.name == 'deleteVolume': > + self.log.error("HACK: Found deleteVolume in finished state, > exiting") > + os._exit(1) > + # TODO end hack > if getProcPool().os.path.exists(taskDir): > getProcPool().fileUtils.cleanupdir(taskDir) Ran the Verification with the following code: ------------------------------------------------- vdsm-4.17.19-0.el7ev.noarch rhevm-3.6.3-0.1.el6.noarch Ran the scenario using the following scenario: ------------------------------------------------ Use the scenario as described above. After starting the vdsm in step 7, the snapshot is automatically deleted in the cleanup. I did not have to delete it again. Moving to VERIFIED! |