Bug 1585455
| Summary: | [downstream clone - 4.2.4] Move disk failed but delete was called on source sd, losing all the data | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Virtualization Manager | Reporter: | RHV bug bot <rhv-bugzilla-bot> | 
| Component: | ovirt-engine | Assignee: | Benny Zlotnik <bzlotnik> | 
| Status: | CLOSED ERRATA | QA Contact: | Kevin Alon Goldblatt <kgoldbla> | 
| Severity: | urgent | Docs Contact: | |
| Priority: | urgent | ||
| Version: | 4.1.9 | CC: | bzlotnik, ebenahar, frolland, gveitmic, lsurette, lveyde, rbalakri, Rhev-m-bugs, srevivo, tnisan, ykaul | 
| Target Milestone: | ovirt-4.2.4 | Keywords: | Performance, Reopened, ZStream | 
| Target Release: | --- | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | ovirt-engine-4.2.4.1 | Doc Type: | If docs needed, set a value | 
| Doc Text: | Story Points: | --- | |
| Clone Of: | 1574346 | Environment: | |
| Last Closed: | 2018-06-27 10:02:42 UTC | Type: | --- | 
| Regression: | --- | Mount Type: | --- | 
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | Storage | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | 1574346 | ||
| Bug Blocks: | |||
| 
        
          Description
        
        
          RHV bug bot
        
        
        
        
        
          2018-06-03 07:08:59 UTC
        
       2 problems? - MoveImageGroupCommand did not finish successfully? - a deadlock somewhere? (Originally by Germano Veit Michel) Freddy, you fixed the same bug 1538840 for 4.2, how complicated will it be to backport to 4.1? (Originally by Tal Nisan) Folks, I might be missing something buy IMO bug 1538840 has nothing to do with this. bug 1538840: - only happens via UI (c#11) - user triggers 2 simultaneous move for the same disk - second move deletes the source img - move/copy commands are sent to the vds this bug: - via sdk - there is no 2 simultaneous move for the same disk - there are several simultaneous disk moves, which look deadlocked (DB) - first move deletes the source img - move/copy commands never sent to the vds And looking at the logs from both BZ, they are totally different. https://gerrit.ovirt.org/#/c/87315/ adds a check to not run a move twice, there is no moving twice here that I can see. Each disk was moved once, by SDK. I would love to be proven wrong, but I am afraid this bug is still in the wild and it can happen again. We have a customer waiting on this to move several hundreds of disks, we must be 100% sure this is fixed. Reopening. (Originally by Germano Veit Michel) The root cause seems to be the same, the end action of the move command wrongly triggers the delete of the source image, but Freddy let's double check indeed (Originally by Tal Nisan) It looks like a different issue. Benny and I started to look at it. There two things here to handle: - The parent command should not be considered successful if the child command failed. - Investigate the root cause of the timeout in the transaction. (Dead lock?) (Originally by Fred Rolland) Germano any chance for a thread dump from the time where the move waited? I'm wondering why the vds call didn't get timeout after 3 minutes (Originally by Roy Golan) (In reply to Roy Golan from comment #13) > Germano any chance for a thread dump from the time where the move waited? > I'm wondering why the vds call didn't get timeout after 3 minutes Hi Roy, Nope. It's quite rare for us to get a chance to spot these "in the wild" and request a thread dump while it is still happening. Because we are quite unlikely to get thread dumps unless it can be reproduced, maybe there could be a way to generate them automatically when a thread is blocked for more than X time, similar to what the kernel does? (Originally by Germano Veit Michel) Verified with the following code: -------------------------------------- ovirt-engine-4.2.4.4-0.1.el7_3.noarch vdsm-4.20.31-1.el7ev.x86_64 Verified with the following scenario: -------------------------------------- 1. Ran a cold move of iscsi disk to another iscsi domain which failed due to error injection >>>> the target is successfully clean up and the source is no longer removed as before 2. Ran the LSM again and this time is completed successfully Moving to VERIFIED! (In reply to Kevin Alon Goldblatt from comment #21) > Verified with the following code: > -------------------------------------- > ovirt-engine-4.2.4.4-0.1.el7_3.noarch > vdsm-4.20.31-1.el7ev.x86_64 > > > Verified with the following scenario: > -------------------------------------- > 1. Ran a cold move of iscsi disk to another iscsi domain which failed due to > error injection >>>> the target is successfully clean up and the source is > no longer removed as before > 2. Ran the LSM again and this time is completed successfully CORRECTION - Ran the COLD MOVE again and this time it competed successfully > > > > Moving to VERIFIED! Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2018:2071 BZ<2>Jira Resync |