Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 952107

Summary: Under certain circumstances live storage migration failure leaves images split among old and new storage as well as tasks running in database
Product: Red Hat Enterprise Virtualization Manager Reporter: Tomas Dosek <tdosek>
Component: ovirt-engineAssignee: Federico Simoncelli <fsimonce>
Status: CLOSED ERRATA QA Contact: Meital Bourvine <mbourvin>
Severity: high Docs Contact:
Priority: high    
Version: 3.1.3CC: abaron, acanan, acathrow, amureini, derez, fsimonce, iheim, jkt, laravot, lpeer, lyarwood, nlevinki, Rhev-m-bugs, scohen, yeylon
Target Milestone: ---Flags: abaron: Triaged+
Target Release: 3.3.0   
Hardware: All   
OS: Linux   
Whiteboard: storage
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-01-21 17:17:24 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1019461    
Attachments:
Description Flags
Logcollector
none
DB dump
none
vdsm logs
none
VDSM+engine logs none

Description Tomas Dosek 2013-04-15 08:24:20 UTC
Created attachment 735811 [details]
Logcollector

Description of problem:

If storage live migration fails, it leaves images split between old storage and the new one. Relevant tasks are never finished in databased and all entities are locked. 

Version-Release number of selected component (if applicable):
3.1.3

How reproducible:
60 % - perfect timing is needed

Steps to Reproduce:
1. Have a VM with 2 disks and 6 snapshots on these - each disk is about 50GB at size
2. Perform storage live migration
3. Block connection of hosts to storage
  
Actual results:
storage migration fails and leaves half of the images on old storage, task is locking VM in database.

Expected results:
rollback should be performed

Additional info:
Attaching logs and db dump

Comment 1 Tomas Dosek 2013-04-15 08:29:10 UTC
Created attachment 735823 [details]
DB dump

Comment 4 Vered Volansky 2013-04-23 07:31:48 UTC
How do you block connection of hosts to storage (exactly)?
Do you reconnect, if so when and how?
How many hosts in this scenario?
Perfect timing is needed to do what? Please add some hint as to how to nail this.

Comment 5 Tomas Dosek 2013-04-23 07:48:42 UTC
(In reply to comment #4)
> How do you block connection of hosts to storage (exactly)?

If you use iscsi you can simply do this using iptables drop rules on hosts.

> Do you reconnect, if so when and how?

I did after the vdsm tiemouts (300 seconds) if you want to be safe do so after 10 minutes.

> How many hosts in this scenario?

Just 2 should be enough

> Perfect timing is needed to do what? Please add some hint as to how to nail
> this.

You need to execute the iptables at the very time when async tasks are already started on the hosts. Ideally at a time when one of these tasks already finished (i.e. one image was already moved and another ones are still in queue).

Comment 6 Ayal Baron 2013-05-02 11:26:32 UTC
Tomas, did you wait for spm to come back? you will not be able to rollback without it.
What I'm trying to understand is what happens after the spm comes back (does it sort itself out).

Comment 7 Tomas Dosek 2013-05-02 11:33:05 UTC
Yes, all hosts were up after this action including the SPM host, the network outage was really short - just pulled out cables, and put them back in right after that.

The roll-back didn't trigger and images were left spread among the two SDs.

Comment 15 Tomas Dosek 2013-05-09 16:16:33 UTC
Created attachment 745723 [details]
vdsm logs

Comment 16 Tomas Dosek 2013-05-10 10:14:52 UTC
Created attachment 746049 [details]
VDSM+engine logs

Comment 30 Allon Mureinik 2013-07-10 07:27:10 UTC
Moving to ON_QA based on comment 29 - probably does not reproduce.

Comment 31 Meital Bourvine 2013-08-27 12:02:51 UTC
Verified on vdsm-4.12.0-61.git8178ec2.el6ev.x86_64

Comment 32 Charlie 2013-11-28 00:23:22 UTC
This bug is currently attached to errata RHEA-2013:15231. If this change is not to be documented in the text for this errata please either remove it from the errata, set the requires_doc_text flag to minus (-), or leave a "Doc Text" value of "--no tech note required" if you do not have permission to alter the flag.

Otherwise to aid in the development of relevant and accurate release documentation, please fill out the "Doc Text" field above with these four (4) pieces of information:

* Cause: What actions or circumstances cause this bug to present.
* Consequence: What happens when the bug presents.
* Fix: What was done to fix the bug.
* Result: What now happens when the actions or circumstances above occur. (NB: this is not the same as 'the bug doesn't present anymore')

Once filled out, please set the "Doc Type" field to the appropriate value for the type of change made and submit your edits to the bug.

For further details on the Cause, Consequence, Fix, Result format please refer to:

https://bugzilla.redhat.com/page.cgi?id=fields.html#cf_release_notes 

Thanks in advance.

Comment 34 errata-xmlrpc 2014-01-21 17:17:24 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2014-0038.html