Bugzilla will be upgraded to version 5.0. The upgrade date is tentatively scheduled for 2 December 2018, pending final testing and feedback.
Bug 952107 - Under certain circumstances live storage migration failure leaves images split among old and new storage as well as tasks running in database
Under certain circumstances live storage migration failure leaves images spli...
Status: CLOSED ERRATA
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine (Show other bugs)
3.1.3
All Linux
high Severity high
: ---
: 3.3.0
Assigned To: Federico Simoncelli
Meital Bourvine
storage
:
Depends On:
Blocks: 1019461
  Show dependency treegraph
 
Reported: 2013-04-15 04:24 EDT by Tomas Dosek
Modified: 2016-02-10 15:24 EST (History)
15 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2014-01-21 12:17:24 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: Storage
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
abaron: Triaged+


Attachments (Terms of Use)
Logcollector (63.30 MB, application/x-xz)
2013-04-15 04:24 EDT, Tomas Dosek
no flags Details
DB dump (1.14 MB, application/x-xz)
2013-04-15 04:29 EDT, Tomas Dosek
no flags Details
vdsm logs (1.99 MB, application/zip)
2013-05-09 12:16 EDT, Tomas Dosek
no flags Details
VDSM+engine logs (1.31 MB, application/zip)
2013-05-10 06:14 EDT, Tomas Dosek
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2014:0038 normal SHIPPED_LIVE Important: Red Hat Enterprise Virtualization Manager 3.3.0 update 2014-01-21 17:03:06 EST

  None (edit)
Description Tomas Dosek 2013-04-15 04:24:20 EDT
Created attachment 735811 [details]
Logcollector

Description of problem:

If storage live migration fails, it leaves images split between old storage and the new one. Relevant tasks are never finished in databased and all entities are locked. 

Version-Release number of selected component (if applicable):
3.1.3

How reproducible:
60 % - perfect timing is needed

Steps to Reproduce:
1. Have a VM with 2 disks and 6 snapshots on these - each disk is about 50GB at size
2. Perform storage live migration
3. Block connection of hosts to storage
  
Actual results:
storage migration fails and leaves half of the images on old storage, task is locking VM in database.

Expected results:
rollback should be performed

Additional info:
Attaching logs and db dump
Comment 1 Tomas Dosek 2013-04-15 04:29:10 EDT
Created attachment 735823 [details]
DB dump
Comment 4 Vered Volansky 2013-04-23 03:31:48 EDT
How do you block connection of hosts to storage (exactly)?
Do you reconnect, if so when and how?
How many hosts in this scenario?
Perfect timing is needed to do what? Please add some hint as to how to nail this.
Comment 5 Tomas Dosek 2013-04-23 03:48:42 EDT
(In reply to comment #4)
> How do you block connection of hosts to storage (exactly)?

If you use iscsi you can simply do this using iptables drop rules on hosts.

> Do you reconnect, if so when and how?

I did after the vdsm tiemouts (300 seconds) if you want to be safe do so after 10 minutes.

> How many hosts in this scenario?

Just 2 should be enough

> Perfect timing is needed to do what? Please add some hint as to how to nail
> this.

You need to execute the iptables at the very time when async tasks are already started on the hosts. Ideally at a time when one of these tasks already finished (i.e. one image was already moved and another ones are still in queue).
Comment 6 Ayal Baron 2013-05-02 07:26:32 EDT
Tomas, did you wait for spm to come back? you will not be able to rollback without it.
What I'm trying to understand is what happens after the spm comes back (does it sort itself out).
Comment 7 Tomas Dosek 2013-05-02 07:33:05 EDT
Yes, all hosts were up after this action including the SPM host, the network outage was really short - just pulled out cables, and put them back in right after that.

The roll-back didn't trigger and images were left spread among the two SDs.
Comment 15 Tomas Dosek 2013-05-09 12:16:33 EDT
Created attachment 745723 [details]
vdsm logs
Comment 16 Tomas Dosek 2013-05-10 06:14:52 EDT
Created attachment 746049 [details]
VDSM+engine logs
Comment 30 Allon Mureinik 2013-07-10 03:27:10 EDT
Moving to ON_QA based on comment 29 - probably does not reproduce.
Comment 31 Meital Bourvine 2013-08-27 08:02:51 EDT
Verified on vdsm-4.12.0-61.git8178ec2.el6ev.x86_64
Comment 32 Charlie 2013-11-27 19:23:22 EST
This bug is currently attached to errata RHEA-2013:15231. If this change is not to be documented in the text for this errata please either remove it from the errata, set the requires_doc_text flag to minus (-), or leave a "Doc Text" value of "--no tech note required" if you do not have permission to alter the flag.

Otherwise to aid in the development of relevant and accurate release documentation, please fill out the "Doc Text" field above with these four (4) pieces of information:

* Cause: What actions or circumstances cause this bug to present.
* Consequence: What happens when the bug presents.
* Fix: What was done to fix the bug.
* Result: What now happens when the actions or circumstances above occur. (NB: this is not the same as 'the bug doesn't present anymore')

Once filled out, please set the "Doc Type" field to the appropriate value for the type of change made and submit your edits to the bug.

For further details on the Cause, Consequence, Fix, Result format please refer to:

https://bugzilla.redhat.com/page.cgi?id=fields.html#cf_release_notes 

Thanks in advance.
Comment 34 errata-xmlrpc 2014-01-21 12:17:24 EST
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2014-0038.html

Note You need to log in before you can comment on or make changes to this bug.