Bug 952107
| Summary: | Under certain circumstances live storage migration failure leaves images split among old and new storage as well as tasks running in database | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Tomas Dosek <tdosek> | ||||||||||
| Component: | ovirt-engine | Assignee: | Federico Simoncelli <fsimonce> | ||||||||||
| Status: | CLOSED ERRATA | QA Contact: | Meital Bourvine <mbourvin> | ||||||||||
| Severity: | high | Docs Contact: | |||||||||||
| Priority: | high | ||||||||||||
| Version: | 3.1.3 | CC: | abaron, acanan, acathrow, amureini, derez, fsimonce, iheim, jkt, laravot, lpeer, lyarwood, nlevinki, Rhev-m-bugs, scohen, yeylon | ||||||||||
| Target Milestone: | --- | Flags: | abaron:
Triaged+
|
||||||||||
| Target Release: | 3.3.0 | ||||||||||||
| Hardware: | All | ||||||||||||
| OS: | Linux | ||||||||||||
| Whiteboard: | storage | ||||||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||||||
| Doc Text: | Story Points: | --- | |||||||||||
| Clone Of: | Environment: | ||||||||||||
| Last Closed: | 2014-01-21 17:17:24 UTC | Type: | Bug | ||||||||||
| Regression: | --- | Mount Type: | --- | ||||||||||
| Documentation: | --- | CRM: | |||||||||||
| Verified Versions: | Category: | --- | |||||||||||
| oVirt Team: | Storage | RHEL 7.3 requirements from Atomic Host: | |||||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
| Embargoed: | |||||||||||||
| Bug Depends On: | |||||||||||||
| Bug Blocks: | 1019461 | ||||||||||||
| Attachments: |
|
||||||||||||
Created attachment 735823 [details]
DB dump
How do you block connection of hosts to storage (exactly)? Do you reconnect, if so when and how? How many hosts in this scenario? Perfect timing is needed to do what? Please add some hint as to how to nail this. (In reply to comment #4) > How do you block connection of hosts to storage (exactly)? If you use iscsi you can simply do this using iptables drop rules on hosts. > Do you reconnect, if so when and how? I did after the vdsm tiemouts (300 seconds) if you want to be safe do so after 10 minutes. > How many hosts in this scenario? Just 2 should be enough > Perfect timing is needed to do what? Please add some hint as to how to nail > this. You need to execute the iptables at the very time when async tasks are already started on the hosts. Ideally at a time when one of these tasks already finished (i.e. one image was already moved and another ones are still in queue). Tomas, did you wait for spm to come back? you will not be able to rollback without it. What I'm trying to understand is what happens after the spm comes back (does it sort itself out). Yes, all hosts were up after this action including the SPM host, the network outage was really short - just pulled out cables, and put them back in right after that. The roll-back didn't trigger and images were left spread among the two SDs. Created attachment 745723 [details]
vdsm logs
Created attachment 746049 [details]
VDSM+engine logs
Moving to ON_QA based on comment 29 - probably does not reproduce. Verified on vdsm-4.12.0-61.git8178ec2.el6ev.x86_64 This bug is currently attached to errata RHEA-2013:15231. If this change is not to be documented in the text for this errata please either remove it from the errata, set the requires_doc_text flag to minus (-), or leave a "Doc Text" value of "--no tech note required" if you do not have permission to alter the flag. Otherwise to aid in the development of relevant and accurate release documentation, please fill out the "Doc Text" field above with these four (4) pieces of information: * Cause: What actions or circumstances cause this bug to present. * Consequence: What happens when the bug presents. * Fix: What was done to fix the bug. * Result: What now happens when the actions or circumstances above occur. (NB: this is not the same as 'the bug doesn't present anymore') Once filled out, please set the "Doc Type" field to the appropriate value for the type of change made and submit your edits to the bug. For further details on the Cause, Consequence, Fix, Result format please refer to: https://bugzilla.redhat.com/page.cgi?id=fields.html#cf_release_notes Thanks in advance. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHSA-2014-0038.html |
Created attachment 735811 [details] Logcollector Description of problem: If storage live migration fails, it leaves images split between old storage and the new one. Relevant tasks are never finished in databased and all entities are locked. Version-Release number of selected component (if applicable): 3.1.3 How reproducible: 60 % - perfect timing is needed Steps to Reproduce: 1. Have a VM with 2 disks and 6 snapshots on these - each disk is about 50GB at size 2. Perform storage live migration 3. Block connection of hosts to storage Actual results: storage migration fails and leaves half of the images on old storage, task is locking VM in database. Expected results: rollback should be performed Additional info: Attaching logs and db dump