Bug 1248119 - Migrating VM corrupted when hypervisor was fenced during attempt to put host into Maintenance mode
Summary: Migrating VM corrupted when hypervisor was fenced during attempt to put host ...
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine
Version: 3.5.3
Hardware: x86_64
OS: Linux
high
high
Target Milestone: ---
: ---
Assignee: Nobody
QA Contact:
URL:
Whiteboard: virt
Depends On: migration_improvements
Blocks:
TreeView+ depends on / blocked
 
Reported: 2015-07-29 15:57 UTC by Robert McSwain
Modified: 2023-09-14 03:02 UTC (History)
14 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2015-09-17 10:06:51 UTC
oVirt Team: ---
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Robert McSwain 2015-07-29 15:57:34 UTC
Version: 
rhevm-3.5.3.1-1.4.el6ev.noarch

Description of problem:
Hypervisors were put into maintenance mode in order to upgrade them. The running VMs were attempted to be live migrated to other hosts. The first 2 hypervisor-upgrades were running successfully but on the 3rd one the live migration was hanging on the last 2 VMs running on it. RHEV fenced this hypervisor because of the non-responsive state while there were still 2 machines live migrating away from it. After it was fenced and up again the filesystem of the VM yeager was totally corrupted and we had to make a full restore of it.

Why could that be that a hypervisor would be fenced while migrating VMs away from it and why can't the engine not handle that? Its not the first time we have had corrupted VM filesystems after updating/live migrating.

How reproducible:
Unknown in practice

Steps to Reproduce:
1. Have power management enabled and put hosts into maintenance mode to initiate a migration
2. Observe the hosts go into non-responsive mode and have RHEV-M fence the host mid-migration
3. Watch the migrations fail due to the fencing, and observe VM corruption on the next VM boot.

Actual results:
The VM is corrupt and needed to be restored from backup due to the fencing

Expected results:
The logic for fencing a host allows for a VM to finish migrating before hard powering off the VM.

Additional info:
Customer data will be in the first comment.

Comment 11 Michal Skrivanek 2015-08-07 11:31:18 UTC
comments from virt

23.7. engine log:
18:11:38 incoming migration of yageri machine into apollo
19:10:09 migration starts from apollo -> armstrong
19:34:42 migration fails MigFrom -> Up, rerun migration to republici
19:48:57 apollo not responding
19:49:03 set to Unknown, fencing starts
host rebooted
20:38:20 start yeageri because it is HA

there are lot of migrations going on all the time, likely due to Move to Maintenance (though there are plenty of migrations even days before the event)  

Fencing setting should be changes as suggested earlier
But more thorough review of migration parameters is recommended


I'm moving the bug back to infra as I thing we should fix the default behavior: I think disabling fencing during request for maintenance should be automatic (or make it a default to skip when storage is active)
MoveToMaintenance will always trigger migrations which often overload network and cause temporary unresponsiveness.

Comment 12 Oved Ourfali 2015-08-07 17:27:34 UTC
Making skip fencing if storage is active isn't the right thing to do, as it will harm HA VMs. 

Michal - worth also to consider limiting the network bandwidth as well if such a case is common. 

So as far as I see it, the only option on infra side is to indeed prevent fencing when in preparing for maintenance. 

Severity here is due to the split brain. 
Reducing it to high.

Comment 13 Oved Ourfali 2015-08-09 09:28:56 UTC
After discussing that with Barak, the idea to disable fencing when preparing for maintenance can cause other discrepancies, as the host will be moved to non-responsive.
Also, seems like skipping fencing when storage connection exists IS the default for new 3.5 clusters.

Robert - is this a new 3.5 cluster or an upgraded one?

Comment 14 Oved Ourfali 2015-08-09 09:30:54 UTC
(well, worth just checking with them whether the check-box that disabled fencing if SD is active is checked in the UI).

Comment 15 Barak 2015-08-09 12:36:40 UTC
Implementing the title as is (not allowing to fence when host on "prep for maintenance") is problematic as it assumes that the cause for fencing is always network saturation, network saturation may be only one of the reasons ... there can be really a problem with the network ... and it will cause hosts to stay in non-responsive status forever without recovering automatically.

dealing with such use cases is by doing a proper design of the system meaning:
- separation of networks (VM, Migration and storage) and capping each network 
  bandwidth according to the usage pattern and bandwidth available .
- use of the flag "skip fencing if host maintain storage lease" (default for new 
  3.5 cluster ... and can be set through edit cluster on upgraded cluster).
  This by itself will reach the desired solution (not fencing if host is alive 
  and running the VMs or migrating)
- The problem will not be solved by this flag ("skip fencing ...") for a case 
  where there is no separation for migration & storage network) because on these 
  cases the network saturation influences also the storage connectivity.
- For those cases there is no other way but to be able to cap network activity 
  per role (even when using the same network), this falls under SLA group.

Comment 18 Michal Skrivanek 2015-08-10 09:03:02 UTC
the only thing virt can do is to improve migration capping. (e.g. today we use 32MBps outgoing, without any limit on incoming) Also improving migration convergence (dynamic bandwidth allocation, different algorithm for convergence) could have helped as those VMs were migrating for 20+ minutes

However I believe infra do need to improve in detection of non-responsiveness. There is no way to avoid overload in general, we can only try to avoid it, but without physical separation of management network it's never 100%.

Comment 19 Oved Ourfali 2015-08-10 09:05:49 UTC
(In reply to Michal Skrivanek from comment #18)
> the only thing virt can do is to improve migration capping. (e.g. today we
> use 32MBps outgoing, without any limit on incoming) Also improving migration
> convergence (dynamic bandwidth allocation, different algorithm for
> convergence) could have helped as those VMs were migrating for 20+ minutes
> 
> However I believe infra do need to improve in detection of
> non-responsiveness. There is no way to avoid overload in general, we can
> only try to avoid it, but without physical separation of management network
> it's never 100%.

And we did these improvements when we added the fencing policy in 3.5, supporting identification of percentage of non-responsive hosts, and storage connectivity.
One can set them up as suitable for his environment, or in maintenance use-cases he can even set fencing to disabled.

Comment 20 Michal Skrivanek 2015-08-11 11:56:18 UTC
reducing priority as the core is incorrect settings of fencing and/or setup issues. We have options to change the false-positive fencing, we have migration network to isolate the management communication

We are working on improving migration behavior in next release, tracked in bug 1252426

Comment 21 Moran Goldboim 2015-09-17 10:06:51 UTC
(In reply to Michal Skrivanek from comment #20)
> reducing priority as the core is incorrect settings of fencing and/or setup
> issues. We have options to change the false-positive fencing, we have
> migration network to isolate the management communication
> 
> We are working on improving migration behavior in next release, tracked in
> bug 1252426

seems like a combination of the fencing configuration, migration dedicated network and migration optimizations in 4.0 (bug 1252426) along with the ones which will be introduced in 3.6 (bug 1109154, 867453) will prevent this issue from occurring

Comment 22 Red Hat Bugzilla 2023-09-14 03:02:44 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days


Note You need to log in before you can comment on or make changes to this bug.