Bug 1111520

Summary:	[RFE] change the behaviour of soft-fencing when "required" networks are missing on a single node cluster
Product:	[Retired] oVirt	Reporter:	Sven Kieske <s.kieske>
Component:	ovirt-engine-core	Assignee:	bugs <bugs>
Status:	CLOSED NOTABUG	QA Contact:	Pavel Stehlik <pstehlik>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	3.5	CC:	acathrow, gklein, iheim, ofrenkel, oourfali, s.kieske, yeylon
Target Milestone:	---	Keywords:	FutureFeature, Triaged
Target Release:	3.5.0
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Enhancement
Doc Text:		Story Points:	---
Clone Of:
Clones:	1112933 (view as bug list)		Environment:
Last Closed:	2014-06-25 05:12:12 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1112933

Description Sven Kieske 2014-06-20 08:39:42 UTC

Description of problem:
This problem occurs on all cluster in ovirt which do just contain
one single host (e.g. local storage)

when a logical network gets created, the default is to make it "required"
on the cluster it gets attached to.

if the single host inside the cluster is missing one logcial network

_all_ vms get fenced, because ovirt wants to restart them on another host
in the cluster, which has all required networks.

Version-Release number of selected component (if applicable):
latest master

How reproducible:
always

Steps to Reproduce:
1. create logical networks for each vm in a single node cluster
2. remove one of the networks from the host
3. watch all vms shutdown

Actual results:
all vms get fenced

Expected results:
here are some superior alternatives (which also can be combined):

a)fence just the vms which have the missing required networks attached
b) do no fencing at all, when just one host is up in the cluster, because
you can't restart the vms on a different host anyway. make sure _before_ you
fence vms that there is a suitable host which meets all requirements is available and has enough resources to run those fenced vms (this requires some computation!)
c) do not mark newly created logical networks as "required" by default, without
user interaction.

Additional info:
I doubt the infrastructure level is the correct point where you want to decide
what happens when a given network connection is not available.
ovirt does this decision by default in the following way:

if the (default)required network is not there, shutdown all vms and start them
on a different host.

this does make no sense, if vm "a" doesn't have this network attached to any of it's interfaces.

it does not even make sense if vm "b" has this network attached to it's interface, because you can't tell if this vm really "requires" this network
(it might have other networks attached, which may be enough for this vm).

Again: this is an application level problem, this means, it depends on the application inside the vm if a network outage is serious or not.

ovirt does not know anything about the application inside the vm, thus ovirt
can't make expectations about the importance of a vanished logical network.

there are 3 cases:

1. the app does not need networking at all
2. the app can handle network failure itself (wait for network to come up again, logging, alarming etc.)
3. the app is poorly designed and crashes/does not work anymore

just in case 3. ovirt would be justified to do what it does by default.

In my opinion it would be up to the network/ovirt administrator to decide
this case and not make this a default for all cases.

I do not know if I selected the correct ovirt component, please reassign, if applicable.

Comment 1 Sven Kieske 2014-06-20 08:53:19 UTC

PS: I got my understanding how the fencing process works from:
https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Virtualization/3.3/html/Administration_Guide/Virtual_Machine_Networks_and_Optional_Networks.html

If this should be wrong, please open a bug on the rhev docs (I can't do that as I have no running subscription, this limits btw your access to user reported bugs in official docs, as we can't assist in getting better docs without a subscription)

addtionally the docs are somewhat unclear what it means that the vms get "fenced".

do they get soft-fenced by acpi call to ovirt-guest-agent or do they
get hard-fenced (powered off)?

Comment 2 Omer Frenkel 2014-06-23 07:29:29 UTC

please attach engine && vdsm logs..

Comment 3 Oved Ourfali 2014-06-23 07:31:48 UTC

As far as I know the information in the documentation is wrong.
There isn't supposed to be any "fencing" of any kind in such a use-case.
If the host is responsive (management network is up and the host is responding), and another network isn't up, then the host will become non-operational.
That should trigger a migration of all the VMs on this host.

However, if there is no other "Up" host in the cluster then the migration will not happen, and the VMs will keep on running on the host.

Can you attach logs?

Comment 4 Sven Kieske 2014-06-23 07:43:17 UTC

Sorry for letting you wait, I'll attach the logs today, just need to tar them together and need to make them anonymous.

Comment 5 Sven Kieske 2014-06-23 13:08:08 UTC

I'd like to provide some logs in private, as they contain sensitive information
which may be important in debugging this problem and thus can't be deleted from
the logs, is this possible?

Comment 6 Sven Kieske 2014-06-23 13:23:08 UTC

I submitted the logs in private to Omer Frenkel.
If there is still missing something, just ping me via IRC, Mail or BZ.

Comment 7 Omer Frenkel 2014-06-24 07:39:49 UTC

Thanks Sven! After looking at the logs, and verifying with Sven online in IRC,
we see that first there was a kernel panic on the host, which caused it to reboot, and when the host came back to UP, the network was missing (and of course all the vms) so the host moved to non-operational.

so the vms going to down was not related to the host moving to non-operational.

maybe this bug need to be moved to documentation,  because i find this sentence confusing - from the link supplied in comment 1 :
"
When a required network becomes non-operational, the virtual machines running on the network are fenced and migrated to another host.
"
saying fenced and migrated is a little of a contrast, should be changed to explain that all VMs on the host will migrate (and not only the vms that use the network) and also according to the cluster policy (migration can be none/HA/All)

Comment 8 Oved Ourfali 2014-06-25 05:12:12 UTC

I've opened a documentation bug, and closing this one.