Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1112933

Summary: [Docs][Async][Admin]Change documentation describing what happens in case of network issue on hosts
Product: Red Hat Enterprise Virtualization Manager Reporter: Oved Ourfali <oourfali>
Component: DocumentationAssignee: Lucy Bopf <lbopf>
Status: CLOSED CURRENTRELEASE QA Contact: Andrew Burden <aburden>
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.5.0CC: gklein, iheim, juwu, ofrenkel, oourfali, rbalakri, s.kieske, yeylon
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 1111520 Environment:
Last Closed: 2014-09-19 01:32:49 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1111520    
Bug Blocks:    

Description Oved Ourfali 2014-06-25 05:09:33 UTC
+++ This bug was initially created as a clone of Bug #1111520 +++

Description of problem:
This problem occurs on all cluster in ovirt which do just contain
one single host (e.g. local storage)

when a logical network gets created, the default is to make it "required"
on the cluster it gets attached to.

if the single host inside the cluster is missing one logcial network

_all_ vms get fenced, because ovirt wants to restart them on another host
in the cluster, which has all required networks.

Version-Release number of selected component (if applicable):
latest master

How reproducible:
always

Steps to Reproduce:
1. create logical networks for each vm in a single node cluster
2. remove one of the networks from the host
3. watch all vms shutdown

Actual results:
all vms get fenced

Expected results:
here are some superior alternatives (which also can be combined):

a)fence just the vms which have the missing required networks attached
b) do no fencing at all, when just one host is up in the cluster, because
you can't restart the vms on a different host anyway. make sure _before_ you
fence vms that there is a suitable host which meets all requirements is available and has enough resources to run those fenced vms (this requires some computation!)
c) do not mark newly created logical networks as "required" by default, without
user interaction.

Additional info:
I doubt the infrastructure level is the correct point where you want to decide
what happens when a given network connection is not available.
ovirt does this decision by default in the following way:

if the (default)required network is not there, shutdown all vms and start them
on a different host.

this does make no sense, if vm "a" doesn't have this network attached to any of it's interfaces.

it does not even make sense if vm "b" has this network attached to it's interface, because you can't tell if this vm really "requires" this network
(it might have other networks attached, which may be enough for this vm).

Again: this is an application level problem, this means, it depends on the application inside the vm if a network outage is serious or not.

ovirt does not know anything about the application inside the vm, thus ovirt
can't make expectations about the importance of a vanished logical network.

there are 3 cases:

1. the app does not need networking at all
2.  the app can handle network failure itself (wait for network to come up again, logging, alarming etc.)
3. the app is poorly designed and crashes/does not work anymore

just in case 3. ovirt would be justified to do what it does by default.

In my opinion it would be up to the network/ovirt administrator to decide
this case and not make this a default for all cases.

I do not know if I selected the correct ovirt component, please reassign, if applicable.

--- Additional comment from Sven Kieske on 2014-06-20 04:53:19 EDT ---

PS: I got my understanding how the fencing process works from:
https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Virtualization/3.3/html/Administration_Guide/Virtual_Machine_Networks_and_Optional_Networks.html

If this should be wrong, please open a bug on the rhev docs (I can't do that as I have no running subscription, this limits btw your access to user reported bugs in official docs, as we can't assist in getting better docs without a subscription)

addtionally the docs are somewhat unclear what it means that the vms get "fenced".

do they get soft-fenced by acpi call to ovirt-guest-agent or do they
get hard-fenced (powered off)?

--- Additional comment from Omer Frenkel on 2014-06-23 03:29:29 EDT ---

please attach engine && vdsm logs..

--- Additional comment from Oved Ourfali on 2014-06-23 03:31:48 EDT ---

As far as I know the information in the documentation is wrong.
There isn't supposed to be any "fencing" of any kind in such a use-case.
If the host is responsive (management network is up and the host is responding), and another network isn't up, then the host will become non-operational.
That should trigger a migration of all the VMs on this host.

However, if there is no other "Up" host in the cluster then the migration will not happen, and the VMs will keep on running on the host.

Can you attach logs?

--- Additional comment from Sven Kieske on 2014-06-23 03:43:17 EDT ---

Sorry for letting you wait, I'll attach the logs today, just need to tar them together and need to make them anonymous.

--- Additional comment from Sven Kieske on 2014-06-23 09:08:08 EDT ---

I'd like to provide some logs in private, as they contain sensitive information
which may be important in debugging this problem and thus can't be deleted from
the logs, is this possible?

--- Additional comment from Sven Kieske on 2014-06-23 09:23:08 EDT ---

I submitted the logs in private to Omer Frenkel.
If there is still missing something, just ping me via IRC, Mail or BZ.

--- Additional comment from Omer Frenkel on 2014-06-24 03:39:49 EDT ---

Thanks Sven! After looking at the logs, and verifying with Sven online in IRC,
we see that first there was a kernel panic on the host, which caused it to reboot, and when the host came back to UP, the network was missing (and of course all the vms) so the host moved to non-operational.

so the vms going to down was not related to the host moving to non-operational.

maybe this bug need to be moved to documentation,  because i find this sentence confusing - from the link supplied in comment 1 :
"
When a required network becomes non-operational, the virtual machines running on the network are fenced and migrated to another host.
"
saying fenced and migrated is a little of a contrast, should be changed to explain that all VMs on the host will migrate (and not only the vms that use the network) and also according to the cluster policy (migration can be none/HA/All)

Comment 1 Sven Kieske 2014-06-25 07:07:31 UTC
please keep in mind not to just only update the rhev 3.5 docs, but also the
3.3 and I guess also 3.4 docs, as the incorrect documentation also occurs there.

thanks!

Comment 2 Lucy Bopf 2014-07-09 00:53:06 UTC
The section "Required Networks, Optional Networks, and Virtual Machine Networks" (topic 10994) has been updated to reflect the changes requested above.

"When a required network becomes non-operational, the virtual machines running on the network are fenced and migrated to another host. This is beneficial if you have machines running mission critical workloads."

has been replaced with

"When a host's required network becomes non-operational, virtual machines running on that host are migrated to another host; the extent of this migration is dependent upon the chosen cluster policy. This is beneficial if you have machines running mission critical workloads."

Comment 4 Lucy Bopf 2014-07-31 01:03:29 UTC
Documentation Link
------------------------------
https://documentation-devel.engineering.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Virtualization/3.5-Beta/html-single/Administration_Guide/index.html#Virtual_Machine_Networks_and_Optional_Networks

What Changed
------------------------------
The following topic was revised to correct the description of virtual machine migration behavior in the event of a required network failure; specifically, there is no longer any reference to virtual machines being fenced. The two versions of the text also appear in Comment #2 above.

Required Networks, Optional Networks, and Virtual Machine Networks [10994-682286]

Updated revision history: [34613-687010]

NVR
------------------------------
Red_Hat_Enterprise_Virtualization-Administration_Guide-3.5-Beta-web-en-US-3.5-5.el6eng

Moving to ON_QA.