Description of problem: This problem occurs on all cluster in ovirt which do just contain one single host (e.g. local storage) when a logical network gets created, the default is to make it "required" on the cluster it gets attached to. if the single host inside the cluster is missing one logcial network _all_ vms get fenced, because ovirt wants to restart them on another host in the cluster, which has all required networks. Version-Release number of selected component (if applicable): latest master How reproducible: always Steps to Reproduce: 1. create logical networks for each vm in a single node cluster 2. remove one of the networks from the host 3. watch all vms shutdown Actual results: all vms get fenced Expected results: here are some superior alternatives (which also can be combined): a)fence just the vms which have the missing required networks attached b) do no fencing at all, when just one host is up in the cluster, because you can't restart the vms on a different host anyway. make sure _before_ you fence vms that there is a suitable host which meets all requirements is available and has enough resources to run those fenced vms (this requires some computation!) c) do not mark newly created logical networks as "required" by default, without user interaction. Additional info: I doubt the infrastructure level is the correct point where you want to decide what happens when a given network connection is not available. ovirt does this decision by default in the following way: if the (default)required network is not there, shutdown all vms and start them on a different host. this does make no sense, if vm "a" doesn't have this network attached to any of it's interfaces. it does not even make sense if vm "b" has this network attached to it's interface, because you can't tell if this vm really "requires" this network (it might have other networks attached, which may be enough for this vm). Again: this is an application level problem, this means, it depends on the application inside the vm if a network outage is serious or not. ovirt does not know anything about the application inside the vm, thus ovirt can't make expectations about the importance of a vanished logical network. there are 3 cases: 1. the app does not need networking at all 2. the app can handle network failure itself (wait for network to come up again, logging, alarming etc.) 3. the app is poorly designed and crashes/does not work anymore just in case 3. ovirt would be justified to do what it does by default. In my opinion it would be up to the network/ovirt administrator to decide this case and not make this a default for all cases. I do not know if I selected the correct ovirt component, please reassign, if applicable.
PS: I got my understanding how the fencing process works from: https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Virtualization/3.3/html/Administration_Guide/Virtual_Machine_Networks_and_Optional_Networks.html If this should be wrong, please open a bug on the rhev docs (I can't do that as I have no running subscription, this limits btw your access to user reported bugs in official docs, as we can't assist in getting better docs without a subscription) addtionally the docs are somewhat unclear what it means that the vms get "fenced". do they get soft-fenced by acpi call to ovirt-guest-agent or do they get hard-fenced (powered off)?
please attach engine && vdsm logs..
As far as I know the information in the documentation is wrong. There isn't supposed to be any "fencing" of any kind in such a use-case. If the host is responsive (management network is up and the host is responding), and another network isn't up, then the host will become non-operational. That should trigger a migration of all the VMs on this host. However, if there is no other "Up" host in the cluster then the migration will not happen, and the VMs will keep on running on the host. Can you attach logs?
Sorry for letting you wait, I'll attach the logs today, just need to tar them together and need to make them anonymous.
I'd like to provide some logs in private, as they contain sensitive information which may be important in debugging this problem and thus can't be deleted from the logs, is this possible?
I submitted the logs in private to Omer Frenkel. If there is still missing something, just ping me via IRC, Mail or BZ.
Thanks Sven! After looking at the logs, and verifying with Sven online in IRC, we see that first there was a kernel panic on the host, which caused it to reboot, and when the host came back to UP, the network was missing (and of course all the vms) so the host moved to non-operational. so the vms going to down was not related to the host moving to non-operational. maybe this bug need to be moved to documentation, because i find this sentence confusing - from the link supplied in comment 1 : " When a required network becomes non-operational, the virtual machines running on the network are fenced and migrated to another host. " saying fenced and migrated is a little of a contrast, should be changed to explain that all VMs on the host will migrate (and not only the vms that use the network) and also according to the cluster policy (migration can be none/HA/All)
I've opened a documentation bug, and closing this one.