Bug 1395087

Summary:	Node becomes non operational, HA VMs go to paused state when glusternw is down on the host .
Product:	[oVirt] ovirt-engine	Reporter:	RamaKasturi <knarra>
Component:	BLL.Infra	Assignee:	Gobinda Das <godas>
Status:	CLOSED CURRENTRELEASE	QA Contact:	SATHEESARAN <sasundar>
Severity:	high	Docs Contact:
Priority:	high
Version:	4.1.0	CC:	ahadas, angystardust, bugs, knarra, mavital, mgoldboi, mperina, sabose, sasundar
Target Milestone:	ovirt-4.2.6	Flags:	rule-engine: ovirt-4.2? sabose: planning_ack? rule-engine: devel_ack+ rule-engine: testing_ack+
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1619646 (view as bug list)		Environment:
Last Closed:	2018-09-13 07:41:18 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	Gluster	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1433896, 1619646

Description RamaKasturi 2016-11-15 06:11:44 UTC

Description of problem:
with power management and gluster fencing policies in place vms should be highly available. When gluster network goes down in the host hosted engine gets restarted on another node but all other app vms goes to paused state.

Version-Release number of selected component (if applicable):
ovirt-engine-4.1.0-0.0.master.20161024211322.gitfc0de31.el7.centos.noarch

How reproducible:
Always

Steps to Reproduce:
1. Install HC using ovirt upstream
2. Create vms.
3. Now bring down gluster network on one of the host by running the command 'ifdown <devicename>'.

Actual results:
HostedEngine vm on that node gets restarted on another node but all the app vms goes to paused state.

Expected results:
Fencing should happen on that host and all app vms should be restarted on another host.

Additional info:

Comment 1 Sahina Bose 2016-12-06 14:27:03 UTC

Is it the same behaviour if you mark the glusternw as Required?

Comment 2 Ramesh N 2016-12-08 09:40:12 UTC

(In reply to Sahina Bose from comment #1)
> Is it the same behaviour if you mark the glusternw as Required?

I could not reproduce error if we mark glusternw as required. When I tested this with glusternw network, VMs where not going to paused state. But i need to test again with and without glusternw.

Comment 3 Sahina Bose 2016-12-16 06:00:00 UTC

Could you retest with the network configured as required?

Comment 4 RamaKasturi 2016-12-30 09:19:33 UTC

Hi sahina,

    I have retested this by marking glusternw was required(clusters->logical Networks->glusternw->manage networks->select required -> ok). I still see the issue where my app vm goes to paused state and node on which the glusternw was brought down goes to Non Operational state.

Thanks
kasturi.

Comment 5 RamaKasturi 2016-12-30 09:20:49 UTC

Another issue i am observing here is that even though my HE VM resides on a different node i see that HE VM goes through the process of  poweroff-enginestart-engine up. Not sure if this is an expected behaviour.

Comment 6 RamaKasturi 2016-12-30 09:42:14 UTC

(In reply to RamaKasturi from comment #4)
> Hi sahina,
> 
>     I have retested this by marking glusternw was required(clusters->logical
> Networks->glusternw->manage networks->select required -> ok). I still see
> the issue where my app vm goes to paused state and node on which the
> glusternw was brought down goes to Non Operational state.
> 
> Thanks
> kasturi.

Some times i do see that one of the host simply restarts where HostedEngine resides and HE is inaccessible for some time i.e until the host which rebooted comes up.When i try to execute hosted-engine --vm-status i see the following errors on the stdout on both the hosts:

Traceback (most recent call last):
  File "/usr/lib64/python2.7/runpy.py", line 162, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/lib64/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_setup/vm_status.py", line 173, in <module>
    if not status_checker.print_status():
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_setup/vm_status.py", line 103, in print_status
    all_host_stats = self._get_all_host_stats()
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_setup/vm_status.py", line 73, in _get_all_host_stats
    all_host_stats = ha_cli.get_all_host_stats()
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py", line 160, in get_all_host_stats
    return self.get_all_stats(self.StatModes.HOST)
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py", line 103, in get_all_stats
    self._configure_broker_conn(broker)
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py", line 180, in _configure_broker_conn
    dom_type=dom_type)
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 177, in set_storage_domain
    .format(sd_type, options, e))
ovirt_hosted_engine_ha.lib.exceptions.RequestError: Failed to set storage domain FilesystemBackend, options {'dom_type': 'glusterfs', 'sd_uuid': '31488ded-cf31-477c-9d96-495cb08a3c35'}: Request failed: <class 'ovirt_hosted_engine_ha.lib.storage_backends.BackendFailureException'>



 I am not sure how to debug the issue of host getting rebooted with out no reason. I looked at /var/log/messages and see the following errors before system starts to reboot.

Dec 30 14:59:34 rhsqa-grafton2 wdmd[1357]: test failed rem 46 now 72593 ping 72569 close 72579 renewal 72500 expire 72580 client 1317 sanlock_f4f18e6e-8b59-4181-a999-bf347b4
9e24b:2
Dec 30 14:59:34 rhsqa-grafton2 journal: vdsm ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink ERROR Connection closed: Connection timed out
Dec 30 14:59:34 rhsqa-grafton2 journal: vdsm root ERROR failed to retrieve Hosted Engine HA info#012Traceback (most recent call last):#012  File "/usr/lib/python2.7/site-pac
kages/vdsm/host/api.py", line 231, in _getHaInfo#012    stats = instance.get_all_stats()#012  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py"
, line 103, in get_all_stats#012    self._configure_broker_conn(broker)#012  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py", line 180, in _c
onfigure_broker_conn#012    dom_type=dom_type)#012  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 177, in set_storage_domain#012    
.format(sd_type, options, e))#012RequestError: Failed to set storage domain FilesystemBackend, options {'dom_type': 'glusterfs', 'sd_uuid': '31488ded-cf31-477c-9d96-495cb08a
3c35'}: Connection timed out
Dec 30 14:59:35 rhsqa-grafton2 sanlock[1317]: 2016-12-30 14:59:35+0530 72594 [1317]: s9 kill 3589 sig 15 count 15
Dec 30 14:59:35 rhsqa-grafton2 sanlock[1317]: 2016-12-30 14:59:35+0530 72594 [1317]: s8 kill 16271 sig 15 count 14
Dec 30 14:59:35 rhsqa-grafton2 wdmd[1357]: test failed rem 45 now 72594 ping 72569 close 72579 renewal 72501 expire 72581 client 1317 sanlock_31488ded-cf31-477c-9d96-495cb08
a3c35:2
Dec 30 14:59:35 rhsqa-grafton2 wdmd[1357]: test failed rem 45 now 72594 ping 72569 close 72579 renewal 72500 expire 72580 client 1317 sanlock_f4f18e6e-8b59-4181-a999-bf347b4
9e24b:2
Dec 30 14:59:36 rhsqa-grafton2 sanlock[1317]: 2016-12-30 14:59:36+0530 72595 [1317]: s9 kill 3589 sig 15 count 16
Dec 30 14:59:36 rhsqa-grafton2 sanlock[1317]: 2016-12-30 14:59:36+0530 72595 [1317]: s8 kill 16271 sig 15 count 15
Dec 30 14:59:36 rhsqa-grafton2 wdmd[1357]: test failed rem 44 now 72595 ping 72569 close 72579 renewal 72501 expire 72581 client 1317 sanlock_31488ded-cf31-477c-9d96-495cb08a3c35:2
Dec 30 14:59:36 rhsqa-grafton2 wdmd[1357]: test failed rem 44 now 72595 ping 72569 close 72579 renewal 72500 expire 72580 client 1317 sanlock_f4f18e6e-8b59-4181-a999-bf347b49e24b:2
Dec 30 14:59:37 rhsqa-grafton2 sanlock[1317]: 2016-12-30 14:59:37+0530 72596 [1317]: s9 kill 3589 sig 15 count 17
Dec 30 14:59:37 rhsqa-grafton2 sanlock[1317]: 2016-12-30 14:59:37+0530 72596 [1317]: s8 kill 16271 sig 15 count 16
Dec 30 14:59:37 rhsqa-grafton2 wdmd[1357]: test failed rem 43 now 72596 ping 72569 close 72579 renewal 72501 expire 72581 client 1317 sanlock_31488ded-cf31-477c-9d96-495cb08a3c35:2
Dec 30 14:59:37 rhsqa-grafton2 wdmd[1357]: test failed rem 43 now 72596 ping 72569 close 72579 renewal 72500 expire 72580 client 1317 sanlock_f4f18e6e-8b59-4181-a999-bf347b49e24b:2
Dec 30 14:59:38 rhsqa-grafton2 sanlock[1317]: 2016-12-30 14:59:38+0530 72597 [1317]: s9 kill 3589 sig 15 count 18
Dec 30 14:59:38 rhsqa-grafton2 sanlock[1317]: 2016-12-30 14:59:38+0530 72597 [1317]: s8 kill 16271 sig 15 count 17
Dec 30 14:59:38 rhsqa-grafton2 wdmd[1357]: test failed rem 42 now 72597 ping 72569 close 72579 renewal 72501 expire 72581 client 1317 sanlock_31488ded-cf31-477c-9d96-495cb08a3c35:2
Dec 30 14:59:38 rhsqa-grafton2 wdmd[1357]: test failed rem 42 now 72597 ping 72569 close 72579 renewal 72500 expire 72580 client 1317 sanlock_f4f18e6e-8b59-4181-a999-bf347b49e24b:2
Dec 30 14:59:39 rhsqa-grafton2 sanlock[1317]: 2016-12-30 14:59:39+0530 72598 [1317]: s9 kill 3589 sig 15 count 19
Dec 30 14:59:39 rhsqa-grafton2 sanlock[1317]: 2016-12-30 14:59:39+0530 72598 [1317]: s8 kill 16271 sig 15 count 18
Dec 30 14:59:39 rhsqa-grafton2 dhclient[3453]: DHCPREQUEST on enp4s0f0 to 10.70.34.2 port 67 (xid=0x40638b6a)
Dec 30 14:59:39 rhsqa-grafton2 wdmd[1357]: test failed rem 41 now 72598 ping 72569 close 72579 renewal 72501 expire 72581 client 1317 sanlock_31488ded-cf31-477c-9d96-495cb08a3c35:2
Dec 30 14:59:39 rhsqa-grafton2 wdmd[1357]: test failed rem 41 now 72598 ping 72569 close 72579 renewal 72500 expire 72580 client 1317 sanlock_f4f18e6e-8b59-4181-a999-bf347b49e24b:2
Dec 30 14:59:40 rhsqa-grafton2 sanlock[1317]: 2016-12-30 14:59:40+0530 72599 [1317]: s9 kill 3589 sig 15 count 20
Dec 30 14:59:40 rhsqa-grafton2 sanlock[1317]: 2016-12-30 14:59:40+0530 72599 [1317]: s8 kill 16271 sig 15 count 19
:


I see another error related to ovirt-imageio-daemon:
==========================================================

Dec 30 15:03:55 rhsqa-grafton2 ovirt-imageio-daemon: Traceback (most recent call last):
Dec 30 15:03:55 rhsqa-grafton2 ovirt-imageio-daemon: File "/usr/bin/ovirt-imageio-daemon", line 14, in <module>
Dec 30 15:03:55 rhsqa-grafton2 systemd: Started NTP client/server.
Dec 30 15:03:55 rhsqa-grafton2 ovirt-imageio-daemon: server.main(sys.argv)
Dec 30 15:03:55 rhsqa-grafton2 ovirt-imageio-daemon: File "/usr/lib/python2.7/site-packages/ovirt_imageio_daemon/server.py", line 50, in main
Dec 30 15:03:55 rhsqa-grafton2 ovirt-imageio-daemon: configure_logger()
Dec 30 15:03:55 rhsqa-grafton2 ovirt-imageio-daemon: File "/usr/lib/python2.7/site-packages/ovirt_imageio_daemon/server.py", line 68, in configure_logger
Dec 30 15:03:55 rhsqa-grafton2 ovirt-imageio-daemon: logging.config.fileConfig(conf, disable_existing_loggers=False)
Dec 30 15:03:55 rhsqa-grafton2 ovirt-imageio-daemon: File "/usr/lib64/python2.7/logging/config.py", line 78, in fileConfig
Dec 30 15:03:55 rhsqa-grafton2 ovirt-imageio-daemon: handlers = _install_handlers(cp, formatters)
Dec 30 15:03:55 rhsqa-grafton2 ovirt-imageio-daemon: File "/usr/lib64/python2.7/logging/config.py", line 156, in _install_handlers
Dec 30 15:03:55 rhsqa-grafton2 ovirt-imageio-daemon: h = klass(*args)
Dec 30 15:03:55 rhsqa-grafton2 ovirt-imageio-daemon: File "/usr/lib64/python2.7/logging/handlers.py", line 117, in __init__
Dec 30 15:03:55 rhsqa-grafton2 ovirt-imageio-daemon: BaseRotatingHandler.__init__(self, filename, mode, encoding, delay)
Dec 30 15:03:55 rhsqa-grafton2 ovirt-imageio-daemon: File "/usr/lib64/python2.7/logging/handlers.py", line 64, in __init__
Dec 30 15:03:55 rhsqa-grafton2 ovirt-imageio-daemon: logging.FileHandler.__init__(self, filename, mode, encoding, delay)
Dec 30 15:03:55 rhsqa-grafton2 ovirt-imageio-daemon: File "/usr/lib64/python2.7/logging/__init__.py", line 902, in __init__
Dec 30 15:03:55 rhsqa-grafton2 ovirt-imageio-daemon: StreamHandler.__init__(self, self._open())
Dec 30 15:03:55 rhsqa-grafton2 ovirt-imageio-daemon: File "/usr/lib64/python2.7/logging/__init__.py", line 925, in _open
Dec 30 15:03:55 rhsqa-grafton2 ovirt-imageio-daemon: stream = open(self.baseFilename, self.mode)
Dec 30 15:03:55 rhsqa-grafton2 ovirt-imageio-daemon: IOError: [Errno 13] Permission denied: '/var/log/ovirt-imageio-daemon/daemon.log'

Comment 7 RamaKasturi 2016-12-30 12:24:51 UTC

(In reply to RamaKasturi from comment #4)
> Hi sahina,
> 
>     I have retested this by marking glusternw was required(clusters->logical
> Networks->glusternw->manage networks->select required -> ok). I still see
> the issue where my app vm goes to NotResponding state and node on which the
> glusternw was brought down goes to Non Operational state due to "Host hosted_engine1 moved to Non-Operational state because interfaces which are down are needed by required networks in the current cluster: 'enp4s0f0 (glusternw)'."
> 
> Thanks
> kasturi.

Comment 8 Sandro Bonazzola 2017-02-01 16:01:44 UTC

oVirt 4.1.0 GA has been released, re-targeting to 4.1.1.
Please check if this issue is correctly targeted or already included in 4.1.0.

Comment 9 Ramesh N 2017-02-21 10:57:29 UTC

I am not able to reproduce this issue. I see following things when I bring down gluster network.

System details:

I have 3 hosts in HC mode with gluster. There is a gluster network which is used for gluster data traffic and migration traffic. Gluster network is marked as required network for the cluster. I have a HostedEngine VM and a HA VM vm1 running on host2. Power management is configured for all the hosts and fencing policy is enabled correctly.

Steps executed:
In 'Host2' which runs HE VM and HA VM, bring down the nic (ex. ens4f0) which is in gluster network.

Results: 
1. HE VM and RHV-M portal is not accessible for some moment. Then RHV-M Portal is accessible.
2. Host is moved to non operational with event 'Host host2 moved to Non-Operational state because interfaces which are down are needed by required networks in the current cluster: 'ens4f0 (glusternw)'.'
3. Glusterd and all other brick process in host2 is killed because of server quroum lose.
4. HA VM is migrated to other host 'Host3'
5. HostedEngine VM continues to run in 'Host2' without any issue.

I am not sure what is the expected behavior in this scenario. Also HA VM is migrated without any issue and HA is satisfied.

Comment 10 Ramesh N 2017-02-23 10:46:39 UTC

This issue is reproducible only when storage is completely not accessible from a host which has some HA VMs running. 

I can reproduce this issue in a Gluster and oVirt HC setup using following steps.

Environment:
 - Three node HC setup with hosted engine.
 - There is separate gluster network defined in the cluster for gluster data traffic and this is marked as migration network as well.
 - Gluster network is marked as mandatory required network.
 - Power management is enabled for all the hosts.
 - Fencing is enabled.
 - Create VM with HA 

Test case:
  - Ensure both Hosted Engine and HA VM runs on Host2
  - Simulate gluster network disconnect on Host2 by configuring firewall to reject network traffic from other two hosts (only on gluster network)
    Firewall rule used to block gluster storage access through ovirt-mgmt network:
        iptables -A OUTPUT -p all --destination 10.70.36.74,10.70.36.76 -j REJECT; iptables -A INPUT -p all --source 10.70.36.74,10.70.36.76 -j REJECT

Results:
I observe following things when I completely disconnect the storage using firewall rules and bring down the gluster network.

1. HE VMs is not accessible for some time until it is restarted on other node.
2. HE VM restarted on other node.
3. Host2 which is disconnected from storage moves to Non-Operational
4. oVirt tries to migrate the HA VMs running on that Host with the reason "Reason: Host preparing for maintenance". 
5. Following two events are logged.
   "VM appvm01 has been paused due to unknown storage error."
    "VM appvm01 has been paused."

 Step-4 and 5 keeps repeating for every 5 minutes once. But HA VM never gets migrated or restarted on other node.

6. Following errors are seen repeatedly in engine log.

2017-02-23 02:16:38,207-05 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.MigrateBrokerVDSCommand] (org.ovirt.thread.pool-6-thread-42) [71a98cc8] FINISH, MigrateBrokerVDSCommand, log id: 5e696908
2017-02-23 02:16:38,213-05 INFO  [org.ovirt.engine.core.vdsbroker.MigrateVDSCommand] (org.ovirt.thread.pool-6-thread-42) [71a98cc8] FINISH, MigrateVDSCommand, return: MigratingFrom, log id: 11400f71
2017-02-23 02:16:38,228-05 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-42) [71a98cc8] EVENT_ID: VM_MIGRATION_START_SYSTEM_INITIATED(67), Correlation ID: 71a98cc8, Job ID: 845482a9-66e1-413e-8eb5-c54112c07232, Call Stack: null, Custom Event ID: -1, Message: Migration initiated by system (VM: appvm01, Source: host2, Destination: host1, Reason: Host preparing for maintenance).
2017-02-23 02:16:38,821-05 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-13) [] VM '36cbe058-a866-4b96-b099-5598efce77e9' was reported as Down on VDS 'cd82ef18-d1fe-437e-b3e5-8301a66cf0d5'(host1)
2017-02-23 02:16:38,825-05 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (ForkJoinPool-1-worker-13) [] START, DestroyVDSCommand(HostName = host1, DestroyVmVDSCommandParameters:{runAsync='true', hostId='cd82ef18-d1fe-437e-b3e5-8301a66cf0d5', vmId='36cbe058-a866-4b96-b099-5598efce77e9', force='false', secondsToWait='0', gracefully='false', reason='', ignoreNoVm='true'}), log id: 52d2f6cf
2017-02-23 02:16:38,827-05 INFO  [org.ovirt.engine.core.vdsbroker.gluster.GlusterServersListVDSCommand] (DefaultQuartzScheduler7) [7cef992a] FINISH, GlusterServersListVDSCommand, return: [10.70.36.75/23:CONNECTED, cambridge-nic2.lab.eng.blr.redhat.com:CONNECTED, 10.70.36.74:DISCONNECTED], log id: 5c822b82
2017-02-23 02:16:38,838-05 INFO  [org.ovirt.engine.core.vdsbroker.gluster.GlusterVolumesListVDSCommand] (DefaultQuartzScheduler7) [7cef992a] START, GlusterVolumesListVDSCommand(HostName = host3, GlusterVolumesListVDSParameters:{runAsync='true', hostId='6b1e524b-e224-4fe1-95e1-b2e5577a6ec9'}), log id: eb27a41
2017-02-23 02:16:39,185-05 WARN  [org.ovirt.engine.core.vdsbroker.gluster.GlusterVolumesListReturn] (DefaultQuartzScheduler7) [7cef992a] Could not associate brick '10.70.36.74:/gluster_bricks/engine/engine' of volume 'dbca77ec-74cb-45ae-bac8-a1f5b1a4e5e8' with correct network as no gluster network found in cluster '00000002-0002-0002-0002-00000000017a'
2017-02-23 02:16:39,191-05 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (ForkJoinPool-1-worker-13) [] Failed to destroy VM '36cbe058-a866-4b96-b099-5598efce77e9' because VM does not exist, ignoring
2017-02-23 02:16:39,192-05 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (ForkJoinPool-1-worker-13) [] FINISH, DestroyVDSCommand, log id: 52d2f6cf
2017-02-23 02:16:39,192-05 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-13) [] VM '36cbe058-a866-4b96-b099-5598efce77e9'(appvm01) was unexpectedly detected as 'Down' on VDS 'cd82ef18-d1fe-437e-b3e5-8301a66cf0d5'(host1) (expected on '22f77c5e-21c6-4b4e-9e89-45b27e3db57d')
2017-02-23 02:16:39,199-05 WARN  [org.ovirt.engine.core.vdsbroker.gluster.GlusterVolumesListReturn] (DefaultQuartzScheduler7) [7cef992a] Could not associate brick '10.70.36.74:/gluster_bricks/data/data' of volume 'f4d6fc61-24d9-4660-8f75-cea21f8e69eb' with correct network as no gluster network found in cluster '00000002-0002-0002-0002-00000000017a'
2017-02-23 02:16:39,213-05 WARN  [org.ovirt.engine.core.vdsbroker.gluster.GlusterVolumesListReturn] (DefaultQuartzScheduler7) [7cef992a] Could not associate brick '10.70.36.74:/gluster_bricks/vmstore/vmstore' of volume '69dcc5d0-4f81-480b-99f7-6e7d3c1b9cb9' with correct network as no gluster network found in cluster '00000002-0002-0002-0002-00000000017a'
2017-02-23 02:16:39,217-05 INFO  [org.ovirt.engine.core.vdsbroker.gluster.GlusterVolumesListVDSCommand] (DefaultQuartzScheduler7) [7cef992a] FINISH, GlusterVolumesListVDSCommand, return: {dbca77ec-74cb-45ae-bac8-a1f5b1a4e5e8=org.ovirt.engine.core.common.businessentities.gluster.GlusterVolumeEntity@1cd5711e, f4d6fc61-24d9-4660-8f75-cea21f8e69eb=org.ovirt.engine.core.common.businessentities.gluster.GlusterVolumeEntity@5e7f8486, 69dcc5d0-4f81-480b-99f7-6e7d3c1b9cb9=org.ovirt.engine.core.common.businessentities.gluster.GlusterVolumeEntity@14f0b0ac}, log id: eb27a41
2017-02-23 02:16:40,199-05 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-4) [] VM '36cbe058-a866-4b96-b099-5598efce77e9'(appvm01) moved from 'MigratingFrom' --> 'Paused'

Comment 11 Ramesh N 2017-02-23 11:29:58 UTC

I discussed with mskrivanek about the right component for this issue and moving to Virt component as per his suggestion.

Comment 12 Michal Skrivanek 2017-02-23 11:49:04 UTC

to sum up
- there is no bug in existing behavior, it's just that the existing HA and Fencing behavior has certain shortcomings

- the intention to try to migrate VMs when host was Up and now gets to the NonOperational is still valid, I believe. There is a good chance that some VMs are not affected by the specific reason the host status changed. E.g. in this case VMs not using disks at that gluster network, diskless VMs, or VMs which do not do any I/O at that time.
For that we use an equivalent of "move host to maintenance" where we try to evacuate the host

- problem of above is that when you are unable to migrate the VMs they will just remain stuck there in case of disk access issues (VM gets Paused with I/O error), and the only resolution is to kill that process, or in many cases on NFS you need to use power management to reboot/powercycle the host

- fencing can do that, but it is currently designed to resolve issues when engine cannot talk to the host at all. This is not the case here, as the ovirtmgmt communication is ok.

I would propose to keep both and improve our fencing aggressiveness. In the "preparing for maintenance" part of the SetNonOperationalVdsCommand we can check whether there are any Running VM present (excluding Paused state), and if not we would trigger fencing and kill the host.

Alternatively this would be resolved via vm-leases, but they'd need to be configured for all storage domains the VM is using. Currently I believe we allow only one.

Does that sound feasible?

Comment 13 Martin Perina 2017-03-22 13:00:36 UTC

We would need to implement a new flow to execute power management fencing for this use case as current fencing flow is tightly coupled with host non-responsiveness  and several steps like SSH Soft Fencing or Kdump detection is completely useless for this use case

Comment 17 Michal Skrivanek 2017-11-27 11:13:52 UTC

based on comment #13 moving to infra for targeting.
sorry for delay

Comment 18 Yaniv Kaul 2017-11-29 10:53:57 UTC

Martin, can anyone look at this?

Comment 19 Martin Perina 2018-01-10 12:33:12 UTC

Removing 4.2 target, unfortunately this item was moved to infra too late for 4.2, adding fencing flow for non-operational hosts due to gluster network unavailable seems to me like quite an dangerous changes which will require time to design properly and lots of testing.

AFAIK the workaround for now is to use VM leases configured on storage domain provided by gluster

Comment 20 Sahina Bose 2018-08-09 05:05:46 UTC

Can we test this flow and ensure configuring leases solves the issue?

Comment 21 Sahina Bose 2018-08-16 09:51:37 UTC

Moving to ON_QA to test scenario with VM leases configured

Comment 22 SATHEESARAN 2018-09-04 13:27:44 UTC

I have configured VM leases for the VM. HE VM and HA app vm were running on a host. Brought down the gluster network on that node. HE VM restarted on another node. But app VM went paused, as I rebooted the host, the VM restarted on another host

Comment 23 SATHEESARAN 2018-09-04 13:28:24 UTC

Verified with RHV 4.2.6-4 and glusterfs-3.8.4