Bug 1395087
Summary: | Node becomes non operational, HA VMs go to paused state when glusternw is down on the host . | |||
---|---|---|---|---|
Product: | [oVirt] ovirt-engine | Reporter: | RamaKasturi <knarra> | |
Component: | BLL.Infra | Assignee: | Gobinda Das <godas> | |
Status: | CLOSED CURRENTRELEASE | QA Contact: | SATHEESARAN <sasundar> | |
Severity: | high | Docs Contact: | ||
Priority: | high | |||
Version: | 4.1.0 | CC: | ahadas, angystardust, bugs, knarra, mavital, mgoldboi, mperina, sabose, sasundar | |
Target Milestone: | ovirt-4.2.6 | Flags: | rule-engine:
ovirt-4.2?
sabose: planning_ack? rule-engine: devel_ack+ rule-engine: testing_ack+ |
|
Target Release: | --- | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | If docs needed, set a value | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1619646 (view as bug list) | Environment: | ||
Last Closed: | 2018-09-13 07:41:18 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | Gluster | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1433896, 1619646 |
Description
RamaKasturi
2016-11-15 06:11:44 UTC
Is it the same behaviour if you mark the glusternw as Required? (In reply to Sahina Bose from comment #1) > Is it the same behaviour if you mark the glusternw as Required? I could not reproduce error if we mark glusternw as required. When I tested this with glusternw network, VMs where not going to paused state. But i need to test again with and without glusternw. Could you retest with the network configured as required? Hi sahina, I have retested this by marking glusternw was required(clusters->logical Networks->glusternw->manage networks->select required -> ok). I still see the issue where my app vm goes to paused state and node on which the glusternw was brought down goes to Non Operational state. Thanks kasturi. Another issue i am observing here is that even though my HE VM resides on a different node i see that HE VM goes through the process of poweroff-enginestart-engine up. Not sure if this is an expected behaviour. (In reply to RamaKasturi from comment #4) > Hi sahina, > > I have retested this by marking glusternw was required(clusters->logical > Networks->glusternw->manage networks->select required -> ok). I still see > the issue where my app vm goes to paused state and node on which the > glusternw was brought down goes to Non Operational state. > > Thanks > kasturi. Some times i do see that one of the host simply restarts where HostedEngine resides and HE is inaccessible for some time i.e until the host which rebooted comes up.When i try to execute hosted-engine --vm-status i see the following errors on the stdout on both the hosts: Traceback (most recent call last): File "/usr/lib64/python2.7/runpy.py", line 162, in _run_module_as_main "__main__", fname, loader, pkg_name) File "/usr/lib64/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_setup/vm_status.py", line 173, in <module> if not status_checker.print_status(): File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_setup/vm_status.py", line 103, in print_status all_host_stats = self._get_all_host_stats() File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_setup/vm_status.py", line 73, in _get_all_host_stats all_host_stats = ha_cli.get_all_host_stats() File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py", line 160, in get_all_host_stats return self.get_all_stats(self.StatModes.HOST) File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py", line 103, in get_all_stats self._configure_broker_conn(broker) File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py", line 180, in _configure_broker_conn dom_type=dom_type) File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 177, in set_storage_domain .format(sd_type, options, e)) ovirt_hosted_engine_ha.lib.exceptions.RequestError: Failed to set storage domain FilesystemBackend, options {'dom_type': 'glusterfs', 'sd_uuid': '31488ded-cf31-477c-9d96-495cb08a3c35'}: Request failed: <class 'ovirt_hosted_engine_ha.lib.storage_backends.BackendFailureException'> I am not sure how to debug the issue of host getting rebooted with out no reason. I looked at /var/log/messages and see the following errors before system starts to reboot. Dec 30 14:59:34 rhsqa-grafton2 wdmd[1357]: test failed rem 46 now 72593 ping 72569 close 72579 renewal 72500 expire 72580 client 1317 sanlock_f4f18e6e-8b59-4181-a999-bf347b4 9e24b:2 Dec 30 14:59:34 rhsqa-grafton2 journal: vdsm ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink ERROR Connection closed: Connection timed out Dec 30 14:59:34 rhsqa-grafton2 journal: vdsm root ERROR failed to retrieve Hosted Engine HA info#012Traceback (most recent call last):#012 File "/usr/lib/python2.7/site-pac kages/vdsm/host/api.py", line 231, in _getHaInfo#012 stats = instance.get_all_stats()#012 File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py" , line 103, in get_all_stats#012 self._configure_broker_conn(broker)#012 File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py", line 180, in _c onfigure_broker_conn#012 dom_type=dom_type)#012 File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 177, in set_storage_domain#012 .format(sd_type, options, e))#012RequestError: Failed to set storage domain FilesystemBackend, options {'dom_type': 'glusterfs', 'sd_uuid': '31488ded-cf31-477c-9d96-495cb08a 3c35'}: Connection timed out Dec 30 14:59:35 rhsqa-grafton2 sanlock[1317]: 2016-12-30 14:59:35+0530 72594 [1317]: s9 kill 3589 sig 15 count 15 Dec 30 14:59:35 rhsqa-grafton2 sanlock[1317]: 2016-12-30 14:59:35+0530 72594 [1317]: s8 kill 16271 sig 15 count 14 Dec 30 14:59:35 rhsqa-grafton2 wdmd[1357]: test failed rem 45 now 72594 ping 72569 close 72579 renewal 72501 expire 72581 client 1317 sanlock_31488ded-cf31-477c-9d96-495cb08 a3c35:2 Dec 30 14:59:35 rhsqa-grafton2 wdmd[1357]: test failed rem 45 now 72594 ping 72569 close 72579 renewal 72500 expire 72580 client 1317 sanlock_f4f18e6e-8b59-4181-a999-bf347b4 9e24b:2 Dec 30 14:59:36 rhsqa-grafton2 sanlock[1317]: 2016-12-30 14:59:36+0530 72595 [1317]: s9 kill 3589 sig 15 count 16 Dec 30 14:59:36 rhsqa-grafton2 sanlock[1317]: 2016-12-30 14:59:36+0530 72595 [1317]: s8 kill 16271 sig 15 count 15 Dec 30 14:59:36 rhsqa-grafton2 wdmd[1357]: test failed rem 44 now 72595 ping 72569 close 72579 renewal 72501 expire 72581 client 1317 sanlock_31488ded-cf31-477c-9d96-495cb08a3c35:2 Dec 30 14:59:36 rhsqa-grafton2 wdmd[1357]: test failed rem 44 now 72595 ping 72569 close 72579 renewal 72500 expire 72580 client 1317 sanlock_f4f18e6e-8b59-4181-a999-bf347b49e24b:2 Dec 30 14:59:37 rhsqa-grafton2 sanlock[1317]: 2016-12-30 14:59:37+0530 72596 [1317]: s9 kill 3589 sig 15 count 17 Dec 30 14:59:37 rhsqa-grafton2 sanlock[1317]: 2016-12-30 14:59:37+0530 72596 [1317]: s8 kill 16271 sig 15 count 16 Dec 30 14:59:37 rhsqa-grafton2 wdmd[1357]: test failed rem 43 now 72596 ping 72569 close 72579 renewal 72501 expire 72581 client 1317 sanlock_31488ded-cf31-477c-9d96-495cb08a3c35:2 Dec 30 14:59:37 rhsqa-grafton2 wdmd[1357]: test failed rem 43 now 72596 ping 72569 close 72579 renewal 72500 expire 72580 client 1317 sanlock_f4f18e6e-8b59-4181-a999-bf347b49e24b:2 Dec 30 14:59:38 rhsqa-grafton2 sanlock[1317]: 2016-12-30 14:59:38+0530 72597 [1317]: s9 kill 3589 sig 15 count 18 Dec 30 14:59:38 rhsqa-grafton2 sanlock[1317]: 2016-12-30 14:59:38+0530 72597 [1317]: s8 kill 16271 sig 15 count 17 Dec 30 14:59:38 rhsqa-grafton2 wdmd[1357]: test failed rem 42 now 72597 ping 72569 close 72579 renewal 72501 expire 72581 client 1317 sanlock_31488ded-cf31-477c-9d96-495cb08a3c35:2 Dec 30 14:59:38 rhsqa-grafton2 wdmd[1357]: test failed rem 42 now 72597 ping 72569 close 72579 renewal 72500 expire 72580 client 1317 sanlock_f4f18e6e-8b59-4181-a999-bf347b49e24b:2 Dec 30 14:59:39 rhsqa-grafton2 sanlock[1317]: 2016-12-30 14:59:39+0530 72598 [1317]: s9 kill 3589 sig 15 count 19 Dec 30 14:59:39 rhsqa-grafton2 sanlock[1317]: 2016-12-30 14:59:39+0530 72598 [1317]: s8 kill 16271 sig 15 count 18 Dec 30 14:59:39 rhsqa-grafton2 dhclient[3453]: DHCPREQUEST on enp4s0f0 to 10.70.34.2 port 67 (xid=0x40638b6a) Dec 30 14:59:39 rhsqa-grafton2 wdmd[1357]: test failed rem 41 now 72598 ping 72569 close 72579 renewal 72501 expire 72581 client 1317 sanlock_31488ded-cf31-477c-9d96-495cb08a3c35:2 Dec 30 14:59:39 rhsqa-grafton2 wdmd[1357]: test failed rem 41 now 72598 ping 72569 close 72579 renewal 72500 expire 72580 client 1317 sanlock_f4f18e6e-8b59-4181-a999-bf347b49e24b:2 Dec 30 14:59:40 rhsqa-grafton2 sanlock[1317]: 2016-12-30 14:59:40+0530 72599 [1317]: s9 kill 3589 sig 15 count 20 Dec 30 14:59:40 rhsqa-grafton2 sanlock[1317]: 2016-12-30 14:59:40+0530 72599 [1317]: s8 kill 16271 sig 15 count 19 : I see another error related to ovirt-imageio-daemon: ========================================================== Dec 30 15:03:55 rhsqa-grafton2 ovirt-imageio-daemon: Traceback (most recent call last): Dec 30 15:03:55 rhsqa-grafton2 ovirt-imageio-daemon: File "/usr/bin/ovirt-imageio-daemon", line 14, in <module> Dec 30 15:03:55 rhsqa-grafton2 systemd: Started NTP client/server. Dec 30 15:03:55 rhsqa-grafton2 ovirt-imageio-daemon: server.main(sys.argv) Dec 30 15:03:55 rhsqa-grafton2 ovirt-imageio-daemon: File "/usr/lib/python2.7/site-packages/ovirt_imageio_daemon/server.py", line 50, in main Dec 30 15:03:55 rhsqa-grafton2 ovirt-imageio-daemon: configure_logger() Dec 30 15:03:55 rhsqa-grafton2 ovirt-imageio-daemon: File "/usr/lib/python2.7/site-packages/ovirt_imageio_daemon/server.py", line 68, in configure_logger Dec 30 15:03:55 rhsqa-grafton2 ovirt-imageio-daemon: logging.config.fileConfig(conf, disable_existing_loggers=False) Dec 30 15:03:55 rhsqa-grafton2 ovirt-imageio-daemon: File "/usr/lib64/python2.7/logging/config.py", line 78, in fileConfig Dec 30 15:03:55 rhsqa-grafton2 ovirt-imageio-daemon: handlers = _install_handlers(cp, formatters) Dec 30 15:03:55 rhsqa-grafton2 ovirt-imageio-daemon: File "/usr/lib64/python2.7/logging/config.py", line 156, in _install_handlers Dec 30 15:03:55 rhsqa-grafton2 ovirt-imageio-daemon: h = klass(*args) Dec 30 15:03:55 rhsqa-grafton2 ovirt-imageio-daemon: File "/usr/lib64/python2.7/logging/handlers.py", line 117, in __init__ Dec 30 15:03:55 rhsqa-grafton2 ovirt-imageio-daemon: BaseRotatingHandler.__init__(self, filename, mode, encoding, delay) Dec 30 15:03:55 rhsqa-grafton2 ovirt-imageio-daemon: File "/usr/lib64/python2.7/logging/handlers.py", line 64, in __init__ Dec 30 15:03:55 rhsqa-grafton2 ovirt-imageio-daemon: logging.FileHandler.__init__(self, filename, mode, encoding, delay) Dec 30 15:03:55 rhsqa-grafton2 ovirt-imageio-daemon: File "/usr/lib64/python2.7/logging/__init__.py", line 902, in __init__ Dec 30 15:03:55 rhsqa-grafton2 ovirt-imageio-daemon: StreamHandler.__init__(self, self._open()) Dec 30 15:03:55 rhsqa-grafton2 ovirt-imageio-daemon: File "/usr/lib64/python2.7/logging/__init__.py", line 925, in _open Dec 30 15:03:55 rhsqa-grafton2 ovirt-imageio-daemon: stream = open(self.baseFilename, self.mode) Dec 30 15:03:55 rhsqa-grafton2 ovirt-imageio-daemon: IOError: [Errno 13] Permission denied: '/var/log/ovirt-imageio-daemon/daemon.log' (In reply to RamaKasturi from comment #4) > Hi sahina, > > I have retested this by marking glusternw was required(clusters->logical > Networks->glusternw->manage networks->select required -> ok). I still see > the issue where my app vm goes to NotResponding state and node on which the > glusternw was brought down goes to Non Operational state due to "Host hosted_engine1 moved to Non-Operational state because interfaces which are down are needed by required networks in the current cluster: 'enp4s0f0 (glusternw)'." > > Thanks > kasturi. oVirt 4.1.0 GA has been released, re-targeting to 4.1.1. Please check if this issue is correctly targeted or already included in 4.1.0. I am not able to reproduce this issue. I see following things when I bring down gluster network. System details: I have 3 hosts in HC mode with gluster. There is a gluster network which is used for gluster data traffic and migration traffic. Gluster network is marked as required network for the cluster. I have a HostedEngine VM and a HA VM vm1 running on host2. Power management is configured for all the hosts and fencing policy is enabled correctly. Steps executed: In 'Host2' which runs HE VM and HA VM, bring down the nic (ex. ens4f0) which is in gluster network. Results: 1. HE VM and RHV-M portal is not accessible for some moment. Then RHV-M Portal is accessible. 2. Host is moved to non operational with event 'Host host2 moved to Non-Operational state because interfaces which are down are needed by required networks in the current cluster: 'ens4f0 (glusternw)'.' 3. Glusterd and all other brick process in host2 is killed because of server quroum lose. 4. HA VM is migrated to other host 'Host3' 5. HostedEngine VM continues to run in 'Host2' without any issue. I am not sure what is the expected behavior in this scenario. Also HA VM is migrated without any issue and HA is satisfied. This issue is reproducible only when storage is completely not accessible from a host which has some HA VMs running. I can reproduce this issue in a Gluster and oVirt HC setup using following steps. Environment: - Three node HC setup with hosted engine. - There is separate gluster network defined in the cluster for gluster data traffic and this is marked as migration network as well. - Gluster network is marked as mandatory required network. - Power management is enabled for all the hosts. - Fencing is enabled. - Create VM with HA Test case: - Ensure both Hosted Engine and HA VM runs on Host2 - Simulate gluster network disconnect on Host2 by configuring firewall to reject network traffic from other two hosts (only on gluster network) Firewall rule used to block gluster storage access through ovirt-mgmt network: iptables -A OUTPUT -p all --destination 10.70.36.74,10.70.36.76 -j REJECT; iptables -A INPUT -p all --source 10.70.36.74,10.70.36.76 -j REJECT Results: I observe following things when I completely disconnect the storage using firewall rules and bring down the gluster network. 1. HE VMs is not accessible for some time until it is restarted on other node. 2. HE VM restarted on other node. 3. Host2 which is disconnected from storage moves to Non-Operational 4. oVirt tries to migrate the HA VMs running on that Host with the reason "Reason: Host preparing for maintenance". 5. Following two events are logged. "VM appvm01 has been paused due to unknown storage error." "VM appvm01 has been paused." Step-4 and 5 keeps repeating for every 5 minutes once. But HA VM never gets migrated or restarted on other node. 6. Following errors are seen repeatedly in engine log. 2017-02-23 02:16:38,207-05 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.MigrateBrokerVDSCommand] (org.ovirt.thread.pool-6-thread-42) [71a98cc8] FINISH, MigrateBrokerVDSCommand, log id: 5e696908 2017-02-23 02:16:38,213-05 INFO [org.ovirt.engine.core.vdsbroker.MigrateVDSCommand] (org.ovirt.thread.pool-6-thread-42) [71a98cc8] FINISH, MigrateVDSCommand, return: MigratingFrom, log id: 11400f71 2017-02-23 02:16:38,228-05 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-42) [71a98cc8] EVENT_ID: VM_MIGRATION_START_SYSTEM_INITIATED(67), Correlation ID: 71a98cc8, Job ID: 845482a9-66e1-413e-8eb5-c54112c07232, Call Stack: null, Custom Event ID: -1, Message: Migration initiated by system (VM: appvm01, Source: host2, Destination: host1, Reason: Host preparing for maintenance). 2017-02-23 02:16:38,821-05 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-13) [] VM '36cbe058-a866-4b96-b099-5598efce77e9' was reported as Down on VDS 'cd82ef18-d1fe-437e-b3e5-8301a66cf0d5'(host1) 2017-02-23 02:16:38,825-05 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (ForkJoinPool-1-worker-13) [] START, DestroyVDSCommand(HostName = host1, DestroyVmVDSCommandParameters:{runAsync='true', hostId='cd82ef18-d1fe-437e-b3e5-8301a66cf0d5', vmId='36cbe058-a866-4b96-b099-5598efce77e9', force='false', secondsToWait='0', gracefully='false', reason='', ignoreNoVm='true'}), log id: 52d2f6cf 2017-02-23 02:16:38,827-05 INFO [org.ovirt.engine.core.vdsbroker.gluster.GlusterServersListVDSCommand] (DefaultQuartzScheduler7) [7cef992a] FINISH, GlusterServersListVDSCommand, return: [10.70.36.75/23:CONNECTED, cambridge-nic2.lab.eng.blr.redhat.com:CONNECTED, 10.70.36.74:DISCONNECTED], log id: 5c822b82 2017-02-23 02:16:38,838-05 INFO [org.ovirt.engine.core.vdsbroker.gluster.GlusterVolumesListVDSCommand] (DefaultQuartzScheduler7) [7cef992a] START, GlusterVolumesListVDSCommand(HostName = host3, GlusterVolumesListVDSParameters:{runAsync='true', hostId='6b1e524b-e224-4fe1-95e1-b2e5577a6ec9'}), log id: eb27a41 2017-02-23 02:16:39,185-05 WARN [org.ovirt.engine.core.vdsbroker.gluster.GlusterVolumesListReturn] (DefaultQuartzScheduler7) [7cef992a] Could not associate brick '10.70.36.74:/gluster_bricks/engine/engine' of volume 'dbca77ec-74cb-45ae-bac8-a1f5b1a4e5e8' with correct network as no gluster network found in cluster '00000002-0002-0002-0002-00000000017a' 2017-02-23 02:16:39,191-05 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (ForkJoinPool-1-worker-13) [] Failed to destroy VM '36cbe058-a866-4b96-b099-5598efce77e9' because VM does not exist, ignoring 2017-02-23 02:16:39,192-05 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (ForkJoinPool-1-worker-13) [] FINISH, DestroyVDSCommand, log id: 52d2f6cf 2017-02-23 02:16:39,192-05 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-13) [] VM '36cbe058-a866-4b96-b099-5598efce77e9'(appvm01) was unexpectedly detected as 'Down' on VDS 'cd82ef18-d1fe-437e-b3e5-8301a66cf0d5'(host1) (expected on '22f77c5e-21c6-4b4e-9e89-45b27e3db57d') 2017-02-23 02:16:39,199-05 WARN [org.ovirt.engine.core.vdsbroker.gluster.GlusterVolumesListReturn] (DefaultQuartzScheduler7) [7cef992a] Could not associate brick '10.70.36.74:/gluster_bricks/data/data' of volume 'f4d6fc61-24d9-4660-8f75-cea21f8e69eb' with correct network as no gluster network found in cluster '00000002-0002-0002-0002-00000000017a' 2017-02-23 02:16:39,213-05 WARN [org.ovirt.engine.core.vdsbroker.gluster.GlusterVolumesListReturn] (DefaultQuartzScheduler7) [7cef992a] Could not associate brick '10.70.36.74:/gluster_bricks/vmstore/vmstore' of volume '69dcc5d0-4f81-480b-99f7-6e7d3c1b9cb9' with correct network as no gluster network found in cluster '00000002-0002-0002-0002-00000000017a' 2017-02-23 02:16:39,217-05 INFO [org.ovirt.engine.core.vdsbroker.gluster.GlusterVolumesListVDSCommand] (DefaultQuartzScheduler7) [7cef992a] FINISH, GlusterVolumesListVDSCommand, return: {dbca77ec-74cb-45ae-bac8-a1f5b1a4e5e8=org.ovirt.engine.core.common.businessentities.gluster.GlusterVolumeEntity@1cd5711e, f4d6fc61-24d9-4660-8f75-cea21f8e69eb=org.ovirt.engine.core.common.businessentities.gluster.GlusterVolumeEntity@5e7f8486, 69dcc5d0-4f81-480b-99f7-6e7d3c1b9cb9=org.ovirt.engine.core.common.businessentities.gluster.GlusterVolumeEntity@14f0b0ac}, log id: eb27a41 2017-02-23 02:16:40,199-05 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-4) [] VM '36cbe058-a866-4b96-b099-5598efce77e9'(appvm01) moved from 'MigratingFrom' --> 'Paused' I discussed with mskrivanek about the right component for this issue and moving to Virt component as per his suggestion. to sum up - there is no bug in existing behavior, it's just that the existing HA and Fencing behavior has certain shortcomings - the intention to try to migrate VMs when host was Up and now gets to the NonOperational is still valid, I believe. There is a good chance that some VMs are not affected by the specific reason the host status changed. E.g. in this case VMs not using disks at that gluster network, diskless VMs, or VMs which do not do any I/O at that time. For that we use an equivalent of "move host to maintenance" where we try to evacuate the host - problem of above is that when you are unable to migrate the VMs they will just remain stuck there in case of disk access issues (VM gets Paused with I/O error), and the only resolution is to kill that process, or in many cases on NFS you need to use power management to reboot/powercycle the host - fencing can do that, but it is currently designed to resolve issues when engine cannot talk to the host at all. This is not the case here, as the ovirtmgmt communication is ok. I would propose to keep both and improve our fencing aggressiveness. In the "preparing for maintenance" part of the SetNonOperationalVdsCommand we can check whether there are any Running VM present (excluding Paused state), and if not we would trigger fencing and kill the host. Alternatively this would be resolved via vm-leases, but they'd need to be configured for all storage domains the VM is using. Currently I believe we allow only one. Does that sound feasible? We would need to implement a new flow to execute power management fencing for this use case as current fencing flow is tightly coupled with host non-responsiveness and several steps like SSH Soft Fencing or Kdump detection is completely useless for this use case based on comment #13 moving to infra for targeting. sorry for delay Martin, can anyone look at this? Removing 4.2 target, unfortunately this item was moved to infra too late for 4.2, adding fencing flow for non-operational hosts due to gluster network unavailable seems to me like quite an dangerous changes which will require time to design properly and lots of testing. AFAIK the workaround for now is to use VM leases configured on storage domain provided by gluster Can we test this flow and ensure configuring leases solves the issue? Moving to ON_QA to test scenario with VM leases configured I have configured VM leases for the VM. HE VM and HA app vm were running on a host. Brought down the gluster network on that node. HE VM restarted on another node. But app VM went paused, as I rebooted the host, the VM restarted on another host Verified with RHV 4.2.6-4 and glusterfs-3.8.4 |