Description of problem: Having hyperconverged ovirt setup, consisting of three nodes, i turned of networking on one of the hosts. VM migrates successfully, but non-responsive host is never fenced. How reproducible: always Steps to Reproduce: 1.Install 3 nodes hyperconverged cluster. 2.Configure and enable power management for all hosts. 3.Execute ifdown ovirtmgmt on host, running engine VM 4.Wait for HE VM to be respawned. Actual results: HE VM is respawned, but previous HE VM host is not fenced. Expected results: previous HE VM host should be fenced. Here are some engine log extracts 2017-04-07 08:39:51,187Z ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetAllVmStatsVDSCommand] (DefaultQuartzScheduler3) [] Command 'GetAllVmStatsVDSCommand(HostName = hc-lion.eng.lab.tlv.redhat.com, VdsIdVD SCommandParametersBase:{runAsync='true', hostId='1a788dad-ed85-4d90-85b5-290be4f8d99b'})' execution failed: java.net.NoRouteToHostException: No route to host 2017-04-07 08:39:51,188Z INFO [org.ovirt.engine.core.vdsbroker.monitoring.PollVmStatsRefresher] (DefaultQuartzScheduler3) [] Failed to fetch vms info for host 'hc-lion.eng.lab.tlv.redhat.com' - skipping VMs mon itoring. 2017-04-07 08:39:51,253Z WARN [org.ovirt.engine.core.vdsbroker.VdsManager] (org.ovirt.thread.pool-5-thread-3) [] Host 'hc-lion.eng.lab.tlv.redhat.com' is not responding. It will stay in Connecting state for a g race period of 80 seconds and after that an attempt to fence the host will be issued. 2017-04-07 08:39:51,680Z WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-5-thread-3) [] EVENT_ID: VDS_HOST_NOT_RESPONDING_CONNECTING(9,008), Correlation ID: nu ll, Call Stack: null, Custom Event ID: -1, Message: Host hc-lion.eng.lab.tlv.redhat.com is not responding. It will stay in Connecting state for a grace period of 80 seconds and after that an attempt to fence the host will be issued. 2017-04-07 08:40:27,332Z INFO [org.ovirt.engine.core.bll.VdsEventListener] (org.ovirt.thread.pool-5-thread-6) [] ResourceManager::vdsNotResponding entered for Host '1a788dad-ed85-4d90-85b5-290be4f8d99b', 'hc-lion.eng.lab.tlv.redhat.com' 2017-04-07 08:40:42,218Z INFO [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [] Connecting to hc-lion.eng.lab.tlv.redhat.com/10.35.16.155 2017-04-07 08:40:45,226Z ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetAllVmStatsVDSCommand] (DefaultQuartzScheduler5) [] Command 'GetAllVmStatsVDSCommand(HostName = hc-lion.eng.lab.tlv.redhat.com, VdsIdVDSCommandParametersBase:{runAsync='true', hostId='1a788dad-ed85-4d90-85b5-290be4f8d99b'})' execution failed: java.net.NoRouteToHostException: No route to host No fence events were seen in logs.
Please attach complete engine and hosts logs.
They are too big and not accepted by bugzilla. Please use that link https://drive.google.com/open?id=0B2yzqx8M1bM-bk83cGIzTUhNbVU
Was it still connected to the storage and alive?
OS was alive, but connectivity completely broken - storage was not available and engine connectivity was lost
Eli, can you please take a look?
Denis, the bug was opened on "future" so I targeted it to 4.2. However, was "future" put there on purpose? Are you testing master? 4.1? 4.0?
(In reply to Denis Chaplygin from comment #2) > They are too big and not accepted by bugzilla. Please use that link > https://drive.google.com/open?id=0B2yzqx8M1bM-bk83cGIzTUhNbVU please allow access in order to get the files , I have sent a request for that....
Oved, i can't remember any reason, why it's on 'future', so i believe it's by mistake. I had discovered it on master. Eli, please, try it again.
(In reply to Denis Chaplygin from comment #0) > Description of problem: Having hyperconverged ovirt setup, consisting of > three nodes, i turned of networking on one of the hosts. VM migrates > successfully, but non-responsive host is never fenced. > > > How reproducible: always > > > Steps to Reproduce: > 1.Install 3 nodes hyperconverged cluster. > 2.Configure and enable power management for all hosts. > 3.Execute ifdown ovirtmgmt on host, running engine VM > 4.Wait for HE VM to be respawned. > > Actual results: HE VM is respawned, but previous HE VM host is not fenced. > > > Expected results: previous HE VM host should be fenced. > > > Here are some engine log extracts > > 2017-04-07 08:39:51,187Z ERROR > [org.ovirt.engine.core.vdsbroker.vdsbroker.GetAllVmStatsVDSCommand] > (DefaultQuartzScheduler3) [] Command 'GetAllVmStatsVDSCommand(HostName = > hc-lion.eng.lab.tlv.redhat.com, VdsIdVD > SCommandParametersBase:{runAsync='true', > hostId='1a788dad-ed85-4d90-85b5-290be4f8d99b'})' execution failed: > java.net.NoRouteToHostException: No route to host > 2017-04-07 08:39:51,188Z INFO > [org.ovirt.engine.core.vdsbroker.monitoring.PollVmStatsRefresher] > (DefaultQuartzScheduler3) [] Failed to fetch vms info for host > 'hc-lion.eng.lab.tlv.redhat.com' - skipping VMs mon > itoring. > 2017-04-07 08:39:51,253Z WARN [org.ovirt.engine.core.vdsbroker.VdsManager] > (org.ovirt.thread.pool-5-thread-3) [] Host 'hc-lion.eng.lab.tlv.redhat.com' > is not responding. It will stay in Connecting state for a g > race period of 80 seconds and after that an attempt to fence the host will > be issued. > 2017-04-07 08:39:51,680Z WARN > [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] > (org.ovirt.thread.pool-5-thread-3) [] EVENT_ID: > VDS_HOST_NOT_RESPONDING_CONNECTING(9,008), Correlation ID: nu > ll, Call Stack: null, Custom Event ID: -1, Message: Host > hc-lion.eng.lab.tlv.redhat.com is not responding. It will stay in Connecting > state for a grace period of 80 seconds and after that an attempt to fence the > host will be issued. > > 2017-04-07 08:40:27,332Z INFO [org.ovirt.engine.core.bll.VdsEventListener] > (org.ovirt.thread.pool-5-thread-6) [] ResourceManager::vdsNotResponding > entered for Host '1a788dad-ed85-4d90-85b5-290be4f8d99b', > 'hc-lion.eng.lab.tlv.redhat.com' > 2017-04-07 08:40:42,218Z INFO > [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) > [] Connecting to hc-lion.eng.lab.tlv.redhat.com/10.35.16.155 > 2017-04-07 08:40:45,226Z ERROR > [org.ovirt.engine.core.vdsbroker.vdsbroker.GetAllVmStatsVDSCommand] > (DefaultQuartzScheduler5) [] Command 'GetAllVmStatsVDSCommand(HostName = > hc-lion.eng.lab.tlv.redhat.com, > VdsIdVDSCommandParametersBase:{runAsync='true', > hostId='1a788dad-ed85-4d90-85b5-290be4f8d99b'})' execution failed: > java.net.NoRouteToHostException: No route to host > > > No fence events were seen in logs. Before that I see few errors that might point to a problem in the fencing agent definition .... "unknown option '--ssl-insecure', , ERROR:root:Unable to connect/login to fencing device, , , Unable to connect/login to fencing device" Can you please check 6the PM definitions ?
Yes, i had problem configuring fencing agent in prior. Finally i figured out correct settings and tried again with a no luck. For the last try agent was configured correctly.
After investigation, seems like a race The non-responding host is in a fencing flow while the engine VM is restarted on another machine , the fencing flow did only the <stop> step from the fencing flow and then the engine VM goes down. Since in the engine start we look only for host with <non-responding> status , this host which is actually in <down> status will not be started. Proposed solution is to add a in_fencing_flow boolean column to vds_dynamic 1) The flag is set at the start of hard fencing flow (stop command) and reset at the end of the hard fencing flow (start command) when the origin of those command comes from non-responding host and not manual fencing 2) In the non-responding treatment that occurs on engine start we will take in account also hosts that have the in_fencing_flow and reset the in_fencing_flow flag 3) when the host is UP either by a successful start command or by Host Monitoring, the in_fencing_flow should be rest as well This will ensure that engine will try to fence hosts that were stuck in the middle of a fencing flow as a result of engine restart
Verified on ovirt-engine-4.2.0-0.0.master.20171013142622.git15e767c.el7.centos.noarch
This bugzilla is included in oVirt 4.2.0 release, published on Dec 20th 2017. Since the problem described in this bug report should be resolved in oVirt 4.2.0 release, published on Dec 20th 2017, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report.