Bug 1440101 - Host that lost network connectivity is not fenced
Summary: Host that lost network connectivity is not fenced
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: BLL.Infra
Version: future
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ovirt-4.2.0
: 4.2.0
Assignee: Eli Mesika
QA Contact: Petr Matyáš
URL:
Whiteboard:
Depends On: 1488554 1492791 1493070
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-04-07 09:50 UTC by Denis Chaplygin
Modified: 2017-12-20 10:46 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
undefined
Clone Of:
Environment:
Last Closed: 2017-12-20 10:46:22 UTC
oVirt Team: Infra
Embargoed:
rule-engine: ovirt-4.2+
lsvaty: testing_ack+


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 79882 0 'None' MERGED core: handle HE VM hosting host fencing 2020-11-23 13:53:14 UTC

Description Denis Chaplygin 2017-04-07 09:50:30 UTC
Description of problem: Having hyperconverged ovirt setup, consisting of three nodes, i turned of networking on one of the hosts. VM migrates successfully, but non-responsive host is never fenced.


How reproducible: always


Steps to Reproduce:
1.Install 3 nodes hyperconverged cluster.
2.Configure and enable power management for all hosts.
3.Execute ifdown ovirtmgmt on host, running engine VM
4.Wait for HE VM to be respawned.

Actual results: HE VM is respawned, but previous HE VM host is not fenced.


Expected results: previous HE VM host should be fenced.


Here are some engine log extracts

2017-04-07 08:39:51,187Z ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetAllVmStatsVDSCommand] (DefaultQuartzScheduler3) [] Command 'GetAllVmStatsVDSCommand(HostName = hc-lion.eng.lab.tlv.redhat.com, VdsIdVD
SCommandParametersBase:{runAsync='true', hostId='1a788dad-ed85-4d90-85b5-290be4f8d99b'})' execution failed: java.net.NoRouteToHostException: No route to host
2017-04-07 08:39:51,188Z INFO  [org.ovirt.engine.core.vdsbroker.monitoring.PollVmStatsRefresher] (DefaultQuartzScheduler3) [] Failed to fetch vms info for host 'hc-lion.eng.lab.tlv.redhat.com' - skipping VMs mon
itoring.
2017-04-07 08:39:51,253Z WARN  [org.ovirt.engine.core.vdsbroker.VdsManager] (org.ovirt.thread.pool-5-thread-3) [] Host 'hc-lion.eng.lab.tlv.redhat.com' is not responding. It will stay in Connecting state for a g
race period of 80 seconds and after that an attempt to fence the host will be issued.
2017-04-07 08:39:51,680Z WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-5-thread-3) [] EVENT_ID: VDS_HOST_NOT_RESPONDING_CONNECTING(9,008), Correlation ID: nu
ll, Call Stack: null, Custom Event ID: -1, Message: Host hc-lion.eng.lab.tlv.redhat.com is not responding. It will stay in Connecting state for a grace period of 80 seconds and after that an attempt to fence the
 host will be issued.

2017-04-07 08:40:27,332Z INFO  [org.ovirt.engine.core.bll.VdsEventListener] (org.ovirt.thread.pool-5-thread-6) [] ResourceManager::vdsNotResponding entered for Host '1a788dad-ed85-4d90-85b5-290be4f8d99b', 'hc-lion.eng.lab.tlv.redhat.com'
2017-04-07 08:40:42,218Z INFO  [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [] Connecting to hc-lion.eng.lab.tlv.redhat.com/10.35.16.155
2017-04-07 08:40:45,226Z ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetAllVmStatsVDSCommand] (DefaultQuartzScheduler5) [] Command 'GetAllVmStatsVDSCommand(HostName = hc-lion.eng.lab.tlv.redhat.com, VdsIdVDSCommandParametersBase:{runAsync='true', hostId='1a788dad-ed85-4d90-85b5-290be4f8d99b'})' execution failed: java.net.NoRouteToHostException: No route to host


No fence events were seen in logs.

Comment 1 Oved Ourfali 2017-04-07 10:29:42 UTC
Please attach complete engine and hosts logs.

Comment 2 Denis Chaplygin 2017-04-07 10:46:10 UTC
They are too big and not accepted by bugzilla. Please use that link https://drive.google.com/open?id=0B2yzqx8M1bM-bk83cGIzTUhNbVU

Comment 3 Yaniv Kaul 2017-04-08 16:57:04 UTC
Was it still connected to the storage and alive?

Comment 4 Denis Chaplygin 2017-04-10 08:36:52 UTC
OS was alive, but connectivity completely broken - storage was not available and engine connectivity was lost

Comment 5 Oved Ourfali 2017-04-28 17:48:28 UTC
Eli, can you please take a look?

Comment 6 Oved Ourfali 2017-04-28 17:49:36 UTC
Denis, the bug was opened on "future" so I targeted it to 4.2.
However, was "future" put there on purpose? Are you testing master? 4.1? 4.0?

Comment 7 Eli Mesika 2017-04-30 09:06:35 UTC
(In reply to Denis Chaplygin from comment #2)
> They are too big and not accepted by bugzilla. Please use that link
> https://drive.google.com/open?id=0B2yzqx8M1bM-bk83cGIzTUhNbVU

please allow access in order to get the files , I have sent a request for that....

Comment 8 Denis Chaplygin 2017-05-02 12:31:14 UTC
Oved, i can't remember any reason, why it's on 'future', so i believe it's by mistake. I had discovered it on master.

Eli, please, try it again.

Comment 9 Eli Mesika 2017-05-25 10:40:17 UTC
(In reply to Denis Chaplygin from comment #0)
> Description of problem: Having hyperconverged ovirt setup, consisting of
> three nodes, i turned of networking on one of the hosts. VM migrates
> successfully, but non-responsive host is never fenced.
> 
> 
> How reproducible: always
> 
> 
> Steps to Reproduce:
> 1.Install 3 nodes hyperconverged cluster.
> 2.Configure and enable power management for all hosts.
> 3.Execute ifdown ovirtmgmt on host, running engine VM
> 4.Wait for HE VM to be respawned.
> 
> Actual results: HE VM is respawned, but previous HE VM host is not fenced.
> 
> 
> Expected results: previous HE VM host should be fenced.
> 
> 
> Here are some engine log extracts
> 
> 2017-04-07 08:39:51,187Z ERROR
> [org.ovirt.engine.core.vdsbroker.vdsbroker.GetAllVmStatsVDSCommand]
> (DefaultQuartzScheduler3) [] Command 'GetAllVmStatsVDSCommand(HostName =
> hc-lion.eng.lab.tlv.redhat.com, VdsIdVD
> SCommandParametersBase:{runAsync='true',
> hostId='1a788dad-ed85-4d90-85b5-290be4f8d99b'})' execution failed:
> java.net.NoRouteToHostException: No route to host
> 2017-04-07 08:39:51,188Z INFO 
> [org.ovirt.engine.core.vdsbroker.monitoring.PollVmStatsRefresher]
> (DefaultQuartzScheduler3) [] Failed to fetch vms info for host
> 'hc-lion.eng.lab.tlv.redhat.com' - skipping VMs mon
> itoring.
> 2017-04-07 08:39:51,253Z WARN  [org.ovirt.engine.core.vdsbroker.VdsManager]
> (org.ovirt.thread.pool-5-thread-3) [] Host 'hc-lion.eng.lab.tlv.redhat.com'
> is not responding. It will stay in Connecting state for a g
> race period of 80 seconds and after that an attempt to fence the host will
> be issued.
> 2017-04-07 08:39:51,680Z WARN 
> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
> (org.ovirt.thread.pool-5-thread-3) [] EVENT_ID:
> VDS_HOST_NOT_RESPONDING_CONNECTING(9,008), Correlation ID: nu
> ll, Call Stack: null, Custom Event ID: -1, Message: Host
> hc-lion.eng.lab.tlv.redhat.com is not responding. It will stay in Connecting
> state for a grace period of 80 seconds and after that an attempt to fence the
>  host will be issued.
> 
> 2017-04-07 08:40:27,332Z INFO  [org.ovirt.engine.core.bll.VdsEventListener]
> (org.ovirt.thread.pool-5-thread-6) [] ResourceManager::vdsNotResponding
> entered for Host '1a788dad-ed85-4d90-85b5-290be4f8d99b',
> 'hc-lion.eng.lab.tlv.redhat.com'
> 2017-04-07 08:40:42,218Z INFO 
> [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor)
> [] Connecting to hc-lion.eng.lab.tlv.redhat.com/10.35.16.155
> 2017-04-07 08:40:45,226Z ERROR
> [org.ovirt.engine.core.vdsbroker.vdsbroker.GetAllVmStatsVDSCommand]
> (DefaultQuartzScheduler5) [] Command 'GetAllVmStatsVDSCommand(HostName =
> hc-lion.eng.lab.tlv.redhat.com,
> VdsIdVDSCommandParametersBase:{runAsync='true',
> hostId='1a788dad-ed85-4d90-85b5-290be4f8d99b'})' execution failed:
> java.net.NoRouteToHostException: No route to host
> 
> 
> No fence events were seen in logs.

Before that I see few errors that might point to a problem in the fencing agent definition ....
"unknown option '--ssl-insecure', , ERROR:root:Unable to connect/login to fencing device, , , Unable to connect/login to fencing device"

Can you please check 6the PM definitions ?

Comment 10 Denis Chaplygin 2017-05-25 10:43:56 UTC
Yes, i had problem configuring fencing agent in prior. Finally i figured out correct settings and tried again with a no luck. For the last try agent was configured correctly.

Comment 11 Eli Mesika 2017-07-17 18:57:26 UTC
After investigation, seems like a race 

The non-responding host is in a fencing flow while the engine VM is restarted on another machine , the fencing flow did only the <stop> step from the fencing flow and then the engine VM goes down.
Since in the engine start we look only for host with <non-responding> status , this host which is actually in <down> status will not be started.

Proposed solution is to add a in_fencing_flow boolean column to vds_dynamic

1) The flag is set at the start of hard fencing flow (stop command) and reset at the end of the hard fencing flow (start command) when the origin of those command comes from non-responding host and not manual fencing 

2) In the non-responding treatment that occurs on engine start we will take in account also hosts that have the in_fencing_flow and reset the in_fencing_flow flag

3) when the host is UP either by a successful start command or by Host Monitoring, the in_fencing_flow should be rest as well

This will ensure that engine will try to fence hosts that were stuck in the middle of a fencing flow as a result of engine restart

Comment 12 Petr Matyáš 2017-10-19 09:29:37 UTC
Verified on ovirt-engine-4.2.0-0.0.master.20171013142622.git15e767c.el7.centos.noarch

Comment 13 Sandro Bonazzola 2017-12-20 10:46:22 UTC
This bugzilla is included in oVirt 4.2.0 release, published on Dec 20th 2017.

Since the problem described in this bug report should be
resolved in oVirt 4.2.0 release, published on Dec 20th 2017, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.