Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1440101

Summary:	Host that lost network connectivity is not fenced
Product:	[oVirt] ovirt-engine	Reporter:	Denis Chaplygin <dchaplyg>
Component:	BLL.Infra	Assignee:	Eli Mesika <emesika>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Petr Matyáš <pmatyas>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	future	CC:	bugs, dchaplyg, emesika, lsvaty, mperina, oourfali
Target Milestone:	ovirt-4.2.0	Flags:	rule-engine: ovirt-4.2+ lsvaty: testing_ack+
Target Release:	4.2.0
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	No Doc Update
Doc Text:	undefined	Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-12-20 10:46:22 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	Infra	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1488554, 1492791, 1493070
Bug Blocks:

Description Denis Chaplygin 2017-04-07 09:50:30 UTC

Description of problem: Having hyperconverged ovirt setup, consisting of three nodes, i turned of networking on one of the hosts. VM migrates successfully, but non-responsive host is never fenced.


How reproducible: always


Steps to Reproduce:
1.Install 3 nodes hyperconverged cluster.
2.Configure and enable power management for all hosts.
3.Execute ifdown ovirtmgmt on host, running engine VM
4.Wait for HE VM to be respawned.

Actual results: HE VM is respawned, but previous HE VM host is not fenced.


Expected results: previous HE VM host should be fenced.


Here are some engine log extracts

2017-04-07 08:39:51,187Z ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetAllVmStatsVDSCommand] (DefaultQuartzScheduler3) [] Command 'GetAllVmStatsVDSCommand(HostName = hc-lion.eng.lab.tlv.redhat.com, VdsIdVD
SCommandParametersBase:{runAsync='true', hostId='1a788dad-ed85-4d90-85b5-290be4f8d99b'})' execution failed: java.net.NoRouteToHostException: No route to host
2017-04-07 08:39:51,188Z INFO  [org.ovirt.engine.core.vdsbroker.monitoring.PollVmStatsRefresher] (DefaultQuartzScheduler3) [] Failed to fetch vms info for host 'hc-lion.eng.lab.tlv.redhat.com' - skipping VMs mon
itoring.
2017-04-07 08:39:51,253Z WARN  [org.ovirt.engine.core.vdsbroker.VdsManager] (org.ovirt.thread.pool-5-thread-3) [] Host 'hc-lion.eng.lab.tlv.redhat.com' is not responding. It will stay in Connecting state for a g
race period of 80 seconds and after that an attempt to fence the host will be issued.
2017-04-07 08:39:51,680Z WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-5-thread-3) [] EVENT_ID: VDS_HOST_NOT_RESPONDING_CONNECTING(9,008), Correlation ID: nu
ll, Call Stack: null, Custom Event ID: -1, Message: Host hc-lion.eng.lab.tlv.redhat.com is not responding. It will stay in Connecting state for a grace period of 80 seconds and after that an attempt to fence the
 host will be issued.

2017-04-07 08:40:27,332Z INFO  [org.ovirt.engine.core.bll.VdsEventListener] (org.ovirt.thread.pool-5-thread-6) [] ResourceManager::vdsNotResponding entered for Host '1a788dad-ed85-4d90-85b5-290be4f8d99b', 'hc-lion.eng.lab.tlv.redhat.com'
2017-04-07 08:40:42,218Z INFO  [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [] Connecting to hc-lion.eng.lab.tlv.redhat.com/10.35.16.155
2017-04-07 08:40:45,226Z ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetAllVmStatsVDSCommand] (DefaultQuartzScheduler5) [] Command 'GetAllVmStatsVDSCommand(HostName = hc-lion.eng.lab.tlv.redhat.com, VdsIdVDSCommandParametersBase:{runAsync='true', hostId='1a788dad-ed85-4d90-85b5-290be4f8d99b'})' execution failed: java.net.NoRouteToHostException: No route to host


No fence events were seen in logs.

Comment 1 Oved Ourfali 2017-04-07 10:29:42 UTC

Please attach complete engine and hosts logs.

Comment 2 Denis Chaplygin 2017-04-07 10:46:10 UTC

They are too big and not accepted by bugzilla. Please use that link https://drive.google.com/open?id=0B2yzqx8M1bM-bk83cGIzTUhNbVU

Comment 3 Yaniv Kaul 2017-04-08 16:57:04 UTC

Was it still connected to the storage and alive?

Comment 4 Denis Chaplygin 2017-04-10 08:36:52 UTC

OS was alive, but connectivity completely broken - storage was not available and engine connectivity was lost

Comment 5 Oved Ourfali 2017-04-28 17:48:28 UTC

Eli, can you please take a look?

Comment 6 Oved Ourfali 2017-04-28 17:49:36 UTC

Denis, the bug was opened on "future" so I targeted it to 4.2.
However, was "future" put there on purpose? Are you testing master? 4.1? 4.0?

Comment 7 Eli Mesika 2017-04-30 09:06:35 UTC

(In reply to Denis Chaplygin from comment #2)
> They are too big and not accepted by bugzilla. Please use that link
> https://drive.google.com/open?id=0B2yzqx8M1bM-bk83cGIzTUhNbVU

please allow access in order to get the files , I have sent a request for that....

Comment 8 Denis Chaplygin 2017-05-02 12:31:14 UTC

Oved, i can't remember any reason, why it's on 'future', so i believe it's by mistake. I had discovered it on master.

Eli, please, try it again.

Comment 9 Eli Mesika 2017-05-25 10:40:17 UTC

(In reply to Denis Chaplygin from comment #0)
> Description of problem: Having hyperconverged ovirt setup, consisting of
> three nodes, i turned of networking on one of the hosts. VM migrates
> successfully, but non-responsive host is never fenced.
> 
> 
> How reproducible: always
> 
> 
> Steps to Reproduce:
> 1.Install 3 nodes hyperconverged cluster.
> 2.Configure and enable power management for all hosts.
> 3.Execute ifdown ovirtmgmt on host, running engine VM
> 4.Wait for HE VM to be respawned.
> 
> Actual results: HE VM is respawned, but previous HE VM host is not fenced.
> 
> 
> Expected results: previous HE VM host should be fenced.
> 
> 
> Here are some engine log extracts
> 
> 2017-04-07 08:39:51,187Z ERROR
> [org.ovirt.engine.core.vdsbroker.vdsbroker.GetAllVmStatsVDSCommand]
> (DefaultQuartzScheduler3) [] Command 'GetAllVmStatsVDSCommand(HostName =
> hc-lion.eng.lab.tlv.redhat.com, VdsIdVD
> SCommandParametersBase:{runAsync='true',
> hostId='1a788dad-ed85-4d90-85b5-290be4f8d99b'})' execution failed:
> java.net.NoRouteToHostException: No route to host
> 2017-04-07 08:39:51,188Z INFO 
> [org.ovirt.engine.core.vdsbroker.monitoring.PollVmStatsRefresher]
> (DefaultQuartzScheduler3) [] Failed to fetch vms info for host
> 'hc-lion.eng.lab.tlv.redhat.com' - skipping VMs mon
> itoring.
> 2017-04-07 08:39:51,253Z WARN  [org.ovirt.engine.core.vdsbroker.VdsManager]
> (org.ovirt.thread.pool-5-thread-3) [] Host 'hc-lion.eng.lab.tlv.redhat.com'
> is not responding. It will stay in Connecting state for a g
> race period of 80 seconds and after that an attempt to fence the host will
> be issued.
> 2017-04-07 08:39:51,680Z WARN 
> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
> (org.ovirt.thread.pool-5-thread-3) [] EVENT_ID:
> VDS_HOST_NOT_RESPONDING_CONNECTING(9,008), Correlation ID: nu
> ll, Call Stack: null, Custom Event ID: -1, Message: Host
> hc-lion.eng.lab.tlv.redhat.com is not responding. It will stay in Connecting
> state for a grace period of 80 seconds and after that an attempt to fence the
>  host will be issued.
> 
> 2017-04-07 08:40:27,332Z INFO  [org.ovirt.engine.core.bll.VdsEventListener]
> (org.ovirt.thread.pool-5-thread-6) [] ResourceManager::vdsNotResponding
> entered for Host '1a788dad-ed85-4d90-85b5-290be4f8d99b',
> 'hc-lion.eng.lab.tlv.redhat.com'
> 2017-04-07 08:40:42,218Z INFO 
> [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor)
> [] Connecting to hc-lion.eng.lab.tlv.redhat.com/10.35.16.155
> 2017-04-07 08:40:45,226Z ERROR
> [org.ovirt.engine.core.vdsbroker.vdsbroker.GetAllVmStatsVDSCommand]
> (DefaultQuartzScheduler5) [] Command 'GetAllVmStatsVDSCommand(HostName =
> hc-lion.eng.lab.tlv.redhat.com,
> VdsIdVDSCommandParametersBase:{runAsync='true',
> hostId='1a788dad-ed85-4d90-85b5-290be4f8d99b'})' execution failed:
> java.net.NoRouteToHostException: No route to host
> 
> 
> No fence events were seen in logs.

Before that I see few errors that might point to a problem in the fencing agent definition ....
"unknown option '--ssl-insecure', , ERROR:root:Unable to connect/login to fencing device, , , Unable to connect/login to fencing device"

Can you please check 6the PM definitions ?

Comment 10 Denis Chaplygin 2017-05-25 10:43:56 UTC

Yes, i had problem configuring fencing agent in prior. Finally i figured out correct settings and tried again with a no luck. For the last try agent was configured correctly.

Comment 11 Eli Mesika 2017-07-17 18:57:26 UTC

After investigation, seems like a race 

The non-responding host is in a fencing flow while the engine VM is restarted on another machine , the fencing flow did only the <stop> step from the fencing flow and then the engine VM goes down.
Since in the engine start we look only for host with <non-responding> status , this host which is actually in <down> status will not be started.

Proposed solution is to add a in_fencing_flow boolean column to vds_dynamic

1) The flag is set at the start of hard fencing flow (stop command) and reset at the end of the hard fencing flow (start command) when the origin of those command comes from non-responding host and not manual fencing 

2) In the non-responding treatment that occurs on engine start we will take in account also hosts that have the in_fencing_flow and reset the in_fencing_flow flag

3) when the host is UP either by a successful start command or by Host Monitoring, the in_fencing_flow should be rest as well

This will ensure that engine will try to fence hosts that were stuck in the middle of a fencing flow as a result of engine restart

Comment 12 Petr Matyáš 2017-10-19 09:29:37 UTC

Verified on ovirt-engine-4.2.0-0.0.master.20171013142622.git15e767c.el7.centos.noarch

Comment 13 Sandro Bonazzola 2017-12-20 10:46:22 UTC

This bugzilla is included in oVirt 4.2.0 release, published on Dec 20th 2017.

Since the problem described in this bug report should be
resolved in oVirt 4.2.0 release, published on Dec 20th 2017, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.