Bug 861220 - backend; If you can't connect to a host due to name resolution failure, don't fence it
backend; If you can't connect to a host due to name resolution failure, don't...
Status: CLOSED WONTFIX
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine (Show other bugs)
3.1.0
Unspecified Unspecified
unspecified Severity high
: ---
: 3.3.0
Assigned To: Yaniv Bronhaim
infra
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2012-09-27 16:59 EDT by Yaniv Kaul
Modified: 2016-02-10 14:21 EST (History)
9 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2013-07-08 09:22:01 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: Infra
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Yaniv Kaul 2012-09-27 16:59:04 EDT
Description of problem:


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:
Comment 1 Yaniv Kaul 2012-09-27 17:01:10 EDT
If backend can't resolve the name of a host, you can't connect to it, and the backend will think it's non-responsive. However, it will also try to fence it. The fencing is many times either defined in IP or in a different domain - and might even succeed. 
But seriously, it's a dire mistake to fence a host just because you failed to resolve its hostname to an IP address.
Comment 2 Yaniv Kaul 2012-09-27 17:13:59 EDT
One may argue that you should not fence it if you time-out to connect to it (vs get a reject to the connection). However, DNS resolving is clearly a case where it is not the host's fault, and rebooting will rarely solve it (it might if it was dynamic DNS, the server registered via DHCP, yada-yada-yada).
Comment 4 Barak 2012-10-02 08:38:54 EDT
I don't understand how the bootstrapping succeeded (when the hostname is not resolved) ?
Comment 5 Yaniv Kaul 2012-10-02 09:10:34 EDT
Both comment 3 and 4 miss the issue:
All is well, hosts are installed and working - and now *RHEVM* loses its DNS connectivity. It will therefore fail to communicate with the hosts, and might try to fence them.
Comment 6 Yair Zaslavsky 2012-10-04 02:26:59 EDT
What about resolving the hostname during registration (obtaining the IP address), and keeping this information as well for the host?
Comment 7 Barak 2012-10-04 03:58:19 EDT
(In reply to comment #6)
> What about resolving the hostname during registration (obtaining the IP
> address), and keeping this information as well for the host?

IPs might change in time,

The resolution should be to identify we can't resolve the hostname (maybe add an alert about it) but skip fencing in this case.

I'm not sure how this may influence the rest of the flows.
Comment 8 Yair Zaslavsky 2012-10-09 08:45:41 EDT
This seems to be as 3.2 since we need to extend the NetworkException and track the exception reason.
Currently NetworkException is just inherited from GenericException and does not have the failure reason
Comment 9 Yair Zaslavsky 2013-05-12 11:25:59 EDT
Yaniv B, Can you please check what are the implications here?
Comment 10 Yaniv Bronhaim 2013-05-16 08:35:25 EDT
It doesn't seems like we can distinguish "types" of network exceptions, Unless we check specific issues when the exception is raised. Generally this exception means the TCP connection between engine and host is broken, can be because rout errors, hard disconnection, iptables and more .. all of them leads to broken session. When the DNS server is not responsive, or hostname can't be resolved by the host, the engine doesn't keep the IP address and use it or try to communicate with different hostname. It just keeps trying to reach the declared hostname by trying to reach the dns server or reading /etc/hosts that defined on the host.

As you mentioned, without reply from vdsm we continue to fence flow.
It doesn't matter if the reason for the non-responsive state is the host itself, or communications problem..
As fence mechanism describes, if host is unreachable more then ** seconds, it must leads to fence to allow releasing of all used resources.  It sounds like classic reason for fence.
Comment 11 Yair Zaslavsky 2013-05-16 08:38:13 EDT
Yaniv, dns code in java is based on the javax.naming package and contains exception (NamingException) that can be caught - so if u have naming resolution error, maybe we can still distinguish?

Note You need to log in before you can comment on or make changes to this bug.