Red Hat Bugzilla – Bug 861220
backend; If you can't connect to a host due to name resolution failure, don't fence it
Last modified: 2016-02-10 14:21:40 EST
Description of problem:
Version-Release number of selected component (if applicable):
Steps to Reproduce:
If backend can't resolve the name of a host, you can't connect to it, and the backend will think it's non-responsive. However, it will also try to fence it. The fencing is many times either defined in IP or in a different domain - and might even succeed.
But seriously, it's a dire mistake to fence a host just because you failed to resolve its hostname to an IP address.
One may argue that you should not fence it if you time-out to connect to it (vs get a reject to the connection). However, DNS resolving is clearly a case where it is not the host's fault, and rebooting will rarely solve it (it might if it was dynamic DNS, the server registered via DHCP, yada-yada-yada).
I don't understand how the bootstrapping succeeded (when the hostname is not resolved) ?
Both comment 3 and 4 miss the issue:
All is well, hosts are installed and working - and now *RHEVM* loses its DNS connectivity. It will therefore fail to communicate with the hosts, and might try to fence them.
What about resolving the hostname during registration (obtaining the IP address), and keeping this information as well for the host?
(In reply to comment #6)
> What about resolving the hostname during registration (obtaining the IP
> address), and keeping this information as well for the host?
IPs might change in time,
The resolution should be to identify we can't resolve the hostname (maybe add an alert about it) but skip fencing in this case.
I'm not sure how this may influence the rest of the flows.
This seems to be as 3.2 since we need to extend the NetworkException and track the exception reason.
Currently NetworkException is just inherited from GenericException and does not have the failure reason
Yaniv B, Can you please check what are the implications here?
It doesn't seems like we can distinguish "types" of network exceptions, Unless we check specific issues when the exception is raised. Generally this exception means the TCP connection between engine and host is broken, can be because rout errors, hard disconnection, iptables and more .. all of them leads to broken session. When the DNS server is not responsive, or hostname can't be resolved by the host, the engine doesn't keep the IP address and use it or try to communicate with different hostname. It just keeps trying to reach the declared hostname by trying to reach the dns server or reading /etc/hosts that defined on the host.
As you mentioned, without reply from vdsm we continue to fence flow.
It doesn't matter if the reason for the non-responsive state is the host itself, or communications problem..
As fence mechanism describes, if host is unreachable more then ** seconds, it must leads to fence to allow releasing of all used resources. It sounds like classic reason for fence.
Yaniv, dns code in java is based on the javax.naming package and contains exception (NamingException) that can be caught - so if u have naming resolution error, maybe we can still distinguish?