Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1158023

Summary: DNS Failure of first DNS server causes engine lockup: java.io.EOFException: SSL peer shut down incorrectly
Product: [Retired] oVirt Reporter: Daniel Helgenberger <daniel.helgenberger>
Component: ovirt-engine-coreAssignee: Eli Mesika <emesika>
Status: CLOSED WONTFIX QA Contact: Pavel Stehlik <pstehlik>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 3.5CC: alonbl, bugs, ecohen, gklein, iheim, lsurette, oourfali, rbalakri, s.kieske, yeylon
Target Milestone: ---   
Target Release: 3.6.0   
Hardware: x86_64   
OS: Linux   
Whiteboard: infra
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-03-10 09:30:25 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Infra RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Sample engine.log: SSL peer shut down incorrectly none

Description Daniel Helgenberger 2014-10-28 10:52:22 UTC
Created attachment 951353 [details]
Sample engine.log: SSL peer shut down incorrectly

Description of problem:
If Engine fails to look up hosts from the first DNS server the connection to the hosts is not resolved any more. The engine is effectively locked up endlessly trying to fence and to connect to hosts. Logs filling with 'java.io.EOFException: SSL peer shut down incorrectly'

Version-Release number of selected component (if applicable):
3.5

Engine Host:
CentOS 6.5

How reproducible:
Always

Steps to Reproduce:
On the engine host
1. Make sure DNS is working and hosts are looked up correctly from the engine
2. Shutdown Engine
3. Optional: Clear DNS cache
4. Edit /etc/resolv.conf and add an invalid NS as fist entry. (Simulate failed DNS)
5. Test nslookup; should come up with ';; connection timed out; trying next origin' for the first NS
6. Start engine
7. Watch engine.log

Actual results:
Engine is locked up. Hosts are tried to be fenced. WebUI inaccessible.

Expected results:
Second / third DNS entries in ressolv.conf are used 

Additional info:
I ran into this issue while my main internal DNS server underwent maintenance and the engine host was restarted because failure of the storage appliance at the same time (unrelated). Note, all the time a secondary NS entry was present and working perfectly.

Comment 1 Alon Bar-Lev 2014-10-29 12:58:23 UTC
this is known issue with java resolve for example[1]

to overcome this a custom resolver should be implemented and a custom host verifier.

this is not engine bug but java.

[1] http://www.rexconsulting.net/tip-java-does-not-honor-dns-ttl-recommendation-in-enterprise-environment.html

Comment 2 Daniel Helgenberger 2014-10-29 13:50:54 UTC
Thanks (In reply to Alon Bar-Lev from comment #1)
> this is known issue with java resolve for example[1]
> 
Thanks Alon for pointing out Java and this Doc (I was expecting such a thing I have to admit).

As your source states this is rather a 'feature' and not a bug in Java. In turn, this should affect Jboss as a whole (enterprise environment).
I opened an RFE for this custom resolver [1]. I would think this matter is rather severe - I wonder nobody else has reported something like this yet?

[1] BZ1158487

Comment 3 Alon Bar-Lev 2014-10-29 13:53:17 UTC
(In reply to Daniel Helgenberger from comment #2)
> Thanks (In reply to Alon Bar-Lev from comment #1)
> > this is known issue with java resolve for example[1]
> > 
> Thanks Alon for pointing out Java and this Doc (I was expecting such a thing
> I have to admit).
> 
> As your source states this is rather a 'feature' and not a bug in Java. In
> turn, this should affect Jboss as a whole (enterprise environment).
> I opened an RFE for this custom resolver [1]. I would think this matter is
> rather severe - I wonder nobody else has reported something like this yet?
> 
> [1] BZ1158487

I just worked very hard on different[1] component to enable dynamic resolution... java has very poor support within it native base classes, not sure why, as it is recent technology.

[1] http://gerrit.ovirt.org/gitweb?p=ovirt-engine-extension-aaa-ldap.git;a=blob;f=README.unboundid-ldapsdk;hb=HEAD

Comment 4 Daniel Helgenberger 2014-10-29 15:09:20 UTC
(In reply to Alon Bar-Lev from comment #3)
> java has very poor support within it native base classes, not
> sure why, as it is recent technology.

I think we need to ask Oracle or alternatively move JBoss to DjangoBoss or RailsBoss ;)

Back to the subject, reading your gerrit issue:
> The UnboundID LDAP SDK for Java provides the RoundRobinDNSServerSet to
> provide some remedy, it does that by duplicating functionality of the basic
> ServerSets, it includes support for fail over, random, round robin modes.

If I am not mistaken this functionality should do for this case? Particularity failover mimics the OS behavior.

Comment 5 Alon Bar-Lev 2014-10-29 15:13:31 UTC
(In reply to Daniel Helgenberger from comment #4)
> > The UnboundID LDAP SDK for Java provides the RoundRobinDNSServerSet to
> > provide some remedy, it does that by duplicating functionality of the basic
> > ServerSets, it includes support for fail over, random, round robin modes.
> 
> If I am not mistaken this functionality should do for this case?
> Particularity failover mimics the OS behavior.

if dynamic dns processing is required, application should use jndi dns provider instead of allowing java to do this automatically, this is possible only in some cases, as java has classes, especially jndi ldap that cannot be used with custom resolver implementation.

Comment 6 Daniel Helgenberger 2014-10-29 16:04:50 UTC
(In reply to Alon Bar-Lev from comment #5)
> if dynamic dns processing is required, application should use jndi dns
> provider instead of allowing java to do this automatically, this is possible
> only in some cases, as java has classes, especially jndi ldap that cannot be
> used with custom resolver implementation.

Sorry, I cannot quite follow here. If I get it right then the goal would be to have one DNS resolver/proxy function to cover all use cases (ldap srv records; host names...); witch is a very sensible approach.

I am no developer and and can only provide ideas to the extent of my knowledge.
That said I would therefore solve the problem my calling the nslookup or dig binaries from Java; witch should not be very expensive but is quite dirty to say the least and may create some other issues (like recreating your 'own' DNS cache).

Dynamic processing is only required to the extend used by the Engine. I gave this never much thought really. I prefer a clean setup where hosts are resolvable in DNS so I put some name servers in the engine's resolve.conf. I guess when I registered the hosts the Engine did a lookup; witch was fine and now looks up hosts by hostname on startup?

A real dynamic setup, as I understand it, involves short DNS TTLs (Google, Facebook, EC2 style load balancing). Of course, this is not applicable for oVirt; and most likely never will be. Hosts (and DNS records for that matter) in production will have (semi)static IPs most of the time. 
For example, I just put my hosts in /etc/hosts to work around this issue. Maybe adding host/ip pairs to some /var/lib - file could fix the the immediate issue?

Whatever you come up with will surely be much better and I am happy to test this  if needed.

Comment 7 Oved Ourfali 2015-03-08 09:04:46 UTC
*** Bug 1158487 has been marked as a duplicate of this bug. ***

Comment 8 Oved Ourfali 2015-03-10 09:30:25 UTC
Per the comments above, this issue is a java issue and not an engine one, and we don't plan to workaround it in the engine code.
Closing as WONTFIX.