Bug 1039115

Summary: Web Admin unresponsive after loss of one dns domain server
Product: [Retired] oVirt Reporter: Ryan Womer <ryan.womer>
Component: ovirt-engine-coreAssignee: Ravi Nori <rnori>
Status: CLOSED INSUFFICIENT_DATA QA Contact: bugs <bugs>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 3.3CC: acathrow, bazulay, emesika, iheim, ryan.womer, yeylon, yzaslavs
Target Milestone: ---Flags: bazulay: needinfo? (ryan.womer)
Target Release: 3.4.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: infra
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-03-03 14:10:33 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
Most recent crash. Took 12 hours to get back up. none

Description Ryan Womer 2013-12-06 17:16:05 UTC
Description of problem:  In my particular environment, I have two active directory domain controllers providing LDAP and DNS.  The engine is aware of both DNS servers.  One DC is located within OVIRT the other is outside on bare-metal.  If either DC is powered off the Web UI becomes unresponsive.  Local login stalls for about 5 minutes.  It appears as if login completes, but all containers are empty.  Once the DC is restored all functions return.


Version-Release number of selected component (if applicable):  3.3.1


How reproducible:  Every time.


Steps to Reproduce:
1.  Power off Domain Controller
2.  Attempt to log in with local admin account on Web UI


Actual results:  Access to virtual machines.


Expected results:  Sluggish UI with no guests listed in containers.


Additional info:  I'm willing to provide any logs that are needed, I just don't know where to start.

Comment 1 Ravi Nori 2014-01-10 19:53:45 UTC
Can you provide the relevant engine log

Comment 2 Ryan Womer 2014-01-14 15:12:57 UTC
This has happened several times.  Would you like a historical or just the most recent around the time it has occurred?

Comment 3 Ravi Nori 2014-01-14 17:04:39 UTC
The most recent would be sufficient

Comment 4 Ryan Womer 2014-01-17 16:06:14 UTC
Created attachment 851683 [details]
Most recent crash.  Took 12 hours to get back up.

Comment 5 Ryan Womer 2014-01-17 16:06:48 UTC
Comment on attachment 851683 [details]
Most recent crash.  Took 12 hours to get back up.

This is the requested engine.log

Comment 6 Ryan Womer 2014-01-17 16:07:57 UTC
Also of note, these crashes occurred during host migrations.  (Two occurred during this 12 hour period.)   This caused several guests to be listed as "UNKNOWN".   I had to edit the entries in postgres to get them back online.

Comment 7 Ravi Nori 2014-01-17 19:40:43 UTC
I was able to reproduce it on my environment.

The issue seems to be that engine can no longer find the host on which the database is installed. The cause of all exceptions seems to be: 

java.sql.SQLException: javax.resource.ResourceException: IJ000453: Unable to get managed connection for java:/ENGINEDataSource


I was able to fix it by specifying the actual ip address of the database host instead of FQDN in /etc/ovirt-engine/engine.conf.d/10-setup-database.conf

ENGINE_DB_HOST="<ip of db host>"
ENGINE_DB_PORT="5432"
....

Comment 8 Itamar Heim 2014-01-17 23:28:10 UTC
domain is such a loaded term, changing bug summary

Comment 9 Yair Zaslavsky 2014-01-28 07:40:21 UTC
I also looked at the logs, and you also have connections errors to the hosts, once again - due to the DNS issues.

Any chance we can also see the resolv.conf file?

Comment 10 Barak 2014-01-28 09:35:33 UTC
if both DNS servers are in the /etc/resolve.conf and both can resolve the db it should be O.K. than (or the workaround suggested in comment #7.

Ryan can you please verify that ?

Comment 11 Ryan Womer 2014-01-28 15:34:03 UTC
The system in question is a production cluster.  I'll schedule an outage to test it as soon as possible.   

I planned on adding a bare-metal ADS as a back-up, so I'll add the test in line with that.

Comment 12 Barak 2014-02-04 09:52:32 UTC
Reduced urgency as this looks like a system/environmental issue.
Ryan - when do you think you'll get back with the answers ?