Bug 1039115 - Web Admin unresponsive after loss of one dns domain server [NEEDINFO]
Summary: Web Admin unresponsive after loss of one dns domain server
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: oVirt
Classification: Retired
Component: ovirt-engine-core
Version: 3.3
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: 3.4.0
Assignee: Ravi Nori
QA Contact: bugs@ovirt.org
URL:
Whiteboard: infra
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2013-12-06 17:16 UTC by Ryan Womer
Modified: 2014-03-03 14:10 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2014-03-03 14:10:33 UTC
oVirt Team: ---
bazulay: needinfo? (ryan.womer)


Attachments (Terms of Use)
Most recent crash. Took 12 hours to get back up. (1.75 MB, text/plain)
2014-01-17 16:06 UTC, Ryan Womer
no flags Details

Description Ryan Womer 2013-12-06 17:16:05 UTC
Description of problem:  In my particular environment, I have two active directory domain controllers providing LDAP and DNS.  The engine is aware of both DNS servers.  One DC is located within OVIRT the other is outside on bare-metal.  If either DC is powered off the Web UI becomes unresponsive.  Local login stalls for about 5 minutes.  It appears as if login completes, but all containers are empty.  Once the DC is restored all functions return.


Version-Release number of selected component (if applicable):  3.3.1


How reproducible:  Every time.


Steps to Reproduce:
1.  Power off Domain Controller
2.  Attempt to log in with local admin account on Web UI


Actual results:  Access to virtual machines.


Expected results:  Sluggish UI with no guests listed in containers.


Additional info:  I'm willing to provide any logs that are needed, I just don't know where to start.

Comment 1 Ravi Nori 2014-01-10 19:53:45 UTC
Can you provide the relevant engine log

Comment 2 Ryan Womer 2014-01-14 15:12:57 UTC
This has happened several times.  Would you like a historical or just the most recent around the time it has occurred?

Comment 3 Ravi Nori 2014-01-14 17:04:39 UTC
The most recent would be sufficient

Comment 4 Ryan Womer 2014-01-17 16:06:14 UTC
Created attachment 851683 [details]
Most recent crash.  Took 12 hours to get back up.

Comment 5 Ryan Womer 2014-01-17 16:06:48 UTC
Comment on attachment 851683 [details]
Most recent crash.  Took 12 hours to get back up.

This is the requested engine.log

Comment 6 Ryan Womer 2014-01-17 16:07:57 UTC
Also of note, these crashes occurred during host migrations.  (Two occurred during this 12 hour period.)   This caused several guests to be listed as "UNKNOWN".   I had to edit the entries in postgres to get them back online.

Comment 7 Ravi Nori 2014-01-17 19:40:43 UTC
I was able to reproduce it on my environment.

The issue seems to be that engine can no longer find the host on which the database is installed. The cause of all exceptions seems to be: 

java.sql.SQLException: javax.resource.ResourceException: IJ000453: Unable to get managed connection for java:/ENGINEDataSource


I was able to fix it by specifying the actual ip address of the database host instead of FQDN in /etc/ovirt-engine/engine.conf.d/10-setup-database.conf

ENGINE_DB_HOST="<ip of db host>"
ENGINE_DB_PORT="5432"
....

Comment 8 Itamar Heim 2014-01-17 23:28:10 UTC
domain is such a loaded term, changing bug summary

Comment 9 Yair Zaslavsky 2014-01-28 07:40:21 UTC
I also looked at the logs, and you also have connections errors to the hosts, once again - due to the DNS issues.

Any chance we can also see the resolv.conf file?

Comment 10 Barak 2014-01-28 09:35:33 UTC
if both DNS servers are in the /etc/resolve.conf and both can resolve the db it should be O.K. than (or the workaround suggested in comment #7.

Ryan can you please verify that ?

Comment 11 Ryan Womer 2014-01-28 15:34:03 UTC
The system in question is a production cluster.  I'll schedule an outage to test it as soon as possible.   

I planned on adding a bare-metal ADS as a back-up, so I'll add the test in line with that.

Comment 12 Barak 2014-02-04 09:52:32 UTC
Reduced urgency as this looks like a system/environmental issue.
Ryan - when do you think you'll get back with the answers ?


Note You need to log in before you can comment on or make changes to this bug.