Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1039115

Summary:

Web Admin unresponsive after loss of one dns domain server

Product:

[Retired] oVirt

Reporter:

Ryan Womer <ryan.womer>

Component:

ovirt-engine-core

Assignee:

Ravi Nori <rnori>

Status:

CLOSED INSUFFICIENT_DATA

QA Contact:

bugs <bugs>

Severity:

unspecified

Docs Contact:

Priority:

unspecified

Version:

3.3

CC:

acathrow, bazulay, emesika, iheim, ryan.womer, yeylon, yzaslavs

Target Milestone:

---

Target Release:

3.4.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

infra

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2014-03-03 14:10:33 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
Most recent crash. Took 12 hours to get back up.	none

Description Ryan Womer 2013-12-06 17:16:05 UTC

Description of problem:  In my particular environment, I have two active directory domain controllers providing LDAP and DNS.  The engine is aware of both DNS servers.  One DC is located within OVIRT the other is outside on bare-metal.  If either DC is powered off the Web UI becomes unresponsive.  Local login stalls for about 5 minutes.  It appears as if login completes, but all containers are empty.  Once the DC is restored all functions return.


Version-Release number of selected component (if applicable):  3.3.1


How reproducible:  Every time.


Steps to Reproduce:
1.  Power off Domain Controller
2.  Attempt to log in with local admin account on Web UI


Actual results:  Access to virtual machines.


Expected results:  Sluggish UI with no guests listed in containers.


Additional info:  I'm willing to provide any logs that are needed, I just don't know where to start.

Comment 1 Ravi Nori 2014-01-10 19:53:45 UTC

Can you provide the relevant engine log

Comment 2 Ryan Womer 2014-01-14 15:12:57 UTC

This has happened several times.  Would you like a historical or just the most recent around the time it has occurred?

Comment 3 Ravi Nori 2014-01-14 17:04:39 UTC

The most recent would be sufficient

Comment 4 Ryan Womer 2014-01-17 16:06:14 UTC

Created attachment 851683 [details]
Most recent crash.  Took 12 hours to get back up.

Comment 5 Ryan Womer 2014-01-17 16:06:48 UTC

Comment on attachment 851683 [details]
Most recent crash.  Took 12 hours to get back up.

This is the requested engine.log

Comment 6 Ryan Womer 2014-01-17 16:07:57 UTC

Also of note, these crashes occurred during host migrations.  (Two occurred during this 12 hour period.)   This caused several guests to be listed as "UNKNOWN".   I had to edit the entries in postgres to get them back online.

Comment 7 Ravi Nori 2014-01-17 19:40:43 UTC

I was able to reproduce it on my environment.

The issue seems to be that engine can no longer find the host on which the database is installed. The cause of all exceptions seems to be: 

java.sql.SQLException: javax.resource.ResourceException: IJ000453: Unable to get managed connection for java:/ENGINEDataSource


I was able to fix it by specifying the actual ip address of the database host instead of FQDN in /etc/ovirt-engine/engine.conf.d/10-setup-database.conf

ENGINE_DB_HOST="<ip of db host>"
ENGINE_DB_PORT="5432"
....

Comment 8 Itamar Heim 2014-01-17 23:28:10 UTC

domain is such a loaded term, changing bug summary

Comment 9 Yair Zaslavsky 2014-01-28 07:40:21 UTC

I also looked at the logs, and you also have connections errors to the hosts, once again - due to the DNS issues.

Any chance we can also see the resolv.conf file?

Comment 10 Barak 2014-01-28 09:35:33 UTC

if both DNS servers are in the /etc/resolve.conf and both can resolve the db it should be O.K. than (or the workaround suggested in comment #7.

Ryan can you please verify that ?

Comment 11 Ryan Womer 2014-01-28 15:34:03 UTC

The system in question is a production cluster.  I'll schedule an outage to test it as soon as possible.   

I planned on adding a bare-metal ADS as a back-up, so I'll add the test in line with that.

Comment 12 Barak 2014-02-04 09:52:32 UTC

Reduced urgency as this looks like a system/environmental issue.
Ryan - when do you think you'll get back with the answers ?

Comment 13 Red Hat Bugzilla 2023-09-14 01:54:58 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days