Bug 879904

Summary: engine: engine fails to connect to DB and cannot be started with NPE
Product: Red Hat Enterprise Virtualization Manager Reporter: Dafna Ron <dron>
Component: ovirt-engineAssignee: Juan Hernández <juan.hernandez>
Status: CLOSED ERRATA QA Contact: Pavel Stehlik <pstehlik>
Severity: urgent Docs Contact:
Priority: low    
Version: 3.1.0CC: acathrow, bazulay, bdagan, bhubbard, byount, cpelland, fharshav, iheim, jkt, jlivings, jraju, juan.hernandez, lpeer, lyarwood, ngupta, Rhev-m-bugs, yeylon, yzaslavs
Target Milestone: ---   
Target Release: 3.3.0   
Hardware: x86_64   
OS: Linux   
Whiteboard: infra
Fixed In Version: is1 Doc Type: Bug Fix
Doc Text:
The engine performed reverse DNS lookups of IP addresses when it failed to connect to the database. This includes looking up its own name and delaying the rest of the operations, including the deployment of the connection pool, which raises a null pointer exception in the administration and user portals. This update modifies the engine so that it locates the data source in a loop and retries until it is successful .
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-01-21 17:31:34 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Infra RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
logs
none
logs
none
new logs
none
tcpdump
none
engine log file from rhs-c issue none

Description Dafna Ron 2012-11-25 10:33:30 UTC
Created attachment 651461 [details]
logs

Description of problem:

I added a domain to my setup and changed resolved.com so that the first server in the list is a server that's currently down. 
engine failed to start with NPE which is also shown in the webadmin and user portal when we try to log in. 

Version-Release number of selected component (if applicable):

si24.4

How reproducible:

100%

Steps to Reproduce:
1. add a domain to your setup using rhevm-manage-domains
2. change resolve.conf so that the first server on the list is a dns server which is down
3. restart ovirt engine
  
Actual results:

engine fails to start with NPE which is also shown in webadmin and user portal

Expected results:

we need to jump to the the next server on the list 

Additional info: logs

2012-11-25 11:56:56,368 ERROR [org.ovirt.engine.core.utils.ejb.EJBUtilsStrategy] (ServerService Thread Pool -- 37) Error looking up resource DATA_SOURCE
2012-11-25 11:56:56,376 ERROR [org.ovirt.engine.core.dal.dbbroker.DbFacadeLocator] (ServerService Thread Pool -- 37) Unable to locate DbFacade instance: java.lang.RuntimeException: Datasource is not defined 
        at org.ovirt.engine.core.dal.dbbroker.DbFacadeLocator.<clinit>(DbFacadeLocator.java:34) [engine-dal.jar:]
        at org.ovirt.engine.core.dal.dbbroker.DbFacade.getInstance(DbFacade.java:209) [engine-dal.jar:]
        at org.ovirt.engine.core.bll.Backend.checkDBConnectivity(Backend.java:124) [engine-bll.jar:]
        at org.ovirt.engine.core.bll.Backend.create(Backend.java:118) [engine-bll.jar:]
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) [rt.jar:1.7.0_09-icedtea]
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) [rt.jar:1.7.0_09-icedtea]
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.7.0_09-icedtea]
        at java.lang.reflect.Method.invoke(Method.java:601) [rt.jar:1.7.0_09-icedtea]
        at org.jboss.as.ee.component.ManagedReferenceLifecycleMethodInterceptorFactory$ManagedReferenceLifecycleMethodInterceptor.processInvocation(ManagedReferenceLifecycleMethodInterceptorFactory.java:130) [jboss-as-ee.jar:7.1.3.Final-redhat-3]
        at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-2]
        at org.jboss.invocation.WeavedInterceptor.processInvocation(WeavedInterceptor.java:53) [jboss-invocation.jar:1.1.1.Final-redhat-2]
        at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-2]
        at org.jboss.as.ee.component.ManagedReferenceFieldInjectionInterceptorFactory$ManagedReferenceFieldInjectionInterceptor.processInvocation(ManagedReferenceFieldInjectionInterceptorFactory.java:112) [jboss-as-ee.jar:7.1.3.Final-redhat-3]
        at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-2]
        at org.jboss.invocation.WeavedInterceptor.processInvocation(WeavedInterceptor.java:53) [jboss-invocation.jar:1.1.1.Final-redhat-2]
        at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-2]
        at org.jboss.as.ee.component.ManagedReferenceInterceptorFactory$ManagedReferenceInterceptor.processInvocation(ManagedReferenceInterceptorFactory.java:95) [jboss-as-ee.jar:7.1.3.Final-redhat-3]
        at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-2]
        at org.jboss.as.ee.component.ManagedReferenceInterceptorFactory$ManagedReferenceInterceptor.processInvocation(ManagedReferenceInterceptorFactory.java:95) [jboss-as-ee.jar:7.1.3.Final-redhat-3]
        at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-2]
        at org.jboss.invocation.WeavedInterceptor.processInvocation(WeavedInterceptor.java:53) [jboss-invocation.jar:1.1.1.Final-redhat-2]
        at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-2]
        at org.jboss.as.ee.component.NamespaceContextInterceptor.processInvocation(NamespaceContextInterceptor.java:50) [jboss-as-ee.jar:7.1.3.Final-redhat-3]
:

Comment 1 Itamar Heim 2012-11-25 10:52:37 UTC
was this a local postgres or a remote postgres?

Comment 2 Dafna Ron 2012-11-25 12:13:21 UTC
(In reply to comment #1)
> was this a local postgres or a remote postgres?

local

Comment 3 Dafna Ron 2012-11-26 12:51:47 UTC
Barak had the same issue on his setup but we are not sure why its happening. 
its not a dns issue. 
Kublin thinks its a security issue this time from error we found in jboss log: 

2012-11-26 14:31:59,130 ERROR [org.jboss.as.controller.management-operation] (ServerService Thread Pool -- 16) JBAS014613: Operation ("enable") failed - address: ([
    ("subsystem" => "datasources"),
    ("data-source" => "ENGINEDataSource")
]) - failure description: "JBAS010436: Failed to create DataSource instance for [{
    \"operation\" => \"enable\",
    \"address\" => [
        (\"subsystem\" => \"datasources\"),
        (\"data-source\" => \"ENGINEDataSource\")
    ],
    \"persistent\" => true
}]
 reason: IJ010073: Invalid <security> configuration"
2012-11-26 14:31:59,400 INFO  [org.jboss.as.connector.subsystems.datasources] (ServerService Thread Pool -- 16) JBAS010404: Deploying non-JDBC-compliant driver class org.postgresql.Driver (version 8.4)
2012-11-26 14:31:59,465 INFO  [org.jboss.as.naming] (MSC service thread 1-4) JBAS011802: Starting Naming Service


I am attaching all the logs from Barak's setup.

Comment 4 Dafna Ron 2012-11-26 12:58:57 UTC
Created attachment 651942 [details]
logs

Comment 7 Barak 2012-12-02 11:27:59 UTC
Juan - do you have any idea?
We are suspecting eap 6.0.1

Comment 8 Juan Hernández 2012-12-02 20:11:49 UTC
The problem described in comment #3 we already had, see bug 868659, and it has already been solved.

The initial problem is different, and I don't think it is related to 6.0.1. My guess is that the fact that the first DNS server is not responding means that all DNS operations are delayed. By default the resolver waits 5 seconds for a response and retry two times, so it is 10 seconds before it tries with the second nameserver. This is time enough to delay the deployment of the database connection pool so that the application starts before it is deployed, thus the NPE. However DNS should not be used to resolve the address of the database, as it is localhost for local database. I would check the "files" is the first option in /etc/nsswitch.conf for the hosts database and that there is a "localhost" entry in the /etc/hosts file.

If this is the problem we could maybe use 127.0.0.1 in the database connection pool configuration, to avoid DNS operations in any case.

Comment 9 mkublin 2012-12-03 08:52:31 UTC
The problem of Barak is solved in the way as described at bug https://bugzilla.redhat.com/show_bug.cgi?id=868659.
Dafna, please use workaround described by Juan

Comment 13 Juan Hernández 2012-12-10 09:12:00 UTC
Dafna, did you check if replacing "localhost" with "127.0.0.1" in the "/etc/sysconfig/ovirt-engine file resolved the issue? It is in the ENGINE_DB_URL variable, its value will probably be this:

  ENGINE_DB_URL=jdbc:postgresql://localhost:5432/engine

I am suggesting to change it to this:

  ENGINE_DB_URL=jdbc:postgresql://127.0.0.1:5432/engine

If that solves the issue (with the DNS configuration still broken) then the problem is that we start the engine before the data source is deployed, and there are two things we can do to improve it:

1. Resolve the database host name to an IP address during setup and store the IP address instead of the host name. This is perfectly fine for local installations, but is not wise for remote databases as the system will start to fail if the database IP address changes.

2. Modify the code so that we wait till the database connection pool is deployed before starting to use it. This is much more reliable in my opinion.

Comment 14 Dafna Ron 2012-12-16 16:10:42 UTC
I tested with 127.0.0.1 and we are still getting NPE when engine starts (attaching logs again): 

ENGINE_DB_URL=jdbc:postgresql://127.0.0.1:5432/engine


2012-12-16 17:54:41,472 ERROR [org.ovirt.engine.core.dal.dbbroker.DbFacadeLocator] (MSC service thread 1-3) Unable to locate DbFacade instance: java.lang.RuntimeException: Datasource is not defined 
        at org.ovirt.engine.core.dal.dbbroker.DbFacadeLocator.<clinit>(DbFacadeLocator.java:34) [engine-dal.jar:]
        at org.ovirt.engine.core.dal.dbbroker.DbFacade.getInstance(DbFacade.java:209) [engine-dal.jar:]
        at org.ovirt.engine.core.bll.Backend.checkDBConnectivity(Backend.java:124) [engine-bll.jar:]
        at org.ovirt.engine.core.bll.Backend.create(Backend.java:118) [engine-bll.jar:]
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) [rt.jar:1.7.0_09-icedtea]
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) [rt.jar:1.7.0_09-icedtea]
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.7.0_09-icedtea]
        at java.lang.reflect.Method.invoke(Method.java:601) [rt.jar:1.7.0_09-icedtea]
        at org.jboss.as.ee.component.ManagedReferenceLifecycleMethodInterceptorFactory$ManagedReferenceLifecycleMethodInterceptor.processInvocation(ManagedReferenceLifecycleMethodInterceptorFactory.java:130) [jboss-as-ee.jar:7.1.2.Final-redhat-1]
        at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-1]
        at org.jboss.invocation.WeavedInterceptor.processInvocation(WeavedInterceptor.java:53) [jboss-invocation.jar:1.1.1.Final-redhat-1]
        at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-1]
        at org.jboss.as.ee.component.ManagedReferenceFieldInjectionInterceptorFactory$ManagedReferenceFieldInjectionInterceptor.processInvocation(ManagedReferenceFieldInjectionInterceptorFactory.java:112) [jboss-as-ee.jar:7.1.2.Final-redhat-1]
        at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-1]
        at org.jboss.invocation.WeavedInterceptor.processInvocation(WeavedInterceptor.java:53) [jboss-invocation.jar:1.1.1.Final-redhat-1]
        at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-1]
        at org.jboss.as.ee.component.ManagedReferenceInterceptorFactory$ManagedReferenceInterceptor.processInvocation(ManagedReferenceInterceptorFactory.java:95) [jboss-as-ee.jar:7.1.2.Final-redhat-1]
        at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-1]
        at org.jboss.as.ee.component.ManagedReferenceInterceptorFactory$ManagedReferenceInterceptor.processInvocation(ManagedReferenceInterceptorFactory.java:95) [jboss-as-ee.jar:7.1.2.Final-redhat-1]
        at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-1]
        at org.jboss.invocation.WeavedInterceptor.processInvocation(WeavedInterceptor.java:53) [jboss-invocation.jar:1.1.1.Final-redhat-1]
        at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-1]
        at org.jboss.as.ee.component.NamespaceContextInterceptor.processInvocation(NamespaceContextInterceptor.java:50) [jboss-as-ee.jar:7.1.2.Final-redhat-1]
        at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-1]
        at org.jboss.as.ejb3.tx.CMTTxInterceptor.invokeInOurTx(CMTTxInterceptor.java:227) [jboss-as-ejb3.jar:7.1.2.Final-redhat-1]
        at org.jboss.as.ejb3.tx.CMTTxInterceptor.requiresNew(CMTTxInterceptor.java:332) [jboss-as-ejb3.jar:7.1.2.Final-redhat-1]
        at org.jboss.as.ejb3.tx.SingletonLifecycleCMTTxInterceptor.processInvocation(SingletonLifecycleCMTTxInterceptor.java:57) [jboss-as-ejb3.jar:7.1.2.Final-redhat-1]
        at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-1]
        at org.jboss.as.ejb3.component.interceptors.CurrentInvocationContextInterceptor.processInvocation(CurrentInvocationContextInterceptor.java:41) [jboss-as-ejb3.jar:7.1.2.Final-redhat-1]
        at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-1]
        at org.jboss.as.ee.component.TCCLInterceptor.processInvocation(TCCLInterceptor.java:45) [jboss-as-ee.jar:7.1.2.Final-redhat-1]
        at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-1]
        at org.jboss.invocation.ChainedInterceptor.processInvocation(ChainedInterceptor.java:61) [jboss-invocation.jar:1.1.1.Final-redhat-1]
        at org.jboss.as.ee.component.BasicComponent.constructComponentInstance(BasicComponent.java:161) [jboss-as-ee.jar:7.1.2.Final-redhat-1]
        at org.jboss.as.ee.component.BasicComponent.createInstance(BasicComponent.java:85) [jboss-as-ee.jar:7.1.2.Final-redhat-1]
        at org.jboss.as.ejb3.component.singleton.SingletonComponent.getComponentInstance(SingletonComponent.java:119) [jboss-as-ejb3.jar:7.1.2.Final-redhat-1]
        at org.jboss.as.ejb3.component.singleton.SingletonComponent.start(SingletonComponent.java:134) [jboss-as-ejb3.jar:7.1.2.Final-redhat-1]
        at org.jboss.as.ee.component.ComponentStartService.start(ComponentStartService.java:44) [jboss-as-ee.jar:7.1.2.Final-redhat-1]
        at org.jboss.msc.service.ServiceControllerImpl$StartTask.startService(ServiceControllerImpl.java:1811)
        at org.jboss.msc.service.ServiceControllerImpl$StartTask.run(ServiceControllerImpl.java:1746)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) [rt.jar:1.7.0_09-icedtea]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) [rt.jar:1.7.0_09-icedtea]
        at java.lang.Thread.run(Thread.java:722) [rt.jar:1.7.0_09-icedtea]

Comment 15 Dafna Ron 2012-12-16 16:11:09 UTC
Created attachment 664346 [details]
new logs

Comment 16 Juan Hernández 2012-12-17 10:13:20 UTC
Thanks for the information Dafna.

Then only reason for this that I can imagine is that the database is doing a reverse DNS lookup of of the IP address of the client. Can you check that Dafna? Maybe take a dump of the traffic to the DNS servers while you start the engine? Or try manually the following command:

  psql --host 127.0.0.1 --user engine --password engine

Does this command take longer when the DNS configuration is wrong?

Anyhow, I think this will need a modification in the code to make the engine wait till the datasource is deployed before trying to use it.

Comment 17 Dafna Ron 2012-12-17 17:03:04 UTC
Created attachment 664982 [details]
tcpdump

I did not see any effect when running psql --host 127.0.0.1 --user engine --password engine

I ran tcpdump on port 53, qa2 file is when we give a dead dns and qa1 is when we have a functional dns.

Comment 18 Juan Hernández 2012-12-18 10:49:08 UTC
Checking the provided dumps I see that there are reverse lookups of the IP address of the machine that are failing to contact the first server, thus delaying things. I checked in the machine what process makes those lookups, and it is the ovirt-engine service, not the database. So this means that the application server is looking up its own host name and probably delaying the rest of the operations, including the deployment of the connection pool.

I prepared the following patch to modify the engine so that it will locate the data source in a loop, retrying till it is successful:

http://gerrit.ovirt.org/10189

With this patch applied I verified that the behaviour when the DNS is slow is different. It prints this new warning and then retries the operation:

2012-12-18 12:32:13,851 WARN  [org.ovirt.engine.core.dal.dbbroker.DbFacadeLocator] (MSC service thread 1-3) The datasource can't be located. This probably means that DNS is not working correctly or is slow, please check it. Will try again in a few seconds.

After a few seconds, when the data source is deployed, the application continues as usual.

Comment 19 Paul Cuzner 2012-12-20 22:10:43 UTC
Created attachment 666974 [details]
engine log file from rhs-c issue

Comment 20 Paul Cuzner 2012-12-20 22:12:53 UTC
I'm seeing the same dbfacade error on my red hat storage console instance (ovirt based).

Has been working for months until I applied maintenance tonight ... that'll teach me ;o)

Comment 27 Juan Hernández 2013-03-21 11:43:23 UTC
I have restored the change (I abandoned it some time ago). moved the data source lookup loop out of the static initializer and added a time limit.

Comment 28 Juan Hernández 2013-04-08 14:03:27 UTC
The upstream change has been merged, so it will be in 3.3 after the rebase.

Comment 29 Yair Zaslavsky 2013-06-11 12:59:22 UTC
*** Bug 972678 has been marked as a duplicate of this bug. ***

Comment 35 Yair Zaslavsky 2013-06-30 11:48:10 UTC
*** Bug 975445 has been marked as a duplicate of this bug. ***

Comment 37 Charlie 2013-11-28 00:24:03 UTC
This bug is currently attached to errata RHEA-2013:15231. If this change is not to be documented in the text for this errata please either remove it from the errata, set the requires_doc_text flag to minus (-), or leave a "Doc Text" value of "--no tech note required" if you do not have permission to alter the flag.

Otherwise to aid in the development of relevant and accurate release documentation, please fill out the "Doc Text" field above with these four (4) pieces of information:

* Cause: What actions or circumstances cause this bug to present.
* Consequence: What happens when the bug presents.
* Fix: What was done to fix the bug.
* Result: What now happens when the actions or circumstances above occur. (NB: this is not the same as 'the bug doesn't present anymore')

Once filled out, please set the "Doc Type" field to the appropriate value for the type of change made and submit your edits to the bug.

For further details on the Cause, Consequence, Fix, Result format please refer to:

https://bugzilla.redhat.com/page.cgi?id=fields.html#cf_release_notes 

Thanks in advance.

Comment 39 Juan Hernández 2014-01-15 08:50:53 UTC
The first part of the doc text is not accurate. Initially we thought that this was caused by a reverse DNS lookup, but we don't have proof of that. What actually happens is that the relative deployment order of the application and the data source aren't guaranteed by the application server, so in some situations the data source may be deployed after the application is deployed and executed. Thus the application has to wait till the data source is deployed.

Comment 41 errata-xmlrpc 2014-01-21 17:31:34 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2014-0038.html