1516256 – [scale] Unable to get managed connection for java:/ENGINEDataSource

Bug 1516256 - [scale] Unable to get managed connection for java:/ENGINEDataSource

Summary: [scale] Unable to get managed connection for java:/ENGINEDataSource

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	ovirt-engine
Classification:	oVirt
Component:	Database.Core
Sub Component:
Version:	4.2.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Ravi Nori
QA Contact:	mlehrer
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-11-22 11:23 UTC by mlehrer
Modified:	2021-12-10 15:42 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2018-04-22 19:55:24 UTC
oVirt Team:	Infra
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
server.log enginge.log postgres log and thread dump of ovirt-engine (601.07 KB, application/zip) 2017-11-22 11:23 UTC, mlehrer	no flags	Details
Previous engine.log was not correct updated version called 'correct_engine.log' is attached in this zip (1.44 MB, application/zip) 2017-11-23 13:52 UTC, mlehrer	no flags	Details
engine log and log from vdsm that holds vms of nested hosts (8.86 MB, application/zip) 2017-12-14 13:45 UTC, mlehrer	no flags	Details
Show Obsolete (1) View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	RHV-44200	0	None	None	None	2021-12-10 15:42:54 UTC

Description mlehrer 2017-11-22 11:23:37 UTC

Created attachment 1357401 [details]
server.log  enginge.log postgres log and thread dump of ovirt-engine

Description of problem:
When attempting to log into ovirt web admin, receive a message:
Unable to get a managed connection for java:/ENGINEDataSource and cannot log in successfully.

Postgres shows many idle (likely orphan) connections.
The env contains: 
- 889 VMs and 6 active Physical Hosts and 290 Nested Hosts in maintenance.

The issue was seen originally when starting ovirt-engine service began to take longer amounts of time.  Eventually a 'log in' would be successful but only after only after throwing a connection error regarding timeouts for getting a managed connection.

Currently a successful ovirt-enigne 'log on' is not possible unless one of these options are done:
  - kill one of the idle queries to the engine on postgres
  - increase ENGINE_DB_MAX_CONNECTIONS and postgres's max connection value
  - reboot the engine, restarting postgres/ovirt* doesn't release the connections.







Version-Release number of selected component (if applicable):
PACKAGE_VERSION="4.2.0_master"
4.2.0-0.0.master.20171025204923.git6f4cbc5.el7.centos


How reproducible:
Very

Steps to Reproduce:
1. Create env of 200 Host and 800 Vms
2. Start & Stop the vms over several days (basic use not constant  stress)
3. Attempt to restart ovirt-engine

Actual results:
Cannot log in

2017-11-22 11:45:03,503+02 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.VmsMonitoring] (EE-ManagedThreadFactory-engineScheduled-Thread-95) [] Exception:: org.springframework.jdbc.CannotGetJdbcConnectionException: Could not get JDBC Connection; nested exception is java.sql.SQLException: javax.resource.ResourceException: IJ000453: Unable to get managed connection for java:/ENGINEDataSource
        at org.springframework.jdbc.datasource.DataSourceUtils.getConnection(DataSourceUtils.java:80) [spring-jdbc.jar:4.3.9.RELEASE]
        at org.springframework.jdbc.core.JdbcTemplate.execute(JdbcTemplate.java:619) [spring-jdbc.jar:4.3.9.RELEASE]
        at org.springframework.jdbc.core.JdbcTemplate.query(JdbcTemplate.java:684) [spring-jdbc.jar:4.3.9.RELEASE]
        at org.springframework.jdbc.core.JdbcTemplate.query(JdbcTemplate.java:716) [spring-jdbc.jar:4.3.9.RELEASE]
        at org.springframework.jdbc.core.JdbcTemplate.query(JdbcTemplate.java:766) [spring-jdbc.jar:4.3.9.RELEASE]
        at org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall.executeCallInternal(PostgresDbEngineDialect.java:152) [dal.jar:]
        at org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall.doExecute(PostgresDbEngineDialect.java:118) [dal.jar:]
        at org.springframework.jdbc.core.simple.SimpleJdbcCall.execute(SimpleJdbcCall.java:198) [spring-jdbc.jar:4.3.9.RELEASE]
        at org.ovirt.engine.core.dal.dbbroker.SimpleJdbcCallsHandler.executeImpl(SimpleJdbcCallsHandler.java:135) [dal.jar:]
        at org.ovirt.engine.core.dal.dbbroker.SimpleJdbcCallsHandler.executeReadList(SimpleJdbcCallsHandler.java:105) [dal.jar:]
        at org.ovirt.engine.core.dao.network.VmNetworkInterfaceDaoImpl.getAllForMonitoredVm(VmNetworkInterfaceDaoImpl.java:51) [dal.jar:]
        at org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer.loadVmNetworkInterfaces(VmAnalyzer.java:1057) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer.updateInterfaceStatistics(VmAnalyzer.java:890) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer.updateStatistics(VmAnalyzer.java:858) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer.proceedVmReportedOnTheSameHost(VmAnalyzer.java:627) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer.analyze(VmAnalyzer.java:155) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.monitoring.VmsMonitoring.lambda$analyzeVms$1(VmsMonitoring.java:136) [vdsbroker.jar:]
        at java.util.ArrayList.forEach(ArrayList.java:1249) [rt.jar:1.8.0_144]
        at org.ovirt.engine.core.vdsbroker.monitoring.VmsMonitoring.analyzeVms(VmsMonitoring.java:131) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.monitoring.VmsMonitoring.perform(VmsMonitoring.java:94) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.monitoring.PollVmStatsRefresher.poll(PollVmStatsRefresher.java:43) [vdsbroker.jar:]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [rt.jar:1.8.0_144]
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) [rt.jar:1.8.0_144]
        at org.glassfish.enterprise.concurrent.internal.ManagedScheduledThreadPoolExecutor$ManagedScheduledFutureTask.access$201(ManagedScheduledThreadPoolExecutor.java:383) [javax.enterprise.concurrent-1.0.jar:]
        at org.glassfish.enterprise.concurrent.internal.ManagedScheduledThreadPoolExecutor$ManagedScheduledFutureTask.run(ManagedScheduledThreadPoolExecutor.java:534) [javax.enterprise.concurrent-1.0.jar:]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [rt.jar:1.8.0_144]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [rt.jar:1.8.0_144]
        at java.lang.Thread.run(Thread.java:748) [rt.jar:1.8.0_144]
        at org.glassfish.enterprise.concurrent.ManagedThreadFactoryImpl$ManagedThread.run(ManagedThreadFactoryImpl.java:250) [javax.enterprise.concurrent-1.0.jar:]
        at org.jboss.as.ee.concurrent.service.ElytronManagedThreadFactory$ElytronManagedThread.run(ElytronManagedThreadFactory.java:78)
Caused by: java.sql.SQLException: javax.resource.ResourceException: IJ000453: Unable to get managed connection for java:/ENGINEDataSource
        at org.jboss.jca.adapters.jdbc.WrapperDataSource.getConnection(WrapperDataSource.java:146)
        at org.jboss.as.connector.subsystems.datasources.WildFlyDataSource.getConnection(WildFlyDataSource.java:64)
        at org.springframework.jdbc.datasource.DataSourceUtils.doGetConnection(DataSourceUtils.java:111) [spring-jdbc.jar:4.3.9.RELEASE]
        at org.springframework.jdbc.datasource.DataSourceUtils.getConnection(DataSourceUtils.java:77) [spring-jdbc.jar:4.3.9.RELEASE]
        ... 29 more
Caused by: javax.resource.ResourceException: IJ000453: Unable to get managed connection for java:/ENGINEDataSource
        at org.jboss.jca.core.connectionmanager.AbstractConnectionManager.getManagedConnection(AbstractConnectionManager.java:690)
        at org.jboss.jca.core.connectionmanager.tx.TxConnectionManagerImpl.getManagedConnection(TxConnectionManagerImpl.java:430)
        at org.jboss.jca.core.connectionmanager.AbstractConnectionManager.allocateConnection(AbstractConnectionManager.java:789)
        at org.jboss.jca.adapters.jdbc.WrapperDataSource.getConnection(WrapperDataSource.java:138)
        ... 32 more
Caused by: javax.resource.ResourceException: IJ000655: No managed connections available within configured blocking timeout (30000 [ms])
        at org.jboss.jca.core.connectionmanager.pool.mcp.SemaphoreConcurrentLinkedDequeManagedConnectionPool.getConnection(SemaphoreConcurrentLinkedDequeManagedConnectionPool.java:570)
        at org.jboss.jca.core.connectionmanager.pool.AbstractPool.getSimpleConnection(AbstractPool.java:632)
        at org.jboss.jca.core.connectionmanager.pool.AbstractPool.getConnection(AbstractPool.java:604)
        at org.jboss.jca.core.connectionmanager.AbstractConnectionManager.getManagedConnection(AbstractConnectionManager.java:624)
        ... 35 more



Expected results:
Able to login, connections closed after restart and not left in 'idle' state.

Additional info:
Attempted to use (CCM) connection manager to identify connections which were not closed correctly per dev recommendation.  We had some configuration issues with this and are still looking at this.

We will also look at number of connections from the JVM side via /ENGINEDataSource/statistics/pool?operation=resource&include-runtime queried

Comment 2 Martin Perina 2017-11-23 09:12:45 UTC

Several comments from my side:

1. Please upgrade WildFly to 11.0.0.Final, which was release month ago

2. Please upgrade to latest engine, especially please make sure you are using vdsm-jsonrpc-java-1.4.9, according to Ravi's measurement it significantly improves ability to handle host status changes and makes engine much faster to react

3. Either I'm blind or you have provided incorrect logs, because I cannot found a single occurrence of CannotGetJdbcConnectionException in provided logs

4. I don't understand the issue from description: are those errors about failing to get db connection happening only after engine restart or always?

5. Looking at postgresql logs, messages around 'found orphan temp table' should probably be fixed by BZ1510432

Comment 3 mlehrer 2017-11-23 13:52:44 UTC

Created attachment 1358237 [details]
Previous engine.log was not correct updated version called 'correct_engine.log' is attached in this zip

Attaching correct Engine.log file called 'correct_engine.log' inside this zip

Comment 4 mlehrer 2017-11-23 14:11:49 UTC

Upgraded to 4.2.0-0.0.master.20171121184703.git173fe83.el7.centos and the issue still occurs as described in this bug.
If this build of: 4.2.0-0.0.master.20171121184703.git173fe83 is not sufficient please cite which build should be used for reproduction.

(In reply to Martin Perina from comment #2)
> Several comments from my side:
> 
> 1. Please upgrade WildFly to 11.0.0.Final, which was release month ago

Current Version: ovirt-engine-wildfly-11.0.0-1.el7.centos.x86_64, does this include the upgrade to 11.0.0.Final as per your own change note:

Tue Oct 24 2017 Martin Perina <mperina> - 11.0.0-1
- Update to upstream version 11.0.0.Final 

If I am incorrect and this wildfly version: '11.0.0.Final' is not default in ovirt build '4.2.0-0.0.master.20171121184703.git173fe83.el7.centos' please supply this as a formal request to eberman/dagur so that we'll know to prioritize this change for our regression efforts and bug verifications, as we are currently requested to use the standard upstream build without modifications, or any application tuning.

> 
> 2. Please upgrade to latest engine, especially please make sure you are
> using vdsm-jsonrpc-java-1.4.9, according to Ravi's measurement it
> significantly improves ability to handle host status changes and makes
> engine much faster to react
> 

The engine was updated to 4.2.0-0.0.master.20171121184703.git173fe83.el7 
This is the version found: vdsm-jsonrpc-java-1.4.9-1.20171120122807.git1a5ecf6.el7.centos.noarch

> 3. Either I'm blind or you have provided incorrect logs, because I cannot
> found a single occurrence of CannotGetJdbcConnectionException in provided
> logs
> 

It's obviously the wrong log, uploaded by mistake; I've uploaded the corrected version after the engine was upgraded.

> 4. I don't understand the issue from the description: are those errors about
> failing to get db connection happening only after engine restart or always?
> 

Always. The engine starts and you can't log in because of these errors.
Originally when this happened, you could wait a few more minutes, and get this exception thrown of 'Unable to get managed connection for java:/ENGINEDataSource' and then you could log on successfully, today you cannot log in even after services restart.

Currently a successful ovirt-enigne 'log on' is ***not possible*** unless one of these options are done:
  - kill one of the idle queries to the engine on postgres
  - increase ENGINE_DB_MAX_CONNECTIONS and postgres's max connection value
  - reboot the engine machince since restarting postgres/ovirt* doesn't release the connections to postgres

> 5. Looking at postgresql logs, messages around 'found orphan temp table'
> should probably be fixed by BZ1510432

Understood.

Comment 5 mlehrer 2017-11-24 06:01:34 UTC

Verified 11.0.0.Final is the running 'product version' of Wildfly via JBOSS CLI.

Comment 6 Martin Perina 2017-11-24 13:13:07 UTC

(In reply to mlehrer from comment #4)
> Upgraded to 4.2.0-0.0.master.20171121184703.git173fe83.el7.centos and the
> issue still occurs as described in this bug.
> If this build of: 4.2.0-0.0.master.20171121184703.git173fe83 is not
> sufficient please cite which build should be used for reproduction.
> 
> (In reply to Martin Perina from comment #2)
> > Several comments from my side:
> > 
> > 1. Please upgrade WildFly to 11.0.0.Final, which was release month ago
> 
> Current Version: ovirt-engine-wildfly-11.0.0-1.el7.centos.x86_64, does this
> include the upgrade to 11.0.0.Final as per your own change note:
> 
> Tue Oct 24 2017 Martin Perina <mperina> - 11.0.0-1
> - Update to upstream version 11.0.0.Final 
> 
> If I am incorrect and this wildfly version: '11.0.0.Final' is not default in
> ovirt build '4.2.0-0.0.master.20171121184703.git173fe83.el7.centos' please
> supply this as a formal request to eberman/dagur so that we'll know to
> prioritize this change for our regression efforts and bug verifications, as
> we are currently requested to use the standard upstream build without
> modifications, or any application tuning.

Yes, this is the right version. I was asking about it, because in provided server.log I still 'WildFly Full 11.0.0.CR1' which was provided by ovirt-engine-wildfly-11.0.0.0-0.2

> 
> > 
> > 2. Please upgrade to latest engine, especially please make sure you are
> > using vdsm-jsonrpc-java-1.4.9, according to Ravi's measurement it
> > significantly improves ability to handle host status changes and makes
> > engine much faster to react
> > 
> 
> The engine was updated to 4.2.0-0.0.master.20171121184703.git173fe83.el7 
> This is the version found:
> vdsm-jsonrpc-java-1.4.9-1.20171120122807.git1a5ecf6.el7.centos.noarch

Great

> 
> > 3. Either I'm blind or you have provided incorrect logs, because I cannot
> > found a single occurrence of CannotGetJdbcConnectionException in provided
> > logs
> > 
> 
> It's obviously the wrong log, uploaded by mistake; I've uploaded the
> corrected version after the engine was upgraded.

OK, I see the exception now in engine.log, but unfortunately server.log and thread dump are not aligned with engine.log so I cannot get the full picture.

Also for some unknown reason there are empty exception message from VdsManager and full stacktrace is visible only when debug log is enabled.

So could you please provided following:

1. Turn on DEBUG logging for all org.ovirt class which can be done by editing /usr/share/ovirt-engine/services/ovirt-engine/ovirt-engine.xml.in and changing below to INFO to DEBUG in following sections:

...
      <logger category="org.ovirt" use-parent-handlers="false">
        <level name="INFO"/>
        <handlers>
...
     <logger category="org.ovirt.engine.core.bll">
        <level name="INFO"/>
      </logger>
...

 and restart ovirt-engine afterwards

2. Wait until issue is happening

3. Do the thread dump by executing 'kill -3 <PID>' for ovirt-engine JVM process

4. Gather all engine logs (including console.log, server.log and engine.log) and postgresql logs

> 
> > 4. I don't understand the issue from the description: are those errors about
> > failing to get db connection happening only after engine restart or always?
> > 
> 
> Always. The engine starts and you can't log in because of these errors.

Is it also happening before engine startup? Or does this issue start only after startup (where we know we have a problem as whole monitoring is started at once) and things returns back to normal after cca 5 minutes?

Have you added hosts/VMs gradually, so you could say that for example up to 200 hosts with 2000 VMs everything was fine and issues started to appear with higher number of hosts/VMs?

> Originally when this happened, you could wait a few more minutes, and get
> this exception thrown of 'Unable to get managed connection for
> java:/ENGINEDataSource' and then you could log on successfully, today you
> cannot log in even after services restart.
> 
> Currently a successful ovirt-enigne 'log on' is ***not possible*** unless
> one of these options are done:
>   - kill one of the idle queries to the engine on postgres
>   - increase ENGINE_DB_MAX_CONNECTIONS and postgres's max connection value
>   - reboot the engine machince since restarting postgres/ovirt* doesn't
> release the connections to postgres

Changing those settings may be required, because we don't have enough HW to test and improve. Following configurations went fine without any issues on Ravi's setup:

1. engine with 1 real and 750 fakevdsm hosts without any VMs

2. engine with 1 real and 500 fakevdsm hosts with 2000 VMs (but VMs were stopped)

Comment 7 eberman 2017-11-28 10:55:57 UTC

(In reply to Martin Perina from comment #6)
> (In reply to mlehrer from comment #4)
> > Upgraded to 4.2.0-0.0.master.20171121184703.git173fe83.el7.centos and the
> > issue still occurs as described in this bug.
> > If this build of: 4.2.0-0.0.master.20171121184703.git173fe83 is not
> > sufficient please cite which build should be used for reproduction.
> > 
> > (In reply to Martin Perina from comment #2)
> > > Several comments from my side:
> > > 
> > > 1. Please upgrade WildFly to 11.0.0.Final, which was release month ago
> > 
> > Current Version: ovirt-engine-wildfly-11.0.0-1.el7.centos.x86_64, does this
> > include the upgrade to 11.0.0.Final as per your own change note:
> > 
> > Tue Oct 24 2017 Martin Perina <mperina> - 11.0.0-1
> > - Update to upstream version 11.0.0.Final 
> > 
> > If I am incorrect and this wildfly version: '11.0.0.Final' is not default in
> > ovirt build '4.2.0-0.0.master.20171121184703.git173fe83.el7.centos' please
> > supply this as a formal request to eberman/dagur so that we'll know to
> > prioritize this change for our regression efforts and bug verifications, as
> > we are currently requested to use the standard upstream build without
> > modifications, or any application tuning.
> 
> Yes, this is the right version. I was asking about it, because in provided
> server.log I still 'WildFly Full 11.0.0.CR1' which was provided by
> ovirt-engine-wildfly-11.0.0.0-0.2
> 
> > 
> > > 
> > > 2. Please upgrade to latest engine, especially please make sure you are
> > > using vdsm-jsonrpc-java-1.4.9, according to Ravi's measurement it
> > > significantly improves ability to handle host status changes and makes
> > > engine much faster to react
> > > 
> > 
> > The engine was updated to 4.2.0-0.0.master.20171121184703.git173fe83.el7 
> > This is the version found:
> > vdsm-jsonrpc-java-1.4.9-1.20171120122807.git1a5ecf6.el7.centos.noarch
> 
> Great
> 
> > 
> > > 3. Either I'm blind or you have provided incorrect logs, because I cannot
> > > found a single occurrence of CannotGetJdbcConnectionException in provided
> > > logs
> > > 
> > 
> > It's obviously the wrong log, uploaded by mistake; I've uploaded the
> > corrected version after the engine was upgraded.
> 
> OK, I see the exception now in engine.log, but unfortunately server.log and
> thread dump are not aligned with engine.log so I cannot get the full picture.
> 
> Also for some unknown reason there are empty exception message from
> VdsManager and full stacktrace is visible only when debug log is enabled.
> 
> So could you please provided following:
> 
> 1. Turn on DEBUG logging for all org.ovirt class which can be done by
> editing /usr/share/ovirt-engine/services/ovirt-engine/ovirt-engine.xml.in
> and changing below to INFO to DEBUG in following sections:
> 
> ...
>       <logger category="org.ovirt" use-parent-handlers="false">
>         <level name="INFO"/>
>         <handlers>
> ...
>      <logger category="org.ovirt.engine.core.bll">
>         <level name="INFO"/>
>       </logger>
> ...
> 
>  and restart ovirt-engine afterwards
> 
> 2. Wait until issue is happening
> 
> 3. Do the thread dump by executing 'kill -3 <PID>' for ovirt-engine JVM
> process
> 
> 4. Gather all engine logs (including console.log, server.log and engine.log)
> and postgresql logs
> 
> > 
> > > 4. I don't understand the issue from the description: are those errors about
> > > failing to get db connection happening only after engine restart or always?
> > > 
> > 
> > Always. The engine starts and you can't log in because of these errors.
> 
> Is it also happening before engine startup? Or does this issue start only
> after startup (where we know we have a problem as whole monitoring is
> started at once) and things returns back to normal after cca 5 minutes?
> 
> Have you added hosts/VMs gradually, so you could say that for example up to
> 200 hosts with 2000 VMs everything was fine and issues started to appear
> with higher number of hosts/VMs?
> 
> > Originally when this happened, you could wait a few more minutes, and get
> > this exception thrown of 'Unable to get managed connection for
> > java:/ENGINEDataSource' and then you could log on successfully, today you
> > cannot log in even after services restart.
> > 
> > Currently a successful ovirt-enigne 'log on' is ***not possible*** unless
> > one of these options are done:
> >   - kill one of the idle queries to the engine on postgres
> >   - increase ENGINE_DB_MAX_CONNECTIONS and postgres's max connection value
> >   - reboot the engine machince since restarting postgres/ovirt* doesn't
> > release the connections to postgres
> 
> Changing those settings may be required, because we don't have enough HW to
> test and improve. Following configurations went fine without any issues on
> Ravi's setup:
> 
> 1. engine with 1 real and 750 fakevdsm hosts without any VMs
> 
> 2. engine with 1 real and 500 fakevdsm hosts with 2000 VMs (but VMs were
> stopped)

Just to update we have some env issues with getting MAC address for the nested Hosts ,opened a ticket for eng-ops, will update once we are able to raise the env back up and validate stability

Comment 8 Yaniv Kaul 2017-11-30 13:42:15 UTC

Waiting for debug logs.

Comment 9 mlehrer 2017-11-30 19:47:50 UTC

(In reply to Yaniv Kaul from comment #8)
> Waiting for debug logs.

##Current status##
Currently not reproducing issue at 238 hosts will add more hosts to reach 290 hosts.

## Env ##
Version: 4.2.0-0.0.master.20171121184703.git173fe83.el7.centos
238 total hosts (of which 6 are physical, 232 nested and 211 in up state)
838 total vms (including those running nested. 835 in up state)


## Details ##
After we upgraded to 4.2.0-0.0.master.20171121184703, we saw the issue remained; and updated this thread. We then enabled debug and restarted the system and the  issue did not reproduce.

At this point there were 6 physical hosts in up state, and 290 nested hosts in non-responsive state.  It was not possible to activate the nested hosts because the vms holding the nested host had now changed IP, and the nested host that were  installed on the old IP could no longer activate.  Additionally, it was not possible to add more nested hosts because we received a message upon vm creation that we were out of mac pool addresses.  Ops investigated and the mac pool range was adjusted from being listed as 3 pools to 1 pool.

Upon this change we were able to remove and recreate the Vms and install 232 nested hosts of which 211 nested hosts have been in active state for over 24 hrs. The following are the pool statistics in the current state of 232 hosts.

#Pool Statistics for Engine Datasource#
ActiveCount 3
AvailableCount 100
AverageBlockingTime 13
AverageCreationTime 115
AverageGetTime 2
AveragePoolTime 315
AverageUsageTime 1
BlockingFailureCount 0
CreatedCount 3
DestroyedCount 0
IdleCount 3
InUseCount 0
MaxCreationTime 133
MaxGetTime 154
MaxPoolTime 11183055
MaxUsageTime 1504
MaxUsedCount 100
MaxWaitCount 89
MaxWaitTime 71
TimedOut 0
TotalBlockingTime 29964
TotalCreationTime 347
TotalGetTime 42913
TotalPoolTime 1059819132
TotalUsageTime 5706727
WaitCount 12


Postgres total connections: 110

After more than 24 hours, the issue has not reproduced.
I will add the additional hosts, and if the error does not re-appear after that time, the bug can be closed.

## Timeline ##
Bug reproducible on 4.2.0-0.0.master.20171025204923 this BZ opened.
Enabled cached connection manager (CCM) for leak detection
Upgraded to build 4.2.0-0.0.master.20171121184703
BZ Thread Updated
Disabled cached connection manager (CCM) 
System Rebooted
Issue not reproducing
Nested Hosts in nonresponsive state
Mac Pool Adjusted
Nested Hosts recreated
211 activated hosts,and issue not reproducing.

@Martin if you are interested in reviewing the engine.log on debug prior to the restart the log can be found zipped here: https://drive.google.com/a/redhat.com/file/d/1hEWTxnh38dWq7-f16_UGOoaCTyto3h-7/view?usp=sharing

## Next Steps ##
Up the number of hosts to 290 to see if issue presents again or if AvailableCount decreases over time.

Comment 10 Martin Perina 2017-12-13 11:21:15 UTC

Closing for now, feel free to reopen if issue is reproduced

Comment 11 mlehrer 2017-12-14 13:00:12 UTC

(In reply to Martin Perina from comment #10)
> Closing for now, feel free to reopen if issue is reproduced

The issue is currently reproducing.
The following things have changed since the bug was opened.
- engine/vdsm upstream version
- # of hosts/vms
- engine location (same hw specs, different physical host)

Spoke with mperina directly, in order to provide direct access to the enviroment.

Comment 12 Yaniv Kaul 2017-12-14 13:07:33 UTC

(In reply to mlehrer from comment #11)
> (In reply to Martin Perina from comment #10)
> > Closing for now, feel free to reopen if issue is reproduced
> 
> The issue is currently reproducing.
> The following things have changed since the bug was opened.
> - engine/vdsm upstream version

version?

> - # of hosts/vms

data?

> - engine location (same hw specs, different physical host)
> 
> Spoke with mperina directly, in order to provide direct access to the
> enviroment.

Logs?

Comment 13 mlehrer 2017-12-14 13:45:39 UTC

Created attachment 1367987 [details]
engine log and log from vdsm that holds vms of nested hosts

Comment 14 mlehrer 2017-12-14 13:51:44 UTC

(In reply to Yaniv Kaul from comment #12)
> (In reply to mlehrer from comment #11)
> > (In reply to Martin Perina from comment #10)
> > > Closing for now, feel free to reopen if issue is reproduced
> > 
> > The issue is currently reproducing.
> > The following things have changed since the bug was opened.
> > - engine/vdsm upstream version
> 
> version?
ovirt-engine-4.2.1-0.0.master.20171206161426.git88e9120.el7.centos.noarch

> 
> > - # of hosts/vms
> 
> data?
> 

3 DC
3 Clusters
406 Hosts
Hera : 4 Hosts
leopard : 2 Hosts
Nested hosts : 400 VMS
3  SD
1200 VMS
Hera : 685 VMS
leopard : 515 VM

> > - engine location (same hw specs, different physical host)
> > 
> > Spoke with mperina directly, in order to provide direct access to the
> > enviroment.
> 
> Logs?

attached

Comment 16 Ravi Nori 2017-12-14 17:36:41 UTC

Here are a list of things from my analysis

1. The thread dump looks normal 
   118 threads TIMED_WAITING
   110 threads WAITING
   34 threads RUNNABLE

2. The server and engine logs list only one error 
   No managed connections available within configured blocking timeout (30000 [ms])

3. I checked the pool statistics in jmx. The max connection count is 109 before the errors. But if everything is normal we should be able to handle it.

4. The problem could be in what is actually being executed on the database.
   On postgres there were 112 queries. 10 of which were VACUUM statments on ovirt_engine_history database.

I stopped both ovirt-engine and ovirt-engine-dwhd and did a full vacuum of the engine and ovirt_engine_history databases. I have restarted engine and I see everything is running normally. I have not started ovirt-engine-dwhd yet. Lets see if this fixes the issue.

Comment 17 Ravi Nori 2017-12-15 14:18:09 UTC

After a day the engine seems to be doing fine with dwh service stopped.

Just started dwh, will monitor the setup to see if things hold up.

Comment 18 Roy Golan 2018-01-15 13:31:53 UTC

We should document how a busy system influences the deployment but I think we need to retest it again to first verify it is the root cause and if it happens again to try to monitor more specifically to understand under what conditions we should recommend to move DWH to separate instance.

Comment 19 Yaniv Kaul 2018-01-15 14:22:35 UTC

(In reply to Roy Golan from comment #18)
> We should document how a busy system influences the deployment but I think
> we need to retest it again to first verify it is the root cause and if it
> happens again to try to monitor more specifically to understand under what
> conditions we should recommend to move DWH to separate instance.

Agree, + on 7.5.
Also note that it was a low-performing host. While I'm not sure if the bottleneck in vacuuming is CPU only, I assume it directly affects it.

I'm more concerned in lower numbers (much lower numbers) with hosted engine based deployments, where it's not as feasible to move the DWH to a different machine/VM.

Comment 21 Ravi Nori 2018-02-05 17:35:34 UTC

In my test environment with 300 hosts and 600 running VMs the database was a bottle neck. With 150 postgres max_connections in postgresql.conf and engine ENGINE_DB_MAX_CONNECTIONS of 100 the engine would run into database connection issues when hosts were stopped or started to toggle their state. 

I had to increase the max_connections in postgresql.conf to 200 and set ENGINE_DB_MAX_CONNECTIONS in engine to 150 for the engine to work properly with out any database connection issues.

Comment 22 Yaniv Kaul 2018-02-05 17:39:13 UTC

(In reply to Ravi Nori from comment #21)
> In my test environment with 300 hosts and 600 running VMs the database was a
> bottle neck. With 150 postgres max_connections in postgresql.conf and engine
> ENGINE_DB_MAX_CONNECTIONS of 100 the engine would run into database
> connection issues when hosts were stopped or started to toggle their state. 
> 
> I had to increase the max_connections in postgresql.conf to 200 and set
> ENGINE_DB_MAX_CONNECTIONS in engine to 150 for the engine to work properly
> with out any database connection issues.

What's the drawback to increasing the default?

Comment 23 Ravi Nori 2018-02-06 14:33:43 UTC

I don't see any drawback in increasing the defaults. In my setup I did not have to increase the shared_buffers memory for postgresql just increasing the max_connection on postgresql and ENGINE_DB_MAX_CONNECTIONS on engine side made the UI a lot snappier under heavy load and there were no database connection exceptions.

Comment 24 Yaniv Kaul 2018-02-06 14:40:34 UTC

(In reply to Ravi Nori from comment #23)
> I don't see any drawback in increasing the defaults. In my setup I did not
> have to increase the shared_buffers memory for postgresql just increasing
> the max_connection on postgresql and ENGINE_DB_MAX_CONNECTIONS on engine
> side made the UI a lot snappier under heavy load and there were no database
> connection exceptions.

I assume it has memory consumption implications. They may be benign - need to check.
If they are, we can change the defaults. If they are not, I assume engine installation should look at total host memory, and if it's large enough (say >32GB), then set it accordingly to higher values.

Comment 25 Roy Golan 2018-02-06 16:49:43 UTC

(In reply to Yaniv Kaul from comment #24)
> (In reply to Ravi Nori from comment #23)
> > I don't see any drawback in increasing the defaults. In my setup I did not
> > have to increase the shared_buffers memory for postgresql just increasing
> > the max_connection on postgresql and ENGINE_DB_MAX_CONNECTIONS on engine
> > side made the UI a lot snappier under heavy load and there were no database
> > connection exceptions.
> 
> I assume it has memory consumption implications. They may be benign - need
> to check.

The base calculation is
   # of connections * work_mem

So it is not negligible, specially in hosted engine cases.

We also need to think on how to tackle the 'db update storm' when the engine starts and we inquire the state of the DC. In other words we have a 
trade-off to make with code-changes vs. configuration changes and this we need to discuss further.

> If they are, we can change the defaults. If they are not, I assume engine
> installation should look at total host memory, and if it's large enough (say
> >32GB), then set it accordingly to higher values.

Comment 26 mlehrer 2018-02-21 20:48:20 UTC

Currently running downstream 4.2.2-0.1.el7 on RHEL 7.5 (3.10.0-829.el7.x86_64) 
We have DWH DB on a remote host, and engine and engine-db on same tier.

reproducing "2018-02-21 19:54:40,662Z ERROR [org.ovirt.engine.core.vdsbroker.VdsManager] (EE-ManagedThreadFactory-engineScheduled-Thread-100) [] Exception: org.springframework.jdbc.CannotGetJdbcConnectionException: Could not get JDBC Connection; nested exception is java.sql.SQLException: javax.resource.ResourceException: IJ000453: Unable to get managed connection for java:/ENGINEDataSource"

Total VMs: 1418
Total Hosts: 284 of which 34 are physical hosts.
All VMs & Hosts are in up status.

SoS logs:
https://drive.google.com/open?id=1vjJw2jcsHCF1g9c3Wg89px9-aIFK4Jwf

I'm still able to log in via UI to web admin, unfortunately JMX public interface is not set so I can't get jmx access without a service restart which may clear this; so I'll hold off on that..

PG_stat_acitivty shows total connections for datname engine is 105 of which 1 is active and 104 idle.

We are running on default connection settings, seems like it would make sense to increase the connections settings and monitor the pool via jmx.
Thoughts?

Comment 27 Roy Golan 2018-02-22 08:05:06 UTC

(In reply to mlehrer from comment #26)

I'm going over the logs now.

You definitely can set the engine datasource pool statistics in runtime:

$ $JBOSS_HOME/bin/jboss-cli.sh --controller=remote+http://127.0.0.1:8706 --connect --user=admin@internal

# enable stats collections
/subsystem=datasources/data-source=ENGINEDataSource:write-attribute(name=statistics-enabled, value=true)

 /subsystem=datasources/data-source=ENGINEDataSource/statistics=jdbc:write-attribute(name=statistics-enabled, value=true)

/subsystem=datasources/data-source=ENGINEDataSource/statistics=pool:write-attribute(name=statistics-enabled, value=true)

# read them
ls /subsystem=datasources/data-source=ENGINEDataSource/statistics=jdbc

ls /subsystem=datasources/data-source=ENGINEDataSource/statistics=pool

Comment 33 Yaniv Kaul 2018-03-11 06:45:52 UTC

Please perform the further testing needed (as mentioned in comment 32 above) and open a docs bug.

Comment 34 Martin Perina 2018-04-05 13:13:35 UTC

Mordechai, have you already open docs bug how to increase number of connections so we can close this bug?

Comment 35 mlehrer 2018-04-22 19:55:24 UTC

(In reply to Martin Perina from comment #34)
> Mordechai, have you already open docs bug how to increase number of
Updated https://bugzilla.redhat.com/show_bug.cgi?id=1570424

> connections so we can close this bug?
Yes.

Note You need to log in before you can comment on or make changes to this bug.