Created attachment 1357401 [details] server.log enginge.log postgres log and thread dump of ovirt-engine Description of problem: When attempting to log into ovirt web admin, receive a message: Unable to get a managed connection for java:/ENGINEDataSource and cannot log in successfully. Postgres shows many idle (likely orphan) connections. The env contains: - 889 VMs and 6 active Physical Hosts and 290 Nested Hosts in maintenance. The issue was seen originally when starting ovirt-engine service began to take longer amounts of time. Eventually a 'log in' would be successful but only after only after throwing a connection error regarding timeouts for getting a managed connection. Currently a successful ovirt-enigne 'log on' is not possible unless one of these options are done: - kill one of the idle queries to the engine on postgres - increase ENGINE_DB_MAX_CONNECTIONS and postgres's max connection value - reboot the engine, restarting postgres/ovirt* doesn't release the connections. Version-Release number of selected component (if applicable): PACKAGE_VERSION="4.2.0_master" 4.2.0-0.0.master.20171025204923.git6f4cbc5.el7.centos How reproducible: Very Steps to Reproduce: 1. Create env of 200 Host and 800 Vms 2. Start & Stop the vms over several days (basic use not constant stress) 3. Attempt to restart ovirt-engine Actual results: Cannot log in 2017-11-22 11:45:03,503+02 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.VmsMonitoring] (EE-ManagedThreadFactory-engineScheduled-Thread-95) [] Exception:: org.springframework.jdbc.CannotGetJdbcConnectionException: Could not get JDBC Connection; nested exception is java.sql.SQLException: javax.resource.ResourceException: IJ000453: Unable to get managed connection for java:/ENGINEDataSource at org.springframework.jdbc.datasource.DataSourceUtils.getConnection(DataSourceUtils.java:80) [spring-jdbc.jar:4.3.9.RELEASE] at org.springframework.jdbc.core.JdbcTemplate.execute(JdbcTemplate.java:619) [spring-jdbc.jar:4.3.9.RELEASE] at org.springframework.jdbc.core.JdbcTemplate.query(JdbcTemplate.java:684) [spring-jdbc.jar:4.3.9.RELEASE] at org.springframework.jdbc.core.JdbcTemplate.query(JdbcTemplate.java:716) [spring-jdbc.jar:4.3.9.RELEASE] at org.springframework.jdbc.core.JdbcTemplate.query(JdbcTemplate.java:766) [spring-jdbc.jar:4.3.9.RELEASE] at org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall.executeCallInternal(PostgresDbEngineDialect.java:152) [dal.jar:] at org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall.doExecute(PostgresDbEngineDialect.java:118) [dal.jar:] at org.springframework.jdbc.core.simple.SimpleJdbcCall.execute(SimpleJdbcCall.java:198) [spring-jdbc.jar:4.3.9.RELEASE] at org.ovirt.engine.core.dal.dbbroker.SimpleJdbcCallsHandler.executeImpl(SimpleJdbcCallsHandler.java:135) [dal.jar:] at org.ovirt.engine.core.dal.dbbroker.SimpleJdbcCallsHandler.executeReadList(SimpleJdbcCallsHandler.java:105) [dal.jar:] at org.ovirt.engine.core.dao.network.VmNetworkInterfaceDaoImpl.getAllForMonitoredVm(VmNetworkInterfaceDaoImpl.java:51) [dal.jar:] at org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer.loadVmNetworkInterfaces(VmAnalyzer.java:1057) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer.updateInterfaceStatistics(VmAnalyzer.java:890) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer.updateStatistics(VmAnalyzer.java:858) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer.proceedVmReportedOnTheSameHost(VmAnalyzer.java:627) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer.analyze(VmAnalyzer.java:155) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.monitoring.VmsMonitoring.lambda$analyzeVms$1(VmsMonitoring.java:136) [vdsbroker.jar:] at java.util.ArrayList.forEach(ArrayList.java:1249) [rt.jar:1.8.0_144] at org.ovirt.engine.core.vdsbroker.monitoring.VmsMonitoring.analyzeVms(VmsMonitoring.java:131) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.monitoring.VmsMonitoring.perform(VmsMonitoring.java:94) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.monitoring.PollVmStatsRefresher.poll(PollVmStatsRefresher.java:43) [vdsbroker.jar:] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [rt.jar:1.8.0_144] at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) [rt.jar:1.8.0_144] at org.glassfish.enterprise.concurrent.internal.ManagedScheduledThreadPoolExecutor$ManagedScheduledFutureTask.access$201(ManagedScheduledThreadPoolExecutor.java:383) [javax.enterprise.concurrent-1.0.jar:] at org.glassfish.enterprise.concurrent.internal.ManagedScheduledThreadPoolExecutor$ManagedScheduledFutureTask.run(ManagedScheduledThreadPoolExecutor.java:534) [javax.enterprise.concurrent-1.0.jar:] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [rt.jar:1.8.0_144] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [rt.jar:1.8.0_144] at java.lang.Thread.run(Thread.java:748) [rt.jar:1.8.0_144] at org.glassfish.enterprise.concurrent.ManagedThreadFactoryImpl$ManagedThread.run(ManagedThreadFactoryImpl.java:250) [javax.enterprise.concurrent-1.0.jar:] at org.jboss.as.ee.concurrent.service.ElytronManagedThreadFactory$ElytronManagedThread.run(ElytronManagedThreadFactory.java:78) Caused by: java.sql.SQLException: javax.resource.ResourceException: IJ000453: Unable to get managed connection for java:/ENGINEDataSource at org.jboss.jca.adapters.jdbc.WrapperDataSource.getConnection(WrapperDataSource.java:146) at org.jboss.as.connector.subsystems.datasources.WildFlyDataSource.getConnection(WildFlyDataSource.java:64) at org.springframework.jdbc.datasource.DataSourceUtils.doGetConnection(DataSourceUtils.java:111) [spring-jdbc.jar:4.3.9.RELEASE] at org.springframework.jdbc.datasource.DataSourceUtils.getConnection(DataSourceUtils.java:77) [spring-jdbc.jar:4.3.9.RELEASE] ... 29 more Caused by: javax.resource.ResourceException: IJ000453: Unable to get managed connection for java:/ENGINEDataSource at org.jboss.jca.core.connectionmanager.AbstractConnectionManager.getManagedConnection(AbstractConnectionManager.java:690) at org.jboss.jca.core.connectionmanager.tx.TxConnectionManagerImpl.getManagedConnection(TxConnectionManagerImpl.java:430) at org.jboss.jca.core.connectionmanager.AbstractConnectionManager.allocateConnection(AbstractConnectionManager.java:789) at org.jboss.jca.adapters.jdbc.WrapperDataSource.getConnection(WrapperDataSource.java:138) ... 32 more Caused by: javax.resource.ResourceException: IJ000655: No managed connections available within configured blocking timeout (30000 [ms]) at org.jboss.jca.core.connectionmanager.pool.mcp.SemaphoreConcurrentLinkedDequeManagedConnectionPool.getConnection(SemaphoreConcurrentLinkedDequeManagedConnectionPool.java:570) at org.jboss.jca.core.connectionmanager.pool.AbstractPool.getSimpleConnection(AbstractPool.java:632) at org.jboss.jca.core.connectionmanager.pool.AbstractPool.getConnection(AbstractPool.java:604) at org.jboss.jca.core.connectionmanager.AbstractConnectionManager.getManagedConnection(AbstractConnectionManager.java:624) ... 35 more Expected results: Able to login, connections closed after restart and not left in 'idle' state. Additional info: Attempted to use (CCM) connection manager to identify connections which were not closed correctly per dev recommendation. We had some configuration issues with this and are still looking at this. We will also look at number of connections from the JVM side via /ENGINEDataSource/statistics/pool?operation=resource&include-runtime queried
Several comments from my side: 1. Please upgrade WildFly to 11.0.0.Final, which was release month ago 2. Please upgrade to latest engine, especially please make sure you are using vdsm-jsonrpc-java-1.4.9, according to Ravi's measurement it significantly improves ability to handle host status changes and makes engine much faster to react 3. Either I'm blind or you have provided incorrect logs, because I cannot found a single occurrence of CannotGetJdbcConnectionException in provided logs 4. I don't understand the issue from description: are those errors about failing to get db connection happening only after engine restart or always? 5. Looking at postgresql logs, messages around 'found orphan temp table' should probably be fixed by BZ1510432
Created attachment 1358237 [details] Previous engine.log was not correct updated version called 'correct_engine.log' is attached in this zip Attaching correct Engine.log file called 'correct_engine.log' inside this zip
Upgraded to 4.2.0-0.0.master.20171121184703.git173fe83.el7.centos and the issue still occurs as described in this bug. If this build of: 4.2.0-0.0.master.20171121184703.git173fe83 is not sufficient please cite which build should be used for reproduction. (In reply to Martin Perina from comment #2) > Several comments from my side: > > 1. Please upgrade WildFly to 11.0.0.Final, which was release month ago Current Version: ovirt-engine-wildfly-11.0.0-1.el7.centos.x86_64, does this include the upgrade to 11.0.0.Final as per your own change note: Tue Oct 24 2017 Martin Perina <mperina> - 11.0.0-1 - Update to upstream version 11.0.0.Final If I am incorrect and this wildfly version: '11.0.0.Final' is not default in ovirt build '4.2.0-0.0.master.20171121184703.git173fe83.el7.centos' please supply this as a formal request to eberman/dagur so that we'll know to prioritize this change for our regression efforts and bug verifications, as we are currently requested to use the standard upstream build without modifications, or any application tuning. > > 2. Please upgrade to latest engine, especially please make sure you are > using vdsm-jsonrpc-java-1.4.9, according to Ravi's measurement it > significantly improves ability to handle host status changes and makes > engine much faster to react > The engine was updated to 4.2.0-0.0.master.20171121184703.git173fe83.el7 This is the version found: vdsm-jsonrpc-java-1.4.9-1.20171120122807.git1a5ecf6.el7.centos.noarch > 3. Either I'm blind or you have provided incorrect logs, because I cannot > found a single occurrence of CannotGetJdbcConnectionException in provided > logs > It's obviously the wrong log, uploaded by mistake; I've uploaded the corrected version after the engine was upgraded. > 4. I don't understand the issue from the description: are those errors about > failing to get db connection happening only after engine restart or always? > Always. The engine starts and you can't log in because of these errors. Originally when this happened, you could wait a few more minutes, and get this exception thrown of 'Unable to get managed connection for java:/ENGINEDataSource' and then you could log on successfully, today you cannot log in even after services restart. Currently a successful ovirt-enigne 'log on' is ***not possible*** unless one of these options are done: - kill one of the idle queries to the engine on postgres - increase ENGINE_DB_MAX_CONNECTIONS and postgres's max connection value - reboot the engine machince since restarting postgres/ovirt* doesn't release the connections to postgres > 5. Looking at postgresql logs, messages around 'found orphan temp table' > should probably be fixed by BZ1510432 Understood.
Verified 11.0.0.Final is the running 'product version' of Wildfly via JBOSS CLI.
(In reply to mlehrer from comment #4) > Upgraded to 4.2.0-0.0.master.20171121184703.git173fe83.el7.centos and the > issue still occurs as described in this bug. > If this build of: 4.2.0-0.0.master.20171121184703.git173fe83 is not > sufficient please cite which build should be used for reproduction. > > (In reply to Martin Perina from comment #2) > > Several comments from my side: > > > > 1. Please upgrade WildFly to 11.0.0.Final, which was release month ago > > Current Version: ovirt-engine-wildfly-11.0.0-1.el7.centos.x86_64, does this > include the upgrade to 11.0.0.Final as per your own change note: > > Tue Oct 24 2017 Martin Perina <mperina> - 11.0.0-1 > - Update to upstream version 11.0.0.Final > > If I am incorrect and this wildfly version: '11.0.0.Final' is not default in > ovirt build '4.2.0-0.0.master.20171121184703.git173fe83.el7.centos' please > supply this as a formal request to eberman/dagur so that we'll know to > prioritize this change for our regression efforts and bug verifications, as > we are currently requested to use the standard upstream build without > modifications, or any application tuning. Yes, this is the right version. I was asking about it, because in provided server.log I still 'WildFly Full 11.0.0.CR1' which was provided by ovirt-engine-wildfly-11.0.0.0-0.2 > > > > > 2. Please upgrade to latest engine, especially please make sure you are > > using vdsm-jsonrpc-java-1.4.9, according to Ravi's measurement it > > significantly improves ability to handle host status changes and makes > > engine much faster to react > > > > The engine was updated to 4.2.0-0.0.master.20171121184703.git173fe83.el7 > This is the version found: > vdsm-jsonrpc-java-1.4.9-1.20171120122807.git1a5ecf6.el7.centos.noarch Great > > > 3. Either I'm blind or you have provided incorrect logs, because I cannot > > found a single occurrence of CannotGetJdbcConnectionException in provided > > logs > > > > It's obviously the wrong log, uploaded by mistake; I've uploaded the > corrected version after the engine was upgraded. OK, I see the exception now in engine.log, but unfortunately server.log and thread dump are not aligned with engine.log so I cannot get the full picture. Also for some unknown reason there are empty exception message from VdsManager and full stacktrace is visible only when debug log is enabled. So could you please provided following: 1. Turn on DEBUG logging for all org.ovirt class which can be done by editing /usr/share/ovirt-engine/services/ovirt-engine/ovirt-engine.xml.in and changing below to INFO to DEBUG in following sections: ... <logger category="org.ovirt" use-parent-handlers="false"> <level name="INFO"/> <handlers> ... <logger category="org.ovirt.engine.core.bll"> <level name="INFO"/> </logger> ... and restart ovirt-engine afterwards 2. Wait until issue is happening 3. Do the thread dump by executing 'kill -3 <PID>' for ovirt-engine JVM process 4. Gather all engine logs (including console.log, server.log and engine.log) and postgresql logs > > > 4. I don't understand the issue from the description: are those errors about > > failing to get db connection happening only after engine restart or always? > > > > Always. The engine starts and you can't log in because of these errors. Is it also happening before engine startup? Or does this issue start only after startup (where we know we have a problem as whole monitoring is started at once) and things returns back to normal after cca 5 minutes? Have you added hosts/VMs gradually, so you could say that for example up to 200 hosts with 2000 VMs everything was fine and issues started to appear with higher number of hosts/VMs? > Originally when this happened, you could wait a few more minutes, and get > this exception thrown of 'Unable to get managed connection for > java:/ENGINEDataSource' and then you could log on successfully, today you > cannot log in even after services restart. > > Currently a successful ovirt-enigne 'log on' is ***not possible*** unless > one of these options are done: > - kill one of the idle queries to the engine on postgres > - increase ENGINE_DB_MAX_CONNECTIONS and postgres's max connection value > - reboot the engine machince since restarting postgres/ovirt* doesn't > release the connections to postgres Changing those settings may be required, because we don't have enough HW to test and improve. Following configurations went fine without any issues on Ravi's setup: 1. engine with 1 real and 750 fakevdsm hosts without any VMs 2. engine with 1 real and 500 fakevdsm hosts with 2000 VMs (but VMs were stopped)
(In reply to Martin Perina from comment #6) > (In reply to mlehrer from comment #4) > > Upgraded to 4.2.0-0.0.master.20171121184703.git173fe83.el7.centos and the > > issue still occurs as described in this bug. > > If this build of: 4.2.0-0.0.master.20171121184703.git173fe83 is not > > sufficient please cite which build should be used for reproduction. > > > > (In reply to Martin Perina from comment #2) > > > Several comments from my side: > > > > > > 1. Please upgrade WildFly to 11.0.0.Final, which was release month ago > > > > Current Version: ovirt-engine-wildfly-11.0.0-1.el7.centos.x86_64, does this > > include the upgrade to 11.0.0.Final as per your own change note: > > > > Tue Oct 24 2017 Martin Perina <mperina> - 11.0.0-1 > > - Update to upstream version 11.0.0.Final > > > > If I am incorrect and this wildfly version: '11.0.0.Final' is not default in > > ovirt build '4.2.0-0.0.master.20171121184703.git173fe83.el7.centos' please > > supply this as a formal request to eberman/dagur so that we'll know to > > prioritize this change for our regression efforts and bug verifications, as > > we are currently requested to use the standard upstream build without > > modifications, or any application tuning. > > Yes, this is the right version. I was asking about it, because in provided > server.log I still 'WildFly Full 11.0.0.CR1' which was provided by > ovirt-engine-wildfly-11.0.0.0-0.2 > > > > > > > > > 2. Please upgrade to latest engine, especially please make sure you are > > > using vdsm-jsonrpc-java-1.4.9, according to Ravi's measurement it > > > significantly improves ability to handle host status changes and makes > > > engine much faster to react > > > > > > > The engine was updated to 4.2.0-0.0.master.20171121184703.git173fe83.el7 > > This is the version found: > > vdsm-jsonrpc-java-1.4.9-1.20171120122807.git1a5ecf6.el7.centos.noarch > > Great > > > > > > 3. Either I'm blind or you have provided incorrect logs, because I cannot > > > found a single occurrence of CannotGetJdbcConnectionException in provided > > > logs > > > > > > > It's obviously the wrong log, uploaded by mistake; I've uploaded the > > corrected version after the engine was upgraded. > > OK, I see the exception now in engine.log, but unfortunately server.log and > thread dump are not aligned with engine.log so I cannot get the full picture. > > Also for some unknown reason there are empty exception message from > VdsManager and full stacktrace is visible only when debug log is enabled. > > So could you please provided following: > > 1. Turn on DEBUG logging for all org.ovirt class which can be done by > editing /usr/share/ovirt-engine/services/ovirt-engine/ovirt-engine.xml.in > and changing below to INFO to DEBUG in following sections: > > ... > <logger category="org.ovirt" use-parent-handlers="false"> > <level name="INFO"/> > <handlers> > ... > <logger category="org.ovirt.engine.core.bll"> > <level name="INFO"/> > </logger> > ... > > and restart ovirt-engine afterwards > > 2. Wait until issue is happening > > 3. Do the thread dump by executing 'kill -3 <PID>' for ovirt-engine JVM > process > > 4. Gather all engine logs (including console.log, server.log and engine.log) > and postgresql logs > > > > > > 4. I don't understand the issue from the description: are those errors about > > > failing to get db connection happening only after engine restart or always? > > > > > > > Always. The engine starts and you can't log in because of these errors. > > Is it also happening before engine startup? Or does this issue start only > after startup (where we know we have a problem as whole monitoring is > started at once) and things returns back to normal after cca 5 minutes? > > Have you added hosts/VMs gradually, so you could say that for example up to > 200 hosts with 2000 VMs everything was fine and issues started to appear > with higher number of hosts/VMs? > > > Originally when this happened, you could wait a few more minutes, and get > > this exception thrown of 'Unable to get managed connection for > > java:/ENGINEDataSource' and then you could log on successfully, today you > > cannot log in even after services restart. > > > > Currently a successful ovirt-enigne 'log on' is ***not possible*** unless > > one of these options are done: > > - kill one of the idle queries to the engine on postgres > > - increase ENGINE_DB_MAX_CONNECTIONS and postgres's max connection value > > - reboot the engine machince since restarting postgres/ovirt* doesn't > > release the connections to postgres > > Changing those settings may be required, because we don't have enough HW to > test and improve. Following configurations went fine without any issues on > Ravi's setup: > > 1. engine with 1 real and 750 fakevdsm hosts without any VMs > > 2. engine with 1 real and 500 fakevdsm hosts with 2000 VMs (but VMs were > stopped) Just to update we have some env issues with getting MAC address for the nested Hosts ,opened a ticket for eng-ops, will update once we are able to raise the env back up and validate stability
Waiting for debug logs.
(In reply to Yaniv Kaul from comment #8) > Waiting for debug logs. ##Current status## Currently not reproducing issue at 238 hosts will add more hosts to reach 290 hosts. ## Env ## Version: 4.2.0-0.0.master.20171121184703.git173fe83.el7.centos 238 total hosts (of which 6 are physical, 232 nested and 211 in up state) 838 total vms (including those running nested. 835 in up state) ## Details ## After we upgraded to 4.2.0-0.0.master.20171121184703, we saw the issue remained; and updated this thread. We then enabled debug and restarted the system and the issue did not reproduce. At this point there were 6 physical hosts in up state, and 290 nested hosts in non-responsive state. It was not possible to activate the nested hosts because the vms holding the nested host had now changed IP, and the nested host that were installed on the old IP could no longer activate. Additionally, it was not possible to add more nested hosts because we received a message upon vm creation that we were out of mac pool addresses. Ops investigated and the mac pool range was adjusted from being listed as 3 pools to 1 pool. Upon this change we were able to remove and recreate the Vms and install 232 nested hosts of which 211 nested hosts have been in active state for over 24 hrs. The following are the pool statistics in the current state of 232 hosts. #Pool Statistics for Engine Datasource# ActiveCount 3 AvailableCount 100 AverageBlockingTime 13 AverageCreationTime 115 AverageGetTime 2 AveragePoolTime 315 AverageUsageTime 1 BlockingFailureCount 0 CreatedCount 3 DestroyedCount 0 IdleCount 3 InUseCount 0 MaxCreationTime 133 MaxGetTime 154 MaxPoolTime 11183055 MaxUsageTime 1504 MaxUsedCount 100 MaxWaitCount 89 MaxWaitTime 71 TimedOut 0 TotalBlockingTime 29964 TotalCreationTime 347 TotalGetTime 42913 TotalPoolTime 1059819132 TotalUsageTime 5706727 WaitCount 12 Postgres total connections: 110 After more than 24 hours, the issue has not reproduced. I will add the additional hosts, and if the error does not re-appear after that time, the bug can be closed. ## Timeline ## Bug reproducible on 4.2.0-0.0.master.20171025204923 this BZ opened. Enabled cached connection manager (CCM) for leak detection Upgraded to build 4.2.0-0.0.master.20171121184703 BZ Thread Updated Disabled cached connection manager (CCM) System Rebooted Issue not reproducing Nested Hosts in nonresponsive state Mac Pool Adjusted Nested Hosts recreated 211 activated hosts,and issue not reproducing. @Martin if you are interested in reviewing the engine.log on debug prior to the restart the log can be found zipped here: https://drive.google.com/a/redhat.com/file/d/1hEWTxnh38dWq7-f16_UGOoaCTyto3h-7/view?usp=sharing ## Next Steps ## Up the number of hosts to 290 to see if issue presents again or if AvailableCount decreases over time.
Closing for now, feel free to reopen if issue is reproduced
(In reply to Martin Perina from comment #10) > Closing for now, feel free to reopen if issue is reproduced The issue is currently reproducing. The following things have changed since the bug was opened. - engine/vdsm upstream version - # of hosts/vms - engine location (same hw specs, different physical host) Spoke with mperina directly, in order to provide direct access to the enviroment.
(In reply to mlehrer from comment #11) > (In reply to Martin Perina from comment #10) > > Closing for now, feel free to reopen if issue is reproduced > > The issue is currently reproducing. > The following things have changed since the bug was opened. > - engine/vdsm upstream version version? > - # of hosts/vms data? > - engine location (same hw specs, different physical host) > > Spoke with mperina directly, in order to provide direct access to the > enviroment. Logs?
Created attachment 1367987 [details] engine log and log from vdsm that holds vms of nested hosts
(In reply to Yaniv Kaul from comment #12) > (In reply to mlehrer from comment #11) > > (In reply to Martin Perina from comment #10) > > > Closing for now, feel free to reopen if issue is reproduced > > > > The issue is currently reproducing. > > The following things have changed since the bug was opened. > > - engine/vdsm upstream version > > version? ovirt-engine-4.2.1-0.0.master.20171206161426.git88e9120.el7.centos.noarch > > > - # of hosts/vms > > data? > 3 DC 3 Clusters 406 Hosts Hera : 4 Hosts leopard : 2 Hosts Nested hosts : 400 VMS 3 SD 1200 VMS Hera : 685 VMS leopard : 515 VM > > - engine location (same hw specs, different physical host) > > > > Spoke with mperina directly, in order to provide direct access to the > > enviroment. > > Logs? attached
Here are a list of things from my analysis 1. The thread dump looks normal 118 threads TIMED_WAITING 110 threads WAITING 34 threads RUNNABLE 2. The server and engine logs list only one error No managed connections available within configured blocking timeout (30000 [ms]) 3. I checked the pool statistics in jmx. The max connection count is 109 before the errors. But if everything is normal we should be able to handle it. 4. The problem could be in what is actually being executed on the database. On postgres there were 112 queries. 10 of which were VACUUM statments on ovirt_engine_history database. I stopped both ovirt-engine and ovirt-engine-dwhd and did a full vacuum of the engine and ovirt_engine_history databases. I have restarted engine and I see everything is running normally. I have not started ovirt-engine-dwhd yet. Lets see if this fixes the issue.
After a day the engine seems to be doing fine with dwh service stopped. Just started dwh, will monitor the setup to see if things hold up.
We should document how a busy system influences the deployment but I think we need to retest it again to first verify it is the root cause and if it happens again to try to monitor more specifically to understand under what conditions we should recommend to move DWH to separate instance.
(In reply to Roy Golan from comment #18) > We should document how a busy system influences the deployment but I think > we need to retest it again to first verify it is the root cause and if it > happens again to try to monitor more specifically to understand under what > conditions we should recommend to move DWH to separate instance. Agree, + on 7.5. Also note that it was a low-performing host. While I'm not sure if the bottleneck in vacuuming is CPU only, I assume it directly affects it. I'm more concerned in lower numbers (much lower numbers) with hosted engine based deployments, where it's not as feasible to move the DWH to a different machine/VM.
In my test environment with 300 hosts and 600 running VMs the database was a bottle neck. With 150 postgres max_connections in postgresql.conf and engine ENGINE_DB_MAX_CONNECTIONS of 100 the engine would run into database connection issues when hosts were stopped or started to toggle their state. I had to increase the max_connections in postgresql.conf to 200 and set ENGINE_DB_MAX_CONNECTIONS in engine to 150 for the engine to work properly with out any database connection issues.
(In reply to Ravi Nori from comment #21) > In my test environment with 300 hosts and 600 running VMs the database was a > bottle neck. With 150 postgres max_connections in postgresql.conf and engine > ENGINE_DB_MAX_CONNECTIONS of 100 the engine would run into database > connection issues when hosts were stopped or started to toggle their state. > > I had to increase the max_connections in postgresql.conf to 200 and set > ENGINE_DB_MAX_CONNECTIONS in engine to 150 for the engine to work properly > with out any database connection issues. What's the drawback to increasing the default?
I don't see any drawback in increasing the defaults. In my setup I did not have to increase the shared_buffers memory for postgresql just increasing the max_connection on postgresql and ENGINE_DB_MAX_CONNECTIONS on engine side made the UI a lot snappier under heavy load and there were no database connection exceptions.
(In reply to Ravi Nori from comment #23) > I don't see any drawback in increasing the defaults. In my setup I did not > have to increase the shared_buffers memory for postgresql just increasing > the max_connection on postgresql and ENGINE_DB_MAX_CONNECTIONS on engine > side made the UI a lot snappier under heavy load and there were no database > connection exceptions. I assume it has memory consumption implications. They may be benign - need to check. If they are, we can change the defaults. If they are not, I assume engine installation should look at total host memory, and if it's large enough (say >32GB), then set it accordingly to higher values.
(In reply to Yaniv Kaul from comment #24) > (In reply to Ravi Nori from comment #23) > > I don't see any drawback in increasing the defaults. In my setup I did not > > have to increase the shared_buffers memory for postgresql just increasing > > the max_connection on postgresql and ENGINE_DB_MAX_CONNECTIONS on engine > > side made the UI a lot snappier under heavy load and there were no database > > connection exceptions. > > I assume it has memory consumption implications. They may be benign - need > to check. The base calculation is # of connections * work_mem So it is not negligible, specially in hosted engine cases. We also need to think on how to tackle the 'db update storm' when the engine starts and we inquire the state of the DC. In other words we have a trade-off to make with code-changes vs. configuration changes and this we need to discuss further. > If they are, we can change the defaults. If they are not, I assume engine > installation should look at total host memory, and if it's large enough (say > >32GB), then set it accordingly to higher values.
Currently running downstream 4.2.2-0.1.el7 on RHEL 7.5 (3.10.0-829.el7.x86_64) We have DWH DB on a remote host, and engine and engine-db on same tier. reproducing "2018-02-21 19:54:40,662Z ERROR [org.ovirt.engine.core.vdsbroker.VdsManager] (EE-ManagedThreadFactory-engineScheduled-Thread-100) [] Exception: org.springframework.jdbc.CannotGetJdbcConnectionException: Could not get JDBC Connection; nested exception is java.sql.SQLException: javax.resource.ResourceException: IJ000453: Unable to get managed connection for java:/ENGINEDataSource" Total VMs: 1418 Total Hosts: 284 of which 34 are physical hosts. All VMs & Hosts are in up status. SoS logs: https://drive.google.com/open?id=1vjJw2jcsHCF1g9c3Wg89px9-aIFK4Jwf I'm still able to log in via UI to web admin, unfortunately JMX public interface is not set so I can't get jmx access without a service restart which may clear this; so I'll hold off on that.. PG_stat_acitivty shows total connections for datname engine is 105 of which 1 is active and 104 idle. We are running on default connection settings, seems like it would make sense to increase the connections settings and monitor the pool via jmx. Thoughts?
(In reply to mlehrer from comment #26) I'm going over the logs now. You definitely can set the engine datasource pool statistics in runtime: $ $JBOSS_HOME/bin/jboss-cli.sh --controller=remote+http://127.0.0.1:8706 --connect --user=admin@internal # enable stats collections /subsystem=datasources/data-source=ENGINEDataSource:write-attribute(name=statistics-enabled, value=true) /subsystem=datasources/data-source=ENGINEDataSource/statistics=jdbc:write-attribute(name=statistics-enabled, value=true) /subsystem=datasources/data-source=ENGINEDataSource/statistics=pool:write-attribute(name=statistics-enabled, value=true) # read them ls /subsystem=datasources/data-source=ENGINEDataSource/statistics=jdbc ls /subsystem=datasources/data-source=ENGINEDataSource/statistics=pool
Please perform the further testing needed (as mentioned in comment 32 above) and open a docs bug.
Mordechai, have you already open docs bug how to increase number of connections so we can close this bug?
(In reply to Martin Perina from comment #34) > Mordechai, have you already open docs bug how to increase number of Updated https://bugzilla.redhat.com/show_bug.cgi?id=1570424 > connections so we can close this bug? Yes.