It seems that in larger deployments when a significant number of hosts lose connectivity there is a huge impact on the rest of the system and healthy hosts are being starved out by the reconnect attempts to the other hosts
This will take a lot of resources just to simulate the environment and observe if we have any issues and what they cause, so moving to 4.2 where we can investigate properly
Moving to MODIFIED as non-blocking thread patches were merged to 4.2, so please retest.
WARN: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason: [Found non-acked flags: '{'rhevm-4.2-ga': '?'}', ] For more info please contact: rhv-devops: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason: [Found non-acked flags: '{'rhevm-4.2-ga': '?'}', ] For more info please contact: rhv-devops
Now that we have scaled RHV setup in our RDU RHV scale lab Could you please provide suggested validation steps? Should we Shut down part of the hosts and monitor? Please suggest how many total hosts where up and down when you saw the issue, Please suggest what should be monitored.
(In reply to Daniel Gur from comment #9) > Now that we have scaled RHV setup in our RDU RHV scale lab > Could you please provide suggested validation steps? > > Should we Shut down part of the hosts and monitor? > Please suggest how many total hosts where up and down when you saw the issue, > Please suggest what should be monitored. We don't have exact reproducing steps, but here's my suggestion: 1. Deploy reasonable high amount of hosts and VMs 2. Observe that everything is stable 3. Try to make 1/3 of hosts non-responsive and watch how system behaves (are all hosts properly fenced, are there any issues during that, ...) 4. Try to put 1/3 of hosts to Maintenance and watch how system behaves 5. Try to shutdown 1/4 of hosts. When done try to shutdown another 1/4 of hosts and at the same time start 1/4 of hosts and watch how system behaves I'm sure that virt/storage/network teams could add more stress tests around their feature but above seems to OK from infra point of view
If this is test only, please move to QE
Daniel, you need set qa_ack, otherwise we cannot put it into errata and move to QA
INFO: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason: [No external trackers attached] For more info please contact: rhv-devops
There's no specific fix, but we have done much work around this area. Moving to QA, suggested tests are mentioned in Comment 10
(In reply to Martin Perina from comment #10) > (In reply to Daniel Gur from comment #9) > > Now that we have scaled RHV setup in our RDU RHV scale lab > > Could you please provide suggested validation steps? > > > > Should we Shut down part of the hosts and monitor? > > Please suggest how many total hosts where up and down when you saw the issue, > > Please suggest what should be monitored. > > We don't have exact reproducing steps, but here's my suggestion: > With default settings? > 1. Deploy reasonable high amount of hosts and VMs > 2. Observe that everything is stable > 3. Try to make 1/3 of hosts non-responsive and watch how system behaves (are > all hosts properly fenced, are there any issues during that, ...) > 4. Try to put 1/3 of hosts to Maintenance and watch how system behaves > 5. Try to shutdown 1/4 of hosts. When done try to shutdown another 1/4 of > hosts and at the same time start 1/4 of hosts and watch how system behaves > > I'm sure that virt/storage/network teams could add more stress tests around > their feature but above seems to OK from infra point of view Is the expectations that these changes resolve pool connection issues, or that these tests complete successfully once the pool is updated? Unclear as there's no specific fix...
(In reply to mlehrer from comment #16) > (In reply to Martin Perina from comment #10) > > (In reply to Daniel Gur from comment #9) > > > Now that we have scaled RHV setup in our RDU RHV scale lab > > > Could you please provide suggested validation steps? > > > > > > Should we Shut down part of the hosts and monitor? > > > Please suggest how many total hosts where up and down when you saw the issue, > > > Please suggest what should be monitored. > > > > We don't have exact reproducing steps, but here's my suggestion: > > > > With default settings? Yes, AFAIK when you scale test on previous releases you have always used the default settings. Or am I wrong? > > > 1. Deploy reasonable high amount of hosts and VMs > > 2. Observe that everything is stable > > 3. Try to make 1/3 of hosts non-responsive and watch how system behaves (are > > all hosts properly fenced, are there any issues during that, ...) > > 4. Try to put 1/3 of hosts to Maintenance and watch how system behaves > > 5. Try to shutdown 1/4 of hosts. When done try to shutdown another 1/4 of > > hosts and at the same time start 1/4 of hosts and watch how system behaves > > > > I'm sure that virt/storage/network teams could add more stress tests around > > their feature but above seems to OK from infra point of view > > > Is the expectations that these changes resolve pool connection issues, or > that these tests complete successfully once the pool is updated? Unclear as > there's no specific fix... This is not strictly related to connection pool, this is about using less threads to manage the same number of hosts. Of course when you use less number of threads, you need less number of connections, so with the same default configuration you should be able to manage more hosts than in previous versions
(In reply to Martin Perina from comment #17) > This is not strictly related to connection pool, this is about using less > threads to manage the same number of hosts. Of course when you use less > number of threads, you need less number of connections, so with the same > default configuration you should be able to manage more hosts than in > previous versions Failed test on 150 hosts from non-responsive to activation. Ilan saw that testing failed with "Unable to get managed connection for java:/ENGINEDataSource"[1] on test case of "1/3 of hosts non-responsive => (activate)." In our enviroment 150 vm with nested virt enabled were powered off to simulate 150 hosts being made non-responsive and then set to activation/maintenance. #Env: hosts: 400 VMs: 4000 /usr/share/ovirt-engine/services/ovirt-engine/ovirt-engine.conf ENGINE_DB_MIN_CONNECTIONS=1 ENGINE_DB_MAX_CONNECTIONS=100 /var/opt/rh/rh-postgresql95/lib/pgsql/data/postgresql.conf max_connections = 150 #Engine log: https://drive.google.com/open?id=1FjwMjTYLwV3VYmppnVvog3fwXBl6Nrj5 [1] Execption: 2018-03-28 09:29:01,182Z ERROR [org.ovirt.engine.core.vdsbroker.VdsManager] (EE-ManagedThreadFactory-engineScheduled-Thread-74) [] Exception: org.springframework.jdbc.CannotGetJdbcConnectionException: Could not get JDBC Connection; nested exception is java.sql.SQLException: javax.resource.ResourceException: IJ000453: Unable to get managed connection for java:/ENGINEDataSource at org.springframework.jdbc.datasource.DataSourceUtils.getConnection(DataSourceUtils.java:80) [spring-jdbc.jar:4.3.9.RELEASE] at org.springframework.jdbc.core.JdbcTemplate.execute(JdbcTemplate.java:619) [spring-jdbc.jar:4.3.9.RELEASE] at org.springframework.jdbc.core.JdbcTemplate.query(JdbcTemplate.java:684) [spring-jdbc.jar:4.3.9.RELEASE] at org.springframework.jdbc.core.JdbcTemplate.query(JdbcTemplate.java:716) [spring-jdbc.jar:4.3.9.RELEASE] at org.springframework.jdbc.core.JdbcTemplate.query(JdbcTemplate.java:766) [spring-jdbc.jar:4.3.9.RELEASE] at org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall.executeCallInternal(PostgresDbEngineDialect.java:152) [dal.jar:] at org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall.doExecute(PostgresDbEngineDialect.java:118) [dal.jar:] at org.springframework.jdbc.core.simple.SimpleJdbcCall.execute(SimpleJdbcCall.java:198) [spring-jdbc.jar:4.3.9.RELEASE] at org.ovirt.engine.core.dal.dbbroker.SimpleJdbcCallsHandler.executeImpl(SimpleJdbcCallsHandler.java:135) [dal.jar:] at org.ovirt.engine.core.dal.dbbroker.SimpleJdbcCallsHandler.executeReadList(SimpleJdbcCallsHandler.java:105) [dal.jar:] at org.ovirt.engine.core.dal.dbbroker.SimpleJdbcCallsHandler.executeRead(SimpleJdbcCallsHandler.java:97) [dal.jar:] at org.ovirt.engine.core.dao.DefaultReadDao.get(DefaultReadDao.java:73) [dal.jar:] at org.ovirt.engine.core.dao.network.InterfaceDaoImpl$1.mapRow(InterfaceDaoImpl.java:330) [dal.jar:] at org.ovirt.engine.core.dao.network.InterfaceDaoImpl$1.mapRow(InterfaceDaoImpl.java:303) [dal.jar:] at org.springframework.jdbc.core.RowMapperResultSetExtractor.extractData(RowMapperResultSetExtractor.java:93) [spring-jdbc.jar:4.3.9.RELEASE] at org.springframework.jdbc.core.RowMapperResultSetExtractor.extractData(RowMapperResultSetExtractor.java:60) [spring-jdbc.jar:4.3.9.RELEASE] at org.springframework.jdbc.core.JdbcTemplate$1.doInPreparedStatement(JdbcTemplate.java:697) [spring-jdbc.jar:4.3.9.RELEASE] at org.springframework.jdbc.core.JdbcTemplate.execute(JdbcTemplate.java:633) [spring-jdbc.jar:4.3.9.RELEASE] at org.springframework.jdbc.core.JdbcTemplate.query(JdbcTemplate.java:684) [spring-jdbc.jar:4.3.9.RELEASE] at org.springframework.jdbc.core.JdbcTemplate.query(JdbcTemplate.java:716) [spring-jdbc.jar:4.3.9.RELEASE] at org.springframework.jdbc.core.JdbcTemplate.query(JdbcTemplate.java:766) [spring-jdbc.jar:4.3.9.RELEASE] at org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall.executeCallInternal(PostgresDbEngineDialect.java:152) [dal.jar:] at org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall.doExecute(PostgresDbEngineDialect.java:118) [dal.jar:] at org.springframework.jdbc.core.simple.SimpleJdbcCall.execute(SimpleJdbcCall.java:198) [spring-jdbc.jar:4.3.9.RELEASE] at org.ovirt.engine.core.dal.dbbroker.SimpleJdbcCallsHandler.executeImpl(SimpleJdbcCallsHandler.java:135) [dal.jar:] at org.ovirt.engine.core.dal.dbbroker.SimpleJdbcCallsHandler.executeReadList(SimpleJdbcCallsHandler.java:105) [dal.jar:] at org.ovirt.engine.core.dao.network.InterfaceDaoImpl.getAllInterfacesForVds(InterfaceDaoImpl.java:195) [dal.jar:] at org.ovirt.engine.core.dao.network.InterfaceDaoImpl.getAllInterfacesForVds(InterfaceDaoImpl.java:158) [dal.jar:] at org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring.fetchHostInterfaces(HostMonitoring.java:534) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring.refreshVdsStats(HostMonitoring.java:467) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring.refreshVdsRunTimeInfo(HostMonitoring.java:121) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring.refresh(HostMonitoring.java:86) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.VdsManager.refreshImpl(VdsManager.java:278) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.VdsManager.refresh(VdsManager.java:246) [vdsbroker.jar:] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [rt.jar:1.8.0_161] at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) [rt.jar:1.8.0_161] at org.glassfish.enterprise.concurrent.internal.ManagedScheduledThreadPoolExecutor$ManagedScheduledFutureTask.access$201(ManagedScheduledThreadPoolExecutor.java:383) [javax.enterprise.concurrent.jar:1.0.0.redhat-1] at org.glassfish.enterprise.concurrent.internal.ManagedScheduledThreadPoolExecutor$ManagedScheduledFutureTask.run(ManagedScheduledThreadPoolExecutor.java:534) [javax.enterprise.concurrent.jar:1.0.0.redhat-1] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [rt.jar:1.8.0_161] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [rt.jar:1.8.0_161] at java.lang.Thread.run(Thread.java:748) [rt.jar:1.8.0_161] at org.glassfish.enterprise.concurrent.ManagedThreadFactoryImpl$ManagedThread.run(ManagedThreadFactoryImpl.java:250) [javax.enterprise.concurrent.jar:1.0.0.redhat-1] at org.jboss.as.ee.concurrent.service.ElytronManagedThreadFactory$ElytronManagedThread.run(ElytronManagedThreadFactory.java:78) Caused by: java.sql.SQLException: javax.resource.ResourceException: IJ000453: Unable to get managed connection for java:/ENGINEDataSource at org.jboss.jca.adapters.jdbc.WrapperDataSource.getConnection(WrapperDataSource.java:146) at org.jboss.as.connector.subsystems.datasources.WildFlyDataSource.getConnection(WildFlyDataSource.java:64) at org.springframework.jdbc.datasource.DataSourceUtils.doGetConnection(DataSourceUtils.java:111) [spring-jdbc.jar:4.3.9.RELEASE] at org.springframework.jdbc.datasource.DataSourceUtils.getConnection(DataSourceUtils.java:77) [spring-jdbc.jar:4.3.9.RELEASE] ... 42 more Caused by: javax.resource.ResourceException: IJ000453: Unable to get managed connection for java:/ENGINEDataSource at org.jboss.jca.core.connectionmanager.AbstractConnectionManager.getManagedConnection(AbstractConnectionManager.java:690) at org.jboss.jca.core.connectionmanager.tx.TxConnectionManagerImpl.getManagedConnection(TxConnectionManagerImpl.java:430) at org.jboss.jca.core.connectionmanager.AbstractConnectionManager.allocateConnection(AbstractConnectionManager.java:789) at org.jboss.jca.adapters.jdbc.WrapperDataSource.getConnection(WrapperDataSource.java:138) ... 45 more Caused by: javax.resource.ResourceException: IJ000655: No managed connections available within configured blocking timeout (30000 [ms]) at org.jboss.jca.core.connectionmanager.pool.mcp.SemaphoreConcurrentLinkedDequeManagedConnectionPool.getConnection(SemaphoreConcurrentLinkedDequeManagedConnectionPool.java:570) at org.jboss.jca.core.connectionmanager.pool.AbstractPool.getSimpleConnection(AbstractPool.java:632) at org.jboss.jca.core.connectionmanager.pool.AbstractPool.getConnection(AbstractPool.java:604) at org.jboss.jca.core.connectionmanager.AbstractConnectionManager.getManagedConnection(AbstractConnectionManager.java:624) ... 48 more
(In reply to mlehrer from comment #18) > (In reply to Martin Perina from comment #17) > > This is not strictly related to connection pool, this is about using less > > threads to manage the same number of hosts. Of course when you use less > > number of threads, you need less number of connections, so with the same > > default configuration you should be able to manage more hosts than in > > previous versions > > Failed test on 150 hosts from non-responsive to activation. > > Ilan saw that testing failed with "Unable to get managed connection for > java:/ENGINEDataSource"[1] on test case of "1/3 of hosts non-responsive => > (activate)." In our enviroment 150 vm with nested virt enabled were powered > off to simulate 150 hosts being made non-responsive and then set to > activation/maintenance. Yes, it can always happen if database is not able to perform fast enough for example when it's installed on slow storage of CPU of database host is too loaded or a network doesn't have enough throughput in case of remote db. There is not absolute statement, the only thing we can say if database is able to perform fast enough, engine is able to serve 400 hosts with only 150 db connections. So if above exception appear, customer either need to investigate database performance and if it doesn't help, then increase number of db connections
(In reply to Martin Perina from comment #19) > (In reply to mlehrer from comment #18) > > (In reply to Martin Perina from comment #17) > > > This is not strictly related to connection pool, this is about using less > > > threads to manage the same number of hosts. Of course when you use less > > > number of threads, you need less number of connections, so with the same > > > default configuration you should be able to manage more hosts than in > > > previous versions > > > > Failed test on 150 hosts from non-responsive to activation. > > > > Ilan saw that testing failed with "Unable to get managed connection for > > java:/ENGINEDataSource"[1] on test case of "1/3 of hosts non-responsive => > > (activate)." In our enviroment 150 vm with nested virt enabled were powered > > off to simulate 150 hosts being made non-responsive and then set to > > activation/maintenance. > > Yes, it can always happen if database is not able to perform fast enough for > example when it's installed on slow storage of CPU of database host is too > loaded or a network doesn't have enough throughput in case of remote db. > > There is not absolute statement, the only thing we can say if database is > able to perform fast enough, engine is able to serve 400 hosts with only 150 > db connections. So if above exception appear, customer either need to > investigate database performance and if it doesn't help, then increase > number of db connections These issues happen when the connection pool is not sized correctly. We have seen all of these test cases pass on 400 hosts/4k VM setup with a correctly sized pool as described in https://bugzilla.redhat.com/show_bug.cgi?id=1570424 Bottom line - pool / connections sized correctly it passes. Default settings - it fails.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2018:1488
BZ<2>Jira Resync