1347644 – engine thread pool issues during major DC outage

Bug 1347644 - engine thread pool issues during major DC outage

Summary: engine thread pool issues during major DC outage

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-engine
Sub Component:
Version:	3.6.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	ovirt-4.2.2
Target Release:	---
Assignee:	Ravi Nori
QA Contact:	mlehrer
Docs Contact:
URL:
Whiteboard:
Depends On:	1570424
Blocks:
TreeView+	depends on / blocked

Reported:	2016-06-17 10:17 UTC by Michal Skrivanek
Modified:	2019-05-16 13:08 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-05-15 17:38:32 UTC
oVirt Team:	Infra
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHEA-2018:1488	0	None	None	None	2018-05-15 17:40:08 UTC

Description Michal Skrivanek 2016-06-17 10:17:36 UTC

It seems that in larger deployments when a significant number of hosts lose connectivity there is a huge impact on the rest of the system and healthy hosts are being starved out by the reconnect attempts to the other hosts

Comment 2 Martin Perina 2016-11-28 12:23:37 UTC

This will take a lot of resources just to simulate the environment and observe if we have any issues and what they cause, so moving to 4.2 where we can investigate properly

Comment 7 Martin Perina 2018-02-14 20:39:40 UTC

Moving to MODIFIED as non-blocking thread patches were merged to 4.2, so please retest.

Comment 8 RHV bug bot 2018-02-16 16:24:14 UTC

WARN: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{'rhevm-4.2-ga': '?'}', ]

For more info please contact: rhv-devops: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{'rhevm-4.2-ga': '?'}', ]

For more info please contact: rhv-devops

Comment 9 Daniel Gur 2018-02-21 21:29:58 UTC

Now that we have scaled RHV setup in our RDU RHV scale lab
Could you please provide suggested validation steps? 

Should we Shut down part of the hosts and monitor?
Please suggest how many total hosts where up and down when you saw the issue,
Please suggest what should be monitored.

Comment 10 Martin Perina 2018-02-22 07:46:42 UTC

(In reply to Daniel Gur from comment #9)
> Now that we have scaled RHV setup in our RDU RHV scale lab
> Could you please provide suggested validation steps? 
> 
> Should we Shut down part of the hosts and monitor?
> Please suggest how many total hosts where up and down when you saw the issue,
> Please suggest what should be monitored.

We don't have exact reproducing steps, but here's my suggestion:

1. Deploy reasonable high amount of hosts and VMs
2. Observe that everything is stable
3. Try to make 1/3 of hosts non-responsive and watch how system behaves (are all hosts properly fenced, are there any issues during that, ...)
4. Try to put 1/3 of hosts to Maintenance  and watch how system behaves
5. Try to shutdown 1/4 of hosts. When done try to shutdown another 1/4 of hosts and at the same time start 1/4 of hosts and watch how system behaves

I'm sure that virt/storage/network teams could add more stress tests around their feature but above seems to OK from infra point of view

Comment 11 Sandro Bonazzola 2018-02-22 11:09:14 UTC

If this is test only, please move to QE

Comment 12 Martin Perina 2018-02-22 11:30:01 UTC

Daniel, you need set qa_ack, otherwise we cannot put it into errata and move to QA

Comment 13 RHV bug bot 2018-02-23 16:00:43 UTC

WARN: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{'rhevm-4.2-ga': '?'}', ]

For more info please contact: rhv-devops: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{'rhevm-4.2-ga': '?'}', ]

For more info please contact: rhv-devops

Comment 14 RHV bug bot 2018-03-16 15:01:39 UTC

INFO: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[No external trackers attached]

For more info please contact: rhv-devops

Comment 15 Martin Perina 2018-03-26 12:35:01 UTC

There's no specific fix, but we have done much work around this area. Moving to QA, suggested tests are mentioned in Comment 10

Comment 16 mlehrer 2018-04-12 11:24:30 UTC

(In reply to Martin Perina from comment #10)
> (In reply to Daniel Gur from comment #9)
> > Now that we have scaled RHV setup in our RDU RHV scale lab
> > Could you please provide suggested validation steps? 
> > 
> > Should we Shut down part of the hosts and monitor?
> > Please suggest how many total hosts where up and down when you saw the issue,
> > Please suggest what should be monitored.
> 
> We don't have exact reproducing steps, but here's my suggestion:
> 

With default settings?

> 1. Deploy reasonable high amount of hosts and VMs
> 2. Observe that everything is stable
> 3. Try to make 1/3 of hosts non-responsive and watch how system behaves (are
> all hosts properly fenced, are there any issues during that, ...)
> 4. Try to put 1/3 of hosts to Maintenance  and watch how system behaves
> 5. Try to shutdown 1/4 of hosts. When done try to shutdown another 1/4 of
> hosts and at the same time start 1/4 of hosts and watch how system behaves
> 
> I'm sure that virt/storage/network teams could add more stress tests around
> their feature but above seems to OK from infra point of view


Is the expectations that these changes resolve pool connection issues, or that these tests complete successfully once the pool is updated?  Unclear as there's no specific fix...

Comment 17 Martin Perina 2018-04-12 12:47:26 UTC

(In reply to mlehrer from comment #16)
> (In reply to Martin Perina from comment #10)
> > (In reply to Daniel Gur from comment #9)
> > > Now that we have scaled RHV setup in our RDU RHV scale lab
> > > Could you please provide suggested validation steps? 
> > > 
> > > Should we Shut down part of the hosts and monitor?
> > > Please suggest how many total hosts where up and down when you saw the issue,
> > > Please suggest what should be monitored.
> > 
> > We don't have exact reproducing steps, but here's my suggestion:
> > 
> 
> With default settings?

Yes, AFAIK when you scale test on previous releases you have always used the default settings. Or am I wrong?

> 
> > 1. Deploy reasonable high amount of hosts and VMs
> > 2. Observe that everything is stable
> > 3. Try to make 1/3 of hosts non-responsive and watch how system behaves (are
> > all hosts properly fenced, are there any issues during that, ...)
> > 4. Try to put 1/3 of hosts to Maintenance  and watch how system behaves
> > 5. Try to shutdown 1/4 of hosts. When done try to shutdown another 1/4 of
> > hosts and at the same time start 1/4 of hosts and watch how system behaves
> > 
> > I'm sure that virt/storage/network teams could add more stress tests around
> > their feature but above seems to OK from infra point of view
> 
> 
> Is the expectations that these changes resolve pool connection issues, or
> that these tests complete successfully once the pool is updated?  Unclear as
> there's no specific fix...

This is not strictly related to connection pool, this is about using less threads to manage the same number of hosts. Of course when you use less number of threads, you need less number of connections, so with the same default configuration you should be able to manage more hosts than in previous versions

Comment 18 mlehrer 2018-04-16 10:55:50 UTC

(In reply to Martin Perina from comment #17)
> This is not strictly related to connection pool, this is about using less
> threads to manage the same number of hosts. Of course when you use less
> number of threads, you need less number of connections, so with the same
> default configuration you should be able to manage more hosts than in
> previous versions

Failed test on 150 hosts from non-responsive to activation.

Ilan saw that testing failed with "Unable to get managed connection for java:/ENGINEDataSource"[1] on test case of "1/3 of hosts non-responsive => (activate)."  In our enviroment 150 vm with nested virt enabled were powered off to simulate 150 hosts being made non-responsive and then set to activation/maintenance.

#Env:
hosts: 400
VMs: 4000

/usr/share/ovirt-engine/services/ovirt-engine/ovirt-engine.conf 
ENGINE_DB_MIN_CONNECTIONS=1
ENGINE_DB_MAX_CONNECTIONS=100

/var/opt/rh/rh-postgresql95/lib/pgsql/data/postgresql.conf 
max_connections = 150

#Engine log: 
https://drive.google.com/open?id=1FjwMjTYLwV3VYmppnVvog3fwXBl6Nrj5


[1] Execption:
2018-03-28 09:29:01,182Z ERROR [org.ovirt.engine.core.vdsbroker.VdsManager] (EE-ManagedThreadFactory-engineScheduled-Thread-74) [] Exception: org.springframework.jdbc.CannotGetJdbcConnectionException: Could not get JDBC Connection; nested exception is java.sql.SQLException: javax.resource.ResourceException: IJ000453: Unable to get managed connection for java:/ENGINEDataSource
	at org.springframework.jdbc.datasource.DataSourceUtils.getConnection(DataSourceUtils.java:80) [spring-jdbc.jar:4.3.9.RELEASE]
	at org.springframework.jdbc.core.JdbcTemplate.execute(JdbcTemplate.java:619) [spring-jdbc.jar:4.3.9.RELEASE]
	at org.springframework.jdbc.core.JdbcTemplate.query(JdbcTemplate.java:684) [spring-jdbc.jar:4.3.9.RELEASE]
	at org.springframework.jdbc.core.JdbcTemplate.query(JdbcTemplate.java:716) [spring-jdbc.jar:4.3.9.RELEASE]
	at org.springframework.jdbc.core.JdbcTemplate.query(JdbcTemplate.java:766) [spring-jdbc.jar:4.3.9.RELEASE]
	at org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall.executeCallInternal(PostgresDbEngineDialect.java:152) [dal.jar:]
	at org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall.doExecute(PostgresDbEngineDialect.java:118) [dal.jar:]
	at org.springframework.jdbc.core.simple.SimpleJdbcCall.execute(SimpleJdbcCall.java:198) [spring-jdbc.jar:4.3.9.RELEASE]
	at org.ovirt.engine.core.dal.dbbroker.SimpleJdbcCallsHandler.executeImpl(SimpleJdbcCallsHandler.java:135) [dal.jar:]
	at org.ovirt.engine.core.dal.dbbroker.SimpleJdbcCallsHandler.executeReadList(SimpleJdbcCallsHandler.java:105) [dal.jar:]
	at org.ovirt.engine.core.dal.dbbroker.SimpleJdbcCallsHandler.executeRead(SimpleJdbcCallsHandler.java:97) [dal.jar:]
	at org.ovirt.engine.core.dao.DefaultReadDao.get(DefaultReadDao.java:73) [dal.jar:]
	at org.ovirt.engine.core.dao.network.InterfaceDaoImpl$1.mapRow(InterfaceDaoImpl.java:330) [dal.jar:]
	at org.ovirt.engine.core.dao.network.InterfaceDaoImpl$1.mapRow(InterfaceDaoImpl.java:303) [dal.jar:]
	at org.springframework.jdbc.core.RowMapperResultSetExtractor.extractData(RowMapperResultSetExtractor.java:93) [spring-jdbc.jar:4.3.9.RELEASE]
	at org.springframework.jdbc.core.RowMapperResultSetExtractor.extractData(RowMapperResultSetExtractor.java:60) [spring-jdbc.jar:4.3.9.RELEASE]
	at org.springframework.jdbc.core.JdbcTemplate$1.doInPreparedStatement(JdbcTemplate.java:697) [spring-jdbc.jar:4.3.9.RELEASE]
	at org.springframework.jdbc.core.JdbcTemplate.execute(JdbcTemplate.java:633) [spring-jdbc.jar:4.3.9.RELEASE]
	at org.springframework.jdbc.core.JdbcTemplate.query(JdbcTemplate.java:684) [spring-jdbc.jar:4.3.9.RELEASE]
	at org.springframework.jdbc.core.JdbcTemplate.query(JdbcTemplate.java:716) [spring-jdbc.jar:4.3.9.RELEASE]
	at org.springframework.jdbc.core.JdbcTemplate.query(JdbcTemplate.java:766) [spring-jdbc.jar:4.3.9.RELEASE]
	at org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall.executeCallInternal(PostgresDbEngineDialect.java:152) [dal.jar:]
	at org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall.doExecute(PostgresDbEngineDialect.java:118) [dal.jar:]
	at org.springframework.jdbc.core.simple.SimpleJdbcCall.execute(SimpleJdbcCall.java:198) [spring-jdbc.jar:4.3.9.RELEASE]
	at org.ovirt.engine.core.dal.dbbroker.SimpleJdbcCallsHandler.executeImpl(SimpleJdbcCallsHandler.java:135) [dal.jar:]
	at org.ovirt.engine.core.dal.dbbroker.SimpleJdbcCallsHandler.executeReadList(SimpleJdbcCallsHandler.java:105) [dal.jar:]
	at org.ovirt.engine.core.dao.network.InterfaceDaoImpl.getAllInterfacesForVds(InterfaceDaoImpl.java:195) [dal.jar:]
	at org.ovirt.engine.core.dao.network.InterfaceDaoImpl.getAllInterfacesForVds(InterfaceDaoImpl.java:158) [dal.jar:]
	at org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring.fetchHostInterfaces(HostMonitoring.java:534) [vdsbroker.jar:]
	at org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring.refreshVdsStats(HostMonitoring.java:467) [vdsbroker.jar:]
	at org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring.refreshVdsRunTimeInfo(HostMonitoring.java:121) [vdsbroker.jar:]
	at org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring.refresh(HostMonitoring.java:86) [vdsbroker.jar:]
	at org.ovirt.engine.core.vdsbroker.VdsManager.refreshImpl(VdsManager.java:278) [vdsbroker.jar:]
	at org.ovirt.engine.core.vdsbroker.VdsManager.refresh(VdsManager.java:246) [vdsbroker.jar:]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [rt.jar:1.8.0_161]
	at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) [rt.jar:1.8.0_161]
	at org.glassfish.enterprise.concurrent.internal.ManagedScheduledThreadPoolExecutor$ManagedScheduledFutureTask.access$201(ManagedScheduledThreadPoolExecutor.java:383) [javax.enterprise.concurrent.jar:1.0.0.redhat-1]
	at org.glassfish.enterprise.concurrent.internal.ManagedScheduledThreadPoolExecutor$ManagedScheduledFutureTask.run(ManagedScheduledThreadPoolExecutor.java:534) [javax.enterprise.concurrent.jar:1.0.0.redhat-1]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [rt.jar:1.8.0_161]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [rt.jar:1.8.0_161]
	at java.lang.Thread.run(Thread.java:748) [rt.jar:1.8.0_161]
	at org.glassfish.enterprise.concurrent.ManagedThreadFactoryImpl$ManagedThread.run(ManagedThreadFactoryImpl.java:250) [javax.enterprise.concurrent.jar:1.0.0.redhat-1]
	at org.jboss.as.ee.concurrent.service.ElytronManagedThreadFactory$ElytronManagedThread.run(ElytronManagedThreadFactory.java:78)
Caused by: java.sql.SQLException: javax.resource.ResourceException: IJ000453: Unable to get managed connection for java:/ENGINEDataSource
	at org.jboss.jca.adapters.jdbc.WrapperDataSource.getConnection(WrapperDataSource.java:146)
	at org.jboss.as.connector.subsystems.datasources.WildFlyDataSource.getConnection(WildFlyDataSource.java:64)
	at org.springframework.jdbc.datasource.DataSourceUtils.doGetConnection(DataSourceUtils.java:111) [spring-jdbc.jar:4.3.9.RELEASE]
	at org.springframework.jdbc.datasource.DataSourceUtils.getConnection(DataSourceUtils.java:77) [spring-jdbc.jar:4.3.9.RELEASE]
	... 42 more
Caused by: javax.resource.ResourceException: IJ000453: Unable to get managed connection for java:/ENGINEDataSource
	at org.jboss.jca.core.connectionmanager.AbstractConnectionManager.getManagedConnection(AbstractConnectionManager.java:690)
	at org.jboss.jca.core.connectionmanager.tx.TxConnectionManagerImpl.getManagedConnection(TxConnectionManagerImpl.java:430)
	at org.jboss.jca.core.connectionmanager.AbstractConnectionManager.allocateConnection(AbstractConnectionManager.java:789)
	at org.jboss.jca.adapters.jdbc.WrapperDataSource.getConnection(WrapperDataSource.java:138)
	... 45 more
Caused by: javax.resource.ResourceException: IJ000655: No managed connections available within configured blocking timeout (30000 [ms])
	at org.jboss.jca.core.connectionmanager.pool.mcp.SemaphoreConcurrentLinkedDequeManagedConnectionPool.getConnection(SemaphoreConcurrentLinkedDequeManagedConnectionPool.java:570)
	at org.jboss.jca.core.connectionmanager.pool.AbstractPool.getSimpleConnection(AbstractPool.java:632)
	at org.jboss.jca.core.connectionmanager.pool.AbstractPool.getConnection(AbstractPool.java:604)
	at org.jboss.jca.core.connectionmanager.AbstractConnectionManager.getManagedConnection(AbstractConnectionManager.java:624)
	... 48 more

Comment 19 Martin Perina 2018-04-19 12:35:04 UTC

(In reply to mlehrer from comment #18)
> (In reply to Martin Perina from comment #17)
> > This is not strictly related to connection pool, this is about using less
> > threads to manage the same number of hosts. Of course when you use less
> > number of threads, you need less number of connections, so with the same
> > default configuration you should be able to manage more hosts than in
> > previous versions
> 
> Failed test on 150 hosts from non-responsive to activation.
> 
> Ilan saw that testing failed with "Unable to get managed connection for
> java:/ENGINEDataSource"[1] on test case of "1/3 of hosts non-responsive =>
> (activate)."  In our enviroment 150 vm with nested virt enabled were powered
> off to simulate 150 hosts being made non-responsive and then set to
> activation/maintenance.

Yes, it can always happen if database is not able to perform fast enough for example when it's installed on slow storage of CPU of database host is too loaded or a network doesn't have enough throughput in case of remote db.

There is not absolute statement, the only thing we can say if database is able to perform fast enough, engine is able to serve 400 hosts with only 150 db connections. So if above exception appear, customer either need to investigate database performance and if it doesn't help, then increase number of db connections

Comment 20 mlehrer 2018-04-22 19:52:13 UTC

(In reply to Martin Perina from comment #19)
> (In reply to mlehrer from comment #18)
> > (In reply to Martin Perina from comment #17)
> > > This is not strictly related to connection pool, this is about using less
> > > threads to manage the same number of hosts. Of course when you use less
> > > number of threads, you need less number of connections, so with the same
> > > default configuration you should be able to manage more hosts than in
> > > previous versions
> > 
> > Failed test on 150 hosts from non-responsive to activation.
> > 
> > Ilan saw that testing failed with "Unable to get managed connection for
> > java:/ENGINEDataSource"[1] on test case of "1/3 of hosts non-responsive =>
> > (activate)."  In our enviroment 150 vm with nested virt enabled were powered
> > off to simulate 150 hosts being made non-responsive and then set to
> > activation/maintenance.
> 
> Yes, it can always happen if database is not able to perform fast enough for
> example when it's installed on slow storage of CPU of database host is too
> loaded or a network doesn't have enough throughput in case of remote db.
> 
> There is not absolute statement, the only thing we can say if database is
> able to perform fast enough, engine is able to serve 400 hosts with only 150
> db connections. So if above exception appear, customer either need to
> investigate database performance and if it doesn't help, then increase
> number of db connections

These issues happen when the connection pool is not sized correctly.
We have seen all of these test cases pass on 400 hosts/4k VM setup with a correctly sized pool as described in https://bugzilla.redhat.com/show_bug.cgi?id=1570424

Bottom line - pool / connections sized correctly it passes.
Default settings - it fails.

Comment 24 errata-xmlrpc 2018-05-15 17:38:32 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:1488

Comment 25 Franta Kust 2019-05-16 13:08:30 UTC

BZ<2>Jira Resync

Note You need to log in before you can comment on or make changes to this bug.