Bug 880969

Summary: ovirt-engine-backend [Scalability]:Problematic query 'getallfromvms' causes user portal to become stuck after user login.
Product: Red Hat Enterprise Virtualization Manager Reporter: Omri Hochman <ohochman>
Component: ovirt-engineAssignee: Yair Zaslavsky <yzaslavs>
Status: CLOSED ERRATA QA Contact: vvyazmin <vvyazmin>
Severity: medium Docs Contact:
Priority: urgent    
Version: 3.1.0CC: acathrow, amureini, bazulay, chetan, dyasny, emesika, hateya, iheim, italkohe, lpeer, mkublin, oourfali, Rhev-m-bugs, rvaknin, sgrinber, tdosek, tvvcox, yeylon, ykaul, yzaslavs
Target Milestone: ---Keywords: Reopened, VerifiedOnDev, ZStream
Target Release: 3.2.0   
Hardware: x86_64   
OS: All   
Whiteboard: infra
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Previously, certain backend calls would sometimes cause environments with a high number of virtual machines and users to lock up the user portal, regardless of how many virtual machines were allocated to the individual user. Now, the calls have been organised more efficiently to prevent scaling issues.
Story Points: ---
Clone Of:
: 906368 (view as bug list) Environment:
Last Closed: 2013-06-10 17:23:35 EDT Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Infra RHEL 7.3 requirements from Atomic Host:
Bug Depends On:    
Bug Blocks: 906368, 915537    
Attachments:
Description Flags
thread_dump
none
pg_log none

Description Omri Hochman 2012-11-28 05:02:53 EST
ovirt-engine-backend [Scalability]:Problematic query 'getallfromvms' causes user portal to become stuck after user login.  

Description:
************
When user attempts to login to the User Portal on a scale environment (with many VM's), User Portal will become stuck and the browser will stop functioning. it happens due to 'getallfromvms' query that runs when presenting the 'Virtual Machines' tab.

Note:
*******
user portal will get stuck  even If the user does not have Any VM's attached to him or have only a single VM attached to him. 

RHEVM Environment:
*******************
- RHEVM (Build IC24.4) installed on physical  
- Postgresql remote DB on another physical machine.

Objects in RHEVM: 
*****************
- Total 31 Hosts. 
- Total 50 iSCSI Storage Domains +  1 ISO + 1 Export
- Total ~1800 not running VM's ,
- 2300 Users/Groups


Thread dump:
**************
at java.net.SocketInputStream.socketRead0(Native Method)
         at java.net.SocketInputStream.read(SocketInputStream.java:150)
         at java.net.SocketInputStream.read(SocketInputStream.java:121)
         at org.postgresql.core.VisibleBufferedInputStream.readMore(VisibleBufferedInputStream.java:135)
         at org.postgresql.core.VisibleBufferedInputStream.ensureBytes(VisibleBufferedInputStream.java:104)
         at org.postgresql.core.VisibleBufferedInputStream.read(VisibleBufferedInputStream.java:73)
         at org.postgresql.core.PGStream.ReceiveChar(PGStream.java:259)
         at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1620)
         at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:257)
         - locked <0x00000000c4f15c88> (a org.postgresql.core.v3.QueryExecutorImpl)
         at org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:479)
         at org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags(AbstractJdbc2Statement.java:367)
         at org.postgresql.jdbc2.AbstractJdbc2Statement.executeQuery(AbstractJdbc2Statement.java:271)
         at org.jboss.jca.adapters.jdbc.CachedPreparedStatement.executeQuery(CachedPreparedStatement.java:107)
         at org.jboss.jca.adapters.jdbc.WrappedPreparedStatement.executeQuery(WrappedPreparedStatement.java:462)
         at org.springframework.jdbc.core.JdbcTemplate$1.doInPreparedStatement(JdbcTemplate.java:644)
         at org.springframework.jdbc.core.JdbcTemplate.execute(JdbcTemplate.java:587)
         at org.springframework.jdbc.core.JdbcTemplate.query(JdbcTemplate.java:637)
         at org.springframework.jdbc.core.JdbcTemplate.query(JdbcTemplate.java:666)
         at org.springframework.jdbc.core.JdbcTemplate.query(JdbcTemplate.java:706)
         at org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall.executeCallInternal(PostgresDbEngineDialect.java:155)
         at org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall.doExecute(PostgresDbEngineDialect.java:121)
         at org.springframework.jdbc.core.simple.SimpleJdbcCall.execute(SimpleJdbcCall.java:181)
         at org.ovirt.engine.core.dal.dbbroker.SimpleJdbcCallsHandler.executeImpl(SimpleJdbcCallsHandler.java:124)
         at org.ovirt.engine.core.dal.dbbroker.SimpleJdbcCallsHandler.executeReadAndReturnMap(SimpleJdbcCallsHandler.java:75)
         at org.ovirt.engine.core.dal.dbbroker.SimpleJdbcCallsHandler.executeReadList(SimpleJdbcCallsHandler.java:66)
         at org.ovirt.engine.core.dao.VmDAODbFacadeImpl.getAll(VmDAODbFacadeImpl.java:181)
         at org.ovirt.engine.core.bll.GetAllVmsQuery.executeQueryCommand(GetAllVmsQuery.java:15)
         at org.ovirt.engine.core.bll.QueriesCommandBase.ExecuteCommand(QueriesCommandBase.java:71)
         at org.ovirt.engine.core.dal.VdcCommandBase.Execute(VdcCommandBase.java:41)
         at org.ovirt.engine.core.bll.Backend.runQueryImpl(Backend.java:400)
         at org.ovirt.engine.core.bll.Backend.RunQuery(Backend.java:379)
         at sun.reflect.GeneratedMethodAccessor99.invoke(Unknown Source)
         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
         at java.lang.reflect.Method.invoke(Method.java:601)
         at org.jboss.as.ee.component.ManagedReferenceMethodInterceptorFactory$ManagedReferenceMethodIn

/var/pg_data/pg_log (those queries return after almost 10 minutes) :
*********************************************************************
LOG:  duration: 522376.064 ms  execute <unnamed>: select * from  getallfromvms($1, $2)
DETAIL:  parameters: $1 = '9b9002d1-ec33-4083-8a7b-31f6b8931648', $2 = 't'
LOG:  duration: 532152.857 ms  execute <unnamed>: select * from  getallfromvms($1, $2)
DETAIL:  parameters: $1 = '9b9002d1-ec33-4083-8a7b-31f6b8931648', $2 = 't'
LOG:  duration: 292911.526 ms  execute <unnamed>: select * from  getallfromvms($1, $2)
DETAIL:  parameters: $1 = '2159ee18-04d8-4549-b6c7-0c1104bc6535', $2 = 't'
LOG:  duration: 291835.261 ms  execute <unnamed>: select * from  getallfromvms($1, $2)
DETAIL:  parameters: $1 = '2159ee18-04d8-4549-b6c7-0c1104bc6535', $2 = 't'
LOG:  duration: 281379.708 ms  execute <unnamed>: select * from  getallfromvms($1, $2)
DETAIL:  parameters: $1 = '2159ee18-04d8-4549-b6c7-0c1104bc6535', $2 = 't'
LOG:  duration: 286763.031 ms  execute <unnamed>: select * from  getallfromvms($1, $2)
DETAIL:  parameters: $1 = '2159ee18-04d8-4549-b6c7-0c1104bc6535', $2 = 't'
LOG:  duration: 292041.869 ms  execute <unnamed>: select * from  getallfromvms($1, $2)
DETAIL:  parameters: $1 = '2159ee18-04d8-4549-b6c7-0c1104bc6535', $2 = 't'
LOG:  duration: 535423.412 ms  execute <unnamed>: select * from  getallfromvms($1, $2)
DETAIL:  parameters: $1 = '9b9002d1-ec33-4083-8a7b-31f6b8931648', $2 = 't'
LOG:  duration: 540449.231 ms  execute <unnamed>: select * from  getallfromvms($1, $2)
DETAIL:  parameters: $1 = '9b9002d1-ec33-4083-8a7b-31f6b8931648', $2 = 't'
Comment 1 Omri Hochman 2012-11-28 05:08:29 EST
Created attachment 653408 [details]
thread_dump
Comment 2 Omri Hochman 2012-11-28 05:09:59 EST
Created attachment 653410 [details]
pg_log
Comment 3 Allon Mureinik 2012-11-28 10:19:22 EST
Having the execution plan for these quereis would be a great help.

Omri, if it's possible, I'd like the following done:

1. Edit postgres.conf and add the following lines:
shared_preload_libraries = 'auto_explain'
custom_variable_classes = 'auto_explain'
auto_explain.log_min_duration = '60s'

2. Restart postgres demon

3. Run the same scenario and attach /var/pg_data/pg_log
Comment 4 mkublin 2012-11-29 08:54:18 EST
LOG:  duration: 90313.548 ms  execute <unnamed>: select * from  getallfromvms($1, $2)
DETAIL:  parameters: $1 = 'fdfc627c-d875-11e0-90f0-83df133b58cc', $2 = 't'
LOG:  duration: 90313.544 ms  plan:
        Function Scan on getallfromvms  (cost=0.00..260.00 rows=1000 width=12580)
LOG:  duration: 90385.997 ms  execute <unnamed>: select * from  getallfromvms($1, $2)
DETAIL:  parameters: $1 = 'fdfc627c-d875-11e0-90f0-83df133b58cc', $2 = 't'
LOG:  duration: 90385.993 ms  plan:
        Function Scan on getallfromvms  (cost=0.00..260.00 rows=1000 width=12580)
LOG:  duration: 90890.892 ms  execute <unnamed>: select * from  getallfromvms($1, $2)
DETAIL:  parameters: $1 = 'fdfc627c-d875-11e0-90f0-83df133b58cc', $2 = 't'
LOG:  duration: 90890.889 ms  plan:
        Function Scan on getallfromvms  (cost=0.00..260.00 rows=1000 width=12580)
LOG:  duration: 91612.826 ms  execute <unnamed>: select * from  getallfromvms($1, $2)
DETAIL:  parameters: $1 = 'fdfc627c-d875-11e0-90f0-83df133b58cc', $2 = 't'
LOG:  duration: 91612.821 ms  plan:
        Function Scan on getallfromvms  (cost=0.00..260.00 rows=1000 width=12580)
LOG:  duration: 89354.231 ms  execute <unnamed>: select * from  getallfromvms($1, $2)
DETAIL:  parameters: $1 = 'fdfc627c-d875-11e0-90f0-83df133b58cc', $2 = 't'
LOG:  duration: 89354.227 ms  plan:
        Function Scan on getallfromvms  (cost=0.00..260.00 rows=1000 width=12580)
Comment 5 Yair Zaslavsky 2012-11-29 10:11:56 EST
looks like a serious performance issue with fn splitter (we split group ids, then perform full table scan on groups table).
Comment 6 Eli Mesika 2012-11-29 10:26:50 EST
(In reply to comment #5)
> looks like a serious performance issue with fn splitter (we split group ids,
> then perform full table scan on groups table).

We will have to get rid of the comma separated values of group ids and host them in a table + handling all upgrade aspects , the fnSplitter performance is bad !!!
Comment 8 Itamar Heim 2012-11-29 18:11:35 EST
(In reply to comment #6)
> (In reply to comment #5)
> > looks like a serious performance issue with fn splitter (we split group ids,
> > then perform full table scan on groups table).
> 
> We will have to get rid of the comma separated values of group ids and host
> them in a table + handling all upgrade aspects , the fnSplitter performance
> is bad !!!

why wasn't this an issue until now?
Comment 14 Omri Hochman 2012-12-05 09:19:23 EST
Apparently the problem was the number of object that the user portal had to present. In my case : the user portal had to present 2000 VMs, because the users I login with the user portal - had 'UserRole' As 'System Permissions' --> 
that means that the users had permission to see all 2000 VMs (this permission is not shown under the VM's 'Permission' sub-tab), and since there's no paging in the user portal virtual machine tab, user-portal probably fail to handle large number of VM's and the browser stuck. 

I'm not sure If this scenario is relevant.
Comment 16 Barak 2012-12-06 08:05:23 EST
To recap the issues discussed here, there were 2 issues:

- 2000 VMs with 2000 users (1 vm per user)
  * this seems to be an inefficiency in our user/permissions queries, will 
    require a DB schema change.

- 2000 VMs all belong to 1 user:
  * looks like User-Portal UI paging issue


Are the 2 issues above valid ?
Comment 18 Omri Hochman 2012-12-11 10:05:51 EST
(In reply to comment #16)
> To recap the issues discussed here, there were 2 issues:
> 
> - 2000 VMs with 2000 users (1 vm per user)
>   * this seems to be an inefficiency in our user/permissions queries, will 
>     require a DB schema change.
We haven't tested this yet, We will create a script that stimulate the scenario and post the results.  
> 
> - 2000 VMs all belong to 1 user:
>   * looks like User-Portal UI paging issue
This scenario is valid, in case:  we would like to allow : superuser/admin@interanl/ or any other user with 'system-permission' to login the user portal. 
> 
> 
> Are the 2 issues above valid ?
Comment 19 Barak 2012-12-16 07:12:41 EST
patch by mkublin

http://gerrit.ovirt.org/#/c/9825/1
Comment 22 Yair Zaslavsky 2012-12-16 11:56:09 EST
A patch to improve getUserAndGroupsById (at create_functions.sh)

Should be added as well.

http://gerrit.ovirt.org/#/c/9825/1 - should also handle fix for

http://gerrit.ovirt.org/#/c/9545/
Comment 25 Omri Hochman 2012-12-31 07:37:27 EST
Tested the following scenario (rhevm 3.1, build SI25.2) - Environment with ~400 users, each user associated with 3-4 Vm's.

One User is logged-in to user portal: every time the user portal 'refresh' (it can happen automatically or manually), the query 'getallfromvms' occur and takes 3 seconds to answer.   

pg_log:
*******
LOG:  duration: 3412.774 ms  execute S_38: select * from  getallfromvms($1, $2)
DETAIL:  parameters: $1 = '2d1cbea3-26db-4a70-81a7-f5763076533e', $2 = 't'
LOG:  duration: 3449.508 ms  execute S_37: select * from  getallfromvms($1, $2)
DETAIL:  parameters: $1 = '2d1cbea3-26db-4a70-81a7-f5763076533e', $2 = 't'

When two different users are logged-in from two different computers 
and refreshing the user portal together - the query takes 8 second to answer. 

pg_log:
*******
DETAIL:  parameters: $1 = '2d1cbea3-26db-4a70-81a7-f5763076533e', $2 = 't'
LOG:  duration: 7849.535 ms  execute S_11: select * from  getallfromvms($1, $2)
DETAIL:  parameters: $1 = '2d1cbea3-26db-4a70-81a7-f5763076533e', $2 = 't'
LOG:  duration: 8133.534 ms  execute S_21: select * from  getallfromvms($1, $2)
DETAIL:  parameters: $1 = '2d1cbea3-26db-4a70-81a7-f5763076533e', $2 = 't'
 
When three different users are logged-in from two different computers 
and refreshing the user portal together- the query takes 12 second to answer.

pg_log:
*******
LOG:  duration: 12171.432 ms  execute S_11: select * from  getallfromvms($1, $2)
DETAIL:  parameters: $1 = '238cb967-22b3-470f-8c8c-fb9c9670aa5c', $2 = 't'
LOG:  duration: 11034.096 ms  execute S_20: select * from  getallfromvms($1, $2)
DETAIL:  parameters: $1 = '1aeca5b2-c933-47bb-b69e-45376c1ed851', $2 = 't'  

When attempting to login to user-portal with user that has user-role permission, for All the VM's in the system (i.e, admin@internal) the browser will get stuck, But in that case, I don't see any large 'getallfromvms' query in pg_log.
Comment 26 Yair Zaslavsky 2012-12-31 08:17:51 EST
Omri tested with a patch based on http://gerrit.ovirt.org/#/c/9825/9
Comment 27 Omri Hochman 2012-12-31 09:00:26 EST
(In reply to comment #26)
> Omri tested with a patch based on http://gerrit.ovirt.org/#/c/9825/9

Test the above patch using the same environment/scenario described in Comment #25 the results showed significant improvement in 'getallfromvms' query time.  

pg_log : (3 users logged-in and manually refreshing user-portal)  
*****************************************************************
LOG:  duration: 216.841 ms  execute S_20: select * from  getdisksvmguid($1, $2, $3)
DETAIL:  parameters: $1 = '5a41bec1-f7b4-4e4e-8910-23bc21375356', $2 = NULL, $3 = 'f'
LOG:  duration: 218.164 ms  execute S_19: select * from  getdisksvmguid($1, $2, $3)
DETAIL:  parameters: $1 = 'ea578c06-8050-4684-a168-d5b09d86d277', $2 = NULL, $3 = 'f'
LOG:  duration: 217.223 ms  execute S_20: select * from  getdisksvmguid($1, $2, $3)

Note: 
*****
login to user-portal, with user that has permission to see All VM's - will still cause the browser to stuck (no paging issue).
Comment 30 RHEL Product and Program Management 2013-01-06 03:46:30 EST
Quality Engineering Management has reviewed and declined this request.
You may appeal this decision by reopening this request.
Comment 40 vvyazmin@redhat.com 2013-06-06 09:37:57 EDT
No issues are found.

Verified on RHEVM 3.2 - SF17.1 environment:

RHEVM: rhevm-3.2.0-11.28.el6ev.noarch
VDSM: vdsm-4.10.2-21.0.el6ev.x86_64
LIBVIRT: libvirt-0.10.2-18.el6_4.5.x86_64
QEMU & KVM: qemu-kvm-rhev-0.12.1.2-2.355.el6_4.3.x86_64
SANLOCK: sanlock-2.6-2.el6.x86_64


Tested in following  environment:
RHEVM: install on physical machine
DB: install on remote physical machine

Steps to Reproduce::
1. On iSCSI DC create 1800 VM's
2. All VM's in “Down” state
3. Add user to “Permissions” management with “PowerUserRole” role, to all those VM's.
4. Login with this user to User Portal

An others problem are found, such user experience in this scenario, already open separate bugs.
Comment 41 errata-xmlrpc 2013-06-10 17:23:35 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2013-0888.html