ovirt-engine-backend [Scalability]:Problematic query 'getallfromvms' causes user portal to become stuck after user login. Description: ************ When user attempts to login to the User Portal on a scale environment (with many VM's), User Portal will become stuck and the browser will stop functioning. it happens due to 'getallfromvms' query that runs when presenting the 'Virtual Machines' tab. Note: ******* user portal will get stuck even If the user does not have Any VM's attached to him or have only a single VM attached to him. RHEVM Environment: ******************* - RHEVM (Build IC24.4) installed on physical - Postgresql remote DB on another physical machine. Objects in RHEVM: ***************** - Total 31 Hosts. - Total 50 iSCSI Storage Domains + 1 ISO + 1 Export - Total ~1800 not running VM's , - 2300 Users/Groups Thread dump: ************** at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.read(SocketInputStream.java:150) at java.net.SocketInputStream.read(SocketInputStream.java:121) at org.postgresql.core.VisibleBufferedInputStream.readMore(VisibleBufferedInputStream.java:135) at org.postgresql.core.VisibleBufferedInputStream.ensureBytes(VisibleBufferedInputStream.java:104) at org.postgresql.core.VisibleBufferedInputStream.read(VisibleBufferedInputStream.java:73) at org.postgresql.core.PGStream.ReceiveChar(PGStream.java:259) at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1620) at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:257) - locked <0x00000000c4f15c88> (a org.postgresql.core.v3.QueryExecutorImpl) at org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:479) at org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags(AbstractJdbc2Statement.java:367) at org.postgresql.jdbc2.AbstractJdbc2Statement.executeQuery(AbstractJdbc2Statement.java:271) at org.jboss.jca.adapters.jdbc.CachedPreparedStatement.executeQuery(CachedPreparedStatement.java:107) at org.jboss.jca.adapters.jdbc.WrappedPreparedStatement.executeQuery(WrappedPreparedStatement.java:462) at org.springframework.jdbc.core.JdbcTemplate$1.doInPreparedStatement(JdbcTemplate.java:644) at org.springframework.jdbc.core.JdbcTemplate.execute(JdbcTemplate.java:587) at org.springframework.jdbc.core.JdbcTemplate.query(JdbcTemplate.java:637) at org.springframework.jdbc.core.JdbcTemplate.query(JdbcTemplate.java:666) at org.springframework.jdbc.core.JdbcTemplate.query(JdbcTemplate.java:706) at org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall.executeCallInternal(PostgresDbEngineDialect.java:155) at org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall.doExecute(PostgresDbEngineDialect.java:121) at org.springframework.jdbc.core.simple.SimpleJdbcCall.execute(SimpleJdbcCall.java:181) at org.ovirt.engine.core.dal.dbbroker.SimpleJdbcCallsHandler.executeImpl(SimpleJdbcCallsHandler.java:124) at org.ovirt.engine.core.dal.dbbroker.SimpleJdbcCallsHandler.executeReadAndReturnMap(SimpleJdbcCallsHandler.java:75) at org.ovirt.engine.core.dal.dbbroker.SimpleJdbcCallsHandler.executeReadList(SimpleJdbcCallsHandler.java:66) at org.ovirt.engine.core.dao.VmDAODbFacadeImpl.getAll(VmDAODbFacadeImpl.java:181) at org.ovirt.engine.core.bll.GetAllVmsQuery.executeQueryCommand(GetAllVmsQuery.java:15) at org.ovirt.engine.core.bll.QueriesCommandBase.ExecuteCommand(QueriesCommandBase.java:71) at org.ovirt.engine.core.dal.VdcCommandBase.Execute(VdcCommandBase.java:41) at org.ovirt.engine.core.bll.Backend.runQueryImpl(Backend.java:400) at org.ovirt.engine.core.bll.Backend.RunQuery(Backend.java:379) at sun.reflect.GeneratedMethodAccessor99.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:601) at org.jboss.as.ee.component.ManagedReferenceMethodInterceptorFactory$ManagedReferenceMethodIn /var/pg_data/pg_log (those queries return after almost 10 minutes) : ********************************************************************* LOG: duration: 522376.064 ms execute <unnamed>: select * from getallfromvms($1, $2) DETAIL: parameters: $1 = '9b9002d1-ec33-4083-8a7b-31f6b8931648', $2 = 't' LOG: duration: 532152.857 ms execute <unnamed>: select * from getallfromvms($1, $2) DETAIL: parameters: $1 = '9b9002d1-ec33-4083-8a7b-31f6b8931648', $2 = 't' LOG: duration: 292911.526 ms execute <unnamed>: select * from getallfromvms($1, $2) DETAIL: parameters: $1 = '2159ee18-04d8-4549-b6c7-0c1104bc6535', $2 = 't' LOG: duration: 291835.261 ms execute <unnamed>: select * from getallfromvms($1, $2) DETAIL: parameters: $1 = '2159ee18-04d8-4549-b6c7-0c1104bc6535', $2 = 't' LOG: duration: 281379.708 ms execute <unnamed>: select * from getallfromvms($1, $2) DETAIL: parameters: $1 = '2159ee18-04d8-4549-b6c7-0c1104bc6535', $2 = 't' LOG: duration: 286763.031 ms execute <unnamed>: select * from getallfromvms($1, $2) DETAIL: parameters: $1 = '2159ee18-04d8-4549-b6c7-0c1104bc6535', $2 = 't' LOG: duration: 292041.869 ms execute <unnamed>: select * from getallfromvms($1, $2) DETAIL: parameters: $1 = '2159ee18-04d8-4549-b6c7-0c1104bc6535', $2 = 't' LOG: duration: 535423.412 ms execute <unnamed>: select * from getallfromvms($1, $2) DETAIL: parameters: $1 = '9b9002d1-ec33-4083-8a7b-31f6b8931648', $2 = 't' LOG: duration: 540449.231 ms execute <unnamed>: select * from getallfromvms($1, $2) DETAIL: parameters: $1 = '9b9002d1-ec33-4083-8a7b-31f6b8931648', $2 = 't'
Created attachment 653408 [details] thread_dump
Created attachment 653410 [details] pg_log
Having the execution plan for these quereis would be a great help. Omri, if it's possible, I'd like the following done: 1. Edit postgres.conf and add the following lines: shared_preload_libraries = 'auto_explain' custom_variable_classes = 'auto_explain' auto_explain.log_min_duration = '60s' 2. Restart postgres demon 3. Run the same scenario and attach /var/pg_data/pg_log
LOG: duration: 90313.548 ms execute <unnamed>: select * from getallfromvms($1, $2) DETAIL: parameters: $1 = 'fdfc627c-d875-11e0-90f0-83df133b58cc', $2 = 't' LOG: duration: 90313.544 ms plan: Function Scan on getallfromvms (cost=0.00..260.00 rows=1000 width=12580) LOG: duration: 90385.997 ms execute <unnamed>: select * from getallfromvms($1, $2) DETAIL: parameters: $1 = 'fdfc627c-d875-11e0-90f0-83df133b58cc', $2 = 't' LOG: duration: 90385.993 ms plan: Function Scan on getallfromvms (cost=0.00..260.00 rows=1000 width=12580) LOG: duration: 90890.892 ms execute <unnamed>: select * from getallfromvms($1, $2) DETAIL: parameters: $1 = 'fdfc627c-d875-11e0-90f0-83df133b58cc', $2 = 't' LOG: duration: 90890.889 ms plan: Function Scan on getallfromvms (cost=0.00..260.00 rows=1000 width=12580) LOG: duration: 91612.826 ms execute <unnamed>: select * from getallfromvms($1, $2) DETAIL: parameters: $1 = 'fdfc627c-d875-11e0-90f0-83df133b58cc', $2 = 't' LOG: duration: 91612.821 ms plan: Function Scan on getallfromvms (cost=0.00..260.00 rows=1000 width=12580) LOG: duration: 89354.231 ms execute <unnamed>: select * from getallfromvms($1, $2) DETAIL: parameters: $1 = 'fdfc627c-d875-11e0-90f0-83df133b58cc', $2 = 't' LOG: duration: 89354.227 ms plan: Function Scan on getallfromvms (cost=0.00..260.00 rows=1000 width=12580)
looks like a serious performance issue with fn splitter (we split group ids, then perform full table scan on groups table).
(In reply to comment #5) > looks like a serious performance issue with fn splitter (we split group ids, > then perform full table scan on groups table). We will have to get rid of the comma separated values of group ids and host them in a table + handling all upgrade aspects , the fnSplitter performance is bad !!!
(In reply to comment #6) > (In reply to comment #5) > > looks like a serious performance issue with fn splitter (we split group ids, > > then perform full table scan on groups table). > > We will have to get rid of the comma separated values of group ids and host > them in a table + handling all upgrade aspects , the fnSplitter performance > is bad !!! why wasn't this an issue until now?
Apparently the problem was the number of object that the user portal had to present. In my case : the user portal had to present 2000 VMs, because the users I login with the user portal - had 'UserRole' As 'System Permissions' --> that means that the users had permission to see all 2000 VMs (this permission is not shown under the VM's 'Permission' sub-tab), and since there's no paging in the user portal virtual machine tab, user-portal probably fail to handle large number of VM's and the browser stuck. I'm not sure If this scenario is relevant.
To recap the issues discussed here, there were 2 issues: - 2000 VMs with 2000 users (1 vm per user) * this seems to be an inefficiency in our user/permissions queries, will require a DB schema change. - 2000 VMs all belong to 1 user: * looks like User-Portal UI paging issue Are the 2 issues above valid ?
(In reply to comment #16) > To recap the issues discussed here, there were 2 issues: > > - 2000 VMs with 2000 users (1 vm per user) > * this seems to be an inefficiency in our user/permissions queries, will > require a DB schema change. We haven't tested this yet, We will create a script that stimulate the scenario and post the results. > > - 2000 VMs all belong to 1 user: > * looks like User-Portal UI paging issue This scenario is valid, in case: we would like to allow : superuser/admin@interanl/ or any other user with 'system-permission' to login the user portal. > > > Are the 2 issues above valid ?
patch by mkublin http://gerrit.ovirt.org/#/c/9825/1
A patch to improve getUserAndGroupsById (at create_functions.sh) Should be added as well. http://gerrit.ovirt.org/#/c/9825/1 - should also handle fix for http://gerrit.ovirt.org/#/c/9545/
Tested the following scenario (rhevm 3.1, build SI25.2) - Environment with ~400 users, each user associated with 3-4 Vm's. One User is logged-in to user portal: every time the user portal 'refresh' (it can happen automatically or manually), the query 'getallfromvms' occur and takes 3 seconds to answer. pg_log: ******* LOG: duration: 3412.774 ms execute S_38: select * from getallfromvms($1, $2) DETAIL: parameters: $1 = '2d1cbea3-26db-4a70-81a7-f5763076533e', $2 = 't' LOG: duration: 3449.508 ms execute S_37: select * from getallfromvms($1, $2) DETAIL: parameters: $1 = '2d1cbea3-26db-4a70-81a7-f5763076533e', $2 = 't' When two different users are logged-in from two different computers and refreshing the user portal together - the query takes 8 second to answer. pg_log: ******* DETAIL: parameters: $1 = '2d1cbea3-26db-4a70-81a7-f5763076533e', $2 = 't' LOG: duration: 7849.535 ms execute S_11: select * from getallfromvms($1, $2) DETAIL: parameters: $1 = '2d1cbea3-26db-4a70-81a7-f5763076533e', $2 = 't' LOG: duration: 8133.534 ms execute S_21: select * from getallfromvms($1, $2) DETAIL: parameters: $1 = '2d1cbea3-26db-4a70-81a7-f5763076533e', $2 = 't' When three different users are logged-in from two different computers and refreshing the user portal together- the query takes 12 second to answer. pg_log: ******* LOG: duration: 12171.432 ms execute S_11: select * from getallfromvms($1, $2) DETAIL: parameters: $1 = '238cb967-22b3-470f-8c8c-fb9c9670aa5c', $2 = 't' LOG: duration: 11034.096 ms execute S_20: select * from getallfromvms($1, $2) DETAIL: parameters: $1 = '1aeca5b2-c933-47bb-b69e-45376c1ed851', $2 = 't' When attempting to login to user-portal with user that has user-role permission, for All the VM's in the system (i.e, admin@internal) the browser will get stuck, But in that case, I don't see any large 'getallfromvms' query in pg_log.
Omri tested with a patch based on http://gerrit.ovirt.org/#/c/9825/9
(In reply to comment #26) > Omri tested with a patch based on http://gerrit.ovirt.org/#/c/9825/9 Test the above patch using the same environment/scenario described in Comment #25 the results showed significant improvement in 'getallfromvms' query time. pg_log : (3 users logged-in and manually refreshing user-portal) ***************************************************************** LOG: duration: 216.841 ms execute S_20: select * from getdisksvmguid($1, $2, $3) DETAIL: parameters: $1 = '5a41bec1-f7b4-4e4e-8910-23bc21375356', $2 = NULL, $3 = 'f' LOG: duration: 218.164 ms execute S_19: select * from getdisksvmguid($1, $2, $3) DETAIL: parameters: $1 = 'ea578c06-8050-4684-a168-d5b09d86d277', $2 = NULL, $3 = 'f' LOG: duration: 217.223 ms execute S_20: select * from getdisksvmguid($1, $2, $3) Note: ***** login to user-portal, with user that has permission to see All VM's - will still cause the browser to stuck (no paging issue).
Quality Engineering Management has reviewed and declined this request. You may appeal this decision by reopening this request.
No issues are found. Verified on RHEVM 3.2 - SF17.1 environment: RHEVM: rhevm-3.2.0-11.28.el6ev.noarch VDSM: vdsm-4.10.2-21.0.el6ev.x86_64 LIBVIRT: libvirt-0.10.2-18.el6_4.5.x86_64 QEMU & KVM: qemu-kvm-rhev-0.12.1.2-2.355.el6_4.3.x86_64 SANLOCK: sanlock-2.6-2.el6.x86_64 Tested in following environment: RHEVM: install on physical machine DB: install on remote physical machine Steps to Reproduce:: 1. On iSCSI DC create 1800 VM's 2. All VM's in “Down” state 3. Add user to “Permissions” management with “PowerUserRole” role, to all those VM's. 4. Login with this user to User Portal An others problem are found, such user experience in this scenario, already open separate bugs.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHSA-2013-0888.html