Description of problem: Customers scenario describes large system load being generated when using api to create VM from a template. Even showing the templates in Webadmin portal takes noticeable amount of time. Description of function we use to list all templates: engine=# \df+ getallfromvmtemplates List of functions -[ RECORD 1 ]----------+------------------------------------------------------------------------------------------------ Schema | public Name | getallfromvmtemplates Data type of result | SETOF vm_templates_view Data type of parameter | v_user_id uuid, v_is_filtered boolean Type | normal Volatility | volatile Owner | postgres Language | plpgsql Source code | : BEGIN : RETURN QUERY SELECT vm_templates.* : FROM vm_templates_view vm_templates : WHERE (NOT v_is_filtered OR EXISTS (SELECT 1 : FROM user_vm_template_permissions_view : WHERE user_id = v_user_id AND entity_id = vmt_guid)) : ORDER BY name; : END; Description | Taking a look at explain analyze of the query (run using user that has no template, except the default blank on, assigned and filtered, in environment that has only 2 templates). It takes 20ms to get the result: engine=# EXPLAIN ANALYZE VERBOSE SELECT * FROM getallfromvmtemplates('c1736ed2-ef83-40b0-9222-dbcbb24538a9','t'); QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Function Scan on getallfromvmtemplates (cost=0.00..260.00 rows=1000 width=3805) (actual time=19.920..19.920 rows=1 loops=1) Output: vmt_guid, name, mem_size_mb, os, creation_date, child_count, num_of_sockets, cpu_per_socket, num_of_cpus, description, vds_group_id, domain, num_of_monitors, allow_console_reconnect, sta tus, usb_policy, time_zone, is_auto_suspend, fail_back, vds_group_name, vm_type, hypervisor_type, operation_mode, nice_level, storage_pool_id, storage_pool_name, quota_enforcement_type, default_boo t_sequence, default_display_type, priority, auto_startup, is_stateless, iso_path, origin, initrd_url, kernel_url, kernel_params, quota_id, quota_name, migration_support, dedicated_vm_for_vds Total runtime: 20.015 ms (3 rows) Version-Release number of selected component (if applicable): 3.2.4 How reproducible: 100 % Steps to Reproduce: 1. Have large amount of templates 2. Navigate to templates tab Actual results: Takes noticeable amount of time and on database level generates quite noticeable load Expected results: Should not take so long and should not generate such load.
Hi Tomas, 1. How many templates do the customer have? 2. How many VMs? 3. Can you provide 3 jstacks of the engine while running the customer scenario? 4. Is it a dedicated DB or an internal one?
Hi Tomas, A very interesting usecase. Wild guess - the constant insert/delete flows might hurt Postgres Analyzer performance. Can you run "ANALYZE vm_static" on the database and see if the performance is improved?
Hi Liran, Here's the output of my testing environment engine=# analyze verbose vm_static; INFO: analyzing "public.vm_static" INFO: "vm_static": scanned 7 of 7 pages, containing 197 live rows and 0 dead rows; 197 rows in sample, 197 estimated total rows ANALYZE engine=# explain ANALYZE select * from getallfromvmtemplates('c1736ed2-ef83-40b0-9222-dbcbb24538a9','t'); QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------- Function Scan on getallfromvmtemplates (cost=0.00..260.00 rows=1000 width=3805) (actual time=9.884..9.884 rows=1 loops=1) Total runtime: 9.948 ms (2 rows) Apparently collecting fresh stats and refreshing the catalog statistics works quite well here. (Still 10 ms for finding the result given the amount of data - 8 users, 2 templates, no rights given to the user on any template - that quite a lot) But 100 % improvement in speed is quite nice. The question is whether the performance degradation in customer's scenario is by any chance influenced by massive changes in vm_static and nonconsistent index. (my wild guess is it could be the case).
Our current implementation stores templates and vms in the same table. So if they remove 800 rows from the table, and insert new ones - it probably hurt the optimizer. Try to run it on their environment, and tell me the outcome. If this is the problem, we'll need to find a good way to do it periodically.
Got it. Will review and get back to you.
Hi Tomas, I've checked. Are you 100% sure we use the function to list all the templates? From a debug session it seems like we're using a dynamic search for that. The query we're running for the System-->Templates tab is: SELECT * FROM (SELECT * FROM vm_templates_view WHERE ( vmt_guid IN (SELECT vm_templates_storage_domain.vmt_guid FROM vm_templates_storage_domain )) ORDER BY name ASC ) as T1 OFFSET (1 -1) LIMIT 100. Now, this is of course not optimal, and it so happens that I have a patch that is being tested that can be extended to solve the templates issue. If not, and the function is called, please detail the scenario again, as I can't reproduce this error. 10x.
Hi Tomas, I apologize for the ping-pong. Can you confirm the customer uses the following API - /ovirt-engine/api/templates If so, it uses a dynamic search query. Also, which user is customer uses to authenticate? Is it an admin user?
is this going to be relevant for all customers? or is there any risk?
Always a risk. This patch will solve this specific scenario (API of getting a list of templates based on a data center)
Hi Ron, I can't confirm permissions will make a difference in this case. From my tests it seems like there is an issue with our query generation. I'm working on the patch.
Verified in rhevm-3.3.0-0.40.rc.el6ev.noarch (is27). I used RHEVM with 700 VMs (from template) and 80 templates (I added and removed them several times in a row). The Templates tab in webadmin then took cca 3 seconds to load. Executing the query from comment 15 ( /api/templates?search=datacenter...) took about 4-5 seconds. I used admin@internal as well as an IPA user, but it didn't make any difference. The system load on RHEVM during the query execution was also within acceptable limits - in my case ~50% CPU load (on a weaker HW than is officialy recommended in docs).
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHSA-2014-0038.html