Description of problem: When a high number of users are created in the environment and these users are assigned to different entities the system can get overloaded because of the Postgresql CPU load. Version-Release number of selected component (if applicable): 4.1.10 How reproducible: 100% Steps to Reproduce: 1.Crete 100 users 2.Assign these users to 100VM (disk operator) 3.Go to tab Users or execute sql: SELECT * FROM (SELECT * FROM vdc_users WHERE ( user_id IN (SELECT distinct vdc_users_with_tags.user_id FROM vdc_users_with_tags WHERE vdc_users.user_group = 'user' )) ORDER BY name ASC ) as T1 OFFSET (1 -1) LIMIT 100; Actual results: each thread handling the execution of this SQL query takes 100% of a CPU core. Expected results: The sql is executed fast. Additional info: This issue is caused by the where condition within the subquery that forces the PostgreSQL to use inefficient execution plan. The performance of the query increases by simply moving the where condition outside the subquery.
Created attachment 1436781 [details] execution plan (condition outside of the subquery) explain analyze SELECT * FROM (SELECT * FROM vdc_users WHERE ( user_id IN (SELECT distinct vdc_users_with_tags.user_id FROM vdc_users_with_tags )) and vdc_users.user_group = 'user' ORDER BY name ASC ) as T1 OFFSET (1 -1) LIMIT 100;
Created attachment 1436782 [details] execution plan (condition inside of the subquery) The sql is not analyzed as the execution doe sot finish in reasonable time. explain SELECT * FROM (SELECT * FROM vdc_users WHERE ( user_id IN (SELECT distinct vdc_users_with_tags.user_id FROM vdc_users_with_tags WHERE vdc_users.user_group = 'user' )) ORDER BY name ASC ) as T1 OFFSET (1 -1) LIMIT 100;
# select count(*) from vdc_users_with_tags; count -------- 711945 (1 row) # select count(*) from users; count ------- 9353 (1 row) I have done the same tests with Postgres 9.5. The unmodified sql takes ~16sec and the modified query takes almost nothing. As this is generated sql. Can we modify the code to move the condition outside of the subquery?
Bug 1510432 which is merged to 4.2 will possibly solve that, I just need to verity the fix is doing that for tag searches like here. It handles 2 things - remove the generated 'distinct', thus choosing simpler query plans for most searches. Specifically it won't do the heaviest part here which is the huge sort, and choose disk space to do that. Bug 1532802 might be related as well, didn't dive deep into the details. Roman your suggestion may be good, I suspect it may miss results though.
Regarding the work_mem. The execution plan does not use external sorts. Everything seems to fit into memory. I have also done some tests with much higher work mem than we introduced (256MB). The result is the same. I have also checked with RHV 4.2.2.6 and the results are the same. The sql query is generated in the same way as before.
(In reply to Roman Hodain from comment #6) > Regarding the work_mem. The execution plan does not use external sorts. > Everything seems to fit into memory. You mentioned the analyze didn't complete in reasonable time, so we don't see the whole plan. Also in the modified query analysis I clearly see external sort - "Sort Method: external merge Disk: 91384kB" Is it possible to wait for the analysis to be done?
(In reply to Roy Golan from comment #8) > (In reply to Roman Hodain from comment #6) > > Regarding the work_mem. The execution plan does not use external sorts. > > Everything seems to fit into memory. > > You mentioned the analyze didn't complete in reasonable time, so we don't > see the whole plan. Also in the modified query analysis I clearly see > external sort - > "Sort Method: external merge Disk: 91384kB" > > Is it possible to wait for the analysis to be done? regarding the external sort, yes it is in the modified query I missed that. Anyway, it is 90MB which may be too much for increasing the work_mem to this amount. I have tried to let it finish, but it took more than one hour and it was still running. I have increased the work_mem to 1GB just for the curiosity. I will let you know if it finishes.
Roman, I agree the WHERE condition should be outside, instead of generating a huge inner dataset to pass on (that's completely foolish thing to do inside the inner query on a the right table... like always, generated queries eventually hit you). We can either fix that by doing so and another option is to completly remove the join on the tag table, which I seriously doubt is needed. Eli do you know why the join on vdc_users_with_tags is needed at all?
(In reply to Roy Golan from comment #17) > We can either fix that by doing so and another option is to completly remove > the join on the tag table, which I seriously doubt is needed. > > Eli do you know why the join on vdc_users_with_tags is needed at all? This join should be used only when the search includes tags, if it is generated in other cases then this is a bug that we should fix
(In reply to Roy Golan from comment #17) > Roman, I agree the WHERE condition should be outside, instead of generating > a huge inner dataset to pass on (that's completely foolish thing to do > inside the inner query on a the right table... like always, generated > queries eventually hit you). > > We can either fix that by doing so and another option is to completly remove > the join on the tag table, which I seriously doubt is needed. > > Eli do you know why the join on vdc_users_with_tags is needed at all? Hi Roy, would it be worth to go both ways? We can remove the tags, but moving the condition outside could help in other cases. Sooner or later we may hit a similar issue in a different environment in a similar query.
(In reply to Roy Golan from comment #17) > Roman, I agree the WHERE condition should be outside, instead of generating > a huge inner dataset to pass on (that's completely foolish thing to do > inside the inner query on a the right table... like always, generated > queries eventually hit you). > > We can either fix that by doing so and another option is to completly remove > the join on the tag table, which I seriously doubt is needed. Let's begin with the WHERE outside. Roman, does that help? > > Eli do you know why the join on vdc_users_with_tags is needed at all? Roman, are they using tags?
(In reply to Yaniv Kaul from comment #22) > (In reply to Roy Golan from comment #17) > > Roman, I agree the WHERE condition should be outside, instead of generating > > a huge inner dataset to pass on (that's completely foolish thing to do > > inside the inner query on a the right table... like always, generated > > queries eventually hit you). > > > > We can either fix that by doing so and another option is to completly remove > > the join on the tag table, which I seriously doubt is needed. > > Let's begin with the WHERE outside. Roman, does that help? Yes it helps significantly. > > > > Eli do you know why the join on vdc_users_with_tags is needed at all? > > Roman, are they using tags? No, they currently suffer from this issue when simply clicking on the tab or calling RestAPI on /users
(In reply to Roman Hodain from comment #24) > (In reply to Yaniv Kaul from comment #22) > > (In reply to Roy Golan from comment #17) > > > Roman, I agree the WHERE condition should be outside, instead of generating > > > a huge inner dataset to pass on (that's completely foolish thing to do > > > inside the inner query on a the right table... like always, generated > > > queries eventually hit you). > > > > > > We can either fix that by doing so and another option is to completly remove > > > the join on the tag table, which I seriously doubt is needed. > > > > Let's begin with the WHERE outside. Roman, does that help? > Yes it helps significantly. > > > > > > Eli do you know why the join on vdc_users_with_tags is needed at all? > > > > Roman, are they using tags? > No, they currently suffer from this issue when simply clicking on the tab or > calling RestAPI on /users So the above sounds like our hotfix.
(In reply to Yaniv Kaul from comment #25) > (In reply to Roman Hodain from comment #24) > > (In reply to Yaniv Kaul from comment #22) > > > (In reply to Roy Golan from comment #17) > > > > Roman, I agree the WHERE condition should be outside, instead of generating > > > > a huge inner dataset to pass on (that's completely foolish thing to do > > > > inside the inner query on a the right table... like always, generated > > > > queries eventually hit you). > > > > > > > > We can either fix that by doing so and another option is to completly remove > > > > the join on the tag table, which I seriously doubt is needed. > > > > > > Let's begin with the WHERE outside. Roman, does that help? > > Yes it helps significantly. > > > > > > > > Eli do you know why the join on vdc_users_with_tags is needed at all? > > > > > > Roman, are they using tags? > > No, they currently suffer from this issue when simply clicking on the tab or > > calling RestAPI on /users > > So the above sounds like our hotfix. So far there is no hotfix, the SQL is generated automatically by SearchEngine, there is no way how to use modified query without modifying SearchEngine SQL generation. And this is very delicate and it will probably take a few more days to have a patch which don't break other flows.
> -------- > 711945 > (1 row) > > # select count(*) from users; > count > ------- > 9353 > (1 row) Another small observation: The view 'vdc_users_with_tags' have 700,000 rows while there are only 10,000 users. are there 70 tags for each user? This is possibly a bug in the view. Can we get a dump to look into it? if there is a bug there we might be able to tweak the view, it's way easier and less risky. More thoughts on this: - the users with tags should not be crossed join if the search query isn't looking for tags at all. - I think anyhow Roman's suggestion of moving the WHERE at the outer query is correct and should be included.
Roman Can you specify please the search string as it is passed to the search engine ? AFAIK , tag views are not involved as long as you are not searching for tags. Thanks
Apart for the initial fix I propose to revisit the vdc_users_with_tags view to contain only records with tag that is not null. This should narrow down the view table dramatically. moreover, it will serve as an easy hot fix currently select user_id,tag_id from vdc_users_with_tags: user_id | tag_id --------------------------------------+-------- c4335808-3cdb-11e8-867d-0242ac130002 | ... And the row number is >= the number of VMs this user has permission on
Extending my last comment - the change to the view will not be enough since the search is producing where on the user_id in the tag table. So the default (wrongly generated) search will return zero results.
I think we should do both, the patches to fix SearchEngine and another patch to remove null tags from users_with_tags view.
(In reply to Martin Perina from comment #31) > I think we should do both, the patches to fix SearchEngine and another patch > to remove null tags from users_with_tags view. Worth a different BZ IMO
I think we can't do it, because this search is meant for a general search on User/groups + VMs + tags. So this would break: Users: vm=vm1 It will not yield results if we make the view free of null tags oin case the user doesn't have tag, even though the user have permissions on that VM. The name of the view is confusing and the implementation of the search is poor because it performs sub-optimal query, to say the least, just to aggregate info on Users with vms or tags
That is probably why the code was always including the vdc_users_with_tags in the search. So this obsoletes my first fix.
We can validate the bug in 4.2.4 version. Could you please provide validation instructions/notes for this fix? Should we just follow reproduction steps descried in the problem description: 1.Crete 100 users 2.Assign these users to 100VM (disk operator) - [Daniel] WHAT is "100VM"? 3.Go to tab Users or execute sql: SELECT * FROM (SELECT * FROM vdc_users WHERE ( user_id IN (SELECT distinct vdc_users_with_tags.user_id FROM vdc_users_with_tags WHERE vdc_users.user_group = 'user' )) ORDER BY name ASC ) as T1 OFFSET (1 -1) LIMIT 100; Any significance for the distribution of users to specific role to asset?
(In reply to Daniel Gur from comment #39) > We can validate the bug in 4.2.4 version. > > Could you please provide validation instructions/notes for this fix? > > Should we just follow reproduction steps descried in the problem description: > 1.Crete 100 users > 2.Assign these users to 100VM (disk operator) - [Daniel] WHAT is "100VM"? Each user should have a DiskOperator (and UserRole probably) permission on single VM > 3.Go to tab Users or execute sql: > SELECT * FROM (SELECT * FROM vdc_users WHERE ( user_id IN (SELECT distinct > vdc_users_with_tags.user_id FROM vdc_users_with_tags WHERE > vdc_users.user_group = 'user' )) ORDER BY name ASC ) as T1 OFFSET (1 -1) > LIMIT 100; If you click on Users tab, you should observe following: 1. It's much faster than in previous version 2. You should verify that SQL execuited to fetch data into the view is not using joing to vdc_users_with_tags table unless specific VM or tag is added to the search.
(In reply to Martin Perina from comment #41) > (In reply to Daniel Gur from comment #39) > > We can validate the bug in 4.2.4 version. > > > > Could you please provide validation instructions/notes for this fix? > > > > Should we just follow reproduction steps descried in the problem description: > > 1.Crete 100 users > > 2.Assign these users to 100VM (disk operator) - [Daniel] WHAT is "100VM"? > > Each user should have a DiskOperator (and UserRole probably) permission on > single VM > > > 3.Go to tab Users or execute sql: > > SELECT * FROM (SELECT * FROM vdc_users WHERE ( user_id IN (SELECT distinct > > vdc_users_with_tags.user_id FROM vdc_users_with_tags WHERE > > vdc_users.user_group = 'user' )) ORDER BY name ASC ) as T1 OFFSET (1 -1) > > LIMIT 100; > > If you click on Users tab, you should observe following: > > 1. It's much faster than in previous version > 2. You should verify that SQL execuited to fetch data into the view is not > using joing to vdc_users_with_tags table unless specific VM or tag is added > to the search. #What was done: created 5 pools of 100 Vms each. create 380 users in internal domain via ovirt-aaa-jdbc-tool user add then used ovirt_users module and ovirt_permissions to update permissions. Click "users" tab (very slow, paging between users is slow) Postgres shows: 2018-06-02 20:02:59.712 UTC LOG: duration: 60818.421 ms execute <unnamed>: SELECT * FROM (SELECT * FROM vdc_users WHERE ( user_id IN (SELECT distinct vdc_users_with_tags.user_id FROM vdc_users_with_tags WHERE vdc_users.user_group = 'user' )) ORDER BY name ASC ) as T1 OFFSET (1 -1) LIMIT 100 2018-06-02 20:03:05.418 UTC LOG: duration: 65654.598 ms execute <unnamed>: SELECT * FROM (SELECT * FROM vdc_users WHERE ( user_id IN (SELECT distinct vdc_users_with_tags.user_id FROM vdc_users_with_tags WHERE vdc_users.user_group = 'user' )) ORDER BY name ASC ) as T1 OFFSET (1 -1) LIMIT 100 2018-06-02 20:04:23.450 UTC LOG: duration: 78011.641 ms execute <unnamed>: SELECT * FROM (SELECT * FROM vdc_users WHERE ( user_id IN (SELECT distinct vdc_users_with_tags.user_id FROM vdc_users_with_tags WHERE vdc_users.user_group = 'user' )) ORDER BY name ASC ) as T1 OFFSET (1 -1) LIMIT 100 #Upgraded to 4.2.4 From UI clicking Users tab snappy and fast, no queries listed over 500 ms duration via postgres duration logging. Checked browsing Users tabs, paginating of Users tab result pages - both of which previously were very difficult to use. Additionally checked these searches and they were fine. "Users:", "Users: type=user", "Users: type=group", "Hosts:", "Vms:", "Datacenters:", "Clusters:" engine=# select count(*) from vdc_users_with_tags; count ------- 14221 (1 row)
#What was done: created 5 pools of 100 Vms each. create 500 users in internal domain via ovirt-aaa-jdbc-tool user add then used ovirt_users module and ovirt_permissions to update permissions. engine=# select count(*) from vdc_users_with_tags; count ------- 18590 (1 row) engine=# Running ovirt-engine-4.3.0.4-0.1.el7.noarch & vdsm-4.30.9-1.el7ev.x86_64 Checked browsing Users tabs, paginating of Users tab result pages. Additionally checked these searches1[1] and they were fine did see: db=engine,user=engine,app=PostgreSQL JDBC Driver,client=127.0.0.1 LOG: duration: 593.434 ms execute <unnamed>: select * from getpermissionsbyadelementid($1, $2, $3, $4) but this database is running both ovirt_engine and ovirt_engine_history so seems fair moving to verified. [1] Searches "Users:", "Users: type=user", "Users: type=group", "Hosts:", "Vms:", "Datacenters:", "Clusters:"
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2019:1085
sync2jira