Description of problem: Some time ago, in p4 change 28606, the implementation of inSubqueryFilter was changed so that instead of generating SQL looking like: foo.id in (....subquery....) it instead generates exists (select subquery_id from (...subquery...) where foo.id = subquery_id) This was specifically to solve known performance problems on PostgreSQL 7.2: "Made in subquery filters generate SQL that uses exists instead of in. This is known to be faster on postgres and is reputed to be faster on oracle." Unfortunately an increasing number of our Oracle APLAWS installations are having performance problems as the size of their dataset increases. Take the following query, for example, select * from (select outerResults.*, rownum as fakeRownum from (select * from (select acs_objects.default_domain_class as cs_objcts__dflt_dmn_clss, acs_objects.display_name as acs_objects__display_name, acs_objects.object_id as object_id, acs_objects.object_type as acs_objects__object_type, cms_items.ancestors as cms_items__ancestors, cms_items.name as cms_items__name, cms_items.version as cms_items__version, cms_pages.launch_date as cms_pages__launch_date, cms_pages.title as cms_pages__title, vc_objects.is_deleted as vc_objects__is_deleted from acs_objects acs_objects, cms_items cms_items, cms_pages cms_pages, vc_objects vc_objects where cms_items.item_id=cms_pages.item_id and vc_objects.object_id=acs_objects.object_id and vc_objects.object_id=cms_items.item_id) results where ((cms_items__version='live' and vc_objects__is_deleted='0' and not exists (select subquery_id from (select version_id as subquery_id from (select v.version_id from cms_version_map v, cat_cat_subcat_trans_index t, cat_object_category_map m where v.item_id=m.object_id and m.category_id=t.subcategory_id and t.category_id='41557' and m.index_p='1') insub1) insub2 where insub2.subquery_id=object_id) and exists (select subquery_id from (select object_id as subquery_id from (select m.object_id from cat_cat_subcat_trans_index t, cat_object_category_map m where t.category_id='41557' and t.subcategory_id=m.category_id) insub1) insub2 where insub2.subquery_id=object_id))) order by object_id desc) outerResults where rownum<'31') where fakeRownum>='1'; This takes about 6 seconds to run on Oracle. The explain plan for this shows full table scans on acs_objects, cms_items, cms_pages & a 'fast full scan' on vc_objects: Execution Plan ---------------------------------------------------------- 0 SELECT STATEMENT Optimizer=CHOOSE (Cost=241 Card=22 Bytes=64 944) 1 0 VIEW (Cost=241 Card=22 Bytes=64944) 2 1 COUNT (STOPKEY) 3 2 VIEW (Cost=241 Card=22 Bytes=64658) 4 3 SORT (ORDER BY STOPKEY) (Cost=109 Card=22 Bytes=4510 ) 5 4 FILTER 6 5 HASH JOIN (Cost=107 Card=22 Bytes=4510) 7 6 HASH JOIN (Cost=101 Card=51 Bytes=8262) 8 7 TABLE ACCESS (FULL) OF 'ACS_OBJECTS' (Cost=6 1 Card=103 Bytes=10609) 9 7 HASH JOIN (Cost=39 Card=9164 Bytes=540676) 10 9 INDEX (FAST FULL SCAN) OF 'VC_OBJECTS_DELE TED_IDX' (NON-UNIQUE) (Cost=4 Card=18328 Bytes=128296) 11 9 TABLE ACCESS (FULL) OF 'CMS_ITEMS' (Cost=1 9 Card=9164 Bytes=476528) 12 6 TABLE ACCESS (FULL) OF 'CMS_PAGES' (Cost=5 Car d=3844 Bytes=165292) 13 5 NESTED LOOPS (Cost=4 Card=1 Bytes=32) 14 13 NESTED LOOPS (Cost=4 Card=1 Bytes=22) 15 14 TABLE ACCESS (BY INDEX ROWID) OF 'CMS_VERSIO N_MAP' (Cost=2 Card=1 Bytes=10) 16 15 INDEX (UNIQUE SCAN) OF 'CMS_VERSION_MAP_VE RSION_ID_UN' (UNIQUE) (Cost=1 Card=10227) 17 14 TABLE ACCESS (BY INDEX ROWID) OF 'CAT_OBJECT _CATEGORY_MAP' (Cost=2 Card=1 Bytes=12) 18 17 INDEX (RANGE SCAN) OF 'CAT_OBJ_CAT_MAP_OBJ ECT_ID_IDX' (NON-UNIQUE) (Cost=1 Card=2) 19 13 INDEX (UNIQUE SCAN) OF 'CAT_CAT_SUBCAT_INDEX_P K' (UNIQUE) 20 5 NESTED LOOPS (Cost=2 Card=2 Bytes=40) 21 20 TABLE ACCESS (BY INDEX ROWID) OF 'CAT_OBJECT_C ATEGORY_MAP' (Cost=2 Card=2 Bytes=20) 22 21 INDEX (RANGE SCAN) OF 'CAT_OBJ_CAT_MAP_OBJEC T_ID_IDX' (NON-UNIQUE) (Cost=1 Card=2) 23 20 INDEX (UNIQUE SCAN) OF 'CAT_CAT_SUBCAT_INDEX_P K' (UNIQUE) The first change we made was to alter the way the exist clause was written so that instead of: where insub2.subquery_id=object_id It joined to a lower-table - cms_items.item_id. This removed the full table scans on all but cms_items. The total query time was still 0.7 of a second. Remembering that inSubqueryFilter was implemented differently on 5.0 (which we are just upgrading this customer from), we replaced the 'exists' with the equivalent 'in' clause as generated by 5.0 persistence. The query now looks like: select * from (select outerResults.*, rownum as fakeRownum from (select * from (select acs_objects.default_domain_class as cs_objcts__dflt_dmn_clss, acs_objects.display_name as acs_objects__display_name, acs_objects.object_id as object_id, acs_objects.object_type as acs_objects__object_type, cms_items.ancestors as cms_items__ancestors, cms_items.name as cms_items__name, cms_items.version as cms_items__version, cms_pages.launch_date as cms_pages__launch_date, cms_pages.title as cms_pages__title, vc_objects.is_deleted as vc_objects__is_deleted from acs_objects acs_objects, cms_items cms_items, cms_pages cms_pages, vc_objects vc_objects where cms_items.item_id=cms_pages.item_id and vc_objects.object_id=acs_objects.object_id and vc_objects.object_id=cms_items.item_id) results where ((cms_items__version='live' and vc_objects__is_deleted='0' and not exists (select subquery_id from (select version_id as subquery_id from (select v.version_id from cms_version_map v, cat_cat_subcat_trans_index t, cat_object_category_map m where v.item_id=m.object_id and m.category_id=t.subcategory_id and t.category_id='41557' and m.index_p='1') insub1) insub2 where insub2.subquery_id=object_id) and exists (select subquery_id from (select object_id as subquery_id from (select m.object_id from cat_cat_subcat_trans_index t, cat_object_category_map m where t.category_id='41557' and t.subcategory_id=m.category_id) insub1) insub2 where insub2.subquery_id=object_id))) order by object_id desc) outerResults where rownum<'31') where fakeRownum>='1'; The explain plan for this query is: Execution Plan ---------------------------------------------------------- 0 SELECT STATEMENT Optimizer=CHOOSE (Cost=85 Card=1 Bytes=13) 1 0 SORT (AGGREGATE) 2 1 VIEW (Cost=85 Card=3 Bytes=39) 3 2 COUNT (STOPKEY) 4 3 VIEW (Cost=85 Card=3) 5 4 SORT (ORDER BY STOPKEY) (Cost=85 Card=3 Bytes=645) 6 5 HASH JOIN (ANTI) (Cost=84 Card=3 Bytes=645) 7 6 NESTED LOOPS (Cost=74 Card=3 Bytes=630) 8 7 NESTED LOOPS (Cost=68 Card=6 Bytes=1002) 9 8 NESTED LOOPS (Cost=56 Card=12 Bytes=1380) 10 9 NESTED LOOPS (Cost=31 Card=25 Bytes=2700 ) 11 10 VIEW OF 'VW_NSO_2' (Cost=4 Card=25 Byt es=125) 12 11 SORT (UNIQUE) 13 12 NESTED LOOPS (Cost=4 Card=25 Bytes =500) 14 13 INDEX (FAST FULL SCAN) OF 'CAT_O BJ_CAT_MAP_UN' (UNIQUE) (Cost=4 Card=7542 Bytes=75420) 15 13 INDEX (UNIQUE SCAN) OF 'CAT_CAT_ SUBCAT_INDEX_PK' (UNIQUE) 16 10 TABLE ACCESS (BY INDEX ROWID) OF 'ACS_ OBJECTS' (Cost=1 Card=1 Bytes=103) 17 16 INDEX (UNIQUE SCAN) OF 'ACS_OBJECTS_ OBJECT_ID_P_HHKB1' (UNIQUE) 18 9 TABLE ACCESS (BY INDEX ROWID) OF 'VC_OBJ ECTS' (Cost=1 Card=1 Bytes=7) 19 18 INDEX (UNIQUE SCAN) OF 'VC_OBJECTS_PK' (UNIQUE) 20 8 TABLE ACCESS (BY INDEX ROWID) OF 'CMS_ITEM S' (Cost=1 Card=1 Bytes=52) 21 20 INDEX (UNIQUE SCAN) OF 'CMS_ITEMS_PK' (U NIQUE) 22 7 TABLE ACCESS (BY INDEX ROWID) OF 'CMS_PAGES' (Cost=1 Card=1 Bytes=43) 23 22 INDEX (UNIQUE SCAN) OF 'CMS_PAGES_ITEM_ID_ P_RNEE1' (UNIQUE) 24 6 VIEW OF 'VW_NSO_1' (Cost=9 Card=12 Bytes=60) 25 24 NESTED LOOPS (Cost=9 Card=12 Bytes=384) 26 25 HASH JOIN (Cost=9 Card=1973 Bytes=43406) 27 26 TABLE ACCESS (FULL) OF 'CAT_OBJECT_CATEG ORY_MAP' (Cost=4 Card=1021 Bytes=12252) 28 26 INDEX (FAST FULL SCAN) OF 'CMS_VERSION_M AP_ITEM_ID_PK' (UNIQUE) (Cost=4 Card=10227 Bytes=102270) 29 25 INDEX (UNIQUE SCAN) OF 'CAT_CAT_SUBCAT_IND EX_PK' (UNIQUE) The only full table scan remaining is now on cat_object_category_map. Running the query now takes 0.24 seconds - which is comparable with what we were seeing on 5.0 version of APLAWS for this customer. We have now had the same problems on 4 major Oracle/APLAWS-5.2 based installations. The only other Oracle/APLAWS-5.2 installs where we are not seeing this problem are those with only a small dataset. Thus is would appear that, while on PG, the 'in' query is best, on Oracle, we should be using 'exists'. Thus I suggest that c.a.persistence.InFilter.java needs to be re-written to use appropriate syntax for each database. One final point, the release notes for 7.4 postgresql suggest 'in' filters will now also be faster than 'exists' for PG too. So we probably need to check which DB version we're running on when making the choice of what filter to use - or perhaps make it configurable. Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
Created attachment 96087 [details] Query using 'in' filter as generated by 5.0 persistence
Created attachment 96088 [details] Query using 'exists' filter as generated by 5.2 persistence
Comments from Aram wrt to performance of 'exists' on Oracle: <aram> exists is generally better that in, except few case <aram> rather you shouldnt use exists if it is major clause restricting nubmer of rows In the above queries we are using the exists as the major restricting clause against cms_items, which is evidently why performance is suffering in this case. I'm not sure what exactly we can do about this, since I'm not sure persistence would be able to accurately guage when to use 'in' vs 'exists' on any given 'addInSubqueryFilter' call. However, a solution *does* need to be found since the current query is unusably slow. Since (to my knowledge at least) we never had any concrete performance problems on Oracle with 'in' my thought is to simply revert to using 'in' instead of 'exists' on Oracle. Any better ideas ? Removing the use of these two sub-query filters isn't an option for 5.2, since that would basically require the new categorization API from Rickshaw.
NB, In the first comment, the 2nd query I posted was incorrect. Look at the attachments for the real ones. The 'addInSubqueryFilter' method is used in a very similar way to the above example (ie filtering a large list of objects by category) & thus have potential for this catastrophic performance in certain scenarios: cms/src/com/arsdigita/cms/ui/IntermediaSearchListModelBuilder.java core/src/com/arsdigita/categorization/Category.java forum/src/com/arsdigita/forum/Forum.java navigation/src/com/arsdigita/london/navigation/ui/AbstractItemList.java navigation/src/com/arsdigita/london/navigation/ui/CategoryItemNav.java navigation/src/com/arsdigita/london/navigation/ui/CategoryTypeItemList.java navigation/src/com/arsdigita/london/navigation/ui/ContentTypeList.java navigation/src/com/arsdigita/london/navigation/ui/L3ContentPane.java navigation/src/com/arsdigita/london/navigation/ui/LevelContentPane.java navigation/src/com/arsdigita/london/navigation/ui/RecentItems.java navigation/src/com/arsdigita/london/navigation/ui/Selection.java navigation/src/com/arsdigita/london/navigation/ui/CategoryWorkspaceSelectionModel.java portal/src/com/arsdigita/london/portal/ui/admin/CategoryComponent.java search/src/com/arsdigita/london/search/IntermediaSearcher.java search/src/com/arsdigita/london/search/Reindexer.java subsite/src/com/arsdigita/london/subsite/IntermediaSearcherForSubsites.java The remaining uses of addInSubqueryFilter are: Exmination of what they are doing suggests they could well have the same problem with 'exists' as the others. cms/src/com/arsdigita/cms/contenttypes/ContentGroup.java cms/src/com/arsdigita/cms/publishToFile/PublishToFile.java content-types/src/com/arsdigita/cms/contenttypes/Link.java core/src/com/arsdigita/kernel/Group.java core/src/com/arsdigita/lucene/Indexer.java core/src/com/arsdigita/ui/admin/GroupTreeModel.java For the category base subquery filters we have verified that using 'in' instead of 'exists' gives better performance on 2 customer deployments of Oracle 9i. We have two further customers for whom the current 'exists' based query is giving horrific performance (ie 10->60 seconds per query), & although we have yet to verify it, we suspect that switching back to 'in' will solve the problem, since they didn't have such a problem while on 5.0 based APLAWS.
This is fixed in the London 5.2 core in changelist 38219. This produced a query which ran in 6ms rather than 6.5 seconds before the change for one particular customer. This changelist makes an exception of postgres, which still uses exists. Could we also make sure this gets fixed in Rickshaw?
my test database has about 7000 items in cms_items and we are seeing similar behavior. My query was taking about 15 seconds with "exists" but it only takes about 20ms with "in". We are running on 6.0. The query with exists is: # select cpg.page_gen_id as c_1, ao.object_type as c_2, ao.display_name as c_3, ao.default_domain_class as c_4, vo.is_deleted as c_5, ci.version as c_6, ci.name as c_7, ci.language as c_8, ci.ancestors as c_9, cp.title as c_10, cp.launch_date as c_11, ca.lead as c_12, cpg.subtitle as c_13, cpg.text as c_14, cpg.author as c_15, cpg.dategen as c_16, cpg.template_id as c_17, cpg.present_on_home_p as c_18, cpg.url as c_19, cpg.paragraph as c_20 from cf_page_gen cpg, acs_objects ao, vc_objects vo, cms_items ci, cms_pages cp, ct_articles ca where ao.object_id = cpg.page_gen_id and vo.object_id = cpg.page_gen_id and ci.item_id = cpg.page_gen_id and cp.item_id = cpg.page_gen_id and ca.item_id = cpg.page_gen_id and (exists ( select "subquery_id" from (select "ITEM_ID" as "subquery_id" from ( select cms_items.item_id from cms_items, cms_bundles, cms_items bi where bi.parent_id in (select item_id from cms_items where ancestors like ? || '%') and cms_bundles.bundle_id = cms_items.parent_id and cms_bundles.bundle_id = bi.item_id and cms_items.version = ?) "insub1" ) "insub2" where "insub2"."subquery_id" = cpg.page_gen_id) and cpg.present_on_home_p = ? and exists ( select "subquery_id" from (select "ITEM_ID" as "subquery_id" from ( select substr(max(to_char(start_date_time, 'YYYY-MM-DD HH24:mm:ss')||map.item_id),20) as item_id from acs_object_lifecycle_map map, lifecycles, cf_page_gen, cms_items where lifecycles.cycle_id = map.cycle_id and cms_items.item_id = map.item_id and cf_page_gen.present_on_home_p = '1' and cf_page_gen.page_gen_id in (select substr(min(to_char(rank,'099')||cii.item_id),5) as item_id from user_lng_prefs ul, cms_items cii where cii.language = ul.language and cms_items.parent_id = cii.parent_id) and cf_page_gen.page_gen_id = cms_items.item_id and ancestors like ? || '%') "insub1" ) "insub2" where "insub2"."subquery_id" = cpg.page_gen_id)) order by ao.display_name # Bindvars: {4=615/614/5011/, 3=1, 2=live, 1=615/614/5011/} Operation Object ------------------------------------------------------------------------------------------ ------------------------------ SELECT STATEMENT () SORT (AGGREGATE) FILTER () NESTED LOOPS () NESTED LOOPS () NESTED LOOPS () NESTED LOOPS () NESTED LOOPS () TABLE ACCESS (FULL) CT_ARTICLES TABLE ACCESS (BY INDEX CF_PAGE_GEN INDEX (UNIQUE SCAN) CF_PAG_GEN_PAGE_GEN_ID_P_ON442 Operation Object ------------------------------------------------------------------------------------------ ------------------------------ INDEX (UNIQUE SCAN) CMS_PAGES_ITEM_ID_P_RNEE1 INDEX (UNIQUE SCAN) CMS_ITEMS_PK INDEX (UNIQUE SCAN) VC_OBJECTS_PK INDEX (UNIQUE SCAN) ACS_OBJECTS_OBJECT_ID_P_HHKB1 NESTED LOOPS () NESTED LOOPS () NESTED LOOPS () TABLE ACCESS (BY INDEX R CMS_ITEMS INDEX (UNIQUE SCAN) CMS_ITEMS_PK INDEX (UNIQUE SCAN) CMS_BUNDLES_BUNDLE_ID_P_HM39D TABLE ACCESS (BY INDEX RO CMS_ITEMS Operation Object ------------------------------------------------------------------------------------------ ------------------------------ INDEX (UNIQUE SCAN) CMS_ITEMS_PK TABLE ACCESS (BY INDEX ROW CMS_ITEMS INDEX (UNIQUE SCAN) CMS_ITEMS_PK VIEW () FILTER () SORT (AGGREGATE) NESTED LOOPS () TABLE ACCESS (BY INDEX LIFECYCLES INDEX (UNIQUE SCAN) LIFECYCLES_CYCLE_ID_P_8JYYQ NESTED LOOPS () NESTED LOOPS () Operation Object ------------------------------------------------------------------------------------------ ------------------------------ TABLE ACCESS (BY INDE CMS_ITEMS INDEX (RANGE SCAN) CMS_ITEMS_ANCESTORS_IDX TABLE ACCESS (BY INDE CF_PAGE_GEN INDEX (UNIQUE SCAN) CF_PAG_GEN_PAGE_GEN_ID_P_ON442 SORT (AGGREGATE) NESTED LOOPS () TABLE ACCESS (BY CMS_ITEMS INDEX (RANGE SCA CMS_ITEMS_PARENT_ID_IDX TABLE ACCESS (BY USER_LNG_PREFS INDEX (UNIQUE SC USER_LNG_PREFS_PK TABLE ACCESS (BY INDEX ACS_OBJECT_LIFECYCLE_MAP Operation Object ------------------------------------------------------------------------------------------ ------------------------------ INDEX (RANGE SCAN) ACS_OBJECT_CYCL_MAP_ITM_IDX The "in" query: select cpg.page_gen_id as c_1, ao.object_type as c_2, ao.display_name as c_3, ao.default_domain_class as c_4, vo.is_deleted as c_5, ci.version as c_6, ci.name as c_7, ci.language as c_8, ci.ancestors as c_9, cp.title as c_10, cp.launch_date as c_11, ca.lead as c_12, cpg.subtitle as c_13, cpg.text as c_14, cpg.author as c_15, cpg.dategen as c_16, cpg.template_id as c_17, cpg.present_on_home_p as c_18, cpg.url as c_19, cpg.paragraph as c_20 from cf_page_gen cpg, acs_objects ao, vc_objects vo, cms_items ci, cms_pages cp, ct_articles ca where ao.object_id = cpg.page_gen_id and vo.object_id = cpg.page_gen_id and ci.item_id = cpg.page_gen_id and cp.item_id = cpg.page_gen_id and ca.item_id = cpg.page_gen_id and (cpg.present_on_home_p = ? and cpg.page_gen_id in ( select cms_items.item_id from cms_items, cms_bundles, cms_items bi where bi.parent_id in (select item_id from cms_items where ancestors like ? || '%') and cms_bundles.bundle_id = cms_items.parent_id and cms_bundles.bundle_id = bi.item_id and cms_items.version = ?) and cpg.page_gen_id in ( select substr(max(to_char(start_date_time, 'YYYY-MM-DD HH24:mm:ss')||map.item_id),20) as item_id from acs_object_lifecycle_map map, lifecycles, cf_page_gen, cms_items where lifecycles.cycle_id = map.cycle_id and cms_items.item_id = map.item_id and cf_page_gen.present_on_home_p = '1' and cf_page_gen.page_gen_id in (select substr(min(to_char(rank,'099')||cii.item_id),5) as item_id from user_lng_prefs ul, cms_items cii where cii.language = ul.language and cms_items.parent_id = cii.parent_id) and cf_page_gen.page_gen_id = cms_items.item_id and ancestors like ? || '%') ) order by ao.display_name BINDS: {4=615/614/5011/, 3=live, 2=615/614/5011/, 1=1} Operation Object ------------------------------------------------------------------------------------------ ------------------------------ SELECT STATEMENT () MERGE JOIN () NESTED LOOPS () NESTED LOOPS () NESTED LOOPS () NESTED LOOPS () NESTED LOOPS () TABLE ACCESS (BY INDEX CF_PAGE_GEN INDEX (UNIQUE SCAN) CF_PAG_GEN_PAGE_GEN_ID_P_ON442 SORT (AGGREGATE) NESTED LOOPS () Operation Object ------------------------------------------------------------------------------------------ ------------------------------ NESTED LOOPS () NESTED LOOPS () TABLE ACCESS (BY CMS_ITEMS INDEX (RANGE SCA CMS_ITEMS_ANCESTORS_IDX TABLE ACCESS (BY CF_PAGE_GEN INDEX (UNIQUE SC CF_PAG_GEN_PAGE_GEN_ID_P_ON442 SORT (AGGREGATE NESTED LOOPS ( TABLE ACCESS CMS_ITEMS INDEX (RANGE CMS_ITEMS_PARENT_ID_IDX TABLE ACCESS USER_LNG_PREFS Operation Object ------------------------------------------------------------------------------------------ ------------------------------ INDEX (UNIQU USER_LNG_PREFS_PK TABLE ACCESS (BY I ACS_OBJECT_LIFECYCLE_MAP INDEX (RANGE SCAN ACS_OBJECT_CYCL_MAP_ITM_IDX TABLE ACCESS (BY IN LIFECYCLES INDEX (UNIQUE SCAN LIFECYCLES_CYCLE_ID_P_8JYYQ TABLE ACCESS (BY INDEX CT_ARTICLES INDEX (UNIQUE SCAN) CT_ARTICLES_ITEM_ID_P_2BE9I TABLE ACCESS (BY INDEX R CMS_PAGES INDEX (UNIQUE SCAN) CMS_PAGES_ITEM_ID_P_RNEE1 TABLE ACCESS (BY INDEX RO CMS_ITEMS INDEX (UNIQUE SCAN) CMS_ITEMS_PK Operation Object ------------------------------------------------------------------------------------------ ------------------------------ TABLE ACCESS (BY INDEX ROW VC_OBJECTS INDEX (UNIQUE SCAN) VC_OBJECTS_PK TABLE ACCESS (BY INDEX ROWI ACS_OBJECTS INDEX (UNIQUE SCAN) ACS_OBJECTS_OBJECT_ID_P_HHKB1 FIRST ROW () VIEW () VW_NSO_1 NESTED LOOPS () NESTED LOOPS () NESTED LOOPS () TABLE ACCESS (BY INDEX CMS_ITEMS INDEX (RANGE SCAN) CMS_ITEMS_VMI_IDX Operation Object ------------------------------------------------------------------------------------------ ------------------------------ INDEX (UNIQUE SCAN) CMS_BUNDLES_BUNDLE_ID_P_HM39D TABLE ACCESS (BY INDEX R CMS_ITEMS INDEX (UNIQUE SCAN) CMS_ITEMS_PK TABLE ACCESS (BY INDEX RO CMS_ITEMS INDEX (UNIQUE SCAN) CMS_ITEMS_PK
A summary is thus: On PG 7.2 & 7.3, 'exists' is practically always faster than 'in' On PG 7.4, the reverse is true - 'in' is practically always faster than 'exists'. On Oracle, the decision depends on how the 'exists' clause is used, as this dicussion details... <aram> danpb_sloth: correlated 'exists' is faster than 'in' most of the time <danpb_sloth> aram: can you explain what you mean by 'correlated' vs 'non-correlated' ? <aram> k, not correlated exists example: select col1 from tab1 where col2 in (select col3 from tabl2) <aram> correlated exists example: select col1 from tab1 where exists (select col3 from tabl2 and col4 = tab1.col2) <aram> danpb_sloth: basically than __properly__ used exists with correlated suquery much faster than in <danpb_sloth> the problem is that using addInSubquery API, there is not enough info to construct a query that looks like the 2nd example <aram> danpb_sloth: but from what you are saying i understand why exists is dog slow <danpb_sloth> => addInSubquery should use 'in' on oracle <aram> danpb_sloth: completely agree now <danpb_sloth> ok, i'll post this transcript to BZ <aram> non correlated 'exists' not good most of the time <aram> danpb_sloth: see the main difference 'in' tries to be computed once for dureation of sql statement <aram> danpb_sloth: 'exists' for every row So since we are unable to construct a correlated 'exists' query using the information available, for Oracle we should revert to using 'in' for the 'addInSubquery' method.
Maybe I'm confused about what exactly Aram means by correlated vs non correlated, but based on the examples he gives I don't actually understand how the exists we generate is non correlated. Based on simple textual mapping it would seem to correspond to his example of a correlated exists.
This page seems to have a longer description: http://www.smart-soft.co.uk/Oracle/oracle-performance-tuning-part7.htm There are a bunch of other resources from Google http://www.google.com/search?q=correlated+subquery http://www.google.com/search?q=correlated+subquery+performance
All the definitions seem to agree that the subquery we're using inside the exists is in fact a correlated subquery. So the rule of thumb would say that exists the way we use it should be faster than in. Since this is empirically demonstrated to be false in some cases this makes me think we're using the wrong thumb.
This post would seem to suggest that the determining criteria (at least for postgres) is the number of rows returned by the subquery: http://archives.postgresql.org/pgsql-performance/2002-12/msg00185.php This makes some sense to me since any occurance of an "in" expression can be trivially transformed into an exists of a correlated subquery, so using whether the subquery is correlated or not to choose between "in" and "exists" doesn't make much sense since the "exists" subquery will be guaranteed to be correlated if it is derived from an "in" expression.
> All the definitions seem to agree that the subquery we're using > inside the exists is in fact a correlated subquery. The other possibilities are that we're lacking some important indexes in all our slow queries or that the wrapped select ... from (select ... from (...subquery...) ...) is screwing up the optimizer. Both seem rather unlikely to me. > This post would seem to suggest that the determining criteria (at > least for postgres) is the number of rows returned by the subquery: It seems unlikely that persistence can't possibly know at the time how many rows a subquery will return, since its not got the DB's schema statistics to hand. An application developer /might/ have enough knowledge to make an educated guess in some cases but then we'd be putting the burden of deciding best impl per DB & version into application code. From the mailing thread > The speed gain for IN on small lists is not as dramatic as the speed > loss for EXISTS on large lists. So, IMO, we should focus on making sure that we avoid the dramatic loss exhibited by 'exists' on oracle & accept a small penalty that 'in' may impose (i say 'may', since I've actually yet found a query on Oracle in WAF where 'exists' is faster as its reputed to be). As a point of interest from http://archives.postgresql.org/pgsql-performance/2002-12/msg00195.php > > I wonder if "[NOT] IN (subselect)" could be improved with a hash > > table in similar fashion to the hash aggregate solution Tom > > recently implemented? > > Its being worked on > > Assuming I get this done, the conventional wisdom that "EXISTS > outperforms IN" will be stood on its head Seems, like Tom got this done for PG 7.4, so good bye conventional wisdom :-)
Here is a simpler exists query that is slow on postgres: -- ID: #7 -- Duration: 4238ms select st_.display_name as c_1, st_.object_id as c_2, st_.default_domain_class as c_3, st_.object_type as c_4 from ( select a.object_id, a.object_type, a.display_name, a.default_domain_class from acs_objects a where exists (select 1 from cms_items i where i.item_id = a.object_id and i.version = 'live' and i.parent_id = '5521' and name = 'admin')) st_; generated by logging in as site-wide admin to /content/admin/
> First, how would you choose between using in and exists in > a query specifically formulated for postgres versions ranging from 7.2 > to 7.4? IN (sub-select) is generally to be avoided in PG versions before 7.4, but I think it should perform all right (as well as, or possibly better than an equivalent EXISTS) in most cases in 7.4. You should run some tests to see how your particular queries behave, though. Note that IN (list-of-values) is a quite different critter, but I assume that's not what you're talking about. > Second, as someone with a reasonable understanding of sql semantics, but > no direct understanding of sql implementations its not at all obvious to > me why there should be such a large difference in performance between a > query using in and an equivalent query formulated to use exists. Well, the reason it sucked in PG < 7.4 was that there was some optimization for the EXISTS case and none whatever for the IN case. Specifically, IN was always done by re-scanning the subquery for each outer row until we found a match or exhausted the subquery. That's more or less equivalent to the stupidest form of nested-loop join. The EXISTS cases that work reasonably well in PG (this is about the same with any recent version) are of the form WHERE ... EXISTS(SELECT ... FROM inner WHERE inner.id = outer.id) where there is an index on inner.id. This structure will typically be turned into an index probe on inner.id for each row of the outer query; so you don't scan all of the inner table, with savings proportional to the size of the inner table. In 7.4 PG can apply a similar optimization to IN, and we can also consider merge and hash joins for IN, so there's a great deal more flexibility in the planner. It sounds like Oracle is smart about IN and not very smart about EXISTS. It may be implementing EXISTS as a nestloop even when better possibilities are available. Your bugzilla discussion quickly dismissed the idea that Oracle's planner might get fooled by nested sub-selects in this context, but I'm not sure that idea is wrong. > I would greatly appreciate any insight you might have on this issue, > or any reading you could recommend on the subject. There's a wealth of info available in the Postgres archives (particularly pgsql-performance), but it's mostly PG-specific. No doubt you can find Oracle-specific tuning info fairly readily as well. I dunno where you go to learn about cross-DB techniques per se, but perhaps you can find overlap between PG and Oracle material. As far as the particular bug goes, I think you have little choice but to emit DB-dependent SQL. You clearly want to use IN for Oracle, and I think you will find that it's also the preferred choice for PG >= 7.4 (but you'd best check, especially with such complex queries). You will need to keep using the EXISTS format for PG < 7.4, though, for as long as you want to support those versions. regards, tom lane
Looks like we're going to have to add both a db and db version dependent default in order to get the behavior we want, and we may at some point want to add an option in PDL that allows the behavior to be explicitly specified for a given query.
Also, see Aram's observation about PostgreSQL: http://post-office.corp.redhat.com/archives/ccm-engineering-list/2004-February/msg00000.html
Fixed on the trunk in change 40807. The addInSubqueryFilter method is only used in a handful of places: cms/src/com/arsdigita/cms/contenttypes/Link.java cms/src/com/arsdigita/cms/search/IntermediaQueryEngine.java cms/src/com/arsdigita/cms/ui/authoring/ApplyWorkflowFormSection.java cms/src/com/arsdigita/cms/ui/item/Summary.java core/src/com/arsdigita/domain/DomainQuery.java core/src/com/arsdigita/kernel/Group.java core/src/com/arsdigita/search/intermedia/BaseQueryEngine.java core/src/com/arsdigita/search/lucene/Indexer.java core/src/com/arsdigita/ui/admin/GroupTreeModel.java I only verified that the fix does not cause any apparent damage in two cases: (a) com.arsdigita.search.lucene.Indexer (b) com.arsdigita.cms.ui.item.Summary The Indexer class runs in a background Timer thread. The Summary class exercises its inSubqueryFilter when you view a content item's Summary tab. (I think the item needs to be published first. Viewing the summary of an item that has not been published yet does not seem to trigger the execution of inSubqueryFilter.) For Oracle, the lucene Indexer's query changed from |select ld.document_id as c_1, | ld.type as c_2, | ld.type_info as c_3, | ld.timestamp as c_4, | ld.dirty as c_5, | ld.is_deleted as c_6, | ld.language as c_7, | ld.country as c_8, | ld.title as c_9, | ld.summary as c_10, | ld.content as c_11, | ld.creation_date as c_12, | ld.creation_party as c_13, | ld.last_modified_date as c_14, | ld.last_modified_party as c_15 |from lucene_docs ld |where (exisTs ( select subquery_id from (select document_id as subquery_id from ( | select document_id | from lucene_docs | where bitand(dirty, ?) > 0) insub1 ) insub2 where insub2.subquery_id = ld.document_id)) to |select ld.document_id as c_1, | ld.type as c_2, | ld.type_info as c_3, | ld.timestamp as c_4, | ld.dirty as c_5, | ld.is_deleted as c_6, | ld.language as c_7, | ld.country as c_8, | ld.title as c_9, | ld.summary as c_10, | ld.content as c_11, | ld.creation_date as c_12, | ld.creation_party as c_13, | ld.last_modified_date as c_14, | ld.last_modified_party as c_15 |from lucene_docs ld |where (ld.document_id in (seLect document_id from ( | select document_id | from lucene_docs | where bitand(dirty, ?) > 0) insub)) (Grep for "exisTs" and "seLect" to find the respective subquery clauses.)
QA_READY has been deprecated in favor of ON_QA. Please use ON_QA in the future. Moving to ON_QA.
Closing old tickets