Bug 110509 - SQL generated for in subquery filter using 'exists' is much slower than old SQL using 'in' on Oracle
Summary: SQL generated for in subquery filter using 'exists' is much slower than old S...
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Web Application Framework
Classification: Retired
Component: persistence
Version: 5.2
Hardware: All
OS: Linux
medium
high
Target Milestone: ---
Assignee: Vadim Nasardinov
QA Contact: Jon Orris
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2003-11-20 15:26 UTC by Daniel Berrangé
Modified: 2007-04-18 16:59 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2006-09-02 17:44:50 UTC
Embargoed:


Attachments (Terms of Use)
Query using 'in' filter as generated by 5.0 persistence (2.39 KB, text/plain)
2003-11-20 15:27 UTC, Daniel Berrangé
no flags Details
Query using 'exists' filter as generated by 5.2 persistence (2.71 KB, text/plain)
2003-11-20 15:27 UTC, Daniel Berrangé
no flags Details

Description Daniel Berrangé 2003-11-20 15:26:06 UTC
Description of problem:
Some time ago, in p4 change 28606, the implementation of
inSubqueryFilter was changed so that instead of generating SQL looking
like:

    foo.id in (....subquery....)

it instead generates

    exists (select subquery_id from (...subquery...) where foo.id = 
subquery_id)

This was specifically to solve known performance problems on
PostgreSQL 7.2:

     "Made in subquery filters generate SQL that uses exists instead
      of in. This is known to be faster on postgres and is reputed
      to be faster on oracle."

Unfortunately an increasing number of our Oracle APLAWS installations
are having performance problems as the size of their dataset increases. 

Take the following query, for example,

select *
from (select outerResults.*,
             rownum as fakeRownum
      from (select *
            from (select acs_objects.default_domain_class as
cs_objcts__dflt_dmn_clss,
                         acs_objects.display_name as
acs_objects__display_name,
                         acs_objects.object_id as object_id,
                         acs_objects.object_type as
acs_objects__object_type,
                         cms_items.ancestors as cms_items__ancestors,
                         cms_items.name as cms_items__name,
                         cms_items.version as cms_items__version,
                         cms_pages.launch_date as cms_pages__launch_date,
                         cms_pages.title as cms_pages__title,
                         vc_objects.is_deleted as vc_objects__is_deleted
                  from acs_objects acs_objects,
                       cms_items cms_items,
                       cms_pages cms_pages,
                       vc_objects vc_objects
                  where cms_items.item_id=cms_pages.item_id
                  and vc_objects.object_id=acs_objects.object_id
                  and vc_objects.object_id=cms_items.item_id) results
            where ((cms_items__version='live'
                    and vc_objects__is_deleted='0'
                    and not exists (select subquery_id
                                    from (select version_id as subquery_id
                                          from (select v.version_id
                                                from cms_version_map v,
                                                    
cat_cat_subcat_trans_index t,
                                                    
cat_object_category_map m
                                                where
v.item_id=m.object_id
                                                and
m.category_id=t.subcategory_id
                                                and t.category_id='41557'
                                                and m.index_p='1')
insub1) insub2
                                    where insub2.subquery_id=object_id)
                    and exists (select subquery_id
                                from (select object_id as subquery_id
                                      from (select m.object_id
                                            from
cat_cat_subcat_trans_index t,
                                                 cat_object_category_map m
                                            where t.category_id='41557'
                                            and
t.subcategory_id=m.category_id) insub1) insub2
                                where insub2.subquery_id=object_id)))
            order by object_id desc) outerResults
      where rownum<'31')
where fakeRownum>='1';


This takes about 6 seconds to run on Oracle. The explain plan for this
shows full table scans on acs_objects, cms_items, cms_pages & a 'fast
full scan' on vc_objects:

Execution Plan
----------------------------------------------------------
   0      SELECT STATEMENT Optimizer=CHOOSE (Cost=241 Card=22 Bytes=64
          944)

   1    0   VIEW (Cost=241 Card=22 Bytes=64944)
   2    1     COUNT (STOPKEY)
   3    2       VIEW (Cost=241 Card=22 Bytes=64658)
   4    3         SORT (ORDER BY STOPKEY) (Cost=109 Card=22 Bytes=4510
          )

   5    4           FILTER
   6    5             HASH JOIN (Cost=107 Card=22 Bytes=4510)
   7    6               HASH JOIN (Cost=101 Card=51 Bytes=8262)
   8    7                 TABLE ACCESS (FULL) OF 'ACS_OBJECTS' (Cost=6
          1 Card=103 Bytes=10609)

   9    7                 HASH JOIN (Cost=39 Card=9164 Bytes=540676)
  10    9                   INDEX (FAST FULL SCAN) OF 'VC_OBJECTS_DELE
          TED_IDX' (NON-UNIQUE) (Cost=4 Card=18328 Bytes=128296)

  11    9                   TABLE ACCESS (FULL) OF 'CMS_ITEMS' (Cost=1
          9 Card=9164 Bytes=476528)

  12    6               TABLE ACCESS (FULL) OF 'CMS_PAGES' (Cost=5 Car
          d=3844 Bytes=165292)

  13    5             NESTED LOOPS (Cost=4 Card=1 Bytes=32)
  14   13               NESTED LOOPS (Cost=4 Card=1 Bytes=22)
  15   14                 TABLE ACCESS (BY INDEX ROWID) OF 'CMS_VERSIO
          N_MAP' (Cost=2 Card=1 Bytes=10)

  16   15                   INDEX (UNIQUE SCAN) OF 'CMS_VERSION_MAP_VE
          RSION_ID_UN' (UNIQUE) (Cost=1 Card=10227)

  17   14                 TABLE ACCESS (BY INDEX ROWID) OF 'CAT_OBJECT
          _CATEGORY_MAP' (Cost=2 Card=1 Bytes=12)

  18   17                   INDEX (RANGE SCAN) OF 'CAT_OBJ_CAT_MAP_OBJ
          ECT_ID_IDX' (NON-UNIQUE) (Cost=1 Card=2)

  19   13               INDEX (UNIQUE SCAN) OF 'CAT_CAT_SUBCAT_INDEX_P
          K' (UNIQUE)

  20    5             NESTED LOOPS (Cost=2 Card=2 Bytes=40)
  21   20               TABLE ACCESS (BY INDEX ROWID) OF 'CAT_OBJECT_C
          ATEGORY_MAP' (Cost=2 Card=2 Bytes=20)

  22   21                 INDEX (RANGE SCAN) OF 'CAT_OBJ_CAT_MAP_OBJEC
          T_ID_IDX' (NON-UNIQUE) (Cost=1 Card=2)

  23   20               INDEX (UNIQUE SCAN) OF 'CAT_CAT_SUBCAT_INDEX_P
          K' (UNIQUE)


The first change we made was to alter the way the exist clause was
written so that instead of:

   where insub2.subquery_id=object_id
 
It joined to a lower-table - cms_items.item_id. This  removed the full
table scans on all but cms_items. The total query time was still 0.7
of a second.

Remembering that inSubqueryFilter was implemented differently on 5.0
(which we are just upgrading this customer from), we replaced the
'exists' with the equivalent 'in' clause as generated by 5.0
persistence. The query now looks like:

select *
from (select outerResults.*,
             rownum as fakeRownum
      from (select *
            from (select acs_objects.default_domain_class as
cs_objcts__dflt_dmn_clss,
                         acs_objects.display_name as
acs_objects__display_name,
                         acs_objects.object_id as object_id,
                         acs_objects.object_type as
acs_objects__object_type,
                         cms_items.ancestors as cms_items__ancestors,
                         cms_items.name as cms_items__name,
                         cms_items.version as cms_items__version,
                         cms_pages.launch_date as cms_pages__launch_date,
                         cms_pages.title as cms_pages__title,
                         vc_objects.is_deleted as vc_objects__is_deleted
                  from acs_objects acs_objects,
                       cms_items cms_items,
                       cms_pages cms_pages,
                       vc_objects vc_objects
                  where cms_items.item_id=cms_pages.item_id
                  and vc_objects.object_id=acs_objects.object_id
                  and vc_objects.object_id=cms_items.item_id) results
            where ((cms_items__version='live'
                    and vc_objects__is_deleted='0'
                    and not exists (select subquery_id
                                    from (select version_id as subquery_id
                                          from (select v.version_id
                                                from cms_version_map v,
                                                    
cat_cat_subcat_trans_index t,
                                                    
cat_object_category_map m
                                                where
v.item_id=m.object_id
                                                and
m.category_id=t.subcategory_id
                                                and t.category_id='41557'
                                                and m.index_p='1')
insub1) insub2
                                    where insub2.subquery_id=object_id)
                    and exists (select subquery_id
                                from (select object_id as subquery_id
                                      from (select m.object_id
                                            from
cat_cat_subcat_trans_index t,
                                                 cat_object_category_map m
                                            where t.category_id='41557'
                                            and
t.subcategory_id=m.category_id) insub1) insub2
                                where insub2.subquery_id=object_id)))
            order by object_id desc) outerResults
      where rownum<'31')
where fakeRownum>='1';


The explain plan for this query is:

Execution Plan
----------------------------------------------------------
   0      SELECT STATEMENT Optimizer=CHOOSE (Cost=85 Card=1 Bytes=13)
   1    0   SORT (AGGREGATE)
   2    1     VIEW (Cost=85 Card=3 Bytes=39)
   3    2       COUNT (STOPKEY)
   4    3         VIEW (Cost=85 Card=3)
   5    4           SORT (ORDER BY STOPKEY) (Cost=85 Card=3 Bytes=645)
   6    5             HASH JOIN (ANTI) (Cost=84 Card=3 Bytes=645)
   7    6               NESTED LOOPS (Cost=74 Card=3 Bytes=630)
   8    7                 NESTED LOOPS (Cost=68 Card=6 Bytes=1002)
   9    8                   NESTED LOOPS (Cost=56 Card=12 Bytes=1380)
  10    9                     NESTED LOOPS (Cost=31 Card=25 Bytes=2700
          )

  11   10                       VIEW OF 'VW_NSO_2' (Cost=4 Card=25 Byt
          es=125)

  12   11                         SORT (UNIQUE)
  13   12                           NESTED LOOPS (Cost=4 Card=25 Bytes
          =500)

  14   13                             INDEX (FAST FULL SCAN) OF 'CAT_O
          BJ_CAT_MAP_UN' (UNIQUE) (Cost=4 Card=7542 Bytes=75420)

  15   13                             INDEX (UNIQUE SCAN) OF 'CAT_CAT_
          SUBCAT_INDEX_PK' (UNIQUE)

  16   10                       TABLE ACCESS (BY INDEX ROWID) OF 'ACS_
          OBJECTS' (Cost=1 Card=1 Bytes=103)

  17   16                         INDEX (UNIQUE SCAN) OF 'ACS_OBJECTS_
          OBJECT_ID_P_HHKB1' (UNIQUE)

  18    9                     TABLE ACCESS (BY INDEX ROWID) OF 'VC_OBJ
          ECTS' (Cost=1 Card=1 Bytes=7)

  19   18                       INDEX (UNIQUE SCAN) OF 'VC_OBJECTS_PK'
           (UNIQUE)

  20    8                   TABLE ACCESS (BY INDEX ROWID) OF 'CMS_ITEM
          S' (Cost=1 Card=1 Bytes=52)

  21   20                     INDEX (UNIQUE SCAN) OF 'CMS_ITEMS_PK' (U
          NIQUE)

  22    7                 TABLE ACCESS (BY INDEX ROWID) OF 'CMS_PAGES'
           (Cost=1 Card=1 Bytes=43)

  23   22                   INDEX (UNIQUE SCAN) OF 'CMS_PAGES_ITEM_ID_
          P_RNEE1' (UNIQUE)

  24    6               VIEW OF 'VW_NSO_1' (Cost=9 Card=12 Bytes=60)
  25   24                 NESTED LOOPS (Cost=9 Card=12 Bytes=384)
  26   25                   HASH JOIN (Cost=9 Card=1973 Bytes=43406)
  27   26                     TABLE ACCESS (FULL) OF 'CAT_OBJECT_CATEG
          ORY_MAP' (Cost=4 Card=1021 Bytes=12252)

  28   26                     INDEX (FAST FULL SCAN) OF 'CMS_VERSION_M
          AP_ITEM_ID_PK' (UNIQUE) (Cost=4 Card=10227 Bytes=102270)

  29   25                   INDEX (UNIQUE SCAN) OF 'CAT_CAT_SUBCAT_IND
          EX_PK' (UNIQUE)


The only full table scan remaining is now on cat_object_category_map.
  Running the query now takes 0.24 seconds - which is comparable with
what we were seeing on 5.0 version of APLAWS for this customer.

We have now had the same problems on 4 major Oracle/APLAWS-5.2 based
installations. The only other Oracle/APLAWS-5.2 installs where we are
not seeing this problem are those with only a small dataset. Thus is
would appear that, while on PG, the 'in' query is best, on Oracle, we
should be using 'exists'. Thus I suggest that
c.a.persistence.InFilter.java needs to be re-written to use
appropriate syntax for each database.

One final point, the release notes for 7.4 postgresql suggest 'in'
filters will now also be faster than 'exists' for PG too. So we
probably need to check which DB version we're running on when making
the choice of what filter to use - or perhaps make it configurable.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 Daniel Berrangé 2003-11-20 15:27:28 UTC
Created attachment 96087 [details]
Query using 'in' filter as generated by 5.0 persistence

Comment 2 Daniel Berrangé 2003-11-20 15:27:56 UTC
Created attachment 96088 [details]
Query using 'exists' filter as generated by 5.2 persistence

Comment 3 Daniel Berrangé 2003-11-20 16:47:39 UTC
Comments from Aram wrt to performance of 'exists' on Oracle:

<aram> exists is generally better that in, except few case
<aram> rather you shouldnt use exists if it is major clause
restricting nubmer of rows


In the above queries we are using the exists as the major restricting
clause against cms_items, which is evidently why performance is
suffering in this case.

I'm not sure what exactly we can do about this, since I'm not sure
persistence would be able to accurately guage when to use 'in' vs
'exists' on any given 'addInSubqueryFilter' call. 

However, a solution *does* need to be found since the current query is
unusably slow. Since (to my knowledge at least) we never had any
concrete performance problems on Oracle with 'in' my thought is to
simply revert to using 'in' instead of 'exists' on Oracle.

Any better ideas ?

Removing the use of these two sub-query filters isn't an option for
5.2, since that would basically require the new categorization API
from Rickshaw.

Comment 4 Daniel Berrangé 2003-11-21 15:54:32 UTC
NB, In the first comment, the 2nd query I posted was incorrect. Look
at the attachments for the real ones.

The 'addInSubqueryFilter' method is used in a very similar way to the
above example (ie filtering a large list of objects by category) &
thus have potential for this catastrophic performance in certain
scenarios:

cms/src/com/arsdigita/cms/ui/IntermediaSearchListModelBuilder.java
core/src/com/arsdigita/categorization/Category.java
forum/src/com/arsdigita/forum/Forum.java
navigation/src/com/arsdigita/london/navigation/ui/AbstractItemList.java
navigation/src/com/arsdigita/london/navigation/ui/CategoryItemNav.java
navigation/src/com/arsdigita/london/navigation/ui/CategoryTypeItemList.java
navigation/src/com/arsdigita/london/navigation/ui/ContentTypeList.java
navigation/src/com/arsdigita/london/navigation/ui/L3ContentPane.java
navigation/src/com/arsdigita/london/navigation/ui/LevelContentPane.java
navigation/src/com/arsdigita/london/navigation/ui/RecentItems.java
navigation/src/com/arsdigita/london/navigation/ui/Selection.java
navigation/src/com/arsdigita/london/navigation/ui/CategoryWorkspaceSelectionModel.java
portal/src/com/arsdigita/london/portal/ui/admin/CategoryComponent.java
search/src/com/arsdigita/london/search/IntermediaSearcher.java
search/src/com/arsdigita/london/search/Reindexer.java
subsite/src/com/arsdigita/london/subsite/IntermediaSearcherForSubsites.java

The remaining uses of addInSubqueryFilter are:

Exmination of what they are doing suggests they could well have the
same problem with 'exists' as the others.

cms/src/com/arsdigita/cms/contenttypes/ContentGroup.java
cms/src/com/arsdigita/cms/publishToFile/PublishToFile.java
content-types/src/com/arsdigita/cms/contenttypes/Link.java
core/src/com/arsdigita/kernel/Group.java
core/src/com/arsdigita/lucene/Indexer.java
core/src/com/arsdigita/ui/admin/GroupTreeModel.java

For the category base subquery filters we have verified that using
'in' instead of 'exists' gives better performance on 2 customer
deployments of Oracle 9i. We have two further customers for whom the
current 'exists' based query is giving horrific performance (ie 10->60
seconds per query), & although we have yet to verify it, we suspect
that switching back to 'in' will solve the problem, since they didn't
have such a problem while on 5.0 based APLAWS.
 


Comment 5 Matthew Booth 2003-11-21 18:31:48 UTC
This is fixed in the London 5.2 core in changelist 38219. This
produced a query which ran in 6ms rather than 6.5 seconds before the
change for one particular customer.

This changelist makes an exception of postgres, which still uses exists.

Could we also make sure this gets fixed in Rickshaw?

Comment 6 Randy Graebner 2003-12-05 15:49:23 UTC
my test database has about 7000 items in cms_items and we are seeing
similar behavior.  My query was taking about 15 seconds with "exists"
but it only takes about 20ms with "in".  We are running on 6.0.  The
query with exists is:

#

select cpg.page_gen_id as c_1,
                              ao.object_type as c_2,
                              ao.display_name as c_3,
                              ao.default_domain_class as c_4,
                              vo.is_deleted as c_5,
                              ci.version as c_6,
                              ci.name as c_7,
                              ci.language as c_8,
                              ci.ancestors as c_9,
                              cp.title as c_10,
                              cp.launch_date as c_11,
                              ca.lead as c_12,
                              cpg.subtitle as c_13,
                              cpg.text as c_14,
                              cpg.author as c_15,
                              cpg.dategen as c_16,
                              cpg.template_id as c_17,
                              cpg.present_on_home_p as c_18,
                              cpg.url as c_19,
                              cpg.paragraph as c_20
                              from cf_page_gen cpg,
                              acs_objects ao,
                              vc_objects vo,
                              cms_items ci,
                              cms_pages cp,
                              ct_articles ca
                              where ao.object_id = cpg.page_gen_id and
vo.object_id = cpg.page_gen_id and ci.item_id = cpg.page_gen_id and
cp.item_id =
                              cpg.page_gen_id and ca.item_id =
cpg.page_gen_id and (exists ( select "subquery_id" from (select "ITEM_ID"
                              as "subquery_id" from (
                              select cms_items.item_id 
                              from cms_items,
                              cms_bundles,
                              cms_items bi
                              where bi.parent_id in (select item_id
from cms_items where ancestors like ? || '%')
                              and cms_bundles.bundle_id =
cms_items.parent_id
                              and cms_bundles.bundle_id = bi.item_id
                              and cms_items.version = ?) "insub1" )
"insub2" where "insub2"."subquery_id" = cpg.page_gen_id)
                              and cpg.present_on_home_p = ? and exists
( select "subquery_id" from (select "ITEM_ID" as "subquery_id"
                              from (
                              select
substr(max(to_char(start_date_time, 'YYYY-MM-DD
HH24:mm:ss')||map.item_id),20) as item_id
                              from acs_object_lifecycle_map map,
                              lifecycles, 
                              cf_page_gen,
                              cms_items
                              where lifecycles.cycle_id = map.cycle_id
                              and cms_items.item_id = map.item_id
                              and cf_page_gen.present_on_home_p = '1'
                              and cf_page_gen.page_gen_id in (select
substr(min(to_char(rank,'099')||cii.item_id),5) as item_id
                              from user_lng_prefs ul, cms_items cii
                              where cii.language = ul.language
                              and cms_items.parent_id = cii.parent_id)
                              and cf_page_gen.page_gen_id =
cms_items.item_id
                              and ancestors like ? || '%') "insub1" )
"insub2" where "insub2"."subquery_id" = cpg.page_gen_id))
                              order by ao.display_name

# Bindvars: {4=615/614/5011/, 3=1, 2=live, 1=615/614/5011/}

Operation										   Object
------------------------------------------------------------------------------------------
------------------------------
SELECT STATEMENT ()
 SORT (AGGREGATE)
  FILTER ()
   NESTED LOOPS ()
    NESTED LOOPS ()
     NESTED LOOPS ()
      NESTED LOOPS ()
       NESTED LOOPS ()
	TABLE ACCESS (FULL)								   CT_ARTICLES
	TABLE ACCESS (BY INDEX								   CF_PAGE_GEN
	 INDEX (UNIQUE SCAN)								   CF_PAG_GEN_PAGE_GEN_ID_P_ON442

Operation										   Object
------------------------------------------------------------------------------------------
------------------------------
       INDEX (UNIQUE SCAN)								   CMS_PAGES_ITEM_ID_P_RNEE1
      INDEX (UNIQUE SCAN)								   CMS_ITEMS_PK
     INDEX (UNIQUE SCAN)								   VC_OBJECTS_PK
    INDEX (UNIQUE SCAN) 								   ACS_OBJECTS_OBJECT_ID_P_HHKB1
   NESTED LOOPS ()
    NESTED LOOPS ()
     NESTED LOOPS ()
      TABLE ACCESS (BY INDEX R								   CMS_ITEMS
       INDEX (UNIQUE SCAN)								   CMS_ITEMS_PK
      INDEX (UNIQUE SCAN)								   CMS_BUNDLES_BUNDLE_ID_P_HM39D
     TABLE ACCESS (BY INDEX RO								   CMS_ITEMS

Operation										   Object
------------------------------------------------------------------------------------------
------------------------------
      INDEX (UNIQUE SCAN)								   CMS_ITEMS_PK
    TABLE ACCESS (BY INDEX ROW								   CMS_ITEMS
     INDEX (UNIQUE SCAN)								   CMS_ITEMS_PK
   VIEW ()
    FILTER ()
     SORT (AGGREGATE)
      NESTED LOOPS ()
       TABLE ACCESS (BY INDEX								   LIFECYCLES
	INDEX (UNIQUE SCAN)								   LIFECYCLES_CYCLE_ID_P_8JYYQ
       NESTED LOOPS ()
	NESTED LOOPS ()

Operation										   Object
------------------------------------------------------------------------------------------
------------------------------
	 TABLE ACCESS (BY INDE								   CMS_ITEMS
	  INDEX (RANGE SCAN)								   CMS_ITEMS_ANCESTORS_IDX
	 TABLE ACCESS (BY INDE								   CF_PAGE_GEN
	  INDEX (UNIQUE SCAN)								   CF_PAG_GEN_PAGE_GEN_ID_P_ON442
	   SORT (AGGREGATE)
	    NESTED LOOPS ()
	     TABLE ACCESS (BY								   CMS_ITEMS
	      INDEX (RANGE SCA								   CMS_ITEMS_PARENT_ID_IDX
	     TABLE ACCESS (BY								   USER_LNG_PREFS
	      INDEX (UNIQUE SC								   USER_LNG_PREFS_PK
	TABLE ACCESS (BY INDEX								   ACS_OBJECT_LIFECYCLE_MAP

Operation										   Object
------------------------------------------------------------------------------------------
------------------------------
	 INDEX (RANGE SCAN)								   ACS_OBJECT_CYCL_MAP_ITM_IDX






The "in" query:
select cpg.page_gen_id as c_1,
                                    ao.object_type as c_2,
                                    ao.display_name as c_3,
                                    ao.default_domain_class as c_4,
                                    vo.is_deleted as c_5,
                                    ci.version as c_6,
                                    ci.name as c_7,
                                    ci.language as c_8,
                                    ci.ancestors as c_9,
                                    cp.title as c_10,
                                    cp.launch_date as c_11,
                                    ca.lead as c_12,
                                    cpg.subtitle as c_13,
                                    cpg.text as c_14,
                                    cpg.author as c_15,
                                    cpg.dategen as c_16,
                                    cpg.template_id as c_17,
                                    cpg.present_on_home_p as c_18,
                                    cpg.url as c_19,
                                    cpg.paragraph as c_20
                                    from cf_page_gen cpg,
                                    acs_objects ao,
                                    vc_objects vo,
                                    cms_items ci,
                                    cms_pages cp,
                                    ct_articles ca
                                    where ao.object_id =
cpg.page_gen_id and vo.object_id = cpg.page_gen_id and ci.item_id =
cpg.page_gen_id and cp.item_id =
                                    cpg.page_gen_id and ca.item_id =
cpg.page_gen_id and (cpg.present_on_home_p = ? and  cpg.page_gen_id in ( 
                                    select cms_items.item_id 
                                    from cms_items,
                                    cms_bundles,
                                    cms_items bi
                                    where bi.parent_id in (select
item_id from cms_items where ancestors like ? || '%')
                                    and cms_bundles.bundle_id =
cms_items.parent_id
                                    and cms_bundles.bundle_id = bi.item_id
                                    and cms_items.version = ?)  and 
cpg.page_gen_id in ( 
                                    select
substr(max(to_char(start_date_time, 'YYYY-MM-DD
HH24:mm:ss')||map.item_id),20) as item_id
                                    from acs_object_lifecycle_map map,
                                    lifecycles, 
                                    cf_page_gen,
                                    cms_items
                                    where lifecycles.cycle_id =
map.cycle_id
                                    and cms_items.item_id = map.item_id
                                    and cf_page_gen.present_on_home_p
= '1'
                                    and cf_page_gen.page_gen_id in
(select substr(min(to_char(rank,'099')||cii.item_id),5) as item_id
                                    from user_lng_prefs ul, cms_items cii
                                    where cii.language = ul.language
                                    and cms_items.parent_id =
cii.parent_id)
                                    and cf_page_gen.page_gen_id =
cms_items.item_id
                                    and ancestors like ? || '%') )
                                    order by ao.display_name
BINDS: {4=615/614/5011/, 3=live, 2=615/614/5011/, 1=1}

Operation										   Object
------------------------------------------------------------------------------------------
------------------------------
SELECT STATEMENT ()
 MERGE JOIN ()
  NESTED LOOPS ()
   NESTED LOOPS ()
    NESTED LOOPS ()
     NESTED LOOPS ()
      NESTED LOOPS ()
       TABLE ACCESS (BY INDEX								   CF_PAGE_GEN
	INDEX (UNIQUE SCAN)								   CF_PAG_GEN_PAGE_GEN_ID_P_ON442
	 SORT (AGGREGATE)
	  NESTED LOOPS ()

Operation										   Object
------------------------------------------------------------------------------------------
------------------------------
	   NESTED LOOPS ()
	    NESTED LOOPS ()
	     TABLE ACCESS (BY								   CMS_ITEMS
	      INDEX (RANGE SCA								   CMS_ITEMS_ANCESTORS_IDX
	     TABLE ACCESS (BY								   CF_PAGE_GEN
	      INDEX (UNIQUE SC								   CF_PAG_GEN_PAGE_GEN_ID_P_ON442
	       SORT (AGGREGATE
		NESTED LOOPS (
		 TABLE ACCESS								   CMS_ITEMS
		  INDEX (RANGE								   CMS_ITEMS_PARENT_ID_IDX
		 TABLE ACCESS								   USER_LNG_PREFS

Operation										   Object
------------------------------------------------------------------------------------------
------------------------------
		  INDEX (UNIQU								   USER_LNG_PREFS_PK
	    TABLE ACCESS (BY I								   ACS_OBJECT_LIFECYCLE_MAP
	     INDEX (RANGE SCAN								   ACS_OBJECT_CYCL_MAP_ITM_IDX
	   TABLE ACCESS (BY IN								   LIFECYCLES
	    INDEX (UNIQUE SCAN								   LIFECYCLES_CYCLE_ID_P_8JYYQ
       TABLE ACCESS (BY INDEX								   CT_ARTICLES
	INDEX (UNIQUE SCAN)								   CT_ARTICLES_ITEM_ID_P_2BE9I
      TABLE ACCESS (BY INDEX R								   CMS_PAGES
       INDEX (UNIQUE SCAN)								   CMS_PAGES_ITEM_ID_P_RNEE1
     TABLE ACCESS (BY INDEX RO								   CMS_ITEMS
      INDEX (UNIQUE SCAN)								   CMS_ITEMS_PK

Operation										   Object
------------------------------------------------------------------------------------------
------------------------------
    TABLE ACCESS (BY INDEX ROW								   VC_OBJECTS
     INDEX (UNIQUE SCAN)								   VC_OBJECTS_PK
   TABLE ACCESS (BY INDEX ROWI								   ACS_OBJECTS
    INDEX (UNIQUE SCAN) 								   ACS_OBJECTS_OBJECT_ID_P_HHKB1
  FIRST ROW ()
   VIEW ()										   VW_NSO_1
    NESTED LOOPS ()
     NESTED LOOPS ()
      NESTED LOOPS ()
       TABLE ACCESS (BY INDEX								   CMS_ITEMS
	INDEX (RANGE SCAN)								   CMS_ITEMS_VMI_IDX

Operation										   Object
------------------------------------------------------------------------------------------
------------------------------
       INDEX (UNIQUE SCAN)								   CMS_BUNDLES_BUNDLE_ID_P_HM39D
      TABLE ACCESS (BY INDEX R								   CMS_ITEMS
       INDEX (UNIQUE SCAN)								   CMS_ITEMS_PK
     TABLE ACCESS (BY INDEX RO								   CMS_ITEMS
      INDEX (UNIQUE SCAN)								   CMS_ITEMS_PK





Comment 7 Daniel Berrangé 2003-12-11 18:33:25 UTC
A summary is thus:

On PG 7.2 & 7.3, 'exists' is practically always faster than 'in'
On PG 7.4, the reverse is true - 'in' is practically always faster
than 'exists'.

On Oracle, the decision depends on how the 'exists' clause is used, as
this dicussion details...

<aram> danpb_sloth: correlated 'exists' is faster than 'in' most of
the time
<danpb_sloth> aram: can you explain what you mean by 'correlated' vs
'non-correlated' ?
<aram> k, not correlated exists example: select col1 from tab1 where
col2 in (select col3 from tabl2)
<aram> correlated exists example: select col1 from tab1 where exists
(select col3 from tabl2 and col4 = tab1.col2)
<aram> danpb_sloth: basically than __properly__ used exists with
correlated suquery much faster than in
<danpb_sloth> the problem is that using addInSubquery API, there is
not enough info to construct a query that looks like the 2nd example
<aram> danpb_sloth: but from what you are saying i understand why
exists is dog slow
<danpb_sloth> => addInSubquery should use 'in' on oracle
<aram> danpb_sloth: completely agree now
<danpb_sloth> ok, i'll post this transcript to BZ
<aram> non correlated 'exists' not good most of the time
<aram> danpb_sloth: see the main difference 'in' tries to be computed
once for dureation of sql statement
<aram> danpb_sloth: 'exists' for every row


So since we are unable to construct a correlated 'exists' query using
the information available, for Oracle we should revert to using 'in'
for the 'addInSubquery' method.


Comment 8 Rafael H. Schloming 2003-12-11 19:47:36 UTC
Maybe I'm confused about what exactly Aram means by correlated vs non
correlated, but based on the examples he gives I don't actually
understand how the exists we generate is non correlated. Based on
simple textual mapping it would seem to correspond to his example of a
correlated exists.

Comment 9 Daniel Berrangé 2003-12-11 20:01:56 UTC
This page seems to have a longer description:

http://www.smart-soft.co.uk/Oracle/oracle-performance-tuning-part7.htm

There are a bunch of other resources from Google

http://www.google.com/search?q=correlated+subquery
http://www.google.com/search?q=correlated+subquery+performance

Comment 10 Rafael H. Schloming 2003-12-11 21:27:56 UTC
All the definitions seem to agree that the subquery we're using inside
the exists is in fact a correlated subquery. So the rule of thumb
would say that exists the way we use it should be faster than in.
Since this is empirically demonstrated to be false in some cases this
makes me think we're using the wrong thumb.

Comment 11 Rafael H. Schloming 2003-12-12 16:46:48 UTC
This post would seem to suggest that the determining criteria (at
least for postgres) is the number of rows returned by the subquery:

http://archives.postgresql.org/pgsql-performance/2002-12/msg00185.php

This makes some sense to me since any occurance of an "in" expression
can be trivially transformed into an exists of a correlated subquery,
so using whether the subquery is correlated or not to choose between
"in" and "exists" doesn't make much sense since the "exists" subquery
will be guaranteed to be correlated if it is derived from an "in"
expression.

Comment 12 Daniel Berrangé 2003-12-12 17:18:02 UTC
> All the definitions seem to agree that the subquery we're using 
> inside the exists is in fact a correlated subquery.

The other possibilities are that we're lacking some important indexes
in all our slow queries or that the wrapped select ... from (select
... from (...subquery...) ...) is screwing up the optimizer. Both seem
rather unlikely to me.

> This post would seem to suggest that the determining criteria (at
> least for postgres) is the number of rows returned by the subquery:

It seems unlikely that persistence can't possibly know at the time how
many rows a subquery will return, since its not got the DB's schema
statistics to hand. An application developer /might/ have enough
knowledge to make an educated guess in some cases but then we'd be
putting the burden of deciding best impl per DB & version into
application code.

From the mailing thread

> The speed gain for IN on small lists is not as dramatic as the speed 
> loss for EXISTS on large lists.

So, IMO, we should focus on making sure that we avoid the dramatic
loss exhibited by 'exists' on oracle & accept a small penalty that
'in' may impose (i say 'may', since I've actually yet found a query on
Oracle in WAF where 'exists' is faster as its reputed to be).

As a point of interest from 

http://archives.postgresql.org/pgsql-performance/2002-12/msg00195.php

> > I wonder if "[NOT] IN (subselect)" could be improved with a hash
> > table in similar fashion to the hash aggregate solution Tom 
> > recently implemented?
>
> Its being worked on
> 
> Assuming I get this done, the conventional wisdom that "EXISTS
> outperforms IN" will be stood on its head

Seems, like Tom got this done for PG 7.4, so good bye conventional
wisdom :-)



Comment 13 Bryan Che 2004-01-14 16:39:02 UTC
Here is a simpler exists query that is slow on postgres:

-- ID: #7
-- Duration: 4238ms
select st_.display_name as c_1,
       st_.object_id as c_2,
       st_.default_domain_class as c_3,
       st_.object_type as c_4
from (
    select
      a.object_id, a.object_type, a.display_name, a.default_domain_class
    from
      acs_objects a
    where exists (select 1 from cms_items i
                  where i.item_id = a.object_id
		  and i.version = 'live'
		  and i.parent_id = '5521'
		  and name = 'admin')) st_;

generated by logging in as site-wide admin to /content/admin/

Comment 14 Rafael H. Schloming 2004-01-15 17:23:39 UTC
> First, how would you choose between using in and exists in
> a query specifically formulated for postgres versions ranging from 7.2
> to 7.4?

IN (sub-select) is generally to be avoided in PG versions before 7.4,
but I think it should perform all right (as well as, or possibly better
than an equivalent EXISTS) in most cases in 7.4.  You should run some
tests to see how your particular queries behave, though.

Note that IN (list-of-values) is a quite different critter, but I assume
that's not what you're talking about.

> Second, as someone with a reasonable understanding of sql semantics, but
> no direct understanding of sql implementations its not at all obvious to
> me why there should be such a large difference in performance between a
> query using in and an equivalent query formulated to use exists.

Well, the reason it sucked in PG < 7.4 was that there was some
optimization for the EXISTS case and none whatever for the IN case.
Specifically, IN was always done by re-scanning the subquery for each
outer row until we found a match or exhausted the subquery.  That's
more or less equivalent to the stupidest form of nested-loop join.

The EXISTS cases that work reasonably well in PG (this is about the same
with any recent version) are of the form
        WHERE ... EXISTS(SELECT ... FROM inner WHERE inner.id = outer.id)
where there is an index on inner.id.  This structure will typically be
turned into an index probe on inner.id for each row of the outer query;
so you don't scan all of the inner table, with savings proportional to
the size of the inner table.

In 7.4 PG can apply a similar optimization to IN, and we can also
consider merge and hash joins for IN, so there's a great deal more
flexibility in the planner.

It sounds like Oracle is smart about IN and not very smart about EXISTS.
It may be implementing EXISTS as a nestloop even when better
possibilities are available.  Your bugzilla discussion quickly dismissed
the idea that Oracle's planner might get fooled by nested sub-selects
in this context, but I'm not sure that idea is wrong.

> I would greatly appreciate any insight you might have on this issue,
> or any reading you could recommend on the subject.

There's a wealth of info available in the Postgres archives
(particularly pgsql-performance), but it's mostly PG-specific.
No doubt you can find Oracle-specific tuning info fairly readily
as well.  I dunno where you go to learn about cross-DB techniques
per se, but perhaps you can find overlap between PG and Oracle material.

As far as the particular bug goes, I think you have little choice but
to emit DB-dependent SQL.  You clearly want to use IN for Oracle, and I
think you will find that it's also the preferred choice for PG >= 7.4
(but you'd best check, especially with such complex queries).  You will
need to keep using the EXISTS format for PG < 7.4, though, for as long
as you want to support those versions.

                        regards, tom lane

Comment 15 Rafael H. Schloming 2004-01-15 17:34:43 UTC
Looks like we're going to have to add both a db and db version
dependent default in order to get the behavior we want, and we may at
some point want to add an option in PDL that allows the behavior to be
explicitly specified for a given query.

Comment 16 Richard Li 2004-02-02 21:23:42 UTC
Also, see Aram's observation about PostgreSQL:

http://post-office.corp.redhat.com/archives/ccm-engineering-list/2004-February/msg00000.html

Comment 17 Vadim Nasardinov 2004-02-26 17:49:16 UTC
Fixed on the trunk in change 40807.

The addInSubqueryFilter method is only used in a handful of places:

  cms/src/com/arsdigita/cms/contenttypes/Link.java
  cms/src/com/arsdigita/cms/search/IntermediaQueryEngine.java
  cms/src/com/arsdigita/cms/ui/authoring/ApplyWorkflowFormSection.java
  cms/src/com/arsdigita/cms/ui/item/Summary.java
  core/src/com/arsdigita/domain/DomainQuery.java
  core/src/com/arsdigita/kernel/Group.java
  core/src/com/arsdigita/search/intermedia/BaseQueryEngine.java
  core/src/com/arsdigita/search/lucene/Indexer.java
  core/src/com/arsdigita/ui/admin/GroupTreeModel.java

I only verified that the fix does not cause any apparent damage in two
cases:
  (a) com.arsdigita.search.lucene.Indexer
  (b) com.arsdigita.cms.ui.item.Summary

The Indexer class runs in a background Timer thread.

The Summary class exercises its inSubqueryFilter when you view a
content item's Summary tab.  (I think the item needs to be published
first.  Viewing the summary of an item that has not been published yet
does not seem to trigger the execution of inSubqueryFilter.)

For Oracle, the lucene Indexer's query changed from

|select ld.document_id as c_1,
|       ld.type as c_2,
|       ld.type_info as c_3,
|       ld.timestamp as c_4,
|       ld.dirty as c_5,
|       ld.is_deleted as c_6,
|       ld.language as c_7,
|       ld.country as c_8,
|       ld.title as c_9,
|       ld.summary as c_10,
|       ld.content as c_11,
|       ld.creation_date as c_12,
|       ld.creation_party as c_13,
|       ld.last_modified_date as c_14,
|       ld.last_modified_party as c_15
|from lucene_docs ld
|where (exisTs ( select subquery_id from (select document_id as subquery_id from (
|    select document_id 
|      from lucene_docs
|     where bitand(dirty, ?) > 0) insub1 ) insub2 where insub2.subquery_id = ld.document_id))

to

|select ld.document_id as c_1,
|       ld.type as c_2,
|       ld.type_info as c_3,
|       ld.timestamp as c_4,
|       ld.dirty as c_5,
|       ld.is_deleted as c_6,
|       ld.language as c_7,
|       ld.country as c_8,
|       ld.title as c_9,
|       ld.summary as c_10,
|       ld.content as c_11,
|       ld.creation_date as c_12,
|       ld.creation_party as c_13,
|       ld.last_modified_date as c_14,
|       ld.last_modified_party as c_15
|from lucene_docs ld
|where (ld.document_id in (seLect document_id from (
|    select document_id 
|      from lucene_docs
|     where bitand(dirty, ?) > 0) insub))

(Grep for "exisTs" and "seLect" to find the respective subquery
clauses.)


Comment 18 David Lawrence 2006-07-18 03:48:34 UTC
QA_READY has been deprecated in favor of ON_QA. Please use ON_QA in the future.
Moving to ON_QA.

Comment 19 Daniel Berrangé 2006-09-02 17:44:50 UTC
Closing old tickets


Note You need to log in before you can comment on or make changes to this bug.