Bug 1022279

Summary: Errata queries constantly running
Product: [Community] Spacewalk Reporter: JDavis4102
Component: ServerAssignee: Michael Mráka <mmraka>
Status: CLOSED CURRENTRELEASE QA Contact: Red Hat Satellite QA List <satqe-list>
Severity: high Docs Contact:
Priority: unspecified    
Version: 2.1CC: akrherz, JDavis4102, jfenal, ldelouw, mhuth, mmello, xdmoon
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: spacewalk-java-2.1.64-1 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1025636 (view as bug list) Environment:
Last Closed: 2014-03-04 13:06:43 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1025636, 1069560    
Attachments:
Description Flags
query and query plan none

Description JDavis4102 2013-10-22 22:15:54 UTC
Description of problem: Errata tasks in taskomatic never seem to finish and in the DB logs I am seeing the following query all the time. Seems like it never finishes and usually takes up 2 cores and 100% cpu on those two cores.

SELECT DISTINCT S.id AS server_id, S.org_id org_id FROM rhnServer S JOIN (select rhnChannelErrata.errata_id, rhnChannelErrata.channel_id, rhnServerChannel.server_id, rhnErrataPackage.package_id from rhnChannelErrata, rhnErrataPackage, rhnChannelNewestPackage, rhnPackageEVR, rhnServerChannel, rhnServerPackage, rhnPackageUpgradeArchCompat where rhnChannelErrata.errata_id = rhnErrataPackage.errata_id and rhnChannelErrata.channel_id = rhnChannelNewestPackage.channel_id and rhnErrataPackage.package_id = rhnChannelNewestPackage.package_id and rhnChannelErrata.channel_id = rhnServerChannel.channel_id and rhnChannelNewestPackage.name_id = rhnServerPackage.name_id and rhnServerChannel.server_id = rhnServerPackage.server_id and rhnChannelNewestPackage.evr_id = rhnPackageEVR.id and rhnServerPackage.package_arch_id = rhnPackageUpgradeArchCompat.package_arch_id and rhnPackageUpgradeArchCompat.package_upgrade_arch_id = rhnCh


How reproducible:
Not sure

Steps to Reproduce:
Not sure how to reproduce except to have a lot of channels and servers (8800+)

Comment 1 daryl herzmann 2013-10-30 21:41:44 UTC
Created attachment 817650 [details]
query and query plan

Comment 2 daryl herzmann 2013-10-30 21:42:50 UTC
I've attached the offending query (took 1600 seconds) and am experimenting with adding various indices to remove those massive sequence scans!

Comment 3 daryl herzmann 2013-10-31 20:26:56 UTC
Well, even setting "set enable_seqscan='off';" did not help.  Looking at the query, it seems the SNV subquery is materializing a ton of data!  The outer query limit of SNV.channel_id = and SNV.errata_id = does not make it down into the subquery.  For a test, I modified the sub query and added the explicit (and rhnChannelErrata.errata_id = 19411). The query returned in a second or so.... 

I am not sure if that is the right thing to do....

Comment 4 Xixi 2013-11-01 01:45:31 UTC
See also: similar / possibly related bug currently being investigated for Red Hat Satellite 5.6 (the highlighted queries are different): 
Bug 1023557 - Satellite 5.6 postgres takes a long time to satellite-sync

Comment 5 Michael Mráka 2013-11-04 11:59:17 UTC
This seems to be (again) issue with PostgreSQL 8.4 optimizer which can't "unfold" inner sub-queries so it starts execution with  cartesian joins in these sub-queries and then later on filtering out majority of the rows...


From explain plan problematic parts, according to number of rows, are

a)
              ->  Hash  (cost=488538.14..488538.14 rows=31777694 width=60)
                     ->  Hash Join  (cost=31433.37..488538.14 rows=31777694 width=60)
                           Hash Cond: (csc.channel_id = ocnp.channel_id)

which corresponds to the third "not exits" sub-query (-- rule out records where other channel provides the same package name with higher evr),

b)
                           ->  Materialize  (cost=0.00..284594.16 rows=11626006 width=26)
                                 ->  Nested Loop  (cost=0.00..168334.10 rows=11626006 width=26)
                                       ->  Index Scan using rhn_sc_sid_cid_uq on rhnserverchannel csc  (cost=0.00..237.98 rows=3229 width=18)
                                       ->  Index Scan using rhn_cp_cp_uq on rhnchannelpackage cp  (cost=0.00..32.42 rows=1571 width=15)
                                             Index Cond: (cp.channel_id = csc.channel_id)

which corresponds to the first "not exits" sub-query (-- rule out records where channels with lower id exist == get min)

and c)
                     ->  Hash  (cost=130497.67..130497.67 rows=5737384 width=60)
                           ->  Hash Join  (cost=754.39..130497.67 rows=5737384 width=60)
                                 Hash Cond: (sp.package_arch_id = puac.package_upgrade_arch_id)
                                 ->  Hash Join  (cost=750.91..37615.95 rows=946784 width=60)
                                       Hash Cond: (sp.evr_id = sevr.id)

which corresponds to the second "not exits" sub-query (-- rule out records where the server has the same package name with higher evr).

Comment 6 Michael Mráka 2013-11-04 12:03:04 UTC
For a) and b) we can add condition to filter out just necessary channels,
i.e.
  AND CP.CHANNEL_ID = :channel_id
resp.
  AND CSC.CHANNEL_ID = :channel_id

Comment 7 Michael Mráka 2013-11-04 12:09:16 UTC
The changes from comment #6 actually make huge difference on my test system
from original

                                    QUERY PLAN
---------------------------------------------------------------
 Unique  (cost=2172993.26..3044634.48 rows=1 width=18) (actual time=673.244..673.244 rows=0 loops=1)

to new
                                 QUERY PLAN
--------------------------------------------
 Unique  (cost=711198.84..711198.85 rows=1 width=18) (actual time=22.454..22.454 rows=0 loops=1)

which is 3 times lower cost / faster.

Comment 8 Michael Mráka 2013-11-04 12:18:20 UTC
The c) is more tricky because there's no channel_id nor errata_id we can use to downsize query results. So we actually have to add some more tables to filter out redundant packages. E.g.

                -- rule out records where the server has the same package name with higher evr
                select 1 from rhnServerPackage sp, rhnPackageEVR sevr, rhnPackageUpgradeArchCompat puac
+, RHNERRATAPACKAGE JEP, RHNPACKAGE JP
                where rhnServerPackage.server_id = sp.server_id
+AND JEP.ERRATA_ID = :errata_id
+AND JEP.PACKAGE_ID = JP.ID
+AND JP.NAME_ID = SP.NAME_ID
                        and rhnServerPackage.name_id = sp.name_id
                        and sp.evr_id = sevr.id
                        and rhnPackageEVR.evr <= sevr.evr
                        and rhnServerPackage.package_arch_id = puac.package_arch_id
                        and puac.package_upgrade_arch_id = sp.package_arch_id

Comment 9 Michael Mráka 2013-11-04 12:22:40 UTC
After comment #8 change explain plan shows even
                                                                                                                   QUERY PLAN
---------------------------------------------------------------
 Unique  (cost=431174.58..431174.59 rows=1 width=18) (actual time=626.219..626.219 rows=0 loops=1)

which is 5 times better than original one.

Comment 10 Michael Mráka 2013-11-04 12:34:37 UTC
Fixed in spacewalk master with
commit ccd106a5ca81fb7a45ddcbe9c01c6e7dcb4bcb40
    1022279 - modified query to work better with PostgreSQL 8.4 optimizer
commit e431dc48db2d5197a1ef1888c6ef14fcfe869730
    1022279 - added hints for PostgreSQL 8.4 optimizer

Comment 12 JDavis4102 2014-01-15 23:34:13 UTC
Issues seems to not be fixed Still having issues with the following Errata query:

SELECT DISTINCT +
| | S.id AS server_id, S.org_id org_id +
| | FROM rhnServer S +
| | JOIN ( +
| | -- +
| | select rhnChannelErrata.errata_id, rhnChannelErrata.channel_id, rhnServerChannel.server_id, rhnErrataPackage.package_id+
| | from rhnChannelErrata, rhnErrataPackage, rhnChannelNewestPackage, rhnPackageEVR, +
| | rhnServerChannel, rhnServerPackage, rhnPackageUpgradeArchCompat +
| | where rhnChannelErrata.errata_id = rhnErrataPackage.errata_id +
| | -- +
| | and rhnChannelErrata.channel_id = rhnChannelNewestPackage.channel_id +
| | and rhnErrataPackage.package_id = rhnChannelNewestPackage.package_id +
| | -- +
| | and rhnChannelErrata.channel_id = rhnServerChannel.channel_id +
| | and rhnChannelNewestPackage.name_id = rhnServerPackage.name_id +
| | and rhnServerChannel.server_id = rhnServerPackage.server_id +
| | -- +
| | and rhnChannelNewestPackage.evr_id = rhnPackageEVR.id +
| | -- +
| | and rhnServerPackage.package_arch_id = rhnPackageUpgradeArchCompat.package_arch_id +
| | and rhnPackageUpgradeArchCompat.package_upgrade_arch_id = rhnCh

Comment 13 Matej Kollar 2014-01-17 12:15:41 UTC
Switching MODIFIED Spacewalk bugs to ON_QA before 2.1 release.

Comment 14 JDavis4102 2014-01-17 16:48:12 UTC
This bug is not resolved and needs further research. As the issue is not fixed. This bug needs to be reverted back to open and not modified.

Comment 15 Matej Kollar 2014-03-04 13:06:43 UTC
Spacewalk 2.1 has been released.
https://fedorahosted.org/spacewalk/wiki/ReleaseNotes21

Comment 16 Matej Kollar 2014-03-04 13:08:37 UTC
Spacewalk 2.1 has been released.
https://fedorahosted.org/spacewalk/wiki/ReleaseNotes21