Bug 1099938 - trying to view Extra Packages page in Satellite web UI for >1800 systems takes too long
Summary: trying to view Extra Packages page in Satellite web UI for >1800 systems take...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Spacewalk
Classification: Community
Component: WebUI
Version: 2.2
Hardware: All
OS: All
medium
medium
Target Milestone: ---
Assignee: Stephen Herr
QA Contact: Red Hat Satellite QA List
URL:
Whiteboard:
Depends On: 1075808
Blocks: space22
TreeView+ depends on / blocked
 
Reported: 2014-05-21 14:43 UTC by Stephen Herr
Modified: 2014-07-17 08:41 UTC (History)
7 users (show)

Fixed In Version: spacewalk-reports-2.2.6-1 spacewalk-java-2.2.72-1
Doc Type: Bug Fix
Doc Text:
Clone Of: 1075808
Environment:
Last Closed: 2014-07-17 08:41:19 UTC
Embargoed:


Attachments (Terms of Use)

Description Stephen Herr 2014-05-21 14:43:40 UTC
+++ This bug was initially created as a clone of Bug #1075808 +++

Description of problem:

On a Satellite 5.6 w/ embedded postgres db with 1,845 registered systems, clicking on Systems > Systems > Extra Packages gives an Internal Server Error after ~80 seconds.

Version-Release number of selected component (if applicable):

spacewalk-java-2.0.2-62.el6sat.noarch

How reproducible:

100%

Steps to Reproduce:
1.) On a Satellite w/ close to 2,000 systems, click on Systems > Systems > Extra Packages.
2.) Wait for a while, and get an Internal Server Error.

Reproduce in postgres directly, with the same query (using user_id = 1 and org_id = 1 in the example below):

****
# spacewalk-sql -i

rhnschema=#  select s.id as id,
rhnschema-#        s.name as name,
rhnschema-#        1 as selectable,
rhnschema-#        count(sp.name_id) as extra_pkg_count
rhnschema-#        from rhnServer s
rhnschema-#   join rhnUserServerPerms usp on (usp.server_id = s.id and
rhnschema(#                                   usp.user_id = 1)
rhnschema-#   join rhnServerPackage sp on (s.id = sp.server_id)
rhnschema-#        left outer join (select sc.server_id,
rhnschema(#                                cp.package_id,
rhnschema(#                                p.name_id,
rhnschema(#                                p.evr_id,
rhnschema(#                                p.package_arch_id
rhnschema(#                           from rhnPackage p,
rhnschema(#                                rhnServerChannel sc
rhnschema(#                           join rhnChannelPackage cp on (sc.channel_id = cp.channel_id)
rhnschema(#                          where cp.package_id = p.id
rhnschema(#                        ) scp on (scp.server_id = sp.server_id and
rhnschema(#                                  sp.name_id = scp.name_id and
rhnschema(#                                  sp.evr_id = scp.evr_id and
rhnschema(#                                  sp.package_arch_id = scp.package_arch_id)
rhnschema-#  where scp.package_id is null and
rhnschema-#        s.org_id = 1
rhnschema-#  group by s.name, s.id
rhnschema-#  order by name, id;


Actual results:

Internal Server Error when trying to via "Extra Packages" report in Systems tab.

Expected results:

No Internal Server Error, relatively quick page load.

Additional info:

The relevant query is:

./com/redhat/rhn/common/db/datasource/xml/System_queries.xml

****

<mode name="extra_packages_systems_count"
      class="com.redhat.rhn.frontend.dto.SystemOverview">
  <query params="userid, orgid">
select s.id as id,
       s.name as name,
       1 as selectable,
       count(sp.name_id) as extra_pkg_count
       from rhnServer s
  join rhnUserServerPerms usp on (usp.server_id = s.id and
                                  usp.user_id = :userid)
  join rhnServerPackage sp on (s.id = sp.server_id)
       left outer join (select sc.server_id,
                               cp.package_id,
                               p.name_id,
                               p.evr_id,
                               p.package_arch_id
                          from rhnPackage p,
                               rhnServerChannel sc
                          join rhnChannelPackage cp on (sc.channel_id = cp.channel_id)
                         where cp.package_id = p.id
                       ) scp on (scp.server_id = sp.server_id and
                                 sp.name_id = scp.name_id and
                                 sp.evr_id = scp.evr_id and
                                 sp.package_arch_id = scp.package_arch_id)
 where scp.package_id is null and
       s.org_id = :orgid
 group by s.name, s.id
 order by name, id
  </query>
  <elaborator name="system_overview" />
  <elaborator name="entitlements" />
  <elaborator name="is_virtual_guest" />
  <elaborator name="is_virtual_host" />
</mode>

****

The inner query returns a row (server id, package id, name id, evr id, package arch id) for all packages in all of the channels to which each server is subscribed. In the reproducer system, this inner query returns 30,411,385 rows. The outer query returns a row for each name id / evr id / package arch id combination installed on each server (basically all of rhnServerPackage). There are 1,357,909 rows in rhnServerPackage on this reproducer. Then the outer query results are left joined with the inner query results, to find rows where the installed packages don't correspond to packages available in any of the server's subscribed channels.

I can prevent the query from erroring out by paring down the inner query to return only those channel-available packages that are actually installed on the system, so that the join is only between "packages on this system" and "packages on this system that are in its channels". That reduces the inner query from 30,411,385 rows to 1,318,596 rows, but the total query still takes over two minutes:

rhnschema=# explain analyze select s.id as id,
       s.name as name,
       1 as selectable,
       count(sp.name_id) as extra_pkg_count
       from rhnServer s
  join rhnUserServerPerms usp on (usp.server_id = s.id and
                                  usp.user_id = 1)
  join rhnServerPackage sp on (s.id = sp.server_id)
       left outer join (select sp2.server_id,
                               sp2.name_id,
                               sp2.evr_id,
                               sp2.package_arch_id,
                               cp.package_id
                          from rhnServerChannel sc,
                               rhnChannelPackage cp,
                               rhnPackage p,
                               rhnServerPackage sp2
                         where sc.server_id = sp2.server_id
                           and sp2.name_id = p.name_id
                           and sp2.evr_id = p.evr_id
                           and sp2.package_arch_id = p.package_arch_id
                           and cp.package_id = p.id
                           and sc.channel_id = cp.channel_id
                       ) scp on (scp.server_id = sp.server_id and
                                 sp.name_id = scp.name_id and
                                 sp.evr_id = scp.evr_id and
                                 sp.package_arch_id = scp.package_arch_id)
 where scp.package_id is null and
       s.org_id = 1
 group by s.name, s.id
 order by name, id;

                                                                                            QUERY PLAN                                                                                             
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=346169.74..346169.77 rows=1 width=28) (actual time=142237.813..142291.037 rows=1732 loops=1)
   ->  Sort  (cost=346169.74..346169.75 rows=1 width=28) (actual time=142237.582..142271.114 rows=42743 loops=1)
         Sort Key: s.name, s.id
         Sort Method:  external merge  Disk: 1624kB
         ->  Nested Loop  (cost=298583.41..346169.73 rows=1 width=28) (actual time=137924.427..141920.480 rows=42743 loops=1)
               ->  Nested Loop  (cost=298583.41..346169.43 rows=1 width=39) (actual time=137924.403..141661.591 rows=42743 loops=1)
                     ->  Hash Left Join  (cost=298583.41..346169.15 rows=1 width=18) (actual time=137924.258..141444.114 rows=42743 loops=1)
                           Hash Cond: ((sp.name_id = sp2.name_id) AND (sp.evr_id = sp2.evr_id) AND (sp.package_arch_id = sp2.package_arch_id) AND (sp.server_id = sp2.server_id))
                           Filter: (cp.package_id IS NULL)
                           ->  Seq Scan on rhnserverpackage sp  (cost=0.00..27217.09 rows=1357909 width=33) (actual time=0.010..427.711 rows=1357909 loops=1)
                           ->  Hash  (cost=298582.59..298582.59 rows=41 width=41) (actual time=137920.324..137920.324 rows=1318596 loops=1)
                                 ->  Merge Join  (cost=297786.54..298582.59 rows=41 width=41) (actual time=129372.955..136863.138 rows=1318596 loops=1)
                                       Merge Cond: ((sc.server_id = sp2.server_id) AND (sc.channel_id = cp.channel_id))
                                       ->  Index Scan using rhn_sc_sid_cid_uq on rhnserverchannel sc  (cost=0.00..731.07 rows=10935 width=18) (actual time=0.036..16.160 rows=10935 loops=1)
                                       ->  Sort  (cost=297786.47..297789.82 rows=1339 width=48) (actual time=129370.858..132623.274 rows=9635704 loops=1)
                                             Sort Key: sp2.server_id, cp.channel_id
                                             Sort Method:  external sort  Disk: 593224kB
                                             ->  Nested Loop  (cost=274240.59..297716.93 rows=1339 width=48) (actual time=11847.035..35419.168 rows=9637241 loops=1)
                                                   ->  Merge Join  (cost=274240.59..291135.98 rows=402 width=41) (actual time=11846.950..18775.401 rows=1340854 loops=1)
                                                         Merge Cond: ((sp2.package_arch_id = p.package_arch_id) AND (sp2.name_id = p.name_id) AND (sp2.evr_id = p.evr_id))
                                                         ->  Sort  (cost=239803.19..243197.96 rows=1357909 width=33) (actual time=10834.956..14759.475 rows=1347258 loops=1)
                                                               Sort Key: sp2.package_arch_id, sp2.name_id, sp2.evr_id
                                                               Sort Method:  external merge  Disk: 57728kB
                                                               ->  Seq Scan on rhnserverpackage sp2  (cost=0.00..27217.09 rows=1357909 width=33) (actual time=0.004..540.515 rows=1357909 loops=1)
                                                         ->  Materialize  (cost=34437.40..36575.19 rows=171023 width=30) (actual time=1008.751..1818.624 rows=1490784 loops=1)
                                                               ->  Sort  (cost=34437.40..34864.96 rows=171023 width=30) (actual time=1008.741..1306.285 rows=171011 loops=1)
                                                                     Sort Key: p.package_arch_id, p.name_id, p.evr_id
                                                                     Sort Method:  external merge  Disk: 7016kB
                                                                     ->  Seq Scan on rhnpackage p  (cost=0.00..15477.23 rows=171023 width=30) (actual time=0.055..146.966 rows=171020 loops=1)
                                                   ->  Index Scan using rhn_cp_pid_idx on rhnchannelpackage cp  (cost=0.00..16.31 rows=5 width=15) (actual time=0.004..0.008 rows=7 loops=1340854)
                                                         Index Cond: (cp.package_id = p.id)
                     ->  Index Scan using rhn_server_id_pk on rhnserver s  (cost=0.00..0.27 rows=1 width=21) (actual time=0.003..0.004 rows=1 loops=42743)
                           Index Cond: (s.id = sp.server_id)
                           Filter: (s.org_id = 1::numeric)
               ->  Index Scan using rhn_usperms_uid_sid_uq on rhnuserserverperms usp  (cost=0.00..0.29 rows=1 width=11) (actual time=0.004..0.005 rows=1 loops=42743)
                     Index Cond: ((usp.user_id = 1::numeric) AND (usp.server_id = s.id))
 Total runtime: 142426.398 ms
(37 rows)


The inner query alone takes 38 seconds to complete:

rhnschema=# explain analyze select sp2.server_id,
                               sp2.name_id,
                               sp2.evr_id,
                               sp2.package_arch_id,
                               cp.package_id
                          from rhnServerChannel sc,
                               rhnChannelPackage cp,
                               rhnPackage p,
                               rhnServerPackage sp2
                         where sc.server_id = sp2.server_id
                           and sp2.name_id = p.name_id
                           and sp2.evr_id = p.evr_id
                           and sp2.package_arch_id = p.package_arch_id
                           and cp.package_id = p.id
                           and sc.channel_id = cp.channel_id;
                                                                       QUERY PLAN                                                                        
---------------------------------------------------------------------------------------------------------------------------------------------------------
 Hash Join  (cost=274605.97..291793.51 rows=41 width=41) (actual time=10985.470..37963.990 rows=1318596 loops=1)
   Hash Cond: ((sp2.server_id = sc.server_id) AND (cp.channel_id = sc.channel_id))
   ->  Nested Loop  (cost=274240.59..291390.90 rows=1339 width=48) (actual time=10976.689..31633.594 rows=9637241 loops=1)
         ->  Merge Join  (cost=274240.59..291135.98 rows=402 width=41) (actual time=10976.631..17299.590 rows=1340854 loops=1)
               Merge Cond: ((sp2.package_arch_id = p.package_arch_id) AND (sp2.name_id = p.name_id) AND (sp2.evr_id = p.evr_id))
               ->  Sort  (cost=239803.19..243197.96 rows=1357909 width=33) (actual time=10164.575..13810.504 rows=1347258 loops=1)
                     Sort Key: sp2.package_arch_id, sp2.name_id, sp2.evr_id
                     Sort Method:  external merge  Disk: 57728kB
                     ->  Seq Scan on rhnserverpackage sp2  (cost=0.00..27217.09 rows=1357909 width=33) (actual time=0.023..534.438 rows=1357909 loops=1)
               ->  Materialize  (cost=34437.40..36575.19 rows=171023 width=30) (actual time=809.142..1471.421 rows=1490784 loops=1)
                     ->  Sort  (cost=34437.40..34864.96 rows=171023 width=30) (actual time=809.137..1079.748 rows=171011 loops=1)
                           Sort Key: p.package_arch_id, p.name_id, p.evr_id
                           Sort Method:  external merge  Disk: 7016kB
                           ->  Seq Scan on rhnpackage p  (cost=0.00..15477.23 rows=171023 width=30) (actual time=0.040..116.852 rows=171020 loops=1)
         ->  Index Scan using rhn_cp_pid_idx on rhnchannelpackage cp  (cost=0.00..0.57 rows=5 width=15) (actual time=0.003..0.007 rows=7 loops=1340854)
               Index Cond: (cp.package_id = p.id)
   ->  Hash  (cost=201.35..201.35 rows=10935 width=18) (actual time=8.759..8.759 rows=10935 loops=1)
         ->  Seq Scan on rhnserverchannel sc  (cost=0.00..201.35 rows=10935 width=18) (actual time=0.011..3.406 rows=10935 loops=1)
 Total runtime: 38291.635 ms
(19 rows)

So, there may need to be some schema changes to get this report into a form that returns results in a timely manner.

--- Additional comment from Stephen Herr on 2014-05-20 18:24:17 EDT ---

Thanks for the investigation Tasos. I think you're right, limiting the inner query to "installed packages on the server that are in channels the server is subscribed to" is the best thing we can do, even though it's still not very fast. The problem is that this is an extremely difficult thing to determine given our current data model. We basically have to join all of the largest tables and then anti-join them with rhnServerPackage, and there's no other way to do it. I've tried several other formulations of the query and they all had (much in some cases) worse performance than this version. I've improved the inner query somewhat by adding a "only servers this user can see" condition, but that will only make a difference if there are many servers that are in other orgs or that the user otherwise can't see. Sadly, I think this is as good as it's going to get.

select s.id as id,
       s.name as name,
       1 as selectable,
       count(sp.name_id) as extra_pkg_count
  from rhnServer s
  join rhnUserServerPerms usp on (usp.server_id = s.id
                             and usp.user_id = :user_id)
  join rhnServerPackage sp on s.id = sp.server_id
  left join (select sc.server_id,
                    cp.package_id,
                    p.name_id,
                    p.evr_id,
                    p.package_arch_id
               from rhnPackage p,
                    rhnServerChannel sc,
                    rhnChannelPackage cp,
                    rhnServerPackage sp2,
                    rhnUserServerPerms usp2
              where cp.package_id = p.id
                and sc.server_id = usp2.server_id
                and usp2.user_id = :user_id
                and sc.channel_id = cp.channel_id
                and sc.server_id = sp2.server_id
                and sp2.package_arch_id = p.package_arch_id
                and sp2.name_id = p.name_id
                and sp2.evr_id = p.evr_id
       ) scp on (scp.server_id = sp.server_id
            and sp.name_id = scp.name_id
            and sp.evr_id = scp.evr_id
            and sp.package_arch_id = scp.package_arch_id)
 where scp.package_id is null
   and s.org_id = :org_id
 group by s.name, s.id
 order by name, id;

--- Additional comment from Stephen Herr on 2014-05-21 10:17:34 EDT ---

The developers discussed this problem today I'll record the salient points here for posterity.

The only solution we could really find that would dramatically improve the performance of this page would be to change the page behavior from "showing a list of systems that have extra packages" to "showing a list of all systems, and indicate how many extra packages they have". The difference being that in the second case every system the user can see is always shown, and if they have no extra packages the list just displays a zero. This would make a difference because then we could move the complicated part of the query into an elaborator that only runs on the systems currently visible on the page, so we would be operating on at most 500 systems at a time which would be performant no matter how large your installation was.

The problem is that this is semantically different from the current behavior and all the other pages in this section of the UI. The "Requiring Reboot" list only lists systems that require a reboot. The "Proxy" list only lists systems that are a Proxy. etc. That change would not only make this page different from everything else but also less useful; you'd have to page through the list to find the systems that have extra packages. You'd have to manually count them if you wanted to know how many there are.

We've decided to proceed with the modest performance improvement that you found Tasos, which will make the page working but slow for large installations and unusable for extremely large installations. I'll also add a report to spacewalk-reports so that customers who have extremely large installations can still get this information from somewhere. The report will also be slow, but it will eventually return with the information.

Comment 1 Stephen Herr 2014-05-21 15:36:41 UTC
Committing to Spacewalk master:
b4e09eccc547f8c7199ab505cd78ca9fa7f787ad
0d2dd9ea24ba93b4e4e8b67825a8cb07056e808d

Comment 2 Milan Zázrivec 2014-07-17 08:41:19 UTC
Spacewalk 2.2 has been released:

    https://fedorahosted.org/spacewalk/wiki/ReleaseNotes22


Note You need to log in before you can comment on or make changes to this bug.