999453 – registering RHEL5-Server-GOLD to Satellite@PostgreSQL into RHEL5 base channel is very slow

Bug 999453 - registering RHEL5-Server-GOLD to Satellite@PostgreSQL into RHEL5 base channel is very slow

Summary: registering RHEL5-Server-GOLD to Satellite@PostgreSQL into RHEL5 base channel...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Satellite 5
Classification:	Red Hat
Component:	Server
Sub Component:
Version:	560
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Assignee:	Stephen Herr
QA Contact:	Jan Hutař
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1004309 (view as bug list)
Depends On:
Blocks:	sat570-blockers
TreeView+	depends on / blocked

Reported:	2013-08-21 11:00 UTC by Jan Hutař
Modified:	2015-01-13 10:25 UTC (History)
CC List:	4 users (show)
Fixed In Version:	spacewalk-java-2.3.8-18-sat
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2015-01-13 10:25:05 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Jan Hutař 2013-08-21 11:00:50 UTC

Description of problem:
Registering RHEL5-Server-GOLD to Satellite@PostgreSQL into RHEL5 base channel is very slow. Also adding or removing it's child channels is slow. Maybe it is cases by errata cache (rhn_server.update_needed_cache(systemID) (?))?


Version-Release number of selected component (if applicable):
Satellite-5.6.0-RHEL6-re20130807.0


How reproducible:
always


Steps to Reproduce:
1. Have Satellite@PostgreSQL with RHEL5 base channel synced
2. Add "log_min_duration_statement = 500" into DB config
3. Register RHEL5-Server-GOLD (default package set - i.e. desktop... => lot
   of packages to upgrade)


Actual results:
It takes too long:

Done 3 attempts, and this is an average on "real":
Sat560@EmbPostgreSQL:
	2m46.001s
	2m41.942s
	2m32.043s
	=> AVG = 159.995s
Sat560@ExtOracleSameSyst:
	0m5.540s
	0m1.828s
	0m1.674s
	=> AVG = 3.014s


Expected results:
Should be quick as against Satellite@Oracle


Additional info:
Maybe this is a duplicate of bug 985947.

Comment 1 Jan Hutař 2013-08-21 11:03:14 UTC

E.g. when adding/removing custom child channel to such a system you can see this:

2013-08-21 06:46:27.925 EDT LOG:  duration: 1736.834 ms  execute <unnamed>: select * from rhn_channel.subscribe_server($1, $2, 1, $3) as result
2013-08-21 06:46:27.925 EDT DETAIL:  parameters: $1 = '1000010032', $2 = '106', $3 = '1'
2013-08-21 06:46:32.751 EDT LOG:  duration: 886.580 ms  execute <unnamed>: select * from rhn_channel.unsubscribe_server($1, $2) as result
2013-08-21 06:46:32.751 EDT DETAIL:  parameters: $1 = '1000010032', $2 = '106'
2013-08-21 06:50:01.550 EDT LOG:  duration: 1259.027 ms  execute <unnamed>: select * from rhn_server.update_needed_cache($1) as result
2013-08-21 06:50:01.551 EDT DETAIL:  parameters: $1 = '1000010032'

Comment 8 Stephen Herr 2013-08-29 18:01:10 UTC

The problem is that the query optimizer in Postgres 8.4 is not smart enough to be optimize away a sub-select in the rhnServerNeededView when you are calling through the rhn_server.update_needed_cache function (which adds yet another level of complexity to the query that the optimizer is trying to execute).

rhnServerNeededView:
######################################
CREATE OR REPLACE VIEW
rhnServerNeededView
(
    org_id,
    server_id,
    errata_id,
    package_id,
    package_name_id,
    channel_id
)
AS
SELECT s.org_id,
       sp.server_id,
       x.errata_id,
       up.id,
       up.name_id,
       x.channel_id
    FROM rhnServer s
        join (SELECT sp_sp.server_id, sp_sp.name_id, sp_sp.package_arch_id, max(sp_pe.evr) AS max_evr
                FROM rhnServerPackage sp_sp
                    join rhnPackageEvr sp_pe ON sp_pe.id = sp_sp.evr_id
                    GROUP BY sp_sp.server_id, sp_sp.name_id, sp_sp.package_arch_id) sp ON sp.server_id = s.id
        join rhnPackage up ON up.name_id = sp.name_id
        join rhnPackageEvr upe ON upe.id = up.evr_id AND sp.max_evr < upe.evr
        join rhnPackageUpgradeArchCompat puac ON puac.package_arch_id = sp.package_arch_id AND puac.package_upgrade_arch_id = up.package_arch_id
        join rhnServerChannel sc ON sc.server_id = sp.server_id
        join rhnChannelPackage cp ON cp.package_id = up.id AND cp.channel_id = sc.channel_id
        left join
        (SELECT ep.errata_id, cp.channel_id, ep.package_id
         FROM rhnChannelErrata cp
             join rhnErrataPackage ep ON ep.errata_id = cp.errata_id) x
            ON x.channel_id = sc.channel_id AND x.package_id = cp.package_id
;
######################################

It is not strictly necessary to do the sub-select at the bottom there as a sub-select (there are no aggregate functions or other things that would prevent you from in-lining it), it's just written that way because that is similar to the logic we are trying to implement. However in this case Postgres 8.4's optimizer is not able to optimize it away, so it does that entire join on those two huge tables, only to throw out most of the results when it left joins the results to the results of the rest of the query.

It is that behavior that causes the call to update_needed_cache to take the large amount of time it is taking, recorded at over a minute in comment 6 and I've seen it take as long as two minutes. In Oracle and Postgres 9.x this does not appear to be a problem, the optimizer is smart enough to transform the sub-select into joins that take much less time.

In order to resolve this problem I have re-written the view to return the same data but through the join method instead of the sub-select method:


######################################
CREATE OR REPLACE VIEW
rhnServerNeededView
(
    org_id,
    server_id,
    errata_id,
    package_id,
    package_name_id,
    channel_id
)
AS
SELECT DISTINCT s.org_id,
       sp.server_id,
       ce.errata_id,
       up.id,
       up.name_id,
       ce.channel_id
    FROM rhnServer s
        join (SELECT sp_sp.server_id, sp_sp.name_id, sp_sp.package_arch_id, max(sp_pe.evr) AS max_evr
                FROM rhnServerPackage sp_sp
                    join rhnPackageEvr sp_pe ON sp_pe.id = sp_sp.evr_id
                    GROUP BY sp_sp.server_id, sp_sp.name_id, sp_sp.package_arch_id) sp ON sp.server_id = s.id
        join rhnPackage up ON up.name_id = sp.name_id
        join rhnPackageEvr upe ON upe.id = up.evr_id AND sp.max_evr < upe.evr
        join rhnPackageUpgradeArchCompat puac ON puac.package_arch_id = sp.package_arch_id AND puac.package_upgrade_arch_id = up.package_arch_id
        join rhnServerChannel sc ON sc.server_id = sp.server_id
        join rhnChannelPackage cp ON cp.package_id = up.id AND cp.channel_id = sc.channel_id
        left join rhnErrataPackage ep ON cp.package_id = ep.package_id
        left join rhnChannelErrata ce ON sc.channel_id = ce.channel_id AND ep.errata_id = ce.errata_id;
######################################

I have tested this with both channels that contain and do not contain errata, and the output is the same as the original view in both cases. However since we have explicitly transformed the query into the join variation the optimizer has less work to do and the query returns quickly, sub-1-second, even on Postgres 8.4.

This should enable the server registration to happen quickly like it does on Oracle and Postgres 9.

Committing to Spacewalk master:
a1f165cd894dd4d5affbc3d62ae49e034f979a6b

Comment 9 Stephen Herr 2013-08-29 18:12:21 UTC

As a side note:
Since the slow registration behavior comes down to how well the postgres optimizer is able to perform it's job this is not an inherently reproducible problem. There could be any number of factors that would influence the different things the optimizer tries in the limited amount of time it has to come up with the execution plan, including table sizes. This may be a problem on most Postgres 8.4 installs, or only a few of them. It may go away over time as postgres gathers additional statistics and trains its optimizer. It may even happen under postgres 9 if given the right conditions. This update to simplify the query should keep it from happening, but I don't know how reproducible the actual problem is prior to doing the update.

The real question is does this updated view definition always return the same data as the old one (and I think it does), if so then we should use the simplified (and faster) definition.

Comment 13 Stephen Herr 2013-09-05 12:53:28 UTC

It was discovered that my original fix for this bug was flawed, see bug 1004309.

Reverting the original fix in Spacewalk master:
08e13a23fd034569520b55540bf85b22c0f9d67a

And fixing it correctly:
f70898d4a5a3ea33d4870c02ba3f45e592e8e9c7

Comment 14 Stephen Herr 2013-09-05 12:55:14 UTC

*** Bug 1004309 has been marked as a duplicate of this bug. ***

Comment 15 Stephen Herr 2013-09-05 13:23:53 UTC

Forgot to update the sha1 hashes:
cb150900f64f60e3f5e5df03e2f618eb2709e671

Comment 18 Stephen Herr 2013-09-10 18:17:27 UTC

Also need:
cd6c31a1be359ce02477da4b2c4535decb13ff68

One file deletion did not make it into the original commit. Also breaking the operations on various entities in the upgrade script into separate files.

Comment 19 Stephen Herr 2013-09-10 18:33:59 UTC

Missed another sha1 hash :(
b1566cbdfb8171d6ec6ee6979b01543623959c9e

Comment 32 Clifford Perry 2015-01-13 10:25:05 UTC

With the release of Red Hat Satellite 5.7 on January 12th 2015 this bug is
being moved to a Closed Current Release state. 

The Satellite 5.7 GA Errata:
 - https://rhn.redhat.com/errata/RHSA-2015-0033.html 

Satellite 5.7 Release Notes:
 -
https://access.redhat.com/documentation/en-US/Red_Hat_Satellite/5.7/html-single/Release_Notes/index.html

Satellite Customer Portal Blog announcement for release:
 - https://access.redhat.com/blogs/1169563/posts/1315743 

Cliff

NOTE: This bug has not been re-verified (moved to RELEASE_PENDING) prior to release. We assume that the bug has indeed been fixed and not regressed since we initially verified it. Please re-open in the future if needed.

Note You need to log in before you can comment on or make changes to this bug.