1119459 – Package deletion actions scheduled via SSM for osad clients fail

Bug 1119459 - Package deletion actions scheduled via SSM for osad clients fail

Summary: Package deletion actions scheduled via SSM for osad clients fail

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Spacewalk
Classification:	Community
Component:	Server
Sub Component:
Version:	2.2
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Stephen Herr
QA Contact:	Red Hat Satellite QA List
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1119460 space23
TreeView+	depends on / blocked

Reported:	2014-07-14 20:13 UTC by Tasos Papaioannou
Modified:	2015-04-14 19:03 UTC (History)
CC List:	0 users
Fixed In Version:	spacewalk-backend-2.3.8-1
Clone Of:
Clones:	1119460 (view as bug list)
Environment:
Last Closed:	2015-04-14 19:03:42 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1119460	1	None	None	None	2021-01-20 06:05:38 UTC

Internal Links: 1119460

Description Tasos Papaioannou 2014-07-14 20:13:07 UTC

Description of problem:

When scheduling a package removal for a third-party rpm in SSM for ~250 osad clients, the server experiences high load and becomes unusable for ~5-10 minutes. Viewing the list of currently-running queries in postgres with "select procpid,query_start,current_query from pg_stat_activity where current_query != '<IDLE>';" while the server is experiencing high load shows ~250 calls to the update_needed_cache stored function, like the following:

1512 | 2014-03-18 14:51:05.950363+01 | SELECT rhn_server.update_needed_cache(1000010574)

Many of the actions (90 out of 285 in one example run) end up with a status of Failed, with a reason of "Packages scheduled in action <X> for server <Y> could not be found". All of the systems did have the package installed, and the client-side logs show that the package actually was removed when it picked up the scheduled action.

It looks like the update_needed_cache function is timing out while waiting on other systems to complete, resulting in the actions being marked as failed.

The relevant code is:

./backend/server/rhnServer/server_packages.py:

    def save_packages_byid(self, sysid, schedule=1):
        """ save the package list """
        log_debug(3, sysid, "Errata cache to run:", schedule, 
            "Changed:", self.__changed, "%d total packages" % len(self.__p))
[...]
        # get rid of the deleted packages
        dlist = filter(lambda a: a.real and a.status in (DELETED, UPDATED), self.__p.values())
        if dlist:
[...]
        # And now add packages
        alist = filter(lambda a: a.status in (ADDED, UPDATED), self.__p.values())
        if alist:
[...]
        if schedule:
            # queue this server for an errata update
            update_errata_cache(sysid)

def update_errata_cache(server_id):

    """ Function that updates rhnServerNeededPackageCache by deltas (as opposed to
        calling queue_server which removes the old entries and inserts new ones).
        It now also updates rhnServerNeededErrataCache, but as the entries there
        are a subset of rhnServerNeededPackageCache's entries, it still gives
        statistics regarding only rhnServerNeededPackageCache.
    """
    log_debug(2, "Updating the errata cache", server_id)
    update_needed_cache = rhnSQL.Procedure("rhn_server.update_needed_cache")
    update_needed_cache(server_id)

The comments in this update_errata_cache function are no longer true. The stored update_needed_cache function no longer updates the cache only with what packages were deleted or added, but simply deletes the entire cache and repopulates it from scratch:

def update_errata_cache(server_id):

    """ Function that updates rhnServerNeededPackageCache by deltas (as opposed to
        calling queue_server which removes the old entries and inserts new ones).
        It now also updates rhnServerNeededErrataCache, but as the entries there
        are a subset of rhnServerNeededPackageCache's entries, it still gives
        statistics regarding only rhnServerNeededPackageCache.
    """
    log_debug(2, "Updating the errata cache", server_id)
    update_needed_cache = rhnSQL.Procedure("rhn_server.update_needed_cache")
    update_needed_cache(server_id)


This method makes sense when subscribing a system to a new channel or doing batch updates to a channel, but when adding or removing a small number of packages, updating rhnServerNeededCache via deltas (using the dlist and alist variables available in save_packages_byid) would be more efficient, and prevent the errors reported above.

schema/spacewalk/postgres/packages/rhn_server.pkb

    create or replace function update_needed_cache(
        server_id_in in numeric
        ) returns void as $$
    begin
      delete from rhnServerNeededCache
        where server_id = server_id_in;
      insert into rhnServerNeededCache
             (server_id, errata_id, package_id)
        (select distinct sp.server_id, x.errata_id, p.id
           FROM (SELECT sp_sp.server_id, sp_sp.name_id,
                        sp_sp.package_arch_id, max(sp_pe.evr) AS max_evr
                   FROM rhnServerPackage sp_sp
                   join rhnPackageEvr sp_pe ON sp_pe.id = sp_sp.evr_id
                  GROUP BY sp_sp.server_id, sp_sp.name_id, sp_sp.package_arch_id) sp
           join rhnPackage p ON p.name_id = sp.name_id
           join rhnPackageEvr pe ON pe.id = p.evr_id AND sp.max_evr < pe.evr
           join rhnPackageUpgradeArchCompat puac
                    ON puac.package_arch_id = sp.package_arch_id
                    AND puac.package_upgrade_arch_id = p.package_arch_id
           join rhnServerChannel sc ON sc.server_id = sp.server_id
           join rhnChannelPackage cp ON cp.package_id = p.id
                    AND cp.channel_id = sc.channel_id
           left join (SELECT ep.errata_id, ce.channel_id, ep.package_id
                        FROM rhnChannelErrata ce
                        join rhnErrataPackage ep
                                 ON ep.errata_id = ce.errata_id
                        join rhnServerChannel sc_sc
                                 ON sc_sc.channel_id = ce.channel_id
                       WHERE sc_sc.server_id = server_id_in) x
             ON x.channel_id = sc.channel_id AND x.package_id = cp.package_id
          where sp.server_id = server_id_in);
        end$$ language plpgsql;
-- restore the original setting
update pg_settings set setting = overlay( setting placing '' from 1 for (length('rhn_server')+1) ) where name = 'search_path';



Version-Release number of selected component (if applicable):

spacewalk-java-2.0.2-79.el6sat.noarch
spacewalk-schema-2.0.2-13.el6sat.noarch
satellite-schema-5.6.0.18-1.el6sat.noarch


How reproducible:

100% for Satellite customer reporting the issue.

Steps to Reproduce:
1.) See description section above.

Actual results:

Failed SSM package removal actions when using osad on a couple hundred systems, due to inefficient rhnPackageNeededCache regeneration.

Expected results:

No failed actions.

Additional info:

Comment 1 Stephen Herr 2014-08-19 19:43:08 UTC

Committing fix to Spacewalk master:
6ba5a4799adcf73deb2f4d5b3cf28a8c5811edb5

Comment 2 Stephen Herr 2014-08-19 19:48:21 UTC

The fix is to just queue the update of the errata cache data for taskomatic to take care of later instead of trying to do it inline when yum reports a package list. This will resolve any concurrency problems with updating hundreds of systems at a time. The downside is that information in Satellite about what updates a server has available will not be updated instantaniously, but rather when taskomatic next runs the Errata Cache task.

Worst case is that someone would try to schedule the update again before Taskomatic gets around to updating the Errata Cache, which would result in an no-op on the client and a successful action with a message of "Requested packages already installed".

Comment 3 Grant Gainey 2015-03-23 16:59:22 UTC

Moving bugs to ON_QA as we move to release Spacewalk 2.3

Comment 4 Grant Gainey 2015-04-14 19:03:42 UTC

Spacewalk 2.3 has been released. See

https://fedorahosted.org/spacewalk/wiki/ReleaseNotes23

Note You need to log in before you can comment on or make changes to this bug.