Description of problem: When scheduling a package removal for a third-party rpm in SSM for ~250 osad clients, the server experiences high load and becomes unusable for ~5-10 minutes. Viewing the list of currently-running queries in postgres with "select procpid,query_start,current_query from pg_stat_activity where current_query != '<IDLE>';" while the server is experiencing high load shows ~250 calls to the update_needed_cache stored function, like the following: 1512 | 2014-03-18 14:51:05.950363+01 | SELECT rhn_server.update_needed_cache(1000010574) Many of the actions (90 out of 285 in one example run) end up with a status of Failed, with a reason of "Packages scheduled in action <X> for server <Y> could not be found". All of the systems did have the package installed, and the client-side logs show that the package actually was removed when it picked up the scheduled action. It looks like the update_needed_cache function is timing out while waiting on other systems to complete, resulting in the actions being marked as failed. The relevant code is: ./backend/server/rhnServer/server_packages.py: def save_packages_byid(self, sysid, schedule=1): """ save the package list """ log_debug(3, sysid, "Errata cache to run:", schedule, "Changed:", self.__changed, "%d total packages" % len(self.__p)) [...] # get rid of the deleted packages dlist = filter(lambda a: a.real and a.status in (DELETED, UPDATED), self.__p.values()) if dlist: [...] # And now add packages alist = filter(lambda a: a.status in (ADDED, UPDATED), self.__p.values()) if alist: [...] if schedule: # queue this server for an errata update update_errata_cache(sysid) def update_errata_cache(server_id): """ Function that updates rhnServerNeededPackageCache by deltas (as opposed to calling queue_server which removes the old entries and inserts new ones). It now also updates rhnServerNeededErrataCache, but as the entries there are a subset of rhnServerNeededPackageCache's entries, it still gives statistics regarding only rhnServerNeededPackageCache. """ log_debug(2, "Updating the errata cache", server_id) update_needed_cache = rhnSQL.Procedure("rhn_server.update_needed_cache") update_needed_cache(server_id) The comments in this update_errata_cache function are no longer true. The stored update_needed_cache function no longer updates the cache only with what packages were deleted or added, but simply deletes the entire cache and repopulates it from scratch: def update_errata_cache(server_id): """ Function that updates rhnServerNeededPackageCache by deltas (as opposed to calling queue_server which removes the old entries and inserts new ones). It now also updates rhnServerNeededErrataCache, but as the entries there are a subset of rhnServerNeededPackageCache's entries, it still gives statistics regarding only rhnServerNeededPackageCache. """ log_debug(2, "Updating the errata cache", server_id) update_needed_cache = rhnSQL.Procedure("rhn_server.update_needed_cache") update_needed_cache(server_id) This method makes sense when subscribing a system to a new channel or doing batch updates to a channel, but when adding or removing a small number of packages, updating rhnServerNeededCache via deltas (using the dlist and alist variables available in save_packages_byid) would be more efficient, and prevent the errors reported above. schema/spacewalk/postgres/packages/rhn_server.pkb create or replace function update_needed_cache( server_id_in in numeric ) returns void as $$ begin delete from rhnServerNeededCache where server_id = server_id_in; insert into rhnServerNeededCache (server_id, errata_id, package_id) (select distinct sp.server_id, x.errata_id, p.id FROM (SELECT sp_sp.server_id, sp_sp.name_id, sp_sp.package_arch_id, max(sp_pe.evr) AS max_evr FROM rhnServerPackage sp_sp join rhnPackageEvr sp_pe ON sp_pe.id = sp_sp.evr_id GROUP BY sp_sp.server_id, sp_sp.name_id, sp_sp.package_arch_id) sp join rhnPackage p ON p.name_id = sp.name_id join rhnPackageEvr pe ON pe.id = p.evr_id AND sp.max_evr < pe.evr join rhnPackageUpgradeArchCompat puac ON puac.package_arch_id = sp.package_arch_id AND puac.package_upgrade_arch_id = p.package_arch_id join rhnServerChannel sc ON sc.server_id = sp.server_id join rhnChannelPackage cp ON cp.package_id = p.id AND cp.channel_id = sc.channel_id left join (SELECT ep.errata_id, ce.channel_id, ep.package_id FROM rhnChannelErrata ce join rhnErrataPackage ep ON ep.errata_id = ce.errata_id join rhnServerChannel sc_sc ON sc_sc.channel_id = ce.channel_id WHERE sc_sc.server_id = server_id_in) x ON x.channel_id = sc.channel_id AND x.package_id = cp.package_id where sp.server_id = server_id_in); end$$ language plpgsql; -- restore the original setting update pg_settings set setting = overlay( setting placing '' from 1 for (length('rhn_server')+1) ) where name = 'search_path'; Version-Release number of selected component (if applicable): spacewalk-java-2.0.2-79.el6sat.noarch spacewalk-schema-2.0.2-13.el6sat.noarch satellite-schema-5.6.0.18-1.el6sat.noarch How reproducible: 100% for Satellite customer reporting the issue. Steps to Reproduce: 1.) See description section above. Actual results: Failed SSM package removal actions when using osad on a couple hundred systems, due to inefficient rhnPackageNeededCache regeneration. Expected results: No failed actions. Additional info:
Committing fix to Spacewalk master: 6ba5a4799adcf73deb2f4d5b3cf28a8c5811edb5
The fix is to just queue the update of the errata cache data for taskomatic to take care of later instead of trying to do it inline when yum reports a package list. This will resolve any concurrency problems with updating hundreds of systems at a time. The downside is that information in Satellite about what updates a server has available will not be updated instantaniously, but rather when taskomatic next runs the Errata Cache task. Worst case is that someone would try to schedule the update again before Taskomatic gets around to updating the Errata Cache, which would result in an no-op on the client and a successful action with a message of "Requested packages already installed".
Moving bugs to ON_QA as we move to release Spacewalk 2.3
Spacewalk 2.3 has been released. See https://fedorahosted.org/spacewalk/wiki/ReleaseNotes23