Bug 867614

Summary:	[RFE] osa dispatch runs all tasks at the same time, overloading the server
Product:	[Community] Spacewalk	Reporter:	Kevin Stange <kevin>
Component:	Server	Assignee:	Tomas Lestach <tlestach>
Status:	CLOSED DEFERRED	QA Contact:	Red Hat Satellite QA List <satqe-list>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	1.8	CC:	duncan, james.hogarth, jochen
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-03-23 12:21:23 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	737830

Description Kevin Stange 2012-10-17 20:45:58 UTC

Description of problem:

I have all my system enrollments using osad to ensure that I can push real-time requests to them.  When I create a system set to run a package task, such as a mass update that applies to more than a handful of systems, the update is immediately sent out to every system from the set.  The systems then reciprocate by all immediately running rhn_check and pulling their package updates, which causes a huge spike in activity.

My system has 4 GB of RAM and the base usage for spacewalk 1.8 is consuming about 2 GB of RAM.  The activity spike for about 30 systems pushes my RAM usage over 4 GB and causes thrashing.

This could be mitigated by scaling up the amount of RAM in my system, but this solution would not scale to 60, 100, 200 systems updated at once.

My preference would be to insert a deferral process into the osa system which works by staggering the systems it notifies so that it won't activate more than a set number of tasks simultaneously.  A task would be considered complete when it reports as completed or failed back to spacewalk.

Alternatively, spacewalk could simply respond to agents requesting package updates by saying "busy, try again in 5 minutes" to a system at which point the server could sleep and try again until it is allowed to complete its request.

Version-Release number of selected component (if applicable):

spacewalk-base-minimal-1.8.42-1.el6.noarch

Comment 1 Jan Pazdziora (Red Hat) 2012-10-18 07:10:54 UTC

Let me mark this as RFE.

Comment 2 James Hogarth 2012-11-14 09:58:02 UTC

I think I'm encountering a similar situation (but it could be something else too) related to spacewalk 1.8 on postgres... are you using a postgres or oracle backend?

I'm considering configuring rhn_check_command in /etc/sysconfig/rhn/osad.conf to be instead be a script that might be a little smarter... at the simple end just add a random sleep before the rhn_check itself to minimise the impact of all calling in at once or perhaps at the more complicated end carrying out a check somehow to see how busy the server is prior to initiating a request...

If you are on a postgres backend it's worth keeping an eye on bug #873379 as well...

Comment 3 Kevin Stange 2012-11-14 18:20:27 UTC

Yes, my backend is postgres.  I will look into deploying a replacement rhn_check command with a random sleep as a mitigation option.  Thanks for that hint.

Comment 4 Duncan Innes 2013-02-07 12:37:13 UTC

This sounds similar to a similar issue I had with "push-to-client".  My idea to get round it was to allow a choice at event submission time:

1) Push to client
2) Next regular rhn_check
3) Not before a specific time

I figured that allowing bulk tasks such as you describe to be picked up by the next regular rhn_check rather than prompting each client to check now would spread the load out in a reasonable manner.  Depends on the INTERVAL setting your rhnsd config file.

I was more thinking about spreading out loads on hundreds of virtual clients running on a handful of virtual hosts rather than worrying about the impact on my Satellite/Spacewalk.  I think the solution could be similar, however.

To be honest, the task scheduling could be quite a bit better.  The "Schedule no sooner than" date could be completely separate from whether to use osad or rhn_check.  If I could set a task to happen at midnight, I might either want it all done as soon as possible, or just whenever the regular rhn_check happens.

Comment 5 Duncan Innes 2013-02-07 12:59:18 UTC

The problem with deploying an alternative rhn_check for osad to be using is that sometimes you *do* want a task happening immediately.

service nscd restart (not the best example, but hopefully gets the point)

If you deploy an alternative wrapper for rhn_check with a built-in random time delay, you lose the ability to get every client reporting back immediately.

Having some kind of jitter delay factor that could be sent to each client and executed before running rhn_check (as in first comment) would work well.

Comment 6 jochen 2014-05-07 07:41:30 UTC

> This sounds similar to a similar issue I had with "push-to-client".  My idea
> to get round it was to allow a choice at event submission time:
> 
> 1) Push to client
> 2) Next regular rhn_check
> 3) Not before a specific time

In the 2.1 GUI it might be best to add a checkbox "Push to client" which is unchecked by default if more than one system is selected.

Comment 8 Michael Mráka 2020-03-23 12:21:23 UTC

Spacewalk project as an upstream for Red Hat Satellite 5 product is going to be End Of Life on May 31 2020.

Spacewalk 2.10 has been released as the last release of this project.
https://github.com/spacewalkproject/spacewalk/wiki/ReleaseNotes210

Any new feature will not be included therefore closing remaining RFEs to set expectations properly.