Bug 867614 - [RFE] osa dispatch runs all tasks at the same time, overloading the server
[RFE] osa dispatch runs all tasks at the same time, overloading the server
Status: NEW
Product: Spacewalk
Classification: Community
Component: Server (Show other bugs)
Unspecified Unspecified
unspecified Severity unspecified
: ---
: ---
Assigned To: Tomas Lestach
Red Hat Satellite QA List
Depends On:
Blocks: spacewalk-rfe
  Show dependency treegraph
Reported: 2012-10-17 16:45 EDT by Kevin Stange
Modified: 2015-10-09 04:42 EDT (History)
4 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)

  None (edit)
Description Kevin Stange 2012-10-17 16:45:58 EDT
Description of problem:

I have all my system enrollments using osad to ensure that I can push real-time requests to them.  When I create a system set to run a package task, such as a mass update that applies to more than a handful of systems, the update is immediately sent out to every system from the set.  The systems then reciprocate by all immediately running rhn_check and pulling their package updates, which causes a huge spike in activity.

My system has 4 GB of RAM and the base usage for spacewalk 1.8 is consuming about 2 GB of RAM.  The activity spike for about 30 systems pushes my RAM usage over 4 GB and causes thrashing.

This could be mitigated by scaling up the amount of RAM in my system, but this solution would not scale to 60, 100, 200 systems updated at once.

My preference would be to insert a deferral process into the osa system which works by staggering the systems it notifies so that it won't activate more than a set number of tasks simultaneously.  A task would be considered complete when it reports as completed or failed back to spacewalk.

Alternatively, spacewalk could simply respond to agents requesting package updates by saying "busy, try again in 5 minutes" to a system at which point the server could sleep and try again until it is allowed to complete its request.

Version-Release number of selected component (if applicable):

Comment 1 Jan Pazdziora 2012-10-18 03:10:54 EDT
Let me mark this as RFE.
Comment 2 James Hogarth 2012-11-14 04:58:02 EST
I think I'm encountering a similar situation (but it could be something else too) related to spacewalk 1.8 on postgres... are you using a postgres or oracle backend?

I'm considering configuring rhn_check_command in /etc/sysconfig/rhn/osad.conf to be instead be a script that might be a little smarter... at the simple end just add a random sleep before the rhn_check itself to minimise the impact of all calling in at once or perhaps at the more complicated end carrying out a check somehow to see how busy the server is prior to initiating a request...

If you are on a postgres backend it's worth keeping an eye on bug #873379 as well...
Comment 3 Kevin Stange 2012-11-14 13:20:27 EST
Yes, my backend is postgres.  I will look into deploying a replacement rhn_check command with a random sleep as a mitigation option.  Thanks for that hint.
Comment 4 Duncan Innes 2013-02-07 07:37:13 EST
This sounds similar to a similar issue I had with "push-to-client".  My idea to get round it was to allow a choice at event submission time:

1) Push to client
2) Next regular rhn_check
3) Not before a specific time

I figured that allowing bulk tasks such as you describe to be picked up by the next regular rhn_check rather than prompting each client to check now would spread the load out in a reasonable manner.  Depends on the INTERVAL setting your rhnsd config file.

I was more thinking about spreading out loads on hundreds of virtual clients running on a handful of virtual hosts rather than worrying about the impact on my Satellite/Spacewalk.  I think the solution could be similar, however.

To be honest, the task scheduling could be quite a bit better.  The "Schedule no sooner than" date could be completely separate from whether to use osad or rhn_check.  If I could set a task to happen at midnight, I might either want it all done as soon as possible, or just whenever the regular rhn_check happens.
Comment 5 Duncan Innes 2013-02-07 07:59:18 EST
The problem with deploying an alternative rhn_check for osad to be using is that sometimes you *do* want a task happening immediately.

service nscd restart (not the best example, but hopefully gets the point)

If you deploy an alternative wrapper for rhn_check with a built-in random time delay, you lose the ability to get every client reporting back immediately.

Having some kind of jitter delay factor that could be sent to each client and executed before running rhn_check (as in first comment) would work well.
Comment 6 jochen 2014-05-07 03:41:30 EDT
> This sounds similar to a similar issue I had with "push-to-client".  My idea
> to get round it was to allow a choice at event submission time:
> 1) Push to client
> 2) Next regular rhn_check
> 3) Not before a specific time

In the 2.1 GUI it might be best to add a checkbox "Push to client" which is unchecked by default if more than one system is selected.

Note You need to log in before you can comment on or make changes to this bug.