740628 – Potential issue in how AsyncTasks run in the task subsystem

Bug 740628 - Potential issue in how AsyncTasks run in the task subsystem

Summary: Potential issue in how AsyncTasks run in the task subsystem

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Pulp
Classification:	Retired
Component:	z_other
Sub Component:
Version:	1.0.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	Sprint 29
Assignee:	Jason Connor
QA Contact:	Preethi Thomas
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2011-09-22 17:56 UTC by Jay Dobies
Modified:	2014-03-31 01:39 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2012-02-24 20:15:51 UTC
Embargoed:

Attachments	(Terms of Use)

Description Jay Dobies 2011-09-22 17:56:13 UTC

This wasn't seen as an actual bug but is more the effect of some conversations and investigations I've been doing. From a conversation with jconnor it's definitely an issue, but at very least I hope this bug causes a deeper investigation than I've done so far.

= Background =
In the task subsystem, the AsyncTask subclass of Task handles gofer's asynchronous with correlated reply setup. The task kernel-level thread ends quicky once the request is made into gofer. When the consumer replies back from the operation, the gofer callback updates the task with the response. This way, we can leverage the task subsystem for the auditing/results of async consumer operations.

The task remains in a "running" state until the callback is received, even though the kernel-level thread ends. That part is key.

= Issue =
The thread for the task ends quickly once gofer acknowledges the invocation. That's good.

The bad part is that the fact that the task is set at "running" means it's effectively taking up a spot in our running task queue. The number of running spots is defaulted to 4 and applies to _all_ Pulp task-based operations (repo sync, cds sync).

The issue can be easily seen in the install package to consumer group implementation. Each consumer install is tracked as an individual task in the task subsystem (that makes sense). That means that any consumer group of over 4 consumers will completely take over Pulp.

This is especially a problem in the consumer package install case. The majority of the work there is done on the consumer itself (barring the bandwidth on Pulp for the consumer to download the packages), which means the Pulp server is sitting idle during those times.

Keeping in mind the desired consumer scale of Pulp, we can't have a consumer group of 1000 consumers blocking all Pulp operations (repo sync, cds sync, etc) while each consumer takes the time to install packages. Since the queue is FIFO, all syncs will be backed up until the install is complete on all 1000 consumers (ok, 997 of them).

= Solutions =

== Introduce an Idle State ==
Instead of leaving tasks in the "running" state while waiting on the callback, they go into an idle state. This would allow other threads to enter the running state while waiting on the callback.

I don't know enough of the details of the task subsystem, but the prospect of having that many idle tasks sitting at once just feels wrong.

We will also probably need to add logic to cap the number of idle threads and block more from happening. So when 500 consumer installs are going on and a task comes in attempting to trigger the 501st, the task subsystem detects that the next state of that new task would be idle and prevents it from running until there's room for it to transition to idle.

Of course, that sort of look ahead logic isn't trivial. :)

== Split Pulp-centric and Consumer-centric Tasks into Two Queues ==
We're in this position because of the nature of tasks that require a gofer callback. That implies that the actual work of the task is being done off of the Pulp server. Therefore, we may be able to track those tasks separate from ones that will hammer the server.

For example, we could continue to allow 4 concurrent Pulp server tasks but, say, 100 concurrent consumer tasks, under the rationale that those 100 are not actively working to melt the Pulp server. That leaves the bulk of the current task infrastructure in place; tasks remain as running but the kernel-level thread dies.

The only complication is in the manager layer to determine which queue to use both when adding a task to the queue and potentially when looking up a task status. I suspect this is simpler than adding in an idle state and all of the rules that have to go in place for idle tasks, but again, I'm not closely familiar with the task code.

Comment 1 Jay Dobies 2011-09-22 18:01:10 UTC

Conversation with jconnor:

<jdob> i was talking to jortel about gofer stuffs
<linear> ok
<jdob> and he mentioned how we have AsyncTask in the task subsystem
<jdob> that will wait on the correlated reply from gofer before marking the task as complete
<jdob> i was curious what state the task sits in while waiting for that and if it holds up one of the queue threads
<jdob> he said it doesnt, but based on my limited understanding of the task queue i'm not sure how the task would reflect as "running" but not be holding up a thread
<linear> hmmm
<linear> lemme look
<jdob> he said he ran it past you, and I'm not really concerned there's an issue so much as I am curious how it's handled
<linear> ok, so he's right and he's not

<linear> the async task's actual kernel-level thread will exit, more or less, immediately
<jdob> right, as soon as gofer acknowledges it got the request and is gonna dispatch it
<linear> however, the task sits in a 'running' state, which will count towards the queue's max running quota 
<linear> so it can block other threads from being fired off
<jdob> so 4 of those will lock up pulp solid.
<linear> yep

<jdob> that's a huge ass problem
<jdob> cause he added install to consumer group
<jdob> which does one task per consumer
<jdob> so you just stopped your pulp server dead in its tracks from doing anything else
<linear> until everything is installed, hmmm
<jdob> which is really bad in this case since all of the actual work is being done on consumers
<jdob> so the pulp server is grossly idle
<linear> perhaps we need a new state for tasks, something like: idle
<jdob> thats what my gut had expected to see in the task subsystem
<linear> that will not blocks other tasks from being launched 
<jdob> that those tasks changed to something like that

Comment 2 Jason Connor 2011-10-12 22:52:41 UTC

We've solved this with the new weighted tasks proposal. the AsyncTasks have a default weight of 0, and therefore don't block tasks behind them while they're running.

https://fedorahosted.org/pulp/wiki/WeightedTasks

Comment 3 Jeff Ortel 2011-10-13 00:49:15 UTC

build: 0.238

Comment 4 Jason Connor 2011-10-20 18:08:06 UTC

To test this, kick off a number of AsyncTasks (weight: 0) equal to the [tasking] concurrency_threshold, which defaults to: 4, (consumer package installs, for instance), then kick off a weighted task (repo sync for instance).

Notice that the weighted task is able to run right away (state of running or finished in task status)

If the weighted task has to wait for the AsyncTasks, this bug fails QA, if it doesn't, it passes.

Comment 5 Preethi Thomas 2011-10-21 19:52:40 UTC

verified
[root@preethi ~]# rpm -q pulp
pulp-0.0.240-1.fc15.noarch

I was able to schedule 

repo sync & package install at the same time and repo sync ran without waiting for package install to finish.

Comment 6 Preethi Thomas 2012-02-24 20:15:51 UTC

Pulp v1.0 is released
Closed Current Release.

Comment 7 Preethi Thomas 2012-02-24 20:17:26 UTC

Pulp v1.0 is released.

Note You need to log in before you can comment on or make changes to this bug.