141276 – [RHEL3] Have pthread_yield() only yield the thread, not the entire process under NPTL

Bug 141276 - [RHEL3] Have pthread_yield() only yield the thread, not the entire process under NPTL

Summary: [RHEL3] Have pthread_yield() only yield the thread, not the entire process un...

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat Enterprise Linux 3
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	3.0
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Ingo Molnar
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2004-11-30 00:05 UTC by Elena Zannoni
Modified:	2007-11-30 22:07 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2005-10-04 00:13:57 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
patch for glibc (663 bytes, patch) 2004-11-30 00:08 UTC, Elena Zannoni	no flags	Details \| Diff
patch for kernel (1.66 KB, patch) 2004-11-30 00:09 UTC, Elena Zannoni	no flags	Details \| Diff
testcase (73.09 KB, text/plain) 2004-11-30 00:10 UTC, Elena Zannoni	no flags	Details
View All

Description Elena Zannoni 2004-11-30 00:05:35 UTC

Issue:

 What is considered the most "correct" way to yield a single thread on
RHEL 3.0?
- sched_yield behaviour has changed so that it now yields the whole
process
- pthread_yield sounds right but is in fact just a wrapper for sched_yield
- futex may be appropriate in some instances (how do I use it?  'man
futex' references linux/futex.h but this only exists in kernel source,
not in /usr/include),  is pthread_mutex a wrapper?  Let me check on
that...
- nanosleep with very short values has been suggested in the past but
we see very poor performance
- Anything else I'm missing?

jrfuller:
 I am investigating.

Johnray

jrfuller assigned to issue for Morgan_Stanley.

Category set to: Applications
Status set to: Waiting on Tech

jrfuller:
 Just a note, it seems that sced_yield does not yield the whole
process, but rather the "task" (which in this case is roughly
equivalent to the thread).

/**
* sys_sched_yield - yield the current processor to other threads.
*
* this function yields the current CPU by moving the calling thread
* to the expired array. If there are no other threads running on this
* CPU then this function will return.
*/
asmlinkage long sys_sched_yield(void)
{
       runqueue_t *rq = this_rq_lock();
       prio_array_t *array = current->array;
                                                                     
                                                    
       /*
        * We implement yielding by moving the task into the expired
        * queue.
        *
        * (special rule: RT tasks will just roundrobin in the active
        *  array.)
        */
       if (likely(!rt_task(current))) {
               dequeue_task(current, array);
               enqueue_task(current, rq->expired);
       } else {
               list_del(&current->run_list);
               list_add_tail(&current->run_list, array->queue +
current->prio);
       }
       /*
        * Since we are going to call schedule() anyway, there's
        * no need to preempt:
        */
       spin_unlock_irq(&rq->lock);
                                                                     
                                                    
       schedule();
                                                                     
                                                    
       return 0;
}
       
--> from kernel/sched.c
                                                                     
                                           
J

thomas.walker:
 My understanding of this (I haven't looked at the code extensively
but there was a lot of discussion about this when RHEL 3.0 was in Beta
because the behaviour change caused a lot of problems, especially with
JVM's, openoffice, etc.) was that in 2.4 when you called sched_yield
you were essentially only yielding a thread (since the linux threads
model is so messed up) and you waited at the top of the run queue.  In
2.6 the logical unit was a process (maybe this is the bit I have
wrong) but you got put into the bottom of the run queue and thread
prioirty took over... what this meant was that if you had a number of
different processes/threads sched_yield'ing with similar priorities
you would get some *very* poor performance.
Regardless of the underlying bits (I'll try to dig through the code
this weekend), we were told *not* to call sched_yield in a loop when
waiting for contention to free up over something (by Jeff Law among
others) and there is quite a bit about this on the 'net and lkml.
jrfuller I'm teaching today so I don't have a chance to take a closer
look at the threading code but I have people who are interested in
trying to use futexes explicitely.

'man futex' references a system header that doesn't exist (it only
exists in the kernel sources which you're generally not supposed to
include).

Can we use futexes directly?  Do we just use functions that normally
would involve a mutex and get a futex automatically on RHEL 3.0?

I have someone who would like to do work on this today so if I could
get an answer on this asap and we can tackle the larger question later
I'd appreciate it greatly...

Tom.

jrfuller:
 Here is some feedback from a developer here form your original query:

> - sched_yield behaviour has changed so that it now yields the whole
process

Yes, sched_yield has changed between LinuxThreads and NPTL. AIUI, it
never has worked at the process level. However vanilla 2.4.22 will try
not to yield if it thinks nothing else will run, so it does "yield
more" (there was a big discussion when NPTL was created due to some
people not expecting this).

> - pthread_yield sounds right but is in fact just a wrapper for
> sched_yield

This is true in NPTL, at least.

> - futex may be appropriate in some instances (how do I use it?  'man
> futex' references linux/futex.h but this only exists in kernel source,
> not in /usr/include),  is pthread_mutex a wrapper?  Let me check on
> that...

Yes, pthread_mutex is a wrapper ... in NPTL.

> - nanosleep with very short values has been suggested in the past but we
> see very poor performance

With nanosleep(1) you'll get scheduled, which is sometimes what people
want when they call sched_yield().

> - Anything else I'm missing?"
>
> Any advice here would be most appreciated...

The biggest thing we need to know to answer this is _what_ are you
yielding for? You will need to determine what that event (or set of
events) is and have some way to wait for it to trigger.

A way work around this depends on the code, and what tyou really want
to do. Possibly using some of select/nanosleep/sched_yield will work.

Status set to: Waiting on Client

thomas.walker:
We've seen generally poor performance with sched_yield, pthread_mutex, and
nanosleep (all individually), variable amounts of time spinning before
calling any of the above, pthread_spinlock, etc.  Basically what we have
here is a multithreaded memory allocator with (usually) a number of heaps
equal to the number of threads (this is tuneable) and
allocations/deallocations
happening from all threads (i.e. allocator thread is generally not the one
freeing).
When a thread tries to free() it first looks in the last heap it free()d
in (in case someone is freeing a whole bunch of things in sequence) and if
it doesn't find what it is looking for then starts searching the other
heaps.  If it comes upon a heap that is in use by another thread (custom
assembly lock) then it *used* to sched_yield in a loop waiting for the
heap to free up.
We're seeing generally poor perf (~10% off AS 2.1 vs. 3.0 smp best case)
and have tried replacing the sched_yield with the above mentioned methods
without much luck.  We also notice that the number of "missed frees" (i.e.
the number of times that we don't find the allocation in the "current"
heap and have to go looking for it increases rather dramatically under
AS3.0 kernel and think it might be due to a change in the scheduler).
We either need to find a more efficient way of wating for the lock (I
assume pthread_mutex should be best here?) and/or understand why the new
scheduler seems to be throwing us off so much.

FYI - the app in question is mtfilter (from a few other tickets) and the
memory allocator is mts (we don't have source but I've been working
closely with the vendor trying to suggest various approaches, etc.)


--

NOTICE: If received in error, please destroy and notify sender.  Sender
does not waive confidentiality or privilege, and use is prohibited.

jrfuller:
Here is more feedback from a developer. I will post more as I get it.

"
<snip>

> When a thread tries to free() it first looks in the last heap it free()d
> in (in case someone is freeing a whole bunch of things in sequence) and
> if it doesn't find what it is looking for then starts searching the
other
> heaps.  If it comes upon a heap that is in use by another thread (custom
> assembly lock) then it *used* to sched_yield in a loop waiting for the
> heap to free up.

Ok, this is the important bit ... the thread is waiting for the event
"other thread stops using "allocator heap" ... and you approximated
that previously with "assume other thread is blocking by us running, so
give some of our timeslice to it and then return to us" -- this is done by
calling sched_yield().

This is prone to races that can starve the current thread, by giving
too much or too little CPU time to the thread holding the resource (and
also problems if multiple threads are "waiting" for the resource).
If there is a real event mechanism (like a FUTEX -- but other options
are available) then the kernel scheduler knows what is going on and can
make the right choices, starvation is avoided and fairness with multiple
threads contending can also be achieved.

> We're seeing generally poor perf (~10% off AS 2.1 vs. 3.0 smp best case)
> and have tried replacing the sched_yield with the above mentioned
> methods without much luck.  We also notice that the number of
"missed frees"
> (i.e. the number of times that we don't find the allocation in the
"current"
> heap and have to go looking for it increases rather dramatically under
> AS3.0 kernel and think it might be due to a change in the scheduler).

The previous explanation said:

       When a thread tries to free() it first looks in the last heap it
       free()d in (in case someone is freeing a whole bunch of things
       in sequence) and if it doesn't find what it is looking for then
       starts searching the other heaps.

Which implies that locality would only change if things were free'd in
a different order. The second explanation implies that allocating
changes the current heap for malloc and free.

This could mean that due to the latency of getting the real event by
using sched_yield() threads are stealing each others current heap, so
when it does "malloc() ... free()" it's lost the heap by the time the
free comes around..."

Status set to: Waiting on Client

jrfuller:
 Thomas,
Here is another approach Allen Cox mentioned in response to the above.
Thought I'd post it just in case there are useful bits in:

"Are their any LGPL lockless memory pool managers, or near lockless
ones, using
per cpu lists and cmpxchg8b to do list merges ?

>  Which implies that locality would only change if things were free'd in
> a different order. The second explanation implies that allocating
> changes the current heap for malloc and free.

You shouldnt not need to spin in this case for a well designed heap
because you can add the block you want to free to a "todo" list to
be freed by the lock holder.

       free(list):=
top:
       cmxpchg8b two list pointers with a lock bit in the low bit
               [ie atomic list append with lock bit]
       if was already locked ret        # Other thread does it
       cmpxchg 8 list with "null locked"
       free list contents
       if cmpxchg list,  null,unlocked == null,locked
               ret
       # We unlocked so back to top holding pending list
       # If we retake lock we free it, if not we queue it for the
       # next party.
       goto top

If I remember the text book example in my head. There is a more clean
version where you have a single memory freeing thread, and also a really
wild variant on the above where you are guaranteed no looping

Alan "

I don't know if this will be helpful, but I thought I'd throw it out
there just in case.

Johnray

thomas.walker 
---------- Forwarded message ----------
Date: Tue, 13 Apr 2004 07:24:18 -0700 (PDT)
From: Ze'ev Mehler <zeevmehler>
To: Thomas.Walker
Cc: Grant McKenzie <Grant.McKenzie>
Subject: Re: (UPDATED) Issue #36940 (Yielding a thread)[Morgan_Stanley]
   (fwd)

Tom,

First, we need to keep in mind that changes to the i/o scheduler could
also effect the overall thoughput.

Now to the comments here:

Out of about 60 million malloc/free operations, we are seeing about
150,000 more 'bad frees' under 3.0, Bad frees are where free has to
hunt for the right heap to free from. I don't think that this is where
we are loosing 10%-20% of the overall performance.

A thread bound to a particular heap always allocates from that heap.
It's current heap to check for a free is a) the last heap that it freed
from b) it's original malloc heap, then it serches all other heaps.

The difference between 2.1 and 3.0 is actually about double if you look
at case 'a' alone, i.e., 3.0 runs, produce about double the misses of
free from previously 'freed from' heap (about 750,000 cases). It drops
to 150,000 when looking at combined a,b cases.

This really reinforces the fact that the consumer threads are being run
in a very different sequence under 3.0. which would explain why there
last free is not in sequence. There seems to be much more 'choppiness'
in the thread scheduling.

>From reading the lit. with my very dim understanding of the scheduler,
I get the feeling that the blocked threads are being given new
timeslices much too quickly thereby using up system resources to get
themselves reblocked again, instead of letting the threads that have
the locks run to completion faster.

Again, the i/o scheduling might also be a factor here.

--Ze'ev

jrfuller:
 Thomas,

Is there a smaller testcase than mtfilter that illustrates this
adequately?

Thanks,
J

Status set to: Waiting on Client

thomas.walker:
 I'm working on it (digging through old hoard test cases to see if one
reproduces the problem nicely).  I'll let you know if I find one.

--

NOTICE: If received in error, please destroy and notify sender.  Sender
does not waive confidentiality or privilege, and use is prohibited.

thomas.walker:
I've taken a pair of test cases from hoard (gpl'd and I'm even being a
good citizen and including a copy of the license) and cleaned them up
a bit to compile with gcc 3.2.3-20 both with and without mts (one line
in the makefile that can be commented/uncommented).
consume.C is a simple producer/consumer model which seems to exhibit
some of the same performance issues that we're seeing with mtfilter. 
rrconsume.C is a similar example that currently seems to have serious
problems with mts on AS3 (segv).  I'm talking to Ze'ev about it and
have passed on the cases... probably best to focus on the simpler case
for now until I get some feedback about the problems with the
round-robin one from him.
My tests were done on saias90/1 (machines you've seen before and have
sysreports for).  Identical hardware (2x3.2 GHZ w/ 6.0GB memory, HT
on) one with something akin to AS 2.1 U3 and the other with RHEL 3 U1.
 RHEL 3 does indeed seem to perform better w/o mts but when we include
mts (and run into the aforementioned heap locking / thread yielding
issues detailed earlier in this issue) we see AS2.1 surge ahead
whereas RHEL 3 stays more or less flat.
I've included a copy of the mts library and header we are using (w/
Ze'ev's ok) so that you can play with the tests and rebuild in-house.

Let me know if there is anything else I can provide.

File uploaded: testcase.tgz

jrfuller:
Thomas,

There are some new findings on IT 36605 that may be of interest (post
dated 04-14-2004 11:01am).

Is the pthread_cond_wait() call an option for you?

J

Status set to: Waiting on Client

thomas.walker:
possibly, I don't know how much overhead explicitely managing this through
thread signaling would add.  let me try it out.

--

NOTICE: If received in error, please destroy and notify sender.  Sender
does not waive confidentiality or privilege, and use is prohibited.

jrfuller:
In response to Event posted 04-12-2004 12:24pm by jrfuller: (a post
from a Thomas Walker email)

The only docs on futex's that I've found are

The "Fuss, Futexes and Furwocks" paper within:
 http://www.linux.org.uk/~ajh/ols2002_proceedings.pdf.gz

As well as the paper "Futexes are Tricky" here (on Uli's People page):
 http://people.redhat.com/~drepper/futex.pdf

> Can we use futexes directly?  Do we just use functions that normally
would involve a mutex and get a futex automatically on RHEL 3.0?

Yes, you can use futex's directly, but one should assume that the
kernel interface to futexs will change in future release of the
kernel, and one should be prepared to adjust to that.

RHEL3 nptl mutexs are implemented (in part) with futexs, but futexs
and mutexs are not the same thing.

Status set to: Waiting on Client

thomas.walker:
how much am I likely to get out of pthread_cond_wait?  Isn't this
going to essentially amount to pthread_mutex when all is said and
done?  At this point, I think we're looking at scheduler behaviour. 
I'm not seeing huge amounts of heap contention but we're seeing pretty
high idle times in very simple test cases which makes me think that
some of our runqueues are running dry on us.  Couple of patches I want
to try...
And btw- I noticed when comparing AS2.1, RHEL3, and 2.6.5 kernel
sources that the RHEL3 version is missing an "imbalance /=2" in the
load_balance() funtcion in sched.c that is present in both AS2.1 and
2.6.5.  Doesn't seem to have an impact on our test cases to put it
back in but this is probably a mistake left over from backporting
scheduler patches that should be fixed.

thomas.walker:
 I take the bit about the imbalance back, it was just moved into
find_busiest_queue for some odd reason...

--

NOTICE: If received in error, please destroy and notify sender.  Sender
does not waive confidentiality or privilege, and use is prohibited.

jrfuller:
 ok, we are trying to make sure our understanding of the supplied test
case is correct.

The test case (actually two but similar logic) has a producer and
consumers.

The producer generates blocks of memory and consumers use (delete) them.

The consumers wait for the producer to tell them the memory is ready
via semaphores. The sempahore count equals to consumer count -
implying each consumer waits for its own semaphore.

But so far we don't see thread_yield (or sched_yield, etc) in this
picture.  So I assume in the real application, they don't use (many)
semaphores but use one single "home-made" lock. All the threads are
competing for the same lock. If the thread can't get the lock, it has
to "yield" the CPU and they want to know what is the best way to do
this "yield".

Is this correct?

Status set to: Waiting on Client

thomas.walker:
The producer/consumer stuff is only to drive the memory allocator in a
fashion similar to what we see in filter.  In fact, the round-robin
example (currently borken) is a much more accurate test.  The memory
allocator (as explained earlier) uses a custom lock when one of the
threads is working with a heap.  The original question centered on what
was the most appropriate way to have another thread that needs access to
that heap yield to wait for the lock to free since sched_yield doesn't
seem to be doing it for us (I've been through the scheduler code quite a
bit lately and my earlier misunderstaning of the "new" behaviour is
considerably clearer now).
Anthony has been seeing results that are more consistent with the ones
we've been seeing with the real app using the test case that you provided
yesterday so I think that is what we're working with now...

--

NOTICE: If received in error, please destroy and notify sender.  Sender
does not waive confidentiality or privilege, and use is prohibited.

jrfuller:
Great, thanks for the clarification. We are now looking into it.

Thanks,
J


fhirtz:
We've been looking into this and currently have it classed as a "nice
to have" enhancement. How important is this in your implementation? I
need to know that so that I can gauge how hard this needs to be pushed.

Status set to: Waiting on Client

Comment 1 Elena Zannoni 2004-11-30 00:06:33 UTC

This is across kernel too, Peter, can you take a look?

Comment 2 Elena Zannoni 2004-11-30 00:08:35 UTC

Created attachment 107589 [details]
patch for glibc

Comment 3 Elena Zannoni 2004-11-30 00:09:51 UTC

Created attachment 107590 [details]
patch for kernel

Comment 4 Elena Zannoni 2004-11-30 00:10:24 UTC

Created attachment 107591 [details]
testcase

Comment 8 Ernie Petrides 2005-10-04 00:13:57 UTC

Closing, since there no longer is an outstanding issue.

Note You need to log in before you can comment on or make changes to this bug.