Bug 655096

Summary: sort eats CPU when used with pipe
Product: [Fedora] Fedora Reporter: Zdenek Kabelac <zkabelac>
Component: coreutilsAssignee: Ondrej Vasik <ovasik>
Status: CLOSED RAWHIDE QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: medium Docs Contact:
Priority: low    
Version: rawhideCC: aquini, kdudka, meyering, ovasik, p, twaugh
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Fixed In Version: coreutils-8.8-1.fc15 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2010-12-23 08:26:31 EST Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Description Zdenek Kabelac 2010-11-19 10:44:48 EST
Description of problem:

Not sure where the problem is - but my kernel 2.6.37-rc2 and 

and some simple   'sort file | less'  are able to eat 100% cpu 
on my machine -   here is  'bt' from gdb attached at loop moment

(gdb) bt
#0  0x00007f37374fe2bd in write () from /lib64/libc.so.6
#1  0x00007f3737499943 in _IO_new_file_write () from /lib64/libc.so.6
#2  0x00007f373749958a in new_do_write () from /lib64/libc.so.6
#3  0x00007f373749aaf5 in _IO_new_do_write () from /lib64/libc.so.6
#4  0x00007f37374996ed in _IO_new_file_xsputn () from /lib64/libc.so.6
#5  0x00007f3737498d45 in fwrite_unlocked () from /lib64/libc.so.6
#6  0x0000000000403976 in write_line (line=<value optimized out>, fp=<value optimized out>, output_file=0x0) at sort.c:3301
#7  0x000000000040507c in mergelines_node (temp_output=0x0, tfp=0x7f37377c48c0, total_lines=627, node=0x7fff3f35c210) at sort.c:3888
#8  merge_loop (temp_output=0x0, tfp=0x7f37377c48c0, total_lines=627, queue=0x7fff3f35c710) at sort.c:3974
#9  sortlines (lines=<value optimized out>, dest=<value optimized out>, nthreads=140734253877056, total_lines=627, parent=<value optimized out>, 
    lo_child=<value optimized out>, merge_queue=0x7fff3f35c710, tfp=0x7f37377c48c0, temp_output=0x0) at sort.c:4088
#10 0x0000000000405448 in sortlines (lines=0x7f3731595330, dest=<value optimized out>, nthreads=2, total_lines=627, parent=<value optimized out>, 
    lo_child=<value optimized out>, merge_queue=0x7fff3f35c710, tfp=0x7f37377c48c0, temp_output=0x0) at sort.c:4067
#11 0x000000000040b81d in sort (nthreads=2, output_file=0x0, nfiles=0, files=0x12673e8) at sort.c:4374
#12 main (argc=<value optimized out>, argv=0x12674b0) at sort.c:5191

Version-Release number of selected component (if applicable):
vanilla 2.6.37-rc2  - I hope  sort is not depending on some new feature to be builtin into kernel. 

How reproducible:

Steps to Reproduce:
1.  sort file | less
Actual results:

Expected results:

Additional info:
Comment 1 Zdenek Kabelac 2010-11-19 11:06:51 EST
I've downgraded to version: coreutils-8.5-10.fc15.x86_64
which seem to be the last properly working release.

Release coreutils-8.6-1.fc15.x86_64  is the first broken one for F15.
Comment 2 Ondrej Vasik 2010-11-19 11:14:50 EST
One more thing - file is not just common file (Zdenek is able to reproduce on x86_64 T61 machine it with /var/log/messages , but not with e.g. /etc/passwd) ... I was not able to reproduce on my i686 T60 machine, though ...
Comment 3 Pádraig Brady 2010-11-22 05:09:06 EST
Does spin forever?
If not, how much longer does sort-8.7 take than sort-8.5?
Does adding the --parallel=1 option change things?
Comment 4 Zdenek Kabelac 2010-11-22 06:06:03 EST
Yes, it's spinning forever as long as it waits for 'less' to print more text.

i.e. a workaround is this:   sort file | tee /dev/null | less
though it's not really practical....

And positive seems to be adding  '--parallel=1' at it fixes the problem.
Comment 5 Pádraig Brady 2010-11-22 06:52:51 EST
Drats. I'd been worried about that. Pity my laptop isn't multicore :(

You could try changing spinlocks to mutexes with:
and configure with HAVE_PTHREAD_SPINLOCK_T undefined.
Or you could manually put the following at the top of src/sort.c:

typedef pthread_mutex_t pthread_spinlock_t;

static inline int
pthread_spin_init (pthread_spinlock_t *lock, int pshared)
  return pthread_mutex_init (lock, NULL);

static inline int
pthread_spin_destroy (pthread_spinlock_t *lock)
  return pthread_mutex_destroy (lock);

static inline int
pthread_spin_lock (pthread_spinlock_t *lock)
  return pthread_mutex_lock (lock);

static inline int
pthread_spin_trylock (pthread_spinlock_t *lock)
  return pthread_mutex_trylock (lock);

static inline int
pthread_spin_unlock (pthread_spinlock_t *lock)
  return pthread_mutex_unlock (lock);
Comment 6 Zdenek Kabelac 2010-11-22 08:04:11 EST
Here is actual backtrace from the spinlocked thread from current 

(gdb) bt
#0  0x00007fd4275cd425 in ?? () from /lib64/libpthread.so.0
#1  0x0000000000404fde in lock_node (node=<value optimized out>) at sort.c:3766
#2  update_parent (queue=0x7fff8098f2a0, merged=<value optimized out>, node=0x7fd3db4d5da0) at sort.c:3943
#3  merge_loop (temp_output=0x0, tfp=0x7fd4275ba8c0, total_lines=257991, queue=0x7fff8098f2a0) at sort.c:3977
#4  sortlines (lines=<value optimized out>, dest=<value optimized out>, nthreads=140735350895272, total_lines=257991, parent=<value optimized out>, 
    lo_child=<value optimized out>, merge_queue=0x7fff8098f2a0, tfp=0x7fd4275ba8c0, temp_output=0x0) at sort.c:4088
#5  0x00000000004054ab in sortlines_thread (data=<value optimized out>) at sort.c:4011
#6  0x00007fd4275c7d5b in start_thread () from /lib64/libpthread.so.0
#7  0x00007fd427301aad in clone () from /lib64/libc.so.6

btw - it's very nicely visible i.e. on quad core that even though you make a real time processing fast -  user time is actually much higher.
(i.e. on larger file    22s real&user  single threaded ->  13/47s multi core solution - so in effect code takes twice as much CPU at current implementation.)

I think it is not a good plan to 'randomly' replace  spinlock with mutex - that's not how pthread works - I would expect some conditional waits somewhere.

Oh and btw. I really don't understand how such patch could have been acked....
Comment 7 Pádraig Brady 2010-11-22 09:11:22 EST
> Oh and btw. I really don't understand how such patch could have been acked

Mea culpa. I've no dedicated multicore hardware, but mainly no time.
Full performance testing is on my todo list (you could help here :))
I thought it better to release early to get varied testing,
given I was happy that it was functionally correct.
Comment 8 Ondrej Vasik 2010-12-23 08:26:31 EST
coreutils-8.8-1.fc15 with various upstream fixes for parallel sorting built, closing RAWHIDE.