Bug 1181685

Summary: tcsh hangs on wait
Product: [Fedora] Fedora Reporter: Filip Krska <fkrska>
Component: tcshAssignee: Pavel Raiskup <praiskup>
Status: CLOSED ERRATA QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: rawhideCC: jchaloup, ovasik, praiskup, qe-baseos-apps, rkollar
Target Milestone: ---Keywords: Reproducer
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: tcsh-6.18.01-13.fc21 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 1181682 Environment:
Last Closed: 2015-02-06 04:02:47 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Fix proposed upstream. none

Description Filip Krska 2015-01-13 15:10:52 UTC
tcsh-6.18.01-12.fc22.x86_64 reproduces as well

+++ This bug was initially created as a clone of Bug #1181682 +++

tcsh-6.18.01-7.el7.x86_64 reproduces as well

+++ This bug was initially created as a clone of Bug #1181680 +++

Description of problem:

tcsh hangs on wait builtin on system with more than one CPU.


Version-Release number of selected component (if applicable):

tcsh-6.17-24.el6.x86_64

How reproducible:

Always

Steps to Reproduce:
1. create executable script /tmp/reproduce.csh

#!/bin/csh
/bin/date &
/bin/true
wait

2. execute it in loop in csh shell:
csh$ while ( 1 )
while? /tmp/reproduce.csh > & /dev/null
while? end

3. in another terminal watch `ps -C reproduce.csh` command

Actual results:

after some time (several seconds or minutes) the script hangs (you observe the PID stops changing)


Expected results:

the watched PID keeps changing

Additional info:

Reproduces with tcsh-6.18.01-12.fc22.x86_64, tcsh-6.18.01-7.el7.x86_64 as well.

(gdb) bt
#0  0x00000034bf832c2a in do_sigsuspend (set=0x7fff735ec840) at ../sysdeps/unix/sysv/linux/sigsuspend.c:63
#1  __sigsuspend (set=0x7fff735ec840) at ../sysdeps/unix/sysv/linux/sigsuspend.c:74
#2  0x00000000004201c5 in dowait (v=<value optimized out>, c=<value optimized out>) at sh.proc.c:604
#3  0x000000000042197f in execute (t=0x2413850, wanttty=-1, pipein=0x0, pipeout=0x0, do_glob=1) at sh.sem.c:632
#4  0x0000000000421396 in execute (t=0x23cb1e0, wanttty=-1, pipein=0x0, pipeout=0x0, do_glob=1) at sh.sem.c:720
#5  0x0000000000403f38 in process (catch=1) at sh.c:2043
#6  0x000000000040605b in main (argc=<value optimized out>, argv=0x7fff735ed1a8) at sh.c:1290
(gdb) f 2
#2  0x00000000004201c5 in dowait (v=<value optimized out>, c=<value optimized out>) at sh.proc.c:604
604		    sigsuspend(&pause_mask);
(gdb) l
599	loop:
600	    for (pp = proclist.p_next; pp; pp = pp->p_next)
601		if (pp->p_procid &&	/* pp->p_procid == pp->p_jobid && */
602		    pp->p_flags & PRUNNING) {
603		    handle_pending_signals();
604		    sigsuspend(&pause_mask);
605		    handle_pending_signals();
606		    goto loop;
607		}
608	    pjobs = 0;

Comment 1 Filip Krska 2015-01-15 15:48:22 UTC
#2  0x0000000000420445 in dowait (v=<optimized out>, c=<optimized out>) at sh.proc.c:611
611		    sigsuspend(&pause_mask);
(gdb) l
606	loop:
607	    for (pp = proclist.p_next; pp; pp = pp->p_next)
608		if (pp->p_procid &&	/* pp->p_procid == pp->p_jobid && */
609		    pp->p_flags & PRUNNING) {
610		    (void)handle_pending_signals();
611		    sigsuspend(&pause_mask);
612		    opintr_disabled = pintr_disabled;
613		    pintr_disabled = 0;
614		    gotsig = handle_pending_signals();
615		    pintr_disabled = opintr_disabled;
(gdb) p proclist 
$15 = {p_next = 0x778908, p_friends = 0x0, p_cwd = 0x0, p_flags = 0, p_reason = 0 '\000', p_index = 0, p_parentid = 0, p_procid = 0, 
  p_jobid = 0, p_btime = {tv_sec = 0, tv_usec = 0}, p_etime = {tv_sec = 0, tv_usec = 0}, p_rusage = {ru_utime = {tv_sec = 0, 
      tv_usec = 0}, ru_stime = {tv_sec = 0, tv_usec = 0}, {ru_maxrss = 0, __ru_maxrss_word = 0}, {ru_ixrss = 0, 
      __ru_ixrss_word = 0}, {ru_idrss = 0, __ru_idrss_word = 0}, {ru_isrss = 0, __ru_isrss_word = 0}, {ru_minflt = 0, 
      __ru_minflt_word = 0}, {ru_majflt = 0, __ru_majflt_word = 0}, {ru_nswap = 0, __ru_nswap_word = 0}, {ru_inblock = 0, 
      __ru_inblock_word = 0}, {ru_oublock = 0, __ru_oublock_word = 0}, {ru_msgsnd = 0, __ru_msgsnd_word = 0}, {ru_msgrcv = 0, 
      __ru_msgrcv_word = 0}, {ru_nsignals = 0, __ru_nsignals_word = 0}, {ru_nvcsw = 0, __ru_nvcsw_word = 0}, {ru_nivcsw = 0, 
      __ru_nivcsw_word = 0}}, p_command = 0x0}
(gdb) p *proclist->p_next 
$17 = {p_next = 0x733508, p_friends = 0x778908, p_cwd = 0x7416c8, p_flags = 4, p_reason = 0 '\000', p_index = 0, p_parentid = 14001, 
  p_procid = 0, p_jobid = 14017, p_btime = {tv_sec = 1421314672, tv_usec = 366544}, p_etime = {tv_sec = 0, tv_usec = 0}, p_rusage = {
    ru_utime = {tv_sec = 0, tv_usec = 2000}, ru_stime = {tv_sec = 0, tv_usec = 0}, {ru_maxrss = 2124, __ru_maxrss_word = 2124}, {
      ru_ixrss = 0, __ru_ixrss_word = 0}, {ru_idrss = 0, __ru_idrss_word = 0}, {ru_isrss = 0, __ru_isrss_word = 0}, {ru_minflt = 85, 
      __ru_minflt_word = 85}, {ru_majflt = 0, __ru_majflt_word = 0}, {ru_nswap = 0, __ru_nswap_word = 0}, {ru_inblock = 0, 
      __ru_inblock_word = 0}, {ru_oublock = 0, __ru_oublock_word = 0}, {ru_msgsnd = 0, __ru_msgsnd_word = 0}, {ru_msgrcv = 0, 
      __ru_msgrcv_word = 0}, {ru_nsignals = 0, __ru_nsignals_word = 0}, {ru_nvcsw = 1, __ru_nvcsw_word = 1}, {ru_nivcsw = 1, 
      __ru_nivcsw_word = 1}}, p_command = 0x7f3988 L"/bin/true"}
(gdb) p *proclist->p_next->p_next
$21 = {p_next = 0x0, p_friends = 0x733508, p_cwd = 0x7416c8, p_flags = 32772, p_reason = 0 '\000', p_index = 1, p_parentid = 14001, 
  p_procid = 14016, p_jobid = 14016, p_btime = {tv_sec = 1421314672, tv_usec = 365773}, p_etime = {tv_sec = 0, tv_usec = 0}, 
  p_rusage = {ru_utime = {tv_sec = 0, tv_usec = 4000}, ru_stime = {tv_sec = 0, tv_usec = 0}, {ru_maxrss = 2124, 
      __ru_maxrss_word = 2124}, {ru_ixrss = 0, __ru_ixrss_word = 0}, {ru_idrss = 0, __ru_idrss_word = 0}, {ru_isrss = 0, 
      __ru_isrss_word = 0}, {ru_minflt = 100, __ru_minflt_word = 100}, {ru_majflt = 0, __ru_majflt_word = 0}, {ru_nswap = 0, 
      __ru_nswap_word = 0}, {ru_inblock = 0, __ru_inblock_word = 0}, {ru_oublock = 0, __ru_oublock_word = 0}, {ru_msgsnd = 0, 
      __ru_msgsnd_word = 0}, {ru_msgrcv = 0, __ru_msgrcv_word = 0}, {ru_nsignals = 0, __ru_nsignals_word = 0}, {ru_nvcsw = 1, 
      __ru_nvcsw_word = 1}, {ru_nivcsw = 4, __ru_nivcsw_word = 4}}, p_command = 0x7f3dc8 L"/bin/date"}

Race condition theory:

/bin/date exits
SIGCHLD is emited then
pchild() is called then
dowait() is called when wait met in the script

606	loop:
607	    for (pp = proclist.p_next; pp; pp = pp->p_next)
608		if (pp->p_procid &&	/* pp->p_procid == pp->p_jobid && */
609		    pp->p_flags & PRUNNING) {

reaches /bin/date process, however PRUNNING flag still may be in place
sigsuspend() is called and waits forever...
pchild() sets pp->p_flags &= ~(PRUNNING | PSTOPPED | PREPORTED), however, too late

Does it make sense, Ondrej?

Comment 3 Pavel Raiskup 2015-01-16 16:32:16 UTC
Reproduced.  Sending SIGCHLD to tcsh then starts the loop again, yep it seems
like SIGCHLD leak.  I'll have a look.

Comment 4 Pavel Raiskup 2015-01-19 08:26:39 UTC
Created attachment 981414 [details]
Fix proposed upstream.

I sent the attached patch to tcsh-bugs.com, this needs to be fixed
upstream first definitely.  Mail archives are slow now so I'll post the link
to proposal later.

Comment 8 Fedora Update System 2015-01-27 06:58:19 UTC
tcsh-6.18.01-13.fc21 has been submitted as an update for Fedora 21.
https://admin.fedoraproject.org/updates/tcsh-6.18.01-13.fc21

Comment 9 Fedora Update System 2015-01-27 06:59:37 UTC
tcsh-6.18.01-6.fc20 has been submitted as an update for Fedora 20.
https://admin.fedoraproject.org/updates/tcsh-6.18.01-6.fc20

Comment 10 Fedora Update System 2015-01-28 19:57:20 UTC
Package tcsh-6.18.01-6.fc20:
* should fix your issue,
* was pushed to the Fedora 20 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=updates-testing tcsh-6.18.01-6.fc20'
as soon as you are able to.
Please go to the following url:
https://admin.fedoraproject.org/updates/FEDORA-2015-1280/tcsh-6.18.01-6.fc20
then log in and leave karma (feedback).

Comment 11 Fedora Update System 2015-02-06 04:02:47 UTC
tcsh-6.18.01-6.fc20 has been pushed to the Fedora 20 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 12 Fedora Update System 2015-02-06 04:03:07 UTC
tcsh-6.18.01-13.fc21 has been pushed to the Fedora 21 stable repository.  If problems still persist, please make note of it in this bug report.