Bug 690356

Summary: tcsh doesn't exit when child killed, consumes 100% cpu
Product: Red Hat Enterprise Linux 6 Reporter: Bob Arendt <rda>
Component: tcshAssignee: Vojtech Vitek <vvitek>
Status: CLOSED ERRATA QA Contact: Branislav NĂ¡ter <bnater>
Severity: urgent Docs Contact:
Priority: medium    
Version: 6.0CC: angelotech, bnater, hripps, jkaluza, jwest, msvoboda, ovasik
Target Milestone: rcKeywords: Patch
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: tcsh-6.17-13.el6 Doc Type: Bug Fix
Doc Text:
If the tcsh shell redirected standard output to a child process using a pipe and this child process terminated, the shell tried to print a message to the already-closed pipe as a high-priority event that could never been finished. As a consequence, tcsh entered an infinite loop and consumed up to 100% of the CPU. To fix this issue, this error event is now removed from the event queue before the shell tries to write it to the broken pipe. As result, the parent process terminates ordinarily.
Story Points: ---
Clone Of:
: 690500 (view as bug list) Environment:
Last Closed: 2011-12-06 17:02:06 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 690500    
Attachments:
Description Flags
python script that reproduces the tcsh hang, consuming 100% CPU
none
patch for tcsh consumes 100% CPU when child killed none

Description Bob Arendt 2011-03-24 03:29:31 UTC
Created attachment 487182 [details]
python script that reproduces the tcsh hang, consuming 100% CPU

Description of problem:

If tcsh has stdout redirected to a pipe and the child is killed, tcsh goes into an infinite loop consuming 100% cpu.  (Simple patch & explanation attached to next comment).

Version-Release number of selected component (if applicable):
tcsh-6.17-8.el6

How reproducible:
Reproducer python script attached.  This is using "find /usr/share/doc" as the child process.  It could be xclock, or any sufficiently long-lived process that forked.

Steps to Reproduce:

1.  Monitor processes in one terminal using:
    watch ps --forest -u myusername -o pid,pcpu,args

2.  Cause the hang:
    python csh-hang.py
  
Actual results:

  PID %CPU COMMAND
 3501  0.0 sshd: rda@pts/0
 3502  0.0  \_ -tcsh
 4445  0.1      \_ watch ps --forest -u rda -o pid,pcpu,args
 5416  0.0          \_ ps --forest -u rda -o pid,pcpu,args
 4942 98.6 /bin/csh -f -c find /usr/share/doc


Expected results:

The final /bin/csh process (the parent) should have died.  Instead it's stuck in a fast infinite loop, trying to output errors to a dead pipe (which causes another error ...).

Additional info:

Similar bug report against RHEL5:
https://bugzilla.redhat.com/show_bug.cgi?id=618723
.. but this has gone down the patch of a major code re-write; We've got a one-line patch that fixes this issue.

Another similar bug report against Fedora 14:
https://bugzilla.redhat.com/show_bug.cgi?id=677830

This also exists in RHEL4 - it's been there a while.

Comment 2 RHEL Program Management 2011-03-24 03:47:33 UTC
This request was evaluated by Red Hat Product Management for
inclusion in the current release of Red Hat Enterprise Linux.
Because the affected component is not scheduled to be updated
in the current release, Red Hat is unfortunately unable to
address this request at this time. Red Hat invites you to
ask your support representative to propose this request, if
appropriate and relevant, in the next release of Red Hat
Enterprise Linux. If you would like it considered as an
exception in the current release, please ask your support
representative.

Comment 3 Bob Arendt 2011-03-24 03:55:01 UTC
Created attachment 487184 [details]
patch for tcsh consumes 100% CPU when child killed

Here's the analysis and patch that my colleague Justin Irwin came up with.  This is happening in tcsh's process event reporting in sh.proc.c.  When the child is killed the event (sigchild?) results in a call to  pendjob().  But the output pipe is closed, so the xprintf() fails without clearing the event.  But the queued event is still the top of the queue, so it tries to process the same event again.  Forever.

Justin's patch zero's the pcurrjob pointer just before xprintf, removing the current error from the queue just prior to the xprintf.  If it fails - no output.  Output isn't possible in this situation anyway.  But the job event is off the queue and tcsh is able to complete and terminate.  We've tested it and it solves the problem.  Notice that pcurrjob is zero'd on exit anyway;  This is  a very safe patch.

This may alleviate the other tcsh bugs cited in comment #1.

Comment 4 Vojtech Vitek 2011-03-24 14:09:39 UTC
Strace of stuck tcsh, after running the reproducer (attachment 487182 [details]):
<snip>
lseek(16, 0, SEEK_END)                  = -1 ESPIPE (Illegal seek)
lseek(16, 0, SEEK_END)                  = -1 ESPIPE (Illegal seek)
write(17, "[1] 10393\n", 10)            = -1 EPIPE (Broken pipe)
--- SIGPIPE (Broken pipe) @ 0 (0) ---
lseek(16, 0, SEEK_END)                  = -1 ESPIPE (Illegal seek)
lseek(16, 0, SEEK_END)                  = -1 ESPIPE (Illegal seek)
write(17, "[1] 10393\n", 10)            = -1 EPIPE (Broken pipe)
--- SIGPIPE (Broken pipe) @ 0 (0) ---
</snip>

Gdb:
<snip>
Program received signal SIGPIPE, Broken pipe.
0x0000003c696d34f0 in __write_nocancel () at ../sysdeps/unix/syscall-template.S:82
82      T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
</snip>

The patch seems ok.

Comment 5 Vojtech Vitek 2011-03-24 14:13:36 UTC
Thanks Bob & Justin, I am going to apply the patch in all supported Fedoras and I'll escalate this for inclusion to RHEL-6.2 and RHEL-5.7 tcsh617.

Comment 6 Vojtech Vitek 2011-03-24 14:17:01 UTC
The patch has been accepted by upstream as well:
http://mx.gw.com/pipermail/tcsh-bugs/2011-March/000727.html

Comment 7 Angelo Bonet 2011-04-15 17:09:40 UTC
For the record, the strace output from this reproducer looks somewhat different on RHEL5 tcsh-6.14-12.el5.

IMO, this looks an awful lot like the strace output from bug #618723 (https://bugzilla.redhat.com/show_bug.cgi?id=618723).

fstat64(247, 0xbfca35a4)                = -1 EBADF (Bad file descriptor)                                               
fstat64(248, 0xbfca35a4)                = -1 EBADF (Bad file descriptor)                                               
fstat64(249, 0xbfca35a4)                = -1 EBADF (Bad file descriptor)                                               
fstat64(250, 0xbfca35a4)                = -1 EBADF (Bad file descriptor)                                               
fstat64(251, 0xbfca35a4)                = -1 EBADF (Bad file descriptor)                                               
fstat64(252, 0xbfca35a4)                = -1 EBADF (Bad file descriptor)                                               
fstat64(253, 0xbfca35a4)                = -1 EBADF (Bad file descriptor)                                               
fstat64(254, 0xbfca35a4)                = -1 EBADF (Bad file descriptor)                                               
fstat64(255, 0xbfca35a4)                = -1 EBADF (Bad file descriptor)                                               
write(17, "[1] 18966\n", 10)            = -1 EPIPE (Broken pipe)                                                       
--- SIGPIPE (Broken pipe) @ 0 (0) ---                                                                                  
_llseek(16, 0, 0xbfca3510, SEEK_END)    = -1 ESPIPE (Illegal seek)                                                     
rt_sigprocmask(SIG_SETMASK, NULL, [], 8) = 0                                                                           
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0                                                                           
fstat64(0, 0xbfca35a4)                  = -1 EBADF (Bad file descriptor)                                               
fstat64(1, 0xbfca35a4)                  = -1 EBADF (Bad file descriptor)                                               
fstat64(2, 0xbfca35a4)                  = -1 EBADF (Bad file descriptor)                                               
fstat64(3, 0xbfca35a4)                  = -1 EBADF (Bad file descriptor)                                               
fstat64(4, 0xbfca35a4)                  = -1 EBADF (Bad file descriptor)                                               
fstat64(5, 0xbfca35a4)                  = -1 EBADF (Bad file descriptor)                                               
fstat64(6, 0xbfca35a4)                  = -1 EBADF (Bad file descriptor)

Comment 11 Miroslav Svoboda 2011-08-30 10:41:22 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
If the tcsh shell redirected standard output to a child process using a pipe and this child process terminated, the shell tried to print a message to the already-closed pipe as a high-priority event that could never been finished. As a consequence, tcsh entered an infinite loop and consumed up to 100% of the CPU. To fix this issue, this error event is now removed from the event queue before the shell tries to write it to the broken pipe. As result, the parent process terminates ordinarily.

Comment 13 errata-xmlrpc 2011-12-06 17:02:06 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2011-1686.html