Created attachment 477700 [details] reproducer Description of problem: When a signal comes in during a normal tcsh program execution just after glibc locks its internal malloc structures (before releasing them) and when a signal handler calls a malloc() related routine, tcsh gets stuck. This problem is deep and related to all signals tcsh copes with internally. It mostly happens for SIGHUP signal (with attached reproducer), but I can see from the source code that it could happen also e.g. for pchild() - SIGCHLD handler. See 'Additional info'. Version-Release number of selected component (if applicable): tcsh-6.14-12.el5-x86_64 How reproducible: takes some time to reproduce Steps to Reproduce: 1. Preapre two machines (server1/client1) 2. By copying client1's public key to server1's /root/.ssh/authorized_keys, setup the way client1 can login to server1 via ssh without password/passphrase 3. On server1, create the following empty file in /tmp. # touch /tmp/testmgr.log 4. On client1, download "ssh_test" program attached to this BZ and do the following: # csh # setenv SENJUHOME /tmp # cd $SENJUHOME # mkdir log Test if everything works correctly (you should see the following, no errors): $ ~/ssh_test 2 2 root@server1 [2][2][root@server1] ======== sj_sshConnect START ========= stropt=[ssh] stropt=[-2] stropt=[-o] stropt=[PreferredAuthentications publickey] stropt=[root@server1] stropt=[] sj_sshConnect END rc=[0],fd=[3],child_pid=[12565] ======== sj_sshConnect END ========= ======== sj_sshExecCmd Loop START ========= result = 457656 -rw-r--r-- 1 root root 0 Oct 4 18:52 /tmp/testmgr.log sj_sshExecCmd[0] END rc=[0] exitcode=[0] result = 457656 -rw-r--r-- 1 root root 0 Oct 4 18:52 /tmp/testmgr.log sj_sshExecCmd[1] END rc=[0] exitcode=[0] result = 457656 -rw-r--r-- 1 root root 0 Oct 4 18:52 /tmp/testmgr.log sj_sshExecCmd[2] END rc=[0] exitcode=[0] result = 457656 -rw-r--r-- 1 root root 0 Oct 4 18:52 /tmp/testmgr.log sj_sshExecCmd[3] END rc=[0] exitcode=[0] result = 457656 -rw-r--r-- 1 root root 0 Oct 4 18:52 /tmp/testmgr.log sj_sshExecCmd[4] END rc=[0] exitcode=[0] ======== sj_sshExecCmd Loop END ========= ======== sj_sshDisconnect START ========= sj_sshDisconnect END rc=[0] child_pid=[12565] ======== sj_sshDisconnect END ========= ^C $ Then repeat the following approx 30 times: # ./ssh_test 2 2 root@server1 > & /dev/null & After some time you should see processes that have their parent id = 1 and they stay in this state for a long time (sometimes it needs 5/15/60 minutes or more for bug to show up), keep watching: # ps -fe | grep sh Actual results: 0 501 28870 1 15 0 70756 1904 184466 S ? 0:00 -sh 0 501 29112 1 16 0 70780 1924 184466 S ? 0:00 -sh 0 501 29114 1 15 0 70748 1896 184466 S ? 0:00 -sh 0 501 29182 1 15 0 70756 1904 184466 S ? 0:00 -sh 0 501 29498 1 16 0 70756 1900 184466 S ? 0:00 -sh ~~~~~~~~~~~~~~~~~~~~~~~~~~snip~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 0 501 32178 1 15 0 70772 1924 184466 S ? 0:00 -sh 0 501 32209 1 15 0 70784 1924 184466 S ? 0:00 -sh 0 501 32224 1 15 0 70780 1932 184466 S ? 0:00 -sh 0 501 32499 1 15 0 70784 1940 184466 S ? 0:00 -sh 0 501 32566 1 15 0 70772 1920 184466 S ? 0:00 -sh 0 501 32692 1 15 0 70748 1892 184466 S ? 0:00 -sh 0 501 32735 1 18 0 136360 1980 184466 Ss ? 0:00 -csh Expected results: No stuck processes. Additional info: This BZ is related to BZ #618723 - fixing a bug from that BZ using a patch from comment #18 helps to reveal the bug described in this BZ. It can happen that without fixing the bug from #618723 this bug cannot be reproduced. The bug is fixed since upstream 6.15.00. It is fixed by not calling any of signal handlers on signal arrival. It calls a special function (handle_pending_signals()) in main program loops and on a lot other (often called) places instead. The following are backtraces from stuck processes: (gdb) bt #0 0x00b3c402 in __kernel_vsyscall () #1 0x00c42783 in __lll_lock_wait_private () from /lib/libc.so.6 #2 0x00bd1c76 in _L_lock_5396 () at malloc.c:6195 #3 0x00bccf69 in _int_free (av=0xcb9140, p=0x9049ce0, have_lock=0) at malloc.c:4846 #4 0x00bcd9e9 in __libc_free (mem=0x9049ce8) at malloc.c:3670 #5 0x0808a1d6 in sfree (p=0x9049ce8) at tc.alloc.c:579 #6 0x0806173b in freelex (vp=0x9049bc8) at sh.lex.c:288 #7 0x08060029 in hfree (hp=0x9049bc8) at sh.hist.c:215 #8 0x0805fd27 in enthist (event=81, lp=0x80bd670, docopy=1, mflg=1) at sh.hist.c:134 #9 0x0805fba0 in savehist (sp=0x80bd670, mflg=1) at sh.hist.c:85 #10 0x0804da08 in process (catch=0) at sh.c:2176 #11 0x0804d0ad in srcunit (unit=6, onlyown=0, hflg=2, av=0xb7c004d0) at sh.c:1728 #12 0x0804c923 in srcfile (f=0xbfc75c20 "/home/buchticka/.history", onlyown=0, flag=2, av=0xb7c004d0) at sh.c:1498 #13 0x0804dd34 in dosource (t=0xb7c004d0, c=0x0) at sh.c:2282 #14 0x0806076a in loadhist (fname=0xb7c00468 L"/home/buchticka/.history", mflg=1) at sh.hist.c:482 #15 0x08060668 in rechist (fname=0xb7c00468 L"/home/buchticka/.history", ref=1) at sh.hist.c:450 #16 0x0804e481 in record () at sh.c:2588 #17 0x0804d328 in phup (snum=1) at sh.c:1834 #18 <signal handler called> #19 0x00bcd047 in _int_free (av=0xcb9140, p=0x904a468, have_lock=0) at malloc.c:4900 #20 0x00bcd9e9 in __libc_free (mem=0x904a470) at malloc.c:3670 #21 0x00b82624 in __gconv_release_cache (steps=0x904a4e8, nsteps=<value optimized out>) at gconv_cache.c:460 #22 0x00b7b94d in __gconv_close_transform (steps=0x904a470, nsteps=2) at gconv_db.c:797 #23 0x00b7b791 in __gconv_close (cd=0x9049e00) at gconv_close.c:64 #24 0x00b7ae0c in iconv_close (cd=0x904a4e8) at iconv_close.c:36 #25 0x0805d50c in nlsclose () at sh.func.c:2539 #26 0x0804e3d9 in xexit (i=1) at sh.c:2521 #27 0x080679b7 in flush () at sh.print.c:294 #28 0x08067788 in xputchar (c=10) at sh.print.c:188 #29 0x0808ea4c in doprnt (addchar=0x80675b1 <xputchar>, sfmt=0x80938e6 "exit\n", ap=0xbfc77224 "") at tc.printf.c:81 #30 0x0808f322 in xprintf (fmt=0x80938e6 "exit\n") at tc.printf.c:377 #31 0x0804c669 in main (argc=0, argv=0xbfc7b514) at sh.c:1382 (gdb) (gdb) bt #0 0x00bd4402 in __kernel_vsyscall () #1 0x009e95ee in __lll_mutex_lock_wait () from /lib/libc.so.6 #2 0x00978d78 in _L_lock_14118 () from /lib/libc.so.6 #3 0x00978024 in free () from /lib/libc.so.6 #4 0x0098e6c8 in tzset_internal () from /lib/libc.so.6 #5 0x0098efcd in tzset () from /lib/libc.so.6 #6 0x00993726 in strftime_l () from /lib/libc.so.6 #7 0x009d905f in __vsyslog_chk () from /lib/libc.so.6 #8 0x009d9547 in vsyslog () from /lib/libc.so.6 #9 0x0096cdd8 in __libc_message () from /lib/libc.so.6 #10 0x00974b16 in _int_free () from /lib/libc.so.6 #11 0x00978030 in free () from /lib/libc.so.6 #12 0x009236af in __gconv_close () from /lib/libc.so.6 #13 0x00922d5c in iconv_close () from /lib/libc.so.6 #14 0x0805d504 in nlsclose () at sh.func.c:2539 #15 0x0804e408 in xexit (i=1) at sh.c:2530 #16 0x0804d3fd in phup (snum=1) at sh.c:1892 #17 <signal handler called> #18 0x0092a550 in __gconv_release_cache () from /lib/libc.so.6 #19 0x0092389d in __gconv_close_transform () from /lib/libc.so.6 #20 0x009236e1 in __gconv_close () from /lib/libc.so.6 #21 0x00922d5c in iconv_close () from /lib/libc.so.6 #22 0x0805d504 in nlsclose () at sh.func.c:2539 #23 0x0804e408 in xexit (i=1) at sh.c:2530 #24 0x080679af in flush () at sh.print.c:294 #25 0x08067780 in xputchar (c=10) at sh.print.c:188 #26 0x0808ea54 in doprnt (addchar=0x80675a9 <xputchar>, sfmt=0x8093916 "exit\n", ap=0xbfb2dc24 "") at tc.printf.c:81 #27 0x0808f32a in xprintf (fmt=0x8093916 "exit\n") at tc.printf.c:377 #28 0x0804c5be in main (argc=0, argv=0xbfb31df4) at sh.c:1393 (gdb) (gdb) bt #0 0x008d2402 in __kernel_vsyscall () #1 0x005e3226 in kill () from /lib/libc.so.6 #2 0x0804a689 in myreset () at sh.c:209 #3 0x0804db9f in process (catch=0) at sh.c:2248 #4 0x0804d1ec in srcunit (unit=6, onlyown=0, hflg=2, av=0x88cf130) at sh.c:1784 #5 0x0804ca62 in srcfile (f=0xbfb8c880 "/home/buchticka/.history", onlyown=0, flag=2, av=0x88cf130) at sh.c:1554 #6 0x0804de66 in dosource (t=0x88cf130, c=0x0) at sh.c:2338 #7 0x08060812 in loadhist (fname=0x88c6cf0 L"/home/buchticka/.history", mflg=1) at sh.hist.c:482 #8 0x08060710 in rechist (fname=0x88c6cf0 L"/home/buchticka/.history", ref=1) at sh.hist.c:450 #9 0x0804e547 in record () at sh.c:2640 #10 0x0804d467 in phup (snum=1) at sh.c:1888 #11 <signal handler called> #12 0x008d2402 in __kernel_vsyscall () #13 0x005e332c in sigsuspend () from /lib/libc.so.6 #14 0x08068202 in pjwait (pp=0x88d46c8) at sh.proc.c:589 #15 0x080680da in pwait () at sh.proc.c:538 #16 0x0806afd5 in execute (t=0x88d6030, wanttty=25555, pipein=0x0, pipeout=0x0, do_glob=1) at sh.sem.c:631 #17 0x0806b3eb in execute (t=0x88d6ad0, wanttty=25555, pipein=0x0, pipeout=0x0, do_glob=1) at sh.sem.c:723 #18 0x0804dc82 in process (catch=1) at sh.c:2285 #19 0x0804c751 in main (argc=0, argv=0xbfba21b4) at sh.c:1422 (gdb) The number of changes required to fix/rewrite the signal handling is enormous. The following patch is only to point out the changes/functions related and required to fix the problem for HUP signal, e.g. see the fixed upstream version that it calls handle_pending_signals() quite more often (the patch only once - in the main program loop): C symbol: handle_pending_signals File Function Line 0 tc.sig.h <global> 162 extern void handle_pending_signals(void ); 1 sh.func.c xcatgets 2495 handle_pending_signals(); 2 sh.func.c nlsclose 2593 handle_pending_signals(); 3 sh.misc.c xclose 545 handle_pending_signals(); 4 sh.misc.c xclosedir 552 handle_pending_signals(); 5 sh.misc.c xcreat 561 handle_pending_signals(); 6 sh.misc.c xdup2 572 handle_pending_signals(); 7 sh.misc.c xgetgrgid 584 handle_pending_signals(); 8 sh.misc.c xgetpwnam 597 handle_pending_signals(); 9 sh.misc.c xgetpwuid 610 handle_pending_signals(); a sh.misc.c xopen 623 handle_pending_signals(); b sh.misc.c xopen 635 handle_pending_signals(); c sh.misc.c xread 648 handle_pending_signals(); d sh.misc.c xtcsetattr 661 handle_pending_signals(); e sh.misc.c xwrite 674 handle_pending_signals(); f sh.proc.c pchild 263 handle_pending_signals(); g sh.proc.c pjwait 504 handle_pending_signals(); h sh.proc.c dowait 597 handle_pending_signals(); i sh.proc.c dowait 599 handle_pending_signals(); j sh.proc.c pfork 1873 handle_pending_signals(); k tc.func.c auto_lock 689 handle_pending_signals(); l tc.func.c remotehost 2032 handle_pending_signals(); m tc.sig.c handle_pending_signals 65 handle_pending_signals(void ) n tc.sig.c disabled_cleanup 120 handle_pending_signals(); o tc.sig.c pintr_push_enable 138 handle_pending_signals(); p tw.init.c tw_grpname_next 690 handle_pending_signals(); diff -up tcsh-6.14.00/sh.c.orig tcsh-6.14.00/sh.c --- tcsh-6.14.00/sh.c.orig 2011-02-08 22:41:31.157316490 +0100 +++ tcsh-6.14.00/sh.c 2011-02-08 22:44:28.353816291 +0100 @@ -163,7 +163,7 @@ static int srcfile __P((const char *, #else int srcfile __P((const char *, int, int, Char **)); #endif /*WINNT_NATIVE*/ -static RETSIGTYPE phup __P((int)); +RETSIGTYPE phup __P((int)); static void srcunit __P((int, int, int, Char **)); static void mailchk __P((void)); #ifndef _PATH_DEFPATH @@ -1106,7 +1106,7 @@ main(argc, argv) * We also only setup the handlers for shells that are trully * interactive. */ - osig = signal(SIGHUP, phup); /* exit processing on HUP */ + osig = signal(SIGHUP, queue_phup); /* exit processing on HUP */ if (!loginsh && osig == SIG_IGN) (void) signal(SIGHUP, osig); #ifdef SIGXCPU @@ -1795,7 +1795,7 @@ exitstat() /* * in the event of a HUP we want to save the history */ -static RETSIGTYPE +RETSIGTYPE phup(snum) int snum; { @@ -1997,6 +1997,7 @@ process(catch) savet = NULL; getexit(osetexit); for (;;) { + handle_pending_signals(); pendjob(); diff -up tcsh-6.14.00/tc.sig.c.orig tcsh-6.14.00/tc.sig.c --- tcsh-6.14.00/tc.sig.c.orig 2011-02-08 22:45:04.618770856 +0100 +++ tcsh-6.14.00/tc.sig.c 2011-02-08 22:46:31.817629697 +0100 @@ -43,6 +43,23 @@ RCSID("$Id: tc.sig.c,v 3.29 2005/01/18 2 */ #define MAX_CHLD 50 +static volatile sig_atomic_t phup_pending; /* = 0; */ + +void +queue_phup(int sig) +{ + USE(sig); + phup_pending = 1; +} + +void +handle_pending_signals(void) +{ + if (phup_pending) { + phup(SIGHUP); + } +} + # ifdef UNRELSIGS static struct mysigstack { int s_w; /* wait report */ diff -up tcsh-6.14.00/tc.sig.h.orig tcsh-6.14.00/tc.sig.h --- tcsh-6.14.00/tc.sig.h.orig 2011-02-08 22:46:57.923579084 +0100 +++ tcsh-6.14.00/tc.sig.h 2011-02-08 22:47:24.416523981 +0100 @@ -225,4 +225,7 @@ extern RETSIGTYPE synch_handler(); (void) sigsetmask(sm)) # endif /* SAVESIGVEC */ +extern void handle_pending_signals(void); +extern void queue_phup(int); + #endif /* _h_tc_sig */
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: The tcsh shell used to call malloc()-related functions inside signal handlers. This could cause tcsh to become unresponsive due to calling just-interrupted internal glibc locking algorithms again. Operations related to malloc() were moved from signal handlers to common routines.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2011-1072.html