Bug 676136 - tcsh calls malloc() related functions inside signal handlers
Summary: tcsh calls malloc() related functions inside signal handlers
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: tcsh617
Version: 5.2
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: rc
: ---
Assignee: Vojtech Vitek
QA Contact: BaseOS QE - Apps
URL:
Whiteboard:
Depends On: 688166
Blocks: 618723
TreeView+ depends on / blocked
 
Reported: 2011-02-08 22:17 UTC by Martin Osvald 🛹
Modified: 2018-11-14 19:01 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
The tcsh shell used to call malloc()-related functions inside signal handlers. This could cause tcsh to become unresponsive due to calling just-interrupted internal glibc locking algorithms again. Operations related to malloc() were moved from signal handlers to common routines.
Clone Of:
Environment:
Last Closed: 2011-07-21 08:48:36 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2011:1072 0 normal SHIPPED_LIVE new package: tcsh617 2011-07-21 08:47:46 UTC

Description Martin Osvald 🛹 2011-02-08 22:17:58 UTC
Created attachment 477700 [details]
reproducer

Description of problem:

When a signal comes in during a normal tcsh program execution just after glibc locks its internal malloc structures (before releasing them) and when a signal handler calls a malloc() related routine, tcsh gets stuck.

This problem is deep and related to all signals tcsh copes with internally. It mostly happens for SIGHUP signal (with attached reproducer), but I can see from the source code that it could happen also e.g. for pchild() - SIGCHLD handler. See 'Additional info'.


Version-Release number of selected component (if applicable):

tcsh-6.14-12.el5-x86_64


How reproducible:

takes some time to reproduce


Steps to Reproduce:

1. Preapre two machines (server1/client1)

2. By copying client1's public key to server1's /root/.ssh/authorized_keys, setup the way client1 can login to server1 via ssh without password/passphrase

3. On server1, create the following empty file in /tmp.

 # touch /tmp/testmgr.log

4. On client1, download "ssh_test" program attached to this BZ and do the following:

 # csh
 # setenv SENJUHOME /tmp
 # cd $SENJUHOME
 # mkdir log
 
Test if everything works correctly (you should see the following, no errors):

$ ~/ssh_test 2 2 root@server1
[2][2][root@server1]
======== sj_sshConnect START =========
stropt=[ssh]
stropt=[-2]
stropt=[-o]
stropt=[PreferredAuthentications publickey]
stropt=[root@server1]
stropt=[]
sj_sshConnect END rc=[0],fd=[3],child_pid=[12565]
======== sj_sshConnect END =========
======== sj_sshExecCmd Loop START =========
result = 457656 -rw-r--r-- 1 root root 0 Oct  4 18:52 /tmp/testmgr.log

sj_sshExecCmd[0] END rc=[0] exitcode=[0]
result = 457656 -rw-r--r-- 1 root root 0 Oct  4 18:52 /tmp/testmgr.log

sj_sshExecCmd[1] END rc=[0] exitcode=[0]
result = 457656 -rw-r--r-- 1 root root 0 Oct  4 18:52 /tmp/testmgr.log

sj_sshExecCmd[2] END rc=[0] exitcode=[0]
result = 457656 -rw-r--r-- 1 root root 0 Oct  4 18:52 /tmp/testmgr.log

sj_sshExecCmd[3] END rc=[0] exitcode=[0]
result = 457656 -rw-r--r-- 1 root root 0 Oct  4 18:52 /tmp/testmgr.log

sj_sshExecCmd[4] END rc=[0] exitcode=[0]
======== sj_sshExecCmd Loop END =========
======== sj_sshDisconnect START =========
sj_sshDisconnect END rc=[0] child_pid=[12565]
======== sj_sshDisconnect END =========
^C
$

Then repeat the following approx 30 times:

 # ./ssh_test 2 2 root@server1 > & /dev/null &

After some time you should see processes that have their parent id = 1 and they stay in this state for a long time (sometimes it needs 5/15/60 minutes or more for bug to show up), keep watching:

 # ps -fe | grep sh


Actual results:

0   501 28870     1  15   0  70756  1904 184466 S    ?          0:00 -sh
0   501 29112     1  16   0  70780  1924 184466 S    ?          0:00 -sh
0   501 29114     1  15   0  70748  1896 184466 S    ?          0:00 -sh
0   501 29182     1  15   0  70756  1904 184466 S    ?          0:00 -sh
0   501 29498     1  16   0  70756  1900 184466 S    ?          0:00 -sh
~~~~~~~~~~~~~~~~~~~~~~~~~~snip~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
0   501 32178     1  15   0  70772  1924 184466 S    ?          0:00 -sh
0   501 32209     1  15   0  70784  1924 184466 S    ?          0:00 -sh
0   501 32224     1  15   0  70780  1932 184466 S    ?          0:00 -sh
0   501 32499     1  15   0  70784  1940 184466 S    ?          0:00 -sh
0   501 32566     1  15   0  70772  1920 184466 S    ?          0:00 -sh
0   501 32692     1  15   0  70748  1892 184466 S    ?          0:00 -sh
0   501 32735     1  18   0 136360  1980 184466 Ss   ?          0:00 -csh


Expected results:

No stuck processes.


Additional info:

This BZ is related to BZ #618723 - fixing a bug from that BZ using a patch from comment #18 helps to reveal the bug described in this BZ. It can happen that without fixing the bug from #618723 this bug cannot be reproduced.

The bug is fixed since upstream 6.15.00. It is fixed by not calling any of signal handlers on signal arrival. It calls a special function (handle_pending_signals()) in main program loops and on a lot other (often called) places instead.

The following are backtraces from stuck processes:

(gdb) bt
#0  0x00b3c402 in __kernel_vsyscall ()
#1  0x00c42783 in __lll_lock_wait_private () from /lib/libc.so.6
#2  0x00bd1c76 in _L_lock_5396 () at malloc.c:6195
#3  0x00bccf69 in _int_free (av=0xcb9140, p=0x9049ce0, have_lock=0) at malloc.c:4846
#4  0x00bcd9e9 in __libc_free (mem=0x9049ce8) at malloc.c:3670
#5  0x0808a1d6 in sfree (p=0x9049ce8) at tc.alloc.c:579
#6  0x0806173b in freelex (vp=0x9049bc8) at sh.lex.c:288
#7  0x08060029 in hfree (hp=0x9049bc8) at sh.hist.c:215
#8  0x0805fd27 in enthist (event=81, lp=0x80bd670, docopy=1, mflg=1) at sh.hist.c:134
#9  0x0805fba0 in savehist (sp=0x80bd670, mflg=1) at sh.hist.c:85
#10 0x0804da08 in process (catch=0) at sh.c:2176
#11 0x0804d0ad in srcunit (unit=6, onlyown=0, hflg=2, av=0xb7c004d0) at sh.c:1728
#12 0x0804c923 in srcfile (f=0xbfc75c20 "/home/buchticka/.history", onlyown=0, flag=2, av=0xb7c004d0) at sh.c:1498
#13 0x0804dd34 in dosource (t=0xb7c004d0, c=0x0) at sh.c:2282
#14 0x0806076a in loadhist (fname=0xb7c00468 L"/home/buchticka/.history", mflg=1) at sh.hist.c:482
#15 0x08060668 in rechist (fname=0xb7c00468 L"/home/buchticka/.history", ref=1) at sh.hist.c:450
#16 0x0804e481 in record () at sh.c:2588
#17 0x0804d328 in phup (snum=1) at sh.c:1834
#18 <signal handler called>
#19 0x00bcd047 in _int_free (av=0xcb9140, p=0x904a468, have_lock=0) at malloc.c:4900
#20 0x00bcd9e9 in __libc_free (mem=0x904a470) at malloc.c:3670
#21 0x00b82624 in __gconv_release_cache (steps=0x904a4e8, nsteps=<value optimized out>) at gconv_cache.c:460
#22 0x00b7b94d in __gconv_close_transform (steps=0x904a470, nsteps=2) at gconv_db.c:797
#23 0x00b7b791 in __gconv_close (cd=0x9049e00) at gconv_close.c:64
#24 0x00b7ae0c in iconv_close (cd=0x904a4e8) at iconv_close.c:36
#25 0x0805d50c in nlsclose () at sh.func.c:2539
#26 0x0804e3d9 in xexit (i=1) at sh.c:2521
#27 0x080679b7 in flush () at sh.print.c:294
#28 0x08067788 in xputchar (c=10) at sh.print.c:188
#29 0x0808ea4c in doprnt (addchar=0x80675b1 <xputchar>, sfmt=0x80938e6 "exit\n", ap=0xbfc77224 "") at tc.printf.c:81
#30 0x0808f322 in xprintf (fmt=0x80938e6 "exit\n") at tc.printf.c:377
#31 0x0804c669 in main (argc=0, argv=0xbfc7b514) at sh.c:1382
(gdb)

(gdb) bt
#0  0x00bd4402 in __kernel_vsyscall ()
#1  0x009e95ee in __lll_mutex_lock_wait () from /lib/libc.so.6
#2  0x00978d78 in _L_lock_14118 () from /lib/libc.so.6
#3  0x00978024 in free () from /lib/libc.so.6
#4  0x0098e6c8 in tzset_internal () from /lib/libc.so.6
#5  0x0098efcd in tzset () from /lib/libc.so.6
#6  0x00993726 in strftime_l () from /lib/libc.so.6
#7  0x009d905f in __vsyslog_chk () from /lib/libc.so.6
#8  0x009d9547 in vsyslog () from /lib/libc.so.6
#9  0x0096cdd8 in __libc_message () from /lib/libc.so.6
#10 0x00974b16 in _int_free () from /lib/libc.so.6
#11 0x00978030 in free () from /lib/libc.so.6
#12 0x009236af in __gconv_close () from /lib/libc.so.6
#13 0x00922d5c in iconv_close () from /lib/libc.so.6
#14 0x0805d504 in nlsclose () at sh.func.c:2539
#15 0x0804e408 in xexit (i=1) at sh.c:2530
#16 0x0804d3fd in phup (snum=1) at sh.c:1892
#17 <signal handler called>
#18 0x0092a550 in __gconv_release_cache () from /lib/libc.so.6
#19 0x0092389d in __gconv_close_transform () from /lib/libc.so.6
#20 0x009236e1 in __gconv_close () from /lib/libc.so.6
#21 0x00922d5c in iconv_close () from /lib/libc.so.6
#22 0x0805d504 in nlsclose () at sh.func.c:2539
#23 0x0804e408 in xexit (i=1) at sh.c:2530
#24 0x080679af in flush () at sh.print.c:294
#25 0x08067780 in xputchar (c=10) at sh.print.c:188
#26 0x0808ea54 in doprnt (addchar=0x80675a9 <xputchar>, sfmt=0x8093916 "exit\n", ap=0xbfb2dc24 "") at tc.printf.c:81
#27 0x0808f32a in xprintf (fmt=0x8093916 "exit\n") at tc.printf.c:377
#28 0x0804c5be in main (argc=0, argv=0xbfb31df4) at sh.c:1393
(gdb)

(gdb) bt
#0  0x008d2402 in __kernel_vsyscall ()
#1  0x005e3226 in kill () from /lib/libc.so.6
#2  0x0804a689 in myreset () at sh.c:209
#3  0x0804db9f in process (catch=0) at sh.c:2248
#4  0x0804d1ec in srcunit (unit=6, onlyown=0, hflg=2, av=0x88cf130) at sh.c:1784
#5  0x0804ca62 in srcfile (f=0xbfb8c880 "/home/buchticka/.history", onlyown=0, flag=2, av=0x88cf130) at sh.c:1554
#6  0x0804de66 in dosource (t=0x88cf130, c=0x0) at sh.c:2338
#7  0x08060812 in loadhist (fname=0x88c6cf0 L"/home/buchticka/.history", mflg=1) at sh.hist.c:482
#8  0x08060710 in rechist (fname=0x88c6cf0 L"/home/buchticka/.history", ref=1) at sh.hist.c:450
#9  0x0804e547 in record () at sh.c:2640
#10 0x0804d467 in phup (snum=1) at sh.c:1888
#11 <signal handler called>
#12 0x008d2402 in __kernel_vsyscall ()
#13 0x005e332c in sigsuspend () from /lib/libc.so.6
#14 0x08068202 in pjwait (pp=0x88d46c8) at sh.proc.c:589
#15 0x080680da in pwait () at sh.proc.c:538
#16 0x0806afd5 in execute (t=0x88d6030, wanttty=25555, pipein=0x0, pipeout=0x0, do_glob=1) at sh.sem.c:631
#17 0x0806b3eb in execute (t=0x88d6ad0, wanttty=25555, pipein=0x0, pipeout=0x0, do_glob=1) at sh.sem.c:723
#18 0x0804dc82 in process (catch=1) at sh.c:2285
#19 0x0804c751 in main (argc=0, argv=0xbfba21b4) at sh.c:1422
(gdb)

The number of changes required to fix/rewrite the signal handling is enormous. The following patch is only to point out the changes/functions related and required to fix the problem for HUP signal, e.g. see the fixed upstream version that it calls handle_pending_signals() quite more often (the patch only once - in the main program loop):

C symbol: handle_pending_signals

  File      Function               Line
0 tc.sig.h  <global>                162 extern void handle_pending_signals(void );
1 sh.func.c xcatgets               2495 handle_pending_signals();
2 sh.func.c nlsclose               2593 handle_pending_signals();
3 sh.misc.c xclose                  545 handle_pending_signals();
4 sh.misc.c xclosedir               552 handle_pending_signals();
5 sh.misc.c xcreat                  561 handle_pending_signals();
6 sh.misc.c xdup2                   572 handle_pending_signals();
7 sh.misc.c xgetgrgid               584 handle_pending_signals();
8 sh.misc.c xgetpwnam               597 handle_pending_signals();
9 sh.misc.c xgetpwuid               610 handle_pending_signals();
a sh.misc.c xopen                   623 handle_pending_signals();
b sh.misc.c xopen                   635 handle_pending_signals();
c sh.misc.c xread                   648 handle_pending_signals();
d sh.misc.c xtcsetattr              661 handle_pending_signals();
e sh.misc.c xwrite                  674 handle_pending_signals();
f sh.proc.c pchild                  263 handle_pending_signals();
g sh.proc.c pjwait                  504 handle_pending_signals();
h sh.proc.c dowait                  597 handle_pending_signals();
i sh.proc.c dowait                  599 handle_pending_signals();
j sh.proc.c pfork                  1873 handle_pending_signals();
k tc.func.c auto_lock               689 handle_pending_signals();
l tc.func.c remotehost             2032 handle_pending_signals();
m tc.sig.c  handle_pending_signals   65 handle_pending_signals(void )
n tc.sig.c  disabled_cleanup        120 handle_pending_signals();
o tc.sig.c  pintr_push_enable       138 handle_pending_signals();
p tw.init.c tw_grpname_next         690 handle_pending_signals();


diff -up tcsh-6.14.00/sh.c.orig tcsh-6.14.00/sh.c
--- tcsh-6.14.00/sh.c.orig	2011-02-08 22:41:31.157316490 +0100
+++ tcsh-6.14.00/sh.c	2011-02-08 22:44:28.353816291 +0100
@@ -163,7 +163,7 @@ static	int		  srcfile	__P((const char *,
 #else
 int		  srcfile	__P((const char *, int, int, Char **));
 #endif /*WINNT_NATIVE*/
-static	RETSIGTYPE	  phup		__P((int));
+RETSIGTYPE	  phup		__P((int));
 static	void		  srcunit	__P((int, int, int, Char **));
 static	void		  mailchk	__P((void));
 #ifndef _PATH_DEFPATH
@@ -1106,7 +1106,7 @@ main(argc, argv)
 	 * We also only setup the handlers for shells that are trully
 	 * interactive.
 	 */
-	osig = signal(SIGHUP, phup);	/* exit processing on HUP */
+	osig = signal(SIGHUP, queue_phup); /* exit processing on HUP */
 	if (!loginsh && osig == SIG_IGN)
 	    (void) signal(SIGHUP, osig);
 #ifdef SIGXCPU
@@ -1795,7 +1795,7 @@ exitstat()
 /*
  * in the event of a HUP we want to save the history
  */
-static RETSIGTYPE
+RETSIGTYPE
 phup(snum)
 int snum;
 {
@@ -1997,6 +1997,7 @@ process(catch)
     savet = NULL;
     getexit(osetexit);
     for (;;) {
+	handle_pending_signals();
 
 	pendjob();
 
diff -up tcsh-6.14.00/tc.sig.c.orig tcsh-6.14.00/tc.sig.c
--- tcsh-6.14.00/tc.sig.c.orig	2011-02-08 22:45:04.618770856 +0100
+++ tcsh-6.14.00/tc.sig.c	2011-02-08 22:46:31.817629697 +0100
@@ -43,6 +43,23 @@ RCSID("$Id: tc.sig.c,v 3.29 2005/01/18 2
  */
 #define MAX_CHLD 50
 
+static volatile sig_atomic_t phup_pending; /* = 0; */
+
+void
+queue_phup(int sig)
+{
+	USE(sig);
+	phup_pending = 1;
+}
+
+void
+handle_pending_signals(void)
+{
+	if (phup_pending) {
+		phup(SIGHUP);
+	}
+}
+
 # ifdef UNRELSIGS
 static struct mysigstack {
     int     s_w;		/* wait report			 */
diff -up tcsh-6.14.00/tc.sig.h.orig tcsh-6.14.00/tc.sig.h
--- tcsh-6.14.00/tc.sig.h.orig	2011-02-08 22:46:57.923579084 +0100
+++ tcsh-6.14.00/tc.sig.h	2011-02-08 22:47:24.416523981 +0100
@@ -225,4 +225,7 @@ extern RETSIGTYPE synch_handler();
 	    (void) sigsetmask(sm))
 # endif /* SAVESIGVEC */
 
+extern void handle_pending_signals(void);
+extern void queue_phup(int);
+
 #endif /* _h_tc_sig */

Comment 23 Miroslav Svoboda 2011-07-01 21:33:57 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
The tcsh shell used to call malloc()-related functions inside signal handlers. This could cause tcsh to become unresponsive due to calling just-interrupted internal glibc locking algorithms again. Operations related to malloc() were moved from signal handlers to common routines.

Comment 24 errata-xmlrpc 2011-07-21 08:48:36 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-1072.html

Comment 25 errata-xmlrpc 2011-07-21 12:09:34 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-1072.html


Note You need to log in before you can comment on or make changes to this bug.