618723 – csh does not exit properly and uses 100% cpu

Bug 618723 - csh does not exit properly and uses 100% cpu

Summary: csh does not exit properly and uses 100% cpu

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	tcsh617
Sub Component:
Version:	5.2
Hardware:	All
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Vojtech Vitek
QA Contact:	BaseOS QE - Apps
Docs Contact:
URL:
Whiteboard:
Depends On:	676136
Blocks:	590060
TreeView+	depends on / blocked

Reported:	2010-07-27 15:31 UTC by Mark Wu
Modified:	2018-11-14 19:01 UTC (History)
CC List:	12 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	When tcsh did not exit properly, it could have entered an infinite loop, using 100% of the CPU, and become unresponsive. This was caused by a function interrupting the exit routine and then re-entering the code and thus causing it to loop infinitely.
Clone Of:
Environment:
Last Closed:	2011-07-21 08:48:01 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
strace_csh (30.02 KB, text/plain) 2010-07-27 15:45 UTC, Mark Wu	no flags	Details
ltrace_csh (3.58 MB, application/octet-stream) 2010-07-27 15:50 UTC, Mark Wu	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2011:1072	0	normal	SHIPPED_LIVE	new package: tcsh617	2011-07-21 08:47:46 UTC

Description Mark Wu 2010-07-27 15:31:26 UTC

Description of problem:
While customer is using csh environment, csh randomly uses 100% cpus without proper ppid and customer have to kill those csh process.  


Version-Release number of selected component (if applicable):
- Red Hat Enterprise Linux 5.2
- telnet-server-0.17-39.el5-x86_64
- krb5-workstation-1.6.1-25.el5-x86_64
- tcsh-6.14-12.el5-x86_64

How reproducible:
happened about 3-4 times in the past year.

Steps to Reproduce:
Can't be reproduced.
  
Actual results:
csh can't exit properly when telnetd exits and consume 100% cpu

Expected results:
csh can exit when corresponding telnetd exit.

Additional info:
We can see the parent process (telnetd) had exited, but login.krb5 and csh still existed from the output of ps.

# egrep "(csh|login|COMMAND)" sos_commands/process/ps_alxwww
F   UID   PID  PPID PRI  NI    VSZ   RSS WCHAN  STAT TTY        TIME COMMAND
4     0  1682     1  15   0  67440  1896 wait   Ss   ?          0:00 login -h 172.18.81.234 -p
4   601  1688  1682  18   0  71504  2248 -      R    ?        12808:33 -csh
4     0  2721     1  15   0  67444  1892 wait   Ss   ?          0:00 login -h 172.18.81.234 -p
4   601  2722  2721  18   0  71316  2060 -      R    ?        78244:46 -csh
4     0  8808  8807  16   0  67444  1952 wait   Ss   pts/11     0:00 login -h 192.168.237.55 -p
4     0  9734  9733  16   0  67444  1956 wait   Ss   pts/12     0:00 login -h 192.168.237.55 -p
4     0 10906     1  15   0  67444  1892 wait   Ss   ?          0:00 login -h 172.18.81.234 -p
4   601 10907 10906  18   0  71524  2296 -      R    ?        81249:27 -csh
4     0 22319     1  15   0  68336  1892 wait   Ss   ?          0:00 login -h 172.18.101.230 -p
4   601 22320 22319  18   0  71520  2284 -      R    ?        175340:12 -csh
4     0 30285 30282  16   0  67444  1896 wait   Ss   pts/9      0:00 login -h 172.18.81.234 -p
4   601 30286 30285  15   0  71888  2652 -      S+   pts/9      0:00 -csh
4     0 31864     1  15   0  67444  1896 wait   Ss   ?          0:00 login -h vmgapp05 -p
4   601 31865 31864  18   0  71340  2140 -      R    ?        45486:23 -csh

# grep csh sos_commands/process/pstree
    |-5*[login.krb5---csh]
    |        `-telnetd---login.krb5---csh

Comment 1 Mark Wu 2010-07-27 15:36:07 UTC

Normally, when the process telnetd exits as expected or even abnormally, the tty device used by it will be released  during the cleanup of this process in kernel. At this point, kernel will send a signal SIGHUP to the session leader (here is login.krb5) on this tty. Because it is kernel's responsibility to send the signal SIGHUP, which is not the business of telnetd itself, telnetd should not be relevant to this issue.

On receipt of the signal SIGHUP, login.krb5 will pass it to the child (here is csh). The following code demonstrate this behavior.

krb5-1.6.1/src/appl/bsd/login.c
<snip> 
   while (1) {
#ifdef HAVE_WAITPID
       pid = waitpid(child, 0, 0);
#elif defined(WAIT_USES_INT)
       pid = wait((int *)0);
#else
       pid = wait((union wait *)0);
#endif

       if (hungup) {
#ifdef HAVE_KILLPG
           killpg(child, SIGHUP);
#else
           kill(-child, SIGHUP);
#endif
       }

       if (pid == child)
           break;
   }
</snip>

Comment 2 Mark Wu 2010-07-27 15:45:23 UTC

Created attachment 434731 [details]
strace_csh

This file was collected when the problem happened. It means that when they found csh use 100% cpu, they used strace attach to the csh process. This file has been truncated, because the original trace result is too big and full of the repeated pattern as this attachment.

Comment 3 Mark Wu 2010-07-27 15:50:43 UTC

Created attachment 434733 [details]
ltrace_csh

Comment 4 Mark Wu 2010-07-27 15:56:57 UTC

The process login.krb5 has been inherited by init, which means that the cleanup of telnetd in kernel has been finished already. So we can believe can kernel sent the signal SIGHUP.  In ps, we can see that login.krb5 was waiting on the exiting of the child process(csh).  It's should be after receiving SIGHUP and sending it to csh.

And from the strace of csh, we can see that it had blocked SIGHUP. So it's highly possible that login.krb5 got SIGHUP  and handed on it to csh, but csh didn't response to the signal due to the blocking of SIGHUP, so login has been waiting for the exit of csh forever.

Try to find what's going on inside csh when the problem happened from the output of strace and ltrace.


strace snippet 1
<snip>
fstat(0, 0x7fff3e146ac0)                = -1 EBADF (Bad file descriptor)
fstat(1, 0x7fff3e146ac0)                = -1 EBADF (Bad file descriptor)

...

fstat(13, 0x7fff3e146ac0)               = -1 EBADF (Bad file descriptor)
fstat(14, 0x7fff3e146ac0)               = -1 EBADF (Bad file descriptor)
fstat(20, 0x7fff3e146ac0)               = -1 EBADF (Bad file descriptor)
...
fstat(25, 0x7fff3e146ac0)               = -1 EBADF (Bad file descriptor)
...
fstat(255, 0x7fff3e146ac0)               = -1 EBADF (Bad file descriptor)
</snip>


code snippet 1
<snip> function closem() sh.misc.c  
#define NOFILE 256
#define FSHTTY  15              /* /dev/tty when manip pgrps */
#define FSHIN   16              /* Preferred desc for shell input */
#define FSHOUT  17              /* ... shell output */
#define FSHDIAG 18              /* ... shell diagnostics */
#define FOLDSTD 19              /* ... old std input */
   for (f = 0; f < NOFILE; f++)
       if (f != SHIN && f != SHOUT && f != SHDIAG && f != OLDSTD &&
           f != FSHTTY
#ifdef MALLOC_TRACE
           && f != 25
#endif /* MALLOC_TRACE */
#ifdef S_ISSOCK
           && fstat(f, &st) == 0 && !S_ISSOCK(st.st_mode)
#endif
           )
</snip>


strace snippet 2
<snip>
rt_sigprocmask(SIG_SETMASK, NULL, [HUP], 8) = 0
rt_sigprocmask(SIG_SETMASK, [HUP], NULL, 8) = 0
</snip>

code snippet 2
<snip>     in the function process() in sh.c
   if (setintr)
#ifdef BSDSIGS
           (void) sigsetmask(sigblock((sigmask_t) 0) & ~sigmask(SIGINT));
#else
           (void) sigrelse(SIGINT);
#endif
</snip>



strace snippet 3
<snip>
lseek(16, 0, SEEK_END)                  = -1 ESPIPE (Illegal seek)
ioctl(15, TIOCSPGRP, [31416])           = -1 ENOTTY (Inappropriate ioctl for device)
</snip>

code snippet 3
<snip>  the end of the function stderror() in sh.err.c
   btoeof();                -->     (void) lseek(SHIN, (off_t) 0, L_XTND);
   set(STRstatus, Strsave(STR1), VAR_READWRITE);
#ifdef BSDJOBS
   if (tpgrp > 0)
       (void) tcsetpgrp(FSHTTY, tpgrp);
#endif
   reset();                -->      call longjmp to back to the position of setexit in the function process() in sh.c
</snip>


strace snippet 4
<snip>
write(17, "[vmgapp10:/app/vmg/util/vmg_File"..., 47) = -1 EIO (Input/output error)
</snip>

code snippet 4
<snip>
   if (haderr)
       unit = didfds ? 2 : SHDIAG;
   else
       unit = didfds ? 1 : SHOUT;        /*  FSHOUT is defined as 17
   ...
   if (write(unit, linbuf, sz) == -1)
       switch (errno) {
#ifdef EIO
       /* We lost our tty */
       case EIO:
#endif
...
       default:
           stderror(ERR_SILENT);
           break;
       }
</snip>



strace snippet 5
<snip>
alarm(60000000)                         = 60000000      
</snip>

code snippet 5
<snip>
setalarm(1)                                   // in the function process() in sh.c
</snip>                              

strace snippet 6  (Please note that the signal SIGHUP was always in the signal mask of csh, so it would be blocked.)
<snip>
rt_sigprocmask(SIG_SETMASK, NULL, [HUP], 8) = 0     
rt_sigprocmask(SIG_SETMASK, [HUP], NULL, 8) = 0              // unblock SIGINT  
rt_sigprocmask(SIG_SETMASK, NULL, [HUP], 8) = 0
rt_sigprocmask(SIG_SETMASK, [HUP INT], NULL, 8) = 0           // block SIGINT
rt_sigprocmask(SIG_SETMASK, NULL, [HUP INT], 8) = 0
rt_sigprocmask(SIG_SETMASK, [HUP], NULL, 8) = 0               // unblock SIGINT
rt_sigprocmask(SIG_SETMASK, NULL, [HUP], 8) = 0
rt_sigprocmask(SIG_SETMASK, [HUP INT], NULL, 8) = 0           // block SIGINT
rt_sigprocmask(SIG_SETMASK, NULL, [HUP INT], 8) = 0
rt_sigprocmask(SIG_SETMASK, [HUP], NULL, 8) = 0               // unblock SIGINT
rt_sigprocmask(SIG_SETMASK, NULL, [HUP], 8) = 0
rt_sigprocmask(SIG_SETMASK, [HUP INT], NULL, 8) = 0           // block SIGINT
rt_sigprocmask(SIG_SETMASK, NULL, [HUP INT], 8) = 0
rt_sigprocmask(SIG_SETMASK, [HUP], NULL, 8) = 0               // unblock SIGINT
rt_sigprocmask(SIG_SETMASK, NULL, [HUP], 8) = 0
rt_sigprocmask(SIG_SETMASK, [HUP INT], NULL, 8) = 0            // block SIGINT
rt_sigprocmask(SIG_SETMASK, NULL, [HUP INT], 8) = 0
rt_sigprocmask(SIG_SETMASK, [HUP], NULL, 8) = 0                // unblock SIGINT
</snip>

code snippet 6
<snip>
       if (setintr)
#ifdef BSDSIGS
           (void) sigsetmask(sigblock((sigmask_t) 0) & ~sigmask(SIGINT));      // unlock SIGINT
#else
           (void) sigrelse(SIGINT);
#endif
   ...

           watch_login(0);                              // this four functions all unblock SIGINT first, then unlock it
#endif /* !HAVENOUTMP */
           sched_run(0);
           period_cmd();
           precmd();

</snip>


Based on the information above, we can know what's going on in csh.  When csh got a lexical error which is not fatal, it tried to flush the error message to tty which had been lost due to the exit of telnetd. That can be proved by the "?" in the column of "TTY" for csh in the output of ps. Afterwards, csh got the EIO error in flush(), and it will jump back to the function process() and begin a new loop. It found the variable "haderr" was set, then call closem() to close files. After this, it will repeat the procedure described above. Consequently, it fell into an endless loop and caused a high cpu usage.

Comment 5 Mark Wu 2010-07-27 15:58:56 UTC

In short, I think the source of this problem is that csh block the signal SIGHUP. It caused that csh couldn't exit accordingly when telnetd exited, and fell into an endless loop due to the lost of tty.

  So why did csh block the signal SIGHUP?  At this point, I am not sure about that. In csh, I only found two situation which blocks SIGHUP of csh itself. One is in the handler of SIGHUP. Because it exits on SIGHUP, it could block SIGHUP.  The other is in the function donohup(), but the built-in command nohup can't be invoked in a login shell. So both cases don't make sense for this issue.

Comment 7 RHEL Program Management 2010-09-15 13:36:58 UTC

This request was evaluated by Red Hat Product Management for
inclusion in the current release of Red Hat Enterprise Linux.
Because the affected component is not scheduled to be updated in the
current release, Red Hat is unfortunately unable to address this
request at this time. Red Hat invites you to ask your support
representative to propose this request, if appropriate and relevant,
in the next release of Red Hat Enterprise Linux.

Comment 11 Martin Osvald 🛹 2010-09-27 11:24:29 UTC

Hello, I am sending only my last update from the issue tracker/rosetta to keep this up with BZ and to inform you that I am still working on this one.
-Martin

=== <snip> ===
Hello,

I have moved so far only with problem where tcsh is getting stuck on futex inside glibc() in malloc routines. It is caused by the fact that malloc() related routines aren't signal safe. You mustn't call these routines on linux from signal handler function directly or other functions that could call them internally, but tcsh does that and it is a bug.

I found out that this is fixed in version shipped with Fedora already (tcsh-6.17), where on signal arrival instead of calling handler routine straightaway a global variable for specific signal is set instead and on several places a special function gets called regularly, which checks for value of these global variables for each signal and calls appropriate routines in such way that these routines are called synchronously and that the situation where malloc routines would be interrupted by other signal just after locking internal structures leading to deadlock by locking the internal structure once again by routine called from the interrupting handler cannot happen there.

Currently working on the patch which would backport these changes into RHEL5, and after it I am going to create a BZ for it.

In the meanwhile I am still running the reproducer for the second bug (looping in process()), but after manually fixing the problems with signals and malloc I am not able to reproduce it on RHEL5, I need to do tests also on RHEL4 yet. I also made some tests during the weekend and found out strange things. We can see from strace output in conjunction with source code that xexit() functions gets called repeatedly, but it never gets to calling _exit():

2418 void
2419 #ifdef PROF
2420 done(i)
2421 #else
2422 xexit(i)
2423 #endif
2424     int     i;
2425 {
...
2469     if (child == 0)
2470         nlsclose(); <<<----
2471 #endif /* NLS_CATALOGS */
2472 #ifdef WINNT_NATIVE
2473     nt_cleanup();
2474 #endif /* WINNT_NATIVE */
2475     _exit(i);
2476 }

I found out that the program execution gets interrupted inside glibc call iconv_close() (xexit()->nlsclose()->iconv_close()) returning back to the main loop in process(), strange I need to investigate this further.

I will try to update asap with the patch+bugzilla for the first bug.

Best regards,
-Martin


__notes to myself:
http://www.gnu.org/software/libc/manual/pdf/libc.pdf

24.4.6 Signal Handling and Nonreentrant Functions
...
On most systems, malloc and free are not reentrant, because they use a static data
structure which records what memory blocks are free. As a result, no library functions
that allocate or free memory are reentrant. This includes functions that allocate space
to store a result.

The best way to avoid the need to allocate memory in a handler is to allocate in advance
space for signal handlers to use.

The best way to avoid freeing memory in a handler is to flag or record the objects to
be freed, and have the program check from time to time whether anything is waiting
to be freed. But this must be done with care, because placing an object on a chain is
not atomic, and if it is interrupted by another signal handler that does the same thing,
you could “lose” one of the objects.
=== </snip> ===

Comment 12 Martin Osvald 🛹 2010-09-28 07:38:19 UTC

Hello,

I have made a progress with the problem where tcsh is getting stuck in process(). When I was going through the code (upstream tcsh-6.15.00) to backport changes to fix the second bug where tcsh is getting stuck in malloc() related functions in signal handlers, I found out I was right. See the following code and especially the comment (I went through the related glibc code already before, but I must have overlooked the longjmp() there):

void
nlsclose(void)
{
#ifdef NLS_CATALOGS
#if defined(HAVE_ICONV) && defined(HAVE_NL_LANGINFO)
    if (catgets_iconv != (iconv_t)-1) {
	iconv_close(catgets_iconv);
	catgets_iconv = (iconv_t)-1;
    }
#endif /* HAVE_ICONV && HAVE_NL_LANGINFO */
    if (catd != (nl_catd)-1) {
	/*
	 * catclose can call other functions which can call longjmp
	 * making us re-enter this code. Prevent infinite recursion
	 * by resetting catd. Problem reported and solved by:
	 * Gerhard Niklasch
	 */
	nl_catd oldcatd = catd;
	catd = (nl_catd)-1;
	while (catclose(oldcatd) == -1 && errno == EINTR)
	    handle_pending_signals();
    }
#endif /* NLS_CATALOGS */
}

I thought that the problem was according to my tests in iconv_close(), I will need to check this yet. I won't create an extra BZ for the second bug, because the above code would need to be in the patch for the both bugs (handle_pending_signals()).

I will try to update asap..

Best regards,
-Martin

Comment 13 Martin Osvald 🛹 2010-09-29 09:37:34 UTC

Hello,

after spending almost the whole week working on the patch and finding out that the patch itself has 812 lines already and still not finished. It would be huge and invasive changes even if they made tcsh code cleaner and simpler, but I haven't any testing tool which would check all the tcsh functionality didn't get broken.

So I decided to give it last try - I simply removed calling of nlsclose():

=== <snip> ===
diff -up tcsh-6.14.00/sh.c.nlsclose tcsh-6.14.00/sh.c
--- tcsh-6.14.00/sh.c.nlsclose  2010-09-29 11:08:44.000000000 +0200
+++ tcsh-6.14.00/sh.c   2010-09-29 11:09:17.000000000 +0200
@@ -2460,15 +2460,6 @@ xexit(i)
        }
     }
     untty();
-#ifdef NLS_CATALOGS
-    /*
-     * We need to call catclose, because SVR4 leaves symlinks behind otherwise
-     * in the catalog directories. We cannot close on a vforked() child,
-     * because messages will stop working on the parent too.
-     */
-    if (child == 0)
-       nlsclose();
-#endif /* NLS_CATALOGS */
 #ifdef WINNT_NATIVE
     nt_cleanup();
 #endif /* WINNT_NATIVE */
=== </snip> ===

from xexit() as both bugs are coming of it and we really don't need to call the function just before ending tcsh as all the stuff is made by kernel at the process termination anyway and what is requred for SVR4 is not required for glibc shipped with RHEL5 (when I look at catclose and iconv_close source code I see no problem in not calling them - see glibc related code hereinafter).

Currently running reproducer, so far it hasn't got stuck. I will run it for the whole day today and will inform you about the results. Let's wait for them.

Best regards,
-Martin

tcsh nlsclose() code:

tcsh-6.14.00/sh.func.c:
2533 void
2534 nlsclose()
2535 {
2536 #ifdef NLS_CATALOGS
2537 #ifdef HAVE_ICONV
2538     if (catgets_iconv != (iconv_t)-1) {
2539         iconv_close(catgets_iconv);
2540         catgets_iconv = (iconv_t)-1;
2541     }
2542 #endif /* HAVE_ICONV */
2543     catclose(catd);
2544 #endif /* NLS_CATALOGS */
2545 }

glibc related parts:

glibc-2.5-20061008T1257/iconv/iconv_close.c:
 27 int
 28 iconv_close (iconv_t cd)
 29 {
 30   if (__builtin_expect (cd == (iconv_t *) -1L, 0))
 31     {
 32       __set_errno (EBADF);
 33       return -1;
 34     }
 35 
 36   return __gconv_close ((__gconv_t) cd) ? -1 : 0;
 37 }

glibc-2.5-20061008T1257/catgets/catgets.c:
115 int
116 catclose (nl_catd catalog_desc)
117 {
118   __nl_catd catalog;
119 
120   /* Be generous if catalog which failed to be open is used.  */
121   if (catalog_desc == (nl_catd) -1)
122     {
123       __set_errno (EBADF);
124       return -1;
125     }
126 
127   catalog = (__nl_catd) catalog_desc;
128 
129 #ifdef _POSIX_MAPPED_FILES
130   if (catalog->status == mmapped)
131     __munmap ((void *) catalog->file_ptr, catalog->file_size);
132   else
133 #endif  /* _POSIX_MAPPED_FILES */
134     if (catalog->status == malloced)
135       free ((void *) catalog->file_ptr);
136     else
137       {
138         __set_errno (EBADF);
139         return -1;
140       }
141 
142   free ((void *) catalog);
143 
144   return 0;
145 }

glibc-2.5-20061008T1257/sysdeps/mach/munmap.c:
 27 int
 28 __munmap (__ptr_t addr, size_t len)
 29 {
 30   kern_return_t err;
 31   if (err = __vm_deallocate (__mach_task_self (),
 32                              (vm_address_t) addr, (vm_size_t) len))
 33     {
 34       errno = err;
 35       return -1;
 36     }
 37   return 0;
 38 }

Comment 14 Martin Osvald 🛹 2010-09-29 09:50:19 UTC

let's make some of the comments public

Comment 15 Martin Osvald 🛹 2010-09-30 08:58:18 UTC

Hello,

updating a current status.. After running reproducer over the night the bug with looping in process() didn't show up (great), but the problem with getting stuck on futex inside malloc signal handler persists (no surprise, as beside phup() also other signal handlers calls a lot of glibc functions which internally allocate/deallocate memory using non-reentrant malloc() routines):

(gdb) bt
#0  0x00c01402 in __kernel_vsyscall ()
#1  0x001ee783 in __lll_lock_wait_private () from /lib/libc.so.6
#2  0x0017dc76 in _L_lock_5396 () at malloc.c:6195
#3  0x00178f69 in _int_free (av=0x265140, p=0x9a33360, have_lock=0) at malloc.c:4846
#4  0x001799e9 in __libc_free (mem=0x9a33368) at malloc.c:3670
#5  0x0805f1c9 in freelex (vp=0x80b3450) at sh.lex.c:288
#6  0x0804a6e1 in process (catch=0) at sh.c:2012
#7  0x0804b588 in srcunit (unit=<value optimized out>, onlyown=<value optimized out>, hflg=2, av=0xb7c00540) at sh.c:1724
#8  0x0804b815 in srcfile (f=<value optimized out>, onlyown=0, flag=2, av=0xb7c00540) at sh.c:1494
#9  0x0804b8f5 in dosource (t=0xb7c00540, c=0x0) at sh.c:2233
#10 0x0805dee4 in rechist (fname=0x0, ref=1) at sh.hist.c:450
#11 0x0804bc09 in record () at sh.c:2535
#12 0x0804be8a in phup (snum=1) at sh.c:1827
#13 <signal handler called>
#14 _int_malloc (av=0x265140, bytes=16384) at malloc.c:4652
#15 0x0017be97 in __libc_malloc (bytes=16384) at malloc.c:3605
#16 0x08081c57 in scalloc (s=4096, n=16384) at tc.alloc.c:553
#17 0x0805f36b in balloc (buf=704) at sh.lex.c:1682
#18 0x0805fe66 in bgetc (wanteof=0) at sh.lex.c:1820
#19 readc (wanteof=0) at sh.lex.c:1562
#20 0x0806092f in getC1 (flag=3) at sh.lex.c:499
#21 0x0806247b in word (hp=0x80b3450) at sh.lex.c:429
#22 lex (hp=0x80b3450) at sh.lex.c:195
#23 0x0804a7f9 in process (catch=0) at sh.c:2097
#24 0x0804b588 in srcunit (unit=<value optimized out>, onlyown=<value optimized out>, hflg=1, av=0x98f8d18) at sh.c:1724
#25 0x0804b815 in srcfile (f=<value optimized out>, onlyown=0, flag=1, av=0x98f8d18) at sh.c:1494
#26 0x0804b8f5 in dosource (t=0x98f8d18, c=0x0) at sh.c:2233
#27 0x0804d8cc in main (argc=0, argv=0xbf8b83d4) at sh.c:1328
(gdb)

so I decided not to call the code in phup() as signal handler and made another patch which partly backports functionality of the newer tcsh-6.15.00 version, which calls phup() as a regular function in the main loop in process() in case the SIGHUP was received. I will probably need to add more than one function checking for global variable indicating that the SIGHUP was received. Doing another round of tests..

Best regards,
-Martin



diff -up tcsh-6.14.00/sh.c.nlsclose_sighup tcsh-6.14.00/sh.c
--- tcsh-6.14.00/sh.c.nlsclose_sighup	2010-09-30 08:54:04.425072301 +0200
+++ tcsh-6.14.00/sh.c	2010-09-30 09:40:16.486072005 +0200
@@ -163,7 +163,7 @@ static	int		  srcfile	__P((const char *,
 #else
 int		  srcfile	__P((const char *, int, int, Char **));
 #endif /*WINNT_NATIVE*/
-static	RETSIGTYPE	  phup		__P((int));
+RETSIGTYPE	  phup		__P((int));
 static	void		  srcunit	__P((int, int, int, Char **));
 static	void		  mailchk	__P((void));
 #ifndef _PATH_DEFPATH
@@ -1106,7 +1106,7 @@ main(argc, argv)
 	 * We also only setup the handlers for shells that are trully
 	 * interactive.
 	 */
-	osig = signal(SIGHUP, phup);	/* exit processing on HUP */
+	osig = signal(SIGHUP, queue_phup);	/* exit processing on HUP */
 	if (!loginsh && osig == SIG_IGN)
 	    (void) signal(SIGHUP, osig);
 #ifdef SIGXCPU
@@ -1795,7 +1795,7 @@ exitstat()
 /*
  * in the event of a HUP we want to save the history
  */
-static RETSIGTYPE
+RETSIGTYPE
 phup(snum)
 int snum;
 {
@@ -1998,6 +1998,8 @@ process(catch)
     getexit(osetexit);
     for (;;) {
 
+	handle_pending_signals();
+
 	pendjob();
 
 	/* This was leaking memory badly, particularly when sourcing
@@ -2460,15 +2462,6 @@ xexit(i)
 	}
     }
     untty();
-#ifdef NLS_CATALOGS
-    /*
-     * We need to call catclose, because SVR4 leaves symlinks behind otherwise
-     * in the catalog directories. We cannot close on a vforked() child,
-     * because messages will stop working on the parent too.
-     */
-    if (child == 0)
-	nlsclose();
-#endif /* NLS_CATALOGS */
 #ifdef WINNT_NATIVE
     nt_cleanup();
 #endif /* WINNT_NATIVE */
diff -up tcsh-6.14.00/tc.sig.c.nlsclose_sighup tcsh-6.14.00/tc.sig.c
--- tcsh-6.14.00/tc.sig.c.nlsclose_sighup	2010-09-30 09:03:47.552224377 +0200
+++ tcsh-6.14.00/tc.sig.c	2010-09-30 09:35:37.281099316 +0200
@@ -43,6 +43,23 @@ RCSID("$Id: tc.sig.c,v 3.29 2005/01/18 2
  */
 #define MAX_CHLD 50
 
+static volatile sig_atomic_t phup_pending; /* = 0; */
+
+void
+queue_phup(int sig)
+{
+    USE(sig);
+    phup_pending = 1;
+}
+
+void
+handle_pending_signals(void)
+{
+    if (phup_pending) {
+        phup(SIGHUP);
+    }
+}
+
 # ifdef UNRELSIGS
 static struct mysigstack {
     int     s_w;		/* wait report			 */
@@ -51,7 +68,6 @@ static struct mysigstack {
 }       stk[MAX_CHLD];
 static int stk_ptr = -1;
 
-
 /* queue child signals
  */
 static RETSIGTYPE
diff -up tcsh-6.14.00/tc.sig.h.nlsclose_sighup tcsh-6.14.00/tc.sig.h
--- tcsh-6.14.00/tc.sig.h.nlsclose_sighup	2010-09-30 09:07:11.330197320 +0200
+++ tcsh-6.14.00/tc.sig.h	2010-09-30 09:19:22.742107058 +0200
@@ -225,4 +225,7 @@ extern RETSIGTYPE synch_handler();
 	    (void) sigsetmask(sm))
 # endif /* SAVESIGVEC */
 
+extern void handle_pending_signals(void);
+extern void queue_phup(int);
+
 #endif /* _h_tc_sig */

Comment 27 Bob Arendt 2011-03-24 03:59:15 UTC

Some of these issues might be related a tcsh problem that we've seen, and we have a simple patch.  See https://bugzilla.redhat.com/show_bug.cgi?id=690356

Comment 28 Vojtech Vitek 2011-03-24 10:41:00 UTC

(In reply to comment #27)
> Some of these issues might be related a tcsh problem that we've seen, and we
> have a simple patch.  See https://bugzilla.redhat.com/show_bug.cgi?id=690356

They are not related, AFAIK. Check the strace_csh output (attachment #434731 [details]), it is completely different issue than the bug 690356.

Comment 31 Angelo Bonet 2011-04-15 17:13:14 UTC

(In reply to comment #28)
> They are not related, AFAIK. Check the strace_csh output (attachment #434731 [details]),
> it is completely different issue than the bug 690356.


I think you need to take another look at bug #690356, specifically this comment: 
https://bugzilla.redhat.com/show_bug.cgi?id=690356#c7

The strace trace output of this bug and that one, in fact, look very similar. 

Have you tried the patch from bug #690356 ?

Comment 32 Vojtech Vitek 2011-05-02 09:43:05 UTC

Angelo, sure I tried the patch, but I still insist the issues are not related, even though the strace outputs can look somewhat similar. 

Both bugs will be fixed in RHEL-5.7 tcsh617.

Comment 34 Miroslav Svoboda 2011-07-01 21:50:50 UTC

    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
When tcsh did not exit properly, it could have entered an infinite loop, using 100% of the CPU, and become unresponsive. This was caused by a function interrupting the exit routine and then re-entering the code and thus causing it to loop infinitely.

Comment 35 errata-xmlrpc 2011-07-21 08:48:01 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-1072.html

Comment 36 errata-xmlrpc 2011-07-21 12:09:04 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-1072.html

Comment 37 Angelo Bonet 2011-11-16 19:32:43 UTC

This bug was originally reported against tcsh-6.14-12.el5.  Why was the Component quietly switched to tcsh617?

Does Redhat plan on providing a fix for tcsh-6.14 for this problem?  Or has tcsh-6.14 been officially abandoned from support?

If I'm running RHEL5/tcsh-6.14 and I'm being affected by this problem, am I supposed to now switch to tcsh617?  That's not as easy as it sounds as tcsh617 has non-backward compatible behaviors: see 
https://bugzilla.redhat.com/show_bug.cgi?id=638955.

Comment 38 Ondrej Vasik 2011-11-17 16:41:07 UTC

There were significant changes in tcsh signal handlers between 6.14 and 6.15 and backporting these changes in such complex package like tcsh is almost impossible. It would be a very risky and invasive change - and it's quite late in RHEL-5 release cycle.

As there are incompatibilities in tcsh 6.17, new component tcsh617 was created.
Support for tcsh-6.14 was not abandoned - in the case of security issue, both packages (tcsh and tcsh617) will be fixed. Only tcsh617 is now the preferred one... 
If none of the tcsh-6.17 vs. tcsh-6.14 incompatibilities limits you, it is probably better to use tcsh617 package as this version is also part of RHEL-6.

If you are hit by this issue and using tcsh617 package is not an option for you, please contact product support... as bugzilla is not a support tool for Red Hat Enterprise Linux.

Note You need to log in before you can comment on or make changes to this bug.