Description of problem: While customer is using csh environment, csh randomly uses 100% cpus without proper ppid and customer have to kill those csh process. Version-Release number of selected component (if applicable): - Red Hat Enterprise Linux 5.2 - telnet-server-0.17-39.el5-x86_64 - krb5-workstation-1.6.1-25.el5-x86_64 - tcsh-6.14-12.el5-x86_64 How reproducible: happened about 3-4 times in the past year. Steps to Reproduce: Can't be reproduced. Actual results: csh can't exit properly when telnetd exits and consume 100% cpu Expected results: csh can exit when corresponding telnetd exit. Additional info: We can see the parent process (telnetd) had exited, but login.krb5 and csh still existed from the output of ps. # egrep "(csh|login|COMMAND)" sos_commands/process/ps_alxwww F UID PID PPID PRI NI VSZ RSS WCHAN STAT TTY TIME COMMAND 4 0 1682 1 15 0 67440 1896 wait Ss ? 0:00 login -h 172.18.81.234 -p 4 601 1688 1682 18 0 71504 2248 - R ? 12808:33 -csh 4 0 2721 1 15 0 67444 1892 wait Ss ? 0:00 login -h 172.18.81.234 -p 4 601 2722 2721 18 0 71316 2060 - R ? 78244:46 -csh 4 0 8808 8807 16 0 67444 1952 wait Ss pts/11 0:00 login -h 192.168.237.55 -p 4 0 9734 9733 16 0 67444 1956 wait Ss pts/12 0:00 login -h 192.168.237.55 -p 4 0 10906 1 15 0 67444 1892 wait Ss ? 0:00 login -h 172.18.81.234 -p 4 601 10907 10906 18 0 71524 2296 - R ? 81249:27 -csh 4 0 22319 1 15 0 68336 1892 wait Ss ? 0:00 login -h 172.18.101.230 -p 4 601 22320 22319 18 0 71520 2284 - R ? 175340:12 -csh 4 0 30285 30282 16 0 67444 1896 wait Ss pts/9 0:00 login -h 172.18.81.234 -p 4 601 30286 30285 15 0 71888 2652 - S+ pts/9 0:00 -csh 4 0 31864 1 15 0 67444 1896 wait Ss ? 0:00 login -h vmgapp05 -p 4 601 31865 31864 18 0 71340 2140 - R ? 45486:23 -csh # grep csh sos_commands/process/pstree |-5*[login.krb5---csh] | `-telnetd---login.krb5---csh
Normally, when the process telnetd exits as expected or even abnormally, the tty device used by it will be released during the cleanup of this process in kernel. At this point, kernel will send a signal SIGHUP to the session leader (here is login.krb5) on this tty. Because it is kernel's responsibility to send the signal SIGHUP, which is not the business of telnetd itself, telnetd should not be relevant to this issue. On receipt of the signal SIGHUP, login.krb5 will pass it to the child (here is csh). The following code demonstrate this behavior. krb5-1.6.1/src/appl/bsd/login.c <snip> while (1) { #ifdef HAVE_WAITPID pid = waitpid(child, 0, 0); #elif defined(WAIT_USES_INT) pid = wait((int *)0); #else pid = wait((union wait *)0); #endif if (hungup) { #ifdef HAVE_KILLPG killpg(child, SIGHUP); #else kill(-child, SIGHUP); #endif } if (pid == child) break; } </snip>
Created attachment 434731 [details] strace_csh This file was collected when the problem happened. It means that when they found csh use 100% cpu, they used strace attach to the csh process. This file has been truncated, because the original trace result is too big and full of the repeated pattern as this attachment.
Created attachment 434733 [details] ltrace_csh
The process login.krb5 has been inherited by init, which means that the cleanup of telnetd in kernel has been finished already. So we can believe can kernel sent the signal SIGHUP. In ps, we can see that login.krb5 was waiting on the exiting of the child process(csh). It's should be after receiving SIGHUP and sending it to csh. And from the strace of csh, we can see that it had blocked SIGHUP. So it's highly possible that login.krb5 got SIGHUP and handed on it to csh, but csh didn't response to the signal due to the blocking of SIGHUP, so login has been waiting for the exit of csh forever. Try to find what's going on inside csh when the problem happened from the output of strace and ltrace. strace snippet 1 <snip> fstat(0, 0x7fff3e146ac0) = -1 EBADF (Bad file descriptor) fstat(1, 0x7fff3e146ac0) = -1 EBADF (Bad file descriptor) ... fstat(13, 0x7fff3e146ac0) = -1 EBADF (Bad file descriptor) fstat(14, 0x7fff3e146ac0) = -1 EBADF (Bad file descriptor) fstat(20, 0x7fff3e146ac0) = -1 EBADF (Bad file descriptor) ... fstat(25, 0x7fff3e146ac0) = -1 EBADF (Bad file descriptor) ... fstat(255, 0x7fff3e146ac0) = -1 EBADF (Bad file descriptor) </snip> code snippet 1 <snip> function closem() sh.misc.c #define NOFILE 256 #define FSHTTY 15 /* /dev/tty when manip pgrps */ #define FSHIN 16 /* Preferred desc for shell input */ #define FSHOUT 17 /* ... shell output */ #define FSHDIAG 18 /* ... shell diagnostics */ #define FOLDSTD 19 /* ... old std input */ for (f = 0; f < NOFILE; f++) if (f != SHIN && f != SHOUT && f != SHDIAG && f != OLDSTD && f != FSHTTY #ifdef MALLOC_TRACE && f != 25 #endif /* MALLOC_TRACE */ #ifdef S_ISSOCK && fstat(f, &st) == 0 && !S_ISSOCK(st.st_mode) #endif ) </snip> strace snippet 2 <snip> rt_sigprocmask(SIG_SETMASK, NULL, [HUP], 8) = 0 rt_sigprocmask(SIG_SETMASK, [HUP], NULL, 8) = 0 </snip> code snippet 2 <snip> in the function process() in sh.c if (setintr) #ifdef BSDSIGS (void) sigsetmask(sigblock((sigmask_t) 0) & ~sigmask(SIGINT)); #else (void) sigrelse(SIGINT); #endif </snip> strace snippet 3 <snip> lseek(16, 0, SEEK_END) = -1 ESPIPE (Illegal seek) ioctl(15, TIOCSPGRP, [31416]) = -1 ENOTTY (Inappropriate ioctl for device) </snip> code snippet 3 <snip> the end of the function stderror() in sh.err.c btoeof(); --> (void) lseek(SHIN, (off_t) 0, L_XTND); set(STRstatus, Strsave(STR1), VAR_READWRITE); #ifdef BSDJOBS if (tpgrp > 0) (void) tcsetpgrp(FSHTTY, tpgrp); #endif reset(); --> call longjmp to back to the position of setexit in the function process() in sh.c </snip> strace snippet 4 <snip> write(17, "[vmgapp10:/app/vmg/util/vmg_File"..., 47) = -1 EIO (Input/output error) </snip> code snippet 4 <snip> if (haderr) unit = didfds ? 2 : SHDIAG; else unit = didfds ? 1 : SHOUT; /* FSHOUT is defined as 17 ... if (write(unit, linbuf, sz) == -1) switch (errno) { #ifdef EIO /* We lost our tty */ case EIO: #endif ... default: stderror(ERR_SILENT); break; } </snip> strace snippet 5 <snip> alarm(60000000) = 60000000 </snip> code snippet 5 <snip> setalarm(1) // in the function process() in sh.c </snip> strace snippet 6 (Please note that the signal SIGHUP was always in the signal mask of csh, so it would be blocked.) <snip> rt_sigprocmask(SIG_SETMASK, NULL, [HUP], 8) = 0 rt_sigprocmask(SIG_SETMASK, [HUP], NULL, 8) = 0 // unblock SIGINT rt_sigprocmask(SIG_SETMASK, NULL, [HUP], 8) = 0 rt_sigprocmask(SIG_SETMASK, [HUP INT], NULL, 8) = 0 // block SIGINT rt_sigprocmask(SIG_SETMASK, NULL, [HUP INT], 8) = 0 rt_sigprocmask(SIG_SETMASK, [HUP], NULL, 8) = 0 // unblock SIGINT rt_sigprocmask(SIG_SETMASK, NULL, [HUP], 8) = 0 rt_sigprocmask(SIG_SETMASK, [HUP INT], NULL, 8) = 0 // block SIGINT rt_sigprocmask(SIG_SETMASK, NULL, [HUP INT], 8) = 0 rt_sigprocmask(SIG_SETMASK, [HUP], NULL, 8) = 0 // unblock SIGINT rt_sigprocmask(SIG_SETMASK, NULL, [HUP], 8) = 0 rt_sigprocmask(SIG_SETMASK, [HUP INT], NULL, 8) = 0 // block SIGINT rt_sigprocmask(SIG_SETMASK, NULL, [HUP INT], 8) = 0 rt_sigprocmask(SIG_SETMASK, [HUP], NULL, 8) = 0 // unblock SIGINT rt_sigprocmask(SIG_SETMASK, NULL, [HUP], 8) = 0 rt_sigprocmask(SIG_SETMASK, [HUP INT], NULL, 8) = 0 // block SIGINT rt_sigprocmask(SIG_SETMASK, NULL, [HUP INT], 8) = 0 rt_sigprocmask(SIG_SETMASK, [HUP], NULL, 8) = 0 // unblock SIGINT </snip> code snippet 6 <snip> if (setintr) #ifdef BSDSIGS (void) sigsetmask(sigblock((sigmask_t) 0) & ~sigmask(SIGINT)); // unlock SIGINT #else (void) sigrelse(SIGINT); #endif ... watch_login(0); // this four functions all unblock SIGINT first, then unlock it #endif /* !HAVENOUTMP */ sched_run(0); period_cmd(); precmd(); </snip> Based on the information above, we can know what's going on in csh. When csh got a lexical error which is not fatal, it tried to flush the error message to tty which had been lost due to the exit of telnetd. That can be proved by the "?" in the column of "TTY" for csh in the output of ps. Afterwards, csh got the EIO error in flush(), and it will jump back to the function process() and begin a new loop. It found the variable "haderr" was set, then call closem() to close files. After this, it will repeat the procedure described above. Consequently, it fell into an endless loop and caused a high cpu usage.
In short, I think the source of this problem is that csh block the signal SIGHUP. It caused that csh couldn't exit accordingly when telnetd exited, and fell into an endless loop due to the lost of tty. So why did csh block the signal SIGHUP? At this point, I am not sure about that. In csh, I only found two situation which blocks SIGHUP of csh itself. One is in the handler of SIGHUP. Because it exits on SIGHUP, it could block SIGHUP. The other is in the function donohup(), but the built-in command nohup can't be invoked in a login shell. So both cases don't make sense for this issue.
This request was evaluated by Red Hat Product Management for inclusion in the current release of Red Hat Enterprise Linux. Because the affected component is not scheduled to be updated in the current release, Red Hat is unfortunately unable to address this request at this time. Red Hat invites you to ask your support representative to propose this request, if appropriate and relevant, in the next release of Red Hat Enterprise Linux.
Hello, I am sending only my last update from the issue tracker/rosetta to keep this up with BZ and to inform you that I am still working on this one. -Martin === <snip> === Hello, I have moved so far only with problem where tcsh is getting stuck on futex inside glibc() in malloc routines. It is caused by the fact that malloc() related routines aren't signal safe. You mustn't call these routines on linux from signal handler function directly or other functions that could call them internally, but tcsh does that and it is a bug. I found out that this is fixed in version shipped with Fedora already (tcsh-6.17), where on signal arrival instead of calling handler routine straightaway a global variable for specific signal is set instead and on several places a special function gets called regularly, which checks for value of these global variables for each signal and calls appropriate routines in such way that these routines are called synchronously and that the situation where malloc routines would be interrupted by other signal just after locking internal structures leading to deadlock by locking the internal structure once again by routine called from the interrupting handler cannot happen there. Currently working on the patch which would backport these changes into RHEL5, and after it I am going to create a BZ for it. In the meanwhile I am still running the reproducer for the second bug (looping in process()), but after manually fixing the problems with signals and malloc I am not able to reproduce it on RHEL5, I need to do tests also on RHEL4 yet. I also made some tests during the weekend and found out strange things. We can see from strace output in conjunction with source code that xexit() functions gets called repeatedly, but it never gets to calling _exit(): 2418 void 2419 #ifdef PROF 2420 done(i) 2421 #else 2422 xexit(i) 2423 #endif 2424 int i; 2425 { ... 2469 if (child == 0) 2470 nlsclose(); <<<---- 2471 #endif /* NLS_CATALOGS */ 2472 #ifdef WINNT_NATIVE 2473 nt_cleanup(); 2474 #endif /* WINNT_NATIVE */ 2475 _exit(i); 2476 } I found out that the program execution gets interrupted inside glibc call iconv_close() (xexit()->nlsclose()->iconv_close()) returning back to the main loop in process(), strange I need to investigate this further. I will try to update asap with the patch+bugzilla for the first bug. Best regards, -Martin __notes to myself: http://www.gnu.org/software/libc/manual/pdf/libc.pdf 24.4.6 Signal Handling and Nonreentrant Functions ... On most systems, malloc and free are not reentrant, because they use a static data structure which records what memory blocks are free. As a result, no library functions that allocate or free memory are reentrant. This includes functions that allocate space to store a result. The best way to avoid the need to allocate memory in a handler is to allocate in advance space for signal handlers to use. The best way to avoid freeing memory in a handler is to flag or record the objects to be freed, and have the program check from time to time whether anything is waiting to be freed. But this must be done with care, because placing an object on a chain is not atomic, and if it is interrupted by another signal handler that does the same thing, you could “lose” one of the objects. === </snip> ===
Hello, I have made a progress with the problem where tcsh is getting stuck in process(). When I was going through the code (upstream tcsh-6.15.00) to backport changes to fix the second bug where tcsh is getting stuck in malloc() related functions in signal handlers, I found out I was right. See the following code and especially the comment (I went through the related glibc code already before, but I must have overlooked the longjmp() there): void nlsclose(void) { #ifdef NLS_CATALOGS #if defined(HAVE_ICONV) && defined(HAVE_NL_LANGINFO) if (catgets_iconv != (iconv_t)-1) { iconv_close(catgets_iconv); catgets_iconv = (iconv_t)-1; } #endif /* HAVE_ICONV && HAVE_NL_LANGINFO */ if (catd != (nl_catd)-1) { /* * catclose can call other functions which can call longjmp * making us re-enter this code. Prevent infinite recursion * by resetting catd. Problem reported and solved by: * Gerhard Niklasch */ nl_catd oldcatd = catd; catd = (nl_catd)-1; while (catclose(oldcatd) == -1 && errno == EINTR) handle_pending_signals(); } #endif /* NLS_CATALOGS */ } I thought that the problem was according to my tests in iconv_close(), I will need to check this yet. I won't create an extra BZ for the second bug, because the above code would need to be in the patch for the both bugs (handle_pending_signals()). I will try to update asap.. Best regards, -Martin
Hello, after spending almost the whole week working on the patch and finding out that the patch itself has 812 lines already and still not finished. It would be huge and invasive changes even if they made tcsh code cleaner and simpler, but I haven't any testing tool which would check all the tcsh functionality didn't get broken. So I decided to give it last try - I simply removed calling of nlsclose(): === <snip> === diff -up tcsh-6.14.00/sh.c.nlsclose tcsh-6.14.00/sh.c --- tcsh-6.14.00/sh.c.nlsclose 2010-09-29 11:08:44.000000000 +0200 +++ tcsh-6.14.00/sh.c 2010-09-29 11:09:17.000000000 +0200 @@ -2460,15 +2460,6 @@ xexit(i) } } untty(); -#ifdef NLS_CATALOGS - /* - * We need to call catclose, because SVR4 leaves symlinks behind otherwise - * in the catalog directories. We cannot close on a vforked() child, - * because messages will stop working on the parent too. - */ - if (child == 0) - nlsclose(); -#endif /* NLS_CATALOGS */ #ifdef WINNT_NATIVE nt_cleanup(); #endif /* WINNT_NATIVE */ === </snip> === from xexit() as both bugs are coming of it and we really don't need to call the function just before ending tcsh as all the stuff is made by kernel at the process termination anyway and what is requred for SVR4 is not required for glibc shipped with RHEL5 (when I look at catclose and iconv_close source code I see no problem in not calling them - see glibc related code hereinafter). Currently running reproducer, so far it hasn't got stuck. I will run it for the whole day today and will inform you about the results. Let's wait for them. Best regards, -Martin tcsh nlsclose() code: tcsh-6.14.00/sh.func.c: 2533 void 2534 nlsclose() 2535 { 2536 #ifdef NLS_CATALOGS 2537 #ifdef HAVE_ICONV 2538 if (catgets_iconv != (iconv_t)-1) { 2539 iconv_close(catgets_iconv); 2540 catgets_iconv = (iconv_t)-1; 2541 } 2542 #endif /* HAVE_ICONV */ 2543 catclose(catd); 2544 #endif /* NLS_CATALOGS */ 2545 } glibc related parts: glibc-2.5-20061008T1257/iconv/iconv_close.c: 27 int 28 iconv_close (iconv_t cd) 29 { 30 if (__builtin_expect (cd == (iconv_t *) -1L, 0)) 31 { 32 __set_errno (EBADF); 33 return -1; 34 } 35 36 return __gconv_close ((__gconv_t) cd) ? -1 : 0; 37 } glibc-2.5-20061008T1257/catgets/catgets.c: 115 int 116 catclose (nl_catd catalog_desc) 117 { 118 __nl_catd catalog; 119 120 /* Be generous if catalog which failed to be open is used. */ 121 if (catalog_desc == (nl_catd) -1) 122 { 123 __set_errno (EBADF); 124 return -1; 125 } 126 127 catalog = (__nl_catd) catalog_desc; 128 129 #ifdef _POSIX_MAPPED_FILES 130 if (catalog->status == mmapped) 131 __munmap ((void *) catalog->file_ptr, catalog->file_size); 132 else 133 #endif /* _POSIX_MAPPED_FILES */ 134 if (catalog->status == malloced) 135 free ((void *) catalog->file_ptr); 136 else 137 { 138 __set_errno (EBADF); 139 return -1; 140 } 141 142 free ((void *) catalog); 143 144 return 0; 145 } glibc-2.5-20061008T1257/sysdeps/mach/munmap.c: 27 int 28 __munmap (__ptr_t addr, size_t len) 29 { 30 kern_return_t err; 31 if (err = __vm_deallocate (__mach_task_self (), 32 (vm_address_t) addr, (vm_size_t) len)) 33 { 34 errno = err; 35 return -1; 36 } 37 return 0; 38 }
let's make some of the comments public
Hello, updating a current status.. After running reproducer over the night the bug with looping in process() didn't show up (great), but the problem with getting stuck on futex inside malloc signal handler persists (no surprise, as beside phup() also other signal handlers calls a lot of glibc functions which internally allocate/deallocate memory using non-reentrant malloc() routines): (gdb) bt #0 0x00c01402 in __kernel_vsyscall () #1 0x001ee783 in __lll_lock_wait_private () from /lib/libc.so.6 #2 0x0017dc76 in _L_lock_5396 () at malloc.c:6195 #3 0x00178f69 in _int_free (av=0x265140, p=0x9a33360, have_lock=0) at malloc.c:4846 #4 0x001799e9 in __libc_free (mem=0x9a33368) at malloc.c:3670 #5 0x0805f1c9 in freelex (vp=0x80b3450) at sh.lex.c:288 #6 0x0804a6e1 in process (catch=0) at sh.c:2012 #7 0x0804b588 in srcunit (unit=<value optimized out>, onlyown=<value optimized out>, hflg=2, av=0xb7c00540) at sh.c:1724 #8 0x0804b815 in srcfile (f=<value optimized out>, onlyown=0, flag=2, av=0xb7c00540) at sh.c:1494 #9 0x0804b8f5 in dosource (t=0xb7c00540, c=0x0) at sh.c:2233 #10 0x0805dee4 in rechist (fname=0x0, ref=1) at sh.hist.c:450 #11 0x0804bc09 in record () at sh.c:2535 #12 0x0804be8a in phup (snum=1) at sh.c:1827 #13 <signal handler called> #14 _int_malloc (av=0x265140, bytes=16384) at malloc.c:4652 #15 0x0017be97 in __libc_malloc (bytes=16384) at malloc.c:3605 #16 0x08081c57 in scalloc (s=4096, n=16384) at tc.alloc.c:553 #17 0x0805f36b in balloc (buf=704) at sh.lex.c:1682 #18 0x0805fe66 in bgetc (wanteof=0) at sh.lex.c:1820 #19 readc (wanteof=0) at sh.lex.c:1562 #20 0x0806092f in getC1 (flag=3) at sh.lex.c:499 #21 0x0806247b in word (hp=0x80b3450) at sh.lex.c:429 #22 lex (hp=0x80b3450) at sh.lex.c:195 #23 0x0804a7f9 in process (catch=0) at sh.c:2097 #24 0x0804b588 in srcunit (unit=<value optimized out>, onlyown=<value optimized out>, hflg=1, av=0x98f8d18) at sh.c:1724 #25 0x0804b815 in srcfile (f=<value optimized out>, onlyown=0, flag=1, av=0x98f8d18) at sh.c:1494 #26 0x0804b8f5 in dosource (t=0x98f8d18, c=0x0) at sh.c:2233 #27 0x0804d8cc in main (argc=0, argv=0xbf8b83d4) at sh.c:1328 (gdb) so I decided not to call the code in phup() as signal handler and made another patch which partly backports functionality of the newer tcsh-6.15.00 version, which calls phup() as a regular function in the main loop in process() in case the SIGHUP was received. I will probably need to add more than one function checking for global variable indicating that the SIGHUP was received. Doing another round of tests.. Best regards, -Martin diff -up tcsh-6.14.00/sh.c.nlsclose_sighup tcsh-6.14.00/sh.c --- tcsh-6.14.00/sh.c.nlsclose_sighup 2010-09-30 08:54:04.425072301 +0200 +++ tcsh-6.14.00/sh.c 2010-09-30 09:40:16.486072005 +0200 @@ -163,7 +163,7 @@ static int srcfile __P((const char *, #else int srcfile __P((const char *, int, int, Char **)); #endif /*WINNT_NATIVE*/ -static RETSIGTYPE phup __P((int)); +RETSIGTYPE phup __P((int)); static void srcunit __P((int, int, int, Char **)); static void mailchk __P((void)); #ifndef _PATH_DEFPATH @@ -1106,7 +1106,7 @@ main(argc, argv) * We also only setup the handlers for shells that are trully * interactive. */ - osig = signal(SIGHUP, phup); /* exit processing on HUP */ + osig = signal(SIGHUP, queue_phup); /* exit processing on HUP */ if (!loginsh && osig == SIG_IGN) (void) signal(SIGHUP, osig); #ifdef SIGXCPU @@ -1795,7 +1795,7 @@ exitstat() /* * in the event of a HUP we want to save the history */ -static RETSIGTYPE +RETSIGTYPE phup(snum) int snum; { @@ -1998,6 +1998,8 @@ process(catch) getexit(osetexit); for (;;) { + handle_pending_signals(); + pendjob(); /* This was leaking memory badly, particularly when sourcing @@ -2460,15 +2462,6 @@ xexit(i) } } untty(); -#ifdef NLS_CATALOGS - /* - * We need to call catclose, because SVR4 leaves symlinks behind otherwise - * in the catalog directories. We cannot close on a vforked() child, - * because messages will stop working on the parent too. - */ - if (child == 0) - nlsclose(); -#endif /* NLS_CATALOGS */ #ifdef WINNT_NATIVE nt_cleanup(); #endif /* WINNT_NATIVE */ diff -up tcsh-6.14.00/tc.sig.c.nlsclose_sighup tcsh-6.14.00/tc.sig.c --- tcsh-6.14.00/tc.sig.c.nlsclose_sighup 2010-09-30 09:03:47.552224377 +0200 +++ tcsh-6.14.00/tc.sig.c 2010-09-30 09:35:37.281099316 +0200 @@ -43,6 +43,23 @@ RCSID("$Id: tc.sig.c,v 3.29 2005/01/18 2 */ #define MAX_CHLD 50 +static volatile sig_atomic_t phup_pending; /* = 0; */ + +void +queue_phup(int sig) +{ + USE(sig); + phup_pending = 1; +} + +void +handle_pending_signals(void) +{ + if (phup_pending) { + phup(SIGHUP); + } +} + # ifdef UNRELSIGS static struct mysigstack { int s_w; /* wait report */ @@ -51,7 +68,6 @@ static struct mysigstack { } stk[MAX_CHLD]; static int stk_ptr = -1; - /* queue child signals */ static RETSIGTYPE diff -up tcsh-6.14.00/tc.sig.h.nlsclose_sighup tcsh-6.14.00/tc.sig.h --- tcsh-6.14.00/tc.sig.h.nlsclose_sighup 2010-09-30 09:07:11.330197320 +0200 +++ tcsh-6.14.00/tc.sig.h 2010-09-30 09:19:22.742107058 +0200 @@ -225,4 +225,7 @@ extern RETSIGTYPE synch_handler(); (void) sigsetmask(sm)) # endif /* SAVESIGVEC */ +extern void handle_pending_signals(void); +extern void queue_phup(int); + #endif /* _h_tc_sig */
Some of these issues might be related a tcsh problem that we've seen, and we have a simple patch. See https://bugzilla.redhat.com/show_bug.cgi?id=690356
(In reply to comment #27) > Some of these issues might be related a tcsh problem that we've seen, and we > have a simple patch. See https://bugzilla.redhat.com/show_bug.cgi?id=690356 They are not related, AFAIK. Check the strace_csh output (attachment #434731 [details]), it is completely different issue than the bug 690356.
(In reply to comment #28) > They are not related, AFAIK. Check the strace_csh output (attachment #434731 [details]), > it is completely different issue than the bug 690356. I think you need to take another look at bug #690356, specifically this comment: https://bugzilla.redhat.com/show_bug.cgi?id=690356#c7 The strace trace output of this bug and that one, in fact, look very similar. Have you tried the patch from bug #690356 ?
Angelo, sure I tried the patch, but I still insist the issues are not related, even though the strace outputs can look somewhat similar. Both bugs will be fixed in RHEL-5.7 tcsh617.
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: When tcsh did not exit properly, it could have entered an infinite loop, using 100% of the CPU, and become unresponsive. This was caused by a function interrupting the exit routine and then re-entering the code and thus causing it to loop infinitely.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2011-1072.html
This bug was originally reported against tcsh-6.14-12.el5. Why was the Component quietly switched to tcsh617? Does Redhat plan on providing a fix for tcsh-6.14 for this problem? Or has tcsh-6.14 been officially abandoned from support? If I'm running RHEL5/tcsh-6.14 and I'm being affected by this problem, am I supposed to now switch to tcsh617? That's not as easy as it sounds as tcsh617 has non-backward compatible behaviors: see https://bugzilla.redhat.com/show_bug.cgi?id=638955.
There were significant changes in tcsh signal handlers between 6.14 and 6.15 and backporting these changes in such complex package like tcsh is almost impossible. It would be a very risky and invasive change - and it's quite late in RHEL-5 release cycle. As there are incompatibilities in tcsh 6.17, new component tcsh617 was created. Support for tcsh-6.14 was not abandoned - in the case of security issue, both packages (tcsh and tcsh617) will be fixed. Only tcsh617 is now the preferred one... If none of the tcsh-6.17 vs. tcsh-6.14 incompatibilities limits you, it is probably better to use tcsh617 package as this version is also part of RHEL-6. If you are hit by this issue and using tcsh617 package is not an option for you, please contact product support... as bugzilla is not a support tool for Red Hat Enterprise Linux.