Development Management has reviewed and declined this request. You may appeal this decision by reopening this request.
once again, I had trouble analyzing the core. The backtrace did yeild some info though, from jobs.c:1653 while(jp) { if(jp->pid==pid) break; if(pid==0 && !jp->next) break; jpold = jp; jp = jp->next; } if(jp) keep in mind in this instance that pid==0. So it is possible that the jp data structure somehow got corrupted and that the tail was never made so jp->next == 0. The good news is that if the problem occurs again, we will know based on the backtrace if it is always this function that is hung up, and we can get an idea of how (and if) the jp linked list is looping. If the jp linked list turned circular, then it would be possible to change the while(jp) into struct jobsave orig = jp while(jp) { ... jp=jp->next; if (jp == orig) break; } However, this would be a short term workaround, since the data structure is actually corrupted, it might lead to other unexpected behavior. Also it relies on the presumption that the linked list is actually circular and not just arbitrary data. Additionally, this would mean that it would be impossible to know when the problem occured. In the meantime, I will try to analyze the core by putting it on different machines, and getting some engineering help to make sure I'm not doing anything completely braindead.
From fhirtz in IT: I have the core and required lib versions installed on ibm-x3650-1.gsslab.rdu.redhat.com (root:redhat): <snip> [root@ibm-x3650-1 ~]# gdb /bin/ksh core.3410 GNU gdb Red Hat Linux (6.5-37.el5_2.1rh) Copyright (C) 2006 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu"...Using host libthread_db library "/lib64/libthread_db.so.1". <snip> Core was generated by `/bin/ksh'. #0 0x0000000000414359 in job_chksave (pid=0) at /usr/src/debug/ksh-20080202/src/cmd/ksh93/sh/jobs.c:1653 1653 if(jp->pid==pid) (gdb) bt #0 0x0000000000414359 in job_chksave (pid=0) at /usr/src/debug/ksh-20080202/src/cmd/ksh93/sh/jobs.c:1653 #1 0x000000000041451c in job_unpost (pwtop=<value optimized out>, notify=<value optimized out>) at /usr/src/debug/ksh-20080202/src/cmd/ksh93/sh/jobs.c:1523 #2 0x0000000000415a97 in job_wait (pid=0) at /usr/src/debug/ksh-20080202/src/cmd/ksh93/sh/jobs.c:1392 #3 0x0000000000434eef in sh_exec (t=0xacadc00, flags=4) at /usr/src/debug/ksh-20080202/src/cmd/ksh93/sh/xec.c:1087 #4 0x00000000004368ea in sh_exec (t=0xacadd20, flags=<value optimized out>) at /usr/src/debug/ksh-20080202/src/cmd/ksh93/sh/xec.c:1218 #5 0x0000000000435f2c in sh_exec (t=0xad1f5a0, flags=4) at /usr/src/debug/ksh-20080202/src/cmd/ksh93/sh/xec.c:1313 #6 0x0000000000435dd9 in sh_exec (t=0xad1f5a0, flags=181532064) at /usr/src/debug/ksh-20080202/src/cmd/ksh93/sh/xec.c:1596 #7 0x0000000000435db3 in sh_exec (t=0xad19060, flags=4) at /usr/src/debug/ksh-20080202/src/cmd/ksh93/sh/xec.c:1332 #8 0x00000000004350ec in sh_exec (t=0xacad410, flags=36) at /usr/src/debug/ksh-20080202/src/cmd/ksh93/sh/xec.c:1531 #9 0x00000000004076b3 in exfile () #10 0x0000000000406bbc in sh_main () #11 0x000000363281d8a4 in __libc_start_main (main=0x405f80 <main>, argc=2, ubp_av=0x7fff6eca28a8, init=<value optimized out>, fini=<value optimized out>, rtld_fini=<value optimized out>, stack_end=0x7fff6eca2898) at libc-start.c:231 #12 0x0000000000405ec9 in _start () (gdb) l 1648 { 1649 register struct jobsave *jp = bck.list, *jpold=0; 1650 register int r= -1; 1651 while(jp) 1652 { 1653 if(jp->pid==pid) 1654 break; 1655 if(pid==0 && !jp->next) 1656 break; 1657 jpold = jp; (gdb) p jp $1 = (struct jobsave *) 0xad1f5a0 (gdb) p jp->next $2 = (struct jobsave *) 0xad1f5a0 (gdb) p jpold $3 = (struct jobsave *) 0xad1f5a0 (gdb) p jp->pid $4 = 10156 (gdb) p pid $5 = 0 </snip> It indeed appears that we're stuck in a loop since "jp->next" points to itself (which was, of course also used to set "jpold") here as you've noted. Agreed that we could hack in a check for that condition and break (and possibly output some useful data right then),
for the case where ksh is spinning, I found a possible cause: (gdb) up #1 0x000000000041451c in job_unpost (pwtop=<value optimized out>, notify=<value optimized out>) at /usr/src/debug/ksh-20080202/src/cmd/ksh93/sh/jobs.c:1523 1523 job_chksave(0); (gdb) ls Undefined command: "ls". Try "help". (gdb) l 1518 if(pw->p_flag&P_EXITSAVE) 1519 { 1520 struct jobsave *jp; 1521 /* save status for future wait */ 1522 if(bck.count++ > sh.lim.child_max) 1523 job_chksave(0); 1524 if(jp = jobsave_create(pw->p_pid)) 1525 { 1526 jp->next = bck.list; 1527 bck.list = jp; (gdb) p bck.count $1 = 1001 (gdb) p sh.lim.child_max $2 = 999 (gdb) ---- Now we may be able to more quickly reproduce if we set sh.lim.child_max to a smaller number. This could be unrelated, and it's possible that it's only forcing the check on the circular list, thus causing the spin, but it's also possible that going over the child_max causes the corruption in the linked list to begin with. Also interesting here is that this appears not to have started spinning in the case that (bck.count==1000) > (sh.lim.child_max ==999)
Thank you for the analysis. I'll try to run some tests on my own. Hopefully with the information you provided I would be able to reproduce the bug.
OpenSolaris appears to be hitting a similiar thing: http://bugs.opensolaris.org/view_bug.do?bug_id=6510946 also explanation to comment #17: from jobs.c 1521: /* save status for future wait */ if(bck.count++ > sh.lim.child_max) job_chksave(0); if(jp = jobsave_create(pw->p_pid)) { jp->next = bck.list; bck.list = jp; jp->exitval = pw->p_exit; if(pw->p_flag&P_SIGNALLED) jp->exitval |= SH_EXITSIG; } pw->p_flag &= ~P_EXITSAVE; ---- I forgot about my post/pre increment commands. Ben the NYC intern pointed out that of course bck.count will be 2 greater the first time that it executes job_chksave(0), since when bck.count == 999, this will not be executed. When bck.count == 1000, this will be executed, but after the increment, hence 1001. The question is weather this is correct or not. My thinking is that they should have used preincrement here, since they eventually do the jobsave_create anyway. That way the code is more like "we're going to create 1000 procs, one greater than allowed, so clean one up before creating the new one." It could be that some other part of ksh is not allowing the 1000th job in due to ulimit, and that is causing the weird behavior. I'll ping ATT mailing list on this to make sure the list is actually accounted for in this fashion. In the meantime I will build a new package to see if this fixes client's problem.
from David Korn: I think that ++bck.count would make more sense but since this is a linked list it won't matter. The standard says that the backround process from the last sh.lim.child_max processes must be saved by the shell. The code was saving sh.lim.child_max+1 which is ok but not required. Thnaks for the report. David Korn dgk.com
I've reproduced a similiar problem. $ ulimit -u 10 $ for (( i = 0; i < 10; i++ )) $ do $ cat /dev/zero > /dev/null & $ done This script will hang ksh with the latest version (rawhide) we built and gave customer for rhel 5. The cause is line 2252 in sh/xec.c while(_sh_fork(parent=fork(),flags,jobid) < 0) fork() returns -1 every time, presumably because we hit the limit for processes. Perhaps solaris does the same kind of accounting that linux does, where maybe BSD or other UNIX doesn't. In any case it appears that ksh thinks it can allocate one more process than the system will allow. From the fork man page: EAGAIN It was not possible to create a new process because the caller’ RLIMIT_NPROC resource limit was encountered. To exceed this limit, the process must have either the CAP_SYS_ADMIN or the CAP_SYS_RESOURCE capability. ---- Running the same script in bash causes an exit: $ ulimit -u 10 $ for (( i=0 ; i < 10; i++ )) > do > cat /dev/zero > /dev/null & > done -bash: fork: Resource temporarily unavailable I will find where ksh accounting has gone wrong, and have a patch by the end of the day. To be fair, this isn't exactly the same problem that we saw earlier in the core, but it causes ksh to "spin" with %100 CPU, so it is possible they are seeing this as well. I would also like to note that customer is using a relatively low value for max user processes, which would trigger this more easily.
Looks like ksh is expecting waitpid in jobs.c to wait, as it should. Unfortunately, at this point there are no child procs (determined using 'ps auxf', causing it to set errno to ECHILD. At this point it will loop wildly. After taking a second look it appears that this is a special rare case, and if ksh has any other child (background) processes, it will wait() on them.
Created attachment 310235 [details] Patch for the "ulimit script" problem Here's my attempt to fix the problem with the "ulimit" script from the comment #22. I didn't test it thoroughly but it seems to work at the first sight.
After some more testing it looks that even with unpatched ksh the script does exit with "limit exceeded [Permission denied]" after about 30 seconds. However, not always and I'm not able to find under what conditions this happens.
In response to comment #27: The question about weather it ultimately will exit after 30 seconds isn't the issue. ksh was designed to wait() on the process. If it wait()s 30 seconds I don't have a problem with it. The fact is that on my machine, the process gets into an infinate loop, thus consuming %100 cpu for those 30 seconds. This is wrong.
(In reply to comment #28) > In response to comment #27: > > The question about weather it ultimately will exit after 30 seconds isn't the > issue. ksh was designed to wait() on the process. If it wait()s 30 seconds I > don't have a problem with it. The fact is that on my machine, the process gets > into an infinate loop, thus consuming %100 cpu for those 30 seconds. This is wrong. I'm not able to reproduce the problem again but if I remember well ksh either timeouts or consumes 100 % CPU, i.e., when things go as expected the CPU consumption is "normal".
Created attachment 311037 [details] Backport from 2006-06-24 Thing I should have done first -- look into the latest upstream code. Unfortunately the 2008-06-24 version won't compile on i386 so I keep forgetting about it because it can't be included in Fedora until the errors get resolved by the upstream. However the backported parts seem to solve the issue and the new package passed all the tests I had at hand including the ones shipped with the sources. I'm still not much happy about fast tracking the ksh bugs though...
Created attachment 311902 [details] savelist removal from jobs.c Proposing this patch since the problem still exists on RHEL4. It's pretty clear from all core dumps that this bug is somewhere in the maintenance of the savelist in jobs.c. I haven't been able to track down exactly where the list gets mangled, but the list isn't really necessary in my opinion. The only thing it gets us is a few less malloc()s and free()s. Alternatively I could have defined NJOB_SAVELIST as 0.
I plan to push 5.3 errata with the 20080202 version including patch from comment #31. Any objections?
This bugzilla was reviewed by QE as a non-FasTrack request. It has since been proposed for FasTrack. The qa_ack has been reset. QE needs to re-review this bugzilla for FasTrack.
Release note added. If any revisions are required, please set the "requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: The ksh package has been upgraded to version 2008-02-02 that fixes many issues including job control problems and adds multibyte character handling. The new version preserves compatibility for the existing scripts.
Release note updated. If any revisions are required, please set the "requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. Diffed Contents: @@ -1 +1 @@ -The ksh package has been upgraded to version 2008-02-02 that fixes many issues including job control problems and adds multibyte character handling. The new version preserves compatibility for the existing scripts.+ksh has been re-based to version 2008-02-02. This update adds multi-byte character handling, addresses many job control problems and applies several bug fixes from upstream. Note that this update to ksh preserves compatibility for existing scripts.
Created attachment 323354 [details] nosavelist source rpm Here ya go.
I'd like to point out that no savelist is _probably_ not going to help. I was initially hopeful when we were running for months without issue, but it's always been a bit of a pipe dream that this patch would solve the issue. I can't find the bit of code that links head->head->head->etc... and even though I decided to get rid of jobsave_create to make thing simpler, there is nothing that I can find that is wrong with that particular function. The theory was that there was a data structure floating around that had some used values for next, and that somehow that got reintroduced to the list.
~~ Attention ~~ We Need Testing Feedback Soon ~~ We're nearing the end of the Red Hat Enterprise Linux 5.3 Testing Phase and this bug has not yet been VERIFIED. This bug should be fixed in the latest RHEL53 Beta Snapshot. It is critical that we receive your feedback ASAP. Otherwise, this bug is at risk of being dropped from the release. If you encounter any new issues, CLONE this bug and describe the new issues you are facing. We are no longer excepting NEW bugs into the release, bar critical regressions and blocker issues. If you have VERIFIED this fix, add CustomerVerified to the Bugzilla Keywords, along with a description of the test results.
this bug was removed from errata so release notes were moved to 456652
Deleted Release Notes Contents. Old Contents: ksh has been re-based to version 2008-02-02. This update adds multi-byte character handling, addresses many job control problems and applies several bug fixes from upstream. Note that this update to ksh preserves compatibility for existing scripts.
Does this mean ksh-20080202-2 is not coming out with 5.3?
~~ Attention Customers and Partners - RHEL 5.5 Beta is now available on RHN ~~ RHEL 5.5 Beta has been released! There should be a fix present in this release that addresses your request. Please test and report back results here, by March 3rd 2010 (2010-03-03) or sooner. Upon successful verification of this request, post your results and update the Verified field in Bugzilla with the appropriate value. If you encounter any issues while testing, please describe them and set this bug into NEED_INFO. If you encounter new defects or have additional patch(es) to request for inclusion, please clone this bug per each request and escalate through your support representative.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2010-0234.html
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days