Bug 435159

Summary:

scripts failing under ksh

Product:

Red Hat Enterprise Linux 5

Reporter:

Eric Sammons <esammons>

Component:

ksh

Assignee:

Michal Hlavinka <mhlavink>

Status:

CLOSED ERRATA

QA Contact:

Severity:

high

Docs Contact:

Priority:

high

Version:

5.3

CC:

cward, ddomingo, dmair, duck, esammons, james.leddy, jim, riek, rlerch, rvokal, shantikatta, syeghiay, tao, tsmetana

Target Milestone:

Keywords:

OtherQA, Rebase, Reopened

Target Release:

---

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Rebase: Bug Fixes and Enhancements

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2010-03-30 08:22:01 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

429153

Bug Blocks:

454962, 455316, 499522, 541103

Attachments:

Description	Flags
Patch for the "ulimit script" problem	none
Backport from 2006-06-24	none
savelist removal from jobs.c	none
new savelist removal	none
nosavelist source rpm	none

Comment 9 RHEL Program Management 2008-03-05 09:45:40 UTC

Development Management has reviewed and declined this request.  You may appeal
this decision by reopening this request.

Comment 15 James M. Leddy 2008-06-16 14:27:46 UTC

once again, I had trouble analyzing the core.  The backtrace did yeild some info
though, from jobs.c:1653

while(jp)
{
if(jp->pid==pid)
break;
if(pid==0 && !jp->next)
break;
jpold = jp;
jp = jp->next;
}
if(jp)

keep in mind in this instance that pid==0.  So it is possible that the jp data
structure somehow got corrupted and that the tail was never made so jp->next ==
0.  The good news is that if the problem occurs again, we will know based on the
backtrace if it is always this function that is hung up, and we can get an idea
of how (and if) the jp linked list is looping.

If the jp linked list turned circular, then it would be possible to change the

   while(jp)

into

   struct jobsave orig = jp
   while(jp)
   {
   ...
       jp=jp->next;
       if (jp == orig) break;
   }

However, this would be a short term workaround, since the data structure is
actually corrupted, it might lead to other unexpected behavior.  Also it relies
on the presumption that the linked list is actually circular and not just
arbitrary data.  Additionally, this would mean that it would be impossible to
know when the problem occured.

In the meantime, I will try to analyze the core by putting it on different
machines, and getting some engineering help to make sure I'm not doing anything
completely braindead.

Comment 16 James M. Leddy 2008-06-16 14:28:28 UTC

From fhirtz in IT:

I have the core and required lib versions installed on
ibm-x3650-1.gsslab.rdu.redhat.com (root:redhat):

<snip>
[root@ibm-x3650-1 ~]# gdb /bin/ksh core.3410 GNU gdb Red Hat Linux
(6.5-37.el5_2.1rh)
Copyright (C) 2006 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu"...Using host libthread_db
library "/lib64/libthread_db.so.1".
<snip>
Core was generated by `/bin/ksh'.
#0  0x0000000000414359 in job_chksave (pid=0)
   at /usr/src/debug/ksh-20080202/src/cmd/ksh93/sh/jobs.c:1653
1653 if(jp->pid==pid)
(gdb) bt
#0  0x0000000000414359 in job_chksave (pid=0)
   at /usr/src/debug/ksh-20080202/src/cmd/ksh93/sh/jobs.c:1653
#1  0x000000000041451c in job_unpost (pwtop=<value optimized out>,
   notify=<value optimized out>)
   at /usr/src/debug/ksh-20080202/src/cmd/ksh93/sh/jobs.c:1523
#2  0x0000000000415a97 in job_wait (pid=0)
   at /usr/src/debug/ksh-20080202/src/cmd/ksh93/sh/jobs.c:1392
#3  0x0000000000434eef in sh_exec (t=0xacadc00, flags=4)
   at /usr/src/debug/ksh-20080202/src/cmd/ksh93/sh/xec.c:1087
#4  0x00000000004368ea in sh_exec (t=0xacadd20, flags=<value optimized out>)
   at /usr/src/debug/ksh-20080202/src/cmd/ksh93/sh/xec.c:1218
#5  0x0000000000435f2c in sh_exec (t=0xad1f5a0, flags=4)
   at /usr/src/debug/ksh-20080202/src/cmd/ksh93/sh/xec.c:1313
#6  0x0000000000435dd9 in sh_exec (t=0xad1f5a0, flags=181532064)
   at /usr/src/debug/ksh-20080202/src/cmd/ksh93/sh/xec.c:1596
#7  0x0000000000435db3 in sh_exec (t=0xad19060, flags=4)
   at /usr/src/debug/ksh-20080202/src/cmd/ksh93/sh/xec.c:1332
#8  0x00000000004350ec in sh_exec (t=0xacad410, flags=36)
   at /usr/src/debug/ksh-20080202/src/cmd/ksh93/sh/xec.c:1531
#9  0x00000000004076b3 in exfile ()
#10 0x0000000000406bbc in sh_main ()
#11 0x000000363281d8a4 in __libc_start_main (main=0x405f80 <main>, argc=2,
   ubp_av=0x7fff6eca28a8, init=<value optimized out>,
   fini=<value optimized out>, rtld_fini=<value optimized out>,
   stack_end=0x7fff6eca2898) at libc-start.c:231
#12 0x0000000000405ec9 in _start ()
(gdb) l
1648 {
1649 register struct jobsave *jp = bck.list, *jpold=0;
1650 register int r= -1;
1651 while(jp)
1652 {
1653 if(jp->pid==pid)
1654 break;
1655 if(pid==0 && !jp->next)
1656 break;
1657 jpold = jp;
(gdb) p jp
$1 = (struct jobsave *) 0xad1f5a0
(gdb) p jp->next
$2 = (struct jobsave *) 0xad1f5a0
(gdb) p jpold
$3 = (struct jobsave *) 0xad1f5a0
(gdb) p jp->pid
$4 = 10156
(gdb) p pid
$5 = 0
</snip>

It indeed appears that we're stuck in a loop since "jp->next" points to itself
(which was, of course also used to set "jpold") here as you've noted. Agreed
that we could hack in a check for that condition and break (and possibly output
some useful data right then),

Comment 17 James M. Leddy 2008-06-16 18:45:58 UTC

for the case where ksh is spinning, I found a possible cause:

(gdb) up
#1  0x000000000041451c in job_unpost (pwtop=<value optimized out>, 
    notify=<value optimized out>)
    at /usr/src/debug/ksh-20080202/src/cmd/ksh93/sh/jobs.c:1523
1523                                    job_chksave(0);
(gdb) ls
Undefined command: "ls".  Try "help".
(gdb) l
1518                    if(pw->p_flag&P_EXITSAVE)
1519                    {
1520                            struct jobsave *jp;
1521                            /* save status for future wait */
1522                            if(bck.count++ > sh.lim.child_max)
1523                                    job_chksave(0);
1524                            if(jp = jobsave_create(pw->p_pid))
1525                            {
1526                                    jp->next = bck.list;
1527                                    bck.list = jp;
(gdb) p bck.count
$1 = 1001
(gdb) p sh.lim.child_max
$2 = 999
(gdb) 
----

Now we may be able to more quickly reproduce if we set sh.lim.child_max to a
smaller number.  This could be unrelated, and it's possible that it's only
forcing the check on the circular list, thus causing the spin, but it's also
possible that going over the child_max causes  the corruption in the linked list
to begin with.

Also interesting here is that this appears not to have started spinning in the
case that (bck.count==1000) > (sh.lim.child_max ==999)

Comment 18 Tomas Smetana 2008-06-17 06:26:42 UTC

Thank you for the analysis.  I'll try to run some tests on my own.  Hopefully
with the information you provided I would be able to reproduce the bug.

Comment 19 James M. Leddy 2008-06-17 19:28:10 UTC

OpenSolaris appears to be hitting a similiar thing:

http://bugs.opensolaris.org/view_bug.do?bug_id=6510946

also explanation to comment #17:

from jobs.c 1521:
			/* save status for future wait */
			if(bck.count++ > sh.lim.child_max)
				job_chksave(0);
			if(jp = jobsave_create(pw->p_pid))
			{
				jp->next = bck.list;
				bck.list = jp;
				jp->exitval = pw->p_exit;
				if(pw->p_flag&P_SIGNALLED)
					jp->exitval |= SH_EXITSIG;
			}
			pw->p_flag &= ~P_EXITSAVE;
----

I forgot about my post/pre increment commands.  Ben the NYC intern pointed out
that of course bck.count will be 2 greater the first time that it executes
job_chksave(0), since when bck.count == 999, this will not be executed.  When
bck.count == 1000, this will be executed, but after the increment, hence 1001.

The question is weather this is correct or not.  My thinking is that they should
have used preincrement here, since they eventually do the jobsave_create anyway.
   That way the code is more like "we're going to create 1000 procs, one greater
than allowed, so clean one up before creating the new one."  It could be that
some other part of ksh is not allowing the 1000th job in due to ulimit, and that
is causing the weird behavior.

I'll ping ATT mailing list on this to make sure the list is actually accounted
for in this fashion. In the meantime I will build a new package to see if this
fixes client's problem.

Comment 20 James M. Leddy 2008-06-17 23:21:00 UTC

from David Korn:

I think that ++bck.count would make more sense but since this is a linked
list it won't matter.

The standard says that the backround process from the last sh.lim.child_max
processes must be saved by the shell.  The code was saving sh.lim.child_max+1
which is ok but not required.

Thnaks for the report.

David Korn
dgk.com

Comment 22 James M. Leddy 2008-06-23 21:21:28 UTC

I've reproduced a similiar problem.

$ ulimit -u 10
$ for (( i = 0; i < 10; i++ ))
$ do
$ cat /dev/zero > /dev/null &
$ done

This script will hang ksh with the latest version (rawhide) we built and gave
customer for rhel 5.  The cause is line 2252 in sh/xec.c

while(_sh_fork(parent=fork(),flags,jobid) < 0)

fork() returns -1 every time, presumably because we hit the limit for processes.
 Perhaps solaris does the same kind of accounting that linux does, where maybe
BSD or other UNIX doesn't.  In any case it appears that ksh thinks it can
allocate one more process than the system will allow.

From the fork man page:

       EAGAIN It was not possible to create a new process because the caller’
              RLIMIT_NPROC  resource  limit  was  encountered.  To exceed this
              limit, the process must have either  the  CAP_SYS_ADMIN  or  the
              CAP_SYS_RESOURCE capability.
----

Running the same script in bash causes an exit:

$ ulimit -u 10
$ for (( i=0 ; i < 10; i++ ))
> do
> cat /dev/zero > /dev/null &
> done
-bash: fork: Resource temporarily unavailable

I will find where ksh accounting has gone wrong, and have a patch by the end of
the day.

To be fair, this isn't exactly the same problem that we saw earlier in the core,
but it causes ksh to "spin" with %100 CPU, so it is possible they are seeing
this as well.  I would also like to note that customer is using a relatively low
value for max user processes, which would trigger this more easily.

Comment 23 James M. Leddy 2008-06-24 00:06:19 UTC

Looks like ksh is expecting waitpid in jobs.c to wait, as it should. 
Unfortunately, at this point there are no child procs (determined using 'ps
auxf', causing it to set errno to ECHILD.   At this point it will loop wildly. 
After taking a second look it appears that this is a special rare case, and if
ksh has any other child (background) processes, it will wait() on them.

Comment 26 Tomas Smetana 2008-06-25 07:42:41 UTC

Created attachment 310235 [details]
Patch for the "ulimit script" problem

Here's my attempt to fix the problem with the "ulimit" script from the comment
#22.  I didn't test it thoroughly but it seems to work at the first sight.

Comment 27 Tomas Smetana 2008-07-01 07:18:47 UTC

After some more testing it looks that even with unpatched ksh the script does
exit with "limit exceeded [Permission denied]" after about 30 seconds. However,
not always and I'm not able to find under what conditions this happens.

Comment 28 James M. Leddy 2008-07-02 20:02:53 UTC

In response to comment #27:

The question about weather it ultimately will exit after 30 seconds isn't the
issue.  ksh was designed to wait() on the process.  If it wait()s 30 seconds I
don't have a problem with it.  The fact is that on my machine, the process gets
into an infinate loop, thus consuming %100 cpu for those 30 seconds.  This is wrong.

Comment 29 Tomas Smetana 2008-07-03 06:24:19 UTC

(In reply to comment #28)
> In response to comment #27:
> 
> The question about weather it ultimately will exit after 30 seconds isn't the
> issue.  ksh was designed to wait() on the process.  If it wait()s 30 seconds I
> don't have a problem with it.  The fact is that on my machine, the process gets
> into an infinate loop, thus consuming %100 cpu for those 30 seconds.  This is
wrong.

I'm not able to reproduce the problem again but if I remember well ksh either
timeouts or consumes 100 % CPU, i.e., when things go as expected the CPU
consumption is "normal".

Comment 31 Tomas Smetana 2008-07-04 12:40:32 UTC

Created attachment 311037 [details]
Backport from 2006-06-24

Thing I should have done first -- look into the latest upstream code. 
Unfortunately the 2008-06-24 version won't compile on i386 so I keep forgetting
about it because it can't be included in Fedora until the errors get resolved
by the upstream.  However the backported parts seem to solve the issue and the
new package passed all the tests I had at hand including the ones shipped with
the sources.  I'm still not much happy about fast tracking the ksh bugs
though...

Comment 32 James M. Leddy 2008-07-16 01:07:11 UTC

Created attachment 311902 [details]
savelist removal from jobs.c

Proposing this patch since the problem still exists on RHEL4.  It's pretty
clear from all core dumps that this bug is somewhere in the maintenance of the
savelist in jobs.c.  I haven't been able to track down exactly where the list
gets mangled, but the list isn't really necessary in my opinion.  The only
thing it gets us is a  few less malloc()s and free()s.

Alternatively I could have defined NJOB_SAVELIST as 0.

Comment 35 Tomas Smetana 2008-08-06 12:51:39 UTC

I plan to push 5.3 errata with the 20080202 version including patch from comment #31.  Any objections?

Comment 51 RHEL Program Management 2008-09-19 21:52:52 UTC

This bugzilla was reviewed by QE as a non-FasTrack request.
It has since been proposed for FasTrack. The qa_ack has 
been reset. QE needs to re-review this bugzilla for FasTrack.

Comment 69 Michal Hlavinka 2008-11-03 12:46:46 UTC

Release note added. If any revisions are required, please set the 
"requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly.
All revisions will be proofread by the Engineering Content Services team.

New Contents:
The ksh package has been upgraded to version 2008-02-02 that fixes many issues including job control problems and adds multibyte character handling. The new version preserves compatibility for the existing scripts.

Comment 71 Don Domingo 2008-11-12 05:22:51 UTC

Release note updated. If any revisions are required, please set the 
"requires_release_notes"  flag to "?" and edit the "Release Notes" field accordingly.
All revisions will be proofread by the Engineering Content Services team.

Diffed Contents:
@@ -1 +1 @@
-The ksh package has been upgraded to version 2008-02-02 that fixes many issues including job control problems and adds multibyte character handling. The new version preserves compatibility for the existing scripts.+ksh has been re-based to version 2008-02-02. This update adds multi-byte character handling, addresses many job control problems and applies several bug fixes from upstream. Note that this update to ksh preserves compatibility for existing scripts.

Comment 76 James M. Leddy 2008-11-12 17:07:04 UTC

Created attachment 323354 [details]
nosavelist source rpm

Here ya go.

Comment 77 James M. Leddy 2008-11-12 17:14:21 UTC

I'd like to point out that no savelist is _probably_ not going to help.  I was initially hopeful when we were running for months without issue, but it's always been a bit of a pipe dream that this patch would solve the issue.  

I can't find the bit of code that links head->head->head->etc... and even though I decided to get rid of jobsave_create to make thing simpler, there is nothing that I can find that is wrong with that particular function.  The theory was that there was a data structure floating around that had some used values for next, and that somehow that got reintroduced to the list.

Comment 81 Chris Ward 2008-11-28 07:01:35 UTC

~~ Attention ~~ We Need Testing Feedback Soon ~~

We're nearing the end of the Red Hat Enterprise Linux 5.3 Testing Phase and this bug has not yet been VERIFIED. This bug should be fixed in the latest RHEL53 Beta Snapshot. It is critical that we receive your feedback ASAP. Otherwise, this bug is at risk of being dropped from the release. 

If you encounter any new issues, CLONE this bug and describe the new issues you are facing. We are no longer excepting NEW bugs into the release, bar critical regressions and blocker issues.

If you have VERIFIED this fix, add CustomerVerified to the Bugzilla Keywords, along with a description of the test results.

Comment 87 Michal Hlavinka 2008-12-03 12:07:17 UTC

this bug was removed from errata so release notes were moved to 456652

Comment 88 Michal Hlavinka 2008-12-03 12:07:17 UTC

Deleted Release Notes Contents.

Old Contents:
ksh has been re-based to version 2008-02-02. This update adds multi-byte character handling, addresses many job control problems and applies several bug fixes from upstream. Note that this update to ksh preserves compatibility for existing scripts.

Comment 89 Shanti Katta 2008-12-03 18:23:52 UTC

Does this mean ksh-20080202-2 is not coming out with 5.3?

Comment 149 Chris Ward 2010-02-11 10:31:14 UTC

~~ Attention Customers and Partners - RHEL 5.5 Beta is now available on RHN ~~

RHEL 5.5 Beta has been released! There should be a fix present in this 
release that addresses your request. Please test and report back results 
here, by March 3rd 2010 (2010-03-03) or sooner.

Upon successful verification of this request, post your results and update 
the Verified field in Bugzilla with the appropriate value.

If you encounter any issues while testing, please describe them and set 
this bug into NEED_INFO. If you encounter new defects or have additional 
patch(es) to request for inclusion, please clone this bug per each request
and escalate through your support representative.

Comment 154 errata-xmlrpc 2010-03-30 08:22:01 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2010-0234.html

Comment 157 Red Hat Bugzilla 2023-09-14 01:12:01 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days