156708 – Third spawned 32-bit child fails in thread processing.

Bug 156708 - Third spawned 32-bit child fails in thread processing.

Summary: Third spawned 32-bit child fails in thread processing.

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat Enterprise Linux 3
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	3.0
Hardware:	ia64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Assignee:	Anil S Keshavamurthy
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2005-05-03 15:08 UTC by Jeff Lee (EMC)
Modified:	2007-11-30 22:07 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2006-03-24 21:10:38 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Simple fork program that will cause the problem (1.90 KB, text/plain) 2005-05-03 15:14 UTC, Jeff Lee (EMC)	no flags	Details
A Red Hat updated testcase... (2.16 KB, text/plain) 2006-01-06 17:50 UTC, Johnray Fuller	no flags	Details
Detailed output of the program I ran (2.86 KB, text/plain) 2006-03-07 19:35 UTC, Anil S Keshavamurthy	no flags	Details
View All

Description Jeff Lee (EMC) 2005-05-03 15:08:51 UTC

From Bugzilla Helper:
User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; EMC IS 55; .NET CLR 1.0.3705; .NET CLR 1.1.4322)

Description of problem:
We have a 32-bit application we are running on a 64-bit redhat kernel. The application does a fork to start a child process, which in turn also does a fork to start a child process. All of which are 32-bit programs. The third child attempts to do a pthread_create and just goes away. No error is reported.

This can be reproduced by any simple fork application that is complied on 32-bit and moved to a 64-bit system for testing.

Version-Release number of selected component (if applicable):
kernel-2.4.21-27.EL

How reproducible:
Always

Steps to Reproduce:
1.Compile attached program on 32-bit machine
2.Run it on 64-bit machine using:
     ./doafork ./doafork /bin/ps -a
3.The fork of the third level will fail
4.Run it on the same machine as:
     ./doafork /bin/ps -a
5.The fork and processing will work

Actual Results:  The EXECVP fails at the third level.. If the program is changed to make the first to parms "/bin/sh" and "-c", and passing the command line as a remaining parm, the third fork will work, but if it attempts to do a fork or create a thread, they will fail.

Expected Results:  It shouldn't matter how many levels, threads and forked processes should run.

Additional info:

Comment 1 Jeff Lee (EMC) 2005-05-03 15:13:22 UTC

Note : This may be a duplicate of 149965, however we have applied that change 
and are still having the problem.

Comment 2 Jeff Lee (EMC) 2005-05-03 15:14:47 UTC

Created attachment 113972 [details]
Simple fork program that will cause the problem

Compile with "g++ doafork.cpp -o doafork" on a 32-bit machine.

Comment 3 Jeff Lee (EMC) 2005-05-03 16:41:03 UTC

One more item to note. Setting LD_ASSUME_KERNEL to 2.4.0 resolves the issue and 
allows all spawned levels to work. Setting it to 2.4.1 causes random 
segmentation violations.

Comment 4 Jeff Lee (EMC) 2005-08-11 15:54:06 UTC

Anyone looked at this ???

Comment 6 Ernie Petrides 2005-08-11 21:18:44 UTC

Reassigning to PeterM.  Jason no longer has time to be ia64 arch maintainer.

Also, reverting to ASSIGNED state.

Jeff Lee, I apologize that no one has yet had time to investigate this.

Comment 7 Jeff Lee (EMC) 2005-09-28 11:36:00 UTC

Any word on the status of this issue?

Comment 8 Jeff Lee (EMC) 2005-11-17 15:05:34 UTC

We really need to get this addressed as soon as possible. Please contact me if 
further information is required.

I cannot change the priority to high, even though I am the submitter.

Can someone respond to me ASAP.

Cheers,
Jeff

Comment 9 Jeff Lee (EMC) 2005-11-18 13:13:11 UTC

Hello again. We urgently need to get this resolved as it is impacting out 
product running on Linux 64. Is there any way to escalate this or assign it to 
someone that can look at it?

Comment 10 Geoff Gustafson 2005-12-13 21:55:24 UTC

I compiled the test app as posted. I ran it and didn't see any difference in the
output between an i386 machine with RHEL3 U2, and an ia64 machine with the
latest RHEL3 pre-U7 (with or w/o ia32el enabled).

Then I tried commenting out the #define USE_SHELL line. I compiled and ran the
program on the i386 RHEL3 U2 box. I see the EXECVP error you describe even on
i386, which indicates to me that this is how it is supposed to work. The exact
same thing happens on ia64 (with or w/o ia32el enabled). So there seems to me no
difference between native i386, hw-emulated i386 on ia64 and sw-emulated i386 on
ia64. The bug would seem to be in your code. (If I get a chance, I'll try to
figure out what that bug is.)

If you can identify a test case that works on native i386 but fails on ia64,
please post it.

- Geoff

Comment 11 Jeff Lee (EMC) 2005-12-14 17:58:22 UTC

I outlined in the about text the cases to make this fail. (See the initial 
entry). Just running the program doesn't cause a problem. The problem is a 
fork level problem. The test program only works for the first runs.

It fails on the third process (fork) EVERY TIME under ia64. But works fine on 
i386. Additionally, the program MUST be complied on i386 and run on ia64.

We run our system where process A starts process B starts process C. Process C 
uses thread and may or may not start other processes. We have never had a 
problem on i386, nor have our many customers. However, on 64-bit this fails 
regularly.

The test program is designed merely to emulate process A, B and C. But as it 
takes any process from the command line and forks it, you have to string them 
as I said above.

Comment 12 Geoff Gustafson 2005-12-14 21:20:40 UTC

The results I reported above were from running your test as you specified. If
you compile and run it on a 32-bit machine, you will see the same results.

I believe I have tracked down the bug in your test program.

You are constructing the cmdline variable which contains the "rest" of the
command line arguments, separated by spaces. Then you're passing this to execvp
as one big argument. But that is not the interface to execvp. Execvp expects to
be called with each argument as a separate array element:

execvp( "/path/prog", { "/path/prog", "arg1", "arg2", "arg3", NULL } );

-not-

execvp( "/path/prog", { "/path/prog arg1 arg2 arg3", NULL } );

When you comment out USE_SHELL, and run ./doafork ./doafork ps -a, here's what
your execvp calls look like:

execvp( "./doafork", { "./doafork", "ps -a", NULL } );
execvp( "ps -a", { "ps -a", NULL } );

When execvp tries to run the program "ps -a" it fails, because there is no such
file as ps<space><dash>a. So it sets errno = 2 (No such file or directory) and
returns -1 for failure.

If you replace the execvp() call in your test program with this line:

execvp(arg[1], &arg[1]);

you will see the behavior you expect.

With that change to the program I have tested on i386 (RHEL3), x86_64 (RHEL3 and
RHEL4), and ia64 RHEL3 (w/ and w/o ia32el), just to be absolutely sure. I ran
this succesfully on each platform:
./doafork ./doafork ./doafork ./doafork ./doafork ps -a

PS if you want an interface that can execute a whole command line as one string,
try int system(const char *string) in <stdlib.h>

- Geoff

Comment 13 Ernie Petrides 2005-12-14 23:37:32 UTC

Closing as NOTABUG based on last comment.

Comment 15 Jeff Lee (EMC) 2005-12-16 15:32:49 UTC

Actually, the full code we use is :

     char* argv[MAXARGS + 4];

     int   i = 0;

     argv[i++] = "/bin/sh";
     argv[i++] = "-c";
     argv[i++] = ( char *)strTempCommand.c_str() ;
     argv[i] = NULL; //signals end of arguments

     execvp(argv[0], argv);

My test program attempts simulated this. The strTemCommand is a string 
containing the command to be executed (ie. "ps -ef"). This works on Unix 
(Solaris, HP, AIX), as well as SUSE Linux. I have also tried the test program 
on Debian Linux and Windows. All work.

However, compiling the above on a 32-bit Redhat machine and running it on 64-
Bit, it fails. It works fine on 32-Bit Redhat machines.

If you're saying this is incorrect and cannot be processed in this fashon, 
then that indicates 64-Bit RH Linux is a special case for which special 
considerations must be made. In making such a statement, it nicely leave RH in 
a position that nothing is a bug, only improper user coding.

Please take another look.

Comment 18 Andrius Benokraitis 2006-01-05 20:26:56 UTC

Could the customer please provide a revised testcase and/or a strace/ltrace on
his systems displaying the failure? We'll be happy to look at the issue a second
time.

Comment 20 Johnray Fuller 2006-01-06 17:50:54 UTC

Created attachment 122883 [details]
A Red Hat updated testcase...

This is from James Antill:

I'm uploading the new testcase, complete with makefiles etc.

This uses the SHELL only ifdef path, which is equivalent to the code you posted
in December. This also includes a threaded application to test with.

I cannot get this to fail with any combination of 32 and 64 bit binaries that
I've tried. Please upload a working "failure" test case or at least a strace &
ltrace on the problem.

Comment 21 Johnray Fuller 2006-01-06 18:00:06 UTC

FYI, to compile the test run "make all32" on a 32bit machine and "make all64" on
a 64bit one ... and then run the binaries in whichever way works/doesn't.

Comment 22 Jeff Lee (EMC) 2006-01-10 18:35:50 UTC

Thanks for the test case. Applying it to our systems for testing. Will let you 
know the results.

Comment 23 Jeff Lee (EMC) 2006-01-27 18:54:42 UTC

We have been doing extensive testing with both test cases. Though the RedHat 
version isn't much different. In our initial testing, we used the following 
command:

>doafork doafork doafork ps -ef

This worked for both 32-bit and 64-bit. Which confused us as this had failed 
before. At least I thought. I had intended to develop a simple test that could 
be used by RedHat to investigate the problem. However, I can't recall ever 
testing with the "ps" command and purhaps this has led to the confusion on the 
problem. I tried a number of ways to run and even had another developer that 
had not been on the original issue try. He too got success.

As I said, this was very confusing. But it had been a while since running the 
tests and some patches had been applied to the system including a new compiler 
on the 32-bit system where we did the builds. So I turned to the original 
problem program. I rebuilt it with the new compiler and ran it on the 64-bit 
system. If failed. What's the difference? All it does it starts threads and 
processes like doafork does.

The program execs a copy of itself (A starts B). The copy starts another 
program (B starts C). This secondary program fails to start processes or 
threads. Which (since the exec code used in doafork came for the problem 
program) would be the same as "doafork doafork xyz". So why didn't "ps" fail?

Because it's a 64-bit program.

Our further testing has shown that if the third (and final) layer program is a 
32-bit program, that's when the failure occurs. So running all 32-bit as 
follows:

>doafork doafork some32bitprogram

will cause the problem. The problem is seen when the "some32bitprogram" tries 
to start a thread or process. That's what we thought. Still testing, we still 
had one item we could not explain, this works:

>doafork doafork doafork doafork doafork ps -ef

Again testing we found that doafork simple does execs. In an attempt to 
recreate the original problem. It is not a multithreaded application. What 
fails is the last, no matter what level (3rd and higher), program that is a 32-
bit multithreaded program.

I cannot supply the original programs directly, but am working with our 
management to see if we can pass them to you for testing. In the meantime I am 
also working on a new test case for you.

I can reproduce this problem at will, as can our customers, using programs 
compiled on 32-bit and run on 64-bit.

Also, for your reference. While I refer to "doafork", I have tested these 
cases with the original test case code I supplied and the RedHat test case 
code that John supplied. Both recreate the problem every time. Simply replace 
my references to "doafork" with John's version.

Cheers,
Jeff

Comment 29 Eric Young (EMC) 2006-03-03 12:34:52 UTC

Is there any update on this issue? Has anytone been able to recreate the issue 
following the instructions and information Jeff posted on 2006-01-27?

We really need to get some type of status so we can inform our customers.

Comment 30 Andrius Benokraitis 2006-03-06 15:50:29 UTC

Eric, this bug has been recently re-assigned to Anil within Red Hat. He should
have an update to this issue shortly.

Comment 31 Anil S Keshavamurthy 2006-03-07 19:35:26 UTC

Created attachment 125768 [details]
Detailed output of the program I ran

I compiled the program attached here (comment #20)
https://bugzilla.redhat.com/bugzilla/attachment.cgi?id=122883
on RHEL3/i686 p3 box (Taroon Update 7 Beta) by running make all32.

Moved the compiled output onto IA64 Box running RHEL3(Taroon Update 7 Beta)
and ran the program as mentioned in this bugzilla. I am unable to recreate the
original problem. Every time the program ran it was *successful*.

FYI, I have attached a detail output of the system I compiled the 32 bit
program and the output of the program that ran on IA64 box.

Please let me know if I need to compile the 32 bit program on a
specific version/release of the kernel in order to recreate the bug.

Comment 33 Jeff Lee (EMC) 2006-03-09 10:50:30 UTC

Anil,

I see you compiled the "Doafork" program as prescribed. As discussed in 
comment #23 you will find just running it as is does not reproduce the problem.

It was a curious item that no one looking at the issue was able to reproduce 
it using the doafork program. I investigated it again here and found that a 
basic program such as doafork doesn't cause the problem. It does allow the 
simulation of the startup that we do which in turn causes the problem.

Again all the details are in comment #23, but basically you need to run the 
doafork in a couple of calls with the final program being a multi-threaded 
application.

Please feel free to contract me directly at lee_jeffrey for contact 
information to discuss this directly in more detail so we can get this 
resolved quickly.

Cheers,
Jeff

Comment 34 Anil S Keshavamurthy 2006-03-10 17:12:58 UTC

Jeff,
 I had several levels of dofork programs and final program being a
multi-threaded application and I could not reproduce the bug. I tied all kinds
of permutations and combinations of 32 bit and 64bit programs and all of them is
working for me.

Some examples that I tried running are
1)./doafork-new32 ./doafork-new64 ./doafork-new64 ./doafork-new32 ./tdns32 3
example_tdns_input
2)./doafork-new32 ./doafork-new64 ./doafork-new64 ./doafork-new32
./doafork-new64 ./tdns32 3 example_tdns_input
3)./doafork-new32 ./doafork-new64 ./doafork-new64 ./doafork-new32
./doafork-new64 ./tdns64 3 example_tdns_input

And of course several such other combinations all are working fine for me.

If you can still reproduce, I would like to know on what 32bit platform/OS
release version that you are compiling the program.

Thanks,
Anil

Comment 35 Jeff Lee (EMC) 2006-03-11 19:31:51 UTC

As this is Saturday, I will reproduce and send detailed OS information on 
Monday. This is not only reproducable by us, but by several of our customers. 
Of course they use our compiled code. But they run on various 64-bit levels.

I will also look into getting you a agent that will cause the problem or a 
smaller program I can get to produce the problem.

If there is anything else you need me to get let me know.

Cheers,
Jeff

Comment 37 Ruitao Duan 2006-03-24 20:45:05 UTC

The solution to the thread spawning problem was due to the fact that our 
machine was missing the ia32el package. After installing this package, the 
issue is resolved on our machine.

This ticket can be closed.

Thanks
Ruitao

Comment 38 Andrius Benokraitis 2006-03-24 21:10:38 UTC

Closing per EMC... glad the issue is resolved.

Note You need to log in before you can comment on or make changes to this bug.