From Bugzilla Helper: User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; EMC IS 55; .NET CLR 1.0.3705; .NET CLR 1.1.4322) Description of problem: We have a 32-bit application we are running on a 64-bit redhat kernel. The application does a fork to start a child process, which in turn also does a fork to start a child process. All of which are 32-bit programs. The third child attempts to do a pthread_create and just goes away. No error is reported. This can be reproduced by any simple fork application that is complied on 32-bit and moved to a 64-bit system for testing. Version-Release number of selected component (if applicable): kernel-2.4.21-27.EL How reproducible: Always Steps to Reproduce: 1.Compile attached program on 32-bit machine 2.Run it on 64-bit machine using: ./doafork ./doafork /bin/ps -a 3.The fork of the third level will fail 4.Run it on the same machine as: ./doafork /bin/ps -a 5.The fork and processing will work Actual Results: The EXECVP fails at the third level.. If the program is changed to make the first to parms "/bin/sh" and "-c", and passing the command line as a remaining parm, the third fork will work, but if it attempts to do a fork or create a thread, they will fail. Expected Results: It shouldn't matter how many levels, threads and forked processes should run. Additional info:
Note : This may be a duplicate of 149965, however we have applied that change and are still having the problem.
Created attachment 113972 [details] Simple fork program that will cause the problem Compile with "g++ doafork.cpp -o doafork" on a 32-bit machine.
One more item to note. Setting LD_ASSUME_KERNEL to 2.4.0 resolves the issue and allows all spawned levels to work. Setting it to 2.4.1 causes random segmentation violations.
Anyone looked at this ???
Reassigning to PeterM. Jason no longer has time to be ia64 arch maintainer. Also, reverting to ASSIGNED state. Jeff Lee, I apologize that no one has yet had time to investigate this.
Any word on the status of this issue?
We really need to get this addressed as soon as possible. Please contact me if further information is required. I cannot change the priority to high, even though I am the submitter. Can someone respond to me ASAP. Cheers, Jeff
Hello again. We urgently need to get this resolved as it is impacting out product running on Linux 64. Is there any way to escalate this or assign it to someone that can look at it?
I compiled the test app as posted. I ran it and didn't see any difference in the output between an i386 machine with RHEL3 U2, and an ia64 machine with the latest RHEL3 pre-U7 (with or w/o ia32el enabled). Then I tried commenting out the #define USE_SHELL line. I compiled and ran the program on the i386 RHEL3 U2 box. I see the EXECVP error you describe even on i386, which indicates to me that this is how it is supposed to work. The exact same thing happens on ia64 (with or w/o ia32el enabled). So there seems to me no difference between native i386, hw-emulated i386 on ia64 and sw-emulated i386 on ia64. The bug would seem to be in your code. (If I get a chance, I'll try to figure out what that bug is.) If you can identify a test case that works on native i386 but fails on ia64, please post it. - Geoff
I outlined in the about text the cases to make this fail. (See the initial entry). Just running the program doesn't cause a problem. The problem is a fork level problem. The test program only works for the first runs. It fails on the third process (fork) EVERY TIME under ia64. But works fine on i386. Additionally, the program MUST be complied on i386 and run on ia64. We run our system where process A starts process B starts process C. Process C uses thread and may or may not start other processes. We have never had a problem on i386, nor have our many customers. However, on 64-bit this fails regularly. The test program is designed merely to emulate process A, B and C. But as it takes any process from the command line and forks it, you have to string them as I said above.
The results I reported above were from running your test as you specified. If you compile and run it on a 32-bit machine, you will see the same results. I believe I have tracked down the bug in your test program. You are constructing the cmdline variable which contains the "rest" of the command line arguments, separated by spaces. Then you're passing this to execvp as one big argument. But that is not the interface to execvp. Execvp expects to be called with each argument as a separate array element: execvp( "/path/prog", { "/path/prog", "arg1", "arg2", "arg3", NULL } ); -not- execvp( "/path/prog", { "/path/prog arg1 arg2 arg3", NULL } ); When you comment out USE_SHELL, and run ./doafork ./doafork ps -a, here's what your execvp calls look like: execvp( "./doafork", { "./doafork", "ps -a", NULL } ); execvp( "ps -a", { "ps -a", NULL } ); When execvp tries to run the program "ps -a" it fails, because there is no such file as ps<space><dash>a. So it sets errno = 2 (No such file or directory) and returns -1 for failure. If you replace the execvp() call in your test program with this line: execvp(arg[1], &arg[1]); you will see the behavior you expect. With that change to the program I have tested on i386 (RHEL3), x86_64 (RHEL3 and RHEL4), and ia64 RHEL3 (w/ and w/o ia32el), just to be absolutely sure. I ran this succesfully on each platform: ./doafork ./doafork ./doafork ./doafork ./doafork ps -a PS if you want an interface that can execute a whole command line as one string, try int system(const char *string) in <stdlib.h> - Geoff
Closing as NOTABUG based on last comment.
Actually, the full code we use is : char* argv[MAXARGS + 4]; int i = 0; argv[i++] = "/bin/sh"; argv[i++] = "-c"; argv[i++] = ( char *)strTempCommand.c_str() ; argv[i] = NULL; //signals end of arguments execvp(argv[0], argv); My test program attempts simulated this. The strTemCommand is a string containing the command to be executed (ie. "ps -ef"). This works on Unix (Solaris, HP, AIX), as well as SUSE Linux. I have also tried the test program on Debian Linux and Windows. All work. However, compiling the above on a 32-bit Redhat machine and running it on 64- Bit, it fails. It works fine on 32-Bit Redhat machines. If you're saying this is incorrect and cannot be processed in this fashon, then that indicates 64-Bit RH Linux is a special case for which special considerations must be made. In making such a statement, it nicely leave RH in a position that nothing is a bug, only improper user coding. Please take another look.
Could the customer please provide a revised testcase and/or a strace/ltrace on his systems displaying the failure? We'll be happy to look at the issue a second time.
Created attachment 122883 [details] A Red Hat updated testcase... This is from James Antill: I'm uploading the new testcase, complete with makefiles etc. This uses the SHELL only ifdef path, which is equivalent to the code you posted in December. This also includes a threaded application to test with. I cannot get this to fail with any combination of 32 and 64 bit binaries that I've tried. Please upload a working "failure" test case or at least a strace & ltrace on the problem.
FYI, to compile the test run "make all32" on a 32bit machine and "make all64" on a 64bit one ... and then run the binaries in whichever way works/doesn't.
Thanks for the test case. Applying it to our systems for testing. Will let you know the results.
We have been doing extensive testing with both test cases. Though the RedHat version isn't much different. In our initial testing, we used the following command: >doafork doafork doafork ps -ef This worked for both 32-bit and 64-bit. Which confused us as this had failed before. At least I thought. I had intended to develop a simple test that could be used by RedHat to investigate the problem. However, I can't recall ever testing with the "ps" command and purhaps this has led to the confusion on the problem. I tried a number of ways to run and even had another developer that had not been on the original issue try. He too got success. As I said, this was very confusing. But it had been a while since running the tests and some patches had been applied to the system including a new compiler on the 32-bit system where we did the builds. So I turned to the original problem program. I rebuilt it with the new compiler and ran it on the 64-bit system. If failed. What's the difference? All it does it starts threads and processes like doafork does. The program execs a copy of itself (A starts B). The copy starts another program (B starts C). This secondary program fails to start processes or threads. Which (since the exec code used in doafork came for the problem program) would be the same as "doafork doafork xyz". So why didn't "ps" fail? Because it's a 64-bit program. Our further testing has shown that if the third (and final) layer program is a 32-bit program, that's when the failure occurs. So running all 32-bit as follows: >doafork doafork some32bitprogram will cause the problem. The problem is seen when the "some32bitprogram" tries to start a thread or process. That's what we thought. Still testing, we still had one item we could not explain, this works: >doafork doafork doafork doafork doafork ps -ef Again testing we found that doafork simple does execs. In an attempt to recreate the original problem. It is not a multithreaded application. What fails is the last, no matter what level (3rd and higher), program that is a 32- bit multithreaded program. I cannot supply the original programs directly, but am working with our management to see if we can pass them to you for testing. In the meantime I am also working on a new test case for you. I can reproduce this problem at will, as can our customers, using programs compiled on 32-bit and run on 64-bit. Also, for your reference. While I refer to "doafork", I have tested these cases with the original test case code I supplied and the RedHat test case code that John supplied. Both recreate the problem every time. Simply replace my references to "doafork" with John's version. Cheers, Jeff
Is there any update on this issue? Has anytone been able to recreate the issue following the instructions and information Jeff posted on 2006-01-27? We really need to get some type of status so we can inform our customers.
Eric, this bug has been recently re-assigned to Anil within Red Hat. He should have an update to this issue shortly.
Created attachment 125768 [details] Detailed output of the program I ran I compiled the program attached here (comment #20) https://bugzilla.redhat.com/bugzilla/attachment.cgi?id=122883 on RHEL3/i686 p3 box (Taroon Update 7 Beta) by running make all32. Moved the compiled output onto IA64 Box running RHEL3(Taroon Update 7 Beta) and ran the program as mentioned in this bugzilla. I am unable to recreate the original problem. Every time the program ran it was *successful*. FYI, I have attached a detail output of the system I compiled the 32 bit program and the output of the program that ran on IA64 box. Please let me know if I need to compile the 32 bit program on a specific version/release of the kernel in order to recreate the bug.
Anil, I see you compiled the "Doafork" program as prescribed. As discussed in comment #23 you will find just running it as is does not reproduce the problem. It was a curious item that no one looking at the issue was able to reproduce it using the doafork program. I investigated it again here and found that a basic program such as doafork doesn't cause the problem. It does allow the simulation of the startup that we do which in turn causes the problem. Again all the details are in comment #23, but basically you need to run the doafork in a couple of calls with the final program being a multi-threaded application. Please feel free to contract me directly at lee_jeffrey for contact information to discuss this directly in more detail so we can get this resolved quickly. Cheers, Jeff
Jeff, I had several levels of dofork programs and final program being a multi-threaded application and I could not reproduce the bug. I tied all kinds of permutations and combinations of 32 bit and 64bit programs and all of them is working for me. Some examples that I tried running are 1)./doafork-new32 ./doafork-new64 ./doafork-new64 ./doafork-new32 ./tdns32 3 example_tdns_input 2)./doafork-new32 ./doafork-new64 ./doafork-new64 ./doafork-new32 ./doafork-new64 ./tdns32 3 example_tdns_input 3)./doafork-new32 ./doafork-new64 ./doafork-new64 ./doafork-new32 ./doafork-new64 ./tdns64 3 example_tdns_input And of course several such other combinations all are working fine for me. If you can still reproduce, I would like to know on what 32bit platform/OS release version that you are compiling the program. Thanks, Anil
As this is Saturday, I will reproduce and send detailed OS information on Monday. This is not only reproducable by us, but by several of our customers. Of course they use our compiled code. But they run on various 64-bit levels. I will also look into getting you a agent that will cause the problem or a smaller program I can get to produce the problem. If there is anything else you need me to get let me know. Cheers, Jeff
The solution to the thread spawning problem was due to the fact that our machine was missing the ia32el package. After installing this package, the issue is resolved on our machine. This ticket can be closed. Thanks Ruitao
Closing per EMC... glad the issue is resolved.