Hi, I have a fresh RH 6.2 installation on a dual-cpu P3. simra@Chess:[simra] 3>rpm -q glibc glibc-2.1.3-15 simra@Chess:[simra] 4>ls -l /lib/libpthread* -rwxr-xr-x 1 root root 289906 Feb 29 16:58 /lib/libpthread-0.8.so* lrwxrwxrwx 1 root root 17 Apr 17 17:46 /lib/libpthread.so.0 -> libpthread-0.8.so* simra@Chess:[simra] 5>uname -a Linux Chess.McRCIM.McGill.EDU 2.2.14-5.0smp #1 SMP Tue Mar 7 21:01:40 EST 2000 i686 unknown The following short program works fine on a single processor machine, but aborts on a dual. I haven't tried it yet w/ open(2) and read(2), so I'm not sure if it's a libc problem or kernel problem. In any case it's a serious problem. btw, if I guard the inside of the while loop with a mutex, the same problem occurs, so it seems to have something to do with the thread migrating between processors (if this does indeed occur), and also suggests that it's not a libc problem. /************************************************************************ gcc -Wall -o threadtest -D_REENTRANT threadtest.c -lpthread Usage: ./threadtest ***********************************************************************/ #include <unistd.h> #include <stdio.h> #include <stdlib.h> #include <pthread.h> #include <string.h> #define maxthreads 4 void * runthread(void* args); int main(int argc, char ** argv) { int t; pthread_t threads[maxthreads]; t=0; for (t=0; t<maxthreads; t++) { fprintf(stderr,"Spawning thread %d\n", t); pthread_create(&threads[t], NULL, runthread, argv[0]); } sleep(10000); exit(0); } void * runthread(void* args) { char buffer[1024]; while (1) { FILE* fp=fopen("threadtest.c","r"); if (!fp) { perror((char*)args); exit(1); } if (!fgets(buffer, 1023, fp)) { perror("Failed fgets"); exit(1); } if (strncmp(buffer, "/*",2)) { fprintf(stderr, "Failed on *%s*\n",buffer); abort(); } else fprintf(stderr,"%ld Success..\n",(long)fp); fclose(fp); } pthread_exit(0); }
Note: I have since reimplemented the program using open(2) and read(2) and the program does not abort- therefore the problem is with fgets in libc. Also, I have seen this problem with the gnu extension 'getline'.
I'm posting some addition comments from Ulrich Drepper re my bug report to the libc people. A second individual was unable to reproduce the bug with libc-2.1.2 and his own custom-built 2.2.14 kernel. I'm beginning to wonder if it's a problem with the RH prebuilt SMP kernel or a HW problem. If it's a HW problem it's not specific to a single machine- I can reproduce the bug on 10 identical machines in our lab. Date: 30 Apr 2000 12:18:50 -0700 Subject: Re: libc/1706: libpthread, multiprocessor linux and fgets/getline Robert Sim <simra.ca> writes: > >Description: > > Compile the program supplied below as per the comment line and execute on a > multi-processor machine. It will eventually abort on the abort() instruction > because fgets failed to read the expected bytes into the buffer (in spite of > returning a success). The program executes fine on a single-processor > machine, and also works fine if I replace fopen and fgets with the equivalent > open(2) and read(2) calls. I have also observed this behaviour using the gnu > getline extension. I cannot reproduce this. The setup is almost identical. The only notable difference is that I'm using a 2.3.99pre6 kernel. I have the process running for more than an hour and everything works fine.
RESOLVED: by upgrading the kernel to 2.2.14-6.1.1 it disturbs me that redhat is doing its own kernel patches- I can't report a bug like this to the linux-kernel people because I'm not using a standard 2.2.14 kernel but some specially patched version of 2.2.14.
And yes there are folks at Red Hat who deal with both RH and standard tree bugs. We pretty much have to ship a non default tree