Bug 43742
Summary: | the pthreads library appears to become confused | ||||||
---|---|---|---|---|---|---|---|
Product: | [Retired] Red Hat Linux | Reporter: | IBM Bug Proxy <bugproxy> | ||||
Component: | glibc | Assignee: | Jakub Jelinek <jakub> | ||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | David Lawrence <dkl> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | high | ||||||
Version: | 7.1 | CC: | destefan, drepper, fweimer, teg | ||||
Target Milestone: | --- | ||||||
Target Release: | --- | ||||||
Hardware: | All | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2001-09-13 11:52:19 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
IBM Bug Proxy
2001-06-06 22:57:42 UTC
Jakub sent me this test program for the thread library some time back. I attach a slightly modified version below shortly. It shows some extremely strange effects. I (and somebody else very much familiar with the thread stuff) spent time to look at the program and couldn't find anything wrong with it. Now the strange thing: the thread library itself seems not to be involved in the problems. Running the same program on a quad Alpha also doesn't show any problem. I do think meanwhile that this is a problem with the memory handling for x86 in the kernel. Let me start by showing some output of the program: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 155288599: ex_ter_tid[55152].th already set to 155287561! 55152: cr_ter_tids(155287561) != ex_ter_tids(155288599) 55153: cr_ter_tids(155288599) != ex_ter_tids(0) 55154: cr_ter_tids(155296771) != ex_ter_tids(155299849) 55155: cr_ter_tids(155299849) != ex_ter_tids(0) 55156: cr_ter_tids(155308057) != ex_ter_tids(155311107) 55157: cr_ter_tids(155311107) != ex_ter_tids(0) 55158: cr_ter_tids(155319324) != ex_ter_tids(155322371) 55159: cr_ter_tids(155322371) != ex_ter_tids(0) 55160: cr_ter_tids(155328529) != ex_ter_tids(155333635) 55161: cr_ter_tids(155333635) != ex_ter_tids(0) [...] 155432990: ex_ter_tid[255197].th already set to 155431965! 55178: cr_ter_tids(155429911) != ex_ter_tids(155439113) 55179: cr_ter_tids(155439113) != ex_ter_tids(0) 255198: cr_ter_tids(155432990) != ex_ter_tids(0) 55180: cr_ter_tids(155443224) != ex_ter_tids(155451401) 55181: cr_ter_tids(155451401) != ex_ter_tids(0) Error joining tertiary thread with index 255200, thread ID 155446299: (3) No such process [and so on...] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Let's start with the first line. As far as I can say something like this is always the first line. Maybe this is the indication of the original problem and the rest is just fallout. The message 155288599: ex_ter_tid[55152].th already set to 155287561! means that although the entry 55152 the array is used for the first time it already contains a value (the arrays are cleared at program start). You can look through the code, there is no other assignment. Unless the compiler generates somewhere different code this can only happen if an assignment somewhere went wrong. Please note that the assignments to ex_ter_tid all happen in the tertiary threads, the ones which are created and terminated with very high rate. My current guess is that the kernel at some point fails to set up the correct mapping and the assignment ex_ter_tid_p = (arr_t *)arg; in ter_thread_rtn() goes somewhere else. This first failure has a ripple effect. The lines 55152: cr_ter_tids(155287561) != ex_ter_tids(155288599) 55153: cr_ter_tids(155288599) != ex_ter_tids(0) mean that ex_ter_tids[55152] is wrong. Instead of the correct thread ID (155287561) the ID 155288599. If you look at the first message this is not surprising since the thread with the ID 155288599 assumes it has to write at the index 55152. The index is derived from the value of a parameter which is passed via clone() to the new thread. This almost always works, just not in this case. This pattern continues though most of the time the assignment mismatch is not detected. That's not a big surprise since it is a race. The symptoms are almost the same: the second thread of the group (tertiary threads are always launched in a group of two) writes the thread ID in the wrong array slot, leaving its own entry zero and overwriting the other threads ID. This is why you see these pairs. Occasionally even the joining fails. pthread_join() only uses the cr_ter_tids array which is normally not corrupted. *** I really cannot explain this behavior. The thread library seems fine. First we thought the pthread_join() would return too early but this can't be it since the ex_ter_tids() array gets written to in the wrong place. The problem seems not to have anything to do with the LDT usage. I've updated my machine to 2.4.8pre1 which should have the latest patch for LDT handling included. I think I've also tested the code with a i386 version of libpthread which is not using the LDT altogether. On UP machines or SMP machines with most all but one processor already busy the problem shows up not that often. I even managed to get 6 out of 10 runs without any problem. As mentioned above, I've tested on my quad Alpha without any problems. I don't have the resources available on the SMP IA-64 machines we have here so I couldn't test it and I don't know how to use the resources in Durham. I would appreciate if you can run the program on SMP machines of other architectures. So far it seems only x86 is affected which probably points at the kernel. *** Some words about the program. I've modified it to add another test and output (in sec_thread_rtn after the pthread_join call). And I've introduced the arr_t type which allows to add padding between the thread descriptors. Created attachment 25811 [details]
Modified test case to see the problem easier
My current theory: Without looking in the kernel sources I would guess that a succesful clone() call just creates the basic data structures to create the new thread but does not allocate all the resources. How the program and the thread library works is that there is never more than one clone() call going on at the same time. This is completely serialized. But there are many clone() calls. The initial threads creates 10 threads which in turn each create 2 threads over and over again. I.e., if the newly created threads are not scheduled immediately there can be many threads hanging around in limbo. Now, how does the kernel create the initial contexts for the thread, especially, how does it set the parameter? This is what seems to be going wrong. Two threads seem to get the same parameter even though the clone() syscall specifies different ones. And judging from the environment, this happens if the two threads are scheduled at the same time on different CPUs. Maybe this is giving somebody a clue. Setting architecture to i386 and category to glibc (for now). This should probably be kernel. This should be fixed with http://sources.redhat.com/ml/libc-alpha/2001-09/msg00114.html, at least the testcase no longer fails to me. Closing this bug as our testing shows this is no longer an issue in version 7.2. (See LTC128) |