| Summary: | sem_post() and sem_timed_wait() block forever | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Paulo Andrade <pandrade> | ||||
| Component: | glibc | Assignee: | Carlos O'Donell <codonell> | ||||
| Status: | CLOSED NOTABUG | QA Contact: | qe-baseos-tools-bugs | ||||
| Severity: | medium | Docs Contact: | |||||
| Priority: | unspecified | ||||||
| Version: | 7.3 | CC: | ashankar, fweimer, mnewsome, pandrade, pfrankli, triegel | ||||
| Target Milestone: | rc | ||||||
| Target Release: | --- | ||||||
| Hardware: | All | ||||||
| OS: | Linux | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2016-11-30 18:10:32 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Attachments: |
|
||||||
I looked at why the old compat symbol was needed. The original sem_t for LinuxThreads looks like this:
typedef struct {
long int sem_status;
int sem_spinlock;
} old_sem_t;
So it is only 8 or 16 bytes large (for 32 and 64 bit architectures, respectively). Our current sem_t is 16 or 32 bytes large.
The current implementation needs 16 bytes on both kinds of architectures, so there is no easy way to fix this on 32-bit architectures.
This is either a bug in the old sem_post, which is entirely plausible, or a test case / user application issue, so we need to triage this to the point where we understand what's the problem with the use case. I expect this is a compat symbol implementation issue, we really need more test cases that exercise the compat symbols. I like the use of the asm to for the symver in the example, that's clever, and I know Florian has been pushing similar things upstream e.g. symbol_version_reference. We need a test just like this one upstream. (In reply to Carlos O'Donell from comment #3) > This is either a bug in the old sem_post, which is entirely plausible, or a > test case / user application issue, so we need to triage this to the point > where we understand what's the problem with the use case. I expect this is a > compat symbol implementation issue, we really need more test cases that > exercise the compat symbols. The test case attached to the bug suggests this is an incorrect assumption about compatibility of old and new semaphores. Apparently, the semaphore object is passed between parts of the system using different library versions. But the test case may not be representative of what actually goes on in the application (see my next comment). I think we can improve the situation at reasonable cost. Here are the current struct definitions with 64-bit atomics available: struct new_sem { uint64_t data; int private; int pad; }; Recall that on 64-bit architectures (with 8-byte alignment), the original definition already needed 16 bytes, so we do not need a different implementation for the compat symbol at all. For architectures without 64-bit atomics, we have: struct new_sem { unsigned int value; int private; int pad; unsigned int nwaiters; }; struct old_sem { unsigned int value; }; The original semaphore may only have been 8 bytes large (see the comment 1). So we can introduce a shared flag field only by reordering fields, like this: struct new_sem { unsigned int value; int flags; int private; unsigned int nwaiters; }; struct old_sem { unsigned int value; int flags; }; New-style sem_init sets flags to 0 and sem_open would set flags 0, the compat implementation of sem_init sets flags to 1. The compat implementation of sem_wait and sem_post call the non-compat implementation if flags is 0. This means that the compat implementations would be able to use semaphore objects created by the current implementation, but not vice versa. This could be sufficient to address the application issue. (In reply to Paulo Andrade from comment #0) > The attached test.c mimics the behaviour of the user > environment where the problem happens. > > Basically, it calls sem_timedwait on an "timedout" timer > on a loop, then, attempts to break the loop by doing a > sem_post, but it never finishes. > > The issue happens only if forcing it to use __old_sem_post. How accurate is the test case? What is puzzling about it is that sem_open became available only *after* we increased the size of struct sem_t and created the compat version for sem_post. This means that it is bizarre that the application does not end up calling the new version of sem_post. Does the real application call sem_init instead? Can you double-check that the shared object which calls sem_post is linked against libpthread? Another possible explanation is that it was not linked correctly, did not receive symbol version, and ld.so binds all its symbols against compat symbols as a result. In this case, the issue would be similar to what we experienced in this upstream bug: https://sourceware.org/bugzilla/show_bug.cgi?id=20489 The shared object is invalid in such a case and impossible to support. But we currently lack diagnostics for that, so it is rather difficult to enforce that. One possible way to check this is to examine the old binaries involved with “eu-readelf -s” and check if they reference any GLIBC_ versions newer than GLIBC_2.0. (In reply to Florian Weimer from comment #4) > The original semaphore may only have been 8 bytes large (see the comment 1). > So we can introduce a shared flag field only by reordering fields, like this: This does not work everywhere because the pad field is already used as a mutex on pre-v9 SPARC, but the struct definition was updated to reflect that. But the question remains: Can we squeeze a flag bit from somewhere? If there are any architectures with 64-bit atomics and only 4-byte alignment for longs, we have another problem (alignment of sem_t is insufficient for the current non-compat implementation). The libraries are very old and if I understood correctly, are dlopen'ed; they are part of mysql compiled stored procedures. Stepping with gdb and putting watchpoints showed the same behaviour in the test case. It calls mostly new_* functions, and the call to old_sem_post causes the infinite loop. The fact that binaries and libraries cannot be rebuilt (it might be mostly due to bureaucracy) is a possible indication of invalid/unsupported build. I will ask the user to check eu-readelf -s and libpthread linking. eu-readelf -s have shown newer than GLIBC_2.0 symbols. The user code did not call sem_init; user was unsure if mariadb somewhere could be calling it. The library was not linked to libpthread. User was instructed to relink the library with -lpthread, and now we are waiting user feedback. |
Created attachment 1223865 [details] test.c This is a regression after correction to rhbz#1027348 The issue happens on a 32 bit application linked to some significantly old 32 bit libraries. The attached test.c mimics the behaviour of the user environment where the problem happens. Basically, it calls sem_timedwait on an "timedout" timer on a loop, then, attempts to break the loop by doing a sem_post, but it never finishes. The issue happens only if forcing it to use __old_sem_post. The user cannot at the moment rebuild the legacy binaries, nor rebuild as 64 bit binaries. The test case should print: $ gcc -O0 -g3 -m32 test.c -o test -Wl,--wrap=sem_post -pthread -bash-4.2$ ./test . No error but instead, in rhel 7.3 (or fedora 24) it just prints nostop a '.'.