Bug 183881
Summary: | kernel/libc type mismatch on siginfo_t->si_band - breaks FAM on 64bit arches | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 3 | Reporter: | Bastien Nocera <bnocera> | ||||||
Component: | kernel | Assignee: | Ernie Petrides <petrides> | ||||||
Status: | CLOSED ERRATA | QA Contact: | Brian Brock <bbrock> | ||||||
Severity: | medium | Docs Contact: | |||||||
Priority: | medium | ||||||||
Version: | 3.0 | CC: | alexl, davej, drepper, jakub, jglauber, jparadis, jrb, lwang, miyer, peterm, petrides, roland, rstrode, sct, tao | ||||||
Target Milestone: | --- | ||||||||
Target Release: | --- | ||||||||
Hardware: | x86_64 | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | RHSA-2006-0437 | Doc Type: | Bug Fix | ||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2006-07-20 13:55:11 UTC | Type: | --- | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | |||||||||
Bug Blocks: | 181405 | ||||||||
Attachments: |
|
Description
Bastien Nocera
2006-03-03 11:30:02 UTC
This seems to be a kernel problem. When the dnotify signals gets delivered to fam the value of si->si_fd is wrong. This seems to be a kernel 64bit issue with sigio in general. I wrote a testcase that doesn't use dnotify that just sets up a sigio handler on a unix domain handler. Running it on a x86-64 "Red Hat Enterprise Linux WS release 3 (Taroon Update 7)" I get: ./test_sigio fd we're waiting on sigio from: 4 received signal 29, si_signo=29, si_errno=0, si_code=1 signaled fd (si->si_fd) = -1787485370 The last line should end with "= 4". Created attachment 126387 [details]
Test case for sigio
sigio works as expected on RHEL3 i386 apparently: veillard:~ -> gcc -o sigio sigio.c veillard:~ -> ./sigio fd we're waiting on sigio from: 4 received signal 29, si_signo=29, si_errno=0, si_code=1 signaled fd (si->si_fd) = 4 veillard:~ -> uname -a Linux veillard.com 2.4.21-37.EL #1 Wed Sep 7 13:37:20 EDT 2005 i686 athlon i386 GNU/Linux veillard:~ -> This is a glibc <-> kernel mismatch of the size of si_band in siginfo_t. For details, see: http://uwsg.ucs.indiana.edu/hypermail/linux/kernel/0402.1/0456.html Alex, thanks for the reproducer in comment #11. I was able to verify the error on my x86_64 test system (on an interim U8 kernel over a U6 glibc base). But I'm confused about one major thing: Where is the type mismatch? Both _band and _fd in the siginfo_t definition in /usr/include/asm-x86_64/siginfo.h are int's, which match what the kernel is using. Also, I don't think changing the kernel data structure is an option, since that would break the ABI (whether or not the int-sized ABI is desirable). The mismatch is with the *other* user-space version of siginfo_t in /usr/include/bits/siginfo.h. Good grief. If you put the following lines before the include of <signal.h> in the test program: #include <sys/types.h> #include <asm/siginfo.h> #define __have_siginfo_t #define __have_sigval_t #define __have_sigevent_t Then the *correct* version of a siginfo_t is used and the test case works correctly. This shows that the kernel struct *cannot* be changed due to ABI compatibility concerns. The bottom line is that /usr/include/asm-x86_64/siginfo.h is correct in glibc-kernheaders, but that /usr/include/bits/siginfo.h is incorrect in glibc-headers. Thus, I'm reassigning this bug to the glibc component (since there's no glibc-headers component, despite there being a glibc-headers RPM). 2.6 kernel uses long on almost all architectures (except sparc64), only 2.4 and earlier kernels got it wrong. glibc can't change depending on what kernel it is running on. Plus, anything that includes asm/siginfo.h directly rather than using <signal.h> is broken. I have an idea for a kernel work-around, but I need to investigate further. Ray has confirmed that "fam" doesn't use the "si_band" field of the "siginfo" structure, so maybe the x86_64 kernel could put the "si_fd" value at both 4-byte and 8-byte offsets beyond where "si_band" resides (without any data structure changes). For reference, the "si_band" field ("_band" in "_sigpoll" part of union) was changed from type "int" to "long" (only on x86_64) upstream in 2.4.22, the release following the one on which the RHEL3 kernel is based. Oh dear, this is all *soooo* broken. Upstream changed a field --- which is an enum that can only hold 5 legal values anyway --- from int to long. But POSIX requires the long. And we added a compatibility option to let some archs keep it an int for legacy compatibility's sake, but x86_64 chose to break it instead. Trouble is, this really leaves us with no authoritative definition of what it *should* be on 2.4. I guess we really have to hold the "long" authoritative. I kindof like Ray's patch. It fixes existing users of si_fd. It won't fix users of si_band, though --- si_band will still contain the old si_fd in its upper bits, so user-space comparisons of (si_band == POLL_*) will still fail. But it fixes the problem at hand in a clean enough way, and it's probably the most compatible fix we can achieve right now. I have finished researching the 2.4/RHEL-3 aspects of this predicament. First and foremost, there has never been a RHEL-3 regression -- this problem has existed since the initial release of RHEL-3 (kernel version 2.4.21-4.EL). Secondly, I have verified that the si_band field of the user-visible siginfo_t in every single RHEL-3 Update (as well as GOLD) for every single architecture has been declared as type (long int). That field (which immediately precedes si_fd, the one most needed when handling a SIGPOLL/SIGIO) only holds a bitmask of 11 possible flags (in bits 0 - 10). Thirdly, the RHEL-3 kernel arches that use an inconsistently sized si_band field (with respect to the siginfo_t in /usr/include/bits/siginfo.h) are x86_64, ppc64, and s390x. (Specifically, ia64 is not broken.) Fourthly, I'm still against *changing* the kernel's siginfo_t declaration because that would break applications compiled with the original kernel's layout (also available to user-space in /usr/include/asm/siginfo.h). Finally, there's only one place in the whole kernel where the si_fd field is assigned (excluding 32-bit-compat support): send_sigio_to_task(). Thus, I feel the best way to resolve the issue is for the kernel to make a 2nd copy of the si_fd value 4 bytes beyond where it thinks it should be on x86_64, ppc64, and s390 *without* changing any of the siginfo_t declarations. This works because the si_fd field is last in the _sigpoll part of the union, and there is a pad of (128 - 5 * sizeof(int)) bytes in the siginfo_t beyond where si_fd is being stored on the 3 broken arches. The benefit is that applications using the 4-byte offset (between si_band and si_fd) will continue to work, those using the 8-byte offset will begin to work without recompilation, and those that either don't use si_band (such as "fam") or properly use it as a bitmask (only doing tests with the "&" operator and not with "==" or "!=") will continue to work. I'll attach an appropriate kernel patch following testing. (It will capture the same idea as Ray's patch in comment #28, but implemented differently.) Created attachment 126605 [details]
siginfo_t work-around patch
This patch has been posted to rhkernel-list for review.
Please note that ppc64 and s390x is big endian, so si_band in that case won't work at all. Rats. Okay, thanks for the note, Jakub. I still think we should proceed with the patch in comment #31, because it at least makes si_fd visible in both versions of the siginfo_t. Also, since our ppc64 release runs with a 32-bit user-space, applications shouldn't run into any problems there. Thanks for all the research and quick turn around Ernie. A fix for this problem has just been committed to the RHEL3 U8 patch pool this evening (in kernel version 2.4.21-40.5.EL). Will this be in RHEL3 U8? I need to know, because otherwise i need to fix fam. Alexander, it already is in U8, as is evident from comment #35 along with the fact that this bug is in MODIFIED state. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2006-0437.html |