Bug 183881

Summary: kernel/libc type mismatch on siginfo_t->si_band - breaks FAM on 64bit arches
Product: Red Hat Enterprise Linux 3 Reporter: Bastien Nocera <bnocera>
Component: kernelAssignee: Ernie Petrides <petrides>
Status: CLOSED ERRATA QA Contact: Brian Brock <bbrock>
Severity: medium Docs Contact:
Priority: medium    
Version: 3.0CC: alexl, davej, drepper, jakub, jglauber, jparadis, jrb, lwang, miyer, peterm, petrides, roland, rstrode, sct, tao
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: RHSA-2006-0437 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2006-07-20 13:55:11 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 181405    
Attachments:
Description Flags
Test case for sigio
none
siginfo_t work-around patch none

Description Bastien Nocera 2006-03-03 11:30:02 UTC
fam-2.6.8-15

There are a lot of problems with FAM on x86-64 systems:

- in NetConnection.c++, NetConnection::input doesn't handle short reads and will
disconnect clients on a whim
- Monitoring a directory doesn't give out any events about newly created files in it
- Monitoring a file doesn't send any events about it being deleted

This was tested with running fam on the command-line as root (fam -d -v), and
using the test client at:
http://oss.sgi.com/cgi-bin/cvsweb.cgi/fam/test/

I would advise replacing FAM and its libraries by the better working gamin, from
RHEL4.

Comment 10 Alexander Larsson 2006-03-21 11:22:42 UTC
This seems to be a kernel problem. When the dnotify signals gets delivered to
fam the value of si->si_fd is wrong.

This seems to be a kernel 64bit issue with sigio in general. I wrote a testcase
that doesn't use dnotify that just sets up a sigio handler on a unix domain
handler. Running it on a x86-64 "Red Hat Enterprise Linux WS release 3 (Taroon
Update 7)" I get:

./test_sigio 
fd we're waiting on sigio from: 4
received signal 29, si_signo=29, si_errno=0, si_code=1
signaled fd (si->si_fd) = -1787485370

The last line should end with "= 4".



Comment 11 Alexander Larsson 2006-03-21 11:23:37 UTC
Created attachment 126387 [details]
Test case for sigio

Comment 12 Daniel Veillard 2006-03-21 11:32:17 UTC
sigio works as expected on RHEL3 i386 apparently:

veillard:~ -> gcc -o sigio sigio.c
veillard:~ -> ./sigio
fd we're waiting on sigio from: 4
received signal 29, si_signo=29, si_errno=0, si_code=1
signaled fd (si->si_fd) = 4
veillard:~ -> uname -a
Linux veillard.com 2.4.21-37.EL #1 Wed Sep 7 13:37:20 EDT 2005 i686 athlon i386
GNU/Linux
veillard:~ ->

Comment 13 Alexander Larsson 2006-03-21 12:04:29 UTC
This is a glibc <-> kernel mismatch of the size of si_band in siginfo_t.
For details, see:
http://uwsg.ucs.indiana.edu/hypermail/linux/kernel/0402.1/0456.html


Comment 21 Ernie Petrides 2006-03-21 21:25:41 UTC
Alex, thanks for the reproducer in comment #11.  I was able to verify
the error on my x86_64 test system (on an interim U8 kernel over a U6
glibc base).  But I'm confused about one major thing:

Where is the type mismatch?  Both _band and _fd in the siginfo_t definition
in /usr/include/asm-x86_64/siginfo.h are int's, which match what the kernel
is using.

Also, I don't think changing the kernel data structure is an option, since
that would break the ABI (whether or not the int-sized ABI is desirable).


Comment 22 Ernie Petrides 2006-03-21 21:37:04 UTC
The mismatch is with the *other* user-space version of siginfo_t
in /usr/include/bits/siginfo.h.  Good grief.

Comment 23 Ernie Petrides 2006-03-21 21:57:12 UTC
If you put the following lines before the include of <signal.h> in
the test program:

  #include <sys/types.h>
  #include <asm/siginfo.h>
  #define __have_siginfo_t
  #define __have_sigval_t
  #define __have_sigevent_t

Then the *correct* version of a siginfo_t is used and the test case
works correctly.  This shows that the kernel struct *cannot* be changed
due to ABI compatibility concerns.

The bottom line is that /usr/include/asm-x86_64/siginfo.h is correct in
glibc-kernheaders, but that /usr/include/bits/siginfo.h is incorrect in
glibc-headers.

Thus, I'm reassigning this bug to the glibc component (since there's no
glibc-headers component, despite there being a glibc-headers RPM).


Comment 24 Jakub Jelinek 2006-03-21 22:07:04 UTC
2.6 kernel uses long on almost all architectures (except sparc64), only
2.4 and earlier kernels got it wrong.  glibc can't change depending on what
kernel it is running on.
Plus, anything that includes asm/siginfo.h directly rather than using <signal.h>
is broken.


Comment 26 Ernie Petrides 2006-03-23 03:06:15 UTC
I have an idea for a kernel work-around, but I need to investigate further.

Ray has confirmed that "fam" doesn't use the "si_band" field of the "siginfo"
structure, so maybe the x86_64 kernel could put the "si_fd" value at both
4-byte and 8-byte offsets beyond where "si_band" resides (without any data
structure changes).


Comment 27 Ernie Petrides 2006-03-23 03:38:23 UTC
For reference, the "si_band" field ("_band" in "_sigpoll" part of union) was
changed from type "int" to "long" (only on x86_64) upstream in 2.4.22, the
release following the one on which the RHEL3 kernel is based.

Comment 29 Stephen Tweedie 2006-03-23 16:24:57 UTC
Oh dear, this is all *soooo* broken.  Upstream changed a field --- which is an
enum that can only hold 5 legal values anyway --- from int to long.  But POSIX
requires the long.  And we added a compatibility option to let some archs keep
it an int for legacy compatibility's sake, but x86_64 chose to break it instead.

Trouble is, this really leaves us with no authoritative definition of what it
*should* be on 2.4.  I guess we really have to hold the "long" authoritative.  

I kindof like Ray's patch.  It fixes existing users of si_fd.  It won't fix
users of si_band, though --- si_band will still contain the old si_fd in its
upper bits, so user-space comparisons of (si_band == POLL_*) will still fail.  
 But it fixes the problem at hand in a clean enough way, and it's probably the
most compatible fix we can achieve right now.


Comment 30 Ernie Petrides 2006-03-24 01:49:47 UTC
I have finished researching the 2.4/RHEL-3 aspects of this predicament.

First and foremost, there has never been a RHEL-3 regression -- this problem
has existed since the initial release of RHEL-3 (kernel version 2.4.21-4.EL).

Secondly, I have verified that the si_band field of the user-visible siginfo_t
in every single RHEL-3 Update (as well as GOLD) for every single architecture
has been declared as type (long int).  That field (which immediately precedes
si_fd, the one most needed when handling a SIGPOLL/SIGIO) only holds a bitmask
of 11 possible flags (in bits 0 - 10).

Thirdly, the RHEL-3 kernel arches that use an inconsistently sized si_band
field (with respect to the siginfo_t in /usr/include/bits/siginfo.h) are
x86_64, ppc64, and s390x.  (Specifically, ia64 is not broken.)

Fourthly, I'm still against *changing* the kernel's siginfo_t declaration
because that would break applications compiled with the original kernel's
layout (also available to user-space in /usr/include/asm/siginfo.h).

Finally, there's only one place in the whole kernel where the si_fd field
is assigned (excluding 32-bit-compat support): send_sigio_to_task().

Thus, I feel the best way to resolve the issue is for the kernel to make a
2nd copy of the si_fd value 4 bytes beyond where it thinks it should be on
x86_64, ppc64, and s390 *without* changing any of the siginfo_t declarations.
This works because the si_fd field is last in the _sigpoll part of the union,
and there is a pad of (128 - 5 * sizeof(int)) bytes in the siginfo_t beyond
where si_fd is being stored on the 3 broken arches.

The benefit is that applications using the 4-byte offset (between si_band and
si_fd) will continue to work, those using the 8-byte offset will begin to work
without recompilation, and those that either don't use si_band (such as "fam")
or properly use it as a bitmask (only doing tests with the "&" operator and
not with "==" or "!=") will continue to work.

I'll attach an appropriate kernel patch following testing.  (It will capture
the same idea as Ray's patch in comment #28, but implemented differently.)


Comment 31 Ernie Petrides 2006-03-24 05:48:38 UTC
Created attachment 126605 [details]
siginfo_t work-around patch

This patch has been posted to rhkernel-list for review.

Comment 32 Jakub Jelinek 2006-03-24 07:06:32 UTC
Please note that ppc64 and s390x is big endian, so si_band in that case won't
work at all.

Comment 33 Ernie Petrides 2006-03-25 01:51:33 UTC
Rats.  Okay, thanks for the note, Jakub.

I still think we should proceed with the patch in comment #31, because
it at least makes si_fd visible in both versions of the siginfo_t.  Also,
since our ppc64 release runs with a 32-bit user-space, applications
shouldn't run into any problems there.


Comment 34 Ray Strode [halfline] 2006-03-25 05:51:50 UTC
Thanks for all the research and quick turn around Ernie.

Comment 35 Ernie Petrides 2006-03-28 01:23:34 UTC
A fix for this problem has just been committed to the RHEL3 U8
patch pool this evening (in kernel version 2.4.21-40.5.EL).


Comment 38 Alexander Larsson 2006-04-28 07:49:34 UTC
Will this be in RHEL3 U8? I need to know, because otherwise i need to fix fam.


Comment 39 Ernie Petrides 2006-04-28 18:15:13 UTC
Alexander, it already is in U8, as is evident from comment #35 along
with the fact that this bug is in MODIFIED state.

Comment 42 Red Hat Bugzilla 2006-07-20 13:55:11 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2006-0437.html