Bug 162124 - 100% reproducible kernel death (no oops, everything just stops)
Summary: 100% reproducible kernel death (no oops, everything just stops)
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 4
Hardware: i386
OS: Linux
medium
high
Target Milestone: ---
Assignee: Dave Jones
QA Contact: Brian Brock
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2005-06-30 06:00 UTC by James Antill
Modified: 2015-01-04 22:20 UTC (History)
5 users (show)

Fixed In Version: kernel
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2005-10-06 21:48:18 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)
strace of process that kills kernel (server) (11.69 KB, text/plain)
2005-07-06 04:27 UTC, James Antill
no flags Details
Potential fix for socket filter hang (2.18 KB, patch)
2005-07-08 23:51 UTC, David Miller
no flags Details | Diff
Server, creates accept socket (prints bound port) and adds socket filter (3.56 KB, text/plain)
2005-07-20 00:07 UTC, James Antill
no flags Details
connects to port and sends data (2.20 KB, text/plain)
2005-07-20 00:08 UTC, James Antill
no flags Details

Description James Antill 2005-06-30 06:00:24 UTC
Description of problem:
 I type "make check" on my code, and kernel dies :)

Version-Release number of selected component (if applicable):
kernel(0:2.6.11-1.1369_FC4).i686

How reproducible:
100%

Steps to Reproduce:
1. download http://www.and.org/vstr/vstr-1.0.14james.tar.bz2
2. untar, and cd inside
3. mkdir j; ../scripts/b/def-tst.sh
  
Actual results:
 kernel death at start of third configuration of ex_httpd tests in example
directory.

Expected results:
happy, working kernel :)

Additional info:
 You'll need socket_poll-1.0.1 and timer_q-1.0.5 installed, or the httpd example
won't be compiled. I have rpms.

 It's not obvious to me if it's happening due to the start of the third
configuration of tests, or at the end of process of the second (the cleanup
tells the configuration to exit, but that just closes the listen sockets and
waits for the other connections to die).

 You can test any configuration manually by doing...

./ex_httpd -P0 --cntl-file=ex_httpd_cntl <printed options>

...and then in another window...

SRCDIR=../../examples ../../examples/tst/tst_httpd.pl ex_httpd_cntl setup vhosts
cleanup

...also note that the second configuration uses mmap(), so you'll want to 
add a "trunc" option before vhosts (or it'll do truncate tests on some of the
files and SEGV).

 Feel free to ask for anymore info.

Comment 1 James Antill 2005-06-30 19:12:09 UTC
 I just tried 2.6.12-1.1385_FC4, from updates-testing, that also dies in the
same way.


Comment 2 James Antill 2005-07-06 04:24:50 UTC
 It's the third configuration that kills it, I assume it has to be something to
do with socket filters because that's one of the main differences of that
configuration, and if I remove that option the kernel doesn't die.

 Here is the end of the strace to the screen from over ssh:

writev(1, [{"[", 1}, {"Jul  5 19:50:54", 15}, {"]: ", 3}, {"READY
[0.0.0.0@32781]: ex_httpd_"..., 38}], 4[Jul  5 19:50:54]: READY [0.0.0.0@32781]:
ex_httpd_root/
) = 57
poll([{fd=3, events=POLLIN}, {fd=4, events=POLLIN}], 2, 0) = 0
poll([{fd=3, events=POLLIN}, {fd=4, events=POLLIN, revents=POLLIN}], 2, -1) = 1
gettimeofday({1120607465, 601546}, NULL) = 0
accept(4, {sa_family=AF_FILE, path=@}, [2]) = 5
gettimeofday({1120607465, 602636}, NULL) = 0
fcntl64(5, F_SETFD, FD_CLOEXEC)         = 0
fcntl64(5, F_GETFL)                     = 0x2 (flags O_RDWR)
fcntl64(5, F_SETFL, O_RDWR|O_NONBLOCK)  = 0
time(NULL)                              = 1120607465
stat64("/etc/localtime", {st_mode=S_IFREG|0644, st_size=1267, ...}) = 0
stat64("/etc/localtime", {st_mode=S_IFREG|0644, st_size=1267, ...}) = 0
stat64("/etc/localtime", {st_mode=S_IFREG|0644, st_size=1267, ...}) = 0
writev(1, [{"[", 1}, {"Jul  5 19:51:05", 15}, {"]: ", 3}, {"CNTL CONNECT
from[local]\n", 25}], 4[Jul  5 19:51:05]: CNTL CONNECT from[local]
) = 44
accept(4, 0xbfddb9f4, [2])              = -1 EAGAIN (Resource temporarily
unavailable)
readv(5, 0xa0991f0, 6)                  = -1 EAGAIN (Resource temporarily
unavailable)
poll([{fd=3, events=POLLIN}, {fd=4, events=POLLIN}, {fd=5, events=POLLIN}], 3,
0) = 0
poll([{fd=3, events=POLLIN}, {fd=4, events=POLLIN}, {fd=5, events=POLLIN,
revents=POLLIN}], 3, -1) = 1
gettimeofday({1120607465, 608434}, NULL) = 0
readv(5, [{"6:STATUS,should be std. in /etc/"..., 120}, {"     
zsync\nimage/x-icon        "..., 120}, {"text/plain MIME type\ntext/plain "...,
120}, {"d because\n# we don\'t default to "..., 120}, {"xt/plain  ./Makefile
./Makefile."..., 120}, {"tension, allow files to be filte"..., 120}], 6) = 9
getpid()                                = 2251
writev(5, [{"64:6:", 5}, {"STATUS", 6}, {",11:from[local],24:ctime[1120607"...,
62}, {"STATUS", 6}, {",19:from[0.0.0.0@32781],24:ctime"..., 71}], 5) = 150
poll([{fd=3, events=POLLIN}, {fd=4, events=POLLIN}, {fd=5, events=POLLIN}], 3,
0) = 0
poll([{fd=3, events=POLLIN}, {fd=4, events=POLLIN}, {fd=5, events=POLLIN,
revents=POLLIN|POLLHUP}], 3, -1) = 1
gettimeofday({1120607465, 615069}, NULL) = 0
readv(5, [{"6:STATUS,should be std. in /etc/"..., 120},
{",19:from[0.0.0.0@32781],24:ctime"..., 120},
{",11:from[local],24:ctime[1120607"..., 120}, {"4294967264:6:94967295:on       
"..., 120}, {"xt/plain  ./Makefile ./Makefile."..., 120}, {"tension, allow files
to be filte"..., 120}], 6) = 0
close(5)                                = 0
time(NULL)                              = 1120607465
writev(1, [{"[", 1}, {"Jul  5 19:51:05", 15}, {"]: ", 3}, {"CNTL FREE
from[local] req_got[1:"..., 48}, {"recv[9B:9] send[150B:150]\n", 26}], 5[Jul  5
19:51:05]: CNTL FREE from[local] req_got[1:1] req_put[1:1] recv[9B:9] send[150B:150]
) = 93
poll([{fd=3, events=POLLIN}, {fd=4, events=POLLIN}], 2, 0) = 0
poll(Read from remote host sophia.and.org: Connection reset by peer


...I'm also uploading the full output of strace to disk (which ends much sooner).
 I've tried to create a small test case that uses the same socket filter, but
that doesn't fail.


Comment 3 James Antill 2005-07-06 04:27:19 UTC
Created attachment 116398 [details]
strace of process that kills kernel (server)

 I assume it's something on the server side, the readv() just before the
FILTER_ADD is the data for the filter. As I said before all code can be
downloaded, and this is 100% reproducible.

Comment 4 Dan Carpenter 2005-07-06 22:39:14 UTC
Is this related to bug 162388?

I think there have been a couple other people reporting iptable problems with
the 2.6.11 kernel  but but I couldn't find the bug numbers right off the bat.


Comment 5 James Antill 2005-07-07 02:33:47 UTC
 One of the machines I've run this on (it's killed all fc4 machines I've tried)
has zero iptables rules, although the ip_tables module is loaded.
 I'll probably test again with the module unloaded tonight, but I'd hope it
wouldn't do that if you didn't have at least one rule.


Comment 6 Dave Jones 2005-07-08 22:56:44 UTC
do you get a kernel oops dump at all when it crashes ?


Comment 7 David Miller 2005-07-08 23:48:10 UTC
bug 162388 is a different problem, when that is triggered traffic
simply doesn't behave as one desires, it doesn't crash the box

what is the exact socket filter being loaded?
it's probably just some logic error in the socket
filter code, and Patrick McHardy found a bunch of
errors there recently, will attach those patches
if I get a chance.


Comment 8 David Miller 2005-07-08 23:51:44 UTC
Created attachment 116553 [details]
Potential fix for socket filter hang

Comment 9 David Miller 2005-07-08 23:54:15 UTC
Can someone try out the test case with the patch attachment
in comment #8 applied and see if that fixes it?

THanks.


Comment 10 James Antill 2005-07-10 23:27:38 UTC
 davej: There is no oops, I run the test and everything stops (mouse pointer
doesn't even move).

 davem:

 Well I haven't tried it, but I don't see how that can be the problem.

1. The only code in the socket filter is the two instructions: 

 load LEN into A, return A

...and as part of trying to make a simple test I tried loading the same socket
filter onto a listening socket ... but couldn't trigger it.

2. The absolute address would have to be in the range [INT_MAX-len+1..INT_MAX],
so even if the socket filter is compiled wrong (I don't think it is, but I had
to write my own compiler) hexdump shows no numbers in that range.



Comment 11 David Miller 2005-07-11 01:54:49 UTC
Please try out what I've asked you to test, even if it may seem
unlikely that the fix I'm pointing out is relevant.


Comment 12 Dave Jones 2005-07-15 21:11:55 UTC
[This comment has been added as a mass update for all FC4 kernel bugs.
 If you have migrated this bug from an FC3 bug today, ignore this comment.]

Please retest your problem with todays 2.6.12-1.1398_FC4 update.

If your problem involved being unable to boot, or some hardware not being
detected correctly, please make sure your /etc/modprobe.conf is correct *BEFORE*
installing any kernel updates.
If in doubt, you can recreate this file using..

mv /etc/sysconfig/hwconf /etc/sysconfig/hwconf.bak
mv /etc/modprobe.conf /etc/modprobe.conf.bak
kudzu


Thank you.


Comment 13 James Antill 2005-07-19 22:55:48 UTC
kernel(0:2.6.12-1.1398_FC4).i686

 still dies 100% of the time.


Comment 14 James Antill 2005-07-20 00:02:50 UTC
 Ok, I've tracked it down and got a small reproducer, the problem happens if you:

 1) have a socket filter installed on an accept socket.
 2) connect and send large amounts of data.

...I'm uploading a client (in perl) and a server (in C) ... run the server, note
the port, then run the client with the port as the argument ... kernel will be
dead, again no Oops just nothing works.

% wc -l fc4_kern_die_client.pl fc4_kern_die_serv.c
 110 fc4_kern_die_client.pl
 173 fc4_kern_die_serv.c
 283 total


Comment 15 James Antill 2005-07-20 00:07:28 UTC
Created attachment 116954 [details]
Server, creates accept socket (prints bound port) and adds socket filter

 To compile use: gcc -Wall -W -o fc4_kern_die_serv fc4_kern_die_serv.c

Comment 16 James Antill 2005-07-20 00:08:12 UTC
Created attachment 116955 [details]
connects to port and sends data

 To run use: ./fc4_kern_die_client.pl <port>

Comment 17 James Antill 2005-07-20 00:31:16 UTC
 It's possibly worth noting that the patch in comment #8 isn't in
kernel(0:2.6.12-1.1398_FC4).i686, but given how I have to hack the spec file
just to install the .src.rpm, and the fact I don't have any test machines to
keep rebooting for fixes I'm 99% sure aren't the problem ... I'm unlikely to
test it in the near future.


Comment 18 Dave Jones 2005-08-26 08:38:40 UTC
That patch is in the latest errata kernel in updates-testing fwiw.


Comment 19 Dave Jones 2005-09-30 06:18:01 UTC
Mass update to all FC4 bugs:

An update has been released (2.6.13-1.1526_FC4) which rebases to a new upstream
kernel (2.6.13.2). As there were ~3500 changes upstream between this and the
previous kernel, it's possible your bug has been fixed already.

Please retest with this update, and update this bug if necessary.

Thanks.


Comment 20 James Antill 2005-10-06 21:48:18 UTC
2.6.13-1.1526_FC4 seems to fix it.



Note You need to log in before you can comment on or make changes to this bug.