Bug 43500 - kfree_skb passed an skb still on a list
kfree_skb passed an skb still on a list
Status: CLOSED CURRENTRELEASE
Product: Red Hat Linux
Classification: Retired
Component: kernel (Show other bugs)
7.1
i686 Linux
medium Severity high
: ---
: ---
Assigned To: Arjan van de Ven
Brock Organ
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2001-06-05 06:58 EDT by Thomas Bjorseth
Modified: 2008-08-01 12:22 EDT (History)
1 user (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2004-09-30 11:39:01 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Thomas Bjorseth 2001-06-05 06:58:43 EDT
From Bugzilla Helper:
User-Agent: Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0)

Description of problem:
After some time (never more than two weeks), one of our servers stop 
responding (freezes completely), and a reboot is needed to get the system 
back up and running.

The server in question is an IBM xSeries eServer 330 with dual PIII-800MHz 
and 1 GB ECC ram. Both CPUs have the same family, model and stepping. No 
bugs are reported in /proc/cpuinfo.

We are running a self-compiled 2.2.19 kernel, but with no special or 
experimental support compiled into the kernel.

How reproducible:
Always

Steps to Reproduce:
1. Boot server
2. Wait for anywhere between a few hours and 10-12 days
3. System hangs


Actual Results:  After a few days up and running, the server freezes. With 
100GB of disk fschk takes quite som time, and our users get angry and 
frustrated waiting for the server to get back online.

Expected Results:  The server should keep on truck^H^H^H^H^Hrunning.

Additional info:

The error message is as follows (abbreviated - stacks and registers 
snipped):
warning: kfree_skb passed an skb still on a list (from c0267fbf)
Unable to handle kernel NULL pointer dereference at virtual address 
00000004
current -> tss.cr3 = 00101000, %cr3=00101000
*pde = 00000000
Oops: 0002
CPU: 1
EIP: 0010[<80185481>]
EFLAGS: 00010047
<snipped registers and stack... available if wanted>
Code: 89 58 04 89 03 c7 02 00 00 00 00 c7 42 04 00 00 00 00 c7 42
Aiee, killing interrupt handler
Kernel panic: Attempted to kill the idle task!
In swapping task - not syncing
Comment 1 Arjan van de Ven 2001-06-05 08:02:00 EDT
A ksymoopsed version of the oops would be very welcome.

Dave: does this look familiar ? (feel free to assign the bug back if not)
Comment 2 Arjan van de Ven 2001-06-05 08:18:47 EDT
"We are running a self-compiled 2.2.19 kernel, but with no special or 
experimental support compiled into the kernel."

Is that RH 2.2.19 or a "upstream" 2.2.19 ?
And are you willing to try out a patch ?
Comment 3 Thomas Bjorseth 2001-06-05 08:35:30 EDT
It's a vanilla "kernel.org" 2.2.19 kernel, not a RH kernel.

Before we try out any patches, it would be nice to know what the patch is 
supposed to do. The server is in full production, and we don't want to risk any 
instability in addition to the occasional halt described in this bugzilla 
report.

The server seem to be running OK with a reboot every night, BTW. Could this be 
an indicator for a resource leak of some sort?
Comment 4 Arjan van de Ven 2001-06-05 08:44:40 EDT
That or some timeout which happens to be > 24 hours ;)

The patch I propose is a change that will also be in 2.2.20 whenever that comes
out.
Comment 5 Thomas Bjorseth 2001-06-05 09:30:16 EDT
Send me the patch, and we'll see what happens... Can't guarantee anything 
today, but maybe tomorrow or Thursday.
Comment 6 Thomas Bjorseth 2001-11-22 12:36:28 EST
Long time no update. Since last, we have tried several kernels:

- 2.2.16smp
- 2.2.16enterprise
- 2.2.16 self compiled
- 2.2.19 self compiled
- 2.4.12 self compiled
- 2.4.14 self compiled

The server is running 2.4.14 now, but the only way to get it 100% stable is to 
remove one CPU. With two CPUs it freezes after anything from 1 day to 1 month.

BTW, it's a NetFinity 5600 with dual PIII-800, 1 GB ram and an Adapter 3200s 
RAID controller (lost faith in IBM ServeRAIDs after a while).
Comment 7 Arjan van de Ven 2001-11-26 04:24:19 EST
What network driver are you using ?
(As you're the only one seeing this problem something must be different)
Comment 8 Thomas Bjorseth 2001-11-26 04:59:47 EST
The network drivers are one "3c59x" (main) and one "dmfe" (crossover to webmail 
server). We are using TCP/IP only, with a NSF mount from one server to the 
other. No ipchains, iptables, netfilter or pppoe.

We are not all alone with this bug, btw. I have seen one more report in a 
Norwegian usenet group (no.it.os.unix.linux.diverse - read by at least one 
RedHat employee), and the reporter did get the error on both dual PIIIs and 
dual Athlon MPs using different 2.4 kernels. This guy was using the following 
libraries: libc-5.3.12-31, glibc-devel-2.2.4-18.7.0, glibc-2.2.4-18.7.0, glibc-
common-2.2.4-18.7.0, compat-libstdc++-6.2-2.9.0.9 on a RedHat 7.0 system (which 
he has modified).

We are only running libraries supplied with the RedHat distributions we have 
tried, to keep it as simple as possible.
Comment 9 Bugzilla owner 2004-09-30 11:39:01 EDT
Thanks for the bug report. However, Red Hat no longer maintains this version of
the product. Please upgrade to the latest version and open a new bug if the problem
persists.

The Fedora Legacy project (http://fedoralegacy.org/) maintains some older releases, 
and if you believe this bug is interesting to them, please report the problem in
the bug tracker at: http://bugzilla.fedora.us/

Note You need to log in before you can comment on or make changes to this bug.