114209 – kernel livelock

Bug 114209 - kernel livelock

Summary: kernel livelock

Keywords:
Status:	CLOSED DUPLICATE of bug 114960
Alias:	None
Product:	Red Hat Enterprise Linux 2.1
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	2.1
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Assignee:	Dave Anderson
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2004-01-23 23:34 UTC by Need Real Name
Modified:	2007-11-30 22:06 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2006-02-21 19:00:55 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
gzip'ed output of show mem (621 bytes, application/x-tar) 2004-01-23 23:47 UTC, Need Real Name	no flags	Details
System.map file requested (459.49 KB, text/plain) 2004-02-06 16:10 UTC, Kambiz Aghaiepour	no flags	Details
View All

Description Need Real Name 2004-01-23 23:34:15 UTC

Description of problem:

Userspace is dead; the box pings, and sysrq can interact with it, but
that is all.

Version-Release number of selected component (if applicable):

Seen most recently under 2.4.9-e.36enterprise, but we have had similar
issues under e.34 and e.35; not tested with any prior kernels, nor
were we able to capture debugging info for e.34/e.35.

How reproducible:

We have had hangs of this nature every few days for 2-3 weeks now.

Steps to Reproduce:
Undetermined right now.

Additional info:

The machines in question are Dell 2650's, with the onboard dual tg3
interfaces plus an offboard quad e1000 interface.  Alan mentioned that
this is probably a bug in a network driver based on these messages:

Warning: kfree_skb on hard IRQ f8ac114a

I'll be attaching all the debugging information that I've been able to
collect so far.

Comment 5 Dave Anderson 2004-02-03 16:48:10 UTC

Has anybody got this vmcore-incomplete file?  I downloaded it from
the enterprise.redhat.com location referenced in the Issue Tracker,
but when I try to unzip it I get:

# bunzip2 vmcore-incomplete.bz2

bunzip2: Compressed file ends unexpectedly;
        perhaps it is corrupted?  *Possible* reason follows.
bunzip2: No such file or directory
        Input file = vmcore-incomplete.bz2, output file =
vmcore-incomplete

It is possible that the compressed file(s) have become corrupted.
You can use the -tvv option to test integrity of such files.

You can use the `bzip2recover' program to attempt to recover
data from undamaged sections of corrupted files.

bunzip2: Deleting output file vmcore-incomplete, if it exists.

Comment 6 Frank Hirtz 2004-02-03 17:29:04 UTC

I hadn't done more to it than move it into place. When I checked, it
would appear that it is indeed corrupt. Does the local copy that you
have also show this, or is it specific to the file on the server?

Comment 7 Dave Anderson 2004-02-03 17:42:12 UTC

my local copy

Comment 8 Dave Anderson 2004-02-03 20:08:32 UTC

I can't speak to any of the network issues involved here -- and
whether they're red herrings or not -- but what's interesting to me
is that there are 26 processes out of the ~140 shown in the
Alt-Sysrq-T output that are blocked in kmap_high(), unable to get any
of the 512 pkmap entries.  Assuming that the ~140 task count is near
correct, it seems unusual that those 26 cannot get an unused pkmap
entry unless then there were also ~512 processes that have done a
kmap_high, and are doing something time-consuming with the temporarily
mapped page, and are blocked waiting for the I/O (?) to complete.

I suppose there's a possibility that pkmap entries could be leaked
(kmap w/o a follow-up kunmap), which would leave them in a "2" state,
and when they end up using all entries, the system would behave as if
it were in a livelock.  

A dump would tell us that, it's impossible to tell without it.

Comment 9 Dave Anderson 2004-02-04 15:48:26 UTC

We've pretty much reached a dead-end given the current amount of
debug data (i.e, without a useable netdump).

I initially had some concerns about the wait_on_irq() timeout that
generated the active-CPU traces that got interspersed with the
Alt-Sysrq-T traces, because they had the enormous amount of network
related functions listed.  But as it turns out, the "traces" that are
displayed when generated from the wait_on_irq() show all kernel return
addresses starting from the first byte above the task_struct, i.e.,
all return addresses that *ever* were left on the stack.  When traces
are generated from Alt-Sysrq-T, they start at the stack location where
the process last called schedule(), so you get a relatively realistic
trace (with potentially stale return addresses intermingled). So those
inordinately long traces are definitely a red herring.  And, BTW, the
wait_on_irq timeout was generated because global_cli() was being
called from a timer-related bottom half on CPU 1, and it got sick of
waiting for the Alt-Sysrq-T interrupt handling to complete on CPU 3.  

As far as the __kfree_skb() warnings, yes, it is unwise to allow it to
happen in interrupt context because it is possible to get blocked
forever in __free_pages_ok() (or perhaps elsewhere) because it can
take a spin lock that could conceivably already be held by the
interrupted context.  If that happened, that CPU would be permanently
blocked in spin_lock.  Howerver, the Alt-Sysrq-P output shows that all
4 CPUs responded at least once, so all of the suspect __kfree_skb()
calls that were made while in interrupt context did in fact make it
out alive.

So -- where do we go from here?

We obviously could use a netdump the next time it happens...

I want to suggest an Alt-Sysrq-W, which is like Alt-Sysrq-P, but sends
an IPI to each other CPU to print there stack traces.  But it runs the
risk of waiting forever if one of the other CPUs is blocked with
interrupts disabled -- and needing a resource that the interrupted
CPUs holds.  From that point on, Alt-Sysrq capability would be dead.  

But as I mentioned above, if several Alt-Sysrq-P calls manage to
interrupt each CPU, then there's no wedge on any of them.  The
Alt-Sysrq-P debug data that Andrew attached does in fact show that all
4 CPUs were capable of receiving the keyboard interrupt.  So before
killing the system with an Alt-Sysrq-C, please again take Alt-Sysrq-T
and several Alt-Sysrq-P's, at least enough to show that all CPUs are
responding.  The Alt-Sysrq-M showed plenty of memory available in this
case, but it's always worth taking at least one of them as well.

I'd like to persue the kmap_high issue I mentioned above in my
previous post.  I'd really like to know the state of the
pkmap_count[512] array when the system wedges.  But short of a
netdump, there's no way to tell what's there unless we add some debug
code to one of the Alt-Sysrq handlers to dump it out.  If they're all
stuck at "2", then we've got a problem to attack.  If they're all
stuck at "1", then the wakeup mechanism must have a hole in it.

In any case, I'll put together an .36 enterprise kernel that has this
additional data available, probably in the Alt-Sysrq-M output.  I'll
also show the number of processes, just in case the Alt-Sysrq-T didn't
display them all.

In the meantime, I guess all we can do is wait for it to happen again,
take the same Alt-Sysrq- m, p's, and t output, and then force a
netdump with Alt-Sysrq-c.  If you can afford to take the system down
in the meantime, I will give you the debug kernel that will give the
additional data.

Comment 10 Need Real Name 2004-02-04 16:46:32 UTC

Well, the bad news is that we had another lockup this morning.  The
good news is that I stumbled across sysrq-W before I saw this update
and have that output as well.

I'm waiting for the netdump to complete currently, and I'll compress
and upload that file when its done.  Hopefully this one will be usable.

When that kernel is available, we can put it onto these machines as
well in case netdumps continue to be unusable.

Comment 11 Jason Baron 2004-02-04 19:18:10 UTC

is nfs involved?

Comment 12 Dave Anderson 2004-02-04 19:33:24 UTC

Depends what you mean by involved...

Of the 26 kmap_high blockages seen in the traces, 9 of them are coming
from nfs routines; the others are from generic_file_read or from
journaling routines.

Comment 13 Dave Anderson 2004-02-04 20:24:29 UTC

Many thanks to Jason -- this looks to be the same kmap_high()
issue seen in Bugzillas #89453 and #58216, and Issue Tracker #33097.

Given that the Alt-Sysrq-T shows that there are two kmap_high()
blockers coming from nfs_read_rpcsetup(), it would appear that they
are taking all the pkmap entries they can get via kmap() -- but not
getting all they need -- so they end up blocking, waiting for someone
to free one, which never occurs:

static void
nfs_read_rpcsetup(struct list_head *head, struct nfs_read_data *data)
{
        struct nfs_page         *req;
        struct iovec            *iov;
        unsigned int            count;

        iov = data->args.iov;
        count = 0;
        while (!list_empty(head)) {
                struct nfs_page *req = nfs_list_entry(head->next);
                nfs_list_remove_request(req);
                nfs_list_add_request(req, &data->pages);
===>            iov->iov_base = kmap(req->wb_page) + req->wb_offset;
                iov->iov_len = req->wb_bytes;
                count += req->wb_bytes;
                iov++;
                data->args.nriov++;
        }

Steve Dickson is looking into putting a patch together to address
this.  I'd still like to see the netdump just to verify that all the
pkmap entries are in fact taken.

Comment 14 Dave Anderson 2004-02-04 20:27:02 UTC

Note from Steve:

> It turns out that the correct patch is in rhel3....
> I'll just back port enough of that patch to fix the
> the problem... in theory... ;-)'
>
> I should have something shortly....
>
> SteveD.

Comment 15 Kambiz Aghaiepour 2004-02-05 14:02:20 UTC

I tried to upload a netdump from the server but got the following:

ftp> cd outgoing/redhat-2182/incoming/
250 Directory successfully changed.
ftp> dir
227 Entering Passive Mode (172,16,48,193,52,157)
150 Here comes the directory listing.
226 Transfer done (but failed to open directory).
ftp> bin
200 Binary it is, then.
ftp> put 10.10.2.221-2004-02-04-08-33.tar.bz
local: 10.10.2.221-2004-02-04-08-33.tar.bz remote:
10.10.2.221-2004-02-04-08-33.tar.bz
227 Entering Passive Mode (172,16,48,193,50,159)
150 Go ahead make my day^W^W^Wsend me the data.
netout: Connection reset by peer
451 Failure writing to local file.
ftp> 221 Goodbye.

Comment 16 Frank Hirtz 2004-02-05 16:54:33 UTC

Hmm. I've not had any reports of issues with the server, nor are there
space problems at present (~7.0G free on the pertinent filesystem). I
see that it uploaded ~460M of data. Could you give it another try?

Comment 17 Dave Anderson 2004-02-05 20:54:52 UTC

We have a kernel that we would like you to test, which is located in:

 http://people.redhat.com/anderson/.bugzilla_114209

There are four files there, consisting of an enterprise kernel
binary rpm:

 kernel-enterprise-2.4.9-e.36.kmap3.i686.rpm

The kernel contains these 3 patches:
  
 linux-2.4.9-kmap_nonblock.patch  
 linux-2.4.9-nfs-kmapfix.patch
 kmap_high_debug.patch

FYI:

 1. The "kmap_nonblock" patch adds a new kmap_nonblock() interface.
 2. Steve D's "nfs-kmapfix" patch uses the new interface.
 3. The kmap_high_debug.patch adds 3 additional lines of output to
    Alt-Sysrq-m.  

The Alt-Sysrq-m additional output consists of:

 ...
 56 threads
 pkmap_count: 0: 113  1: 399  2: 0  2+: 0
 kmap_high_nonblock calls: 13157  failures: 0

The thread count is probably of no use, but the pkmap_count array data
shows the current state of the 512 pkmap entries:

  0:  unused, 
  1:  mapped/unused, 
  2:  mapped/used-by-one-process,
  2+: mapped/used-by-more-than-one-process)

And lastly, the count of kmap_high_nonblock() calls, and how many of
those calls resulted in a failure.

This debug kernel is only to be used in-house, and must not be made
available to any customer.

Comment 18 Kambiz Aghaiepour 2004-02-06 01:29:17 UTC

I'll see about getting the new kernels on the box(es). Meanwhile,

ftp://enteprise.redhat.com/outgoing/redhat-2182/incoming/10.10.2.221-2004-02-04-08-33.tar.bz

uploaded for your examination.

Kambiz

Comment 19 Matthew Galgoci 2004-02-06 01:48:50 UTC

Additionally, the md5sum of the uploaded file is the following:

ea1176e7249d44a32ef7685c93ac8e3e  10.10.2.221-2004-02-04-08-33.tar.bz

Comment 20 Dave Anderson 2004-02-06 15:59:36 UTC

Can you please make the System.map file associated with this last dump
available?  It's needed for AS2.1-type crash analysis.

The log file shows all the earmarks of the kmap_high() livelock that
the new debug kernel is trying to address.  18 of the 144 processes
are blocked in kmap_high(), with the likely pkmap entry hog being
this one:

rpciod        D 00000202  3564   948      1           975   949
(L-TLB)
Call Trace: [<c01445ad>] kmap_high [kernel] 0xad (0xf593de44)
[<f8b09bd1>] xprt_reserve [sunrpc] 0x41 (0xf593de5c)
[<f8b1fd7b>] nfs_read_rpcsetup [nfs] 0x9b (0xf593de94)
[<f8b1ff6d>] nfs_pagein_one [nfs] 0x5d (0xf593deb0)
[<f8b21858>] nfs_coalesce_requests [nfs] 0x48 (0xf593dec0)
[<f8b200e5>] nfs_pagein_list [nfs] 0x65 (0xf593dee4)
[<f8b2023d>] nfs_pagein_inode [nfs] 0x3d (0xf593df0c)
[<f8b275ce>] nfs_flushd [nfs] 0x9e (0xf593df30)
[<f8b0b169>] __rpc_execute [sunrpc] 0xc9 (0xf593df54)
[<c0119a65>] schedule [kernel] 0x385 (0xf593df84)
[<f8b0b5b8>] __rpc_schedule [sunrpc] 0x178 (0xf593dfa0)
[<f8b0bf29>] rpciod [sunrpc] 0xf9 (0xf593dfb8)
[<f8b1a63c>] rpciod_killer [sunrpc] 0x0 (0xf593dfbc)
[<c0105856>] arch_kernel_thread [kernel] 0x26 (0xf593dff0)
[<f8b1a63c>] rpciod_killer [sunrpc] 0x0 (0xf593dff4)
[<f8b0be30>] rpciod [sunrpc] 0x0 (0xf593dff8)

Comment 21 Kambiz Aghaiepour 2004-02-06 16:10:31 UTC

Created attachment 97516 [details]
System.map file requested

Per your request.

Comment 22 Dave Anderson 2004-02-06 16:24:42 UTC

Thanks!

As suspected, all 512 pkmaps are allocated:

crash> p pkmap_count
pkmap_count = $3 =
 {2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
  2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
  2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
  2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
  2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
  2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
  2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
  2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
  2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
  2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
  2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
  2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
  2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2}
crash>

Keep us posted regarding the installation/status of the debug kernel.

Comment 23 Dave Anderson 2004-02-10 18:39:29 UTC

Is this test kernel actually running anywhere yet?

(Time's a-wasting as far as getting this into the next AS2.1
update...)

Comment 24 Jason Baron 2004-02-12 16:32:24 UTC

right-we are code freezing AS2.1 kernel in about 1 week, and i'd
really like to get some feedback on this issue.

Comment 25 Frank Hirtz 2004-02-12 18:41:29 UTC

I spoke to Kambiz and IS doesn't believe that they'll be able to get
the test set up in the near future, and even if they did the issue is
sporadic in nature so we wouldn't have much definitive other than a
failure. 

I have a client (MSDW) who is reporting failures which also match
this, so I'd like to push for inclusion on their behalf as well as IS.
Let me know if there's anything that I can do to expedite this.

Thanks,

Frank.

Comment 26 Jason Baron 2004-02-12 18:51:04 UTC

Frank - this patch is fairly invasive and completely untested. What we
really need is to either recreate the environment internally (which i
thought we had in this bugzilla), or have a customer who is hitting
this help work with us to get it tested. There are no lack of ppl
hitting this afaik. bz #89453, bz #58216 and it #33097 are all
possible places for us to find some guinea pigs. The nature of this
BUG is very intermitent as mentioned. Once again this is going to be a
tricky issue to fix, so we need all the help we can get.

thanks,

-jason

Comment 27 Dave Anderson 2004-02-12 23:04:42 UTC

I've removed the .bugzilla_114209 directory.

The beehive-built binary RPMS containing test kernels can be found in:

  http://people.redhat.com/anderson/.nfs_kmap_nonblock

Comment 28 Don Howard 2004-02-12 23:07:36 UTC

Do we want to close this ticket as a dup of 114960?

Comment 29 Dave Anderson 2004-02-13 13:11:34 UTC

If tracker bugs can be considered the leader, that's fine with me.

Comment 30 Jason Baron 2004-07-09 20:20:58 UTC


*** This bug has been marked as a duplicate of 114960 ***

Comment 31 Red Hat Bugzilla 2006-02-21 19:00:55 UTC

Changed to 'CLOSED' state since 'RESOLVED' has been deprecated.

Note You need to log in before you can comment on or make changes to this bug.