Description of problem: Userspace is dead; the box pings, and sysrq can interact with it, but that is all. Version-Release number of selected component (if applicable): Seen most recently under 2.4.9-e.36enterprise, but we have had similar issues under e.34 and e.35; not tested with any prior kernels, nor were we able to capture debugging info for e.34/e.35. How reproducible: We have had hangs of this nature every few days for 2-3 weeks now. Steps to Reproduce: Undetermined right now. Additional info: The machines in question are Dell 2650's, with the onboard dual tg3 interfaces plus an offboard quad e1000 interface. Alan mentioned that this is probably a bug in a network driver based on these messages: Warning: kfree_skb on hard IRQ f8ac114a I'll be attaching all the debugging information that I've been able to collect so far.
Has anybody got this vmcore-incomplete file? I downloaded it from the enterprise.redhat.com location referenced in the Issue Tracker, but when I try to unzip it I get: # bunzip2 vmcore-incomplete.bz2 bunzip2: Compressed file ends unexpectedly; perhaps it is corrupted? *Possible* reason follows. bunzip2: No such file or directory Input file = vmcore-incomplete.bz2, output file = vmcore-incomplete It is possible that the compressed file(s) have become corrupted. You can use the -tvv option to test integrity of such files. You can use the `bzip2recover' program to attempt to recover data from undamaged sections of corrupted files. bunzip2: Deleting output file vmcore-incomplete, if it exists.
I hadn't done more to it than move it into place. When I checked, it would appear that it is indeed corrupt. Does the local copy that you have also show this, or is it specific to the file on the server?
my local copy
I can't speak to any of the network issues involved here -- and whether they're red herrings or not -- but what's interesting to me is that there are 26 processes out of the ~140 shown in the Alt-Sysrq-T output that are blocked in kmap_high(), unable to get any of the 512 pkmap entries. Assuming that the ~140 task count is near correct, it seems unusual that those 26 cannot get an unused pkmap entry unless then there were also ~512 processes that have done a kmap_high, and are doing something time-consuming with the temporarily mapped page, and are blocked waiting for the I/O (?) to complete. I suppose there's a possibility that pkmap entries could be leaked (kmap w/o a follow-up kunmap), which would leave them in a "2" state, and when they end up using all entries, the system would behave as if it were in a livelock. A dump would tell us that, it's impossible to tell without it.
We've pretty much reached a dead-end given the current amount of debug data (i.e, without a useable netdump). I initially had some concerns about the wait_on_irq() timeout that generated the active-CPU traces that got interspersed with the Alt-Sysrq-T traces, because they had the enormous amount of network related functions listed. But as it turns out, the "traces" that are displayed when generated from the wait_on_irq() show all kernel return addresses starting from the first byte above the task_struct, i.e., all return addresses that *ever* were left on the stack. When traces are generated from Alt-Sysrq-T, they start at the stack location where the process last called schedule(), so you get a relatively realistic trace (with potentially stale return addresses intermingled). So those inordinately long traces are definitely a red herring. And, BTW, the wait_on_irq timeout was generated because global_cli() was being called from a timer-related bottom half on CPU 1, and it got sick of waiting for the Alt-Sysrq-T interrupt handling to complete on CPU 3. As far as the __kfree_skb() warnings, yes, it is unwise to allow it to happen in interrupt context because it is possible to get blocked forever in __free_pages_ok() (or perhaps elsewhere) because it can take a spin lock that could conceivably already be held by the interrupted context. If that happened, that CPU would be permanently blocked in spin_lock. Howerver, the Alt-Sysrq-P output shows that all 4 CPUs responded at least once, so all of the suspect __kfree_skb() calls that were made while in interrupt context did in fact make it out alive. So -- where do we go from here? We obviously could use a netdump the next time it happens... I want to suggest an Alt-Sysrq-W, which is like Alt-Sysrq-P, but sends an IPI to each other CPU to print there stack traces. But it runs the risk of waiting forever if one of the other CPUs is blocked with interrupts disabled -- and needing a resource that the interrupted CPUs holds. From that point on, Alt-Sysrq capability would be dead. But as I mentioned above, if several Alt-Sysrq-P calls manage to interrupt each CPU, then there's no wedge on any of them. The Alt-Sysrq-P debug data that Andrew attached does in fact show that all 4 CPUs were capable of receiving the keyboard interrupt. So before killing the system with an Alt-Sysrq-C, please again take Alt-Sysrq-T and several Alt-Sysrq-P's, at least enough to show that all CPUs are responding. The Alt-Sysrq-M showed plenty of memory available in this case, but it's always worth taking at least one of them as well. I'd like to persue the kmap_high issue I mentioned above in my previous post. I'd really like to know the state of the pkmap_count[512] array when the system wedges. But short of a netdump, there's no way to tell what's there unless we add some debug code to one of the Alt-Sysrq handlers to dump it out. If they're all stuck at "2", then we've got a problem to attack. If they're all stuck at "1", then the wakeup mechanism must have a hole in it. In any case, I'll put together an .36 enterprise kernel that has this additional data available, probably in the Alt-Sysrq-M output. I'll also show the number of processes, just in case the Alt-Sysrq-T didn't display them all. In the meantime, I guess all we can do is wait for it to happen again, take the same Alt-Sysrq- m, p's, and t output, and then force a netdump with Alt-Sysrq-c. If you can afford to take the system down in the meantime, I will give you the debug kernel that will give the additional data.
Well, the bad news is that we had another lockup this morning. The good news is that I stumbled across sysrq-W before I saw this update and have that output as well. I'm waiting for the netdump to complete currently, and I'll compress and upload that file when its done. Hopefully this one will be usable. When that kernel is available, we can put it onto these machines as well in case netdumps continue to be unusable.
is nfs involved?
Depends what you mean by involved... Of the 26 kmap_high blockages seen in the traces, 9 of them are coming from nfs routines; the others are from generic_file_read or from journaling routines.
Many thanks to Jason -- this looks to be the same kmap_high() issue seen in Bugzillas #89453 and #58216, and Issue Tracker #33097. Given that the Alt-Sysrq-T shows that there are two kmap_high() blockers coming from nfs_read_rpcsetup(), it would appear that they are taking all the pkmap entries they can get via kmap() -- but not getting all they need -- so they end up blocking, waiting for someone to free one, which never occurs: static void nfs_read_rpcsetup(struct list_head *head, struct nfs_read_data *data) { struct nfs_page *req; struct iovec *iov; unsigned int count; iov = data->args.iov; count = 0; while (!list_empty(head)) { struct nfs_page *req = nfs_list_entry(head->next); nfs_list_remove_request(req); nfs_list_add_request(req, &data->pages); ===> iov->iov_base = kmap(req->wb_page) + req->wb_offset; iov->iov_len = req->wb_bytes; count += req->wb_bytes; iov++; data->args.nriov++; } Steve Dickson is looking into putting a patch together to address this. I'd still like to see the netdump just to verify that all the pkmap entries are in fact taken.
Note from Steve: > It turns out that the correct patch is in rhel3.... > I'll just back port enough of that patch to fix the > the problem... in theory... ;-)' > > I should have something shortly.... > > SteveD.
I tried to upload a netdump from the server but got the following: ftp> cd outgoing/redhat-2182/incoming/ 250 Directory successfully changed. ftp> dir 227 Entering Passive Mode (172,16,48,193,52,157) 150 Here comes the directory listing. 226 Transfer done (but failed to open directory). ftp> bin 200 Binary it is, then. ftp> put 10.10.2.221-2004-02-04-08-33.tar.bz local: 10.10.2.221-2004-02-04-08-33.tar.bz remote: 10.10.2.221-2004-02-04-08-33.tar.bz 227 Entering Passive Mode (172,16,48,193,50,159) 150 Go ahead make my day^W^W^Wsend me the data. netout: Connection reset by peer 451 Failure writing to local file. ftp> 221 Goodbye.
Hmm. I've not had any reports of issues with the server, nor are there space problems at present (~7.0G free on the pertinent filesystem). I see that it uploaded ~460M of data. Could you give it another try?
We have a kernel that we would like you to test, which is located in: http://people.redhat.com/anderson/.bugzilla_114209 There are four files there, consisting of an enterprise kernel binary rpm: kernel-enterprise-2.4.9-e.36.kmap3.i686.rpm The kernel contains these 3 patches: linux-2.4.9-kmap_nonblock.patch linux-2.4.9-nfs-kmapfix.patch kmap_high_debug.patch FYI: 1. The "kmap_nonblock" patch adds a new kmap_nonblock() interface. 2. Steve D's "nfs-kmapfix" patch uses the new interface. 3. The kmap_high_debug.patch adds 3 additional lines of output to Alt-Sysrq-m. The Alt-Sysrq-m additional output consists of: ... 56 threads pkmap_count: 0: 113 1: 399 2: 0 2+: 0 kmap_high_nonblock calls: 13157 failures: 0 The thread count is probably of no use, but the pkmap_count array data shows the current state of the 512 pkmap entries: 0: unused, 1: mapped/unused, 2: mapped/used-by-one-process, 2+: mapped/used-by-more-than-one-process) And lastly, the count of kmap_high_nonblock() calls, and how many of those calls resulted in a failure. This debug kernel is only to be used in-house, and must not be made available to any customer.
I'll see about getting the new kernels on the box(es). Meanwhile, ftp://enteprise.redhat.com/outgoing/redhat-2182/incoming/10.10.2.221-2004-02-04-08-33.tar.bz uploaded for your examination. Kambiz
Additionally, the md5sum of the uploaded file is the following: ea1176e7249d44a32ef7685c93ac8e3e 10.10.2.221-2004-02-04-08-33.tar.bz
Can you please make the System.map file associated with this last dump available? It's needed for AS2.1-type crash analysis. The log file shows all the earmarks of the kmap_high() livelock that the new debug kernel is trying to address. 18 of the 144 processes are blocked in kmap_high(), with the likely pkmap entry hog being this one: rpciod D 00000202 3564 948 1 975 949 (L-TLB) Call Trace: [<c01445ad>] kmap_high [kernel] 0xad (0xf593de44) [<f8b09bd1>] xprt_reserve [sunrpc] 0x41 (0xf593de5c) [<f8b1fd7b>] nfs_read_rpcsetup [nfs] 0x9b (0xf593de94) [<f8b1ff6d>] nfs_pagein_one [nfs] 0x5d (0xf593deb0) [<f8b21858>] nfs_coalesce_requests [nfs] 0x48 (0xf593dec0) [<f8b200e5>] nfs_pagein_list [nfs] 0x65 (0xf593dee4) [<f8b2023d>] nfs_pagein_inode [nfs] 0x3d (0xf593df0c) [<f8b275ce>] nfs_flushd [nfs] 0x9e (0xf593df30) [<f8b0b169>] __rpc_execute [sunrpc] 0xc9 (0xf593df54) [<c0119a65>] schedule [kernel] 0x385 (0xf593df84) [<f8b0b5b8>] __rpc_schedule [sunrpc] 0x178 (0xf593dfa0) [<f8b0bf29>] rpciod [sunrpc] 0xf9 (0xf593dfb8) [<f8b1a63c>] rpciod_killer [sunrpc] 0x0 (0xf593dfbc) [<c0105856>] arch_kernel_thread [kernel] 0x26 (0xf593dff0) [<f8b1a63c>] rpciod_killer [sunrpc] 0x0 (0xf593dff4) [<f8b0be30>] rpciod [sunrpc] 0x0 (0xf593dff8)
Created attachment 97516 [details] System.map file requested Per your request.
Thanks! As suspected, all 512 pkmaps are allocated: crash> p pkmap_count pkmap_count = $3 = {2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2} crash> Keep us posted regarding the installation/status of the debug kernel.
Is this test kernel actually running anywhere yet? (Time's a-wasting as far as getting this into the next AS2.1 update...)
right-we are code freezing AS2.1 kernel in about 1 week, and i'd really like to get some feedback on this issue.
I spoke to Kambiz and IS doesn't believe that they'll be able to get the test set up in the near future, and even if they did the issue is sporadic in nature so we wouldn't have much definitive other than a failure. I have a client (MSDW) who is reporting failures which also match this, so I'd like to push for inclusion on their behalf as well as IS. Let me know if there's anything that I can do to expedite this. Thanks, Frank.
Frank - this patch is fairly invasive and completely untested. What we really need is to either recreate the environment internally (which i thought we had in this bugzilla), or have a customer who is hitting this help work with us to get it tested. There are no lack of ppl hitting this afaik. bz #89453, bz #58216 and it #33097 are all possible places for us to find some guinea pigs. The nature of this BUG is very intermitent as mentioned. Once again this is going to be a tricky issue to fix, so we need all the help we can get. thanks, -jason
I've removed the .bugzilla_114209 directory. The beehive-built binary RPMS containing test kernels can be found in: http://people.redhat.com/anderson/.nfs_kmap_nonblock
Do we want to close this ticket as a dup of 114960?
If tracker bugs can be considered the leader, that's fine with me.
*** This bug has been marked as a duplicate of 114960 ***
Changed to 'CLOSED' state since 'RESOLVED' has been deprecated.