Bug 248787
Summary: | [RHEL4 U4] NFS server, rpciod was stuck in a infinite loop, | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 4 | Reporter: | Jeff Burke <jburke> | ||||||
Component: | kernel | Assignee: | Jeff Layton <jlayton> | ||||||
Status: | CLOSED ERRATA | QA Contact: | Martin Jenner <mjenner> | ||||||
Severity: | low | Docs Contact: | |||||||
Priority: | low | ||||||||
Version: | 4.4 | CC: | coughlan, staubach, steved | ||||||
Target Milestone: | rc | Keywords: | Regression | ||||||
Target Release: | --- | ||||||||
Hardware: | All | ||||||||
OS: | Linux | ||||||||
URL: | http://rhts.lab.boston.redhat.com/cgi-bin/rhts/test_log.cgi?id=324671 | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | RHSA-2008-0665 | Doc Type: | Bug Fix | ||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2008-07-24 19:14:19 UTC | Type: | --- | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Attachments: |
|
Description
Jeff Burke
2007-07-18 19:33:09 UTC
The sysrq data shows this at some point prior to the crash: rpciod D F5CABE7C 2492 3479 1 3483 3478 (L-TLB) f5cabeb4 00000046 fcc80a40 f5cabe7c c011cdb9 c1aa5830 f5cbec30 c1804de0 c1804de0 00000086 c1804de0 c1804de0 00000001 00000000 fcc80a40 001aaa9e c1aa5830 f5cbe6b0 f5cbe81c f7e36028 c0120420 063881e0 f7e36000 f5cabed8 Call Trace: [<c011cdb9>] activate_task+0x88/0x95 [<c0120420>] prepare_to_wait+0x12/0x4c [<c0130dff>] flush_cpu_workqueue+0x12f/0x153 [<c01204f5>] autoremove_wake_function+0x0/0x2d [<c01204f5>] autoremove_wake_function+0x0/0x2d [<c0130e92>] flush_workqueue+0x6f/0x94 [<f8aa45d0>] xprt_destroy+0x20/0x44 [sunrpc] [<f8aa1454>] rpc_destroy_client+0x7c/0xa4 [sunrpc] [<f8aa5672>] rpc_release_task+0x11a/0x135 [sunrpc] [<f8aa5168>] __rpc_execute+0x31b/0x326 [sunrpc] [<f8aa5211>] __rpc_schedule+0x3f/0x93 [sunrpc] [<f8aa5952>] rpciod+0xf7/0x251 [sunrpc] [<c01204f5>] autoremove_wake_function+0x0/0x2d [<c02d4622>] ret_from_fork+0x6/0x14 [<c01204f5>] autoremove_wake_function+0x0/0x2d [<f8aa585b>] rpciod+0x0/0x251 [sunrpc] [<c01041f5>] kernel_thread_helper+0x5/0xb Then, from the core: crash> set 3479 PID: 3479 COMMAND: "rpciod" TASK: f5cbe6b0 [THREAD_INFO: f5cab000] CPU: 1 STATE: TASK_RUNNING (ACTIVE) crash> bt PID: 3479 TASK: f5cbe6b0 CPU: 1 COMMAND: "rpciod" #0 [f5cabd0c] smp_call_function_interrupt at c0116cab #1 [f5cabd14] call_function_interrupt at c02d511d EAX: d532ad08 EBX: 0000de26 ECX: c03506c0 EDX: 000000d3 EBP: 0000ee48 DS: 007b ESI: 0000005e ES: 007b EDI: d0785800 CS: 0060 EIP: c02b3d3e ERR: fffffffb EFLAGS: 00000282 #2 [f5cabd48] udp_v4_get_port at c02b3d3e #3 [f5cabd6c] inet_autobind at c02ba124 #4 [f5cabd78] inet_sendmsg at c02ba940 #5 [f5cabd8c] sock_sendmsg at c0277daa #6 [f5cabe74] kernel_sendmsg at c0277dea #7 [f5cabe80] xdr_sendpages at f8aab426 #8 [f5cabf04] xprt_transmit at f8aa3a3e #9 [f5cabf34] call_transmit at f8aa1cf4 #10 [f5cabf3c] __rpc_execute at f8aa4ef1 #11 [f5cabf94] __rpc_schedule at f8aa520c #12 [f5cabfa0] rpciod at f8aa594d #13 [f5cabff0] kernel_thread_helper at c01041f3 So at the time of the crash, it looks like rpciod was trying to send a message on a (new?) unbound socket, and so was trying to bind it. LOAD AVERAGE: 2.37, 2.51, 3.03 load avg was high, and steve said the thread was eating 100% of a CPU. See this in the ring buffer too -- not sure if it's related. May have something to do with netconsole? e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang Tx Queue <0> TDH <be> TDT <e> next_to_use <e> next_to_clean <bc> buffer_info[next_to_clean] time_stamp <bf1ad67b> next_to_watch <be> jiffies <bf1af9de> next_to_watch.status <0> NETDEV WATCHDOG: eth0: transmit timed out e1000: eth0: e1000_watchdog_task: NIC Link is Up 1000 Mbps Full Duplex Before the kernel crashed (and before and after the Tx Unit Hang messages), there was also some stuff like this in the log: nsm_mon_unmon: rpc failed, status=-5 lockd: cannot monitor 10.12.4.31 portmap: too small RPC reply size (0 bytes) nsm_mon_unmon: rpc failed, status=-5 lockd: cannot monitor 10.12.4.31 portmap: too small RPC reply size (0 bytes) nsm_mon_unmon: rpc failed, status=-5 lockd: cannot monitor 10.12.4.81 ...in hindsight, cranking up rpc_debug would have been helpful. The info we have here is somewhat sparse. I'm thinking we won't be able to tell much here unless this happens again. If so, we might want to leave the box in that state and see if we can track down what it's actually doing. I'm suspecting that the problem may have ultimately been caused by something lower than NFS (network card problems maybe), but it's hard to say without some more info. Setting this to NEEDINFO for now. I don't have enough to go on here to determine the cause. If this occurs again, please try to turn up RPC debugging: # echo 32767 > /proc/sys/sunrpc/rpc_debug and let it sit with that for a few minutes before forcing a core. OK will do. You were not able to get any data from the vmcore file? Thanks, Jeff I got a bit -- it's detailed in the above comments. Unfortunately, I can't tell much. rpciod doesn't seem to be locked up hard. The sysrq data shows it at a different place than the stack in the actual crash, which might indicate that it was in some sort of loop, or could be indicative that it wasn't hung at all. A second core from the same issue might be more helpful, especially if rpc_debugging is turned up before crashing the box. Even better would be to let me know if you have a machine in this situation so I can poke at it while it's still "alive". Jeff, I found out that there was a second issue last week while I was on vacation. Steve D was looking at it. Looks like it may have panic(But I am not sure) This may/may not be a different issue. Here is a link to the vmcore from when the system had an issue. ndnc-1.lab.boston.redhat.com:/var/crash/10.12.4.232-2007-08-15-11:36 Thanks, Jeff This issue looks different, but I suppose there's a chance that they ultimately have the same cause: kernel BUG at include/asm/spinlock.h:133! That error means that the spinlock magic was corrupt: void svc_wake_up(struct svc_serv *serv) { struct svc_rqst *rqstp; spin_lock_bh(&serv->sv_lock); ...so the serv->sv_lock was bad (and that likely means that the whole svc_serv struct was bad). The ring buffer in the core didn't have any info before the panic, so it's hard to tell what what happened leading up to the crash. Here's the stack: PID: 3618 TASK: f63deeb0 CPU: 0 COMMAND: "rpciod" #0 [caeb6dc0] netpoll_start_netdump at f8a52570 #1 [caeb6de0] die at c0106045 #2 [caeb6e14] do_invalid_op at c0106420 #3 [caeb6ec4] error_code (via invalid_op) at c02d51dd EAX: f8aa7c45 EBX: c1afd518 ECX: c02e6fa1 EDX: c02e6fa1 EBP: c28ea808 DS: 007b ESI: c1afd500 ES: 007b EDI: ec2a4a80 CS: 0060 EIP: c02d33c2 ERR: ffffffff EFLAGS: 00010216 #4 [caeb6f00] _spin_lock_bh at c02d33c2 #5 [caeb6f08] svc_wake_up at f8aa7c40 #6 [caeb6f14] nlmsvc_grant_callback at f8c41f1d #7 [caeb6f2c] __rpc_execute at f8aa50f7 #8 [caeb6f84] __rpc_schedule at f8aa520c #9 [caeb6f90] rpciod_killall at f8aa5ad8 #10 [caeb6f9c] rpciod at f8aa5a52 #11 [caeb6ff0] kernel_thread_helper at c01041f3 rpciod_killall() is in the stack here, which indicates that rpciod was in the process of shutting down at the time. Looks like lockd was in the process of shutting down too: crash> bt PID: 3617 TASK: f63de930 CPU: 1 COMMAND: "lockd" #0 [f2c26f14] schedule at c02d25c9 #1 [f2c26f78] wait_for_completion at c02d26f6 #2 [f2c26fcc] rpciod_down at f8aa5cca #3 [f2c26fd4] lockd at f8c40f31 #4 [f2c26ff0] kernel_thread_helper at c01041f3 I don't see any NFS filesystems in the mount table, so I'm guessing this happened after unmounting the last NFS filesystem on the box. Is there already a separate BZ for this? Ahh Jeff pointed out that this is actually a nfs server. So lockd should only be shutting down if nfs is going down. Here's something odd though -- there are 2 lockd threads in the process list: 3617 1 1 f63de930 UN 0.0 0 0 [lockd] 27857 1 1 c1b46e30 UN 0.0 0 0 [lockd] the second thread looks like this: crash> set 27857 PID: 27857 COMMAND: "lockd" TASK: c1b46e30 [THREAD_INFO: f152b000] CPU: 1 STATE: TASK_UNINTERRUPTIBLE crash> bt PID: 27857 TASK: c1b46e30 CPU: 1 COMMAND: "lockd" #0 [f152bf20] schedule at c02d25c9 #1 [f152bf84] __sched_text_start at c02d1bb4 #2 [f152bfb8] __down_failed at c02d1d2b #3 [f152bfc8] .text.lock.sched (via rpciod_up) at f8aa5e92 #4 [f152bfd4] lockd at f8c40d7f #5 [f152bff0] kernel_thread_helper at c01041f3 I think it's blocked on the nlmsvc_sema (which is probably being held by whoever is calling lockd_down). Possibility? lockd was taking an inordinate amount of time to come down, nfs was restarted or something and a second lockd thread was started... err...actually the second lockd thread is blocked on the rpciod_sema from rpciod_up. This makes sense since rpciod is coming down and the other lockd is holding it in rpciod_down. My guess is that rpciod was taking a long time to come down and the svc_serv struct that it crashed while accessing was freed and reused before it could get to it. Either that or we have memory corruption of some sort... I've cloned this as bug 253754 to track the second panic. I'm thinking the 2 problems are unrelated (though they could ultimately be due to the same issue). Comment #2 caught my eye when I was searching for info on bug 398921. I thought I'd add this note because of possible similarities via the tx hang issue. I have a PC in bug 398921 that is an NFS server client. The client was loading 340 gig of data onto a usb 2.0 hard drive. I left it going overnight hoping for the best. The tx hang makes for slow loading. In the morning the system was locked i.e. the screen was dark and a tap of the mouse or shift key would not bring up the screen saver password dialog. I thought it was an X issue so I performed a control, shift, backspace key sequence to restart X. No go. I wish I would have used an alt, shift, F1 to get to the console. There had been so many tx hang issues that the network service had stopped. An /etc/rc.d/init.d/network restart would have done the trick. I wound up rebooting the computer instead. The tx hang message was the last thing in the log files before the reboot messages. It might be worth a consideration. Thanks Greg. I suspect that this problem was ultimately caused by something lower in the network stack than NFS. I think I've tracked down the cause of bug 253754, which I spun off from this one. The cause of that oops seems to be a subtle use-after-free bug in lockd, but spotty network communication appears to be a contributing factor. Another event today. The machine had apparently been up since Dec 11th. Jeff turned up rpc_debug when the problem occurred. It pretty much brought the box to its knees and Jeff had to power cycle it without getting a core. We have a lot of logs via netconsole. For a long time before the problem occurred, the logs were full of messages like this: portmap: server 10.12.4.103 not responding, still trying ...just before things really got hosed, we see a number of messages like this: portmap: too small RPC reply size (0 bytes) lockd: cannot unmonitor 10.12.4.58 portmap: too small RPC reply size (0 bytes) lockd: cannot unmonitor 10.12.4.58 portmap: too small RPC reply size (0 bytes) lockd: cannot monitor 10.12.5.44 portmap: too small RPC reply size (0 bytes) lockd: cannot monitor 10.12.5.44 ...possible theory: A client went down and left an RPC call in such a way that the portmap call was being retried forever? Perhaps we can try to reproduce that somehow: Get a server into a state where it's printing messages like this: portmap: server 10.12.4.103 not responding, still trying ....and then keep the client down and watch the box. Perhaps there's a memory leak or RPC call leak in this codepath? Finally, if the portmap not responding messages are related, then the patch I've just backported for bug 204309 might be relevant here. Ok, finally got a chance to look at this more closely. I've downloaded the log from the netdump server and started going through it. After rpc_debug was enabled, we saw more or less the same sort of messages cycling over and over. Here's an example from task 13894 -- the others look similar but there is some variation: # bzcat 2007-12-11-log.bz2 | grep "RPC: 13894" RPC: 13894 xmit complete RPC: 13894 sleep_on(queue "xprt_pending" time -1718939168) RPC: 13894 added to queue c3504a88 "xprt_pending" RPC: 13894 setting alarm for 10000 ms RPC: 13894 added to queue f8c2df20 "schedq" RPC: 13894 removed from queue f8c2df20 "schedq" RPC: 13894 rpc_execute flgs 4a9 RPC: 13894 call_status (status -110) RPC: 13894 call_timeout (minor) RPC: 13894 call_bind clnt 0xc32d2800 port 28416 (status 0) RPC: 13894 call_connect xprt c3504a00 is connected RPC: 13894 call_transmit (status 0) RPC: 13894 xprt_prepare_transmit RPC: 13894 xprt_cwnd_limited cong = 0 cwnd = 256 Almost all of the tasks show this set of RPC flags. RPC: 13894 rpc_execute flgs 4a9 That is: #define RPC_TASK_ASYNC 0x0001 /* is an async task */ #define RPC_TASK_CHILD 0x0008 /* is child of other task */ #define RPC_CALL_MAJORSEEN 0x0020 /* major timeout seen */ #define RPC_TASK_DYNAMIC 0x0080 /* task was kmalloc'ed */ #define RPC_TASK_NOINTR 0x0400 /* uninterruptible task */ ...though some don't have RPC_CALL_MAJORSEEN set. The part that concerns me here is this: #define RPC_TASK_CHILD 0x0008 /* is child of other task */ ...I think that means that this is a recursive RPC call. An RPC task spawned a new RPC task. That flag is only set in rpc_new_child(), and that function is only called from rpc_getport(), which is how portmap calls get done. That would seem to confirm that the large number of calls shown here are portmapper calls. The question is this -- do these calls eventually time out and go away or are they being retried more or less forever? Given what we know about bug 204309, I suspect the latter -- these calls are being retried forever. I may see if I can reproduce this, but we need some method of viewing the current RPC queue at various times. I had a patch that added a new sysrq handler for that, but I'll need to dig it up again. If this theory is confirmed then I suspect that the patch for bug 204309 will probably fix this as well since it makes portmap calls eventually error out. Created attachment 292554 [details]
debug patch -- add sysrq handler to call rpc_show_tasks()
Quick and dirty debug patch to add a new sysrq handler to RHEL4. Doing a
sysrq-q calls rpc_show_tasks, so we can get a snapshot of the RPC queue over
time.
Patched the sunrpc module with the above patch. I then had a rhel5 client mount my rhel4 server, using NFSv2/UDP. I locked a file on the server and then had the rhel5 client do a blocking lock on the same file. I then set up iptables rules on the client to drop all traffic to and from the server. I then released the lock on the server. That should make the server start trying to do a grant callback to the client, but it can't communicate with it. I start seeing messages like this in the logs: portmap: server 192.168.1.246 not responding, still trying portmap: server 192.168.1.246 not responding, still trying ...at first I saw just a few RPC tasks in queue: SysRq : Dump RPC Queue -pid- proc flgs status -client- -prog- --rqstp- -timeout -rpcwait -action- --exit-- 00026 0010 0481 000000 dfe8fe00 100021 da1470ac 00060000 bindwait e08c1a7b e0915e70 00025 0003 04a9 000000 dffd2e00 100000 df917000 00010000 xprt_pending e08c1d16 e08c569d 00024 0010 0481 000000 dfe8fe00 100021 da147000 00060000 childq e08c1a7b e0915e70 ...after 20 minutes or so, the queue looked like this: SysRq : Dump RPC Queue -pid- proc flgs status -client- -prog- --rqstp- -timeout -rpcwait -action- --exit-- 00065 0010 0481 000000 dfe8fe00 100021 da147158 00060000 bindwait e08c1a7b e0915e70 00064 0010 0481 000000 dfe8fe00 100021 da1470ac 00060000 bindwait e08c1a7b e0915e70 00063 0003 04a9 000000 df90ac00 100000 da4ed000 00010000 xprt_pending e08c1d16 e08c569d 00060 0003 04a9 000000 c1784200 100000 dc37b000 00020000 xprt_pending e08c1d16 e08c569d 00057 0003 04a9 000000 c16e3200 100000 dabce000 00020000 xprt_pending e08c1d16 e08c569d 00054 0003 04a9 000000 dffda400 100000 db27d000 00010000 xprt_pending e08c1d16 e08c569d 00051 0003 04a9 000000 dddba200 100000 dabcf000 00020000 xprt_pending e08c1d16 e08c569d 00048 0003 04a9 000000 df938a00 100000 dec55000 00020000 xprt_pending e08c1d16 e08c569d 00045 0003 04a9 000000 c1793c00 100000 db1e9000 00020000 xprt_pending e08c1d16 e08c569d 00042 0003 04a9 000000 dfedf600 100000 db263000 00005000 xprt_pending e08c1d16 e08c569d 00039 0003 04a9 000000 dffd6600 100000 dcbba000 00020000 xprt_pending e08c1d16 e08c569d 00036 0003 04a9 000000 c15fbc00 100000 da4e8000 00020000 xprt_pending e08c1d16 e08c569d 00033 0003 04a9 000000 c17cfc00 100000 dc538000 00010000 xprt_pending e08c1d16 e08c569d 00030 0003 04a9 000000 c1796400 100000 dc6f2000 00020000 xprt_pending e08c1d16 e08c569d 00027 0003 04a9 000000 c16fce00 100000 de6e1000 00020000 xprt_pending e08c1d16 e08c569d 00025 0003 04a9 000000 dffd2e00 100000 df917000 00010000 xprt_pending e08c1d16 e08c569d 00024 0010 0481 000000 dfe8fe00 100021 da147000 00060000 childq e08c1a7b e0915e70 ...note that the tasks here never really seem to time out. I think this might be part of the problem. This consumes memory and other resources on the server, eventually I could see the server getting very bogged down retrying all of these requests. Added a bit of code to do a dump_stack() whenever we call rpc_new_child. The first couple of child tasks come from a stack like this: [<e08c5700>] rpc_new_child+0x1b/0x45 [sunrpc] [<e08ca983>] rpc_getport+0x103/0x15a [sunrpc] [<e08c4f03>] __rpc_execute+0xa6/0x326 [sunrpc] [<e08c6457>] rpcauth_bindcred+0x7e/0xb8 [sunrpc] [<e08c15de>] rpc_call_sync+0x68/0x8c [sunrpc] [<e091755b>] nsm_mon_unmon+0xaf/0xd0 [lockd] [<e09175a9>] nsm_monitor+0x2d/0x56 [lockd] [<e09164a1>] nlmsvc_retrieve_args+0x47/0xaa [lockd] [<e091664b>] nlmsvc_proc_lock+0x5d/0xdd [lockd] [<e0917ea2>] nlmsvc_decode_lockargs+0x0/0x31a [lockd] [<e0914cbf>] nlmsvc_dispatch+0x7c/0x122 [lockd] [<e08c7635>] svc_process+0x444/0x6f3 [sunrpc] [<e0914ee8>] lockd+0x183/0x270 [lockd] [<e0914d65>] lockd+0x0/0x270 [lockd] [<c01041f5>] kernel_thread_helper+0x5/0xb ...the remaining ones look like this: [<e08c5700>] rpc_new_child+0x1b/0x45 [sunrpc] [<e08ca983>] rpc_getport+0x103/0x15a [sunrpc] [<e08c4f03>] __rpc_execute+0xa6/0x326 [sunrpc] [<e08c5221>] __rpc_schedule+0x3f/0x93 [sunrpc] [<e08c597d>] rpciod+0xf7/0x251 [sunrpc] [<c012052d>] autoremove_wake_function+0x0/0x2d [<c02d8586>] ret_from_fork+0x6/0x14 [<c012052d>] autoremove_wake_function+0x0/0x2d [<e08c5886>] rpciod+0x0/0x251 [sunrpc] [<c01041f5>] kernel_thread_helper+0x5/0xb ...I also added a bit of extra info to rpc_show_tasks (calldata and the task address itself): SysRq : Dump RPC Queue -pid- proc flgs status -client- -prog- --rqstp- -timeout -rpcwait -action- --exit-- calldata taskaddr 00046 0010 0481 000000 c16fc200 100021 daf9a0ac 00060000 bindwait e08c1a7b e0915e70 df0e7408 dd8ff580 00045 0003 0489 000000 dfe82800 100000 da939000 00010000 xprt_pending e08c1d16 e08c569d dd8ffe80 dd8ff4c0 00042 0003 04a9 000000 ddfa5c00 100000 dbc6c000 00005000 xprt_pending e08c1d16 e08c569d dd8ffe80 dd8ff1c0 00039 0003 04a9 000000 c17d8c00 100000 dd0c5000 00020000 xprt_pending e08c1d16 e08c569d dd8ffe80 dd8ff280 00036 0003 04a9 000000 c178ce00 100000 db1b6000 00020000 xprt_pending e08c1d16 e08c569d dd8ffe80 dd8ff100 00033 0003 04a9 000000 c16dce00 100000 dd476000 00010000 xprt_pending e08c1d16 e08c569d dd8ffe80 dd8ff340 00030 0003 04a9 000000 c15ef200 100000 dccb5000 00020000 xprt_pending e08c1d16 e08c569d dd8ffe80 dd8ff040 00027 0003 04a9 000000 df905200 100000 dbf9b000 00020000 xprt_pending e08c1d16 e08c569d dd8ffe80 dd8ff400 00024 0003 04a9 000000 c1796200 100000 dbdd9000 00010000 xprt_pending e08c1d16 e08c569d dd8ffe80 dd8ffc40 00021 0003 04a9 000000 c16fbc00 100000 df3e0000 00005000 xprt_pending e08c1d16 e08c569d dd8ffe80 dd8ffdc0 00020 0010 0481 000000 c16fc200 100021 daf9a000 00060000 childq e08c1a7b e0915e70 df0e7408 dd8ffe80 ...the interesting bit is that the calldata is the same for almost all of the tasks in queue here. They're all children of the same task. I think we probably need to change this behavior. I don't think it does any good to spawn more than one portmapper request at a time. Perhaps if there's already a portmapper request in progress, we can just try to wake it up and not spawn a new one. I also need to check upstream to see how it behaves (and whether it might have the same issue). Heh, it occurs to me that a sysrq key isn't really necessary. Doing: # echo 0 > /proc/sys/sunrpc/rpc_debug ...does a similar thing. Testing rawhide today, after ~10 minutes, the queue looks like this: -pid- proc flgs status -client- -prog- --rqstp- -timeout -rpcwait -action- ---ops-- 42 0003 0281 0 df1f1690 100000 ca403000 60000 xprt_pending e0a5eb17 e0a6ef94 28 0010 04a1 -11 dee93af0 100021 ca640000 5000 delayq e0a5ebbd e0a37fac 30 0010 04a1 -11 dee93af0 100021 ca6400cc 5000 delayq e0a5ebbd e0a37fac 31 0010 04a1 -11 dee93af0 100021 ca640198 5000 delayq e0a5ebbd e0a37fac 32 0010 04a1 -11 dee93af0 100021 ca640264 5000 delayq e0a5ebbd e0a37fac 33 0010 04a1 -11 dee93af0 100021 ca640330 5000 delayq e0a5ebbd e0a37fac 36 0010 04a1 -11 dee93af0 100021 ca6403fc 5000 delayq e0a5ebbd e0a37fac 37 0010 04a1 -11 dee93af0 100021 ca6404c8 5000 delayq e0a5ebbd e0a37fac 38 0010 04a1 -11 dee93af0 100021 ca640594 5000 delayq e0a5ebbd e0a37fac 39 0010 04a1 -11 dee93af0 100021 ca640660 5000 delayq e0a5ebbd e0a37fac 40 0010 04a1 -11 dee93af0 100021 ca64072c 5000 delayq e0a5ebbd e0a37fac 41 0010 04a1 -11 dee93af0 100021 ca6407f8 5000 delayq e0a5ebbd e0a37fac 43 0010 04a1 -11 dee93af0 100021 ca6408c4 5000 delayq e0a5ebbd e0a37fac 44 0010 04a1 -11 dee93af0 100021 ca640990 5000 delayq e0a5ebbd e0a37fac 45 0010 0481 -11 dee93af0 100021 ca640a5c 5000 delayq e0a5ebbd e0a37fac 46 0010 0481 -11 dee93af0 100021 ca640b28 5000 delayq e0a5ebbd e0a37fac ...so it looks like rawhide has a similar problem. I *think* a simple fix might be to just make all portmap children be soft tasks. In the event that one times out, the main task is still a hard task and it should just reattempt the portmap call if it isn't bound. The other alternative would be to make it so that the main task can only spawn a single portmap call at a time, but that sounds more invasive. Actually...the problem on rawhide is different... In the rawhide case, most of the queued up RPC calls are for program 100021 (nlockmgr), whereas on RHEL4 they are mostly portmap calls. One major difference is that portmap calls on rawhide are soft tasks. I think that's probably a change we should consider for RHEL as well, though it needs testing to make sure it's safe. There's still an issue on rawhide though. nlm_grant_blocked calls are hard tasks, but it does this: /* If b_granted is true this means we've been here before. * Just retry the grant callback, possibly refreshing the RPC * binding */ if (block->b_granted) { nlm_rebind_host(block->b_host); goto callback; } ...so on each pass through nlm_retry_blocked(), we end up kicking off a new RPC task, irrespective of the fact that we already have hard tasks for this retrying in queue. The easiest fix looks like it'll be to just skip reissuing this call if there's a call in queue, but that may not take into account failures that don't allow the call to retry. Changing these to soft tasks might be the best method... Looking back through the logs from the Jan 10th event... When Jeff turned up rpc_debug, we should have gotten a call to rpc_show_tasks. I tracked that down in the logs and I see a bunch of these: -pid- proc flgs status -client- -prog- --rqstp- -timeout -rpcwait -action- --exit-- 36379 0003 04a9 -00110 f1dfd000 100000 ed16a000 00000000 schedq f8c10d19 f8c146a5 36378 0003 04a9 -00110 eecd4c00 100000 d88f0000 00000000 schedq f8c10d19 f8c146a5 36377 0003 04a9 -00110 e486f600 100000 d6870000 00000000 schedq f8c10d19 f8c146a5 These are portmap calls (prog is 100000). The last task in the list: 03074 0010 0481 000000 f7e8e400 100021 d5545000 00060000 unknown f8c10a7b f8c5fe20 $ bzcat 2007-12-11-log.bz2 | grep 'schedq f8c10d19 f8c146a5' | wc -l 29308 ... So the netdump log shows >29000 portmap calls were queued up. It may even be more since netlog can drop messages at times. In any case, I think changing those to be soft RPC tasks is probably reasonable. I'll roll up a patch and see if we can start testing it. Unfortunately, we don't know for sure if that's what caused the problem, but it seems like its something that should really be fixed, and may help us catch the real culprit if this happens again. The problem I've seen on rawhide looks like a similar, but unrelated issue... Jeff, if this occurs again anytime soon, could you just do the following on the box? # echo 0 > /proc/sys/sunrpc/rpc_debug ...that should just make it do an rpc_show_tasks and dump the state of the RPC queue to the ring buffer. It won't enable the debugging, however, so hopefully it won't bring the box to its knees as badly. I think I'm on track of the cause here, but it may be hard to tell since I can't reproduce the problem exactly. I think this problem is actually a regression due to the patch in bug 172082. Prior to that patch going in, portmap calls were always soft RPC tasks, and I don't see any period of time where that was not the case upstream. That patch added this delta to rpc_getport: + /* use the soft setting from the orignal clnt */ + pmap_clnt->cl_softrtry = clnt->cl_softrtry; + ...which makes all portmap children inherit the soft setting from the parent task. I think this delta is wrong and should be reverted. That should keep this problem from occurring. Steve, bug 172082 was yours...any thoughts? Going to try to see if we can fix this for 4.7. I want to do a bit more testing before I call this a regression, but I believe that it is. As long as we have the patch for bug 204309, I think that the delta above is unnecessary, but I need to check it against the reproducer in bug 172082. This bugzilla has Keywords: Regression. Since no regressions are allowed between releases, it is also being proposed as a blocker for this release. Please resolve ASAP. Created attachment 292949 [details] patch -- make sure that kernel portmap calls are always soft tasks I think this patch will fix this issue, and it makes it so that the behavior here matches upstream behavior. AFAICT, upstream has always had pmap calls be soft. Note that this patch alone may cause bug 172082 to pop up again, but this patch along with the patch for bug 204309 should make everything work ok. This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release. Regarding bug 172082: Chuck's reproducer: Steps to Reproduce: 1. Bring up an NFS server and mount it from a client 2. Kill the portmapper on the server 3. Attempt to run anything that tries to do NLM locking ...when I do this on a client with this patch + the patch for bug 204309, it (properly) hangs for ~25 minutes and then errors out with a -EIO. From an strace: 1201273564.642552 fcntl64(3, F_SETLKW, {type=F_WRLCK, whence=SEEK_SET, start=0, len=0}) = -1 EIO (Input/output error) <1619.831624> This was just testing TCP. Obviously we'll want to test that situation with UDP as well. Committed in 68.17.EL . RPMS are available at http://people.redhat.com/vgoyal/rhel4/ Jeff L, I installed the 2.6.9-68.17.EL kernel on our static RHEL4 NFS server. This system is the one we use to run client connectathon test in RHTS. When I did that all of the connectathon test machines failed. Here is a link to the test results: http://rhts.redhat.com/cgi-bin/rhts/job_matrix.cgi?method=job&jobs=18621&whiteboard_filter=&project_filter=&action=Generate What I noticed on the server was this: /var/log/messages Mar 28 06:44:29 dell-pe840-01 kernel: nfsd: non-standard errno: -38 Mar 28 06:44:33 dell-pe840-01 kernel: nfsd: non-standard errno: -38 Mar 28 06:44:33 dell-pe840-01 mountd[21233]: authenticated unmount request from g13.lab.boston.redhat.com:818 for /export/home (/export/home) Mar 28 06:44:41 dell-pe840-01 kernel: nfsd: non-standard errno: -38 Mar 28 06:44:41 dell-pe840-01 mountd[21233]: authenticated unmount request from g16.lab.boston.redhat.com:666 for /export/home (/export/home) Mar 28 06:44:44 dell-pe840-01 kernel: nfsd: non-standard errno: -38 Mar 28 06:44:44 dell-pe840-01 kernel: nfsd: non-standard errno: -38 Mar 28 06:44:51 dell-pe840-01 mountd[21233]: authenticated mount request from g16.lab.boston.redhat.com:774 for /export/home (/export/home) Mar 28 06:44:51 dell-pe840-01 kernel: nfsd: non-standard errno: -38 Mar 28 06:44:51 dell-pe840-01 kernel: nfsd: non-standard errno: -38 Mar 28 06:45:02 dell-pe840-01 mountd[21233]: authenticated mount request from g16.lab.boston.redhat.com:904 for /export/home (/export/home) Mar 28 06:45:02 dell-pe840-01 kernel: nfsd: non-standard errno: -38 Mar 28 06:45:10 dell-pe840-01 last message repeated 10 times Mar 28 06:45:10 dell-pe840-01 mountd[21233]: authenticated unmount request from g16.lab.boston.redhat.com:902 for /export/home (/export/home) Mar 28 06:45:20 dell-pe840-01 mountd[21233]: authenticated mount request from g16.lab.boston.redhat.com:1011 for /export/home (/export/home) Mar 28 06:45:20 dell-pe840-01 kernel: nfsd: non-standard errno: -38 Mar 28 06:45:26 dell-pe840-01 last message repeated 7 times Mar 28 06:45:38 dell-pe840-01 mountd[21233]: authenticated mount request from g16.lab.boston.redhat.com:780 for /export/home (/export/home) Mar 28 06:45:38 dell-pe840-01 kernel: nfsd: non-standard errno: -38 Mar 28 06:45:38 dell-pe840-01 rpc.statd[20877]: Received erroneous SM_UNMON request from 127.0.0.1 for 192.168.77.215 Mar 28 06:45:51 dell-pe840-01 kernel: nfsd: non-standard errno: -38 Mar 28 06:45:51 dell-pe840-01 mountd[21233]: authenticated unmount request from g16.lab.boston.redhat.com:898 for /export/home (/export/home) Mar 28 06:46:02 dell-pe840-01 kernel: nfsd: non-standard errno: -38 Mar 28 06:46:02 dell-pe840-01 kernel: nfsd: non-standard errno: -38 dmesg nfsd: non-standard errno: -38 nfsd: non-standard errno: -38 nfsd: non-standard errno: -38 nfsd: non-standard errno: -38 nfsd: non-standard errno: -38 nfsd: non-standard errno: -38 nfsd: non-standard errno: -38 nfsd: non-standard errno: -38 nfsd: non-standard errno: -38 nfsd: non-standard errno: -38 nfsd: non-standard errno: -38 nfsd: non-standard errno: -38 nfsd: non-standard errno: -38 nfsd: non-standard errno: -38 This kernel is causing a regression. I am going to put the server back to the Z stream kernel. Interesting. I rather doubt that the patch that fixes this bug is causing this: #define ENOSYS 38 /* Function not implemented */ ...I wonder if that problem might be related to bug 438723? Jeff Let me step back one min, I may have jumped the gun a bit. I didn't realize that it was a test kernel that failed in RHTS. Since the kernel that was running the test was Vivek's 2.6.9-68.29.EL.vgoyal.test1. I have put the server back to the previous Z-Stream kernel and kicked off a connectathon test using Vivek's test kernel. This will tell us for sure if it was client or server. Sorry for raising the alarm. Let me get more data for you. Jeff An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2008-0665.html |