Bug 248787

Summary:

[RHEL4 U4] NFS server, rpciod was stuck in a infinite loop,

Product:

Red Hat Enterprise Linux 4

Reporter:

Jeff Burke <jburke>

Component:

kernel

Assignee:

Jeff Layton <jlayton>

Status:

CLOSED ERRATA

QA Contact:

Martin Jenner <mjenner>

Severity:

low

Docs Contact:

Priority:

low

Version:

4.4

CC:

coughlan, staubach, steved

Target Milestone:

Keywords:

Regression

Target Release:

---

Hardware:

All

OS:

Linux

URL:

http://rhts.lab.boston.redhat.com/cgi-bin/rhts/test_log.cgi?id=324671

Whiteboard:

Fixed In Version:

RHSA-2008-0665

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2008-07-24 19:14:19 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
debug patch -- add sysrq handler to call rpc_show_tasks()	none
patch -- make sure that kernel portmap calls are always soft tasks	none

Description Jeff Burke 2007-07-18 19:33:09 UTC

Description of problem:
 While testing RHEL4 U6 beta kernel. The static server used for testing
connectathon got into a "bad" state. The client side tests were failing the
following connectathon tests:
 rhel4-nfs NFSv2 tcp lock
 rhel4-nfs NFSv2 udp lock
 rhel4-nfs NFSv3 udp lock
 rhel4-nfs NFSv3 tcp lock

 Since the issue was seen on all clients the issue was narrowed down to the a
server side issue.

Version-Release number of selected component (if applicable):
 2.6.9-42.ELsmp

How reproducible:
 Unknown

Actual results:
 See URL

Additional info:
 When the server was in the "bad" state Steve D manually crashed the kernel to
get a vmcore file. Here is a pointer to the vmcore.
ndnc-1.lab.boston.redhat.com:/var/crash/10.12.4.232-2007-06-11-09:22

Comment 1 Jeff Layton 2007-07-18 20:08:36 UTC

The sysrq data shows this at some point prior to the crash:

rpciod        D F5CABE7C  2492  3479      1          3483  3478 (L-TLB)
f5cabeb4 00000046 fcc80a40 f5cabe7c c011cdb9 c1aa5830 f5cbec30 c1804de0 
       c1804de0 00000086 c1804de0 c1804de0 00000001 00000000 fcc80a40 001aaa9e 
       c1aa5830 f5cbe6b0 f5cbe81c f7e36028 c0120420 063881e0 f7e36000 f5cabed8 
Call Trace:
 [<c011cdb9>] activate_task+0x88/0x95
 [<c0120420>] prepare_to_wait+0x12/0x4c
 [<c0130dff>] flush_cpu_workqueue+0x12f/0x153
 [<c01204f5>] autoremove_wake_function+0x0/0x2d
 [<c01204f5>] autoremove_wake_function+0x0/0x2d
 [<c0130e92>] flush_workqueue+0x6f/0x94
 [<f8aa45d0>] xprt_destroy+0x20/0x44 [sunrpc]
 [<f8aa1454>] rpc_destroy_client+0x7c/0xa4 [sunrpc]
 [<f8aa5672>] rpc_release_task+0x11a/0x135 [sunrpc]
 [<f8aa5168>] __rpc_execute+0x31b/0x326 [sunrpc]
 [<f8aa5211>] __rpc_schedule+0x3f/0x93 [sunrpc]
 [<f8aa5952>] rpciod+0xf7/0x251 [sunrpc]
 [<c01204f5>] autoremove_wake_function+0x0/0x2d
 [<c02d4622>] ret_from_fork+0x6/0x14
 [<c01204f5>] autoremove_wake_function+0x0/0x2d
 [<f8aa585b>] rpciod+0x0/0x251 [sunrpc]
 [<c01041f5>] kernel_thread_helper+0x5/0xb


Then, from the core:

crash> set 3479
    PID: 3479
COMMAND: "rpciod"
   TASK: f5cbe6b0  [THREAD_INFO: f5cab000]
    CPU: 1
  STATE: TASK_RUNNING (ACTIVE)
crash> bt
PID: 3479   TASK: f5cbe6b0  CPU: 1   COMMAND: "rpciod"
 #0 [f5cabd0c] smp_call_function_interrupt at c0116cab
 #1 [f5cabd14] call_function_interrupt at c02d511d
    EAX: d532ad08  EBX: 0000de26  ECX: c03506c0  EDX: 000000d3  EBP: 0000ee48 
    DS:  007b      ESI: 0000005e  ES:  007b      EDI: d0785800 
    CS:  0060      EIP: c02b3d3e  ERR: fffffffb  EFLAGS: 00000282 
 #2 [f5cabd48] udp_v4_get_port at c02b3d3e
 #3 [f5cabd6c] inet_autobind at c02ba124
 #4 [f5cabd78] inet_sendmsg at c02ba940
 #5 [f5cabd8c] sock_sendmsg at c0277daa
 #6 [f5cabe74] kernel_sendmsg at c0277dea
 #7 [f5cabe80] xdr_sendpages at f8aab426
 #8 [f5cabf04] xprt_transmit at f8aa3a3e
 #9 [f5cabf34] call_transmit at f8aa1cf4
#10 [f5cabf3c] __rpc_execute at f8aa4ef1
#11 [f5cabf94] __rpc_schedule at f8aa520c
#12 [f5cabfa0] rpciod at f8aa594d
#13 [f5cabff0] kernel_thread_helper at c01041f3

So at the time of the crash, it looks like rpciod was trying to send a message
on a (new?) unbound socket, and so was trying to bind it.

LOAD AVERAGE: 2.37, 2.51, 3.03

load avg was high, and steve said the thread was eating 100% of a CPU.

Comment 2 Jeff Layton 2007-07-18 20:10:17 UTC

See this in the ring buffer too -- not sure if it's related. May have something
to do with netconsole?

e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang
  Tx Queue             <0>
  TDH                  <be>
  TDT                  <e>
  next_to_use          <e>
  next_to_clean        <bc>
buffer_info[next_to_clean]
  time_stamp           <bf1ad67b>
  next_to_watch        <be>
  jiffies              <bf1af9de>
  next_to_watch.status <0>
NETDEV WATCHDOG: eth0: transmit timed out
e1000: eth0: e1000_watchdog_task: NIC Link is Up 1000 Mbps Full Duplex

Comment 3 Jeff Layton 2007-07-19 14:51:32 UTC

Before the kernel crashed (and before and after the Tx Unit Hang messages),
there was also some stuff like this in the log:

nsm_mon_unmon: rpc failed, status=-5
lockd: cannot monitor 10.12.4.31
portmap: too small RPC reply size (0 bytes)
nsm_mon_unmon: rpc failed, status=-5
lockd: cannot monitor 10.12.4.31
portmap: too small RPC reply size (0 bytes)
nsm_mon_unmon: rpc failed, status=-5
lockd: cannot monitor 10.12.4.81

...in hindsight, cranking up rpc_debug would have been helpful. The info we have
here is somewhat sparse. I'm thinking we won't be able to tell much here unless
this happens again. If so, we might want to leave the box in that state and see
if we can track down what it's actually doing.

I'm suspecting that the problem may have ultimately been caused by something
lower than NFS (network card problems maybe), but it's hard to say without some
more info.

Comment 4 Jeff Layton 2007-08-14 11:10:14 UTC

Setting this to NEEDINFO for now. I don't have enough to go on here to determine
the cause. If this occurs again, please try to turn up RPC debugging:

# echo 32767 > /proc/sys/sunrpc/rpc_debug

and let it sit with that for a few minutes before forcing a core.

Comment 5 Jeff Burke 2007-08-20 17:33:01 UTC

OK will do. You were not able to get any data from the vmcore file?

Thanks,
Jeff

Comment 6 Jeff Layton 2007-08-20 18:32:39 UTC

I got a bit -- it's detailed in the above comments. Unfortunately, I can't tell
much. rpciod doesn't seem to be locked up hard. The sysrq data shows it at a
different place than the stack in the actual crash, which might indicate that it
was in some sort of loop, or could be indicative that it wasn't hung at all.

A second core from the same issue might be more helpful, especially if
rpc_debugging is turned up before crashing the box. Even better would be to let
me know if you have a machine in this situation so I can poke at it while it's
still "alive".

Comment 7 Jeff Burke 2007-08-20 23:21:26 UTC

Jeff,
  I found out that there was a second issue last week while I was on vacation.
Steve D was looking at it. Looks like it may have panic(But I am not sure) This
may/may not be a different issue.

 Here is a link to the vmcore from when the system had an issue.
ndnc-1.lab.boston.redhat.com:/var/crash/10.12.4.232-2007-08-15-11:36

Thanks,
Jeff

Comment 8 Jeff Layton 2007-08-21 01:35:29 UTC

This issue looks different, but I suppose there's a chance that they ultimately
have the same cause:

kernel BUG at include/asm/spinlock.h:133!

That error means that the spinlock magic was corrupt:

void
svc_wake_up(struct svc_serv *serv)
{
        struct svc_rqst *rqstp;

        spin_lock_bh(&serv->sv_lock);

...so the serv->sv_lock was bad (and that likely means that the whole svc_serv
struct was bad). The ring buffer in the core didn't have any info before the
panic, so it's hard to tell what what happened leading up to the crash. 

Here's the stack:

PID: 3618   TASK: f63deeb0  CPU: 0   COMMAND: "rpciod"
 #0 [caeb6dc0] netpoll_start_netdump at f8a52570
 #1 [caeb6de0] die at c0106045
 #2 [caeb6e14] do_invalid_op at c0106420
 #3 [caeb6ec4] error_code (via invalid_op) at c02d51dd
    EAX: f8aa7c45  EBX: c1afd518  ECX: c02e6fa1  EDX: c02e6fa1  EBP: c28ea808 
    DS:  007b      ESI: c1afd500  ES:  007b      EDI: ec2a4a80 
    CS:  0060      EIP: c02d33c2  ERR: ffffffff  EFLAGS: 00010216 
 #4 [caeb6f00] _spin_lock_bh at c02d33c2
 #5 [caeb6f08] svc_wake_up at f8aa7c40
 #6 [caeb6f14] nlmsvc_grant_callback at f8c41f1d
 #7 [caeb6f2c] __rpc_execute at f8aa50f7
 #8 [caeb6f84] __rpc_schedule at f8aa520c
 #9 [caeb6f90] rpciod_killall at f8aa5ad8
#10 [caeb6f9c] rpciod at f8aa5a52
#11 [caeb6ff0] kernel_thread_helper at c01041f3

rpciod_killall() is in the stack here, which indicates that rpciod was in the
process of shutting down at the time. Looks like lockd was in the process of
shutting down too:

crash> bt
PID: 3617   TASK: f63de930  CPU: 1   COMMAND: "lockd"
 #0 [f2c26f14] schedule at c02d25c9
 #1 [f2c26f78] wait_for_completion at c02d26f6
 #2 [f2c26fcc] rpciod_down at f8aa5cca
 #3 [f2c26fd4] lockd at f8c40f31
 #4 [f2c26ff0] kernel_thread_helper at c01041f3

I don't see any NFS filesystems in the mount table, so I'm guessing this
happened after unmounting the last NFS filesystem on the box.

Is there already a separate BZ for this?

Comment 9 Jeff Layton 2007-08-21 14:30:13 UTC

Ahh Jeff pointed out that this is actually a nfs server. So lockd should only be
shutting down if nfs is going down.

Here's something odd though -- there are 2 lockd threads in the process list:

   3617      1   1  f63de930  UN   0.0       0      0  [lockd]
  27857      1   1  c1b46e30  UN   0.0       0      0  [lockd]

the second thread looks like this:

crash> set 27857
    PID: 27857
COMMAND: "lockd"
   TASK: c1b46e30  [THREAD_INFO: f152b000]
    CPU: 1
  STATE: TASK_UNINTERRUPTIBLE 
crash> bt
PID: 27857  TASK: c1b46e30  CPU: 1   COMMAND: "lockd"
 #0 [f152bf20] schedule at c02d25c9
 #1 [f152bf84] __sched_text_start at c02d1bb4
 #2 [f152bfb8] __down_failed at c02d1d2b
 #3 [f152bfc8] .text.lock.sched (via rpciod_up) at f8aa5e92
 #4 [f152bfd4] lockd at f8c40d7f
 #5 [f152bff0] kernel_thread_helper at c01041f3

I think it's blocked on the nlmsvc_sema (which is probably being held by whoever
is calling lockd_down).

Possibility? lockd was taking an inordinate amount of time to come down, nfs was
restarted or something and a second lockd thread was started...

Comment 10 Jeff Layton 2007-08-21 16:14:44 UTC

err...actually the second lockd thread is blocked on the rpciod_sema from
rpciod_up. This makes sense since rpciod is coming down and the other lockd is
holding it in rpciod_down.

My guess is that rpciod was taking a long time to come down and the svc_serv
struct that it crashed while accessing was freed and reused before it could get
to it. Either that or we have memory corruption of some sort...

Comment 11 Jeff Layton 2007-08-21 18:15:37 UTC

I've cloned this as bug 253754 to track the second panic. I'm thinking the 2
problems are unrelated (though they could ultimately be due to the same issue).

Comment 12 Greg Morgan 2007-12-13 01:34:35 UTC

Comment #2 caught my eye when I was searching for info on bug 398921.  I thought
I'd add this note because of possible similarities via the tx hang issue.  I
have a PC in bug 398921 that is an NFS server client.  The client was loading
340 gig of data onto a usb 2.0 hard drive.  I left it going overnight hoping for
the best.  The tx hang makes for slow loading.  In the morning the system was
locked i.e. the screen was dark and a tap of the mouse or shift key would not
bring up the screen saver password dialog.  I thought it was an X issue so I
performed a control, shift, backspace key sequence to restart X.  No go.  I wish
I would have used an alt, shift, F1 to get to the console.  There had been so
many tx hang issues that the network service had stopped. An
/etc/rc.d/init.d/network restart would have done the trick.  I wound up
rebooting the computer instead.  The tx hang message was the last thing in the
log files before the reboot messages.   It might be worth a consideration.

Comment 13 Jeff Layton 2007-12-13 11:33:19 UTC

Thanks Greg. I suspect that this problem was ultimately caused by something
lower in the network stack than NFS. I think I've tracked down the cause of bug
253754, which I spun off from this one. The cause of that oops seems to be a
subtle use-after-free bug in lockd, but spotty network communication appears to
be a contributing factor.

Comment 14 Jeff Layton 2008-01-10 17:39:03 UTC

Another event today. The machine had apparently been up since Dec 11th. 

Jeff turned up rpc_debug when the problem occurred. It pretty much brought the
box to its knees and Jeff had to power cycle it without getting a core. We have
a lot of logs via netconsole.

For a long time before the problem occurred, the logs were full of messages like
this:

    portmap: server 10.12.4.103 not responding, still trying

...just before things really got hosed, we see a number of messages like this:

portmap: too small RPC reply size (0 bytes)
lockd: cannot unmonitor 10.12.4.58
portmap: too small RPC reply size (0 bytes)
lockd: cannot unmonitor 10.12.4.58
portmap: too small RPC reply size (0 bytes)
lockd: cannot monitor 10.12.5.44
portmap: too small RPC reply size (0 bytes)
lockd: cannot monitor 10.12.5.44

...possible theory:

A client went down and left an RPC call in such a way that the portmap call was
being retried forever? Perhaps we can try to reproduce that somehow:

Get a server into a state where it's printing messages like this:

    portmap: server 10.12.4.103 not responding, still trying

....and then keep the client down and watch the box. Perhaps there's a memory
leak or RPC call leak in this codepath?

Finally, if the portmap not responding messages are related, then the patch I've
just backported for bug 204309 might be relevant here.

Comment 15 Jeff Layton 2008-01-22 15:45:44 UTC

Ok, finally got a chance to look at this more closely. I've downloaded the log
from the netdump server and started going through it. After rpc_debug was
enabled, we saw more or less the same sort of messages cycling over and over.
Here's an example from task 13894 -- the others look similar but there is some
variation:

# bzcat 2007-12-11-log.bz2 | grep "RPC: 13894"
RPC: 13894 xmit complete
RPC: 13894 sleep_on(queue "xprt_pending" time -1718939168)
RPC: 13894 added to queue c3504a88 "xprt_pending"
RPC: 13894 setting alarm for 10000 ms
RPC: 13894 added to queue f8c2df20 "schedq"
RPC: 13894 removed from queue f8c2df20 "schedq"
RPC: 13894 rpc_execute flgs 4a9
RPC: 13894 call_status (status -110)
RPC: 13894 call_timeout (minor)
RPC: 13894 call_bind clnt 0xc32d2800 port 28416 (status 0)
RPC: 13894 call_connect xprt c3504a00 is connected
RPC: 13894 call_transmit (status 0)
RPC: 13894 xprt_prepare_transmit
RPC: 13894 xprt_cwnd_limited cong = 0 cwnd = 256


Almost all of the tasks show this set of RPC flags.

RPC: 13894 rpc_execute flgs 4a9

That is:

#define RPC_TASK_ASYNC          0x0001          /* is an async task */
#define RPC_TASK_CHILD          0x0008          /* is child of other task */
#define RPC_CALL_MAJORSEEN      0x0020          /* major timeout seen */
#define RPC_TASK_DYNAMIC        0x0080          /* task was kmalloc'ed */
#define RPC_TASK_NOINTR         0x0400          /* uninterruptible task */

...though some don't have RPC_CALL_MAJORSEEN set. The part that concerns me here
is this:

#define RPC_TASK_CHILD          0x0008          /* is child of other task */

...I think that means that this is a recursive RPC call. An RPC task spawned a
new RPC task. That flag is only set in rpc_new_child(), and that function is
only called from rpc_getport(), which is how portmap calls get done.

That would seem to confirm that the large number of calls shown here are
portmapper calls. The question is this -- do these calls eventually time out and
go away or are they being retried more or less forever? Given what we know about
bug 204309, I suspect the latter -- these calls are being retried forever.

I may see if I can reproduce this, but we need some method of viewing the
current RPC queue at various times. I had a patch that added a new sysrq handler
for that, but I'll need to dig it up again. If this theory is confirmed then I
suspect that the patch for bug 204309 will probably fix this as well since it
makes portmap calls eventually error out.

Comment 16 Jeff Layton 2008-01-22 19:59:10 UTC

Created attachment 292554 [details]
debug patch -- add sysrq handler to call rpc_show_tasks()

Quick and dirty debug patch to add a new sysrq handler to RHEL4. Doing a
sysrq-q calls rpc_show_tasks, so we can get a snapshot of the RPC queue over
time.

Comment 17 Jeff Layton 2008-01-22 20:05:30 UTC

Patched the sunrpc module with the above patch. I then had a rhel5 client mount
my rhel4 server, using NFSv2/UDP. I locked a file on the server and then had the
rhel5 client do a blocking lock on the same file. I then set up iptables rules
on the client to drop all traffic to and from the server. I then released the
lock on the server. That should make the server start trying to do a grant
callback to the client, but it can't communicate with it.

I start seeing messages like this in the logs:

portmap: server 192.168.1.246 not responding, still trying
portmap: server 192.168.1.246 not responding, still trying

...at first I saw just a few RPC tasks in queue:

SysRq : Dump RPC Queue
-pid- proc flgs status -client- -prog- --rqstp- -timeout -rpcwait -action- --exit--
00026 0010 0481 000000 dfe8fe00 100021 da1470ac 00060000 bindwait e08c1a7b e0915e70
00025 0003 04a9 000000 dffd2e00 100000 df917000 00010000 xprt_pending e08c1d16
e08c569d
00024 0010 0481 000000 dfe8fe00 100021 da147000 00060000   childq e08c1a7b e0915e70

...after 20 minutes or so, the queue looked like this:

SysRq : Dump RPC Queue
-pid- proc flgs status -client- -prog- --rqstp- -timeout -rpcwait -action- --exit--
00065 0010 0481 000000 dfe8fe00 100021 da147158 00060000 bindwait e08c1a7b e0915e70
00064 0010 0481 000000 dfe8fe00 100021 da1470ac 00060000 bindwait e08c1a7b e0915e70
00063 0003 04a9 000000 df90ac00 100000 da4ed000 00010000 xprt_pending e08c1d16
e08c569d
00060 0003 04a9 000000 c1784200 100000 dc37b000 00020000 xprt_pending e08c1d16
e08c569d
00057 0003 04a9 000000 c16e3200 100000 dabce000 00020000 xprt_pending e08c1d16
e08c569d
00054 0003 04a9 000000 dffda400 100000 db27d000 00010000 xprt_pending e08c1d16
e08c569d
00051 0003 04a9 000000 dddba200 100000 dabcf000 00020000 xprt_pending e08c1d16
e08c569d
00048 0003 04a9 000000 df938a00 100000 dec55000 00020000 xprt_pending e08c1d16
e08c569d
00045 0003 04a9 000000 c1793c00 100000 db1e9000 00020000 xprt_pending e08c1d16
e08c569d
00042 0003 04a9 000000 dfedf600 100000 db263000 00005000 xprt_pending e08c1d16
e08c569d
00039 0003 04a9 000000 dffd6600 100000 dcbba000 00020000 xprt_pending e08c1d16
e08c569d
00036 0003 04a9 000000 c15fbc00 100000 da4e8000 00020000 xprt_pending e08c1d16
e08c569d
00033 0003 04a9 000000 c17cfc00 100000 dc538000 00010000 xprt_pending e08c1d16
e08c569d
00030 0003 04a9 000000 c1796400 100000 dc6f2000 00020000 xprt_pending e08c1d16
e08c569d
00027 0003 04a9 000000 c16fce00 100000 de6e1000 00020000 xprt_pending e08c1d16
e08c569d
00025 0003 04a9 000000 dffd2e00 100000 df917000 00010000 xprt_pending e08c1d16
e08c569d
00024 0010 0481 000000 dfe8fe00 100021 da147000 00060000   childq e08c1a7b e0915e70

...note that the tasks here never really seem to time out. I think this might be
part of the problem. This consumes memory and other resources on the server,
eventually I could see the server getting very bogged down retrying all of these
requests.

Comment 18 Jeff Layton 2008-01-22 20:51:58 UTC

Added a bit of code to do a dump_stack() whenever we call rpc_new_child. The
first couple of child tasks come from a stack like this:

 [<e08c5700>] rpc_new_child+0x1b/0x45 [sunrpc]
 [<e08ca983>] rpc_getport+0x103/0x15a [sunrpc]
 [<e08c4f03>] __rpc_execute+0xa6/0x326 [sunrpc]
 [<e08c6457>] rpcauth_bindcred+0x7e/0xb8 [sunrpc]
 [<e08c15de>] rpc_call_sync+0x68/0x8c [sunrpc]
 [<e091755b>] nsm_mon_unmon+0xaf/0xd0 [lockd]
 [<e09175a9>] nsm_monitor+0x2d/0x56 [lockd]
 [<e09164a1>] nlmsvc_retrieve_args+0x47/0xaa [lockd]
 [<e091664b>] nlmsvc_proc_lock+0x5d/0xdd [lockd]
 [<e0917ea2>] nlmsvc_decode_lockargs+0x0/0x31a [lockd]
 [<e0914cbf>] nlmsvc_dispatch+0x7c/0x122 [lockd]
 [<e08c7635>] svc_process+0x444/0x6f3 [sunrpc]
 [<e0914ee8>] lockd+0x183/0x270 [lockd]
 [<e0914d65>] lockd+0x0/0x270 [lockd]
 [<c01041f5>] kernel_thread_helper+0x5/0xb

...the remaining ones look like this:

 [<e08c5700>] rpc_new_child+0x1b/0x45 [sunrpc]
 [<e08ca983>] rpc_getport+0x103/0x15a [sunrpc]
 [<e08c4f03>] __rpc_execute+0xa6/0x326 [sunrpc]
 [<e08c5221>] __rpc_schedule+0x3f/0x93 [sunrpc]
 [<e08c597d>] rpciod+0xf7/0x251 [sunrpc]
 [<c012052d>] autoremove_wake_function+0x0/0x2d
 [<c02d8586>] ret_from_fork+0x6/0x14
 [<c012052d>] autoremove_wake_function+0x0/0x2d
 [<e08c5886>] rpciod+0x0/0x251 [sunrpc]
 [<c01041f5>] kernel_thread_helper+0x5/0xb

...I also added a bit of extra info to rpc_show_tasks (calldata and the task
address itself):

SysRq : Dump RPC Queue
-pid- proc flgs status -client- -prog- --rqstp- -timeout -rpcwait -action-
--exit-- calldata taskaddr
00046 0010 0481 000000 c16fc200 100021 daf9a0ac 00060000 bindwait e08c1a7b
e0915e70 df0e7408 dd8ff580
00045 0003 0489 000000 dfe82800 100000 da939000 00010000 xprt_pending e08c1d16
e08c569d dd8ffe80 dd8ff4c0
00042 0003 04a9 000000 ddfa5c00 100000 dbc6c000 00005000 xprt_pending e08c1d16
e08c569d dd8ffe80 dd8ff1c0
00039 0003 04a9 000000 c17d8c00 100000 dd0c5000 00020000 xprt_pending e08c1d16
e08c569d dd8ffe80 dd8ff280
00036 0003 04a9 000000 c178ce00 100000 db1b6000 00020000 xprt_pending e08c1d16
e08c569d dd8ffe80 dd8ff100
00033 0003 04a9 000000 c16dce00 100000 dd476000 00010000 xprt_pending e08c1d16
e08c569d dd8ffe80 dd8ff340
00030 0003 04a9 000000 c15ef200 100000 dccb5000 00020000 xprt_pending e08c1d16
e08c569d dd8ffe80 dd8ff040
00027 0003 04a9 000000 df905200 100000 dbf9b000 00020000 xprt_pending e08c1d16
e08c569d dd8ffe80 dd8ff400
00024 0003 04a9 000000 c1796200 100000 dbdd9000 00010000 xprt_pending e08c1d16
e08c569d dd8ffe80 dd8ffc40
00021 0003 04a9 000000 c16fbc00 100000 df3e0000 00005000 xprt_pending e08c1d16
e08c569d dd8ffe80 dd8ffdc0
00020 0010 0481 000000 c16fc200 100021 daf9a000 00060000   childq e08c1a7b
e0915e70 df0e7408 dd8ffe80

...the interesting bit is that the calldata is the same for almost all of the
tasks in queue here. They're all children of the same task. I think we probably
need to change this behavior. I don't think it does any good to spawn more than
one portmapper request at a time. Perhaps if there's already a portmapper
request in progress, we can just try to wake it up and not spawn a new one.

I also need to check upstream to see how it behaves (and whether it might have
the same issue).

Comment 19 Jeff Layton 2008-01-23 18:38:16 UTC

Heh, it occurs to me that a sysrq key isn't really necessary. Doing:

# echo 0 > /proc/sys/sunrpc/rpc_debug 

...does a similar thing. Testing rawhide today, after ~10 minutes, the queue
looks like this:

-pid- proc flgs status -client- -prog- --rqstp- -timeout -rpcwait -action- ---ops--
   42 0003 0281      0 df1f1690 100000 ca403000    60000 xprt_pending e0a5eb17
e0a6ef94
   28 0010 04a1    -11 dee93af0 100021 ca640000     5000   delayq e0a5ebbd e0a37fac
   30 0010 04a1    -11 dee93af0 100021 ca6400cc     5000   delayq e0a5ebbd e0a37fac
   31 0010 04a1    -11 dee93af0 100021 ca640198     5000   delayq e0a5ebbd e0a37fac
   32 0010 04a1    -11 dee93af0 100021 ca640264     5000   delayq e0a5ebbd e0a37fac
   33 0010 04a1    -11 dee93af0 100021 ca640330     5000   delayq e0a5ebbd e0a37fac
   36 0010 04a1    -11 dee93af0 100021 ca6403fc     5000   delayq e0a5ebbd e0a37fac
   37 0010 04a1    -11 dee93af0 100021 ca6404c8     5000   delayq e0a5ebbd e0a37fac
   38 0010 04a1    -11 dee93af0 100021 ca640594     5000   delayq e0a5ebbd e0a37fac
   39 0010 04a1    -11 dee93af0 100021 ca640660     5000   delayq e0a5ebbd e0a37fac
   40 0010 04a1    -11 dee93af0 100021 ca64072c     5000   delayq e0a5ebbd e0a37fac
   41 0010 04a1    -11 dee93af0 100021 ca6407f8     5000   delayq e0a5ebbd e0a37fac
   43 0010 04a1    -11 dee93af0 100021 ca6408c4     5000   delayq e0a5ebbd e0a37fac
   44 0010 04a1    -11 dee93af0 100021 ca640990     5000   delayq e0a5ebbd e0a37fac
   45 0010 0481    -11 dee93af0 100021 ca640a5c     5000   delayq e0a5ebbd e0a37fac
   46 0010 0481    -11 dee93af0 100021 ca640b28     5000   delayq e0a5ebbd e0a37fac


...so it looks like rawhide has a similar problem. I *think* a simple fix might
be to just make all portmap children be soft tasks. In the event that one times
out, the main task is still a hard task and it should just reattempt the portmap
call if it isn't bound.

The other alternative would be to make it so that the main task can only spawn a
single portmap call at a time, but that sounds more invasive.

Comment 20 Jeff Layton 2008-01-24 16:30:19 UTC

Actually...the problem on rawhide is different...

In the rawhide case, most of the queued up RPC calls are for program 100021
(nlockmgr), whereas on RHEL4 they are mostly portmap calls. One major difference
is that portmap calls on rawhide are soft tasks. I think that's probably a
change we should consider for RHEL as well, though it needs testing to make sure
it's safe.

There's still an issue on rawhide though. nlm_grant_blocked calls are hard
tasks, but it does this:

        /* If b_granted is true this means we've been here before.
         * Just retry the grant callback, possibly refreshing the RPC
         * binding */
        if (block->b_granted) {
                nlm_rebind_host(block->b_host);
                goto callback;
        }

...so on each pass through nlm_retry_blocked(), we end up kicking off a new RPC
task, irrespective of the fact that we already have hard tasks for this retrying
in queue.

The easiest fix looks like it'll be to just skip reissuing this call if there's
a call in queue, but that may not take into account failures that don't allow
the call to retry. Changing these to soft tasks might be the best method...

Comment 21 Jeff Layton 2008-01-24 17:01:46 UTC

Looking back through the logs from the Jan 10th event...

When Jeff turned up rpc_debug, we should have gotten a call to rpc_show_tasks. I
tracked that down in the logs and I see a bunch of these:

-pid- proc flgs status -client- -prog- --rqstp- -timeout -rpcwait -action- --exit--
36379 0003 04a9 -00110 f1dfd000 100000 ed16a000 00000000   schedq f8c10d19 f8c146a5
36378 0003 04a9 -00110 eecd4c00 100000 d88f0000 00000000   schedq f8c10d19 f8c146a5
36377 0003 04a9 -00110 e486f600 100000 d6870000 00000000   schedq f8c10d19 f8c146a5

These are portmap calls (prog is 100000). The last task in the list:

03074 0010 0481 000000 f7e8e400 100021 d5545000 00060000  unknown f8c10a7b f8c5fe20

$ bzcat 2007-12-11-log.bz2 | grep 'schedq f8c10d19 f8c146a5' | wc -l
29308

...

So the netdump log shows >29000 portmap calls were queued up. It may even be
more since netlog can drop messages at times. In any case, I think changing
those to be soft RPC tasks is probably reasonable. I'll roll up a patch and see
if we can start testing it.

Unfortunately, we don't know for sure if that's what caused the problem, but it
seems like its something that should really be fixed, and may help us catch the
real culprit if this happens again.

The problem I've seen on rawhide looks like a similar, but unrelated issue...

Comment 22 Jeff Layton 2008-01-24 17:28:38 UTC

Jeff, if this occurs again anytime soon, could you just do the following on the box?

# echo 0 > /proc/sys/sunrpc/rpc_debug

...that should just make it do an rpc_show_tasks and dump the state of the RPC
queue to the ring buffer. It won't enable the debugging, however, so hopefully
it won't bring the box to its knees as badly.

I think I'm on track of the cause here, but it may be hard to tell since I can't
reproduce the problem exactly.

Comment 23 Jeff Layton 2008-01-24 20:26:13 UTC

I think this problem is actually a regression due to the patch in bug 172082.
Prior to that patch going in, portmap calls were always soft RPC tasks, and I
don't see any period of time where that was not the case upstream.

That patch added this delta to rpc_getport:

+	/* use the soft setting from the orignal clnt */
+	pmap_clnt->cl_softrtry = clnt->cl_softrtry;
+

...which makes all portmap children inherit the soft setting from the parent
task. I think this delta is wrong and should be reverted. That should keep this
problem from occurring.

Steve, bug 172082 was yours...any thoughts?

Comment 24 Jeff Layton 2008-01-25 14:11:14 UTC

Going to try to see if we can fix this for 4.7. I want to do a bit more testing
before I call this a regression, but I believe that it is.

As long as we have the patch for bug 204309, I think that the delta above is
unnecessary, but I need to check it against the reproducer in bug 172082.

Comment 25 RHEL Program Management 2008-01-25 15:26:49 UTC

This bugzilla has Keywords: Regression.  

Since no regressions are allowed between releases, 
it is also being proposed as a blocker for this release.  

Please resolve ASAP.

Comment 26 Jeff Layton 2008-01-25 15:31:56 UTC

Created attachment 292949 [details]
patch -- make sure that kernel portmap calls are always soft tasks

I think this patch will fix this issue, and it makes it so that the behavior
here matches upstream behavior. AFAICT, upstream has always had pmap calls be
soft.

Note that this patch alone may cause bug 172082 to pop up again, but this patch
along with the patch for bug 204309 should make everything work ok.

Comment 27 RHEL Program Management 2008-01-25 15:36:35 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 28 Jeff Layton 2008-01-25 15:39:06 UTC

Regarding bug 172082:

Chuck's reproducer:

Steps to Reproduce:
1.  Bring up an NFS server and mount it from a client
2.  Kill the portmapper on the server
3.  Attempt to run anything that tries to do NLM locking

...when I do this on a client with this patch + the patch for bug 204309, it
(properly) hangs for ~25 minutes and then errors out with a -EIO. From an strace:

1201273564.642552 fcntl64(3, F_SETLKW, {type=F_WRLCK, whence=SEEK_SET, start=0,
len=0}) = -1 EIO (Input/output error) <1619.831624>

This was just testing TCP. Obviously we'll want to test that situation with UDP
as well.

Comment 30 Vivek Goyal 2008-03-06 03:28:32 UTC

Committed in 68.17.EL . RPMS are available at http://people.redhat.com/vgoyal/rhel4/

Comment 31 Jeff Burke 2008-03-28 11:58:33 UTC

Jeff L,
    I installed the 2.6.9-68.17.EL kernel on our static RHEL4 NFS server. This
system is the one we use to run client connectathon test in RHTS.

    When I did that all of the connectathon test machines failed. Here is a link
to the test results:
http://rhts.redhat.com/cgi-bin/rhts/job_matrix.cgi?method=job&jobs=18621&whiteboard_filter=&project_filter=&action=Generate

   What I noticed on the server was this:
/var/log/messages
Mar 28 06:44:29 dell-pe840-01 kernel: nfsd: non-standard errno: -38
Mar 28 06:44:33 dell-pe840-01 kernel: nfsd: non-standard errno: -38
Mar 28 06:44:33 dell-pe840-01 mountd[21233]: authenticated unmount request from
g13.lab.boston.redhat.com:818 for /export/home (/export/home)
Mar 28 06:44:41 dell-pe840-01 kernel: nfsd: non-standard errno: -38
Mar 28 06:44:41 dell-pe840-01 mountd[21233]: authenticated unmount request from
g16.lab.boston.redhat.com:666 for /export/home (/export/home)
Mar 28 06:44:44 dell-pe840-01 kernel: nfsd: non-standard errno: -38
Mar 28 06:44:44 dell-pe840-01 kernel: nfsd: non-standard errno: -38
Mar 28 06:44:51 dell-pe840-01 mountd[21233]: authenticated mount request from
g16.lab.boston.redhat.com:774 for /export/home (/export/home)
Mar 28 06:44:51 dell-pe840-01 kernel: nfsd: non-standard errno: -38
Mar 28 06:44:51 dell-pe840-01 kernel: nfsd: non-standard errno: -38
Mar 28 06:45:02 dell-pe840-01 mountd[21233]: authenticated mount request from
g16.lab.boston.redhat.com:904 for /export/home (/export/home)
Mar 28 06:45:02 dell-pe840-01 kernel: nfsd: non-standard errno: -38
Mar 28 06:45:10 dell-pe840-01 last message repeated 10 times
Mar 28 06:45:10 dell-pe840-01 mountd[21233]: authenticated unmount request from
g16.lab.boston.redhat.com:902 for /export/home (/export/home)
Mar 28 06:45:20 dell-pe840-01 mountd[21233]: authenticated mount request from
g16.lab.boston.redhat.com:1011 for /export/home (/export/home)
Mar 28 06:45:20 dell-pe840-01 kernel: nfsd: non-standard errno: -38
Mar 28 06:45:26 dell-pe840-01 last message repeated 7 times
Mar 28 06:45:38 dell-pe840-01 mountd[21233]: authenticated mount request from
g16.lab.boston.redhat.com:780 for /export/home (/export/home)
Mar 28 06:45:38 dell-pe840-01 kernel: nfsd: non-standard errno: -38
Mar 28 06:45:38 dell-pe840-01 rpc.statd[20877]: Received erroneous SM_UNMON
request from 127.0.0.1 for 192.168.77.215
Mar 28 06:45:51 dell-pe840-01 kernel: nfsd: non-standard errno: -38
Mar 28 06:45:51 dell-pe840-01 mountd[21233]: authenticated unmount request from
g16.lab.boston.redhat.com:898 for /export/home (/export/home)
Mar 28 06:46:02 dell-pe840-01 kernel: nfsd: non-standard errno: -38
Mar 28 06:46:02 dell-pe840-01 kernel: nfsd: non-standard errno: -38

dmesg
nfsd: non-standard errno: -38
nfsd: non-standard errno: -38
nfsd: non-standard errno: -38
nfsd: non-standard errno: -38
nfsd: non-standard errno: -38
nfsd: non-standard errno: -38
nfsd: non-standard errno: -38
nfsd: non-standard errno: -38
nfsd: non-standard errno: -38
nfsd: non-standard errno: -38
nfsd: non-standard errno: -38
nfsd: non-standard errno: -38
nfsd: non-standard errno: -38
nfsd: non-standard errno: -38

This kernel is causing a regression. I am going to put the server back to the Z
stream kernel.

Comment 32 Jeff Layton 2008-03-28 12:07:49 UTC

Interesting. I rather doubt that the patch that fixes this bug is causing this:

#define ENOSYS          38      /* Function not implemented */

...I wonder if that problem might be related to bug 438723?

Comment 33 Jeff Burke 2008-03-28 12:15:20 UTC

Jeff

  Let me step back one min, I may have jumped the gun a bit. I didn't realize
that it was a test kernel that failed in RHTS.

  Since the kernel that was running the test was Vivek's
2.6.9-68.29.EL.vgoyal.test1. I have put the server back to the previous Z-Stream
kernel and kicked off a connectathon test using Vivek's test kernel.

  This will tell us for sure if it was client or server. Sorry for raising the
alarm. Let me get more data for you.

Jeff

Comment 37 errata-xmlrpc 2008-07-24 19:14:19 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2008-0665.html