672305 – Repeatable NFS mount hang

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 672305 - Repeatable NFS mount hang

Summary: Repeatable NFS mount hang

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	6.0
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	Jeff Layton
QA Contact:	Filesystem QE
Docs Contact:
URL:
Whiteboard:
Duplicates (3):	675310 697029 698696 (view as bug list)
Depends On:	669204
Blocks:	692315
TreeView+	depends on / blocked

Reported:	2011-01-24 18:59 UTC by Neal Kim
Modified:	2018-11-26 18:33 UTC (History)
CC List:	34 users (show)
Fixed In Version:	kernel-2.6.32-130.el6
Doc Type:	Bug Fix
Doc Text:
Clone Of:	669204
Clones:	692315 (view as bug list)
Environment:
Last Closed:	2011-05-19 12:30:20 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
nfs.ko (598.76 KB, application/octet-stream) 2011-01-25 21:17 UTC, Neal Kim	no flags	Details
sunrpc.ko (403.37 KB, application/octet-stream) 2011-01-25 21:18 UTC, Neal Kim	no flags	Details
rpc_debug output from Fedora (37.75 KB, text/plain) 2011-02-04 20:58 UTC, Neal Kim	no flags	Details
RPC debug from RHEL 6 (272.67 KB, text/plain) 2011-03-17 17:52 UTC, Adam Drew	no flags	Details
SUNRPC: Deal with the lack of a SYN_SENT sk->sk_state_change callback... (1.97 KB, patch) 2011-03-19 00:36 UTC, Trond Myklebust	no flags	Details \| Diff
various rpc info relating to commit task in core1 (4.71 KB, text/plain) 2011-03-21 11:37 UTC, Jeff Layton	no flags	Details
various rpc info relating to commit task in core1 (12.86 KB, text/plain) 2011-03-21 11:49 UTC, Jeff Layton	no flags	Details
NFS: Fix a hang/infinite loop in nfs_wb_page() (2.53 KB, patch) 2011-03-21 19:38 UTC, Trond Myklebust	no flags	Details \| Diff
Show Obsolete (1) View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2011:0542	0	normal	SHIPPED_LIVE	Important: Red Hat Enterprise Linux 6.1 kernel security, bug fix and enhancement update	2011-05-19 11:58:07 UTC

Comment 5 Steve Dickson 2011-01-25 20:56:27 UTC

Neal,

When this hang happens again please do the following
  'echo 0 > /proc/sys/sunrpc/rpc_debug'

which should dump all the active RPC task into /var/log/messages. 

Also can you make the following .ko files available

/lib/modules/`uname -r`/kernel/fs/nfs/nfs.ko
/lib/modules/`uname -r`/kernel/net/sunrpc/sunrpc.ko

The ones that match that dump...

Comment 6 Neal Kim 2011-01-25 21:17:41 UTC

Created attachment 475280 [details]
nfs.ko

Comment 7 Neal Kim 2011-01-25 21:18:25 UTC

Created attachment 475281 [details]
sunrpc.ko

Comment 8 Neal Kim 2011-01-25 21:20:04 UTC

nfs.ko and sunrpc.ko attached (vanilla kernel-2.6.32-71.el6.x86_64).

I should have a dump of rpc_debug shortly.

Comment 15 RHEL Program Management 2011-02-01 06:10:05 UTC

This request was evaluated by Red Hat Product Management for
inclusion in the current release of Red Hat Enterprise Linux.
Because the affected component is not scheduled to be updated
in the current release, Red Hat is unfortunately unable to
address this request at this time. Red Hat invites you to
ask your support representative to propose this request, if
appropriate and relevant, in the next release of Red Hat
Enterprise Linux. If you would like it considered as an
exception in the current release, please ask your support
representative.

Comment 16 Ric Wheeler 2011-02-01 12:56:19 UTC

We will continue to work on this - if we can figure it out really soon, we can try to reflag it for 6.1.

As it looks now, more likely to land in 6.2.

Comment 17 Neal Kim 2011-02-04 20:58:10 UTC

Created attachment 477115 [details]
rpc_debug output from Fedora

Comment 19 Jeff Layton 2011-02-23 01:10:33 UTC

Looking over the rpc_debug info, what I mainly see is a ton of writes queued up. The one at the top of the stack however is this:

[581306.641656] 52282 0001      0 ffff8801fa933600 ffff880264064498   120000 ffffffffa0389700 nfsv3 WRITE a:call_connect_status q:xprt_pending

...the fact that it's sitting in call_connect_status implies that the transport was disconnected and (likely) in the process of being reconnected at the time that this was generated.

The follow-on info adds a bit more, unfortunately not all of the rpc_debug bits were set when this info was done, so we didn't get all of the possible rpc_debug messages (in the future, it might be better to echo in 65535 rather than 1 to rpc_debug). In any case...

[581312.289081] RPC: 52282 xprt_connect_status: retrying

This means that the connect attempt returned with a status of -EAGAIN. I think there are a couple of ways that this can occur, unfortunately I can't tell which since the debug info didn't contain the messages that would tell us this.

[581312.289084] RPC: 52282 xprt_prepare_transmit
[581312.289088] RPC: 52282 xprt_transmit(65676)
[581312.289092] RPC:       disconnected transport ffff88042aeb2000
[581312.289096] RPC: 52282 xprt_connect xprt ffff88042aeb2000 is not connected

...in any case, I suspect that the problem is that the client is attempting to reconnect the socket but is unable to do so for some reason. Beyond that, I can't tell much.

Comment 20 Jeff Layton 2011-02-24 00:26:32 UTC

I grabbed the rhel6 core and started poking at it. The superblock is here:

    0xffff88042c1c6400

The nfs_server is here:

    0xffff880427f54400

The rpc_clnt:

    0xffff88042b815a00

...hanging off the rpc_clnt *should* be a rpc_xprt, which contains the sockaddr for the server. That however seems to be at least corrupt and has probably been freed. So I think what we may be looking at is a rpc_xprt refcount leak somewhere in the code.

Comment 21 Jeff Layton 2011-02-24 00:56:26 UTC

Sorry, not a refcount leak but a refcount imbalance (too many puts/not enough gets).

Comment 22 Jeff Layton 2011-02-24 01:43:21 UTC

...more notes:

crash> struct rpc_clnt.cl_xprt 0xffff88042b815a00
  cl_xprt = 0xffffffffa0548d80, 

...but that address looks odd:

crash> kmem ffffffffa0548d80
ffffffffa0548d80 (D) nfs_rpcstat  

   VM_STRUCT                 ADDRESS RANGE               SIZE
ffff88042ad61540  ffffffffa0503000 - ffffffffa055a000  356352

      PAGE        PHYSICAL      MAPPING       INDEX CNT FLAGS
ffffea000e956758 42aafd000                0       6a  1 40000000000000

...so it's the same address as the nfs_rpcstat struct, and that looks valid. It seems unlikely that that address was handed out by a kzalloc, so how did we end up with it in the cl_xprt field? Maybe it's a more generic memory corruption in rpc_clnt?

Comment 23 Jeff Layton 2011-03-08 16:09:12 UTC

Got an email from Lars at Pixar the other day:

"Thanks, Steve--unfortunately, I came in this morning to find a patched
machine hanging just as they did before. Bummer! I verified that the patch
was applied correctly by looking at the sources in the debuginfo package."

...so apparently Trond's patch did not fix the issue. I think we need to
do the following:

When we get a RHEL6 machine into this state, it would be good to confirm
that...

The ideal thing would be to get a kernel with a backport of Trond's patch
for RHEL6 to them, and get it into the state where it's hung like this.
Once it's in that state, crank up rpc_debug for a few minutes:

    # rpcdebug -m rpc -s all

...wait about 5 minutes or so. Try not to do any activity on the box so that
we can focus on what's already queued and the RPC calls that are in progress. Then:

    # rpcdebug -m rpc -c all

...after that, save off the log to a file:

    # dmesg > /tmp/rpc-debug-log.txt

...and attach it here so we can see what the client is actually doing and
why it's not making progress.

Comment 24 Adam Drew 2011-03-17 17:52:46 UTC

Created attachment 486078 [details]
RPC debug from RHEL 6

Comment 25 Jeff Layton 2011-03-17 19:20:51 UTC

Not much info there. Stripping out the soft lockup messages gives this:

-pid- flgs status -client- --rqstp- -timeout ---ops--
39502 0001      0 ffff880c44937200 ffff880c46e690d8        0 ffffffffa0308e70 nfsv3 READ a:call_status q:none
38818 0001      0 ffff880c452e3000 ffff880c474e4930        0 ffffffffa0309000 nfsv3 COMMIT a:call_status q:none
38819 0001      0 ffff880c452e3000 ffff880c474e4310        0 ffffffffa0308fc0 nfsv3 WRITE a:call_status q:none
38820 0001      0 ffff880c452e3000 ffff880c474e4620        0 ffffffffa0308fc0 nfsv3 WRITE a:call_status q:none
39014 0001      0 ffff880c452e3000 ffff880c474e4498        0 ffffffffa0308fc0 nfsv3 WRITE a:call_status q:none
39232 0001      0 ffff880c452e3000 ffff880c474e4ab8        0 ffffffffa0308fc0 nfsv3 WRITE a:call_status q:none
39249 0001      0 ffff880c452e3000 ffff880c474e4c40        0 ffffffffa0308fc0 nfsv3 WRITE a:call_status q:none
RPC:       xs_close xprt ffff880c475df800
RPC:       disconnected transport ffff880c475df800 
RPC:       wake_up_next(ffff880c475dfa58 "xprt_resend")
RPC:       wake_up_next(ffff880c475df990 "xprt_sending")
-pid- flgs status -client- --rqstp- -timeout ---ops--
39502 0001      0 ffff880c44937200 ffff880c46e690d8        0 ffffffffa0308e70 nfsv3 READ a:call_status q:none
38818 0001      0 ffff880c452e3000 ffff880c474e4930        0 ffffffffa0309000 nfsv3 COMMIT a:call_status q:none
38819 0001      0 ffff880c452e3000 ffff880c474e4310        0 ffffffffa0308fc0 nfsv3 WRITE a:call_status q:none
38820 0001      0 ffff880c452e3000 ffff880c474e4620        0 ffffffffa0308fc0 nfsv3 WRITE a:call_status q:none
39014 0001      0 ffff880c452e3000 ffff880c474e4498        0 ffffffffa0308fc0 nfsv3 WRITE a:call_status q:none
39232 0001      0 ffff880c452e3000 ffff880c474e4ab8        0 ffffffffa0308fc0 nfsv3 WRITE a:call_status q:none

I'll have to look more closely at the messages, but at first glance this looks quite different from the problem seen on F14 at Pixar.

Comment 26 Jeff Layton 2011-03-17 19:23:57 UTC

Adam, some details about how they collected this might be helpful. What specifically did they do to enable and disable debugging, and how long did they leave rpc_debug enabled?

Comment 27 Jeff Layton 2011-03-17 19:45:17 UTC

The fact that the task->tk_timeout is 0 in all of these seems quite odd. I'm thinking that should be set to some positive value but there may be a reason for it being that way.

Comment 30 Kevin Constantine 2011-03-17 20:00:34 UTC

Jeff- We collected the rpc_debug with your instructions in Comment 23:

    # rpcdebug -m rpc -s all

...wait about 5 minutes or so. Try not to do any activity on the box so that
we can focus on what's already queued and the RPC calls that are in progress.
Then:

    # rpcdebug -m rpc -c all

...after that, save off the log to a file:

    # dmesg > /tmp/rpc-debug-log.txt

Comment 34 Jeff Layton 2011-03-18 13:06:20 UTC

Looking at 20110317161658_1 first. Just some notes first for future analysis:

The maya.bin process is stuck waiting for a COMMIT to complete. Note that the stack dump that the soft lockup message does often has some "junk" in it. It's really stuck here, which is probably the wait_on_bit call in nfs_commit_inode:

 #6 [ffff88035085b908] prepare_to_wait_exclusive at ffffffff81091ff3
 #7 [ffff88035085b940] __wait_on_bit_lock at ffffffff814c9e68
 #8 [ffff88035085b990] out_of_line_wait_on_bit_lock at ffffffff814c9f58
 #9 [ffff88035085ba00] nfs_commit_inode at ffffffffa025dc9a
#10 [ffff88035085ba70] nfs_wb_page at ffffffffa025de29
#11 [ffff88035085bb10] nfs_flush_incompatible at ffffffffa025dec0

Scraping the stack for info gives:

filp: 0xffff880c4756dc00
dentry: 0xffff88097db793c0
superblock: 0xffff880c46420c00

This is the mount at /data/adai0101. Digging down into the NFS and RPC fields shows:

nfs_server: 0xffff880c3dd1dc00
rpc_clnt: 0xffff880c45934400
rpc_xprt: 0xffff880c43f93000

Comment 35 Jeff Layton 2011-03-18 14:04:23 UTC

To get the list of rpc_tasks hanging off this rpc_clnt:

crash> list rpc_task.tk_task -s rpc_task -H 0xffff880c45934418

The tk_action of all of them is call_status. That suggests that the calls have been transmitted to the server and the client is just waiting on the result.

tk_runstate of all of them is 0x5, which is (RPC_TASK_RUNNING|RPC_TASK_ACTIVE)

tk_flags on all of them are RPC_TASK_ASYNC. The tk_callback on all of them is xprt_timer, which I *think* means that it's waiting for the transport to settle.

Moving to the state of the socket:

crash> struct socket 0xffff880c16491940
struct socket {
  state = SS_CONNECTING, 

crash> struct sock.__sk_common 0xffff8801df177200 | grep skc_state
    skc_state = 0x1, 

...I believe the skc_state is TCPF_ESTABLISHED

Comment 36 Jeff Layton 2011-03-18 14:19:46 UTC

Correction...that should be TCP_ESTABLISHED

Comment 37 Jeff Layton 2011-03-18 15:39:02 UTC

Ok, socket states do seem a bit odd, but it may be normal (I don't understand the socket layer code that well). For now, I'll return focus back to what the processes are actually blocking on and why they're not making progress:

inode = 0xffff8801d7673700

...to get to nfs_inode, pretend the vfs_inode field is a list_head:

crash> struct nfs_inode.flags -l nfs_inode.vfs_inode 0xffff8801d7673700
  flags = 0x84, 

...which is NFS_INO_COMMIT|NFS_INO_FLUSHING. Looking at the disassembly of that function confirms that it's waiting for NFS_INO_COMMIT to clear. So we now know what maya.bin is waiting on.

Now to figure out why it's not happening...

Comment 38 Jeff Layton 2011-03-18 15:48:34 UTC

Actually...it looks like it's trying to set NFS_INO_COMMIT inside of nfs_commit_set_lock. So it seems likely that another task had kicked off a commit (possibly an async one) and that didn't happen for some reason. Regardless, maya.bin is still waiting on that bit to clear so the next step is to figure out why that's not happening. It should be happening when the commit response is received...

Comment 39 Jeff Layton 2011-03-18 16:30:33 UTC

There's only one commit operation running:

crash> struct rpc_task ffff88013dfba948

...confirmed by looking at the tk_ops:

struct rpc_call_ops {
  rpc_call_prepare = 0xffffffffa025ce00 <nfs_write_prepare>, 
  rpc_call_done = 0xffffffffa025c2d0 <nfs_commit_done>, 
  rpc_release = 0xffffffffa025e030 <nfs_commit_release>
}

  tk_owner = 29765, 

...which is the pid of maya.bin. Looking at the rpc_rqst:

  rq_reply_bytes_recvd = 0x98, 
  rq_connect_cookie = 0x0, 
  rq_bytes_sent = 0x0, 
  rq_xtime = {
    tv64 = 0x1f2a0710a7433
  }, 

...so bytes_sent is 0, but bytes recvd is 0x98. Looking back at rpc_xprt:

  state = 0x12, 
  connect_cookie = 0x0, 

...which is XPRT_CONNECTED|XPRT_BOUND. So the rpc_xprt is basically considered to be established.

The connect_cookie is 0 though, which seems very odd -- how do we get to a state with XPRT_CONNECTED|XPRT_BOUND without ever incrementing the connect_cookie? That value starts at 0 when the xprt is created and is incremented whenever the tcp state changes to TCP_FIN_WAIT1 or TCP_SYN_SENT.

How do we get a connected TCP socket without ever entering TCP_SYN_SENT?

Comment 40 Jeff Layton 2011-03-18 17:25:29 UTC

More anecdotal evidence that there is a problem with the socket handling. There are 21 jobs queued to rpciod. 17 are rpc_async_schedule jobs, and the other 4 are xprt_autoclose.

What's not clear to me is why rpciod doesn't seem to be picking them up...

Comment 44 J. Bruce Fields 2011-03-18 21:43:39 UTC

"bytes_sent is 0, but bytes recvd is 0x98"

I think that's right:

                /* If we've sent the entire packet, immediately
                 * reset the count of bytes sent. */
                req->rq_bytes_sent += status;
                req->rq_xmit_bytes_sent += status;
                if (likely(req->rq_bytes_sent >= req->rq_slen)) {
                        req->rq_bytes_sent = 0;
                        return 0;
                }

And 0x98 is the length of a commit response (starting from the top of the rpc header) with pre- and post- attributes.  So that much sounds normal.

Comment 45 J. Bruce Fields 2011-03-18 22:05:24 UTC

"how do we get to a
state with XPRT_CONNECTED|XPRT_BOUND without ever incrementing the
connect_cookie? That value starts at 0 when the xprt is created and is
incremented whenever the tcp state changes to TCP_FIN_WAIT1 or TCP_SYN_SENT."

Grepping for sk_state_change in net/ipv4/, it looks like we don't actually get a callback in the TCP_SYN_SENT case.  The TCP_SYN_SENT case in xs_tcp_state_change might even be completely dead code.

Comment 46 Trond Myklebust 2011-03-19 00:36:26 UTC

Created attachment 486341 [details]
SUNRPC: Deal with the lack of a SYN_SENT sk->sk_state_change callback...

Hi Bruce

Yes. I was sure I had seen a state_change callback for SYN_SENT, but
grepping the code, I can't find it any more.

This patch ought to fix the issue by relying on the return value from
kernel_connect() instead of the state_change callback.

Comment 47 Jeff Layton 2011-03-19 00:42:33 UTC

Ok, that helps. Thanks, Bruce...

So, it doesn't really look like anything is amiss with what I've found so far. Still though, it does at least seem like the COMMIT call is not progressing. The question at this point is why...

The rpc_debug info collected by Kevin seems to indicate that no real RPC activity is going on. Could there be an issue with the workqueue?

Comment 48 Trond Myklebust 2011-03-19 01:00:35 UTC

See your comment #27. If all the RPC calls have infinite timeouts, then
something from outside the usual call path has to wake them up.

The intention is that the socket callbacks will fill that role...

Comment 49 Jeff Layton 2011-03-19 04:20:56 UTC

Ok, yeah...I think I see now. All of the queued tasks are have this:

  tk_callback = 0xffffffffa01ced70 <xprt_timer>, 
  tk_action = 0xffffffffa01ccb60 <call_status>, 

...which suggests that they all went through xprt_transmit and went to sleep on xprt->pending.

Trond, I guess you're theory is that disconnect/reconnect occured, and since the SYN_SENT callback never happened those tasks were never awakened. Is that correct?

Comment 50 Jeff Layton 2011-03-19 14:35:03 UTC

Having a look at the second core to confirm that the problem there is similar...

dentry = 0xffff8801c48bfa40
inode = 0xffff880218969378
sb = 0xffff880c44aac800
nfs_server = 0xffff880c46447c00
rpc_clnt = 0xffff880c46e73400

...the problem does seem to be the same. There are fewer's RPC's queued up with this rpc_clnt, but they seem to be in the same state.

Comment 51 Trond Myklebust 2011-03-19 16:01:59 UTC

So this is what I don't understand:

The tk_callback == xprt_timer, which means they were put to sleep by xprt_transmit.

However, the tk_timeout == 0, which suggests that the call to xprt->ops->set_retrans_timeout() has somehow failed. That again means that task->tk_rqstp->rq_timeout must be zero.

Can you check the value of task->tk_client->cl_timeout, and see what it is pointing to?

Comment 52 Max Matveev 2011-03-21 06:10:09 UTC

FWIW, bug 674427 has report of soft lockup with the same backtrace.

Also, I've check a few cores - the first one from case 00434772
the server was "warmmilk32", all cores from 2011/03/17 are for
"hotspot": just wanted to eliminate the possibility of server
induced problems.

Comment 53 Jeff Layton 2011-03-21 11:37:19 UTC

Created attachment 486592 [details]
various rpc info relating to commit task in core1

Here's a collection of the following, which should help answer Trond's question. Let me know if you need more. The rpc_timeout looks normal to me (not zeroed out):

struct rpc_task ffff88013dfba948
struct rpc_rqst 0xffff8801deb1c498
struct rpc_clnt 0xffff880c45934400
struct rpc_timeout 0xffff880c45934540

Also as Max mentions, server behavior could be a factor here. Maybe it's spuriously disconnecting or something?

That shouldn't cause the client to hang like this though. There seems to clearly be a client-side bug here.

Comment 54 Jeff Layton 2011-03-21 11:49:47 UTC

Created attachment 486593 [details]
various rpc info relating to commit task in core1

Also, rq_timeout for the attached rpc_rqst is not 0, it's 60000. Here's an updated file that also contains the rpc_xprt.

Comment 55 Jeff Layton 2011-03-21 12:36:48 UTC

Ok, I see what Trond is saying, but I don't see how set_retrans_timeout can reasonably fail. I think that leaves us 2 possibilities:

1) the rq_timeout was set to 0 when xprt_transmit called set_retrans_timeout

...or...

2) the tk_timeout was set to a non-zero value at that time but was later zeroed by something else

The second possibility seems more likely to me. The task->tk_runstate is 0x5, which is:

#define RPC_TASK_RUNNING        0
#define RPC_TASK_ACTIVE         2

... RPC_TASK_QUEUED is not set, but it should be if it's still sleeping on the waitqueue.

Comment 56 Jeff Layton 2011-03-21 13:15:59 UTC

One thought... rpc_make_runnable does this:

        rpc_clear_queued(task);
        if (rpc_test_and_set_running(task))
                return;

Could we be hitting a condition where we're trying to wake up the task while QUEUED and RUNNABLE are both set? We set the QUEUED bit when the task goes on the waitqueue, but we don't clear the runnable bit until some time later in __rpc_execute. If the task was woken up between those events might they get stuck like this?

Comment 58 Trond Myklebust 2011-03-21 14:00:54 UTC

If both QUEUED and RUNNABLE are set, then that means we're still in the
__rpc_execute() loop. In that case, we rely on the section that is protected
by the queue->lock to avoid conflicts with rpc_make_runnable().

That raises an issue, though. What is the status of task->u.tk_work? Does
task->u.tk_work.data contain a valid set of WORK_STRUCT bits, and if so,
what are their values?

Comment 59 Trond Myklebust 2011-03-21 14:21:54 UTC

Oh... Another point: task sleep and wake up is designed to be soft-irq safe,
but not hard irq safe.  We might want to add in a defensive WARN_ON(in_irq())
to rpc_make_runnable() just in case we're hitting some poorly designed
network driver that is trying to call us back from the wrong kind of context.

Comment 60 Steve Dickson 2011-03-21 15:09:38 UTC

Will do... I'll get a kernel building with these three patches

1) the one liner from our Pixar trip
2) The patch from Comment 46
3) Adding the following WARN_ON():
diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
index cace604..c462fab 100644
--- a/net/sunrpc/sched.c
+++ b/net/sunrpc/sched.c
@@ -281,6 +281,8 @@ EXPORT_SYMBOL_GPL(__rpc_wait_for_completion_task);
  */
 static void rpc_make_runnable(struct rpc_task *task)
 {
+       WARN_ON(in_irq())
+
        rpc_clear_queued(task);
        if (rpc_test_and_set_running(task))
                return;

Comment 62 Jeff Layton 2011-03-21 15:32:09 UTC

Yeah, I've been wondering about the workqueue too. Here's tk_work...

    tk_work = {
      data = {
        counter = 0xffffe8ffffccce81
      }, 
      entry = {
        next = 0xffff880a1429c718, 
        prev = 0xffffe8ffffccce88
      }, 
      func = 0xffffffffa01d5760 <rpc_async_schedule>
    }, 


...how the "data" field gets set is quite tricky...working down through the definition of INIT_WORK:

     (_work)->data = (atomic_long_t) WORK_DATA_INIT();

#define WORK_DATA_INIT()        ATOMIC_LONG_INIT(WORK_STRUCT_NO_CPU)

        WORK_STRUCT_NO_CPU      = WORK_CPU_NONE << WORK_STRUCT_FLAG_BITS,

        WORK_CPU_NONE           = NR_CPUS + 1,

        CONFIG_NR_CPUS          = 4096

        WORK_STRUCT_FLAG_BITS   = WORK_STRUCT_COLOR_SHIFT +
                                  WORK_STRUCT_COLOR_BITS,

...I think CONFIG_DEBUG_OBJECTS_WORK is unset, so:

        WORK_STRUCT_COLOR_SHIFT = 4,    /* color for workqueue flushing */
        WORK_STRUCT_COLOR_BITS  = 4,

So the counter should be initialized by INIT_WORK to 0x100100. It's not that here, but I'm not sure whether it gets altered later or what. Trond is there something you wanted me to specifically check for? I might be able to get some help determining the state of the tk_work if so...

Comment 63 Jeff Layton 2011-03-21 17:05:26 UTC

Disregard that...I was looking at a mainline-ish kernel which has all of the CMWQ changes. In RHEL6, ->data is initialized to 0:

    #define WORK_DATA_INIT()        ATOMIC_LONG_INIT(0)

...I'll look at what it ends up pointing to as it progresses.

Comment 64 Jeff Layton 2011-03-21 17:31:01 UTC

Ok, the work_struct starts its life with ->data initialized to 0. It then has this happen when it's queued:

        if (!test_and_set_bit(WORK_STRUCT_PENDING, work_data_bits(work))) {

WORK_STRUCT_PENDING is 0, and that bit seems to be set on tk_work task. So, this was queued since it was last initialized. It looks like run_workqueue calls work_clear_pending before calling ->func, so it looks like this job was queued to the workqueue but has not run.

Masking off 0x1 in the ->data gives us this:

crash> struct cpu_workqueue_struct 0xffffe8ffffccce80
struct cpu_workqueue_struct {
  lock = {
    raw_lock = {
      slock = 0x85cb85cb
    }
  }, 
  worklist = {
    next = 0xffff88013dfba9d8, 
    prev = 0xffff88021675f3d8
  }, 
  more_work = {
    lock = {
      raw_lock = {
        slock = 0xaca0aca0
      }
    }, 
    task_list = {
      next = 0xffffe8ffffcccea0, 
      prev = 0xffffe8ffffcccea0
    }
  }, 
  current_work = 0x0, 
  wq = 0xffff880c43c94380, 
  thread = 0xffff880c469a14e0
}

...sanity checking the wq field:

crash> struct workqueue_struct 0xffff880c43c94380
struct workqueue_struct {
  cpu_wq = 0x60ffd7a0ce80, 
  list = {
    next = 0xffff880c43d05188, 
    prev = 0xffff880c3c2aac88
  }, 
  name = 0xffffffffa01ea024 "rpciod", 
  singlethread = 0x0, 
  freezeable = 0x0, 
  rt = 0x0
}

Comment 65 Jeff Layton 2011-03-21 17:51:22 UTC

task 0xffff880c469a14e0 is rpciod/3:

crash> set 0xffff880c469a14e0
    PID: 1588
COMMAND: "rpciod/3"
   TASK: ffff880c469a14e0  [THREAD_INFO: ffff880c3dc8c000]
    CPU: 3
  STATE: TASK_RUNNING

crash> bt
PID: 1588   TASK: ffff880c469a14e0  CPU: 3   COMMAND: "rpciod/3"
 #0 [ffff880c3dc8dd70] schedule at ffffffff814c8f68
 #1 [ffff880c3dc8de38] worker_thread at ffffffff8108c7ac
 #2 [ffff880c3dc8dee8] kthread at ffffffff81091a86
 #3 [ffff880c3dc8df48] kernel_thread at ffffffff810141ca

...so it's TASK_RUNNING, and there are 22 jobs queued to the ->worklist, but it seems to still be sitting in schedule(). Disassembly confirms that it is sitting in the schedule() call in worker_thread.

Comment 66 Jeff Layton 2011-03-21 17:56:58 UTC

So it seems like it went to sleep and got waken up, but the scheduler hasn't given it any time?

Comment 67 Jeff Layton 2011-03-21 18:19:55 UTC

maya.bin is on CPU3 as well and that's where we're seeing the softlockup messages. Could it somehow be stuck continually trying to wait but can't, and so it can't relenquish the CPU to allow the workqueue jobs to run?

Comment 68 Trond Myklebust 2011-03-21 18:37:23 UTC

Is there a fatal signal pending on maya.bin? Otherwise it should be calling
schedule()?

Comment 69 Neal Kim 2011-03-21 18:50:37 UTC

Just a quick update.

Lars has been running the 2.6.32-118.el6.bz672305.x86_64 test kernel for the better part of a week with no hangs...so far. Although he did mention that they had a power outage early yesterday morning which resulted in rendering jobs being held off completely that day.

I'll keep everyone posted.

Comment 70 Jeff Layton 2011-03-21 19:00:39 UTC

task->stack->flags for maya.bin is 0x8c == 10001100. Which is:

TIF_SIGPENDING
TIF_NEED_RESCHED
TIF_SYSCALL_AUDIT

...so yes, there is a signal pending:

    signal = {
      sig = {0x100}
    }

...which I think is SIGKILL.

Comment 71 Trond Myklebust 2011-03-21 19:05:06 UTC

Oh crap... I see the problem...

If nfs_commit_set_lock() fails, then nfs_commit_inode() returns 0,
which nfs_wb_page() interprets as 'success!', and loops again...

Comment 72 Jeff Layton 2011-03-21 19:12:17 UTC

Yeah, I was just looking at the same thing. That's almost certainly the bug.

Comment 73 Jeff Layton 2011-03-21 19:20:27 UTC

(In reply to comment #69)
> Just a quick update.
> 
> Lars has been running the 2.6.32-118.el6.bz672305.x86_64 test kernel for the
> better part of a week with no hangs...so far. Although he did mention that they
> had a power outage early yesterday morning which resulted in rendering jobs
> being held off completely that day.
> 

That's good to know. Even though the symptoms are the same, it seems quite likely that Disney and Pixar are hitting different bugs.

Comment 74 Trond Myklebust 2011-03-21 19:38:10 UTC

Created attachment 486678 [details]
NFS: Fix a hang/infinite loop in nfs_wb_page()

When one of the two waits in nfs_commit_inode() is interrupted, it
returns a non-negative value, which causes nfs_wb_page() to think
that the operation was successful causing it to busy-loop rather
than exiting.
It also causes nfs_file_fsync() to incorrectly report the file as
being successfully committed to disk.

This patch fixes both problems by ensuring that we return an error
if the attempts to wait fail.

Comment 75 Jeff Layton 2011-03-21 20:07:36 UTC

Thanks, Trond...

Steve, could you add the patch in comment #74 to your test kernel? Once we have that, I think we should give it to Disney to test. Pixar is also welcome to do so too, but we don't know for sure that they actually hit this bug.

Comment 76 Steve Dickson 2011-03-21 20:18:32 UTC

(In reply to comment #74)
> Created attachment 486678 [details]
> NFS: Fix a hang/infinite loop in nfs_wb_page()
> 
> When one of the two waits in nfs_commit_inode() is interrupted, it
> returns a non-negative value, which causes nfs_wb_page() to think
> that the operation was successful causing it to busy-loop rather
> than exiting.
> It also causes nfs_file_fsync() to incorrectly report the file as
> being successfully committed to disk.
> 
> This patch fixes both problems by ensuring that we return an error
> if the attempts to wait fail.

Do you want *just* the patch in comment #74 in the test kernel or
should I just added it to the other two from comment #60?

Comment 77 Trond Myklebust 2011-03-21 20:25:14 UTC

In my opinion, we should test just the patch from comment #74 for now. It looks as if that is the correct fix for Disney's problem.

Comment 78 Jeff Layton 2011-03-21 20:27:58 UTC

That works for me. Eventually, we'll want to test the whole swath of fixes (including the sunrpc ones), but it might be nice to confirm that this patch is the fix, particularly if this is easily reproducible.

Comment 79 Jeff Layton 2011-03-21 20:33:12 UTC

Also to confirm, the second core shows the same situation as the first. Jobs there are queued up on rpciod/4 and a SIGKILL'ed maya.bin process is spinning on CPU4 in the same spot.

Comment 80 Steve Dickson 2011-03-21 21:18:39 UTC

(In reply to comment #77)
> In my opinion, we should test just the patch from comment #74 for now. It looks
> as if that is the correct fix for Disney's problem.

Cool... The kernel is in the oven... Thanks for all your help!!!

Comment 82 Jeff Layton 2011-03-22 11:34:23 UTC

Setting needinfo flag for reporter as we await test results.

Comment 86 Jim Rees 2011-03-24 01:36:24 UTC

I believe we're seeing the same problem. See the thread here:
http://marc.info/?l=linux-nfs&m=129685584924582&w=2

I have applied the patch from comment #74, and the preliminary report I have is that it still hangs, but now the iozone process is interruptible, where it wasn't before, and the call trace has changed:

INFO: task iozone:3824 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
iozone          D 0000000104ba67e9     0  3824      1 0x00000080
ffff88023c23fc30 0000000000000086 0000000000000000 00000000ffffffff
0000000000013700 0000000000013700 0000000000013700 ffff88023d4044d0
0000000000013700 ffff88023c23ffd8 0000000000013700 0000000000013700
Call Trace:
[<ffffffff81136829>] ? lock_flocks+0x10/0x12
[<ffffffff81137713>] ? __posix_lock_file+0x60/0x441
[<ffffffff814377b6>] rwsem_down_failed_common+0xd3/0x105
[<ffffffff81137d07>] ? posix_lock_file+0x11/0x13
[<ffffffff8143780f>] rwsem_down_read_failed+0x12/0x14
[<ffffffff812072e4>] call_rwsem_down_read_failed+0x14/0x30
[<ffffffff81436f4b>] ? down_read+0x32/0x36
[<ffffffffa03cf842>] nfs4_proc_lock+0x21f/0x32d [nfs]
[<ffffffffa03b6682>] do_setlk+0x61/0xd5 [nfs]
[<ffffffffa03b6a94>] nfs_lock+0x16e/0x17d [nfs]
[<ffffffff8103bb05>] ? should_resched+0x9/0x29
[<ffffffff81137d2d>] vfs_lock_file+0x24/0x30
[<ffffffff8113848c>] fcntl_setlk+0x169/0x2fb
[<ffffffff811025b9>] ? fsnotify_access+0x5d/0x65
[<ffffffff8110f915>] sys_fcntl+0x31c/0x4c9
[<ffffffff814389a7>] ? do_device_not_available+0x9/0xb
[<ffffffff8100aa12>] system_call_fastpath+0x16/0x1b

Comment 87 Jim Rees 2011-03-24 01:38:01 UTC

Guess I should mention that this is on 2.6.38-rc5, and I had to adapt the patch to make it apply.

Comment 88 Jeff Layton 2011-03-24 02:31:59 UTC

Hmmm...that really looks like a different bug entirely. The hallmarks of this one are mainly that the machine is stuck looping while trying to issue a commit. That prevents the rpciod workqueue jobs scheduled on the same CPU from running.

The stack trace you just posted looks more like something in the locking code. Possibly a problem with locking the nfsi->rwsem? That might indicate a problem with lock recovery. Are there any *-manager threads running when you hit this? If so, what are they doing?

In any case, it might be best to continue discussion of that on linux-nfs since this bug is really about a RHEL6 problem, or maybe open a bug at the kernel.org bug tracker that we can all follow.

Comment 89 Lars Damerow 2011-03-24 17:57:41 UTC

Five renderfarm machines survived a night's rendering while using the patch from comment #74. That's a promising start, so I've added five more machines today and will continue to watch them.

If those ten are still doing well tomorrow morning, I'll add another ten and let them all cook for a couple of days.

Thanks for working hard on this! My fingers are crossed and I'll update again tomorrow. :)

Comment 90 Jeff Layton 2011-03-28 14:26:21 UTC

I'm not sure why it was posted as a private comment, but Disney has also reported that the patched kernel has been working for them.

Comment 91 RHEL Program Management 2011-03-28 14:39:58 UTC

This request was evaluated by Red Hat Product Management for inclusion
in a Red Hat Enterprise Linux maintenance release. Product Management has 
requested further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed 
products. This request is not yet committed for inclusion in an Update release.

Comment 92 Kevin Constantine 2011-03-28 16:38:31 UTC

We hadn't seen a single machine hang since we received the patched kernel on either the patched machines, or the stock rhel6 machines until this weekend.   We had 21 machines crash over the weekend and all of them were running the unpatched stock rhel6 kernel.

Comment 93 Lars Damerow 2011-03-28 17:19:23 UTC

All of our patched machines made it through the weekend's rendering load with no hangs. I'll start installing the patch on some workstations for users who've been suffering with the bug.

Seems like this is it! Thanks.

Comment 94 Jeff Layton 2011-03-28 17:32:01 UTC

Excellent news. Thanks for the updates. I'll see whether we can get this into 6.1...

Comment 97 yanfu,wang 2011-03-29 06:04:15 UTC

hi Jeff,
Could you reproduce the problem? seems customer have the environment to verify the problem, so how to test by QE internal?

Comment 101 Jeff Layton 2011-03-29 11:52:06 UTC

(In reply to comment #97)
> hi Jeff,
> Could you reproduce the problem? seems customer have the environment to verify
> the problem, so how to test by QE internal?

I'm afraid not. It's really a matter of timing and workload. I don't know of any way to reproduce this reliably.

Comment 102 Lars Damerow 2011-03-29 17:24:40 UTC

Alas, I spoke too soon. We had three hangs on patched machines this morning.

These machines are running 2.6.35.9-64 with the patch from comment 74. I'm getting a group of machines to run 2.6.38.2-8, from the package found here:

  http://kojipkgs.fedoraproject.org/packages/kernel/2.6.38.2/8.fc15/

I'm hoping that some other changes between 2.6.35.9 and 2.6.38.2 are also required to fix the problem.

-lars

Comment 103 Steve Dickson 2011-03-29 19:28:59 UTC

(In reply to comment #102)
> Alas, I spoke too soon. We had three hangs on patched machines this morning.
They are all hung in the same place? Maybe these hangs are a different bug...

Comment 104 Lars Damerow 2011-03-29 19:55:14 UTC

The symptoms are identical to the situation you mentioned in comment #1: two processes waiting for the same lock. The traces look the same. In these three cases, one of the processes is the flush thread, and the other is a process belonging to the rendering pipeline (it isn't the same one each time).

thanks,
-lars

Comment 105 Jeff Layton 2011-03-29 21:13:05 UTC

Yeah, this is a bit confusing...

This bug was originally opened against RHEL6 under the assumption that the problem seen there was the same as the one that Pixar was seeing on F14 kernels. The problem is though, that we never had any conclusive proof of that.

In the meantime, the case from Disney was attached to this one again under the assumption that the problem was

The one RHEL6 core that we got from Pixar was at least partially corrupt. When I look at it now in light of the bug that the patch in comment #74 fixes, it shows none of the hallmarks of that problem. The CPUs are all idle aside from the one that is writing to /proc/sysrq-trigger, and the process that is stuck (dfa) doesn't have nfs_wb_page() in its stack.

At this point we have to conclude that they are different problems. I think that the right course of action is to use this bug to track the patch that Trond put in comment #74, and consider opening a new bug or cloning this one to track the problem that Pixar has seen.

If we do that, I'll plan to take a fresh look at the core that we do have from Pixar to see if I can learn more about the cause there.

Comment 106 Lars Damerow 2011-03-29 21:28:14 UTC

Much appreciated--thanks, Jeff.

-lars

Comment 107 Jeff Layton 2011-03-30 18:08:56 UTC

Since there is no cloned bug yet, I'll keep notes in this one for now. I'll plan to transfer these to that bug unless someone beats me to it. It looks like the core from Pixar might not be corrupt after all.

Somehow we ended up with the wrong debuginfo which threw off all of the struct alignment. I went back and un-cpio'ed the correct debuginfo and now things are looking more sane. These values are correct:

superblock: 0xffff88042c1c6400
nfs_server: 0xffff880427f54400
rpc_clnt: 0xffff88042b815a00
rpc_xprt: 0xffff8804060b5800

I'll have to go back now and work through the different data structures.

Comment 108 Jeff Layton 2011-03-30 18:21:20 UTC

There are a bunch of tasks in the cl_tasks queue. Most have tk_action set to call_reserveresult, but some have call_connect_status, and a couple are in call_status.

Interesting fields from the xprt:

state == 0x115 (XPRT_LOCKED|XPRT_CONNECTING|XPRT_BOUND|XPRT_CONNECTION_CLOSE)

inet->skc_state == 0x7 (TCP_CLOSE)

sock->state == SS_UNCONNECTED

Comment 109 Jeff Layton 2011-03-30 18:26:15 UTC

Also:

  connect_worker = {
    work = {
      data = {
        counter = 0xffffe8ffffc88bc1
      }, 
      entry = {
        next = 0xffff8804060b5e80, 
        prev = 0xffff8804060b5e80
      }, 
      func = 0xffffffffa0413210 <xs_tcp_connect_worker4>
    }, 
    timer = {
      entry = {
        next = 0xffff880427c4af50, 
        prev = 0xffff8804060b5b78
      }, 
      expires = 0x113c98b7c, 
      function = 0xffffffff8108cbf0 <delayed_work_timer_fn>, 
      data = 0xffff8804060b5e78, 
      base = 0xffff88042e224000, 
      start_site = 0x0, 
      start_comm = "\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000", 
      start_pid = 0xffffffff
    }

...the fact that the 1 is set means that WORK_STRUCT_PENDING is set. The job is queued to a workqueue but has not yet run. jiffies at the time of the crash is 
0x113c979d4 so it's just waiting for the timer to pop.

Comment 110 Jeff Layton 2011-03-30 19:25:03 UTC

The tasks are all write operations.

crash> struct rpc_call_ops 0xffffffffa053e020
struct rpc_call_ops {
  rpc_call_prepare = 0xffffffffa0519e00 <nfs_write_prepare>, 
  rpc_call_done = 0xffffffffa0519d60 <nfs_writeback_done_full>, 
  rpc_release = 0xffffffffa051a560 <nfs_writeback_release_full>
}

...not too surprising since "dfa" is attempting to close a file and is writing everything out. So at this point, we have a bunch of tasks queued up, but the xprt isn't connected so they can't do anything.

Some more info from xprt:

    reestablish_timeout = 0x17700, 
    connect_cookie = 0x73c, 
    task_cleanup = {
      data = {
        counter = 0xffffe8ffffc88bc0
      }, 
      entry = {
        next = 0xffff8804060b5d00, 
        prev = 0xffff8804060b5d00
      }, 
      func = 0xffffffffa04114a0 <xprt_autoclose>
    }, 

It looks like the kernel is trying to reconnect the socket and can't, but right offhand I can't quite tell why. Is there any chance that Pixar could reproduce this again on RHEL6 and turn up rpcdebug? That may tell us whether this is similar to the problem we saw on Fedora when we were there...

Comment 111 Neal Kim 2011-03-31 01:42:36 UTC

Cloned to BZ#672305 for Pixar

Comment 112 Neal Kim 2011-03-31 01:47:16 UTC

Sorry, please ignore my last comment.

Cloned to BZ#692315 for Pixar

https://bugzilla.redhat.com/show_bug.cgi?id=692315

Comment 113 Trond Myklebust 2011-03-31 09:14:08 UTC

(In reply to comment #108)
> There are a bunch of tasks in the cl_tasks queue. Most have tk_action set to
> call_reserveresult, but some have call_connect_status, and a couple are in
> call_status.
> 
> Interesting fields from the xprt:
> 
> state == 0x115 (XPRT_LOCKED|XPRT_CONNECTING|XPRT_BOUND|XPRT_CONNECTION_CLOSE)
> 
> inet->skc_state == 0x7 (TCP_CLOSE)
> 
> sock->state == SS_UNCONNECTED

The fact that XPRT_CONNECTION_CLOSE is set means that we should be seeing an
instance of xprt->task_cleanup on the workqueue. It doesn't look as if that
is the case (data->counter does not have bit 0 set).

Basically it looks as if the connection is somehow deadlocked with the connect
code waiting for xprt_autoclose to kill the socket, and the latter failing to
run.

Comment 114 Aristeu Rozanski 2011-04-07 13:51:48 UTC

Patch(es) available on kernel-2.6.32-130.el6

Comment 117 David Mair 2011-04-07 20:20:52 UTC

Set wrong flag.  Setting hot-fix flag.

Comment 119 Jeff Layton 2011-04-28 16:24:23 UTC

*** Bug 697029 has been marked as a duplicate of this bug. ***

Comment 120 Jeff Layton 2011-05-02 18:39:54 UTC

*** Bug 698696 has been marked as a duplicate of this bug. ***

Comment 121 Jeff Layton 2011-05-02 18:46:15 UTC

*** Bug 675310 has been marked as a duplicate of this bug. ***

Comment 123 errata-xmlrpc 2011-05-19 12:30:20 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0542.html

Comment 124 fcardoso 2015-02-23 11:27:31 UTC

Getting again issues about this on newer versions of kernel, ran into this issue with a "customer" that has a similar setup only it uses NFS over RDMA for an HPC Cluster setup.
The NFS Client nodes are running at very high load averages and the NFS eventually hangs while the server bumps  "svcrdma: Error fast registering memory for xprt ffff880c1ad31000".
Setup is in all ways similar to the ones described here "https://access.redhat.com/solutions/55816"
Could this have regressed into becoming a problem in newer versions of the kernel ?

Regards,

Francisco

Comment 125 Fabio Olive Leite 2015-02-23 14:25:36 UTC

Switching needinfo request to Sachin, as Jeff left Red Hat a while ago. I hope Sachin can provide insight here.

Please note I'm not related with the resolution of this bug, I'm just doing a quick favour to ensure the needinfo does not get lost. :)

Comment 126 Sachin Prabhu 2015-02-24 10:43:44 UTC

Francisco,

That seems to be a different problem. Can you please open a new case with Red Hat support. 

Sachin Prabhu

Note You need to log in before you can comment on or make changes to this bug.

adrew
anton
bergwolf
bfields
bhubbard
cmarthal
cward
cww
dmair
fleite
francisco.cardoso
francois
gansalmon
itamar
jcm
jlayton
jonathan
kernel-maint
kevin.constantine
kzhang
lars
madhu.chinakonda
makc
Marc.Jordan
nhorman
nkim
nobody+PNT0273897
prescott
rwheeler
sbeal
sprabhu
steved
trond.myklebust
yanwang