Bug 692315

Summary:

Repeatable NFS mount hang

Product:

Red Hat Enterprise Linux 6

Reporter:

Neal Kim <nkim>

Component:

kernel

Assignee:

Jeff Layton <jlayton>

Status:

CLOSED DUPLICATE

QA Contact:

Filesystem QE <fs-qe>

Severity:

medium

Docs Contact:

Priority:

medium

Version:

6.0

CC:

adrew, bergwolf, bfields, bhubbard, francois, gansalmon, itamar, jcm, jiali, jlayton, jonathan, kernel-maint, kevin.constantine, kzhang, lars, madhu.chinakonda, makc, Marc.Jordan, nhorman, nkim, nobody+PNT0273897, rees, rwheeler, sbeal, sprabhu, steved, trond.myklebust, yanwang

Target Milestone:

Keywords:

OtherQA

Target Release:

---

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

672305

Environment:

Last Closed:

2011-06-14 14:30:31 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

669204, 672305

Bug Blocks:

Attachments:

Description	Flags
sysrq-t output for a RHEL6 machine in a bad state	none
rpc_debug output for RHEL6 machine in bad state	none
rpc_debug output for F11/2.6.38.2 machine in bad state	none
sysrq-t output for F11/2.6.38.2 machine in a bad state	none
sysrq-t output for a RHEL6 machine having flush/inode_wait issue	none
rpc debug output for a RHEL6 machine having flush/inode_wait issue	none
RPC log snippet	none
patchset - fix a couple of issues in xprtsock.c	none

Description Neal Kim 2011-03-31 01:38:53 UTC

+++ This bug was initially created as a clone of Bug #672305 +++

Branching off of the original BZ, since it seems Pixar is hitting a different bug than what Trond's patch (comment #74) seems to address in the previous BZ.

+++ This bug was initially created as a clone of Bug #669204 +++

Description of problem:

We have a relatively repeatable case where a decent NFS load will cause an NFS mount to get stuck on a client, and any new process that then tries to access that mount also gets stuck. These machines are running Fedora 11, but we've caused it to happen with Fedora 13 as well. 

Currently running a Fedora 14 kernel 2.6.35.9-64 backported to Fedora 11.

One machine got stuck on a mount, and I found a task in the following state:

[154061.428023] flush-0:188   D 0000000000000002     0 17043      2 0x00000080
[154061.428023]  ffff88025fd21c40 0000000000000046 0000000000000400 0000000000000000
[154061.428023]  ffff88025fd21fd8 ffff88042ace5d00 0000000000015500 ffff88025fd21fd8
[154061.428023]  0000000000015500 0000000000015500 0000000000015500 0000000000015500
[154061.428023] Call Trace:
[154061.428023]  [<ffffffff811295f4>] ? inode_wait+0x0/0x12
[154061.428023]  [<ffffffff81129602>] inode_wait+0xe/0x12
[154061.428023]  [<ffffffff814675eb>] __wait_on_bit+0x48/0x7b
[154061.428023]  [<ffffffff81132b75>] ? writeback_sb_inodes+0x10b/0x142
[154061.428023]  [<ffffffff81132585>] inode_wait_for_writeback+0x8a/0xaa
[154061.428023]  [<ffffffff81065f20>] ? wake_bit_function+0x0/0x33
[154061.428023]  [<ffffffff81133378>] wb_writeback+0x1f4/0x212
[154061.428023]  [<ffffffff811334c7>] wb_do_writeback+0x131/0x147
[154061.428023]  [<ffffffff81133520>] bdi_writeback_task+0x43/0x11b
[154061.428023]  [<ffffffff81065df8>] ? bit_waitqueue+0x17/0xa9
[154061.428023]  [<ffffffff810e746d>] ? bdi_start_fn+0x0/0xca
[154061.428023]  [<ffffffff810e74d0>] bdi_start_fn+0x63/0xca
[154061.428023]  [<ffffffff810e746d>] ? bdi_start_fn+0x0/0xca
[154061.428023]  [<ffffffff81065a4d>] kthread+0x7f/0x87
[154061.428023]  [<ffffffff8100aa64>] kernel_thread_helper+0x4/0x10
[154061.428023]  [<ffffffff810659ce>] ? kthread+0x0/0x87
[154061.428023]  [<ffffffff8100aa60>] ? kernel_thread_helper+0x0/0x10

I hadn't seen this before, so I looked online and found this post:

    http://kerneltrap.org/mailarchive/linux-kernel/2010/5/22/4573770/thread

The traceback looks almost the same, but it looks like 2.6.35.9-64 already includes the patch Trond posted. That patch is 0522f6ad in the linux-stable git repo. Could our issues be another case of that problem?

Version-Release number of selected component (if applicable):

kernel 2.6.35.9-64


How reproducible:

High workload.

--- Additional comment from nkim on 2011-01-12 16:17:20 EST ---

Customer is Pixar.

Pixar is a very interesting account, as they are the only account that I
am aware of that actually pays for support on Fedora. Although in terms of
SLA, in a best-effort sort of way.

--- Additional comment from steved on 2011-01-14 17:59:10 EST ---

Created attachment 473610 [details]
The system back trace of the hang

It appear the following to process are looking for the same lock: 

[958985.445126] dfa           D 0000000000000000     0 11141      1 0x00000084
[958985.445126]  ffff8802b5f43c38 0000000000000082 ffff8802b5f43ba8 ffff880119c8a018
[958985.445126]  ffff8802b5f43fd8 ffff8801e86b5d00 0000000000015500 ffff8802b5f43fd8
[958985.445126]  0000000000015500 0000000000015500 0000000000015500 0000000000015500
[958985.445126] Call Trace:
[958985.445126]  [<ffffffff810d33ed>] ? sync_page+0x0/0x4a
[958985.445126]  [<ffffffff810d33ed>] ? sync_page+0x0/0x4a
[958985.445126]  [<ffffffff8146706d>] io_schedule+0x43/0x5d
[958985.445126]  [<ffffffff810d3433>] sync_page+0x46/0x4a
[958985.445126]  [<ffffffff814675eb>] __wait_on_bit+0x48/0x7b
[958985.445126]  [<ffffffff810d3605>] wait_on_page_bit+0x72/0x79
[958985.445126]  [<ffffffff81065f20>] ? wake_bit_function+0x0/0x33
[958985.445126]  [<ffffffff810db921>] ? pagevec_lookup_tag+0x25/0x2e
[958985.445126]  [<ffffffff810d3f0a>] filemap_fdatawait_range+0xa4/0x178
[958985.445126]  [<ffffffff810d3fff>] filemap_fdatawait+0x21/0x23
[958985.445126]  [<ffffffff811326dc>] writeback_single_inode+0xc6/0x1f8
[958985.445126]  [<ffffffff8113283b>] sync_inode+0x2d/0x3b
[958985.445126]  [<ffffffffa0417fe8>] nfs_wb_all+0x42/0x44 [nfs]
[958985.445126]  [<ffffffffa040b63a>] nfs_do_fsync+0x20/0x3d [nfs]
[958985.445126]  [<ffffffffa040b823>] nfs_file_flush+0x75/0x7d [nfs]
[958985.445126]  [<ffffffff81115a22>] filp_close+0x43/0x72
[958985.445126]  [<ffffffff81115aee>] sys_close+0x9d/0xd2
[958985.445126]  [<ffffffff81009c72>] system_call_fastpath+0x16/0x1b
[958985.445126] flush-0:202   D ffff8801004b2d80     0 11142      2 0x00000080
[958985.445126]  ffff880216045c40 0000000000000046 0000000000000400 0000000000000800
[958985.445126]  ffff880216045fd8 ffff88042b2d0000 0000000000015500 ffff880216045fd8
[958985.445126]  0000000000015500 0000000000015500 0000000000015500 0000000000015500
[958985.445126] Call Trace:
[958985.445126]  [<ffffffff81065f61>] ? spin_unlock_irqrestore+0xe/0x10
[958985.445126]  [<ffffffff811295f4>] ? inode_wait+0x0/0x12
[958985.445126]  [<ffffffff81129602>] inode_wait+0xe/0x12
[958985.445126]  [<ffffffff814675eb>] __wait_on_bit+0x48/0x7b
[958985.445126]  [<ffffffff81132b75>] ? writeback_sb_inodes+0x10b/0x142
[958985.445126]  [<ffffffff81132585>] inode_wait_for_writeback+0x8a/0xaa
[958985.445126]  [<ffffffff81065f20>] ? wake_bit_function+0x0/0x33
[958985.445126]  [<ffffffff81133378>] wb_writeback+0x1f4/0x212
[958985.445126]  [<ffffffff811334c7>] wb_do_writeback+0x131/0x147
[958985.445126]  [<ffffffff81133520>] bdi_writeback_task+0x43/0x11b
[958985.445126]  [<ffffffff81065df8>] ? bit_waitqueue+0x17/0xa9
[958985.445126]  [<ffffffff810e746d>] ? bdi_start_fn+0x0/0xca
[958985.445126]  [<ffffffff810e74d0>] bdi_start_fn+0x63/0xca
[958985.445126]  [<ffffffff810e746d>] ? bdi_start_fn+0x0/0xca
[958985.445126]  [<ffffffff81065a4d>] kthread+0x7f/0x87
[958985.445126]  [<ffffffff8100aa64>] kernel_thread_helper+0x4/0x10
[958985.445126]  [<ffffffff810659ce>] ? kthread+0x0/0x87
[958985.445126]  [<ffffffff8100aa60>] ? kernel_thread_helper+0x0/0x10

--- Additional comment from jcm on 2011-01-15 21:03:51 EST ---

FWIW, I am seeing a similar problem.

--- Additional comment from cebbert on 2011-01-17 17:55:46 EST ---

Can someone get the held locks when this happens? (sysrq-D)

--- Additional comment from nkim on 2011-01-24 13:57:02 EST ---

I have some great news.

Customer was able to reproduce this issue on RHEL6 and has captured the full sysrq-t traceback.

kernel-2.6.32-71.el6.x86_64

sosreport and traceback will be attached shortly.

--- Additional comment from nkim on 2011-01-24 14:00:02 EST ---

Created attachment 475018 [details]
full sysrq-t

--- Additional comment from nkim on 2011-01-24 14:00:32 EST ---

Created attachment 475019 [details]
sosreport

--- Additional comment from pm-rhel on 2011-01-24 14:14:49 EST ---

Since this issue was entered in bugzilla, the release flag has been
set to ? to ensure that it is properly evaluated for this release.

--- Additional comment from nkim on 2011-01-24 15:14:13 EST ---

Core analysis complete.

megatron.gsslab.rdu.redhat.com

/cores/20110124150354/work

--- Additional comment from steved on 2011-01-25 15:56:27 EST ---

Neal,

When this hang happens again please do the following
  'echo 0 > /proc/sys/sunrpc/rpc_debug'

which should dump all the active RPC task into /var/log/messages. 

Also can you make the following .ko files available

/lib/modules/`uname -r`/kernel/fs/nfs/nfs.ko
/lib/modules/`uname -r`/kernel/net/sunrpc/sunrpc.ko

The ones that match that dump...

--- Additional comment from nkim on 2011-01-25 16:17:41 EST ---

Created attachment 475280 [details]
nfs.ko

--- Additional comment from nkim on 2011-01-25 16:18:25 EST ---

Created attachment 475281 [details]
sunrpc.ko

--- Additional comment from nkim on 2011-01-25 16:20:04 EST ---

nfs.ko and sunrpc.ko attached (vanilla kernel-2.6.32-71.el6.x86_64).

I should have a dump of rpc_debug shortly.

--- Additional comment from steved on 2011-01-26 10:57:14 EST ---

Neal,

Where is you get that rhel6 kernel (2.6.32-71.el6.x86_64) I need
to get some of the debuginfo modules....

--- Additional comment from nkim on 2011-01-26 12:21:50 EST ---

Steve,

2.6.32-71.el6.x86_64 is the GA release of RHEL6. You should be able to get the debuginfo modules from here:

http://download.devel.redhat.com/released/RHEL-6-Server/6.0/x86_64/debug/

Unless I am missing something?

--- Additional comment from steved on 2011-01-27 11:50:45 EST ---

got it... thanks... any process on getting the rpc_debug info?

--- Additional comment from nkim on 2011-01-27 12:01:47 EST ---

Sorry Steve. Still waiting on Pixar.

--- Additional comment from nkim on 2011-01-31 14:18:28 EST ---

Lars just updated the case a short while ago. It seems 'echo 0 > /proc/sys/sunrpc/rpc_debug' is not producing anything in the logs on RHEL6. Although he is getting output on Fedora. He was hesitant on attaching the output thinking it may confuse things a bit. I have asked him to attach the output regardless, so I will keep you updated once I hear back from him.

--- Additional comment from steved on 2011-01-31 15:30:50 EST ---

Have them try this...

echo 1 > /proc/sys/sunrpc/rpc_debug
and then
echo 0 > /proc/sys/sunrpc/rpc_debug

--- Additional comment from pm-rhel on 2011-02-01 01:10:05 EST ---

This request was evaluated by Red Hat Product Management for
inclusion in the current release of Red Hat Enterprise Linux.
Because the affected component is not scheduled to be updated
in the current release, Red Hat is unfortunately unable to
address this request at this time. Red Hat invites you to
ask your support representative to propose this request, if
appropriate and relevant, in the next release of Red Hat
Enterprise Linux. If you would like it considered as an
exception in the current release, please ask your support
representative.

--- Additional comment from rwheeler on 2011-02-01 07:56:19 EST ---

We will continue to work on this - if we can figure it out really soon, we can try to reflag it for 6.1.

As it looks now, more likely to land in 6.2.

--- Additional comment from nkim on 2011-02-04 15:58:10 EST ---

Created attachment 477115 [details]
rpc_debug output from Fedora

--- Additional comment from nkim on 2011-02-04 16:00:50 EST ---

I have attached the rpc_debug output from Pixar. Although this output is from a Fedora system, I figure it was relevant.

--- Additional comment from jlayton on 2011-02-22 20:10:33 EST ---

Looking over the rpc_debug info, what I mainly see is a ton of writes queued up. The one at the top of the stack however is this:

[581306.641656] 52282 0001      0 ffff8801fa933600 ffff880264064498   120000 ffffffffa0389700 nfsv3 WRITE a:call_connect_status q:xprt_pending

...the fact that it's sitting in call_connect_status implies that the transport was disconnected and (likely) in the process of being reconnected at the time that this was generated.

The follow-on info adds a bit more, unfortunately not all of the rpc_debug bits were set when this info was done, so we didn't get all of the possible rpc_debug messages (in the future, it might be better to echo in 65535 rather than 1 to rpc_debug). In any case...

[581312.289081] RPC: 52282 xprt_connect_status: retrying

This means that the connect attempt returned with a status of -EAGAIN. I think there are a couple of ways that this can occur, unfortunately I can't tell which since the debug info didn't contain the messages that would tell us this.

[581312.289084] RPC: 52282 xprt_prepare_transmit
[581312.289088] RPC: 52282 xprt_transmit(65676)
[581312.289092] RPC:       disconnected transport ffff88042aeb2000
[581312.289096] RPC: 52282 xprt_connect xprt ffff88042aeb2000 is not connected

...in any case, I suspect that the problem is that the client is attempting to reconnect the socket but is unable to do so for some reason. Beyond that, I can't tell much.

--- Additional comment from jlayton on 2011-02-23 19:26:32 EST ---

I grabbed the rhel6 core and started poking at it. The superblock is here:

    0xffff88042c1c6400

The nfs_server is here:

    0xffff880427f54400

The rpc_clnt:

    0xffff88042b815a00

...hanging off the rpc_clnt *should* be a rpc_xprt, which contains the sockaddr for the server. That however seems to be at least corrupt and has probably been freed. So I think what we may be looking at is a rpc_xprt refcount leak somewhere in the code.

--- Additional comment from jlayton on 2011-02-23 19:56:26 EST ---

Sorry, not a refcount leak but a refcount imbalance (too many puts/not enough gets).

--- Additional comment from jlayton on 2011-02-23 20:43:21 EST ---

...more notes:

crash> struct rpc_clnt.cl_xprt 0xffff88042b815a00
  cl_xprt = 0xffffffffa0548d80, 

...but that address looks odd:

crash> kmem ffffffffa0548d80
ffffffffa0548d80 (D) nfs_rpcstat  

   VM_STRUCT                 ADDRESS RANGE               SIZE
ffff88042ad61540  ffffffffa0503000 - ffffffffa055a000  356352

      PAGE        PHYSICAL      MAPPING       INDEX CNT FLAGS
ffffea000e956758 42aafd000                0       6a  1 40000000000000

...so it's the same address as the nfs_rpcstat struct, and that looks valid. It seems unlikely that that address was handed out by a kzalloc, so how did we end up with it in the cl_xprt field? Maybe it's a more generic memory corruption in rpc_clnt?

--- Additional comment from jlayton on 2011-03-08 11:09:12 EST ---

Got an email from Lars at Pixar the other day:

"Thanks, Steve--unfortunately, I came in this morning to find a patched
machine hanging just as they did before. Bummer! I verified that the patch
was applied correctly by looking at the sources in the debuginfo package."

...so apparently Trond's patch did not fix the issue. I think we need to
do the following:

When we get a RHEL6 machine into this state, it would be good to confirm
that...

The ideal thing would be to get a kernel with a backport of Trond's patch
for RHEL6 to them, and get it into the state where it's hung like this.
Once it's in that state, crank up rpc_debug for a few minutes:

    # rpcdebug -m rpc -s all

...wait about 5 minutes or so. Try not to do any activity on the box so that
we can focus on what's already queued and the RPC calls that are in progress. Then:

    # rpcdebug -m rpc -c all

...after that, save off the log to a file:

    # dmesg > /tmp/rpc-debug-log.txt

...and attach it here so we can see what the client is actually doing and
why it's not making progress.

--- Additional comment from adrew on 2011-03-17 13:52:46 EDT ---

Created attachment 486078 [details]
RPC debug from RHEL 6

--- Additional comment from jlayton on 2011-03-17 15:20:51 EDT ---

Not much info there. Stripping out the soft lockup messages gives this:

-pid- flgs status -client- --rqstp- -timeout ---ops--
39502 0001      0 ffff880c44937200 ffff880c46e690d8        0 ffffffffa0308e70 nfsv3 READ a:call_status q:none
38818 0001      0 ffff880c452e3000 ffff880c474e4930        0 ffffffffa0309000 nfsv3 COMMIT a:call_status q:none
38819 0001      0 ffff880c452e3000 ffff880c474e4310        0 ffffffffa0308fc0 nfsv3 WRITE a:call_status q:none
38820 0001      0 ffff880c452e3000 ffff880c474e4620        0 ffffffffa0308fc0 nfsv3 WRITE a:call_status q:none
39014 0001      0 ffff880c452e3000 ffff880c474e4498        0 ffffffffa0308fc0 nfsv3 WRITE a:call_status q:none
39232 0001      0 ffff880c452e3000 ffff880c474e4ab8        0 ffffffffa0308fc0 nfsv3 WRITE a:call_status q:none
39249 0001      0 ffff880c452e3000 ffff880c474e4c40        0 ffffffffa0308fc0 nfsv3 WRITE a:call_status q:none
RPC:       xs_close xprt ffff880c475df800
RPC:       disconnected transport ffff880c475df800 
RPC:       wake_up_next(ffff880c475dfa58 "xprt_resend")
RPC:       wake_up_next(ffff880c475df990 "xprt_sending")
-pid- flgs status -client- --rqstp- -timeout ---ops--
39502 0001      0 ffff880c44937200 ffff880c46e690d8        0 ffffffffa0308e70 nfsv3 READ a:call_status q:none
38818 0001      0 ffff880c452e3000 ffff880c474e4930        0 ffffffffa0309000 nfsv3 COMMIT a:call_status q:none
38819 0001      0 ffff880c452e3000 ffff880c474e4310        0 ffffffffa0308fc0 nfsv3 WRITE a:call_status q:none
38820 0001      0 ffff880c452e3000 ffff880c474e4620        0 ffffffffa0308fc0 nfsv3 WRITE a:call_status q:none
39014 0001      0 ffff880c452e3000 ffff880c474e4498        0 ffffffffa0308fc0 nfsv3 WRITE a:call_status q:none
39232 0001      0 ffff880c452e3000 ffff880c474e4ab8        0 ffffffffa0308fc0 nfsv3 WRITE a:call_status q:none

I'll have to look more closely at the messages, but at first glance this looks quite different from the problem seen on F14 at Pixar.

--- Additional comment from jlayton on 2011-03-17 15:23:57 EDT ---

Adam, some details about how they collected this might be helpful. What specifically did they do to enable and disable debugging, and how long did they leave rpc_debug enabled?

--- Additional comment from jlayton on 2011-03-17 15:45:17 EDT ---

The fact that the task->tk_timeout is 0 in all of these seems quite odd. I'm thinking that should be set to some positive value but there may be a reason for it being that way.

--- Additional comment from jlayton on 2011-03-17 15:49:18 EDT ---

Did I hear correctly that we have a core dump from a machine in this state?

--- Additional comment from sbeal on 2011-03-17 15:56:11 EDT ---

Available via ftp at:

<kerb-id>@seg.rdu.redhat.com/incoming
114f58ba40c89621ad6bedf92749f864  00434772-172.30.218.89.vmcore.gz
778ae826a60b9fc6445a5d947b32cd0a  00434772-172.30.218.81.vmcore.gz
ae2e1b2ee2b3ba8f18542d6636f3a198  00434772-172.30.218.95.vmcore.gz
f17b1225f9b8c75cf12f3fca1ff70ceb  00434772-172.30.218.97.vmcore.gz

--- Additional comment from kevin.constantine on 2011-03-17 16:00:34 EDT ---

Jeff- We collected the rpc_debug with your instructions in Comment 23:

    # rpcdebug -m rpc -s all

...wait about 5 minutes or so. Try not to do any activity on the box so that
we can focus on what's already queued and the RPC calls that are in progress.
Then:

    # rpcdebug -m rpc -c all

...after that, save off the log to a file:

    # dmesg > /tmp/rpc-debug-log.txt

--- Additional comment from jlayton on 2011-03-17 16:11:25 EDT ---

(In reply to comment #29)
> Available via ftp at:
> 
> <kerb-id>@seg.rdu.redhat.com/incoming
> 114f58ba40c89621ad6bedf92749f864  00434772-172.30.218.89.vmcore.gz
> 778ae826a60b9fc6445a5d947b32cd0a  00434772-172.30.218.81.vmcore.gz
> ae2e1b2ee2b3ba8f18542d6636f3a198  00434772-172.30.218.95.vmcore.gz
> f17b1225f9b8c75cf12f3fca1ff70ceb  00434772-172.30.218.97.vmcore.gz

Could someone in GSS set this up in CAS and post the location here?

--- Additional comment from adrew on 2011-03-17 16:15:47 EDT ---

(In reply to comment #31)
> (In reply to comment #29)
> > Available via ftp at:
> > 
> > <kerb-id>@seg.rdu.redhat.com/incoming
> > 114f58ba40c89621ad6bedf92749f864  00434772-172.30.218.89.vmcore.gz
> > 778ae826a60b9fc6445a5d947b32cd0a  00434772-172.30.218.81.vmcore.gz
> > ae2e1b2ee2b3ba8f18542d6636f3a198  00434772-172.30.218.95.vmcore.gz
> > f17b1225f9b8c75cf12f3fca1ff70ceb  00434772-172.30.218.97.vmcore.gz
> 
> Could someone in GSS set this up in CAS and post the location here?

I can take care of that. I'll put the locations in here.

--- Additional comment from adrew on 2011-03-17 16:41:47 EDT ---

(In reply to comment #31)
> (In reply to comment #29)
> > Available via ftp at:
> > 
> > <kerb-id>@seg.rdu.redhat.com/incoming
> > 114f58ba40c89621ad6bedf92749f864  00434772-172.30.218.89.vmcore.gz
> > 778ae826a60b9fc6445a5d947b32cd0a  00434772-172.30.218.81.vmcore.gz
> > ae2e1b2ee2b3ba8f18542d6636f3a198  00434772-172.30.218.95.vmcore.gz
> > f17b1225f9b8c75cf12f3fca1ff70ceb  00434772-172.30.218.97.vmcore.gz
> 
> Could someone in GSS set this up in CAS and post the location here?

The locations will be:

/cores/20110317161658_1/work
/cores/20110317161658_2/work
/cores/20110317161658_3/work
/cores/20110317161658_4/work

CAS is chewing through them so they may not all be done yet. 2 & 4 look reasonably complete but 1 & 2 look unfinished atm. Let me know if you need anything else.

--- Additional comment from jlayton on 2011-03-18 09:06:20 EDT ---

Looking at 20110317161658_1 first. Just some notes first for future analysis:

The maya.bin process is stuck waiting for a COMMIT to complete. Note that the stack dump that the soft lockup message does often has some "junk" in it. It's really stuck here, which is probably the wait_on_bit call in nfs_commit_inode:

 #6 [ffff88035085b908] prepare_to_wait_exclusive at ffffffff81091ff3
 #7 [ffff88035085b940] __wait_on_bit_lock at ffffffff814c9e68
 #8 [ffff88035085b990] out_of_line_wait_on_bit_lock at ffffffff814c9f58
 #9 [ffff88035085ba00] nfs_commit_inode at ffffffffa025dc9a
#10 [ffff88035085ba70] nfs_wb_page at ffffffffa025de29
#11 [ffff88035085bb10] nfs_flush_incompatible at ffffffffa025dec0

Scraping the stack for info gives:

filp: 0xffff880c4756dc00
dentry: 0xffff88097db793c0
superblock: 0xffff880c46420c00

This is the mount at /data/adai0101. Digging down into the NFS and RPC fields shows:

nfs_server: 0xffff880c3dd1dc00
rpc_clnt: 0xffff880c45934400
rpc_xprt: 0xffff880c43f93000

--- Additional comment from jlayton on 2011-03-18 10:04:23 EDT ---

To get the list of rpc_tasks hanging off this rpc_clnt:

crash> list rpc_task.tk_task -s rpc_task -H 0xffff880c45934418

The tk_action of all of them is call_status. That suggests that the calls have been transmitted to the server and the client is just waiting on the result.

tk_runstate of all of them is 0x5, which is (RPC_TASK_RUNNING|RPC_TASK_ACTIVE)

tk_flags on all of them are RPC_TASK_ASYNC. The tk_callback on all of them is xprt_timer, which I *think* means that it's waiting for the transport to settle.

Moving to the state of the socket:

crash> struct socket 0xffff880c16491940
struct socket {
  state = SS_CONNECTING, 

crash> struct sock.__sk_common 0xffff8801df177200 | grep skc_state
    skc_state = 0x1, 

...I believe the skc_state is TCPF_ESTABLISHED

--- Additional comment from jlayton on 2011-03-18 10:19:46 EDT ---

Correction...that should be TCP_ESTABLISHED

--- Additional comment from jlayton on 2011-03-18 11:39:02 EDT ---

Ok, socket states do seem a bit odd, but it may be normal (I don't understand the socket layer code that well). For now, I'll return focus back to what the processes are actually blocking on and why they're not making progress:

inode = 0xffff8801d7673700

...to get to nfs_inode, pretend the vfs_inode field is a list_head:

crash> struct nfs_inode.flags -l nfs_inode.vfs_inode 0xffff8801d7673700
  flags = 0x84, 

...which is NFS_INO_COMMIT|NFS_INO_FLUSHING. Looking at the disassembly of that function confirms that it's waiting for NFS_INO_COMMIT to clear. So we now know what maya.bin is waiting on.

Now to figure out why it's not happening...

--- Additional comment from jlayton on 2011-03-18 11:48:34 EDT ---

Actually...it looks like it's trying to set NFS_INO_COMMIT inside of nfs_commit_set_lock. So it seems likely that another task had kicked off a commit (possibly an async one) and that didn't happen for some reason. Regardless, maya.bin is still waiting on that bit to clear so the next step is to figure out why that's not happening. It should be happening when the commit response is received...

--- Additional comment from jlayton on 2011-03-18 12:30:33 EDT ---

There's only one commit operation running:

crash> struct rpc_task ffff88013dfba948

...confirmed by looking at the tk_ops:

struct rpc_call_ops {
  rpc_call_prepare = 0xffffffffa025ce00 <nfs_write_prepare>, 
  rpc_call_done = 0xffffffffa025c2d0 <nfs_commit_done>, 
  rpc_release = 0xffffffffa025e030 <nfs_commit_release>
}

  tk_owner = 29765, 

...which is the pid of maya.bin. Looking at the rpc_rqst:

  rq_reply_bytes_recvd = 0x98, 
  rq_connect_cookie = 0x0, 
  rq_bytes_sent = 0x0, 
  rq_xtime = {
    tv64 = 0x1f2a0710a7433
  }, 

...so bytes_sent is 0, but bytes recvd is 0x98. Looking back at rpc_xprt:

  state = 0x12, 
  connect_cookie = 0x0, 

...which is XPRT_CONNECTED|XPRT_BOUND. So the rpc_xprt is basically considered to be established.

The connect_cookie is 0 though, which seems very odd -- how do we get to a state with XPRT_CONNECTED|XPRT_BOUND without ever incrementing the connect_cookie? That value starts at 0 when the xprt is created and is incremented whenever the tcp state changes to TCP_FIN_WAIT1 or TCP_SYN_SENT.

How do we get a connected TCP socket without ever entering TCP_SYN_SENT?

--- Additional comment from jlayton on 2011-03-18 13:25:29 EDT ---

More anecdotal evidence that there is a problem with the socket handling. There are 21 jobs queued to rpciod. 17 are rpc_async_schedule jobs, and the other 4 are xprt_autoclose.

What's not clear to me is why rpciod doesn't seem to be picking them up...

--- Additional comment from sbeal on 2011-03-18 15:37:32 EDT ---

=== In Red Hat Customer Portal Case 00434772 ===
--- Comment by Constantine, Kevin on 3/18/2011 3:32 PM ---

We believe that we're able to reproduce this lock-up within a half hour or so.  If there are things you'd like us to try, let us know.

--- Additional comment from sbeal on 2011-03-18 16:08:41 EDT ---

=== In Red Hat Customer Portal Case 00434772 ===
--- Comment by Beal, Stephen on 3/18/2011 4:08 PM ---

I spoke with Kevin at Disney regarding expectations for the weekend. He has agreed that we can set this case to sev. 2 for the weekend.  However I'd like to re-iterate that this issue is causing Disney a lot of pain, so we need to work this as a high priority sev. 2, and if any progress can be made over the weekend, that'd be really good.  


Steve

--- Additional comment from jlayton on 2011-03-18 16:19:14 EDT ---

To set expectations appropriately...

This is a very difficult to debug problem. It's not likely to be solved quickly. We'll continue to work on it as best we can but it's likely to take a while.

--- Additional comment from bfields on 2011-03-18 17:43:39 EDT ---

"bytes_sent is 0, but bytes recvd is 0x98"

I think that's right:

                /* If we've sent the entire packet, immediately
                 * reset the count of bytes sent. */
                req->rq_bytes_sent += status;
                req->rq_xmit_bytes_sent += status;
                if (likely(req->rq_bytes_sent >= req->rq_slen)) {
                        req->rq_bytes_sent = 0;
                        return 0;
                }

And 0x98 is the length of a commit response (starting from the top of the rpc header) with pre- and post- attributes.  So that much sounds normal.

--- Additional comment from bfields on 2011-03-18 18:05:24 EDT ---

"how do we get to a
state with XPRT_CONNECTED|XPRT_BOUND without ever incrementing the
connect_cookie? That value starts at 0 when the xprt is created and is
incremented whenever the tcp state changes to TCP_FIN_WAIT1 or TCP_SYN_SENT."

Grepping for sk_state_change in net/ipv4/, it looks like we don't actually get a callback in the TCP_SYN_SENT case.  The TCP_SYN_SENT case in xs_tcp_state_change might even be completely dead code.

--- Additional comment from trond.myklebust on 2011-03-18 20:36:26 EDT ---

Created attachment 486341 [details]
SUNRPC: Deal with the lack of a SYN_SENT sk->sk_state_change callback...

Hi Bruce

Yes. I was sure I had seen a state_change callback for SYN_SENT, but
grepping the code, I can't find it any more.

This patch ought to fix the issue by relying on the return value from
kernel_connect() instead of the state_change callback.

--- Additional comment from jlayton on 2011-03-18 20:42:33 EDT ---

Ok, that helps. Thanks, Bruce...

So, it doesn't really look like anything is amiss with what I've found so far. Still though, it does at least seem like the COMMIT call is not progressing. The question at this point is why...

The rpc_debug info collected by Kevin seems to indicate that no real RPC activity is going on. Could there be an issue with the workqueue?

--- Additional comment from trond.myklebust on 2011-03-18 21:00:35 EDT ---

See your comment #27. If all the RPC calls have infinite timeouts, then
something from outside the usual call path has to wake them up.

The intention is that the socket callbacks will fill that role...

--- Additional comment from jlayton on 2011-03-19 00:20:56 EDT ---

Ok, yeah...I think I see now. All of the queued tasks are have this:

  tk_callback = 0xffffffffa01ced70 <xprt_timer>, 
  tk_action = 0xffffffffa01ccb60 <call_status>, 

...which suggests that they all went through xprt_transmit and went to sleep on xprt->pending.

Trond, I guess you're theory is that disconnect/reconnect occured, and since the SYN_SENT callback never happened those tasks were never awakened. Is that correct?

--- Additional comment from jlayton on 2011-03-19 10:35:03 EDT ---

Having a look at the second core to confirm that the problem there is similar...

dentry = 0xffff8801c48bfa40
inode = 0xffff880218969378
sb = 0xffff880c44aac800
nfs_server = 0xffff880c46447c00
rpc_clnt = 0xffff880c46e73400

...the problem does seem to be the same. There are fewer's RPC's queued up with this rpc_clnt, but they seem to be in the same state.

--- Additional comment from trond.myklebust on 2011-03-19 12:01:59 EDT ---

So this is what I don't understand:

The tk_callback == xprt_timer, which means they were put to sleep by xprt_transmit.

However, the tk_timeout == 0, which suggests that the call to xprt->ops->set_retrans_timeout() has somehow failed. That again means that task->tk_rqstp->rq_timeout must be zero.

Can you check the value of task->tk_client->cl_timeout, and see what it is pointing to?

--- Additional comment from makc on 2011-03-21 02:10:09 EDT ---

FWIW, bug 674427 has report of soft lockup with the same backtrace.

Also, I've check a few cores - the first one from case 00434772
the server was "warmmilk32", all cores from 2011/03/17 are for
"hotspot": just wanted to eliminate the possibility of server
induced problems.

--- Additional comment from jlayton on 2011-03-21 07:37:19 EDT ---

Created attachment 486592 [details]
various rpc info relating to commit task in core1

Here's a collection of the following, which should help answer Trond's question. Let me know if you need more. The rpc_timeout looks normal to me (not zeroed out):

struct rpc_task ffff88013dfba948
struct rpc_rqst 0xffff8801deb1c498
struct rpc_clnt 0xffff880c45934400
struct rpc_timeout 0xffff880c45934540

Also as Max mentions, server behavior could be a factor here. Maybe it's spuriously disconnecting or something?

That shouldn't cause the client to hang like this though. There seems to clearly be a client-side bug here.

--- Additional comment from jlayton on 2011-03-21 07:49:47 EDT ---

Created attachment 486593 [details]
various rpc info relating to commit task in core1

Also, rq_timeout for the attached rpc_rqst is not 0, it's 60000. Here's an updated file that also contains the rpc_xprt.

--- Additional comment from jlayton on 2011-03-21 08:36:48 EDT ---

Ok, I see what Trond is saying, but I don't see how set_retrans_timeout can reasonably fail. I think that leaves us 2 possibilities:

1) the rq_timeout was set to 0 when xprt_transmit called set_retrans_timeout

...or...

2) the tk_timeout was set to a non-zero value at that time but was later zeroed by something else

The second possibility seems more likely to me. The task->tk_runstate is 0x5, which is:

#define RPC_TASK_RUNNING        0
#define RPC_TASK_ACTIVE         2

... RPC_TASK_QUEUED is not set, but it should be if it's still sleeping on the waitqueue.

--- Additional comment from jlayton on 2011-03-21 09:15:59 EDT ---

One thought... rpc_make_runnable does this:

        rpc_clear_queued(task);
        if (rpc_test_and_set_running(task))
                return;

Could we be hitting a condition where we're trying to wake up the task while QUEUED and RUNNABLE are both set? We set the QUEUED bit when the task goes on the waitqueue, but we don't clear the runnable bit until some time later in __rpc_execute. If the task was woken up between those events might they get stuck like this?

--- Additional comment from steved on 2011-03-21 09:19:42 EDT ---

The new kernel build that has both of Trond's patches
    http://brewweb.devel.redhat.com/brew/taskinfo?taskID=3189941

--- Additional comment from trond.myklebust on 2011-03-21 10:00:54 EDT ---

If both QUEUED and RUNNABLE are set, then that means we're still in the
__rpc_execute() loop. In that case, we rely on the section that is protected
by the queue->lock to avoid conflicts with rpc_make_runnable().

That raises an issue, though. What is the status of task->u.tk_work? Does
task->u.tk_work.data contain a valid set of WORK_STRUCT bits, and if so,
what are their values?

--- Additional comment from trond.myklebust on 2011-03-21 10:21:54 EDT ---

Oh... Another point: task sleep and wake up is designed to be soft-irq safe,
but not hard irq safe.  We might want to add in a defensive WARN_ON(in_irq())
to rpc_make_runnable() just in case we're hitting some poorly designed
network driver that is trying to call us back from the wrong kind of context.

--- Additional comment from steved on 2011-03-21 11:09:38 EDT ---

Will do... I'll get a kernel building with these three patches

1) the one liner from our Pixar trip
2) The patch from Comment 46
3) Adding the following WARN_ON():
diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
index cace604..c462fab 100644
--- a/net/sunrpc/sched.c
+++ b/net/sunrpc/sched.c
@@ -281,6 +281,8 @@ EXPORT_SYMBOL_GPL(__rpc_wait_for_completion_task);
  */
 static void rpc_make_runnable(struct rpc_task *task)
 {
+       WARN_ON(in_irq())
+
        rpc_clear_queued(task);
        if (rpc_test_and_set_running(task))
                return;

--- Additional comment from steved on 2011-03-21 11:14:33 EDT ---

The build is:
    http://brewweb.devel.redhat.com/brew/taskinfo?taskID=3190416

--- Additional comment from jlayton on 2011-03-21 11:32:09 EDT ---

Yeah, I've been wondering about the workqueue too. Here's tk_work...

    tk_work = {
      data = {
        counter = 0xffffe8ffffccce81
      }, 
      entry = {
        next = 0xffff880a1429c718, 
        prev = 0xffffe8ffffccce88
      }, 
      func = 0xffffffffa01d5760 <rpc_async_schedule>
    }, 


...how the "data" field gets set is quite tricky...working down through the definition of INIT_WORK:

     (_work)->data = (atomic_long_t) WORK_DATA_INIT();

#define WORK_DATA_INIT()        ATOMIC_LONG_INIT(WORK_STRUCT_NO_CPU)

        WORK_STRUCT_NO_CPU      = WORK_CPU_NONE << WORK_STRUCT_FLAG_BITS,

        WORK_CPU_NONE           = NR_CPUS + 1,

        CONFIG_NR_CPUS          = 4096

        WORK_STRUCT_FLAG_BITS   = WORK_STRUCT_COLOR_SHIFT +
                                  WORK_STRUCT_COLOR_BITS,

...I think CONFIG_DEBUG_OBJECTS_WORK is unset, so:

        WORK_STRUCT_COLOR_SHIFT = 4,    /* color for workqueue flushing */
        WORK_STRUCT_COLOR_BITS  = 4,

So the counter should be initialized by INIT_WORK to 0x100100. It's not that here, but I'm not sure whether it gets altered later or what. Trond is there something you wanted me to specifically check for? I might be able to get some help determining the state of the tk_work if so...

--- Additional comment from jlayton on 2011-03-21 13:05:26 EDT ---

Disregard that...I was looking at a mainline-ish kernel which has all of the CMWQ changes. In RHEL6, ->data is initialized to 0:

    #define WORK_DATA_INIT()        ATOMIC_LONG_INIT(0)

...I'll look at what it ends up pointing to as it progresses.

--- Additional comment from jlayton on 2011-03-21 13:31:01 EDT ---

Ok, the work_struct starts its life with ->data initialized to 0. It then has this happen when it's queued:

        if (!test_and_set_bit(WORK_STRUCT_PENDING, work_data_bits(work))) {

WORK_STRUCT_PENDING is 0, and that bit seems to be set on tk_work task. So, this was queued since it was last initialized. It looks like run_workqueue calls work_clear_pending before calling ->func, so it looks like this job was queued to the workqueue but has not run.

Masking off 0x1 in the ->data gives us this:

crash> struct cpu_workqueue_struct 0xffffe8ffffccce80
struct cpu_workqueue_struct {
  lock = {
    raw_lock = {
      slock = 0x85cb85cb
    }
  }, 
  worklist = {
    next = 0xffff88013dfba9d8, 
    prev = 0xffff88021675f3d8
  }, 
  more_work = {
    lock = {
      raw_lock = {
        slock = 0xaca0aca0
      }
    }, 
    task_list = {
      next = 0xffffe8ffffcccea0, 
      prev = 0xffffe8ffffcccea0
    }
  }, 
  current_work = 0x0, 
  wq = 0xffff880c43c94380, 
  thread = 0xffff880c469a14e0
}

...sanity checking the wq field:

crash> struct workqueue_struct 0xffff880c43c94380
struct workqueue_struct {
  cpu_wq = 0x60ffd7a0ce80, 
  list = {
    next = 0xffff880c43d05188, 
    prev = 0xffff880c3c2aac88
  }, 
  name = 0xffffffffa01ea024 "rpciod", 
  singlethread = 0x0, 
  freezeable = 0x0, 
  rt = 0x0
}

--- Additional comment from jlayton on 2011-03-21 13:51:22 EDT ---

task 0xffff880c469a14e0 is rpciod/3:

crash> set 0xffff880c469a14e0
    PID: 1588
COMMAND: "rpciod/3"
   TASK: ffff880c469a14e0  [THREAD_INFO: ffff880c3dc8c000]
    CPU: 3
  STATE: TASK_RUNNING

crash> bt
PID: 1588   TASK: ffff880c469a14e0  CPU: 3   COMMAND: "rpciod/3"
 #0 [ffff880c3dc8dd70] schedule at ffffffff814c8f68
 #1 [ffff880c3dc8de38] worker_thread at ffffffff8108c7ac
 #2 [ffff880c3dc8dee8] kthread at ffffffff81091a86
 #3 [ffff880c3dc8df48] kernel_thread at ffffffff810141ca

...so it's TASK_RUNNING, and there are 22 jobs queued to the ->worklist, but it seems to still be sitting in schedule(). Disassembly confirms that it is sitting in the schedule() call in worker_thread.

--- Additional comment from jlayton on 2011-03-21 13:56:58 EDT ---

So it seems like it went to sleep and got waken up, but the scheduler hasn't given it any time?

--- Additional comment from jlayton on 2011-03-21 14:19:55 EDT ---

maya.bin is on CPU3 as well and that's where we're seeing the softlockup messages. Could it somehow be stuck continually trying to wait but can't, and so it can't relenquish the CPU to allow the workqueue jobs to run?

--- Additional comment from trond.myklebust on 2011-03-21 14:37:23 EDT ---

Is there a fatal signal pending on maya.bin? Otherwise it should be calling
schedule()?

--- Additional comment from nkim on 2011-03-21 14:50:37 EDT ---

Just a quick update.

Lars has been running the 2.6.32-118.el6.bz672305.x86_64 test kernel for the better part of a week with no hangs...so far. Although he did mention that they had a power outage early yesterday morning which resulted in rendering jobs being held off completely that day.

I'll keep everyone posted.

--- Additional comment from jlayton on 2011-03-21 15:00:39 EDT ---

task->stack->flags for maya.bin is 0x8c == 10001100. Which is:

TIF_SIGPENDING
TIF_NEED_RESCHED
TIF_SYSCALL_AUDIT

...so yes, there is a signal pending:

    signal = {
      sig = {0x100}
    }

...which I think is SIGKILL.

--- Additional comment from trond.myklebust on 2011-03-21 15:05:06 EDT ---

Oh crap... I see the problem...

If nfs_commit_set_lock() fails, then nfs_commit_inode() returns 0,
which nfs_wb_page() interprets as 'success!', and loops again...

--- Additional comment from jlayton on 2011-03-21 15:12:17 EDT ---

Yeah, I was just looking at the same thing. That's almost certainly the bug.

--- Additional comment from jlayton on 2011-03-21 15:20:27 EDT ---

(In reply to comment #69)
> Just a quick update.
> 
> Lars has been running the 2.6.32-118.el6.bz672305.x86_64 test kernel for the
> better part of a week with no hangs...so far. Although he did mention that they
> had a power outage early yesterday morning which resulted in rendering jobs
> being held off completely that day.
> 

That's good to know. Even though the symptoms are the same, it seems quite likely that Disney and Pixar are hitting different bugs.

--- Additional comment from trond.myklebust on 2011-03-21 15:38:10 EDT ---

Created attachment 486678 [details]
NFS: Fix a hang/infinite loop in nfs_wb_page()

When one of the two waits in nfs_commit_inode() is interrupted, it
returns a non-negative value, which causes nfs_wb_page() to think
that the operation was successful causing it to busy-loop rather
than exiting.
It also causes nfs_file_fsync() to incorrectly report the file as
being successfully committed to disk.

This patch fixes both problems by ensuring that we return an error
if the attempts to wait fail.

--- Additional comment from jlayton on 2011-03-21 16:07:36 EDT ---

Thanks, Trond...

Steve, could you add the patch in comment #74 to your test kernel? Once we have that, I think we should give it to Disney to test. Pixar is also welcome to do so too, but we don't know for sure that they actually hit this bug.

--- Additional comment from steved on 2011-03-21 16:18:32 EDT ---

(In reply to comment #74)
> Created attachment 486678 [details]
> NFS: Fix a hang/infinite loop in nfs_wb_page()
> 
> When one of the two waits in nfs_commit_inode() is interrupted, it
> returns a non-negative value, which causes nfs_wb_page() to think
> that the operation was successful causing it to busy-loop rather
> than exiting.
> It also causes nfs_file_fsync() to incorrectly report the file as
> being successfully committed to disk.
> 
> This patch fixes both problems by ensuring that we return an error
> if the attempts to wait fail.

Do you want *just* the patch in comment #74 in the test kernel or
should I just added it to the other two from comment #60?

--- Additional comment from trond.myklebust on 2011-03-21 16:25:14 EDT ---

In my opinion, we should test just the patch from comment #74 for now. It looks as if that is the correct fix for Disney's problem.

--- Additional comment from jlayton on 2011-03-21 16:27:58 EDT ---

That works for me. Eventually, we'll want to test the whole swath of fixes (including the sunrpc ones), but it might be nice to confirm that this patch is the fix, particularly if this is easily reproducible.

--- Additional comment from jlayton on 2011-03-21 16:33:12 EDT ---

Also to confirm, the second core shows the same situation as the first. Jobs there are queued up on rpciod/4 and a SIGKILL'ed maya.bin process is spinning on CPU4 in the same spot.

--- Additional comment from steved on 2011-03-21 17:18:39 EDT ---

(In reply to comment #77)
> In my opinion, we should test just the patch from comment #74 for now. It looks
> as if that is the correct fix for Disney's problem.

Cool... The kernel is in the oven... Thanks for all your help!!!

--- Additional comment from steved on 2011-03-21 17:46:40 EDT ---

The build is at:
    http://brewweb.devel.redhat.com/brew/taskinfo?taskID=3191755

--- Additional comment from jlayton on 2011-03-22 07:34:23 EDT ---

Setting needinfo flag for reporter as we await test results.

--- Additional comment from sprabhu on 2011-03-22 08:10:21 EDT ---

This should be waiting on Stephen Beal(sbeal) for Disney instead.

--- Additional comment from rwheeler on 2011-03-23 09:24:10 EDT ---

This looks like data corruption and something that we need to pull into 6.1 - just the fix in:

https://bugzilla.redhat.com/show_bug.cgi?id=672305#c74

Requesting 6.1 and blocker status, thanks!

--- Additional comment from bhubbard on 2011-03-23 20:50:12 EDT ---

Latest update from Disney.

Checked	#164 Created By: Constantine, Kevin (24/03/2011 6:44 AM) Last Modified By: Constantine, Kevin (24/03/2011 6:44 AM)

Well, I've got good news and bad news.  The good news is that none of the machines with the new kernel have crashed.  The bad news is that none of our machines without the kernel have crashed since we rebooted a bunch on Monday from outages over the weekend.

It feels like something changed that spontaneously caused the issue on the 15th, and then changed again that is preventing the issue from recurring over the weekend.  We're at a loss trying to determine what that change is.  We're kind of just waiting for another machine to crash right now.

--- Additional comment from rees on 2011-03-23 21:36:24 EDT ---

I believe we're seeing the same problem. See the thread here:
http://marc.info/?l=linux-nfs&m=129685584924582&w=2

I have applied the patch from comment #74, and the preliminary report I have is that it still hangs, but now the iozone process is interruptible, where it wasn't before, and the call trace has changed:

INFO: task iozone:3824 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
iozone          D 0000000104ba67e9     0  3824      1 0x00000080
ffff88023c23fc30 0000000000000086 0000000000000000 00000000ffffffff
0000000000013700 0000000000013700 0000000000013700 ffff88023d4044d0
0000000000013700 ffff88023c23ffd8 0000000000013700 0000000000013700
Call Trace:
[<ffffffff81136829>] ? lock_flocks+0x10/0x12
[<ffffffff81137713>] ? __posix_lock_file+0x60/0x441
[<ffffffff814377b6>] rwsem_down_failed_common+0xd3/0x105
[<ffffffff81137d07>] ? posix_lock_file+0x11/0x13
[<ffffffff8143780f>] rwsem_down_read_failed+0x12/0x14
[<ffffffff812072e4>] call_rwsem_down_read_failed+0x14/0x30
[<ffffffff81436f4b>] ? down_read+0x32/0x36
[<ffffffffa03cf842>] nfs4_proc_lock+0x21f/0x32d [nfs]
[<ffffffffa03b6682>] do_setlk+0x61/0xd5 [nfs]
[<ffffffffa03b6a94>] nfs_lock+0x16e/0x17d [nfs]
[<ffffffff8103bb05>] ? should_resched+0x9/0x29
[<ffffffff81137d2d>] vfs_lock_file+0x24/0x30
[<ffffffff8113848c>] fcntl_setlk+0x169/0x2fb
[<ffffffff811025b9>] ? fsnotify_access+0x5d/0x65
[<ffffffff8110f915>] sys_fcntl+0x31c/0x4c9
[<ffffffff814389a7>] ? do_device_not_available+0x9/0xb
[<ffffffff8100aa12>] system_call_fastpath+0x16/0x1b

--- Additional comment from rees on 2011-03-23 21:38:01 EDT ---

Guess I should mention that this is on 2.6.38-rc5, and I had to adapt the patch to make it apply.

--- Additional comment from jlayton on 2011-03-23 22:31:59 EDT ---

Hmmm...that really looks like a different bug entirely. The hallmarks of this one are mainly that the machine is stuck looping while trying to issue a commit. That prevents the rpciod workqueue jobs scheduled on the same CPU from running.

The stack trace you just posted looks more like something in the locking code. Possibly a problem with locking the nfsi->rwsem? That might indicate a problem with lock recovery. Are there any *-manager threads running when you hit this? If so, what are they doing?

In any case, it might be best to continue discussion of that on linux-nfs since this bug is really about a RHEL6 problem, or maybe open a bug at the kernel.org bug tracker that we can all follow.

--- Additional comment from lars on 2011-03-24 13:57:41 EDT ---

Five renderfarm machines survived a night's rendering while using the patch from comment #74. That's a promising start, so I've added five more machines today and will continue to watch them.

If those ten are still doing well tomorrow morning, I'll add another ten and let them all cook for a couple of days.

Thanks for working hard on this! My fingers are crossed and I'll update again tomorrow. :)

--- Additional comment from jlayton on 2011-03-28 10:26:21 EDT ---

I'm not sure why it was posted as a private comment, but Disney has also reported that the patched kernel has been working for them.

--- Additional comment from pm-rhel on 2011-03-28 10:39:58 EDT ---

This request was evaluated by Red Hat Product Management for inclusion
in a Red Hat Enterprise Linux maintenance release. Product Management has 
requested further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed 
products. This request is not yet committed for inclusion in an Update release.

--- Additional comment from kevin.constantine on 2011-03-28 12:38:31 EDT ---

We hadn't seen a single machine hang since we received the patched kernel on either the patched machines, or the stock rhel6 machines until this weekend.   We had 21 machines crash over the weekend and all of them were running the unpatched stock rhel6 kernel.

--- Additional comment from lars on 2011-03-28 13:19:23 EDT ---

All of our patched machines made it through the weekend's rendering load with no hangs. I'll start installing the patch on some workstations for users who've been suffering with the bug.

Seems like this is it! Thanks.

--- Additional comment from jlayton on 2011-03-28 13:32:01 EDT ---

Excellent news. Thanks for the updates. I'll see whether we can get this into 6.1...

--- Additional comment from spurrier on 2011-03-28 14:04:52 EDT ---

=== In Red Hat Customer Portal Case 00434772 ===
--- Comment by Spurrier, Scott on 3/28/2011 2:04 PM ---

The patch appears to be working as expected.  Disney's current release kernel is 2.6.32-71.18.2.  Would it be possible to backport the patch to that kernel?

--- Additional comment from jlayton on 2011-03-28 14:20:46 EDT ---

I'm not opposed to putting this in zstream. It's a fairly straightforward bug and patch.

--- Additional comment from yanwang on 2011-03-29 02:04:15 EDT ---

hi Jeff,
Could you reproduce the problem? seems customer have the environment to verify the problem, so how to test by QE internal?

--- Additional comment from sprabhu on 2011-03-29 05:55:35 EDT ---

Jeff, 

Since we now know that these are 2 different problems, do you want me to create a separate bz or do we plan on treating both the patches as a part of the same changeset?

Sachin Prabhu

--- Additional comment from sprabhu on 2011-03-29 06:02:10 EDT ---

I just noticed the patch sent to rhkernel-list. This deals with the patch we provided Disney. I guess we should split the original report from Pixar into another bz then.

--- Additional comment from jlayton on 2011-03-29 07:50:50 EDT ---

We never had any hard evidence from Pixar about what their RHEL6 problem actually is. The one core they provided was corrupt. They hit a problem on Fedora that was quite different, but manifested itself in a similar way. I don't think we should assume that RHEL6 has the same issue though without something to back that up.

I think we should just assume for now that Pixar and Disney were hitting the same problem, and have Pixar open a new RHEL6 bug if it turns out that that isn't the case.

--- Additional comment from jlayton on 2011-03-29 07:52:06 EDT ---

(In reply to comment #97)
> hi Jeff,
> Could you reproduce the problem? seems customer have the environment to verify
> the problem, so how to test by QE internal?

I'm afraid not. It's really a matter of timing and workload. I don't know of any way to reproduce this reliably.

--- Additional comment from lars on 2011-03-29 13:24:40 EDT ---

Alas, I spoke too soon. We had three hangs on patched machines this morning.

These machines are running 2.6.35.9-64 with the patch from comment 74. I'm getting a group of machines to run 2.6.38.2-8, from the package found here:

  http://kojipkgs.fedoraproject.org/packages/kernel/2.6.38.2/8.fc15/

I'm hoping that some other changes between 2.6.35.9 and 2.6.38.2 are also required to fix the problem.

-lars

--- Additional comment from steved on 2011-03-29 15:28:59 EDT ---

(In reply to comment #102)
> Alas, I spoke too soon. We had three hangs on patched machines this morning.
They are all hung in the same place? Maybe these hangs are a different bug...

--- Additional comment from lars on 2011-03-29 15:55:14 EDT ---

The symptoms are identical to the situation you mentioned in comment #1: two processes waiting for the same lock. The traces look the same. In these three cases, one of the processes is the flush thread, and the other is a process belonging to the rendering pipeline (it isn't the same one each time).

thanks,
-lars

--- Additional comment from jlayton on 2011-03-29 17:13:05 EDT ---

Yeah, this is a bit confusing...

This bug was originally opened against RHEL6 under the assumption that the problem seen there was the same as the one that Pixar was seeing on F14 kernels. The problem is though, that we never had any conclusive proof of that.

In the meantime, the case from Disney was attached to this one again under the assumption that the problem was

The one RHEL6 core that we got from Pixar was at least partially corrupt. When I look at it now in light of the bug that the patch in comment #74 fixes, it shows none of the hallmarks of that problem. The CPUs are all idle aside from the one that is writing to /proc/sysrq-trigger, and the process that is stuck (dfa) doesn't have nfs_wb_page() in its stack.

At this point we have to conclude that they are different problems. I think that the right course of action is to use this bug to track the patch that Trond put in comment #74, and consider opening a new bug or cloning this one to track the problem that Pixar has seen.

If we do that, I'll plan to take a fresh look at the core that we do have from Pixar to see if I can learn more about the cause there.

--- Additional comment from lars on 2011-03-29 17:28:14 EDT ---

Much appreciated--thanks, Jeff.

-lars

--- Additional comment from jlayton on 2011-03-30 14:08:56 EDT ---

Since there is no cloned bug yet, I'll keep notes in this one for now. I'll plan to transfer these to that bug unless someone beats me to it. It looks like the core from Pixar might not be corrupt after all.

Somehow we ended up with the wrong debuginfo which threw off all of the struct alignment. I went back and un-cpio'ed the correct debuginfo and now things are looking more sane. These values are correct:

superblock: 0xffff88042c1c6400
nfs_server: 0xffff880427f54400
rpc_clnt: 0xffff88042b815a00
rpc_xprt: 0xffff8804060b5800

I'll have to go back now and work through the different data structures.

--- Additional comment from jlayton on 2011-03-30 14:21:20 EDT ---

There are a bunch of tasks in the cl_tasks queue. Most have tk_action set to call_reserveresult, but some have call_connect_status, and a couple are in call_status.

Interesting fields from the xprt:

state == 0x115 (XPRT_LOCKED|XPRT_CONNECTING|XPRT_BOUND|XPRT_CONNECTION_CLOSE)

inet->skc_state == 0x7 (TCP_CLOSE)

sock->state == SS_UNCONNECTED

--- Additional comment from jlayton on 2011-03-30 14:26:15 EDT ---

Also:

  connect_worker = {
    work = {
      data = {
        counter = 0xffffe8ffffc88bc1
      }, 
      entry = {
        next = 0xffff8804060b5e80, 
        prev = 0xffff8804060b5e80
      }, 
      func = 0xffffffffa0413210 <xs_tcp_connect_worker4>
    }, 
    timer = {
      entry = {
        next = 0xffff880427c4af50, 
        prev = 0xffff8804060b5b78
      }, 
      expires = 0x113c98b7c, 
      function = 0xffffffff8108cbf0 <delayed_work_timer_fn>, 
      data = 0xffff8804060b5e78, 
      base = 0xffff88042e224000, 
      start_site = 0x0, 
      start_comm = "\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000", 
      start_pid = 0xffffffff
    }

...the fact that the 1 is set means that WORK_STRUCT_PENDING is set. The job is queued to a workqueue but has not yet run. jiffies at the time of the crash is 
0x113c979d4 so it's just waiting for the timer to pop.

--- Additional comment from jlayton on 2011-03-30 15:25:03 EDT ---

The tasks are all write operations.

crash> struct rpc_call_ops 0xffffffffa053e020
struct rpc_call_ops {
  rpc_call_prepare = 0xffffffffa0519e00 <nfs_write_prepare>, 
  rpc_call_done = 0xffffffffa0519d60 <nfs_writeback_done_full>, 
  rpc_release = 0xffffffffa051a560 <nfs_writeback_release_full>
}

...not too surprising since "dfa" is attempting to close a file and is writing everything out. So at this point, we have a bunch of tasks queued up, but the xprt isn't connected so they can't do anything.

Some more info from xprt:

    reestablish_timeout = 0x17700, 
    connect_cookie = 0x73c, 
    task_cleanup = {
      data = {
        counter = 0xffffe8ffffc88bc0
      }, 
      entry = {
        next = 0xffff8804060b5d00, 
        prev = 0xffff8804060b5d00
      }, 
      func = 0xffffffffa04114a0 <xprt_autoclose>
    }, 

It looks like the kernel is trying to reconnect the socket and can't, but right offhand I can't quite tell why. Is there any chance that Pixar could reproduce this again on RHEL6 and turn up rpcdebug? That may tell us whether this is similar to the problem we saw on Fedora when we were there...

Comment 2 RHEL Program Management 2011-04-04 02:39:59 UTC

Since RHEL 6.1 External Beta has begun, and this bug remains
unresolved, it has been rejected as it is not proposed as
exception or blocker.

Red Hat invites you to ask your support representative to
propose this request, if appropriate and relevant, in the
next release of Red Hat Enterprise Linux.

Comment 3 Lars Damerow 2011-04-04 19:37:02 UTC

One of our RHEL6 render machines is now hanging, but the symptoms look a little different. I'm not sure what's going on in this case but I'll post the rpc information here anyway.

When I try to get ps output, it doesn't complete, hanging while trying to read one process's cmdline file. That process (a prman instance) is in the D state, reported as "disk sleep" and its wchan file says "call_rwsem_down_write_failed".

I'm attaching the output from echoing 0 to /proc/sys/sunrpc/rpc_debug, as well as the sysrq-t trace. Note that there are tons of hung "mem-status" processes; these are jobs run out of cron that inspect all processes, and a lot of them piled up once the machine got in this state.

thanks,
-lars

Comment 4 Lars Damerow 2011-04-04 19:38:41 UTC

Created attachment 489841 [details]
sysrq-t output for a RHEL6 machine in a bad state

In this instance the machine may not be suffering from the same bug we've been tracking down, but I'm including it here just in case.

Comment 5 Lars Damerow 2011-04-04 19:39:15 UTC

Created attachment 489843 [details]
rpc_debug output for RHEL6 machine in bad state

In this instance the machine may not be suffering from the same bug we've been tracking down, but I'm including it here just in case.

Comment 6 Lars Damerow 2011-04-04 19:55:10 UTC

We're also testing Fedora machines running 2.6.38.2, and one of them got into a bad state over the weekend. This one also seems different from the original issue but maybe it contains some clues.

There are processes piling up on the machine too, but here they're sleeping in autofs4_wait. I'm leaving the machine alone in case there's any debugging you all would like me to do while it's broken.

Comment 7 Lars Damerow 2011-04-04 19:55:44 UTC

Created attachment 489844 [details]
rpc_debug output for F11/2.6.38.2 machine in bad state

Comment 8 Lars Damerow 2011-04-04 19:57:31 UTC

Created attachment 489845 [details]
sysrq-t output for F11/2.6.38.2 machine in a bad state

Comment 9 Jeff Layton 2011-04-04 22:24:09 UTC

In the RHEL6 info, there are some calls that have tk_action set to call_connect_status, which indicates that they ended up reconnecting the transport:

Apr  4 10:49:12 r3822 kernel: 8221 0080    -11 ffff880429d21200 ffff880106e46620        0 ffffffffa0429b40 nfsv3 GETATTR a:call_connect_status q:xprt_sending

...but that info doesn't contain any actual RPC debug messages, did none show up, or did you disable RPC debug soon after turning it back on?

The 2.6.38.2 info looks like it may be the same (or similar) problem that we saw when we visited your site:

[440182.760368] RPC: 11478 release request ffff88034665f5e0
[440182.798225] RPC: 11479 reserved req ffff88034665c000 xid e3315ef5
[440182.798233] RPC: 11479 xprt_connect xprt ffff88041d1d0800 is not connected
[440182.798509] RPC: 11479 xprt_connect_status: retrying
[440182.798514] RPC: 11479 xprt_prepare_transmit
[440182.798519] RPC: 11479 xprt_transmit(140)
[440182.798539] RPC: 11479 xmit complete
[440182.806304] RPC: 11479 xid e3315ef5 complete (112 bytes received)
[440182.806322] RPC: 11479 release request ffff88034665c000

...but in this case there's no info about the tasks in queue and the log doesn't seem to span much time.

In any case, at this point XPRT_CONNECTED was not set but it looks like it ended up doing a transmit anyway and apparently succeded. I'll look over what we have so far and see if I can come up with a debugging approach.

Comment 10 Jeff Layton 2011-04-11 15:30:53 UTC

On 2.6.38...that kernel contains some of the d_automount changes as well as all of the RCU dcache stuff that went into 2.6.38. It's possible that that is completely unrelated to any of the previous problems you've seen. It's difficult for me to tell from this info what was actually hung on that kernel and what it's waiting on.

This is a very difficult problem and now we're trying to chase this down across several kernel versions. So far you have:

2.6.35-ish kernel
RHEL6 kernel
2.6.38.2

...and there's no guarantee that any of these problems are the same. In order to make any progress on this, I think we'll need to settle on a consistent kernel -- preferably RHEL6's since that's the one we officially support.

On the RHEL6 sysrq-t info, I don't see any evidence of a nfs issue. The processes are almost all stuck trying to lock a semaphore so they can read a /proc/pid file.

The rpc_debug output just shows a bunch of queued RPC jobs and it looks like there might be a reconnect in progress. I see no other rpc_debug messages though which leads me to believe that it wasn't enabled for very long.

The main takeaway here is this -- if we suspect that this problem is related to the socket getting disconnected and reconnected, then we will likely need rpc_debug output that spans something on the order of 5-10 mins of realtime.

The reason is that there are a number of workqueue jobs that run on timers of that order. If rpc_debug isn't turned up when those jobs run, then we have no info about why they may have failed.

Also with this, we need to proceed from first principles. When the machine is having problems, start by identifying processes that are stuck and then following from there to determine why they're stuck.

Comment 11 Lars Damerow 2011-04-14 21:48:45 UTC

Fair enough. I'll only send information for kernel 2.6.32-71.el6.x86_64 from now on. In order to really resolve the issue, though, I do need a patch that I can migrate to a Fedora kernel. I'm more than happy to do the port myself once I can see the patch or patches.

I'm attaching a ten-minute rpc debug log and full sysrq-t output from a machine that is clearly suffering from the bug in question. There's a flush thread in the D state, waiting in inode_wait, and a handful of user processes also in the D state.

If it would be helpful I can grant you ssh access to these machines. In the meantime, let me know if there are other logs I can get for you.

thanks,
-lars

Comment 12 Lars Damerow 2011-04-14 21:49:55 UTC

Created attachment 492239 [details]
sysrq-t output for a RHEL6 machine having flush/inode_wait issue

Comment 13 Lars Damerow 2011-04-14 21:51:02 UTC

Created attachment 492240 [details]
rpc debug output for a RHEL6 machine having flush/inode_wait issue

This contains ten minutes of runtime.

Comment 14 Jeff Layton 2011-04-15 13:59:10 UTC

Created attachment 492383 [details]
RPC log snippet

Thanks Lars. Ok, at first glance this looks like this may be a similar problem to the one we saw when we were there a month or so ago. Here we're trying to connect the socket, but it doesn't immediately fail:

RPC:       worker connecting xprt ffff880390253000 via tcp to 138.72.236.48 (port 2049)
RPC:       ffff880390253000 connect status 115 connected 0 sock state 2

...115 is -EINPROGRESS. Basically that means that the connect has been queued to the TCP layer and the RPC code is just waiting on the result. It appears that this then succeeds:

RPC:       xs_tcp_state_change client ffff880390253000...
RPC:       state 1 conn 0 dead 0 zapped 1

...state 1 is TCP_ESTABLISHED. The RPC tasks get woken back up and then go to retransmit. The first attempt though fails:

Comment 15 Jeff Layton 2011-04-15 14:03:11 UTC

...oops hit submit too quickly. In any case, first attempt fails with an EPIPE:

RPC: 53190 xprt_transmit(57736)
RPC:       xs_tcp_send_request(57736) = -32
RPC:       xs_tcp_state_change client ffff880390253000...
RPC:       state 4 conn 1 dead 0 zapped 1
RPC: 53190 call_status (status -32)

...and now it's in TCP_FIN_WAIT1. It goes into TCP_FIN_WAIT2 and then TCP_CLOSE:

RPC: 53190 call_bind (status 0)
RPC: 53190 call_connect xprt ffff880390253000 is not connected
RPC: 53190 xprt_connect xprt ffff880390253000 is not connected
RPC: 53190 sleep_on(queue "xprt_pending" time 5504045478)
RPC: 53190 added to queue ffff880390253320 "xprt_pending"
RPC: 53190 setting alarm for 60000 ms
RPC:       xs_tcp_state_change client ffff880390253000...
RPC:       state 5 conn 0 dead 0 zapped 1
RPC:       xs_tcp_state_change client ffff880390253000...
RPC:       state 7 conn 0 dead 0 zapped 1
RPC:       disconnected transport ffff880390253000

...and the transport is disconnected. More to come in a bit...

Comment 16 Jeff Layton 2011-04-15 16:33:10 UTC

...a little more:

RPC: 53190 __rpc_wake_up_task (now 5504045478)
RPC: 53190 disabling timer
RPC: 53190 removed from queue ffff880390253320 "xprt_pending"
RPC:       __rpc_wake_up_task done
RPC:       xs_tcp_state_change client ffff880390253000...
RPC:       state 7 conn 0 dead 0 zapped 1
RPC:       disconnected transport ffff880390253000
RPC:       xs_tcp_data_ready...
RPC: 53190 __rpc_execute flags=0x1

...so we got another state change, but the state didn't seem to actually change here since it was 7 before. But I think Trond mentioned that we don't get sk_state_change callbacks for all state changes, so it's possible that it flipped to another state and then back to TCP_CLOSE.

Regardless, there's another interesting one...xs_tcp_data_ready also fired here. Unfortunately this doesn't print the xprt address, so it's possible it was for another xprt entirely. Oh well.

In any case we get some other attempts to send on the socket that fail again with EPIPE:

RPC: 53190 xprt_transmit(57736)
RPC:       xs_tcp_send_request(57736) = -32
RPC:       xs_tcp_state_change client ffff880390253000...
RPC:       state 7 conn 0 dead 0 zapped 1
RPC:       disconnected transport ffff880390253000
RPC: 53190 call_status (status -32)
RPC: 53190 call_bind (status 0)
RPC: 53190 call_connect xprt ffff880390253000 is not connected
RPC: 53190 xprt_connect xprt ffff880390253000 is not connected
RPC: 53190 sleep_on(queue "xprt_pending" time 5504045478)
RPC: 53190 added to queue ffff880390253320 "xprt_pending"
RPC: 53190 setting alarm for 60000 ms
RPC:       xs_connect delayed xprt ffff880390253000 for 0 seconds
RPC:       disconnecting xprt ffff880390253000 to reuse port
RPC:       xs_error_report client ffff880390253000...
RPC:       error 0
RPC: 53190 __rpc_wake_up_task (now 5504045478)
RPC: 53190 disabling timer
RPC: 53190 removed from queue ffff880390253320 "xprt_pending"
RPC:       __rpc_wake_up_task done
RPC:       disconnected transport ffff880390253000

...Finally, the connect_worker job fires again and we get this:

RPC:       worker connecting xprt ffff880390253000 via tcp to 138.72.236.48 (port 2049)
RPC:       ffff880390253000 connect status 99 connected 0 sock state 7
RPC: 53190 __rpc_execute flags=0x1
RPC: 53190 xprt_connect_status: retrying

...and here we see the infamous EADDRNOTAVAIL error (99).

Comment 17 Jeff Layton 2011-04-15 16:51:08 UTC

So I guess the first question is, why did we get an EPIPE error on the first attempt to send on the socket after the initial successful reconnect? I didn't see any evidence that we shut down the socket after the connect.

I do see this in tcp_sendmsg:

        err = -EPIPE;
        if (sk->sk_err || (sk->sk_shutdown & SEND_SHUTDOWN))
                goto out_err;

...could one of those cases have been true? A systemtap script that checks for that might be interesting.

We probably also ought to have Lars test a kernel with the patches that Trond cooked up when we were at Pixar and soon afterward. Preferably something close to 6.1 release kernel. I'll try to cobble the patches together here and add them to the BZ.

If that fails in a similar fashion, then you could collect the same sort of rpc_debug info, or I could maybe log in there and poke at it and we can see whether it has changed what the client is doing here.

Comment 18 Trond Myklebust 2011-04-15 17:26:42 UTC

sk->sk_shutdown is cleared in tcp_disconnect(), so it should be cleared when
we exit from the xs_abort_connection() helper. That helper will also set
sk->sk_err to ECONNRESET as far as I can see.

Normally sk->sk_err should then be cleared in the call to tcp_connect() so
whatever may be setting it would have to do so after that.

So, yes, it might not be a bad idea to figure out what the value of sk_err
is.

Comment 19 Jeff Layton 2011-04-15 17:28:31 UTC

Created attachment 492457 [details]
patchset - fix a couple of issues in xprtsock.c

These two patches are not in RHEL6 yet. One of them is intended to fix the problem that we found while at pixar. The other fixes another we found by inspecting the logs.

What would be good would be to test something close to a 6.1 kernel with these two patches on top. If they don't fix the issue, then turn up rpc_debug again for about 10 mins and collect the logs.

What might also be helpful is to roll up a systemtap script that prints sk->sk_err and sk->sk_shutdown when tcp_sendmsg returns -EPIPE.

Sound like a plan?

Comment 21 Neal Kim 2011-04-29 21:49:25 UTC

Just a quick update from Lars in regards to the kernel we gave him. 

They have had 33 renderfarm machines running 2.6.32-132.el6.jtltest.015.x86_64 for nine days, and so far none of them have shown the mount hang.

"All but five of them are running Fedora 11. This is good news overall, but a handful of the F11 machines have crashed outright, and so far I can't tell why. There's a backtrace on the screen, but as far as I can see the traceback is in code that writes a coredump (do_coredump and elf_core_dump are called).

Does the RHEL kernel actually dump core automatically when it crashes? If it does, where does it go? These machines don't have a /var/crash directory."

I'm trying to get a little more clarification, because to me it sounds like 28 Fedora 11 machines running the 2.6.32-132.el6.jtltest.015.x86_64 RHEL kernel.

Comment 22 Neal Kim 2011-04-29 22:15:58 UTC

So I confirmed with Lars, 28 Fedora 11 machines running the 2.6.32-132.el6.jtltest.015.x86_64 RHEL kernel.

Updated machine count:

28 Fedora 11 machines running the patched kernel
5 RHEL6 machines running the patched kernel
5 RHEL6 machines running the stock, unpatched kernel

Comment 23 Lars Damerow 2011-05-25 17:33:44 UTC

Hi all,

I wanted to send an update after more than a month's testing. The mount hang does appear to be fixed with the RHEL kernel and Jeff's patches.

Now that RHEL 6.1 is released, is there any ETA for an errata RHEL kernel that would contain these patches?

Thanks!
-lars

Comment 24 Jeff Layton 2011-05-26 14:23:45 UTC

Thanks for testing them, Lars. Yes, I plan to propose these for 6.2.

Comment 25 Trond Myklebust 2011-05-26 14:26:47 UTC

Jeff,

Is there anything here that has not already been sent upstream?

Comment 26 RHEL Program Management 2011-05-26 14:29:49 UTC

This request was evaluated by Red Hat Product Management for inclusion
in a Red Hat Enterprise Linux maintenance release. Product Management has 
requested further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed 
products. This request is not yet committed for inclusion in an Update release.

Comment 27 Jeff Layton 2011-05-26 14:42:02 UTC

(In reply to comment #25)
> Jeff,
> 
> Is there anything here that has not already been sent upstream?

Yes. Looks like this patch that you attached to the earlier bug is not yet in mainline:

    SUNRPC: Deal with the lack of a SYN_SENT sk->sk_state_change callback...

    https://bugzilla.redhat.com/attachment.cgi?id=486341

...were you planning to merge that upstream?

Comment 28 Jeff Layton 2011-05-26 14:43:14 UTC

FWIW, the patchset that Lars tested is attached to this bug in comment #19.

Comment 29 Trond Myklebust 2011-05-26 15:09:29 UTC

(In reply to comment #27)
> Yes. Looks like this patch that you attached to the earlier bug is not yet in
> mainline:
> 
>     SUNRPC: Deal with the lack of a SYN_SENT sk->sk_state_change callback...
> 
>     https://bugzilla.redhat.com/attachment.cgi?id=486341
> 
> ...were you planning to merge that upstream?

I'll queue it up and add a Cc: stable...

Comment 30 Jeff Layton 2011-06-14 14:30:31 UTC

steved ended up rolling this patch into a larger set. So, closing this bug as a duplicate of that one...

*** This bug has been marked as a duplicate of bug 701788 ***