Bug 672305
| Summary: | Repeatable NFS mount hang | |||
|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 6 | Reporter: | Neal Kim <nkim> | |
| Component: | kernel | Assignee: | Jeff Layton <jlayton> | |
| Status: | CLOSED ERRATA | QA Contact: | Filesystem QE <fs-qe> | |
| Severity: | medium | Docs Contact: | ||
| Priority: | medium | |||
| Version: | 6.0 | CC: | adrew, anton, bergwolf, bfields, bhubbard, cmarthal, cward, cww, dmair, fleite, francisco.cardoso, francois, gansalmon, itamar, jcm, jlayton, jonathan, kernel-maint, kevin.constantine, kzhang, lars, madhu.chinakonda, makc, Marc.Jordan, nhorman, nkim, nobody+PNT0273897, prescott, rwheeler, sbeal, sprabhu, steved, trond.myklebust, yanwang | |
| Target Milestone: | rc | Keywords: | OtherQA | |
| Target Release: | --- | |||
| Hardware: | x86_64 | |||
| OS: | Linux | |||
| Whiteboard: | ||||
| Fixed In Version: | kernel-2.6.32-130.el6 | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | ||
| Clone Of: | 669204 | |||
| : | 692315 (view as bug list) | Environment: | ||
| Last Closed: | 2011-05-19 12:30:20 UTC | Type: | --- | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | 669204 | |||
| Bug Blocks: | 692315 | |||
| Attachments: | ||||
|
Comment 5
Steve Dickson
2011-01-25 20:56:27 UTC
Created attachment 475280 [details]
nfs.ko
Created attachment 475281 [details]
sunrpc.ko
nfs.ko and sunrpc.ko attached (vanilla kernel-2.6.32-71.el6.x86_64). I should have a dump of rpc_debug shortly. This request was evaluated by Red Hat Product Management for inclusion in the current release of Red Hat Enterprise Linux. Because the affected component is not scheduled to be updated in the current release, Red Hat is unfortunately unable to address this request at this time. Red Hat invites you to ask your support representative to propose this request, if appropriate and relevant, in the next release of Red Hat Enterprise Linux. If you would like it considered as an exception in the current release, please ask your support representative. We will continue to work on this - if we can figure it out really soon, we can try to reflag it for 6.1. As it looks now, more likely to land in 6.2. Created attachment 477115 [details]
rpc_debug output from Fedora
Looking over the rpc_debug info, what I mainly see is a ton of writes queued up. The one at the top of the stack however is this: [581306.641656] 52282 0001 0 ffff8801fa933600 ffff880264064498 120000 ffffffffa0389700 nfsv3 WRITE a:call_connect_status q:xprt_pending ...the fact that it's sitting in call_connect_status implies that the transport was disconnected and (likely) in the process of being reconnected at the time that this was generated. The follow-on info adds a bit more, unfortunately not all of the rpc_debug bits were set when this info was done, so we didn't get all of the possible rpc_debug messages (in the future, it might be better to echo in 65535 rather than 1 to rpc_debug). In any case... [581312.289081] RPC: 52282 xprt_connect_status: retrying This means that the connect attempt returned with a status of -EAGAIN. I think there are a couple of ways that this can occur, unfortunately I can't tell which since the debug info didn't contain the messages that would tell us this. [581312.289084] RPC: 52282 xprt_prepare_transmit [581312.289088] RPC: 52282 xprt_transmit(65676) [581312.289092] RPC: disconnected transport ffff88042aeb2000 [581312.289096] RPC: 52282 xprt_connect xprt ffff88042aeb2000 is not connected ...in any case, I suspect that the problem is that the client is attempting to reconnect the socket but is unable to do so for some reason. Beyond that, I can't tell much. I grabbed the rhel6 core and started poking at it. The superblock is here:
0xffff88042c1c6400
The nfs_server is here:
0xffff880427f54400
The rpc_clnt:
0xffff88042b815a00
...hanging off the rpc_clnt *should* be a rpc_xprt, which contains the sockaddr for the server. That however seems to be at least corrupt and has probably been freed. So I think what we may be looking at is a rpc_xprt refcount leak somewhere in the code.
Sorry, not a refcount leak but a refcount imbalance (too many puts/not enough gets). ...more notes:
crash> struct rpc_clnt.cl_xprt 0xffff88042b815a00
cl_xprt = 0xffffffffa0548d80,
...but that address looks odd:
crash> kmem ffffffffa0548d80
ffffffffa0548d80 (D) nfs_rpcstat
VM_STRUCT ADDRESS RANGE SIZE
ffff88042ad61540 ffffffffa0503000 - ffffffffa055a000 356352
PAGE PHYSICAL MAPPING INDEX CNT FLAGS
ffffea000e956758 42aafd000 0 6a 1 40000000000000
...so it's the same address as the nfs_rpcstat struct, and that looks valid. It seems unlikely that that address was handed out by a kzalloc, so how did we end up with it in the cl_xprt field? Maybe it's a more generic memory corruption in rpc_clnt?
Got an email from Lars at Pixar the other day:
"Thanks, Steve--unfortunately, I came in this morning to find a patched
machine hanging just as they did before. Bummer! I verified that the patch
was applied correctly by looking at the sources in the debuginfo package."
...so apparently Trond's patch did not fix the issue. I think we need to
do the following:
When we get a RHEL6 machine into this state, it would be good to confirm
that...
The ideal thing would be to get a kernel with a backport of Trond's patch
for RHEL6 to them, and get it into the state where it's hung like this.
Once it's in that state, crank up rpc_debug for a few minutes:
# rpcdebug -m rpc -s all
...wait about 5 minutes or so. Try not to do any activity on the box so that
we can focus on what's already queued and the RPC calls that are in progress. Then:
# rpcdebug -m rpc -c all
...after that, save off the log to a file:
# dmesg > /tmp/rpc-debug-log.txt
...and attach it here so we can see what the client is actually doing and
why it's not making progress.
Created attachment 486078 [details]
RPC debug from RHEL 6
Not much info there. Stripping out the soft lockup messages gives this: -pid- flgs status -client- --rqstp- -timeout ---ops-- 39502 0001 0 ffff880c44937200 ffff880c46e690d8 0 ffffffffa0308e70 nfsv3 READ a:call_status q:none 38818 0001 0 ffff880c452e3000 ffff880c474e4930 0 ffffffffa0309000 nfsv3 COMMIT a:call_status q:none 38819 0001 0 ffff880c452e3000 ffff880c474e4310 0 ffffffffa0308fc0 nfsv3 WRITE a:call_status q:none 38820 0001 0 ffff880c452e3000 ffff880c474e4620 0 ffffffffa0308fc0 nfsv3 WRITE a:call_status q:none 39014 0001 0 ffff880c452e3000 ffff880c474e4498 0 ffffffffa0308fc0 nfsv3 WRITE a:call_status q:none 39232 0001 0 ffff880c452e3000 ffff880c474e4ab8 0 ffffffffa0308fc0 nfsv3 WRITE a:call_status q:none 39249 0001 0 ffff880c452e3000 ffff880c474e4c40 0 ffffffffa0308fc0 nfsv3 WRITE a:call_status q:none RPC: xs_close xprt ffff880c475df800 RPC: disconnected transport ffff880c475df800 RPC: wake_up_next(ffff880c475dfa58 "xprt_resend") RPC: wake_up_next(ffff880c475df990 "xprt_sending") -pid- flgs status -client- --rqstp- -timeout ---ops-- 39502 0001 0 ffff880c44937200 ffff880c46e690d8 0 ffffffffa0308e70 nfsv3 READ a:call_status q:none 38818 0001 0 ffff880c452e3000 ffff880c474e4930 0 ffffffffa0309000 nfsv3 COMMIT a:call_status q:none 38819 0001 0 ffff880c452e3000 ffff880c474e4310 0 ffffffffa0308fc0 nfsv3 WRITE a:call_status q:none 38820 0001 0 ffff880c452e3000 ffff880c474e4620 0 ffffffffa0308fc0 nfsv3 WRITE a:call_status q:none 39014 0001 0 ffff880c452e3000 ffff880c474e4498 0 ffffffffa0308fc0 nfsv3 WRITE a:call_status q:none 39232 0001 0 ffff880c452e3000 ffff880c474e4ab8 0 ffffffffa0308fc0 nfsv3 WRITE a:call_status q:none I'll have to look more closely at the messages, but at first glance this looks quite different from the problem seen on F14 at Pixar. Adam, some details about how they collected this might be helpful. What specifically did they do to enable and disable debugging, and how long did they leave rpc_debug enabled? The fact that the task->tk_timeout is 0 in all of these seems quite odd. I'm thinking that should be set to some positive value but there may be a reason for it being that way. Jeff- We collected the rpc_debug with your instructions in Comment 23: # rpcdebug -m rpc -s all ...wait about 5 minutes or so. Try not to do any activity on the box so that we can focus on what's already queued and the RPC calls that are in progress. Then: # rpcdebug -m rpc -c all ...after that, save off the log to a file: # dmesg > /tmp/rpc-debug-log.txt Looking at 20110317161658_1 first. Just some notes first for future analysis: The maya.bin process is stuck waiting for a COMMIT to complete. Note that the stack dump that the soft lockup message does often has some "junk" in it. It's really stuck here, which is probably the wait_on_bit call in nfs_commit_inode: #6 [ffff88035085b908] prepare_to_wait_exclusive at ffffffff81091ff3 #7 [ffff88035085b940] __wait_on_bit_lock at ffffffff814c9e68 #8 [ffff88035085b990] out_of_line_wait_on_bit_lock at ffffffff814c9f58 #9 [ffff88035085ba00] nfs_commit_inode at ffffffffa025dc9a #10 [ffff88035085ba70] nfs_wb_page at ffffffffa025de29 #11 [ffff88035085bb10] nfs_flush_incompatible at ffffffffa025dec0 Scraping the stack for info gives: filp: 0xffff880c4756dc00 dentry: 0xffff88097db793c0 superblock: 0xffff880c46420c00 This is the mount at /data/adai0101. Digging down into the NFS and RPC fields shows: nfs_server: 0xffff880c3dd1dc00 rpc_clnt: 0xffff880c45934400 rpc_xprt: 0xffff880c43f93000 To get the list of rpc_tasks hanging off this rpc_clnt:
crash> list rpc_task.tk_task -s rpc_task -H 0xffff880c45934418
The tk_action of all of them is call_status. That suggests that the calls have been transmitted to the server and the client is just waiting on the result.
tk_runstate of all of them is 0x5, which is (RPC_TASK_RUNNING|RPC_TASK_ACTIVE)
tk_flags on all of them are RPC_TASK_ASYNC. The tk_callback on all of them is xprt_timer, which I *think* means that it's waiting for the transport to settle.
Moving to the state of the socket:
crash> struct socket 0xffff880c16491940
struct socket {
state = SS_CONNECTING,
crash> struct sock.__sk_common 0xffff8801df177200 | grep skc_state
skc_state = 0x1,
...I believe the skc_state is TCPF_ESTABLISHED
Correction...that should be TCP_ESTABLISHED Ok, socket states do seem a bit odd, but it may be normal (I don't understand the socket layer code that well). For now, I'll return focus back to what the processes are actually blocking on and why they're not making progress: inode = 0xffff8801d7673700 ...to get to nfs_inode, pretend the vfs_inode field is a list_head: crash> struct nfs_inode.flags -l nfs_inode.vfs_inode 0xffff8801d7673700 flags = 0x84, ...which is NFS_INO_COMMIT|NFS_INO_FLUSHING. Looking at the disassembly of that function confirms that it's waiting for NFS_INO_COMMIT to clear. So we now know what maya.bin is waiting on. Now to figure out why it's not happening... Actually...it looks like it's trying to set NFS_INO_COMMIT inside of nfs_commit_set_lock. So it seems likely that another task had kicked off a commit (possibly an async one) and that didn't happen for some reason. Regardless, maya.bin is still waiting on that bit to clear so the next step is to figure out why that's not happening. It should be happening when the commit response is received... There's only one commit operation running:
crash> struct rpc_task ffff88013dfba948
...confirmed by looking at the tk_ops:
struct rpc_call_ops {
rpc_call_prepare = 0xffffffffa025ce00 <nfs_write_prepare>,
rpc_call_done = 0xffffffffa025c2d0 <nfs_commit_done>,
rpc_release = 0xffffffffa025e030 <nfs_commit_release>
}
tk_owner = 29765,
...which is the pid of maya.bin. Looking at the rpc_rqst:
rq_reply_bytes_recvd = 0x98,
rq_connect_cookie = 0x0,
rq_bytes_sent = 0x0,
rq_xtime = {
tv64 = 0x1f2a0710a7433
},
...so bytes_sent is 0, but bytes recvd is 0x98. Looking back at rpc_xprt:
state = 0x12,
connect_cookie = 0x0,
...which is XPRT_CONNECTED|XPRT_BOUND. So the rpc_xprt is basically considered to be established.
The connect_cookie is 0 though, which seems very odd -- how do we get to a state with XPRT_CONNECTED|XPRT_BOUND without ever incrementing the connect_cookie? That value starts at 0 when the xprt is created and is incremented whenever the tcp state changes to TCP_FIN_WAIT1 or TCP_SYN_SENT.
How do we get a connected TCP socket without ever entering TCP_SYN_SENT?
More anecdotal evidence that there is a problem with the socket handling. There are 21 jobs queued to rpciod. 17 are rpc_async_schedule jobs, and the other 4 are xprt_autoclose. What's not clear to me is why rpciod doesn't seem to be picking them up... "bytes_sent is 0, but bytes recvd is 0x98"
I think that's right:
/* If we've sent the entire packet, immediately
* reset the count of bytes sent. */
req->rq_bytes_sent += status;
req->rq_xmit_bytes_sent += status;
if (likely(req->rq_bytes_sent >= req->rq_slen)) {
req->rq_bytes_sent = 0;
return 0;
}
And 0x98 is the length of a commit response (starting from the top of the rpc header) with pre- and post- attributes. So that much sounds normal.
"how do we get to a state with XPRT_CONNECTED|XPRT_BOUND without ever incrementing the connect_cookie? That value starts at 0 when the xprt is created and is incremented whenever the tcp state changes to TCP_FIN_WAIT1 or TCP_SYN_SENT." Grepping for sk_state_change in net/ipv4/, it looks like we don't actually get a callback in the TCP_SYN_SENT case. The TCP_SYN_SENT case in xs_tcp_state_change might even be completely dead code. Created attachment 486341 [details]
SUNRPC: Deal with the lack of a SYN_SENT sk->sk_state_change callback...
Hi Bruce
Yes. I was sure I had seen a state_change callback for SYN_SENT, but
grepping the code, I can't find it any more.
This patch ought to fix the issue by relying on the return value from
kernel_connect() instead of the state_change callback.
Ok, that helps. Thanks, Bruce... So, it doesn't really look like anything is amiss with what I've found so far. Still though, it does at least seem like the COMMIT call is not progressing. The question at this point is why... The rpc_debug info collected by Kevin seems to indicate that no real RPC activity is going on. Could there be an issue with the workqueue? See your comment #27. If all the RPC calls have infinite timeouts, then something from outside the usual call path has to wake them up. The intention is that the socket callbacks will fill that role... Ok, yeah...I think I see now. All of the queued tasks are have this: tk_callback = 0xffffffffa01ced70 <xprt_timer>, tk_action = 0xffffffffa01ccb60 <call_status>, ...which suggests that they all went through xprt_transmit and went to sleep on xprt->pending. Trond, I guess you're theory is that disconnect/reconnect occured, and since the SYN_SENT callback never happened those tasks were never awakened. Is that correct? Having a look at the second core to confirm that the problem there is similar... dentry = 0xffff8801c48bfa40 inode = 0xffff880218969378 sb = 0xffff880c44aac800 nfs_server = 0xffff880c46447c00 rpc_clnt = 0xffff880c46e73400 ...the problem does seem to be the same. There are fewer's RPC's queued up with this rpc_clnt, but they seem to be in the same state. So this is what I don't understand: The tk_callback == xprt_timer, which means they were put to sleep by xprt_transmit. However, the tk_timeout == 0, which suggests that the call to xprt->ops->set_retrans_timeout() has somehow failed. That again means that task->tk_rqstp->rq_timeout must be zero. Can you check the value of task->tk_client->cl_timeout, and see what it is pointing to? FWIW, bug 674427 has report of soft lockup with the same backtrace. Also, I've check a few cores - the first one from case 00434772 the server was "warmmilk32", all cores from 2011/03/17 are for "hotspot": just wanted to eliminate the possibility of server induced problems. Created attachment 486592 [details]
various rpc info relating to commit task in core1
Here's a collection of the following, which should help answer Trond's question. Let me know if you need more. The rpc_timeout looks normal to me (not zeroed out):
struct rpc_task ffff88013dfba948
struct rpc_rqst 0xffff8801deb1c498
struct rpc_clnt 0xffff880c45934400
struct rpc_timeout 0xffff880c45934540
Also as Max mentions, server behavior could be a factor here. Maybe it's spuriously disconnecting or something?
That shouldn't cause the client to hang like this though. There seems to clearly be a client-side bug here.
Created attachment 486593 [details]
various rpc info relating to commit task in core1
Also, rq_timeout for the attached rpc_rqst is not 0, it's 60000. Here's an updated file that also contains the rpc_xprt.
Ok, I see what Trond is saying, but I don't see how set_retrans_timeout can reasonably fail. I think that leaves us 2 possibilities: 1) the rq_timeout was set to 0 when xprt_transmit called set_retrans_timeout ...or... 2) the tk_timeout was set to a non-zero value at that time but was later zeroed by something else The second possibility seems more likely to me. The task->tk_runstate is 0x5, which is: #define RPC_TASK_RUNNING 0 #define RPC_TASK_ACTIVE 2 ... RPC_TASK_QUEUED is not set, but it should be if it's still sleeping on the waitqueue. One thought... rpc_make_runnable does this:
rpc_clear_queued(task);
if (rpc_test_and_set_running(task))
return;
Could we be hitting a condition where we're trying to wake up the task while QUEUED and RUNNABLE are both set? We set the QUEUED bit when the task goes on the waitqueue, but we don't clear the runnable bit until some time later in __rpc_execute. If the task was woken up between those events might they get stuck like this?
If both QUEUED and RUNNABLE are set, then that means we're still in the __rpc_execute() loop. In that case, we rely on the section that is protected by the queue->lock to avoid conflicts with rpc_make_runnable(). That raises an issue, though. What is the status of task->u.tk_work? Does task->u.tk_work.data contain a valid set of WORK_STRUCT bits, and if so, what are their values? Oh... Another point: task sleep and wake up is designed to be soft-irq safe, but not hard irq safe. We might want to add in a defensive WARN_ON(in_irq()) to rpc_make_runnable() just in case we're hitting some poorly designed network driver that is trying to call us back from the wrong kind of context. Will do... I'll get a kernel building with these three patches 1) the one liner from our Pixar trip 2) The patch from Comment 46 3) Adding the following WARN_ON(): diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c index cace604..c462fab 100644 --- a/net/sunrpc/sched.c +++ b/net/sunrpc/sched.c @@ -281,6 +281,8 @@ EXPORT_SYMBOL_GPL(__rpc_wait_for_completion_task); */ static void rpc_make_runnable(struct rpc_task *task) { + WARN_ON(in_irq()) + rpc_clear_queued(task); if (rpc_test_and_set_running(task)) return; Yeah, I've been wondering about the workqueue too. Here's tk_work...
tk_work = {
data = {
counter = 0xffffe8ffffccce81
},
entry = {
next = 0xffff880a1429c718,
prev = 0xffffe8ffffccce88
},
func = 0xffffffffa01d5760 <rpc_async_schedule>
},
...how the "data" field gets set is quite tricky...working down through the definition of INIT_WORK:
(_work)->data = (atomic_long_t) WORK_DATA_INIT();
#define WORK_DATA_INIT() ATOMIC_LONG_INIT(WORK_STRUCT_NO_CPU)
WORK_STRUCT_NO_CPU = WORK_CPU_NONE << WORK_STRUCT_FLAG_BITS,
WORK_CPU_NONE = NR_CPUS + 1,
CONFIG_NR_CPUS = 4096
WORK_STRUCT_FLAG_BITS = WORK_STRUCT_COLOR_SHIFT +
WORK_STRUCT_COLOR_BITS,
...I think CONFIG_DEBUG_OBJECTS_WORK is unset, so:
WORK_STRUCT_COLOR_SHIFT = 4, /* color for workqueue flushing */
WORK_STRUCT_COLOR_BITS = 4,
So the counter should be initialized by INIT_WORK to 0x100100. It's not that here, but I'm not sure whether it gets altered later or what. Trond is there something you wanted me to specifically check for? I might be able to get some help determining the state of the tk_work if so...
Disregard that...I was looking at a mainline-ish kernel which has all of the CMWQ changes. In RHEL6, ->data is initialized to 0:
#define WORK_DATA_INIT() ATOMIC_LONG_INIT(0)
...I'll look at what it ends up pointing to as it progresses.
Ok, the work_struct starts its life with ->data initialized to 0. It then has this happen when it's queued:
if (!test_and_set_bit(WORK_STRUCT_PENDING, work_data_bits(work))) {
WORK_STRUCT_PENDING is 0, and that bit seems to be set on tk_work task. So, this was queued since it was last initialized. It looks like run_workqueue calls work_clear_pending before calling ->func, so it looks like this job was queued to the workqueue but has not run.
Masking off 0x1 in the ->data gives us this:
crash> struct cpu_workqueue_struct 0xffffe8ffffccce80
struct cpu_workqueue_struct {
lock = {
raw_lock = {
slock = 0x85cb85cb
}
},
worklist = {
next = 0xffff88013dfba9d8,
prev = 0xffff88021675f3d8
},
more_work = {
lock = {
raw_lock = {
slock = 0xaca0aca0
}
},
task_list = {
next = 0xffffe8ffffcccea0,
prev = 0xffffe8ffffcccea0
}
},
current_work = 0x0,
wq = 0xffff880c43c94380,
thread = 0xffff880c469a14e0
}
...sanity checking the wq field:
crash> struct workqueue_struct 0xffff880c43c94380
struct workqueue_struct {
cpu_wq = 0x60ffd7a0ce80,
list = {
next = 0xffff880c43d05188,
prev = 0xffff880c3c2aac88
},
name = 0xffffffffa01ea024 "rpciod",
singlethread = 0x0,
freezeable = 0x0,
rt = 0x0
}
task 0xffff880c469a14e0 is rpciod/3:
crash> set 0xffff880c469a14e0
PID: 1588
COMMAND: "rpciod/3"
TASK: ffff880c469a14e0 [THREAD_INFO: ffff880c3dc8c000]
CPU: 3
STATE: TASK_RUNNING
crash> bt
PID: 1588 TASK: ffff880c469a14e0 CPU: 3 COMMAND: "rpciod/3"
#0 [ffff880c3dc8dd70] schedule at ffffffff814c8f68
#1 [ffff880c3dc8de38] worker_thread at ffffffff8108c7ac
#2 [ffff880c3dc8dee8] kthread at ffffffff81091a86
#3 [ffff880c3dc8df48] kernel_thread at ffffffff810141ca
...so it's TASK_RUNNING, and there are 22 jobs queued to the ->worklist, but it seems to still be sitting in schedule(). Disassembly confirms that it is sitting in the schedule() call in worker_thread.
So it seems like it went to sleep and got waken up, but the scheduler hasn't given it any time? maya.bin is on CPU3 as well and that's where we're seeing the softlockup messages. Could it somehow be stuck continually trying to wait but can't, and so it can't relenquish the CPU to allow the workqueue jobs to run? Is there a fatal signal pending on maya.bin? Otherwise it should be calling schedule()? Just a quick update. Lars has been running the 2.6.32-118.el6.bz672305.x86_64 test kernel for the better part of a week with no hangs...so far. Although he did mention that they had a power outage early yesterday morning which resulted in rendering jobs being held off completely that day. I'll keep everyone posted. task->stack->flags for maya.bin is 0x8c == 10001100. Which is:
TIF_SIGPENDING
TIF_NEED_RESCHED
TIF_SYSCALL_AUDIT
...so yes, there is a signal pending:
signal = {
sig = {0x100}
}
...which I think is SIGKILL.
Oh crap... I see the problem... If nfs_commit_set_lock() fails, then nfs_commit_inode() returns 0, which nfs_wb_page() interprets as 'success!', and loops again... Yeah, I was just looking at the same thing. That's almost certainly the bug. (In reply to comment #69) > Just a quick update. > > Lars has been running the 2.6.32-118.el6.bz672305.x86_64 test kernel for the > better part of a week with no hangs...so far. Although he did mention that they > had a power outage early yesterday morning which resulted in rendering jobs > being held off completely that day. > That's good to know. Even though the symptoms are the same, it seems quite likely that Disney and Pixar are hitting different bugs. Created attachment 486678 [details]
NFS: Fix a hang/infinite loop in nfs_wb_page()
When one of the two waits in nfs_commit_inode() is interrupted, it
returns a non-negative value, which causes nfs_wb_page() to think
that the operation was successful causing it to busy-loop rather
than exiting.
It also causes nfs_file_fsync() to incorrectly report the file as
being successfully committed to disk.
This patch fixes both problems by ensuring that we return an error
if the attempts to wait fail.
Thanks, Trond... Steve, could you add the patch in comment #74 to your test kernel? Once we have that, I think we should give it to Disney to test. Pixar is also welcome to do so too, but we don't know for sure that they actually hit this bug. (In reply to comment #74) > Created attachment 486678 [details] > NFS: Fix a hang/infinite loop in nfs_wb_page() > > When one of the two waits in nfs_commit_inode() is interrupted, it > returns a non-negative value, which causes nfs_wb_page() to think > that the operation was successful causing it to busy-loop rather > than exiting. > It also causes nfs_file_fsync() to incorrectly report the file as > being successfully committed to disk. > > This patch fixes both problems by ensuring that we return an error > if the attempts to wait fail. Do you want *just* the patch in comment #74 in the test kernel or should I just added it to the other two from comment #60? In my opinion, we should test just the patch from comment #74 for now. It looks as if that is the correct fix for Disney's problem. That works for me. Eventually, we'll want to test the whole swath of fixes (including the sunrpc ones), but it might be nice to confirm that this patch is the fix, particularly if this is easily reproducible. Also to confirm, the second core shows the same situation as the first. Jobs there are queued up on rpciod/4 and a SIGKILL'ed maya.bin process is spinning on CPU4 in the same spot. (In reply to comment #77) > In my opinion, we should test just the patch from comment #74 for now. It looks > as if that is the correct fix for Disney's problem. Cool... The kernel is in the oven... Thanks for all your help!!! Setting needinfo flag for reporter as we await test results. I believe we're seeing the same problem. See the thread here: http://marc.info/?l=linux-nfs&m=129685584924582&w=2 I have applied the patch from comment #74, and the preliminary report I have is that it still hangs, but now the iozone process is interruptible, where it wasn't before, and the call trace has changed: INFO: task iozone:3824 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. iozone D 0000000104ba67e9 0 3824 1 0x00000080 ffff88023c23fc30 0000000000000086 0000000000000000 00000000ffffffff 0000000000013700 0000000000013700 0000000000013700 ffff88023d4044d0 0000000000013700 ffff88023c23ffd8 0000000000013700 0000000000013700 Call Trace: [<ffffffff81136829>] ? lock_flocks+0x10/0x12 [<ffffffff81137713>] ? __posix_lock_file+0x60/0x441 [<ffffffff814377b6>] rwsem_down_failed_common+0xd3/0x105 [<ffffffff81137d07>] ? posix_lock_file+0x11/0x13 [<ffffffff8143780f>] rwsem_down_read_failed+0x12/0x14 [<ffffffff812072e4>] call_rwsem_down_read_failed+0x14/0x30 [<ffffffff81436f4b>] ? down_read+0x32/0x36 [<ffffffffa03cf842>] nfs4_proc_lock+0x21f/0x32d [nfs] [<ffffffffa03b6682>] do_setlk+0x61/0xd5 [nfs] [<ffffffffa03b6a94>] nfs_lock+0x16e/0x17d [nfs] [<ffffffff8103bb05>] ? should_resched+0x9/0x29 [<ffffffff81137d2d>] vfs_lock_file+0x24/0x30 [<ffffffff8113848c>] fcntl_setlk+0x169/0x2fb [<ffffffff811025b9>] ? fsnotify_access+0x5d/0x65 [<ffffffff8110f915>] sys_fcntl+0x31c/0x4c9 [<ffffffff814389a7>] ? do_device_not_available+0x9/0xb [<ffffffff8100aa12>] system_call_fastpath+0x16/0x1b Guess I should mention that this is on 2.6.38-rc5, and I had to adapt the patch to make it apply. Hmmm...that really looks like a different bug entirely. The hallmarks of this one are mainly that the machine is stuck looping while trying to issue a commit. That prevents the rpciod workqueue jobs scheduled on the same CPU from running. The stack trace you just posted looks more like something in the locking code. Possibly a problem with locking the nfsi->rwsem? That might indicate a problem with lock recovery. Are there any *-manager threads running when you hit this? If so, what are they doing? In any case, it might be best to continue discussion of that on linux-nfs since this bug is really about a RHEL6 problem, or maybe open a bug at the kernel.org bug tracker that we can all follow. Five renderfarm machines survived a night's rendering while using the patch from comment #74. That's a promising start, so I've added five more machines today and will continue to watch them. If those ten are still doing well tomorrow morning, I'll add another ten and let them all cook for a couple of days. Thanks for working hard on this! My fingers are crossed and I'll update again tomorrow. :) I'm not sure why it was posted as a private comment, but Disney has also reported that the patched kernel has been working for them. This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release. We hadn't seen a single machine hang since we received the patched kernel on either the patched machines, or the stock rhel6 machines until this weekend. We had 21 machines crash over the weekend and all of them were running the unpatched stock rhel6 kernel. All of our patched machines made it through the weekend's rendering load with no hangs. I'll start installing the patch on some workstations for users who've been suffering with the bug. Seems like this is it! Thanks. Excellent news. Thanks for the updates. I'll see whether we can get this into 6.1... hi Jeff, Could you reproduce the problem? seems customer have the environment to verify the problem, so how to test by QE internal? (In reply to comment #97) > hi Jeff, > Could you reproduce the problem? seems customer have the environment to verify > the problem, so how to test by QE internal? I'm afraid not. It's really a matter of timing and workload. I don't know of any way to reproduce this reliably. Alas, I spoke too soon. We had three hangs on patched machines this morning. These machines are running 2.6.35.9-64 with the patch from comment 74. I'm getting a group of machines to run 2.6.38.2-8, from the package found here: http://kojipkgs.fedoraproject.org/packages/kernel/2.6.38.2/8.fc15/ I'm hoping that some other changes between 2.6.35.9 and 2.6.38.2 are also required to fix the problem. -lars (In reply to comment #102) > Alas, I spoke too soon. We had three hangs on patched machines this morning. They are all hung in the same place? Maybe these hangs are a different bug... The symptoms are identical to the situation you mentioned in comment #1: two processes waiting for the same lock. The traces look the same. In these three cases, one of the processes is the flush thread, and the other is a process belonging to the rendering pipeline (it isn't the same one each time). thanks, -lars Yeah, this is a bit confusing... This bug was originally opened against RHEL6 under the assumption that the problem seen there was the same as the one that Pixar was seeing on F14 kernels. The problem is though, that we never had any conclusive proof of that. In the meantime, the case from Disney was attached to this one again under the assumption that the problem was The one RHEL6 core that we got from Pixar was at least partially corrupt. When I look at it now in light of the bug that the patch in comment #74 fixes, it shows none of the hallmarks of that problem. The CPUs are all idle aside from the one that is writing to /proc/sysrq-trigger, and the process that is stuck (dfa) doesn't have nfs_wb_page() in its stack. At this point we have to conclude that they are different problems. I think that the right course of action is to use this bug to track the patch that Trond put in comment #74, and consider opening a new bug or cloning this one to track the problem that Pixar has seen. If we do that, I'll plan to take a fresh look at the core that we do have from Pixar to see if I can learn more about the cause there. Much appreciated--thanks, Jeff. -lars Since there is no cloned bug yet, I'll keep notes in this one for now. I'll plan to transfer these to that bug unless someone beats me to it. It looks like the core from Pixar might not be corrupt after all. Somehow we ended up with the wrong debuginfo which threw off all of the struct alignment. I went back and un-cpio'ed the correct debuginfo and now things are looking more sane. These values are correct: superblock: 0xffff88042c1c6400 nfs_server: 0xffff880427f54400 rpc_clnt: 0xffff88042b815a00 rpc_xprt: 0xffff8804060b5800 I'll have to go back now and work through the different data structures. There are a bunch of tasks in the cl_tasks queue. Most have tk_action set to call_reserveresult, but some have call_connect_status, and a couple are in call_status. Interesting fields from the xprt: state == 0x115 (XPRT_LOCKED|XPRT_CONNECTING|XPRT_BOUND|XPRT_CONNECTION_CLOSE) inet->skc_state == 0x7 (TCP_CLOSE) sock->state == SS_UNCONNECTED Also:
connect_worker = {
work = {
data = {
counter = 0xffffe8ffffc88bc1
},
entry = {
next = 0xffff8804060b5e80,
prev = 0xffff8804060b5e80
},
func = 0xffffffffa0413210 <xs_tcp_connect_worker4>
},
timer = {
entry = {
next = 0xffff880427c4af50,
prev = 0xffff8804060b5b78
},
expires = 0x113c98b7c,
function = 0xffffffff8108cbf0 <delayed_work_timer_fn>,
data = 0xffff8804060b5e78,
base = 0xffff88042e224000,
start_site = 0x0,
start_comm = "\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000",
start_pid = 0xffffffff
}
...the fact that the 1 is set means that WORK_STRUCT_PENDING is set. The job is queued to a workqueue but has not yet run. jiffies at the time of the crash is
0x113c979d4 so it's just waiting for the timer to pop.
The tasks are all write operations.
crash> struct rpc_call_ops 0xffffffffa053e020
struct rpc_call_ops {
rpc_call_prepare = 0xffffffffa0519e00 <nfs_write_prepare>,
rpc_call_done = 0xffffffffa0519d60 <nfs_writeback_done_full>,
rpc_release = 0xffffffffa051a560 <nfs_writeback_release_full>
}
...not too surprising since "dfa" is attempting to close a file and is writing everything out. So at this point, we have a bunch of tasks queued up, but the xprt isn't connected so they can't do anything.
Some more info from xprt:
reestablish_timeout = 0x17700,
connect_cookie = 0x73c,
task_cleanup = {
data = {
counter = 0xffffe8ffffc88bc0
},
entry = {
next = 0xffff8804060b5d00,
prev = 0xffff8804060b5d00
},
func = 0xffffffffa04114a0 <xprt_autoclose>
},
It looks like the kernel is trying to reconnect the socket and can't, but right offhand I can't quite tell why. Is there any chance that Pixar could reproduce this again on RHEL6 and turn up rpcdebug? That may tell us whether this is similar to the problem we saw on Fedora when we were there...
Cloned to BZ#672305 for Pixar Sorry, please ignore my last comment. Cloned to BZ#692315 for Pixar https://bugzilla.redhat.com/show_bug.cgi?id=692315 (In reply to comment #108) > There are a bunch of tasks in the cl_tasks queue. Most have tk_action set to > call_reserveresult, but some have call_connect_status, and a couple are in > call_status. > > Interesting fields from the xprt: > > state == 0x115 (XPRT_LOCKED|XPRT_CONNECTING|XPRT_BOUND|XPRT_CONNECTION_CLOSE) > > inet->skc_state == 0x7 (TCP_CLOSE) > > sock->state == SS_UNCONNECTED The fact that XPRT_CONNECTION_CLOSE is set means that we should be seeing an instance of xprt->task_cleanup on the workqueue. It doesn't look as if that is the case (data->counter does not have bit 0 set). Basically it looks as if the connection is somehow deadlocked with the connect code waiting for xprt_autoclose to kill the socket, and the latter failing to run. Patch(es) available on kernel-2.6.32-130.el6 Set wrong flag. Setting hot-fix flag. *** Bug 697029 has been marked as a duplicate of this bug. *** *** Bug 698696 has been marked as a duplicate of this bug. *** *** Bug 675310 has been marked as a duplicate of this bug. *** An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-0542.html Getting again issues about this on newer versions of kernel, ran into this issue with a "customer" that has a similar setup only it uses NFS over RDMA for an HPC Cluster setup. The NFS Client nodes are running at very high load averages and the NFS eventually hangs while the server bumps "svcrdma: Error fast registering memory for xprt ffff880c1ad31000". Setup is in all ways similar to the ones described here "https://access.redhat.com/solutions/55816" Could this have regressed into becoming a problem in newer versions of the kernel ? Regards, Francisco Switching needinfo request to Sachin, as Jeff left Red Hat a while ago. I hope Sachin can provide insight here. Please note I'm not related with the resolution of this bug, I'm just doing a quick favour to ensure the needinfo does not get lost. :) Francisco, That seems to be a different problem. Can you please open a new case with Red Hat support. Sachin Prabhu |