Bug 692315
| Summary: | Repeatable NFS mount hang | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 6 | Reporter: | Neal Kim <nkim> |
| Component: | kernel | Assignee: | Jeff Layton <jlayton> |
| Status: | CLOSED DUPLICATE | QA Contact: | Filesystem QE <fs-qe> |
| Severity: | medium | Docs Contact: | |
| Priority: | medium | ||
| Version: | 6.0 | CC: | adrew, bergwolf, bfields, bhubbard, francois, gansalmon, itamar, jcm, jiali, jlayton, jonathan, kernel-maint, kevin.constantine, kzhang, lars, madhu.chinakonda, makc, Marc.Jordan, nhorman, nkim, nobody+PNT0273897, rees, rwheeler, sbeal, sprabhu, steved, trond.myklebust, yanwang |
| Target Milestone: | rc | Keywords: | OtherQA |
| Target Release: | --- | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | 672305 | Environment: | |
| Last Closed: | 2011-06-14 14:30:31 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | 669204, 672305 | ||
| Bug Blocks: | |||
| Attachments: | |||
|
Description
Neal Kim
2011-03-31 01:38:53 UTC
Since RHEL 6.1 External Beta has begun, and this bug remains unresolved, it has been rejected as it is not proposed as exception or blocker. Red Hat invites you to ask your support representative to propose this request, if appropriate and relevant, in the next release of Red Hat Enterprise Linux. One of our RHEL6 render machines is now hanging, but the symptoms look a little different. I'm not sure what's going on in this case but I'll post the rpc information here anyway. When I try to get ps output, it doesn't complete, hanging while trying to read one process's cmdline file. That process (a prman instance) is in the D state, reported as "disk sleep" and its wchan file says "call_rwsem_down_write_failed". I'm attaching the output from echoing 0 to /proc/sys/sunrpc/rpc_debug, as well as the sysrq-t trace. Note that there are tons of hung "mem-status" processes; these are jobs run out of cron that inspect all processes, and a lot of them piled up once the machine got in this state. thanks, -lars Created attachment 489841 [details]
sysrq-t output for a RHEL6 machine in a bad state
In this instance the machine may not be suffering from the same bug we've been tracking down, but I'm including it here just in case.
Created attachment 489843 [details]
rpc_debug output for RHEL6 machine in bad state
In this instance the machine may not be suffering from the same bug we've been tracking down, but I'm including it here just in case.
We're also testing Fedora machines running 2.6.38.2, and one of them got into a bad state over the weekend. This one also seems different from the original issue but maybe it contains some clues. There are processes piling up on the machine too, but here they're sleeping in autofs4_wait. I'm leaving the machine alone in case there's any debugging you all would like me to do while it's broken. Created attachment 489844 [details]
rpc_debug output for F11/2.6.38.2 machine in bad state
Created attachment 489845 [details]
sysrq-t output for F11/2.6.38.2 machine in a bad state
In the RHEL6 info, there are some calls that have tk_action set to call_connect_status, which indicates that they ended up reconnecting the transport: Apr 4 10:49:12 r3822 kernel: 8221 0080 -11 ffff880429d21200 ffff880106e46620 0 ffffffffa0429b40 nfsv3 GETATTR a:call_connect_status q:xprt_sending ...but that info doesn't contain any actual RPC debug messages, did none show up, or did you disable RPC debug soon after turning it back on? The 2.6.38.2 info looks like it may be the same (or similar) problem that we saw when we visited your site: [440182.760368] RPC: 11478 release request ffff88034665f5e0 [440182.798225] RPC: 11479 reserved req ffff88034665c000 xid e3315ef5 [440182.798233] RPC: 11479 xprt_connect xprt ffff88041d1d0800 is not connected [440182.798509] RPC: 11479 xprt_connect_status: retrying [440182.798514] RPC: 11479 xprt_prepare_transmit [440182.798519] RPC: 11479 xprt_transmit(140) [440182.798539] RPC: 11479 xmit complete [440182.806304] RPC: 11479 xid e3315ef5 complete (112 bytes received) [440182.806322] RPC: 11479 release request ffff88034665c000 ...but in this case there's no info about the tasks in queue and the log doesn't seem to span much time. In any case, at this point XPRT_CONNECTED was not set but it looks like it ended up doing a transmit anyway and apparently succeded. I'll look over what we have so far and see if I can come up with a debugging approach. On 2.6.38...that kernel contains some of the d_automount changes as well as all of the RCU dcache stuff that went into 2.6.38. It's possible that that is completely unrelated to any of the previous problems you've seen. It's difficult for me to tell from this info what was actually hung on that kernel and what it's waiting on.
This is a very difficult problem and now we're trying to chase this down across several kernel versions. So far you have:
2.6.35-ish kernel
RHEL6 kernel
2.6.38.2
...and there's no guarantee that any of these problems are the same. In order to make any progress on this, I think we'll need to settle on a consistent kernel -- preferably RHEL6's since that's the one we officially support.
On the RHEL6 sysrq-t info, I don't see any evidence of a nfs issue. The processes are almost all stuck trying to lock a semaphore so they can read a /proc/pid file.
The rpc_debug output just shows a bunch of queued RPC jobs and it looks like there might be a reconnect in progress. I see no other rpc_debug messages though which leads me to believe that it wasn't enabled for very long.
The main takeaway here is this -- if we suspect that this problem is related to the socket getting disconnected and reconnected, then we will likely need rpc_debug output that spans something on the order of 5-10 mins of realtime.
The reason is that there are a number of workqueue jobs that run on timers of that order. If rpc_debug isn't turned up when those jobs run, then we have no info about why they may have failed.
Also with this, we need to proceed from first principles. When the machine is having problems, start by identifying processes that are stuck and then following from there to determine why they're stuck.
Fair enough. I'll only send information for kernel 2.6.32-71.el6.x86_64 from now on. In order to really resolve the issue, though, I do need a patch that I can migrate to a Fedora kernel. I'm more than happy to do the port myself once I can see the patch or patches. I'm attaching a ten-minute rpc debug log and full sysrq-t output from a machine that is clearly suffering from the bug in question. There's a flush thread in the D state, waiting in inode_wait, and a handful of user processes also in the D state. If it would be helpful I can grant you ssh access to these machines. In the meantime, let me know if there are other logs I can get for you. thanks, -lars Created attachment 492239 [details]
sysrq-t output for a RHEL6 machine having flush/inode_wait issue
Created attachment 492240 [details]
rpc debug output for a RHEL6 machine having flush/inode_wait issue
This contains ten minutes of runtime.
Created attachment 492383 [details]
RPC log snippet
Thanks Lars. Ok, at first glance this looks like this may be a similar problem to the one we saw when we were there a month or so ago. Here we're trying to connect the socket, but it doesn't immediately fail:
RPC: worker connecting xprt ffff880390253000 via tcp to 138.72.236.48 (port 2049)
RPC: ffff880390253000 connect status 115 connected 0 sock state 2
...115 is -EINPROGRESS. Basically that means that the connect has been queued to the TCP layer and the RPC code is just waiting on the result. It appears that this then succeeds:
RPC: xs_tcp_state_change client ffff880390253000...
RPC: state 1 conn 0 dead 0 zapped 1
...state 1 is TCP_ESTABLISHED. The RPC tasks get woken back up and then go to retransmit. The first attempt though fails:
...oops hit submit too quickly. In any case, first attempt fails with an EPIPE: RPC: 53190 xprt_transmit(57736) RPC: xs_tcp_send_request(57736) = -32 RPC: xs_tcp_state_change client ffff880390253000... RPC: state 4 conn 1 dead 0 zapped 1 RPC: 53190 call_status (status -32) ...and now it's in TCP_FIN_WAIT1. It goes into TCP_FIN_WAIT2 and then TCP_CLOSE: RPC: 53190 call_bind (status 0) RPC: 53190 call_connect xprt ffff880390253000 is not connected RPC: 53190 xprt_connect xprt ffff880390253000 is not connected RPC: 53190 sleep_on(queue "xprt_pending" time 5504045478) RPC: 53190 added to queue ffff880390253320 "xprt_pending" RPC: 53190 setting alarm for 60000 ms RPC: xs_tcp_state_change client ffff880390253000... RPC: state 5 conn 0 dead 0 zapped 1 RPC: xs_tcp_state_change client ffff880390253000... RPC: state 7 conn 0 dead 0 zapped 1 RPC: disconnected transport ffff880390253000 ...and the transport is disconnected. More to come in a bit... ...a little more: RPC: 53190 __rpc_wake_up_task (now 5504045478) RPC: 53190 disabling timer RPC: 53190 removed from queue ffff880390253320 "xprt_pending" RPC: __rpc_wake_up_task done RPC: xs_tcp_state_change client ffff880390253000... RPC: state 7 conn 0 dead 0 zapped 1 RPC: disconnected transport ffff880390253000 RPC: xs_tcp_data_ready... RPC: 53190 __rpc_execute flags=0x1 ...so we got another state change, but the state didn't seem to actually change here since it was 7 before. But I think Trond mentioned that we don't get sk_state_change callbacks for all state changes, so it's possible that it flipped to another state and then back to TCP_CLOSE. Regardless, there's another interesting one...xs_tcp_data_ready also fired here. Unfortunately this doesn't print the xprt address, so it's possible it was for another xprt entirely. Oh well. In any case we get some other attempts to send on the socket that fail again with EPIPE: RPC: 53190 xprt_transmit(57736) RPC: xs_tcp_send_request(57736) = -32 RPC: xs_tcp_state_change client ffff880390253000... RPC: state 7 conn 0 dead 0 zapped 1 RPC: disconnected transport ffff880390253000 RPC: 53190 call_status (status -32) RPC: 53190 call_bind (status 0) RPC: 53190 call_connect xprt ffff880390253000 is not connected RPC: 53190 xprt_connect xprt ffff880390253000 is not connected RPC: 53190 sleep_on(queue "xprt_pending" time 5504045478) RPC: 53190 added to queue ffff880390253320 "xprt_pending" RPC: 53190 setting alarm for 60000 ms RPC: xs_connect delayed xprt ffff880390253000 for 0 seconds RPC: disconnecting xprt ffff880390253000 to reuse port RPC: xs_error_report client ffff880390253000... RPC: error 0 RPC: 53190 __rpc_wake_up_task (now 5504045478) RPC: 53190 disabling timer RPC: 53190 removed from queue ffff880390253320 "xprt_pending" RPC: __rpc_wake_up_task done RPC: disconnected transport ffff880390253000 ...Finally, the connect_worker job fires again and we get this: RPC: worker connecting xprt ffff880390253000 via tcp to 138.72.236.48 (port 2049) RPC: ffff880390253000 connect status 99 connected 0 sock state 7 RPC: 53190 __rpc_execute flags=0x1 RPC: 53190 xprt_connect_status: retrying ...and here we see the infamous EADDRNOTAVAIL error (99). So I guess the first question is, why did we get an EPIPE error on the first attempt to send on the socket after the initial successful reconnect? I didn't see any evidence that we shut down the socket after the connect.
I do see this in tcp_sendmsg:
err = -EPIPE;
if (sk->sk_err || (sk->sk_shutdown & SEND_SHUTDOWN))
goto out_err;
...could one of those cases have been true? A systemtap script that checks for that might be interesting.
We probably also ought to have Lars test a kernel with the patches that Trond cooked up when we were at Pixar and soon afterward. Preferably something close to 6.1 release kernel. I'll try to cobble the patches together here and add them to the BZ.
If that fails in a similar fashion, then you could collect the same sort of rpc_debug info, or I could maybe log in there and poke at it and we can see whether it has changed what the client is doing here.
sk->sk_shutdown is cleared in tcp_disconnect(), so it should be cleared when we exit from the xs_abort_connection() helper. That helper will also set sk->sk_err to ECONNRESET as far as I can see. Normally sk->sk_err should then be cleared in the call to tcp_connect() so whatever may be setting it would have to do so after that. So, yes, it might not be a bad idea to figure out what the value of sk_err is. Created attachment 492457 [details]
patchset - fix a couple of issues in xprtsock.c
These two patches are not in RHEL6 yet. One of them is intended to fix the problem that we found while at pixar. The other fixes another we found by inspecting the logs.
What would be good would be to test something close to a 6.1 kernel with these two patches on top. If they don't fix the issue, then turn up rpc_debug again for about 10 mins and collect the logs.
What might also be helpful is to roll up a systemtap script that prints sk->sk_err and sk->sk_shutdown when tcp_sendmsg returns -EPIPE.
Sound like a plan?
Just a quick update from Lars in regards to the kernel we gave him. They have had 33 renderfarm machines running 2.6.32-132.el6.jtltest.015.x86_64 for nine days, and so far none of them have shown the mount hang. "All but five of them are running Fedora 11. This is good news overall, but a handful of the F11 machines have crashed outright, and so far I can't tell why. There's a backtrace on the screen, but as far as I can see the traceback is in code that writes a coredump (do_coredump and elf_core_dump are called). Does the RHEL kernel actually dump core automatically when it crashes? If it does, where does it go? These machines don't have a /var/crash directory." I'm trying to get a little more clarification, because to me it sounds like 28 Fedora 11 machines running the 2.6.32-132.el6.jtltest.015.x86_64 RHEL kernel. So I confirmed with Lars, 28 Fedora 11 machines running the 2.6.32-132.el6.jtltest.015.x86_64 RHEL kernel. Updated machine count: 28 Fedora 11 machines running the patched kernel 5 RHEL6 machines running the patched kernel 5 RHEL6 machines running the stock, unpatched kernel Hi all, I wanted to send an update after more than a month's testing. The mount hang does appear to be fixed with the RHEL kernel and Jeff's patches. Now that RHEL 6.1 is released, is there any ETA for an errata RHEL kernel that would contain these patches? Thanks! -lars Thanks for testing them, Lars. Yes, I plan to propose these for 6.2. Jeff, Is there anything here that has not already been sent upstream? This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release. (In reply to comment #25) > Jeff, > > Is there anything here that has not already been sent upstream? Yes. Looks like this patch that you attached to the earlier bug is not yet in mainline: SUNRPC: Deal with the lack of a SYN_SENT sk->sk_state_change callback... https://bugzilla.redhat.com/attachment.cgi?id=486341 ...were you planning to merge that upstream? FWIW, the patchset that Lars tested is attached to this bug in comment #19. (In reply to comment #27) > Yes. Looks like this patch that you attached to the earlier bug is not yet in > mainline: > > SUNRPC: Deal with the lack of a SYN_SENT sk->sk_state_change callback... > > https://bugzilla.redhat.com/attachment.cgi?id=486341 > > ...were you planning to merge that upstream? I'll queue it up and add a Cc: stable... steved ended up rolling this patch into a larger set. So, closing this bug as a duplicate of that one... *** This bug has been marked as a duplicate of bug 701788 *** |