Bug 156148 (IT_71382)
Summary: | automount/mount hang in __rpc_execute | ||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 3 | Reporter: | Wendy Cheng <nobody+wcheng> | ||||||||||||||||||||
Component: | kernel | Assignee: | Steve Dickson <steved> | ||||||||||||||||||||
Status: | CLOSED NOTABUG | QA Contact: | Brian Brock <bbrock> | ||||||||||||||||||||
Severity: | high | Docs Contact: | |||||||||||||||||||||
Priority: | high | ||||||||||||||||||||||
Version: | 3.0 | CC: | davem, dduffey, kanderso, martin.bowers, mtonn, nhorman, petrides, staubach, tao | ||||||||||||||||||||
Target Milestone: | --- | ||||||||||||||||||||||
Target Release: | --- | ||||||||||||||||||||||
Hardware: | i386 | ||||||||||||||||||||||
OS: | Linux | ||||||||||||||||||||||
Whiteboard: | |||||||||||||||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||||||||||||||
Doc Text: | Story Points: | --- | |||||||||||||||||||||
Clone Of: | Environment: | ||||||||||||||||||||||
Last Closed: | 2005-07-08 12:42:52 UTC | Type: | --- | ||||||||||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||||||||||
Documentation: | --- | CRM: | |||||||||||||||||||||
Verified Versions: | Category: | --- | |||||||||||||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||||||||||
Embargoed: | |||||||||||||||||||||||
Attachments: |
|
Description
Wendy Cheng
2005-04-27 20:57:40 UTC
Created attachment 113741 [details]
network sniffing trace
Please ignore the "nfs bindresvport: Address already in use" error messages in
one of the files zipped. We have identified it was from the workaround during
debugging sessions where the autofs was removed. It is *not* a major concern at
this moment. The real issue is the mount point hang from time to time randomly.
Also ignore the MVFS in the first vmcore. The problem shows up with and without clear case. From User-Agent: XML-RPC Latest update from customer: Our mount test succeeded 3 times in a row with no failures, so I am now setting up 9 systems in production to see if user jobs bring out the failure. Instead of failing during testing, mount would print: mount: RPC: Timed out mount: backgrounding "<mount_point>" And then it would succeed in mounting the dir in the background. If that is where it would have otherwise gone into unint sleep, then this is encouraging. I will get the 9 systems online ASAP to see how it goes there. Not to get ahead of myself, but can you begin preparing a U5 kernel for us so we have a solution until U5 officially comes out (or whenever you get this patch in a release): - This patch is installed - The patch from bugid 124600 is installed - It's not a debug kernel We only need the x86 and x86_64 versions of U5 like this, the ppc version doesn't apply to us. If we do wind up at the stance that this fixes the issue, I don't that to be the first time we ask for this, that way our downtime is decreased. I will update again later tonight with my progress. ===== Assuming this test kernel continues to work, we'll need a Hotfix before U5. This is the formal request. Internal Status set to 'Waiting on Customer' Status set to: Waiting on Client This event sent from IssueTracker by marty issue 71382 Created attachment 113749 [details]
memory barrier patch
The test patch the customer is currently running ...
Customer update by gordanz: Well, nevermind on the kernel request, I spoke too soon. I uploaded core-4-534000.tar.gz which is a dump where a csh went unint sleep on you kernel -- untainted and without autofs. The user typed "cd /proj/.ppc" without hitting enter. He then typed "Ctl-D", which in csh will show you all the different ways you could finish the line, like an auto-complete. That's where it hung. I also had an automount go unint sleep on one of the 9 production systems I deployed running this kernel. That core is currently uploading, called core-5-534000.tar.gz. This is an example of our original failure. This kernel is also tainted by ClearCase. From User-Agent: XML-RPC From core-2: also a twin version of first core - automount pid=1798 forked pid 20442 which got stuck in __rpc_execute; Pid 20442 subsequently blocked a "tcsh" shell. And these two are the only two active threads in the whole system !!!! PID: 20442 TASK: d5d4c000 CPU: 1 COMMAND: "automount" #0 [d5d4dd2c] context_switch at c0126104 #1 [d5d4dd44] schedule at c012413f #2 [d5d4dd88] __rpc_execute at f91073d2 #3 [d5d4ddc8] rpc_call_sync at f9102598 #4 [d5d4de84] nfs3_rpc_wrapper at f913814f #5 [d5d4deac] nfs3_proc_getattr at f913828e #6 [d5d4ded4] __nfs_revalidate_inode at f91305e0 #7 [d5d4df6c] nfs_revalidate at f91302ed #8 [d5d4df84] sys_lstat64 at c016ebf4 #9 [d5d4dfc0] system_call at c03f4068 PID: 2964 TASK: f71be000 CPU: 1 COMMAND: "tcsh"^M #0 [f71bfbbc] context_switch at c0126104^M #1 [f71bfbd4] schedule at c012413f^M #2 [f71bfc18] __rpc_execute at f91073d2^M #3 [f71bfc58] rpc_call_sync at f9102598^M #4 [f71bfd14] nfs3_rpc_wrapper at f913814f^M #5 [f71bfd3c] nfs3_proc_access at f91385fb^M #6 [f71bfdf8] nfs_permission at f912ce4d^M #7 [f71bfe20] permission at c0172c4c^M #8 [f71bfe30] link_path_walk at c0173111^M #9 [f71bfe70] __vfs_follow_link at c0177022^M #10 [f71bfe90] nfs_follow_link at f9135211^M #11 [f71bfea4] link_path_walk at c017335e^M #12 [f71bfee4] __vfs_follow_link at c0177022^M #13 [f71bff04] nfs_follow_link at f9135211^M #14 [f71bff18] link_path_walk at c017335e^M #15 [f71bff58] path_lookup at c0173aa4^M #16 [f71bff68] __user_walk at c0173df4^M #17 [f71bff84] sys_chdir at c0162d20^M #18 [f71bffc0] system_call at c03f4068^M This event sent from IssueTracker by wcheng issue 71382 Note that when the above vmcore was forced (via sysrq-c), all other threads (including automount daemons) were working. So we definitely missed the wake-up - not sure how to dig the rpc_task structure out (that's two stack frames down). Look to me it is something with this loop: while (RPC_IS_SLEEPING(task)) { /* sync task: sleep here */ dprintk("RPC: %4d sync task going to sleep\n", task->tk_pid); if (current->pid == rpciod_pid) printk(KERN_ERR "RPC: rpciod waiting on sync task!\n"); __wait_event(task->tk_wait, !RPC_IS_SLEEPING(task)); dprintk("RPC: %4d sync task resuming\n", task->tk_pid); /* * When a sync task receives a signal, it exits with * -ERESTARTSYS. In order to catch any callbacks that * clean up after sleeping on some queue, we don't * break the loop here, but go around once more. */ if (task->tk_client->cl_intr && signalled()) { dprintk("RPC: %4d got signal\n", task->tk_pid); task->tk_flags |= RPC_TASK_KILLED; rpc_exit(task, -ERESTARTSYS); rpc_wake_up_task(task); } } } So this problem seems to be triggered by three events: * nfs server is slow to response. * autmount is in the middle of mounting (nfs_revalidate). * user access to that particular mount point but the process signalled. Customer update --------------- Here is an strace of a shell (similar to the one) that did the Ctrl-D that exposed the D (with commentary for your entertainment): ## Prior stuff removed to save space. We are somewhere around the shell prompt. brk(0x8529800) = 0x8529800 close(0) = 0 munmap(0xb73d7000, 4096) = 0 alarm(0) = 0 ioctl(16, FIONREAD, [0]) = 0 ioctl(16, SNDCTL_TMR_TIMEBASE or TCGETS, {B38400 opost isig icanon echo ...}) = 0 ioctl(16, SNDCTL_TMR_STOP or TCSETSW, {B38400 opost isig -icanon -echo ...}) = 0 ## Shell prompt delivered to user below: write(17, "cde-tx32-lc093:1% ", 18cde-tx32-lc093:1% ) = 18 ## Unsuspecting user starts typing: ls /proj/.ppc read(16, "l", 1) = 1 write(17, "l", 1l) = 1 read(16, "s", 1) = 1 write(17, "s", 1s) = 1 read(16, " ", 1) = 1 write(17, " ", 1 ) = 1 read(16, "/", 1) = 1 write(17, "/", 1/) = 1 read(16, "p", 1) = 1 write(17, "p", 1p) = 1 read(16, "r", 1) = 1 write(17, "r", 1r) = 1 read(16, "o", 1) = 1 write(17, "o", 1o) = 1 read(16, "j", 1) = 1 write(17, "j", 1j) = 1 read(16, "/", 1) = 1 write(17, "/", 1/) = 1 read(16, ".", 1) = 1 write(17, ".", 1.) = 1 read(16, "p", 1) = 1 write(17, "p", 1p) = 1 read(16, "p", 1) = 1 write(17, "p", 1p) = 1 read(16, "c", 1) = 1 write(17, "c", 1c) = 1 ## User gets tired and hits Ctrl-D not knowing the tragic events that are about to unfold: read(16, "\4", 1) = 1 write(17, "\r\n", 2 ) = 2 ## Ctr-D is captured and the beginning of the end starts: rt_sigprocmask(SIG_BLOCK, [INT], [], 8) = 0 rt_sigprocmask(SIG_SETMASK, [], [INT], 8) = 0 open("/proj/", O_RDONLY|O_NONBLOCK|O_LARGEFILE|O_DIRECTORY) = 0 fstat64(0, {st_mode=S_IFDIR|0755, st_size=8192, ...}) = 0 fcntl64(0, F_SETFD, FD_CLOEXEC) = 0 lseek(0, 0, SEEK_SET) = 0 getdents64(0, /* 122 entries */, 4096) = 4096 lstat64("/proj/.ppc_apollo8x_ver3", {st_mode=S_IFDIR|0755, st_size=1024, ...}) = 0 rt_sigprocmask(SIG_BLOCK, [INT], [], 8) = 0 rt_sigprocmask(SIG_SETMASK, [], [INT], 8) = 0 rt_sigprocmask(SIG_BLOCK, [INT], [], 8) = 0 rt_sigprocmask(SIG_SETMASK, [], [INT], 8) = 0 ## lstat64 domino effects starts: lstat64("/proj/.ppc_25", {st_mode=S_IFDIR|S_ISGID|0770, st_size=1024, ...}) = 0 lstat64("/proj/.ppc_18", {st_mode=S_IFDIR|0750, st_size=1024, ...}) = 0 lstat64("/proj/.ppc_53", {st_mode=S_IFDIR|0755, st_size=1024, ...}) = 0 lstat64("/proj/.ppc_44", {st_mode=S_IFDIR|0755, st_size=1024, ...}) = 0 lstat64("/proj/.ppc_35", {st_mode=S_IFDIR|S_ISUID|S_ISGID|0770, st_size=1024, .. .}) = 0 lstat64("/proj/.ppc_05", {st_mode=S_IFDIR|S_ISUID|S_ISGID|0750, st_size=1024, .. .}) = 0 lstat64("/proj/.ppc_04", {st_mode=S_IFDIR|0775, st_size=1024, ...}) = 0 lstat64("/proj/.ppc_67", {st_mode=S_IFDIR|0755, st_size=1024, ...}) = 0 lstat64("/proj/.ppc_14", {st_mode=S_IFDIR|0770, st_size=2048, ...}) = 0 lstat64("/proj/.ppc_13", {st_mode=S_IFDIR|0770, st_size=1024, ...}) = 0 lstat64("/proj/.ppc_09", {st_mode=S_IFDIR|S_ISGID|0770, st_size=1024, ...}) = 0 lstat64("/proj/.ppc_draco8_integ_06", {st_mode=S_IFDIR|S_ISGID|0775, st_size=102 4, ...}) = 0 rt_sigprocmask(SIG_BLOCK, [INT], [], 8) = 0 rt_sigprocmask(SIG_SETMASK, [], [INT], 8) = 0 lstat64("/proj/.ppc_scr1", {st_mode=S_IFDIR|S_ISGID|0775, st_size=512, ...}) = 0 lstat64("/proj/.ppc_02", {st_mode=S_IFDIR|S_ISGID|0770, st_size=1024, ...}) = 0 lstat64("/proj/.ppc_altair_unitver", ## This is where it hangs Created attachment 113767 [details]
tcpdump from the customer
Customer update: ---------------- I was able to recreate the problem and had logging turned up as requested. The messages file is attached. Unfortunately the sysrq-c seems to have failed. It created a log, but the vmcore is incomplete (it's zero length, doesn't look like it started). I attached the log. Please let me know if I need to try again or if what you have is sufficient. Thanks, Marty -------- I'm not sure if it has any relevance, but I want to be sure the backend engineers saw my observation at the beginning of this ticket as it might provide a lead. When the processes go uninterruptable according to ps, vmstat doesn't show any blocked processes as it should (and has in the past when I have dealt with unint processes). Also, /proc/stat doesn't show any blocked processes either. Like I said, it may not be relevant, but that behavior is not the way Linux has been for me in the past. Just wanted to throw that out there.. -Marty Created attachment 113768 [details]
log messages
Looking at these (from the log in Comment #16): RPC: tcp_data_ready... RPC: tcp_state_change client f703b000... RPC: state 8 conn 1 dead 0 zapped 0 RPC: disconnected transport f703b000 RPC: tcp_data_ready... It appears the TCP connection is dying.... A ethereal trace of what (if anything) is going over the wire at this point would be good. Looking the vmcore that was given to Wendy, it appear the automount and a process named "sbatchd" are both stuck hung reconnecting their socket to the NFS server. The state of both sockets are SS_CONNECTING which means "in process of connecting". Which could mean one of a few things: Even though the sockets are in "connecting mode" the connection request (i.e. the TCP SYN) is not being sent to the server . With an ethereal -w trace of the client, we should be able verify this one way or the other. Assuming that the client is sending out the connect requests (which is probably a pretty safe assumption), the next step is to figure why we are having such a hard time connecting to the server. This is where a tcpdump from the server's point view would be handy. From that, we could tell if the client's connect requests are or are not making it to the server. If the requests are not getting to the server , then the network (i.e. an network appliance on the network) is dropping packets. We should need to know the network topology to pursue this further. If the connection requests are making it to the server and the server is either dropping them (i.e. not sending any response which is probably the case) or resetting them (i.e. sending a TCP RST) then we should work with the EMC people to see what the NFS client can do to resolve the problem. Either way we get in touch with the EMC people so we can work this problem from both ends.... imho... Here are the steps I took to look at the state of RPC socket that is being used by the hung automounter process: 1) Start crash. If everything works you should be sitting at a crash prompt like "crash> " (after a bunch system info is displayed) 2) Find the process that is hung by doing a 'foreach bt' (which gives a backtrace of all the running processes. Look for a stack that looks like: crash> foreach bt PID: 20442 TASK: d5d4c000 CPU: 1 COMMAND: "automount" #0 [d5d4dd2c] context_switch at c0126104 #1 [d5d4dd44] schedule at c012413f #2 [d5d4dd88] __rpc_execute at f91073d2 #3 [d5d4ddc8] rpc_call_sync at f9102598 #4 [d5d4de84] nfs3_rpc_wrapper at f913814f #5 [d5d4deac] nfs3_proc_getattr at f913828e #6 [d5d4ded4] __nfs_revalidate_inode at f91305e0 #7 [d5d4df6c] nfs_revalidate at f91302ed #8 [d5d4df84] sys_lstat64 at c016ebf4 #9 [d5d4dfc0] system_call at c03f4068 EAX: 000000c4 EBX: bfff7c40 ECX: bfff6a9c EDX: 00ac8898 DS: 002b ESI: bfff6a9c ES: 002b EDI: 00000003 SS: 002b ESP: bfff6a90 EBP: bfff6b08 CS: 0023 EIP: 00a62313 ERR: 000000c4 EFLAGS: 00000202 3) Now dump the stack of the hung process with the "bt -f pid" command In my case its "bt -f 20442" which shows: crash> bt -f 20442 PID: 20442 TASK: d5d4c000 CPU: 1 COMMAND: "automount" #0 [d5d4dd2c] context_switch at c0126104 [RA: c0124144 SP: d5d4dd2c FP: d5d4dd44 SIZE: 28] d5d4dd2c: d5d4dd40 f9103922 c043d180 00000001 d5d4dd3c: dbec0000 d5d4dd84 c0124144 #1 [d5d4dd44] schedule at c012413f [RA: f91073d7 SP: d5d4dd48 FP: d5d4dd88 SIZE: 68] d5d4dd48: c043dc80 d5d4c000 dbec0000 f9104d90 d5d4dd58: f9110262 c043e120 d5d4c000 c043dc80 d5d4dd68: c043e11c c043e58c 00000000 00000001 d5d4dd78: d5d4c000 d5d4dde4 d5d4de58 d5d4dd98 d5d4dd88: f91073d7 #2 [d5d4dd88] __rpc_execute at f91073d2 [RA: f910259d SP: d5d4dd8c FP: d5d4ddc8 SIZE: 64] d5d4dd8c: d5d4dde4 c03a9100 d5d4dde4 00000000 d5d4dd9c: d5d4c000 d5d4de5c d5d4de5c 00000000 d5d4ddac: 00000000 00000000 d5d4dde4 00000000 d5d4ddbc: f661a580 d5d4dde4 d5d4dddc f910259d #3 [d5d4ddc8] rpc_call_sync at f9102598 [RA: f9138154 SP: d5d4ddcc FP: d5d4de84 SIZE: 188] d5d4ddcc: d5d4dde4 d5d4debc 00000000 00000000 d5d4dddc: fffffeff ffffffff f0b49c74 f4d8905c d5d4ddec: 00f00baa f66e088c f0b49c80 f661a580 d5d4ddfc: f4d89128 00000000 f4d8905c 00000001 d5d4de0c: eb73b2bc d5d4def0 ec1d5e80 f7e86800 d5d4de1c: 00010202 f9104d90 00000000 f9102e60 d5d4de2c: 00000000 00000000 00000000 eca021a0 d5d4de3c: c051e560 021d9fbb d5d4dde4 f91066e0 d5d4de4c: 00000000 00000000 c0386280 00000001 d5d4de5c: d5d4dda0 d5d4dda0 0000ea60 00010020 d5d4de6c: 00000001 000006b2 eb73b180 d5d4c000 d5d4de7c: f661a580 00000000 f9138154 #4 [d5d4de84] nfs3_rpc_wrapper at f913814f [RA: f9138293 SP: d5d4de88 FP: d5d4deac SIZE: 40] d5d4de88: *f661a580* d5d4debc 00000000 00000000 d5d4de98: 00000000 eb73b180 d5d4def0 d5d4def0 d5d4dea8: bfff7c40 f9138293 [.......] crash> Use the source, its know that that first argument to the rpc_call_sync() call is the rpc client handle structure which is also the last thing pushed on the stack. So looking the routine that called rpc_call_sync(), in this case its nfs3_rpc_wrapper(), we see that f661a580 is the last address that was pushed on the stack before the call to rpc_call_sync(). So f661a580 should be the rpc client handle structure 4) Now dump out the rpc client handle structure (i..e struct rpc_clnt) by typing "rpc_clnt addr" . In my case its rpc_clnt f661a580 which shows: crash> rpc_clnt f661a580 struct rpc_clnt { cl_users = { counter = 3 }, cl_xprt = 0xf4d89000, cl_procinfo = 0xf9143520, cl_maxproc = 22, cl_server = 0xe2d9f480 "tx32-fs01", cl_protname = 0xf913f851 "nfs", [.....] crash> Note the address of the 'cl_xprt' field. 5) Next dump out the rpc transport structure (i.e. struct rpc_xprt) by typing "rpc_xprt addr". In my case its rpc_xprt 0xf4d89000 which shows: crash> rpc_xprt 0xf4d8900 struct rpc_xprt { sock = 0xeda059b4, inet = 0xf2bd9680, timeout = { to_current = 700, to_initval = 700, to_maxval = 60000, to_increment = 3478626096, to_resrvval = 3222405346, to_retries = 3, to_exponential = 1 '\001' }, [...] crash> Note the address of the 'sock' field 6) Finally dump out the socket structure (i.e. struct socket ) by typing "socket addr". In my case its socket 0xeda059b4 which shows: crash> socket 0xeda059b4 struct socket { state = SS_CONNECTING, flags = 0, ops = 0xc03ed920, inode = 0xeda05880, fasync_list = 0x0, crash> Note the "state" field. its SS_CONNECTING which means the socket is trying to connect with the server. I hope this helps... From User-Agent: XML-RPC Summery of Call: 1. Steve wanted to make sure that the linux client was attempting to make the connection. If the client is sending the request out. Where is the connection being dropped(network/server). Trace on an already hung process. An ethereal -w on client and rpc -t to see what goes out. 2. Freescale will be sending new vmcore without checkpoint turned on. 3 . EMC will look if there is a way server side to tcpdump and filter for an already hung client. EMC will also send a TCP dump. 4. Requested that EMC look at vmcore file. Client notes: The Linux clients have on average about 50 connections, some have up two 400 connections. Joe Kachuck This event sent from IssueTracker by jkachuck issue 71382 Two data points: * when a NAS box is performing any form of backup (checkpoint, snapshot, etc), the system capacity would drop. Some implementations can even temporarily suspend the access. * The Solaris NFS client will automatically switch its protocol to NFS-UDP when it can't go thru NFS-TCP connection, except for NFSv4 mounts or when the protocol is explicitly specified on the command line. s/Solaris NFS client/Solaris NFS mount request/ in comment #32. Created attachment 113873 [details]
This the last tcpdump attached that the customer wants to see it.
The file tcpdumps-3.tar.gz is now uploaded. It has a dump from 3 different points, the linux client (client.3), the EMC server (server.3), and the Cisco SPAN port (line.3). The client had a single unint automount process, the same as in the last dump. There is also a capture of Solaris traffic from a single host to the same EMC server we have been working with, called server_sun.1. Please let us know asap if you have any questions or need us to do anything else. Thx! Created attachment 113879 [details]
tcpdumps-3.tar.gz
Created attachment 113880 [details]
tcpdump.
Ignore my attachment as it is identical to what Michael posted. -narsi. One thing to note, the server.sun.1 was not taken @ the same time, hence the .1 extension.. crm update time from customer: 29-APR-2005 18:51:29 Created attachment 113881 [details]
solaris binary trace
Comparing the Solaris trace with either server.3 or client.3 trace the one thing that is interesting is the different type of file handles that the server seems to hand out. This could be just an ethereal anomaly but, the majority of the time ethereal says the file handle type is a "NetApp file handle" but now and then ethereal reports the type as "unknown". See packets 35 and 38 in the Solaris trace and packets 3 and 24 in the client.3 trace. Now the notable thing is whenever the Linux client uses one of these unknown file handle types, the EMC server drops the NFS requests on the floor but when a Solaris use an unknown file handle type, the server response with the appropriate response. Previous discussed workaround (shortening RPC-TCP timeout to force retransmit) won't work for this problem since that (theoretically impossible) RPC-TCP retransmit (600 seconds) has been kicked off and the server has been (keep) dropping that particular getattr packet. This issue is boiled down to why the client's GETATTR request never gets replied from the server when this hang occurs. 1. The GETATTR requests were re-transmitted twice and their arrivals into the server side were captured in server.3 (from tcpdumps-3.tar.gz) as (frame No. 3, 31, 49). The server never replied. 2. Linux client has no business with file handle. However, by peeking around the successful GETATTR calls and failed one, the difference lies in the lower 16 bytes of file handle. The first 16 bytes and lower 16 bytes were identical in bad call but different in good calls. So look from the surface, the FH was bad. Why didn't linux client get a stale filehandle error reply ? Was it dropped in the server's RPC layer ? The fothe three questionable EMC acknowledged soft scsi errors and issues with the clarion backend storage. They confirmed two bad disks and requested for immediate replacement. They will be rebooting the datamover and/or do a failover so that memory maybe more contigious. Why would solaris client work with the same EMC server?. What is the result of big fluctuation with TCP window sizes on the EMC server? -narsi. Created attachment 113932 [details]
EMC packet dump, filtered to show the issue
This is a section of a 3 gig packet dump on the EMC that has been filtered that
demonstrate the problem.
|