Bug 204309
Summary: | kernel retries portmap query indefinitely when statd is down | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 4 | Reporter: | Alexey Roytman <alexey> |
Component: | kernel | Assignee: | Jeff Layton <jlayton> |
Status: | CLOSED ERRATA | QA Contact: | Brian Brock <bbrock> |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | 4.4 | CC: | hansjoerg.maurer, jbaron, masanari_iida, staubach, steved |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | i386 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | RHSA-2008-0665 | Doc Type: | Bug Fix |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2008-07-24 19:11:47 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 430698, 2043102 | ||
Attachments: |
Description
Alexey Roytman
2006-08-28 14:15:17 UTC
Hi we experienced a similar problem after an upgrade to RHEL4U4. After the Update, a nastran run on the RHEL4U4 Client fails to run the nastran analysis program. The reason is an IO Error. Nastran complains, that the data it reads ar not reasonable. We did an strace on nastran in both cases and recogniced, that after the Upgrade, a read operation reads only zero bytes, whereas before the upgrade, it reads 32768 bytes at the same point of the nastran run. The interesting part after the upgrade is here: _llseek(12, 131072, [131072], SEEK_SET) = 0 read(12, "", 32768) = 0 before the upgrade it looks like: _llseek(12, 131072, [131072], SEEK_SET) = 0 read(12, "\1\0\0\0\30\0\0\0\0\0\0\0\22\0\0\21\0\0\0\0\0\0\0\0\0\0"..., 32768) = 32768 (longer parts are attached) As you can see, that read in the file returns zero bytes. If we mount the filesystem with -o sync or -o noac the application starts working again. Can you try this client side mount option in order to test, if it solves your problem. You can put it in an automount map... cat /proc/mounts rmcs33:/export /net/rmcs33/export nfs rw,sync,nosuid,nodev,v3,rsize=8192,wsize=8192,hard,intr,tcp,lock,addr=rmcs33 0 0 But this option slows down all nfs operations. There must have been a significant change in the nfs client caching behaviour, which makes a write with a subsequent raid to fail in some cases. My understanding of posix is (according to the german read manpage), that a read operation after a write operation inside one program has to return the new (written) data. Greetings Hansjörg strace after upgrade 5602 write(13, "\0\0\0\0\0\0\0\0\0\0\0\0\n\0\0\0\0 \0\0s\0\0\0\n\0\0\0"..., 32768) = 32768 5602 _llseek(14, 0, [0], SEEK_SET) = 0 5602 write(14, "\0\0\0\0\0\0\0\0\0\0\0\0\n\0\0\0\0 \0\0s\0\0\0\n\0\0\0"..., 32768) = 32768 5602 _llseek(15, 0, [0], SEEK_SET) = 0 5602 write(15, "\0\0\0\0\0\0\0\0\0\0\0\0\n\0\0\0\0 \0\0s\0\0\0\n\0\0\0"..., 32768) = 32768 5602 _llseek(16, 0, [0], SEEK_SET) = 0 5602 write(16, "\0\0\0\0\0\0\0\0\0\0\0\0\n\0\0\0\0 \0\0s\0\0\0\n\0\0\0"..., 32768) = 32768 5602 _llseek(12, 98304, [98304], SEEK_SET) = 0 5602 write(12, "\1\0\0\0\35\0\0\0\0\0\0\0\2\0\0\21PROJVERS\4\0\0001\f\0"..., 32768) = 32768 5602 write(12, "\1\0\0\0\30\0\0\0\0\0\0\0\22\0\0\21\0\0\0\0\0\0\0\0\0\0"..., 32768) = 32768 5602 _llseek(12, 131072, [131072], SEEK_SET) = 0 5602 read(12, "", 32768) = 0 29552 write(13, "\0\0\0\0\0\0\0\0\0\0\0\0\n\0\0\0\0 \0\0s\0\0\0\n\0\0\0"..., 32768) = 32768 29552 _llseek(14, 0, [0], SEEK_SET) = 0 29552 write(14, "\0\0\0\0\0\0\0\0\0\0\0\0\n\0\0\0\0 \0\0s\0\0\0\n\0\0\0"..., 32768) = 32768 29552 _llseek(15, 0, [0], SEEK_SET) = 0 29552 write(15, "\0\0\0\0\0\0\0\0\0\0\0\0\n\0\0\0\0 \0\0s\0\0\0\n\0\0\0"..., 32768) = 32768 29552 _llseek(16, 0, [0], SEEK_SET) = 0 29552 write(16, "\0\0\0\0\0\0\0\0\0\0\0\0\n\0\0\0\0 \0\0s\0\0\0\n\0\0\0"..., 32768) = 32768 29552 _llseek(12, 98304, [98304], SEEK_SET) = 0 29552 write(12, "\1\0\0\0\35\0\0\0\0\0\0\0\2\0\0\21PROJVERS\4\0\0001\f\0"..., 32768) = 32768 29552 write(12, "\1\0\0\0\30\0\0\0\0\0\0\0\22\0\0\21\0\0\0\0\0\0\0\0\0\0"..., 32768) = 32768 29552 _llseek(12, 131072, [131072], SEEK_SET) = 0 29552 read(12, "\1\0\0\0\30\0\0\0\0\0\0\0\22\0\0\21\0\0\0\0\0\0\0\0\0\0"..., 32768) = 32768 29552 _llseek(12, 131072, [131072], SEEK_SET) = 0 29552 write(12, "\1\0\0\0\370\37\0\0\0\0\0\0\22\0\0\21\0\0\0\0\0\0\0\0\0"..., 32768) = 32768 29552 _llseek(12, 131072, [131072], SEEK_SET) = 0 29552 read(12, "\1\0\0\0\370\37\0\0\0\0\0\0\22\0\0\21\0\0\0\0\0\0\0\0\0"..., 32768) = 32768 29552 write(12, "\1\0\0\0\256\0\0\0\0\0\0\0\2\0\0\21DBSPACE \4\0\0001\244"..., 32768) = 32768 Would it be possible to get at bzip2-ed binar tethereal network trace of this? Something similar to: ethereal -w /tmp/data.pcap host <server> ; bzip2 /tmp/data.pcap Created attachment 144999 [details]
Output of "tethereal -i any -w <the-file>" during login to KDE on problematic kernel
The current configuration was updated: $ rpm -qa | egrep 'glibc|kdebase|kdelibs|kernel' | sort glibc-2.3.4-2.25 glibc-common-2.3.4-2.25 glibc-devel-2.3.4-2.25 glibc-headers-2.3.4-2.25 glibc-kernheaders-2.4-9.1.98.EL kdebase-3.3.1-5.13 kdelibs-3.3.1-6.RHEL4 kernel-smp-2.6.9-34.0.1.EL kernel-smp-2.6.9-42.0.2.EL kernel-smp-2.6.9-42.0.3.EL kernel-utils-2.4-13.1.83 Still, I've tested the issue on kernel-smp-2.6.9-42.0.2.EL, and I do attach the output of "tethereal -a any -w "tethereal.-i_any.dump" during the login to KDE environment. For security, I've replaced internal info (like encrypted passwords, additional user and private names) with 'X'-es. The server's IP is 10.1.1.137, local machine's IP is 10.1.1.124. Is this enough, or you wanted something different ? Hi we had a problem which seems to be related to that. Due to our busines contract with redhat they provided a testkernel, which solved the problem. The statement of techsupport concerning the reason of the problem was: This is an issue related with caching. This has been fixed in the test kernel which I have provided. I am reporting the same to Engineering Team. This patch should be released with RHEL 4 update 5. I would let you know when this would be released Changelog -fix NFS cache consistency when a file is mmap'ed (Jeff Layton) [186142] Unfortnutaly Bug 186142 is a private bug, not visible with bugzilla The testkernel was: kernel-smp-2.6.9-42.22.EL.i686.rpm Maybe redhat can provide a link to download this or a newer version which includes the fix . Then you could test, if it is the same problem. Greeting Hansjörg Hansjörg, thanks for the information. The issue is not urgent for me, I use older kernel. But I will recall that I need either kernel later than kernel-smp-2.6.9-42.22, or the one that fixes Bug 186142. Thank you. Jason, Can you post a point to a newer kernel for these guys to test against? This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release. This request was previously evaluated by Red Hat Product Management for inclusion in the current Red Hat Enterprise Linux release, but Red Hat was unable to resolve it in time. This request will be reviewed for a future Red Hat Enterprise Linux release. From the comments above, it sounds like this was fixed in RHEL4.5. Could you install kernel -55.EL or later and verify whether that is the case? I've done update for all packages, and tried kernel-smp-2.6.9-55.0.6.EL, but it still has the problem. I see a lot of these errors in the capture: 1398 147.740090 127.0.0.1 -> 127.0.0.1 Portmap V2 GETPORT Call STAT(100024) V:1 UDP 1399 147.740633 127.0.0.1 -> 127.0.0.1 Portmap V2 GETPORT Reply (Call In 1398) PROGRAM_NOT_AVAILABLE Can you run the following on the client and post the output? # ps -ef | grep rpc.statd # rpcinfo -p ...it seems like a missing statd shouldn't make things hang, but it's the only thing NFS-related that I see that's amiss... From an NFS standpoint, the last calls I see concern this filehandle: 919 30.276934 10.1.1.124 -> 10.1.1.137 NFS V3 LOOKUP Call, DH:0xde4723bf/.qtrc.lock 920 30.277132 10.1.1.137 -> 10.1.1.124 NFS V3 LOOKUP Reply (Call In 919), FH:0x37843785 ... 936 30.973065 10.1.1.124 -> 10.1.1.137 NFS V3 GETATTR Call, FH:0x37843785 937 30.973255 10.1.1.137 -> 10.1.1.124 NFS V3 GETATTR Reply (Call In 936) Regular File mode:0600 uid:1001 gid:10 938 30.973335 10.1.1.124 -> 10.1.1.137 TCP mdbs_daemon > nfs [ACK] Seq=39888 Ack=127164 Win=18876 Len=0 TSV=42005 TSER=480933153 ...after that, there are no more NFS calls. A minute and a half or so later, the sockets get closed: 1279 124.428741 10.1.1.124 -> 10.1.1.137 TCP 799 > nfs [FIN, ACK] Seq=0 Ack=0 Win=1460 Len=0 TSV=135470 TSER=480726606 1280 124.428840 10.1.1.137 -> 10.1.1.124 TCP nfs > 799 [FIN, ACK] Seq=0 Ack=1 Win=1448 Len=0 TSV=481026623 TSER=135470 1281 124.428895 10.1.1.124 -> 10.1.1.137 TCP 799 > nfs [ACK] Seq=1 Ack=1 Win=1460 Len=0 TSV=135470 TSER=481026623 1282 124.467721 10.1.1.124 -> 10.1.1.125 TCP 798 > nfs [FIN, ACK] Seq=0 Ack=0 Win=1460 Len=0 TSV=135509 TSER=34303508 1283 124.467904 10.1.1.125 -> 10.1.1.124 TCP nfs > 798 [ACK] Seq=0 Ack=1 Win=10136 Len=0 TSV=34333504 TSER=135509 1284 124.468150 10.1.1.125 -> 10.1.1.124 TCP nfs > 798 [FIN, ACK] Seq=0 Ack=1 Win=10136 Len=0 TSV=34333504 TSER=135509 1285 124.468199 10.1.1.124 -> 10.1.1.125 TCP 798 > nfs [ACK] Seq=1 Ack=1 Win=1460 Len=0 TSV=135509 TSER=34333504 I cannot reboot the machine now, so I've done it on the good kernel: kernel-smp-2.6.9-34.0.1.EL 1. "ps -ef | grep rpc.statd | grep -v grep" gives an empty output 2. "rpcinfo -p" gives: program vers proto port 100000 2 tcp 111 portmapper 100000 2 udp 111 portmapper 100021 1 udp 32768 nlockmgr 100021 3 udp 32768 nlockmgr 100021 4 udp 32768 nlockmgr 100021 1 tcp 32772 nlockmgr 100021 3 tcp 32772 nlockmgr 100021 4 tcp 32772 nlockmgr 100007 2 udp 970 ypbind 100007 1 udp 970 ypbind 100007 2 tcp 973 ypbind 100007 1 tcp 973 ypbind If you need the output from the problematic kernel strictly, reply, and I'll post the data later, after reboot. Also, even on good kernel, the /var/log/messages has a plenty of: Sep 16 10:47:56 frog kernel: lockd: cannot monitor 10.1.1.137 Sep 16 10:47:56 frog kernel: lockd: failed to monitor 10.1.1.137 (The 10.1.1.137 is the NFS server.) What version of nfs-utils are you using? # rpm -q nfs-utils There is no nfs-utils package installed. $ rpm -qa | grep -i nfs nfs-utils-lib-1.0.6-8 system-config-nfs-1.2.8-1 Ok. The statd thing is probably red herring anyway... From the capture, it looks like the server is doing everything it should, so any problem is likely on the client. Unfortunately, a KDE login is a pretty complex process. We'll need to narrow down to a single process (or processes) that's actually hanging, and understand what they're waiting for... Once you do narrow it down to a single process, collect several snapshots of sysrq-t output from the kernel and get a stack trace of the hung processes. That should help us to determine what they're doing (presuming they're in kernel space) and whether they're actually hung... If you're need help doing this, then I'd suggest opening a support case. Our support people should be able to help you narrow this down. If you do this, be sure to mention that you've opened this BZ case as well. Well, I've traced the KDE login, and found that ksplash was using a lock file, and was hanging. Then I've replaced ksplash binary with a wrapper with strace usage, and found the problematic call: open("/users/alexey/.qt/.qtrc.lock", O_RDWR|O_CREAT|O_LARGEFILE, 0600) = 4 fcntl64(4, F_SETLKW64, {type=F_RDLCK, whence=SEEK_SET, start=0, len=0} The latter call hangs. I've written a small C program that ensures that I'm right, and it also hangs on fcntl() call. On a "good" kernel it shows: ./a.out: pid=3600 testing /users/alexey/lockfile... fcntl (may hang!)... fcntl (did not hang, no bug) ==> -1 (37:No locks available) On a "bad" kernel it hangs: ./a.out: pid=3600 testing /users/alexey/lockfile... fcntl (may hang!)... I've done Ctrl-ScrollLock and obtained the following information: a.out D 00000000 1628 5383 5222 (NOTLB) d53b0b40 00000082 00000003 00000000 d72b0280 00000247 00000000 e0b677eb 00000000 d7bb1180 00000246 c140fde0 00000001 0000b041 37cb1193 000001ec dfe310b0 decfc1d0 decfc33c d72b0300 c0120458 d72b0280 d72b0300 d53b0b6c Call Trace: [<e0b677eb>] rpc_run_child+0xd4/0xe2 [sunrpc] [<c0120458>] prepare_to_wait+0x12/0x4c [<e0b67005>] __rpc_execute+0x1a0/0x326 [sunrpc] [<c012052d>] autoremove_wake_function+0x0/0x2d [<c012052d>] autoremove_wake_function+0x0/0x2d [<c012052d>] autoremove_wake_function+0x0/0x2d [<e0b635de>] rpc_call_sync+0x68/0x8c [sunrpc] [<e0b3f507>] nsm_mon_unmon+0xaf/0xd0 [lockd] [<e0b3f555>] nsm_monitor+0x2d/0x56 [lockd] [<e0b3beaf>] nlmclnt_lock+0x22/0xc7 [lockd] [<e0b3b925>] nlmclnt_proc+0x217/0x2bf [lockd] [<e0bcc4f8>] do_setlk+0xc3/0x186 [nfs] [<e0bcc5bb>] nfs_lock+0x0/0xe2 [nfs] [<c016ec06>] fcntl_setlk+0x169/0x2ec [<c01fdc61>] tty_ldisc_deref+0x50/0x5f [<c01fea00>] tty_write+0x258/0x262 [<c020322d>] write_chan+0x0/0x1d0 Hope this helps. Created attachment 197201 [details]
fcntl() simulation that hangs on a "bad" kernels
Thanks for the reproducer. I'll give it a go and let you know what I find. I'm seeing a similar hang, but stack trace looks a bit different. That's probably attributable to the architecture change (x86_64 in my case): lock-nfs-bug D 0000000000000000 0 11168 11069 (NOTLB) ffffff8003c17918 0000000000000286 ffffff801f5777f0 0000007300000000 ffffff801ecde7f0 0000000000002062 00011bae13161699 ffffffff80322140 ffffff801ecdea88 0000000000000000 Call Trace:<ffffffffa00a4718>{:sunrpc:__rpc_atrun+0} <ffffffffa00a4718>{:sunrpc:__rpc_atrun+0} <ffffffffa00a421b>{:sunrpc:rpc_sleep_on+50} <ffffffffa00a4db6>{:sunrpc:__rpc_execute+462} <ffffffff8012dadc>{autoremove_wake_function+0} <ffffffff8012dadc>{autoremove_wake_function+0} <ffffffffa00a079c>{:sunrpc:rpc_call_sync+114} <ffffffffa013ee13>{:lockd:nsm_mon_unmon+299} <ffffffffa00a0224>{:sunrpc:rpc_create_client+548} <ffffffffa013eec3>{:lockd:nsm_monitor+50} <ffffffffa013afb6>{:lockd:nlmclnt_proc+900} <ffffffffa015307c>{:nfs:nfs_lock+980} <ffffffff8018a066>{fcntl_setlk+305} <ffffffff8029454a>{thread_return+0} <ffffffff802945a2>{thread_return+88} <ffffffff801863cb>{sys_fcntl+736} <ffffffff8010d636>{system_call+134} <ffffffff8010d5b0>{system_call+0} Stack trace seems to indicate that the client is getting hung up while trying to set up statd monitoring. Starting statd seems to work around it (i.e., install the nfs-utils package and do a service nfslock start). I don't think statd is considered to be required, but it's highly recommended to use it. Without it, locking won't be reliable if the server reboots. Regardless though, the fcntl call shouldn't hang indefinitely like this. I might suggest that statd is required for proper locking support. It is a little more than just recommended. Applications use locking for consistency purposes and invisibly losing a lock, because the statd was not there to recover the lock, would be considered a very bad thing, tantamount to data corruption. Good point. Perhaps then we should just shoot for having a missing statd handled in a more graceful way. Maybe when statd isn't up we should try to have locking calls return an error of some sort? I'm also checking this against rawhide to see what the behavior is like there... Citating myself: fcntl (did not hang, no bug) ==> -1 (37:No locks available) I think the issue is centered around some changes that went in that changed pmap_getport_done(). In the -34-ish timeframe, when the client wasn't able to get a port, the call would be aborted and an error returned. With the introduction of that patch, the portmap call seems to retry indefinitely. I'm still tracking down the BZ that made that change. I think the patch that introduced this is the one for bug #172082. The main issue is this part in pmap_getport_done: } else if (clnt->cl_port == 0) { /* Program not registered */ task->tk_status = -EACCES; - task->tk_action = NULL; ...taking that tk_action=NULL out seems to be what causes the nsm rpc call to retry talking to the portmapper indefinitely. I think this part of that that patch may not actually be needed to fix the originally reported problem in that BZ... When I replaced both of those lines with: rpc_exit(task, EACCES); ...the test program errored out with ENOLCK immediately when statd was down. After shutting down portmapper, it hung (which seems to be what we want to happen, according to bug #172082). Created attachment 198961 [details] proposed patch This patch seems to fix it, but I'm not clear on whether it might cause the earlier BZ to regress somehow. It doesn't seem to, but it needs to have a bit more testing before I'm comfortable with it. In any case, I've built some test kernels with this patch and put them on my people page: http://people.redhat.com/jlayton ...please test them and let me know if that fixes the issue for you. Note that as Peter pointed out, statd is really not an optional component here. So while this may make lock requests error out when statd isn't running, you most certainly don't want to run your hosts without it... The patch fixes the problem I've reported. $ uname -a Linux frog 2.6.9-59.EL.jtltest.18smp #1 SMP Tue Sep 18 16:16:25 EDT 2007 i686 i686 i386 GNU/Linux Steve had a question about why/how this works... When setting up the statd client, the code does a call_allocate, which calls call_bind. call_bind does this: task->tk_action = call_connect; if (!clnt->cl_port) { task->tk_action = call_bind_status; task->tk_timeout = RPC_CONNECT_TIMEOUT; rpc_getport(task, clnt); } If cl_port is non-zero (indicating a successful portmap query), then the next call is call_connect. If the cl_port is 0, then we fall into the if statement here. rpc_getport is done and then rpciod calls call_bind_status to handle the result. For most error conditions (including -EACCES) call_bind_status does a "goto retry_bind;". That does: retry_bind: task->tk_status = 0; task->tk_action = call_bind; return; ...and so we fall back into call_bind again. So in this situation, where the portmapper is responding but statd isn't up, we'll continuously loop with: call_bind rpc_getport pmap_getport_done call_bind_status (...there are some other calls in between, but you get the idea...) Until statd comes up. This patch breaks that cycle by having rpc_getport_done set task->tk_action=NULL which makes rpciod return -EACCES back to the caller. > This patch breaks that cycle by having rpc_getport_done
> set task->tk_action=NULL which makes rpciod return -EACCES back to the caller.
Ok, and why is this a good thing? I would think as long as the call is
interpretable (when using the -o intr) the bind loop should continue.
What happens if the portmapper is just slow to respond
(because its single threaded and its in the middle of a
mount storm)?
> Ok, and why is this a good thing? I would think as long as the call is > interpretable (when using the -o intr) the bind loop should continue. If the situation were one that was likely to resolve itself in time then I'd agree with you. But I'd submit that statd being not registered with the portmapper is not one that's likely to be resolved by just waiting long enough. It likely means that statd just wasn't started. Returning with an error in that situation would seem to be OK. It's what upstream does and should cue the user that something is wrong with locking. > What happens if the portmapper is just slow to respond > (because its single threaded and its in the middle of a > mount storm)? That situation won't be affected by this. This only affects when the portmapper has responded and said "no such service". If the portmapper isn't responding then it will continue to loop as it does today. I've rescinded this patch as it could cause lock reclamation to fail (a'la RHEL5 bug 240976). Currently looking at upstream behavior -- perhaps we can turn these situations into an RPC timeout event and thereby make the behavior contingent upon soft/hard settings. Created attachment 291005 [details]
patch -- clean up soft task error handling
Backported upstream patch to clean up soft task error handling.
Patch seems to make RHEL4's behavior consistent with how RHEL5 now behaves with the patch for bug 240976. The lock attempt times out after ~30 seconds of trying to query statd. I want to do a bit more testing, but it looks good so far... Committed in 68.13. RPMS are available at http://people.redhat.com/vgoyal/rhel4/ *** Bug 448479 has been marked as a duplicate of this bug. *** An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2008-0665.html |