Bug 140385
Summary: | lockd: cannot (un)monitor x.x.x.x | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 3 | Reporter: | Rex Dieter <rdieter> |
Component: | nfs-utils | Assignee: | Steve Dickson <steved> |
Status: | CLOSED ERRATA | QA Contact: | Brian Brock <bbrock> |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | 3.0 | CC: | buckh, ee-cap-admin-dl, erich, george.liu, herrold, hyclak, martin.donnelly, martinez, riel, support, tao |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | RHBA-2005-697 | Doc Type: | Bug Fix |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2005-09-28 18:51:34 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 156321 |
Description
Rex Dieter
2004-11-22 18:47:21 UTC
I'll upgrade to kernel-2.4.21-4.EL to see if that helps. OK, I mistyped, I was using kernel-2.4.21-20.EL originally. OK, new(est) kernel kernel-smp-2.4.21-27.0.2.EL different log messages, same result (server grinding to a halt)... lots and lots of: Feb 7 15:32:30 server rpc.statd[1990]: Received erroneous SM_UNMON request from server for client1 ... Feb 7 14:28:10 server rpc.statd[1990]: Received erroneous SM_UNMON request from server for client2 Oh, and client machines are a mixture of rhel3, rh90, fc2, fc3. Server hung up again, this time seeing a bunch of nsm_mon_unmon: rpc failed, status=-110 lockd: cannot monitor x.x.x.x nsm_mon_unmon: rpc failed, status=-110 lockd: cannot monitor y.y.y.y nsm_mon_unmon: rpc failed, status=-110 lockd: cannot monitor z.z.z.z (a lockd: cannot monitor message for almost *every* nfs client in use) I'm adding this here cause I don;t know if the bug is with EL3 or 4. Previously I was running an EL3 nfs server and EL3 clients. We have upgraded the clients to EL4 and left the server on EL3 for the time being. We are now getting the unmonitor problem. Do you want me to open this as an EL4 bug or leave it here. Whatever, it is causing a very serious problem for us. It was not happening with the EL4 beta 2 so we were very surprised to see it now. Paul Happened again this morning after I had reverted to kernel-smp-2.4.21-20.EL (from kernel-smp-2.4.21-27.0.2.EL). Oh well, back to kernel-smp-2.4.21-27.0.2.EL The "lockd: cannot monitor..." messages are because the kernel can not talk to the local rpc.statd process, which could me that statd has gone down. Please check that status of both rpc.statd (on the server and client) during this condition. the "lockd: unauthenticated request from..." message is a server message says the client trying to do the lock is not the export list. Which may or may not be true... Checking (and then out of desperation restarting) rpc.statd on the server was one of the *first* things I've tried on each occassion. It has never helped. The fact that restarting the server makes all clients happy leads me to think that all client-side rpc.statd's were pretty much ok. That is, unless there is something that could simulteneously affect the stability of rpm.statd on 30+ clients. (-: Happened again today, just (re)booted the server this morning, running kernel-smp-2.4.21-27.0.2.ELsmp. In the appended log, you can see that I restarted rpm.statd (with /sbin/service nfslock restart), but it didn't help. For the record, the nfs server failed even to provide file locking with *itself* @ 192.168.181.2 /var/log/messages contained: (names and ip addresses changed to protect the innocent): Feb 21 15:12:07 nfsserver rpc.statd[1985]: Received erroneous SM_UNMON request from nfsserver.unl.edu for 192.168.180.21 Feb 21 15:15:59 nfsserver kernel: lockd: cannot monitor 192.168.181.47 Feb 21 15:16:24 nfsserver kernel: lockd: cannot unmonitor 192.168.181.72 Feb 21 15:16:49 nfsserver kernel: lockd: cannot unmonitor 192.168.181.12 Feb 21 15:17:14 nfsserver kernel: lockd: cannot monitor 192.168.181.47 Feb 21 15:17:39 nfsserver kernel: lockd: cannot monitor 192.168.181.6 Feb 21 15:18:04 nfsserver kernel: lockd: cannot monitor 192.168.181.47 Feb 21 15:18:29 nfsserver kernel: lockd: cannot unmonitor 192.168.181.24 Feb 21 15:18:54 nfsserver kernel: lockd: cannot unmonitor 192.168.181.28 Feb 21 15:19:19 nfsserver kernel: lockd: cannot monitor 192.168.181.6 Feb 21 15:19:44 nfsserver kernel: lockd: cannot monitor 192.168.181.47 Feb 21 15:20:09 nfsserver kernel: lockd: cannot monitor 192.168.181.6 Feb 21 15:20:19 nfsserver rpc.statd[1985]: Caught signal 15, un-registering and exiting. Feb 21 15:20:19 nfsserver nfslock: rpc.statd startup succeeded Feb 21 15:20:34 nfsserver kernel: lockd: cannot monitor 192.168.181.47 Feb 21 15:20:49 nfsserver rpc.statd[31263]: Can't notify 192.168.181.64, giving up. Feb 21 15:20:49 nfsserver rpc.statd[31263]: Can't notify 192.168.181.133, giving up. Feb 21 15:20:59 nfsserver kernel: lockd: cannot monitor 192.168.181.6 and restarting rpc.statd on affected clients yielded interesting logs from the server as well: Feb 21 15:35:15 mathstat rpc.statd[31263]: SM_NOTIFY from client1 while not monitoring any hosts. Feb 21 15:35:34 mathstat kernel: lockd: cannot monitor 192.168.181.6 Feb 21 15:35:52 mathstat rpc.statd[31263]: SM_NOTIFY from client2 while not monitoring any hosts. Feb 21 15:35:59 mathstat kernel: lockd: cannot monitor 192.168.181.28 Feb 21 15:36:15 mathstat rpc.statd[31263]: SM_NOTIFY from client3 while not monitoring any hosts. client1 = 192.168.181.6 client2 = 192.168.181.28 OK, I was wrong about my comment regarding locking with the nfsserver itself at .181.2. Ignore that part. I also get clobbered with these when the statd was not running; however, I will test further For the record (*crosses fingers*), We've been stable now for 10+ days after changing /etc/exports from using a mixture of IP addresses and hostnames to using IP addresses exclusively. I made the change on a hunch that the mixture was at least one contributing factor to the problem(s), since some of the log entries referred to hostnames (rpc.statd: SM_NOTIFY... bits), whereas some only IP addreses (lockd: cannot monitor... bits). Died again today... oh well, lasted almost 9 days this time. Died again. A whopping 13 days uptime. Everytime it dies, is rpc.statd dead as well? No, rpc.statd is not dead. If it were, how would be sending all those entries to syslog? Further, as reported earlier here, I've tried repeatedly to restart rpc.statd on both server and clients with no avail. The only anomalous thing I ever see is that rpc.lockd (on the NFS server) is in the "D" state as reported by ps. Normally (when functioning happily) it's state is "SW" i've got the same problem my server randomly crash, before seeing this thread i've called HP (my server is an ml350) we have changed for a new ml350 after that we suspected our san to be the buggish thing but it doesn't seem at all. Now i'm very confused, i've made all update. I don't know if this error make my server hang but in all case i've got this error before the server hang up. Uur box in question is a P4 with Hyperthreading. Since the last crash, I decided to try running a UP kernel instead. Interestingly, I started seeing the statd errors/warnings in the syslog again, but this time rpc.lockd didn't go into lala land (ie, it's still "SW", not "D"), and it *appears* to still be functioning ok(*) (*) Previously, once the wierdness began to occur, *all* nfs clients would be unable to obtain file locks over nfs. Now, at least, all(most/some?) nfs clients are still able to function. Well the "kernel: lockd: cannot monitor" are due to the fact rpc.statd is dead and in bz 151828 rpc.statd seems to be seg faulting was I was trying to see if these are related... When statd is hung in "D" state, would it be possible to get an AltSysRq-t system trace (echo t > /proc/sysrq-trigger should do it) Also is there any reliable way to reproduce this? I'm trying but I can't seem to find one.... NFS (locking) froze up again today. Upon reflection, server death, and rpc.lockd going into the "D" state doesn't occur until we start seeing kernel: lockd: cannot monitor x.x.x.x syslog entries (summary changed accordingly) Interestingly, during the shutdown as processes were being killed off, I happened to run 'ps' again, and noticed that lockd had went back to the "SW" state (which it normally is). What nfs-utils version are you using? The box in question is a fully up-to-date RHEL3, so: nfs-utils-1.0.6-33EL What kind of load is running that is causing all this locking to occur? Again, I'm trying to reproduce this problem and I'm not having much luck... Also in this the last failure was there the "nsm_mon_unmon: rpc failed, status=-110" messages again? Finally, you said you added the patch from bug 118839 so what are you setting the nfsd3_acl_max_entries to? (which may or may not make a difference). Re: load - The box is an NFS/samba server for ~100 mixed RedHat9, RHEL3, WindowsXP clients (about 50%/50% windows/linux) Didn't see any nsm_mon_unmon messages this time (or since that first time I reported seeing it actually). FYI, the patch from bug 118839 is already in kernel-2.4.21-27.0.2.EL, and I have in /etc/modules.conf: # Per http://bugzilla.redhat.com/bugzilla/118839 options nfsd nfsd3_acl_max_entries=128 options nfs nfs3_acl_max_entries=128 May be coincidence, but I've been running a RedHat 9 NFS server for 18 months with no trouble untill about a week ago when I started samba up on it. I am now seeing messages like this in the SYSLOG. nsm_mon_unmon: rpc failed, status=-13 lockd: cannot monitor 192.168.1.202 Chris This thread describes what we've seen on one system today. RHEL3U4 2.4.21-27.0.2.EL kernel. In our case it is both an NFS client and server. The first thing we noticed is the NFS client side locks up. That is, exports from other NFS servers mounted on this system hang in device wait. Same kernel messages: "rpc.statd Received erroneous SM_UNMON request from <dnsname.of.this.box> for <IP of an NFS client>" show up for days and days every 3-8 minutes. Then just when NFS locks up these messages show up: "kernel: nsm_mon_unmon: rpc failed, status=-110" "kernel: lockd: cannot monitor 198.48.89.107" We don't run Samba on this server. NFS servers and most clients are IBM AIX. Eric In http://people.redhat.com/steved/bz15182 is a nfs-utils that fixes a problem with causes rpc.statd to crash. Now the "lockd: cannot monitor" caused by the kernel not being able to communicate with rpc.statd. So I'm wondering if this issue is a combination of a couple of issues. So could you please upgrade you nfs-utils to nfs-utils-1.0.6-55.src.rpm and let me know what happens. I get "Error: 404" trying http://people.redhat.com/steved/bz15182 My bad... its http://people.redhat.com/steved/bz151828 nfs-utils-1.0.6-55 doesn't build on RHEL3, because BuildRequires: krb5-devel > 1.3.1 Naively removing that doesn't help... it fails in a couple places later in the build: ... Making dep in nfsidmap gcc -O2 -pipe -march=i386 -mcpu=i686 -M libnfsidmap.c nss.c umich_ldap.c cfg.c strlcpy.c cfg.h nfsidmap_internal.h queue.h > .depend gcc: compilation of header file requested gcc: compilation of header file requested gcc: compilation of header file requested ...and... Making all in src ... gcc -DPACKAGE_NAME=\"librpcsecgss\" -DPACKAGE_TARNAME=\"librpcsecgss\" -DPACKAGE_VERSION=\"0.1\" "-DPACKAGE_STRING=\"librpcsecgss 0.1\"" -DPACKAGE_BUGREPORT=\"nfsv4-wg.edu\" -DSTDC_HEADERS=1 -DHAVE_SYS_TYPES_H=1 -DHAVE_SYS_STAT_H=1 -DHAVE_STDLIB_H=1 -DHAVE_STRING_H=1 -DHAVE_MEMORY_H=1 -DHAVE_STRINGS_H=1 -DHAVE_INTTYPES_H=1 -DHAVE_STDINT_H=1 -DHAVE_UNISTD_H=1 -DHAVE_DLFCN_H=1 -DSTDC_HEADERS=1 -DHAVE_STDDEF_H=1 -DHAVE_STDLIB_H=1 -DHAVE_STRING_H=1 -DHAVE_UNISTD_H=1 -DHAVE_STDLIB_H=1 -DHAVE_MALLOC=1 -DHAVE_STDLIB_H=1 -DHAVE_REALLOC=1 -DHAVE_MEMSET=1 -DHAVE_STRERROR=1 -DHAVE_KRB5=1 -DKRB5_VERSION=127 -DUSE_PRIVATE_KRB5_FUNCTIONS=1 -DHAVE_GSS_KRB5_CCACHE_NAME=1 -I. -I. -I../include -I/usr/kerberos/include -O2 -pipe -march=i386 -mcpu=i686 -c auth_gss.c -fPIC -DPIC -o .libs/auth_gss.o auth_gss.c: In function `authgss_create_default': auth_gss.c:224: `GSS_C_NT_HOSTBASED_SERVICE' undeclared (first use in this function) auth_gss.c:224: (Each undeclared identifier is reported only once ahh.... I forgot about the NFSv4 stuff in the RHEL4 nfs-utils... nevermind with the upgrade :-[ Boy I wish I could reproduce this... I realize this happens intermittently, but if possible having a (bzip2-ed) binary ethereal trace (i.e. tethereal -w /tmp/trace.pcap) of the problem could help determining what is going... Also a system trace of the hung system (ie. AltSysRq-t on the console or echo t > /proc/sysrq-trigger) would also help... I rebuilt nfs-utils starting with nfs-util-1.0.6-33EL including the additional (mostly statd-related patches) from nfs-utils-1.0.6-55: Patch56: nfs-utils-1.0.6-fd-sig-cleanup.patch Patch59: nfs-utils-1.0.6-rquotad-overflow.patch Patch60: nfs-utils-1.0.6-statd-notify-hostname.patch Let's see if it helps any. As to comment 34 ("boy I wish I could reproduce this") - I had 'cannot unmontor' happen when a RO nfs export server dropped off the LAN for an extended interval (a couple hours), but a companion RW server at a differing mountpoint was fine and reachible Is it possible that the local lockd is not ignoring RO filesystems properly? There is a rumor afoot that a possible workaround is to downgrade to nfs-utils-1.0.6-31.EL. I'm not sure how that is possible since not too much has changed.... But to see if this is truly the case, I put a x86 rpm and srpm of nfs-utils-1.0.6-31.EL in http://people.redhat.com/steved/bz140385/ Please download and see if it help with the issue... how's this one different from the nfs-utils-1.0.6-31.EL that's on the RHN website? I have two different Dell servers that I recently installed. One's running RHEL3 Update 3 and the other one Update 4. That's pretty much it in terms of difference. So the package versions are different (including kernel and nfs-util). Anyways, the problem is only showing up on the Update 4 server. (In reply to comment #38) > There is a rumor afoot that a possible workaround is to downgrade to > nfs-utils-1.0.6-31.EL. I'm not sure how that is possible since not > too much has changed.... > > But to see if this is truly the case, I put a x86 rpm and srpm > of nfs-utils-1.0.6-31.EL in http://people.redhat.com/steved/bz140385/ > Please download and see if it help with the issue... > The the nfs-utils on my people page and RHN are the same... I just figured it would make it easier to find if I put it on my people page. FYI, I've discoverred something interesting. The last few times I've experienced this problem, the "server" in question had been acting also as an NFS client to another server of ours. I mention this only because we just had another el3 box freeze up(*), but this one was acting only as an NFS client. All el3 boxes in question are using a modified nfs-utils-1.0.6-33 (see comment #35). I guess we'll try downgrading to 1.0.6-31 to see if that helps. (*) Only in the sense that it could no longer obtain file locks over NFS. Just curious.... are all these server multi home servers? Meaning they have network interfaces on more that one subnet? No multihome. One (physical) subnet. My misbehaving RO NFS server at comment 36 _is_ multi-homed; it is however, also in a fully populated DNS client and server set for each segment. Since dropping the nfs-utils-1.0.6-sgi-statd-fixes.patch and removing the --enable-secure-statd configure option (effectively dropping back to the nfs-utils-1.0.6-31 rpm), lo and behold... *0* problems for 2 solid weeks. That's at least a week longer uptime than we've had for quite awhile. I think we may have something. I doubt nfs-utils-1.0.6-sgi-statd-fixes.patch has anything to do with it since it just clean up some printfs, ignore SIGPIPE so tcp connection don't kill statd and clean up some gid setting in the drop privileges code... but... setting secure-statd could be the issue... The security checks this enables are suppose to ensure only the local lockd will be able to monitor locks... maybe its a bit too restrictive.... btw, thank you very much for this work... its much appreciated! In http://people.redhat.com/steved/bz151828/ is an nfs-utils rpm that is built without the --enable-secure-statd configure option set. Please give it a try to see if it solves the problem. Cut/past error http://people.redhat.com/steved/bz140385/ is the correct place... Cut/paste error http://people.redhat.com/steved/bz140385/ is the correct place... Got similar issues on a DELL PE 6450 using nfs-utils-1.0.6-33EL with 2.4.21-32.ELsmp kernel. Is it a true statement that nfs-utils-1.0.6-36EL fixes this problem? Well, I wouldn't say it's completely fixed, but at least it's happening a *lot* less now. Since trying nfs-utils-1.0.6-36EL (effectively since May 5), I've experienced only 1 nfs client lockup since. Before that, we were seeing it every few(2-7) days. Rex, Did that one lockup have the same foot print as the others? Yes. Any attempt at nfs-locking on the client yielded hung processes. What I meant was were there the same error messages when the hang occurred and has the hang happens sinces? Yes, this last nfs client lock-freeze had all the same symptoms and error messages. So its the same hang but they don't happen as often, so how often do they now occur? In comment #46 it was was stated the hang and not happen in two week, is that about the range between hangs? See comment #56 "Since trying nfs-utils-1.0.6-36EL (effectively since May 5), I've experienced only 1 nfs client lockup since. Before that, we were seeing it every few(2-7) days" So, only 1 lockup, between May 5 and now (June 16). By chance, is there some prevalent application running, like clear case or some type of database, when the hangs occur or is it just normal user traffic like people logging in and out.... Also is the entire machine locked up? Would it be possible to get a system trace as described in comment #34 Just running compiles (rpm building actually). The buildroot is local, but the hangs occur on 1. reading my ~/.rpmrc ~/.rpmmacros, ~ is on NFS. 2. Running make in an NFS directory with rpm specfiles. When/if it happens again, I can try to get the system/ethereal traces. Just to be clear, the server currently is running kernel-smp-2.4.20-20.EL (or something close) and the client is running the same kernel or something different? We've seen it occur against 2 servers: 2 Servers: 1. RHEL3 box: kernel-smp/kernel-2.4.21-32.0.1.EL (and previously 2.4.21-27.0.4.EL) . Going to the up kernel didn't (seem to) help. Verying up/smp kernels on the client didn't (seem to) help either. 2. (Old) rh90 box: kernel-smp-2.4.20-43.9.legacy (using nfs-utils from rpmbuild --rebuild nfs-utils-1.0.6-36EL) What was the kernel version on the clients that used these servers? And I'm sure I understand what you mean by "Verying up/smp kernels on the client didn't (seem to) help either" Clients were using the same kernel(s) as the servers. Since I can not reproduce this and the seemly has gone away with the removal of the secure-statd compile I'm going to put this bug in a state that will send it to our QA group. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2005-697.html Matching CRM is closed, erratum released, closing this. Internal Status set to 'Resolved' Status set to: Closed by Tech This event sent from IssueTracker by pdemauro issue 65578 |