Bug 140385

Summary:	lockd: cannot (un)monitor x.x.x.x
Product:	Red Hat Enterprise Linux 3	Reporter:	Rex Dieter <rdieter>
Component:	nfs-utils	Assignee:	Steve Dickson <steved>
Status:	CLOSED ERRATA	QA Contact:	Brian Brock <bbrock>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	3.0	CC:	buckh, ee-cap-admin-dl, erich, george.liu, herrold, hyclak, martin.donnelly, martinez, riel, support, tao
Target Milestone:	---
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:	RHBA-2005-697	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2005-09-28 18:51:34 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	156321

Description Rex Dieter 2004-11-22 18:47:21 UTC

Running kernel-2.4.20-20.EL (with nfs patch from bug 118839), server
will run for hours or days, but eventually grinds to halt
nfs-server-wise (it's happened now 3 times in the past week).  The
only relavent syslog entries (as things a deteriorating):

Nov 22 12:07:59 x kernel: lockd: cannot monitor x.x.x.133
Nov 22 12:08:24 x kernel: lockd: cannot monitor x.x.x.133
Nov 22 12:09:07 x kernel: lockd: cannot unmonitor x.x.x.103
Nov 22 12:09:32 x kernel: lockd: cannot monitor x.x.x.133
Nov 22 12:09:57 x kernel: lockd: cannot monitor x.x.x.133
Nov 22 12:10:22 x kernel: lockd: cannot monitor x.x.x.133
Nov 22 12:10:47 x kernel: lockd: cannot monitor x.x.x.6
Nov 22 12:11:12 x kernel: lockd: cannot monitor x.x.x.133

and on attempting to shutdown the box a bunch of:

Nov 22 12:43:17 x kernel: lockd: unauthenticated request from
(815db527:799)
Nov 22 12:43:17 x kernel: lockd: unauthenticated request from
(815db527:799)
Nov 22 12:43:17 x kernel: lockd: unauthenticated request from
(815db56d:799)
Nov 22 12:43:26 x kernel: lockd: unauthenticated request from
(815db567:800)
Nov 22 12:43:27 x kernel: lockd: unauthenticated request from
(815db518:798)
Nov 22 12:43:35 x kernel: lockd: unauthenticated request from
(815db529:799)
Nov 22 12:43:36 x kernel: lockd: unauthenticated request from
(815db529:799)
Nov 22 12:43:48 x kernel: lockd: unauthenticated request from
(815db585:798)
Nov 22 12:43:54 x kernel: lockd: unauthenticated request from
(815db57f:799)

Comment 1 Rex Dieter 2004-11-22 18:53:24 UTC

I'll upgrade to kernel-2.4.21-4.EL to see if that helps.

Comment 2 Rex Dieter 2004-11-22 18:56:07 UTC

OK, I mistyped, I was using kernel-2.4.21-20.EL originally.

Comment 3 Rex Dieter 2005-02-07 21:34:55 UTC

OK, new(est) kernel
kernel-smp-2.4.21-27.0.2.EL
different log messages, same result (server grinding to a halt)...
lots and lots of:

Feb  7 15:32:30 server rpc.statd[1990]: Received erroneous SM_UNMON
request from server for client1
...
Feb  7 14:28:10 server rpc.statd[1990]: Received erroneous SM_UNMON
request from server for client2

Comment 4 Rex Dieter 2005-02-07 21:40:24 UTC

Oh, and client machines are a mixture of rhel3, rh90, fc2, fc3.

Comment 5 Rex Dieter 2005-02-15 21:17:00 UTC

Server hung up again, this time seeing a bunch of
nsm_mon_unmon: rpc failed, status=-110
lockd: cannot monitor x.x.x.x
nsm_mon_unmon: rpc failed, status=-110
lockd: cannot monitor y.y.y.y
nsm_mon_unmon: rpc failed, status=-110
lockd: cannot monitor z.z.z.z

(a lockd: cannot monitor message for almost *every* nfs client in use)

Comment 6 EE CAP Admin 2005-02-17 10:08:52 UTC

I'm adding this here cause I don;t know if the bug is with EL3 or 4.

Previously I was running an EL3 nfs server and EL3 clients.  We have
upgraded the clients to EL4 and left the server on EL3 for the time being.

We are now getting the unmonitor problem.

Do you want me to open this as an EL4 bug or leave it here.  Whatever,
it is causing a very serious problem for us.  It was not happening
with the EL4 beta 2 so we were very surprised to see it now.

Paul

Comment 7 Rex Dieter 2005-02-18 16:12:40 UTC

Happened again this morning after I had reverted to
kernel-smp-2.4.21-20.EL (from kernel-smp-2.4.21-27.0.2.EL).  Oh well,
back to kernel-smp-2.4.21-27.0.2.EL

Comment 8 Steve Dickson 2005-02-20 12:42:05 UTC

The "lockd: cannot monitor..." messages are because  the
kernel can not talk to the local rpc.statd process, which 
could me that statd has gone down. Please check that status
of both rpc.statd (on the server and client) during this condition.


the "lockd: unauthenticated request from..." message 
is a server message says the client trying to do the 
lock is not the export list. Which may or may not
be true...

Comment 9 Rex Dieter 2005-02-21 14:06:40 UTC

Checking (and then out of desperation restarting) rpc.statd on the
server was one of the *first* things I've tried on each occassion.  It
has never helped. 

The fact that restarting the server makes all clients happy leads me
to think that all client-side rpc.statd's were pretty much ok.  That
is, unless there is something that could simulteneously affect the
stability of rpm.statd on 30+ clients.  (-:

Comment 10 Rex Dieter 2005-02-21 21:38:54 UTC

Happened again today, just (re)booted the server this morning, running 
kernel-smp-2.4.21-27.0.2.ELsmp.  In the appended log, you can see that
I restarted rpm.statd (with /sbin/service nfslock restart), but it
didn't help.  For the record, the nfs server failed even to provide
file locking with *itself* @ 192.168.181.2

/var/log/messages contained: (names and ip addresses changed to
protect the innocent):

Feb 21 15:12:07 nfsserver rpc.statd[1985]: Received erroneous SM_UNMON
request from nfsserver.unl.edu for 192.168.180.21
Feb 21 15:15:59 nfsserver kernel: lockd: cannot monitor 192.168.181.47
Feb 21 15:16:24 nfsserver kernel: lockd: cannot unmonitor 192.168.181.72
Feb 21 15:16:49 nfsserver kernel: lockd: cannot unmonitor 192.168.181.12
Feb 21 15:17:14 nfsserver kernel: lockd: cannot monitor 192.168.181.47
Feb 21 15:17:39 nfsserver kernel: lockd: cannot monitor 192.168.181.6
Feb 21 15:18:04 nfsserver kernel: lockd: cannot monitor 192.168.181.47
Feb 21 15:18:29 nfsserver kernel: lockd: cannot unmonitor 192.168.181.24
Feb 21 15:18:54 nfsserver kernel: lockd: cannot unmonitor 192.168.181.28
Feb 21 15:19:19 nfsserver kernel: lockd: cannot monitor 192.168.181.6
Feb 21 15:19:44 nfsserver kernel: lockd: cannot monitor 192.168.181.47
Feb 21 15:20:09 nfsserver kernel: lockd: cannot monitor 192.168.181.6
Feb 21 15:20:19 nfsserver rpc.statd[1985]: Caught signal 15,
un-registering and exiting.
Feb 21 15:20:19 nfsserver nfslock: rpc.statd startup succeeded
Feb 21 15:20:34 nfsserver kernel: lockd: cannot monitor 192.168.181.47
Feb 21 15:20:49 nfsserver rpc.statd[31263]: Can't notify
192.168.181.64, giving up.
Feb 21 15:20:49 nfsserver rpc.statd[31263]: Can't notify
192.168.181.133, giving up.
Feb 21 15:20:59 nfsserver kernel: lockd: cannot monitor 192.168.181.6


and restarting rpc.statd on affected clients yielded interesting logs
from the server as well:

Feb 21 15:35:15 mathstat rpc.statd[31263]: SM_NOTIFY from client1
while not monitoring any hosts.
Feb 21 15:35:34 mathstat kernel: lockd: cannot monitor 192.168.181.6
Feb 21 15:35:52 mathstat rpc.statd[31263]: SM_NOTIFY from client2
while not monitoring any hosts.
Feb 21 15:35:59 mathstat kernel: lockd: cannot monitor 192.168.181.28
Feb 21 15:36:15 mathstat rpc.statd[31263]: SM_NOTIFY from client3
while not monitoring any hosts.

client1 = 192.168.181.6
client2 = 192.168.181.28

Comment 11 Rex Dieter 2005-02-21 21:40:58 UTC

OK, I was wrong about my comment regarding locking with the nfsserver
itself at .181.2.  Ignore that part.

Comment 12 R P Herrold 2005-03-01 19:50:27 UTC

I also get clobbered with these when the statd was not running;
however, I will test further

Comment 13 Rex Dieter 2005-03-01 19:58:57 UTC

For the record (*crosses fingers*), We've been stable now for 10+ days
after changing /etc/exports from using a mixture of IP addresses and
hostnames to using IP addresses exclusively.

I made the change on a hunch that the mixture was at least one
contributing factor to the problem(s), since some of the log entries
referred to hostnames (rpc.statd: SM_NOTIFY... bits), whereas some
only IP addreses (lockd: cannot monitor... bits).

Comment 16 Rex Dieter 2005-03-09 20:21:44 UTC

Died again today... oh well, lasted almost 9 days this time.

Comment 17 Rex Dieter 2005-03-22 13:49:14 UTC

Died again.  A whopping 13 days uptime.

Comment 18 Steve Dickson 2005-03-23 13:05:34 UTC

Everytime it dies, is rpc.statd dead as well?

Comment 19 Rex Dieter 2005-03-23 13:13:39 UTC

No, rpc.statd is not dead.  If it were, how would be sending all those entries
to syslog?  Further, as reported earlier here, I've tried repeatedly to restart
rpc.statd on both server and clients with no avail.

The only anomalous thing I ever see is that rpc.lockd (on the NFS server) is in
the "D" state as reported by ps.  Normally (when functioning happily) it's state
is "SW"

Comment 20 Xavier DRONIOU 2005-03-24 13:16:21 UTC

i've got the same problem my server randomly crash, before seeing this thread 
i've called HP (my server is an ml350) we have changed for a new ml350 after 
that we suspected our san to be the buggish thing but it doesn't seem at all.
Now i'm very confused, i've made all update. I don't know if this error make my 
server hang but in all case i've got this error before the server hang up.

Comment 21 Rex Dieter 2005-03-24 13:21:47 UTC

Uur box in question is a P4 with Hyperthreading.  Since the last crash, I
decided to try running a UP kernel instead.  Interestingly, I started seeing the
statd errors/warnings in the syslog again, but this time rpc.lockd didn't go
into lala land (ie, it's still "SW", not "D"), and it *appears* to still be
functioning ok(*)

(*) Previously, once the wierdness began to occur, *all* nfs clients would be
unable to obtain file locks over nfs.  Now, at least, all(most/some?) nfs
clients are still able to function.

Comment 22 Steve Dickson 2005-03-28 15:33:32 UTC

Well the "kernel: lockd: cannot monitor" are due to the fact rpc.statd is
dead and in bz 151828 rpc.statd seems to be seg faulting was I was
trying to see if these are related... 

When statd is hung in "D" state, would it be possible to get an
AltSysRq-t system trace (echo t > /proc/sysrq-trigger should do it)

Also is there any reliable way to reproduce this? I'm trying but I
can't seem to find one....

Comment 23 Rex Dieter 2005-03-28 21:56:37 UTC

NFS (locking) froze up again today.

Upon reflection, server death, and rpc.lockd going into the "D" state doesn't
occur until we start seeing
kernel: lockd: cannot monitor x.x.x.x
syslog entries (summary changed accordingly)

Interestingly, during the shutdown as processes were being killed off, I
happened to run 'ps' again, and noticed that lockd had went back to the "SW"
state (which it normally is).

Comment 24 Steve Dickson 2005-03-29 12:39:37 UTC

What nfs-utils version are you using?

Comment 25 Rex Dieter 2005-03-29 12:44:34 UTC

The box in question is a fully up-to-date RHEL3, so: nfs-utils-1.0.6-33EL

Comment 26 Steve Dickson 2005-03-29 14:26:42 UTC

What kind of load is running that is causing all this locking
to occur? Again, I'm trying to reproduce this problem and I'm
not having much luck...

Also in this the last failure was there the
"nsm_mon_unmon: rpc failed, status=-110" 
messages again?

Finally, you said you added the patch from bug 118839
so what are you setting the nfsd3_acl_max_entries to?
(which may or may not make a difference).

Comment 27 Rex Dieter 2005-03-29 14:38:13 UTC

Re: load - The box is an NFS/samba server for ~100 mixed RedHat9, RHEL3,
WindowsXP clients (about 50%/50% windows/linux)

Didn't see any nsm_mon_unmon messages this time (or since that first time I
reported seeing it actually).

FYI, the patch from bug 118839 is already in kernel-2.4.21-27.0.2.EL, and I have
in /etc/modules.conf:
# Per http://bugzilla.redhat.com/bugzilla/118839
options nfsd nfsd3_acl_max_entries=128
options nfs nfs3_acl_max_entries=128

Comment 28 chris eborn 2005-03-30 09:40:05 UTC

May be coincidence, but I've been running a RedHat 9
NFS server for 18 months with no trouble untill about
a week ago when I started samba up on it. I am now
seeing messages like this in the SYSLOG.
 
nsm_mon_unmon: rpc failed, status=-13
lockd: cannot monitor 192.168.1.202


Chris

Comment 29 Eric 2005-03-31 23:52:38 UTC

This thread describes what we've seen on one system today. RHEL3U4
2.4.21-27.0.2.EL kernel. In our case it is both an NFS client and server. The
first thing we noticed is the NFS client side locks up. That is, exports from
other NFS servers mounted on this system hang in device wait. 

Same kernel messages:
 "rpc.statd Received erroneous SM_UNMON request from <dnsname.of.this.box> for
<IP of an NFS client>"
show up for days and days every 3-8 minutes.

Then just when NFS locks up these messages show up:
 "kernel: nsm_mon_unmon: rpc failed, status=-110"
 "kernel: lockd: cannot monitor 198.48.89.107"

We don't run Samba on this server. NFS servers and most clients are IBM AIX.

Eric

Comment 30 Steve Dickson 2005-04-01 13:18:10 UTC

In http://people.redhat.com/steved/bz15182 is a nfs-utils that
fixes a problem with causes rpc.statd to crash. Now the
"lockd: cannot monitor" caused by the kernel not being able
to communicate with rpc.statd. So I'm wondering if this issue
is a combination of a couple of issues. So could you please
upgrade you nfs-utils to nfs-utils-1.0.6-55.src.rpm and let
me know what happens.

Comment 31 Rex Dieter 2005-04-01 13:22:45 UTC

I get "Error: 404" trying http://people.redhat.com/steved/bz15182

Comment 32 Steve Dickson 2005-04-01 13:26:36 UTC

My bad... its http://people.redhat.com/steved/bz151828

Comment 33 Rex Dieter 2005-04-01 13:45:44 UTC

nfs-utils-1.0.6-55 doesn't build on RHEL3, because
BuildRequires: krb5-devel > 1.3.1

Naively removing that doesn't help... it fails in a couple places later in the
build:

...
Making dep in nfsidmap
gcc -O2 -pipe -march=i386 -mcpu=i686 -M libnfsidmap.c nss.c umich_ldap.c cfg.c
strlcpy.c cfg.h nfsidmap_internal.h queue.h > .depend
gcc: compilation of header file requested
gcc: compilation of header file requested
gcc: compilation of header file requested

...and...

Making all in src
...
 gcc -DPACKAGE_NAME=\"librpcsecgss\" -DPACKAGE_TARNAME=\"librpcsecgss\"
-DPACKAGE_VERSION=\"0.1\" "-DPACKAGE_STRING=\"librpcsecgss 0.1\""
-DPACKAGE_BUGREPORT=\"nfsv4-wg.edu\" -DSTDC_HEADERS=1
-DHAVE_SYS_TYPES_H=1 -DHAVE_SYS_STAT_H=1 -DHAVE_STDLIB_H=1 -DHAVE_STRING_H=1
-DHAVE_MEMORY_H=1 -DHAVE_STRINGS_H=1 -DHAVE_INTTYPES_H=1 -DHAVE_STDINT_H=1
-DHAVE_UNISTD_H=1 -DHAVE_DLFCN_H=1 -DSTDC_HEADERS=1 -DHAVE_STDDEF_H=1
-DHAVE_STDLIB_H=1 -DHAVE_STRING_H=1 -DHAVE_UNISTD_H=1 -DHAVE_STDLIB_H=1
-DHAVE_MALLOC=1 -DHAVE_STDLIB_H=1 -DHAVE_REALLOC=1 -DHAVE_MEMSET=1
-DHAVE_STRERROR=1 -DHAVE_KRB5=1 -DKRB5_VERSION=127
-DUSE_PRIVATE_KRB5_FUNCTIONS=1 -DHAVE_GSS_KRB5_CCACHE_NAME=1 -I. -I.
-I../include -I/usr/kerberos/include -O2 -pipe -march=i386 -mcpu=i686 -c
auth_gss.c  -fPIC -DPIC -o .libs/auth_gss.o
auth_gss.c: In function `authgss_create_default':
auth_gss.c:224: `GSS_C_NT_HOSTBASED_SERVICE' undeclared (first use in this function)
auth_gss.c:224: (Each undeclared identifier is reported only once

Comment 34 Steve Dickson 2005-04-01 14:17:14 UTC

ahh.... I forgot about the NFSv4 stuff in the RHEL4 nfs-utils...
nevermind with the upgrade :-[  

Boy I wish I could reproduce this... 

I realize this happens intermittently, but if possible having a
(bzip2-ed) binary ethereal trace (i.e. tethereal -w /tmp/trace.pcap)
of the problem could help determining what is going... 

Also a system trace of the hung system (ie. AltSysRq-t on the console 
or echo t > /proc/sysrq-trigger) would also help...

Comment 35 Rex Dieter 2005-04-01 14:59:09 UTC

I rebuilt nfs-utils starting with nfs-util-1.0.6-33EL including the additional
(mostly statd-related patches) from nfs-utils-1.0.6-55:

Patch56: nfs-utils-1.0.6-fd-sig-cleanup.patch
Patch59: nfs-utils-1.0.6-rquotad-overflow.patch
Patch60: nfs-utils-1.0.6-statd-notify-hostname.patch

Let's see if it helps any.

Comment 36 R P Herrold 2005-04-03 02:09:29 UTC

As to comment 34 ("boy I wish I could reproduce this") - I had 'cannot unmontor'
happen when a RO nfs export server dropped off the LAN for an extended interval
(a couple hours), but a companion RW server at a differing mountpoint was fine
and reachible

Is it possible that the local lockd is not ignoring RO filesystems properly?

Comment 38 Steve Dickson 2005-04-05 15:56:23 UTC

There is a rumor afoot that a possible workaround is to downgrade to
nfs-utils-1.0.6-31.EL. I'm not sure how that is possible since not
too much has changed....

But to see if this is truly the case, I put a x86 rpm and srpm
of nfs-utils-1.0.6-31.EL in http://people.redhat.com/steved/bz140385/
Please download and see if it help with the issue...

Comment 39 nancy 2005-04-05 19:15:44 UTC

how's this one different from the nfs-utils-1.0.6-31.EL that's on the RHN website? 

I have two different Dell servers that I recently installed.  One's running
RHEL3 Update 3 and the other one Update 4.  That's pretty much it in terms of
difference.  So the package versions are different (including kernel and
nfs-util).  Anyways, the problem is only showing up on the Update 4 server.  

(In reply to comment #38)
> There is a rumor afoot that a possible workaround is to downgrade to
> nfs-utils-1.0.6-31.EL. I'm not sure how that is possible since not
> too much has changed....
> 
> But to see if this is truly the case, I put a x86 rpm and srpm
> of nfs-utils-1.0.6-31.EL in http://people.redhat.com/steved/bz140385/
> Please download and see if it help with the issue...
>

Comment 40 Steve Dickson 2005-04-06 12:58:55 UTC

The the nfs-utils on my people page and RHN are the same... I just figured 
it would make it easier to find if I put it on my people page.

Comment 41 Rex Dieter 2005-04-20 13:22:01 UTC

FYI, I've discoverred something interesting.  The last few times I've
experienced this problem, the "server" in question had been acting also as an
NFS client to another server of ours.  I mention this only because we just had
another el3 box freeze up(*), but this one was acting only as an NFS client.

All el3 boxes in question are using a modified nfs-utils-1.0.6-33 (see comment
#35).  I guess we'll try downgrading to 1.0.6-31 to see if that helps.

(*) Only in the sense that it could no longer obtain file locks over NFS.

Comment 42 Steve Dickson 2005-04-26 19:44:30 UTC

Just curious.... are all these server multi home servers? Meaning they
have network interfaces on more that one subnet?

Comment 43 Rex Dieter 2005-04-26 20:25:47 UTC

No multihome.  One (physical) subnet.

Comment 44 R P Herrold 2005-04-26 21:03:40 UTC

My misbehaving RO NFS server at comment 36 _is_ multi-homed; it is however, also
in a fully populated DNS client and server set for each segment.

Comment 46 Rex Dieter 2005-05-05 12:31:46 UTC

Since dropping the nfs-utils-1.0.6-sgi-statd-fixes.patch and removing the
--enable-secure-statd configure option (effectively dropping back to the
nfs-utils-1.0.6-31 rpm), lo and behold... *0* problems for 2 solid weeks. 
That's at least a week longer uptime than we've had for quite awhile.  I think
we may have something.

Comment 47 Steve Dickson 2005-05-05 16:06:28 UTC

I doubt nfs-utils-1.0.6-sgi-statd-fixes.patch has anything to do with
it since it just  clean up some printfs, ignore SIGPIPE so tcp connection
don't kill statd and clean up some gid setting in the drop privileges code...
but... setting secure-statd could be the issue... The security checks
this enables are suppose to ensure only the local lockd will be able
to monitor locks...  maybe its a bit too restrictive....

btw, thank you very much for this work... its much appreciated!

Comment 48 Steve Dickson 2005-05-06 22:26:27 UTC

In http://people.redhat.com/steved/bz151828/ is an nfs-utils rpm
that is built without the --enable-secure-statd configure option
set. Please give it a try to see if it solves the problem.

Comment 52 Steve Dickson 2005-05-17 20:23:42 UTC

Cut/past error 

http://people.redhat.com/steved/bz140385/ is the correct place...

Comment 53 Steve Dickson 2005-05-17 20:24:06 UTC

Cut/paste error 

http://people.redhat.com/steved/bz140385/ is the correct place...

Comment 54 George 2005-06-02 13:59:55 UTC

Got similar issues on a DELL PE 6450 using nfs-utils-1.0.6-33EL with 
2.4.21-32.ELsmp kernel.

Comment 56 Steve Dickson 2005-06-08 21:08:46 UTC

Is it a true statement that nfs-utils-1.0.6-36EL fixes this problem?

Comment 58 Rex Dieter 2005-06-09 12:10:42 UTC

Well, I wouldn't say it's completely fixed, but at least it's happening a *lot*
less now.  Since trying nfs-utils-1.0.6-36EL (effectively since May 5), I've
experienced only 1 nfs client lockup since.  Before that, we were seeing it
every few(2-7) days.

Comment 59 Steve Dickson 2005-06-10 13:15:06 UTC

Rex,

Did that one lockup have the same foot print as the others?

Comment 60 Rex Dieter 2005-06-10 13:17:32 UTC

Yes.  Any attempt at nfs-locking on the client yielded hung processes.

Comment 61 Steve Dickson 2005-06-15 21:22:28 UTC

What I meant was were there the same error messages when the hang occurred
and has the hang happens sinces?

Comment 62 Rex Dieter 2005-06-16 12:18:19 UTC

Yes, this last nfs client lock-freeze had all the same symptoms and error messages.

Comment 63 Steve Dickson 2005-06-16 12:30:18 UTC

So its the same hang but they don't happen as often, so
how often do they now occur? In comment #46 it was
was stated the hang and not happen in two week, is
that about the range between hangs?

Comment 64 Rex Dieter 2005-06-16 12:37:34 UTC

See comment #56
"Since trying nfs-utils-1.0.6-36EL (effectively since May 5), I've
experienced only 1 nfs client lockup since.  Before that, we were seeing it
every few(2-7) days"

So, only 1 lockup, between May 5 and now (June 16).

Comment 65 Steve Dickson 2005-06-16 13:32:10 UTC

By chance, is there some prevalent application running, like
clear case or some type of database, when the hangs
occur or is it just normal user traffic like people logging
in and out....

Also is the entire machine locked up? Would it be possible to get
a system trace as described in comment #34

Comment 66 Rex Dieter 2005-06-16 13:52:25 UTC

Just running compiles (rpm building actually).  The buildroot is local, but the
hangs occur on 
1.  reading my ~/.rpmrc  ~/.rpmmacros, ~ is on NFS.
2.  Running make in an NFS directory with rpm specfiles.

When/if it happens again, I can try to get the system/ethereal traces.

Comment 67 Steve Dickson 2005-06-16 14:05:57 UTC

Just to be clear, the server currently is running kernel-smp-2.4.20-20.EL 
(or something close) and the client is running the same kernel or 
something different?

Comment 68 Rex Dieter 2005-06-16 14:33:37 UTC

We've seen it occur against 2 servers:

2 Servers:
1.  RHEL3 box: kernel-smp/kernel-2.4.21-32.0.1.EL (and previously
2.4.21-27.0.4.EL) .  Going to the up kernel didn't (seem to) help.  Verying
up/smp kernels on the client didn't (seem to) help either.
2.  (Old) rh90 box: kernel-smp-2.4.20-43.9.legacy (using nfs-utils from rpmbuild
--rebuild nfs-utils-1.0.6-36EL)

Comment 69 Steve Dickson 2005-06-18 14:48:39 UTC

What was the kernel version on the clients that used these servers?
And I'm sure I understand what you mean by
"Verying up/smp kernels on the client didn't (seem to) help either"

Comment 70 Rex Dieter 2005-06-20 13:57:49 UTC

Clients were using the same kernel(s) as the servers.

Comment 72 Steve Dickson 2005-06-21 19:00:43 UTC

Since I can not reproduce this and the seemly has gone away
with the removal of the secure-statd compile I'm going to
put this bug in a state that will send it to our QA group.

Comment 77 Red Hat Bugzilla 2005-09-28 18:51:35 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2005-697.html

Comment 82 Issue Tracker 2007-06-19 08:11:36 UTC

Matching CRM is closed, erratum released, closing this.

Internal Status set to 'Resolved'
Status set to: Closed by Tech

This event sent from IssueTracker by pdemauro 
 issue 65578