Bug 448244

Summary:	yp_all error on kernel 2.6.18-92.el5
Product:	Red Hat Enterprise Linux 5	Reporter:	Darren <d-gitelman>
Component:	kernel	Assignee:	Jeff Layton <jlayton>
Status:	CLOSED INSUFFICIENT_DATA	QA Contact:	Martin Jenner <mjenner>
Severity:	medium	Docs Contact:
Priority:	low
Version:	5.2	CC:	staubach, steved
Target Milestone:	rc
Target Release:	---
Hardware:	i386
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2008-11-25 12:32:22 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Darren 2008-05-24 22:01:28 UTC

Description of problem:
Upgraded to kernel 2.6.18-92.el5. The reboot hangs on loading NFS quota and
shows repeated messages: yp_all: clnt_call: RPC: Timed out
The machine has to be rebooted to get passed this point.
 
Using an interactive boot I see that ypbind loads fine. I disabled loading NFS
and the machine starts. From the command line ypwhich and ypcat passwd show the
appropriate server and password file information. Rebooting using kernel
2.6.18-53.1.21.el5 does not give the yp_all error. This is happening on the nis
clients.
 
The NIS server is able to boot kernel 2.6.18-92.el5. But it is not automounting
nfs shares. problem again solved by reverting to kernel 2.6.18-53.1.21.el5.

Version-Release number of selected component (if applicable):
2.6.18-92.el5.

How reproducible:
every time

Steps to Reproduce:
1. setup NIS. setup NFS automounts
2. from an nis client reboot into kernel 2.6.18-92.el5
3. booting hangs at nfs quota.
  
Actual results:
upon getting to nfs quota, boot hangs with repeated messages
yp_all: clnt_call: RPC: Timed out

Expected results:
There should be no errors since the boot proceeds normally for kernel
2.6.18-53.1.21.el5 with no changes to any conf files

Additional info:

Comment 1 Jeff Layton 2008-07-24 21:22:15 UTC

Some questions...

When you say "set up NIS" do you mean set this machine up as a server or client?
(I presume client, but I'd like to be sure)

What maps are you serving?

What does /etc/nsswitch.conf look like on this host?

By "automounts" here, you mean to set up NFS mounts in /etc/fstab, correct?

When you say "hangs at NFS quota", do you mean that it hangs after printing out
this message?

    Starting NFS quotas: 

Are you able to do "chkconfig nfs off" and then boot to the newer kernel?

If so, does running "service nfs start" hang when starting rquotad?

Comment 2 Darren 2008-07-25 04:42:48 UTC

Machine is NIS client. 

Serving maps: passwd, shadow, group, hosts, rpc, services, netid, protocols, mail

nsswitch.conf
[root@chinook etc]# more /etc/nsswitch.conf
#
# /etc/nsswitch.conf
#
# An example Name Service Switch config file. This file should be
# sorted with the most-used services at the beginning.
#
# The entry '[NOTFOUND=return]' means that the search for an
# entry should stop if the search in the previous entry turned
# up nothing. Note that if the search failed due to some other reason
# (like no NIS server responding) then the search continues with the
# next entry.
#
# Legal entries are:
#
#       nis or yp               Use NIS (NIS version 2), also called YP
#       dns                     Use DNS (Domain Name Service)
#       files                   Use the local files
#       db                      Use the local database (.db) files
#       compat                  Use NIS on compat mode
#       hesiod                  Use Hesiod for user lookups
#       ldap                    Use LDAP (only if nss_ldap is installed)
#       nisplus or nis+         Use NIS+ (NIS version 3), unsupported
#       [NOTFOUND=return]       Stop searching if not found so far
#

# To use db, put the "db" in front of "files" for entries you want to be
# looked up first in the databases
#
# Example:
#passwd:    db files ldap nis
#shadow:    db files ldap nis
#group:     db files ldap nis

passwd:     files nis
shadow:     files
group:      files nis

#hosts:     db files ldap nis dns
hosts:      files nis dns

# Example - obey only what ldap tells us...
#services:  ldap [NOTFOUND=return] files
#networks:  ldap [NOTFOUND=return] files
#protocols: ldap [NOTFOUND=return] files
#rpc:       ldap [NOTFOUND=return] files
#ethers:    ldap [NOTFOUND=return] files

bootparams: files nis
ethers:     files nis
netmasks:   files nis
networks:   files nis
protocols:  files nis
rpc:        files nis
services:   files nis
netgroup:   files nis
publickey:  files nis
automount:  files
aliases:    files nis
#############################################

Automounts refers to using the automounter (autofs) (exports and auto.*)

Yes that's where it hangs.

Yes I can do chkconfig nfs off and it will boot. Running service nfs start
produces the same error : yp_all: clnt_call: RPC: Timed out

Comment 3 Darren 2008-07-25 04:45:12 UTC

Please additionally note the following 

[root@chinook init.d]# service nfs start
Starting NFS services:                                     [  OK  ]
Starting NFS quotas: yp_all: clnt_call: RPC: Timed out
yp_all: clnt_call: RPC: Timed out
yp_all: clnt_call: RPC: Timed out
yp_all: clnt_call: RPC: Timed out
yp_all: clnt_call: RPC: Timed out
yp_all: clnt_call: RPC: Timed out
yp_all: clnt_call: RPC: Timed out
yp_all: clnt_call: RPC: Timed out
                                                           [  OK  ]
Starting NFS daemon:                                       [  OK  ]
Starting NFS mountd: yp_all: clnt_call: RPC: Timed out
yp_all: clnt_call: RPC: Timed out
yp_all: clnt_call: RPC: Timed out
yp_all: clnt_call: RPC: Timed out
yp_all: clnt_call: RPC: Timed out
yp_all: clnt_call: RPC: Timed out
                                                           [  OK  ]

 
This occurs when the system is booting or if I stop and then start the service. 
 
No messages relevant to this error appear in /var/log/messages 
 
No matter which kernel starts yp is appropriately bound to the NIS server
ypwhich returns the name of the server and ypcat passwd returns the passwords.
 
RPC appears to be running correctly
 
[root@chinook ~]# rpcinfo -p localhost
   program vers proto   port
    100000    2   tcp    111  portmapper
    100000    2   udp    111  portmapper
    100007    2   udp    966  ypbind
    100007    1   udp    966  ypbind
    100007    2   tcp    969  ypbind
    100007    1   tcp    969  ypbind
    100024    1   udp    644  status
    100024    1   tcp    652  status
    100021    1   udp  32784  nlockmgr
    100021    3   udp  32784  nlockmgr
    100021    4   udp  32784  nlockmgr
    100021    1   tcp  43375  nlockmgr
    100021    3   tcp  43375  nlockmgr
    100021    4   tcp  43375  nlockmgr
    100011    1   udp    766  rquotad
    100011    2   udp    766  rquotad
    100011    1   tcp    790  rquotad
    100011    2   tcp    790  rquotad
    100003    2   udp   2049  nfs
    100003    3   udp   2049  nfs
    100003    4   udp   2049  nfs
    100003    2   tcp   2049  nfs
    100003    3   tcp   2049  nfs
    100003    4   tcp   2049  nfs
    100005    1   udp    804  mountd
 
[root@chinook ~]# rpcinfo -u localhost ypbind program 100007 version 1 ready and
waiting program 100007 version 2 ready and waiting

ypdomainname, nisdomainname, hostname all return appropriate values, which are
not different between kernels.
 
Searching google returned a suggestion to change the file limits > ulimit -n
256, but this does not resolve the error.
 
I am running the following versions of programs (of course these do not change
when switching kernels)
ypbind-1.19-8.el5
yp-tools-2.9-0.1
nfs-utils-1.0.9-33.el5
nfs-utils-lib-1.0.8-7.2.z2
nfs4-acl-tools-0.3.1-1.el5.1

Comment 4 Darren 2008-07-25 04:54:44 UTC

One last note: the error also occurs for kernel: 2.6.18-92.1.6.el5

Comment 5 Jeff Layton 2008-07-25 12:00:39 UTC

I gave a shot at reproducing this, and haven't been able to.

The error messages you're seeing mean that the yp client is trying to query the
yp server but it isn't getting responses. I have to wonder whether this symptom
is just indicative of some sort of generic network connectivity problem between
the yp client and server.

Once you boot this machine to the new kernel, are you able to do:

# ypcat hosts

Can you also ping the yp server?

Comment 6 Darren 2008-07-25 12:10:59 UTC

Please see comment #3:
  "No matter which kernel starts yp is appropriately bound to the NIS server
ypwhich returns the name of the server and ypcat passwd returns the passwords."

So the answer is yes. 

This is what is so puzzling. If I boot to kernel 2.6.18-92 or 2.6.18-92.1.6 I
get the error. If I boot to 2.6.18-53.1.21 I do not get the error. If I disable
the NFS service there is no error. Both kernels show normal yp behavior and
return the proper results for ypcat hosts, ypcat password, ypwhich, etc. Of
course the user's home directory doesn't mount with NFS turned off but the
system recognizes the user's login.

Comment 7 Jeff Layton 2008-07-25 13:49:34 UTC

Ok, then I'm stumped. At this point you're going to need to do some
troubleshooting to narrow down the cause. I recommend opening an RH support case
and working with the folks there to narrow down the reason for this. You should
be able to refer to this BZ. If it turns out that this problem is due to a bug
of some sort then we can transition this BZ to address it.

Comment 8 Darren 2008-07-25 14:00:34 UTC

That's too bad. I tried submitting this to RH, but they won't open a support
case since I only have academic support which just covers RHN proxy. So I guess
this dies here. Are there any suggested forums where I could post this, or do
you have suggestions for troubleshooting that I then could submit the results to
a forum.

Darren

Comment 9 Jeff Layton 2008-07-25 14:15:39 UTC

Ahh, that is too bad. You'll need to do some legwork to track this down yourself
then. You might also check the CentOS forums.

It looks like the problem is confined to mountd and rquotad. Since mountd and
rquotad both work with the export table, I suspect that the problem is related
to something in /etc/exports. You may want to try narrowing down your exports
table to see if you can determine if there's one or two that cause the problem.

Since it's related to YP, you could also try running with or without nscd and
see if it makes a difference.

You could also try running mountd or rquotad by hand and seeing if you can
replicate the problem with one of them. The goal here is to determine what's
changed at the system call level (since most likely, something has). If you
strace mountd on both kernels and compare them, that might also give you a hint.

Maybe something like this:

# strace -f -o /tmp/mountd.strace -tt -T rpc.mountd -F

If it turns out to be a bug and you have a description of the problem from which
I can work, or (even better) a way for me to reproduce this here, I'll be happy
to look further.

Comment 10 Jeff Layton 2008-11-25 12:32:22 UTC

No word in quite some time. Closing case. Please reopen if you're able to provide more info...