Bug 569944

Summary:	lockd calls statd with source address of first interface instead of localhost
Product:	[Fedora] Fedora	Reporter:	Ron Wail <ron>
Component:	nfs-utils	Assignee:	Steve Dickson <steved>
Status:	CLOSED WONTFIX	QA Contact:	Fedora Extras Quality Assurance <extras-qa>
Severity:	medium	Docs Contact:
Priority:	low
Version:	12	CC:	jlayton, steved
Target Milestone:	---	Keywords:	Reopened
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2010-12-03 22:07:41 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Ron Wail 2010-03-02 18:57:31 UTC

Description of problem:

When kernel lockd calls rpc.statd, the source address of the RPC call is not the loopback address. rpc.statd rejects the call and the lock attempt fails.

Version-Release number of selected component (if applicable):

1:1.2.1-4.fc12

How reproducible:
All attempts to ask for a lock on the server fail.

Steps to Reproduce:
1. Mount an NFS export
2. Request an NFS4 lock

  
Actual results:

lock fails with NLM_DENIED_NOLOCKS

Expected results:

lock succeeds

Additional info:

In /var/log/messages, where:

<client IP>: The IP address of the client attempting to get the lock
<client hostname>: The FQDN of the client based on the <client IP>
<server interface IP>: One of the server's interfaces' IP address
<server hostname>: The FQDN of the server

rpcbind: connect from <client IP> to getport/addr(nlockmgr)
rpcbind: connect from <server interface IP> to getport/addr(status)
kernel: lockd: request from <client IP>, port=861
se kernel: lockd: LOCK          called
se kernel: lockd: nlmsvc_lookup_host(host='<client hostname>', vers=4, proto=tcp)
kernel: lockd: host garbage collection
kernel: lockd: nlmsvc_mark_resources
kernel: lockd: delete host <client hostname>
kernel: lockd: destroyed nsm_handle for <client hostname> (<client IP>)
kernel: lockd: created nsm_handle for <client hostname> (<client IP>)
kernel: lockd: nlm_lookup_host created host <client hostname>
kernel: lockd: nsm_monitor(<client hostname>)
rpc.statd[10970]: Call to statd from non-local host <server interface IP>
rpc.statd[10970]: STAT_FAIL to <server hostname> for SM_MON of <client IP>
kernel: lockd: xdr_dec_stat_res status 1 state -1
kernel: lockd: cannot monitor <client hostname>
kernel: lockd: release host <client hostname>

Comment 1 Steve Dickson 2010-10-18 13:20:03 UTC

> rpc.statd[10970]: Call to statd from non-local host <server interface IP>
> rpc.statd[10970]: STAT_FAIL to <server hostname> for SM_MON of <client IP>

This is by design... statd will reject any monitor calls that do
not come from 127.0.0.1

Comment 2 Ron Wail 2010-10-18 15:16:51 UTC

(In reply to comment #1)
> > rpc.statd[10970]: Call to statd from non-local host <server interface IP>
> > rpc.statd[10970]: STAT_FAIL to <server hostname> for SM_MON of <client IP>
> 
> This is by design... statd will reject any monitor calls that do
> not come from 127.0.0.1

If you read the original report, rpcbind or lockd is not sending the call to statd via the loopback interface. It is sending it via the one of its other interfaces, so the source address is not 127.0.0.1. There is no way to make lockd use localhost/loopback as its source address.

So this is a bug in rpcbind/lockd for not sending from the loopback address, or it's a bug in statd for not accepting monitor calls from one of localhost's interfaces.

Comment 3 Steve Dickson 2010-10-19 18:47:46 UTC

I see... thank you for pointing this out.. It must be lockd since
rpcbind does not send SM_MON messages... hmm... 

What kernel version are you using?

Comment 4 Ron Wail 2010-10-20 00:30:31 UTC

# uname -a

Linux <hostname> 2.6.31.12-174.2.22.fc12.i686.PAE #1 SMP Fri Feb 19 19:10:04 UTC 2010 i686 i686 i386 GNU/Linux

# cat /etc/exports
/srv/nas                       *(rw,sync,no_root_squash,insecure_locks)

The NFS mount is from VMs running on the host over the virbr0/vnet interfaces.

Unfortunately, it's a production box and I don't have a test machine to look at upgrades etc at the moment, but may have soon.

One complexity is that this box has the following network interfaces:

lo
eth1 -> Internet
eth0 -> 2 x vlan: vlan1 -> 4 multi-home IPs, vlan2 -> single IP
tun1: OpenVPN
virbr0: Bridge interface for 8 x VMs
vnet[0-8]: host visible vnet interfaces of the VMs

Comment 5 Jeff Layton 2010-10-20 12:00:07 UTC

It sounds like something is wrong with the routing here. From nsm_create():

        struct sockaddr_in sin = {
                .sin_family             = AF_INET,
                .sin_addr.s_addr        = htonl(INADDR_LOOPBACK),
        };
        struct rpc_create_args args = {
                .protocol               = XPRT_TRANSPORT_UDP,
                .address                = (struct sockaddr *)&sin,
                .addrsize               = sizeof(sin),
                .servername             = "rpc.statd",
                .program                = &nsm_program,
                .version                = NSM_VERSION,
                .authflavor             = RPC_AUTH_NULL,
                .flags                  = RPC_CLNT_CREATE_NOPING,
        };

...the destination address for NSM calls is hardcoded to INADDR_LOOPBACK. If the source address of those packets is something other than INADDR_LOOPBACK then the routing table is screwy indeed.

I suppose it's also possible that statd has a bug and is rejecting calls from INADDR_LOOPBACK when it shouldn't.

It might be worthwhile to sniff traffic on 'lo' and see if you can see these requests and what the source address actually is.

Comment 6 Steve Dickson 2010-10-20 14:02:49 UTC

How often does this happen? Would it be possible to get a network
trace of lo without the trace growing to an unmanageable size?

The trace command could look something like:
     yum install wireshark
     tshark -e lo -o /tmp/data.pcap -R rpc
     bzip2 /tmp/data.pcap

Comment 7 Bug Zapper 2010-11-03 20:56:43 UTC

This message is a reminder that Fedora 12 is nearing its end of life.
Approximately 30 (thirty) days from now Fedora will stop maintaining
and issuing updates for Fedora 12.  It is Fedora's policy to close all
bug reports from releases that are no longer maintained.  At that time
this bug will be closed as WONTFIX if it remains open with a Fedora 
'version' of '12'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version prior to Fedora 12's end of life.

Bug Reporter: Thank you for reporting this issue and we are sorry that 
we may not be able to fix it before Fedora 12 is end of life.  If you 
would still like to see this bug fixed and are able to reproduce it 
against a later version of Fedora please change the 'version' of this 
bug to the applicable version.  If you are unable to change the version, 
please add a comment here and someone will do it for you.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events.  Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

The process we are following is described here: 
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 8 Bug Zapper 2010-12-03 22:07:41 UTC

Fedora 12 changed to end-of-life (EOL) status on 2010-12-02. Fedora 12 is 
no longer maintained, which means that it will not receive any further 
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of 
Fedora please feel free to reopen this bug against that version.

Thank you for reporting this bug and we are sorry it could not be fixed.

Comment 9 Ron Wail 2010-12-04 02:34:34 UTC

The source of the fault ended up being some relationship of iptables NAT and lockd.

In our NAT tables, we had:

-A POSTROUTING -j MASQUERADE

after a number of SNAT and DNAT entries.

Changing that entry to

-A POSTROUTING ! -o lo -j MASQUERADE

corrected the problem.

Sorry for not replying until now, but thanks for the assistance along the way.