Bug 500653

Summary: NFS: problems with virtual IP and locking
Product: Red Hat Enterprise Linux 2.1 Reporter: Sachin Prabhu <sprabhu>
Component: kernelAssignee: Jeff Layton <jlayton>
Status: CLOSED ERRATA QA Contact: Red Hat Kernel QE team <kernel-qe>
Severity: high Docs Contact:
Priority: high    
Version: 2.1CC: cward, donald.harper, dzickus, jiali, lscalabr, rmitchel, rvandolson, rwheeler, steved, tao
Target Milestone: rcKeywords: Reopened
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2010-03-30 07:18:00 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On:    
Bug Blocks: 533192, 526775    
Attachments:
Description Flags
first draft -- track src address of request and use for callback
none
first draft -- track src address of request and use for callback
none
2nd draft -- track src address of request and use for callback
none
patch 3 -- don't bind unless necessary none

Description Sachin Prabhu 2009-05-13 15:06:37 UTC
There have been a few cases reporting problems with locking. In these cases, the NFS server was using a floating ip for clustering purposes. The clients mounted shares from the server using this virtual ip.

There is a problem seen when using blocking locks in this sort of setup. For example, consider this transaction

NFS client: 10.33.8.75
NFS Server:
Primary IP : 10.33.8.71
Floating IP:  10.33.8.77

$ tshark -r block-virtual.pcap -R 'nlm'
19   2.487622   10.33.8.75 -> 10.33.8.77   NLM V4 LOCK Call FH:0x6176411a svid:4 pos:0-0
22   2.487760   10.33.8.77 -> 10.33.8.75   NLM V4 LOCK Reply (Call In 19) NLM_BLOCKED
33   2.489518   10.33.8.71 -> 10.33.8.75   NLM V4 GRANTED_MSG Call FH:0x6176411a svid:4 pos:0-0
36   2.489635   10.33.8.75 -> 10.33.8.71   NLM V4 GRANTED_MSG Reply (Call In 33)
46   2.489977   10.33.8.75 -> 10.33.8.71   NLM V4 GRANTED_RES Call NLM_DENIED
49   2.490096   10.33.8.71 -> 10.33.8.75   NLM V4 GRANTED_RES Reply (Call In 46)

19 - A lock request is sent from the client to the floating ip.
22 - A NLM_BLOCKED request is sent back by the Floating ip to the client.
33 - Server Primary IP address returns a NLM_GRANTED.
36 - Ack for GRANTED_MSG in 33.
47 - Client returns a NLM_DENIED to the SERVER. This is done since it doesn't match the locks requested.
49 - Ack for GRANTED_RES in 46.

In this case, the lock is eventually obtained when the client polls for the lock after 30 seconds. This is unacceptable for clients. Similar problems are also seen with GFS where the filesystem uses deferred locks.

The problem is caused in the function nlmclnt_grant(). In here, we go through the list of blocked locks nlm_blocked. It matches each lock against the following criteria.
a) The blocking range.
b) The 'pid/svid' of the locks which is being granted.
c) The ip address of the machine making the grant request to the ip address stored in the blocked list.
d) The filehandle.

Since the ip addresses do not match, the client doesn't match any lock from the blocked list and a NLM_DENIED is returned back to the server.

Comment 1 Sachin Prabhu 2009-05-13 15:08:37 UTC
The problem also exists upstream and any patch to fix this issue will first have to be discussed and accepted upstream.

Comment 2 Sachin Prabhu 2009-05-13 15:14:36 UTC
A possible fix for this issue is to remove the check for the ip address and instead check the owner field.


struct nlm_lock {
..
        struct xdr_netobj       oh;
..
}; 

From RFC1813.
The oh field is an opaque object that identifies the host or process that is holding the lock.

NFS illustrated also says that clients put in its ip address or hostname in this netobj. Early solaris implementations encoded a timestamp into the oh field so that it could be used to detect retransmitted requests. The encoded timestamp created a problem with some servers that rejected unlock requests because the oh field didn't match that of the lock request. 

On Linux clients, the nlm_lock.oh is set in
static void nlmclnt_setlockargs(struct nlm_rqst *req, struct file_lock *fl)
{
..
        lock->oh.data = req->a_owner;
        lock->oh.len  = snprintf(req->a_owner, sizeof(req->a_owner), "%u@%s",
                                (unsigned int)fl->fl_u.nfs_fl.owner->pid,
                                system_utsname.nodename);
        lock->svid = fl->fl_u.nfs_fl.owner->pid;
..
} 

The tcpdump shows that the lock structure is returned untouched by the linux nfs server to the client. The only concern is whether this is also true for other nfs servers.

Comment 3 Sachin Prabhu 2009-05-13 15:37:50 UTC
http://www.opengroup.org/onlinepubs/9629799/NLM_GRANTED_MSG.htm

Call Arguments

struct nlm_testargs { 
    netobj cookie; 
    bool exclusive; 
    struct nlm_lock alock; 
};
 
..
"exclusive" and "alock" will be the values in the original NLM_LOCK_MSG procedure.
..

So a compliant NFS server should return the same nlm_lock structure alock which was sent to it by the client. This should allow us to use the nlm_struct.oh element to verify the lock in nlmclnt_grant().

Comment 4 Sachin Prabhu 2009-05-14 09:28:01 UTC
It looks like older versions of RHEL 4 kernel(U5 and before) did not have this problem. 

u32
nlmclnt_grant(struct nlm_lock *lock)
{
..
        for (block = nlm_blocked; block; block = block->b_next) {
                if (nlm_compare_locks(block->b_lock, &lock->fl))
                        break;
        }
..
}

nlm_compare_locks(struct file_lock *fl1, struct file_lock *fl2)
{
        return  fl1->fl_pid   == fl2->fl_pid
             && fl1->fl_start == fl2->fl_start
             && fl1->fl_end   == fl2->fl_end
             &&(fl1->fl_type  == fl2->fl_type || fl2->fl_type == F_UNLCK);
}

It simply checked the pid, range and type. This however caused problems. The following upstream patch was backported to RHEL 4 U6 in bz 220897 to fix this problem. However this ended up breaking callbacks when using virtual ips.

commit 5ac5f9d1ce8492163dbde5d357dc5d03becf7e36
Author: Trond Myklebust <Trond.Myklebust@netapp.com>
Date:   Tue Feb 14 13:53:04 2006 -0800

    [PATCH] NLM: Fix the NLM_GRANTED callback checks

    If 2 threads attached to the same process are blocking on different locks
    on different files (maybe even on different servers) but have the same lock
    arguments (i.e.  same offset+length - actually quite common, since most
    processes try to lock the entire file) then the first GRANTED call that
    wakes one up will also wake the other.

    Currently when the NLM_GRANTED callback comes in, lockd walks the list of
    blocked locks in search of a match to the lock that the NLM server has
    granted.  Although it checks the lock pid, start and end, it fails to check
    the filehandle and the server address.

    By checking the filehandle and server IP address, we ensure that this only
    happens if the locks truly are referencing the same file.

    Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
    Signed-off-by: Andrew Morton <akpm@osdl.org>
    Signed-off-by: Linus Torvalds <torvalds@osdl.org>

Comment 5 Sachin Prabhu 2009-05-15 15:06:20 UTC
Upstream post:

http://thread.gmane.org/gmane.linux.nfs/26689

Comment 6 Sachin Prabhu 2009-05-18 13:37:58 UTC
This issue is apparently fixed by having the NFS server bind to the address on which the initial command was sent.

http://thread.gmane.org/gmane.linux.nfs/14379
[PATCH 0/3] NLM: fix source address in server callbacks

http://thread.gmane.org/gmane.linux.nfs/14380
[PATCH 1/3] SUNRPC server: record the destination address of a request
Commit a97476926ec061f90b77da478620ea6dc71a3237

http://thread.gmane.org/gmane.linux.nfs/14381
[PATCH 2/3] SUNRPC client: add interface for binding to a local address
Commit d3bc9a1deb8964d774af8535814cb91bf8f6def0

http://thread.gmane.org/gmane.linux.nfs/14382
[PATCH 3/3] NLM: fix source address of callback to client
commit c98451bdb2f3e6d6cc1e03adad641e9497512b49

However this still leaves the question of should the client be fixed to allow for callbacks from different ip addresses.

Comment 7 Sachin Prabhu 2009-05-18 14:53:27 UTC
Upstream is of the opinion that the current behaviour where the client rejects callbacks from unknown ip addresses is correct.

It looks like the fix for the problem is on the server side.

Comment 12 RHEL Program Management 2009-07-01 14:49:25 UTC
Development Management has reviewed and declined this request.  You may appeal
this decision by reopening this request.

Comment 13 Jeff Layton 2009-07-02 12:36:24 UTC
Created attachment 350268 [details]
first draft -- track src address of request and use for callback

Here's a first draft at a patch to fix this. It still has some warts. Most notably I've done some hackery to work around kabi, but it's really not sufficient for a release patch. It should be OK for proof-of-concept testing though.

Everything seems to work ok on my (single-homed) test rig. rpc_debug shows the socket being bound to the expected srcaddr for the callback to the client.

Unfortunately, I don't have easy access to a multihomed rig in order to test that aspect of it. Sachin, can you test this out on your setup?

Comment 14 Jeff Layton 2009-07-02 12:37:47 UTC
Created attachment 350269 [details]
first draft -- track src address of request and use for callback

Oops, wrong patch. This should be the right one.

Comment 15 Sachin Prabhu 2009-07-02 13:15:32 UTC
Quick check using kernel kernel-2.6.18-156.el5.jtltest.77.

Server Primary address: 192.168.122.22
Server Floating IP address: 192.168.122.81

Client IP address: 192.168.122.21

$ tshark -r 500653.test1 -R 'nlm'
 10   0.006548 192.168.122.21 -> 192.168.122.81 NLM V4 LOCK Call FH:0x194a83e6 svid:2 pos:0-0
 11   0.007198 192.168.122.81 -> 192.168.122.21 NLM V4 LOCK Reply (Call In 10) NLM_BLOCKED
 14   4.519680 192.168.122.81 -> 192.168.122.21 NLM V4 GRANTED_MSG Call FH:0x194a83e6 svid:2 pos:0-0
 15   4.519829 192.168.122.21 -> 192.168.122.81 NLM V4 GRANTED_RES Call
 16   4.521272 192.168.122.21 -> 192.168.122.81 NLM V4 GRANTED_MSG Reply (Call In 14)
 18   4.523264 192.168.122.81 -> 192.168.122.21 NLM V4 GRANTED_RES Reply (Call In 15)
 31   5.797287 192.168.122.21 -> 192.168.122.81 NLM V4 UNLOCK Call FH:0x194a83e6 svid:2 pos:0-0
 34   5.798244 192.168.122.81 -> 192.168.122.21 NLM V4 UNLOCK Reply (Call In 31)

10 - A lock request is sent from the client to the floating ip.
11 - A NLM_BLOCKED request is sent back by the Floating ip to the client.
14 - The _Floating_IP_ IP address returns a NLM_GRANTED.
15 - successful GRANTED_RES
16 - Ack for GRANTED_MSG in 14.
18 - Ack for GRANTED_RES in 16.
31 & 34, the Unlock calls from the client.


Frame 14 shows the server contacts the client using the floating ip address. The test kernel was _successful_ in fixing the problem.

Comment 18 Jeff Layton 2009-07-02 18:41:20 UTC
Created attachment 350326 [details]
2nd draft -- track src address of request and use for callback

Second draft. This should take care of the kabi issues with the rpc_xprt struct.

I had to add a new tcp_flag to indicate the presence of the new field. Ugly, but it should work.

I'm not convinced I need to do anything to account for the addition of the field to the nlm_host struct. As best I can tell, nothing outside of lockd deals with nlm_host structs, and I don't see that it's ever used in an array or embedded in another struct anywhere.

I'm not sure why the kabi checker complained about that one, but I think it's wrong.

Anyway...Sachin when you get a chance, please test this patchset as well.

Comment 19 Peter Staubach 2009-07-02 19:17:08 UTC
I don't think that the kABI checker is wrong.  A struct which is part of
the kABI changed sizes and elements.  This constitutes a kABI violation.

The struct is part of the arguments or is a return value from functions
which are part of the kABI.  For better or worse.

That said, I don't see a problem in the existing code with just adding
the __GENKSYMS__ trick.  The nlm_lock struct seems pretty well self
contained and manipulated by lockd specific code.

Comment 21 Jeff Layton 2009-07-02 19:58:52 UTC
I'll gladly take that guidance. The log from the build failure is here:

https://brewweb.devel.redhat.com/getfile?taskID=1871014&name=build.log

Here's the relevant part:

*** ERROR - ABI BREAKAGE WAS DETECTED ***
The following symbols have been changed (this will cause an ABI breakage):
nlmsvc_ops
nlmclnt_proc

...adding the __GENKSYMS__ trick around the new nlm_host->h_saddr made the problem go away. Like Peter said, I'm pretty sure nlm_hosts are never referenced outside of lockd code, so I'm not at all clear why it complained.

Comment 22 Don Zickus 2009-07-02 21:31:23 UTC
Well I found the problem for nlmclnt_proc:

int
nlmclnt_proc(struct inode *inode, int cmd, struct file_lock *fl)
|
|-->struct file_lock {
    <snip>
    ...
    union {
                struct nfs_lock_info    nfs_fl;
                struct nfs4_lock_info   nfs4_fl;
        } fl_u;
    ...
    <snip>
    |
    |--> struct nfs_lock_info {
                u32             state;
                struct nlm_lockowner *owner;
                struct list_head list;
         };
         |
         |--> struct nlm_lockowner {
              <snip>
              ...
              struct nlm_host *host;
              ...
              <snip>

Now normally nlm_lockowner is defined as opaque for nfs_lock_info but because fs/lockd/clntproc.c is including lockd.h, I guess it gets defined and causes the problems.  You can argue the whole pointer shouldn't count theory but I guess it depends who/where that memory is allocated (most likely in the kernel).

Knowing the struct file_lock causes the issue, I think nlmsvc_ops has a similar problem.

Hope that helps.

Cheers,
Don

Comment 23 Jeff Layton 2009-07-06 11:08:58 UTC
Thanks for the explanation, Don. That makes sense.

The important bit is that I'm not actually changing the fl structure or anything that's touched outside of lockd code. I think it's OK to just use the __GENKSYMS__ trick on the nlm_host struct here.

I think the patch I have so far is pretty close, but I may do one more respin to try to limit the changes to only affect the lockd calbacks. I need to look over the code a little more closely though.

Comment 24 Jeff Layton 2009-07-06 13:33:03 UTC
Created attachment 350603 [details]
patch 3 -- don't bind unless necessary

3rd draft. Change the code so that it only binds the socket if a reserved port is needed or if the source address is set.

It may be harmless to bind sockets to 0.0.0.0:0 but it's still a behavior change. I'd like to avoid adding those unless they're necessary. This means adding a bit of if logic to the beginning of xs_bind.

The rest of the patches should be the same (more or less).

Comment 25 Jeff Layton 2009-07-06 18:29:06 UTC
New test kernel is up on my people.redhat.com page with the new set (jtltest.78):

http://people.redhat.com/jlayton/

...please test it out and let me know how it goes.

Comment 26 Sachin Prabhu 2009-07-07 09:45:17 UTC
Test performed using kernel kernel-2.6.18-156.el5.jtltest.78.x86_64

Server Primary address: 192.168.122.22
Server Floating IP address: 192.168.122.82

Client IP address: 192.168.122.23

 28   2.166274 192.168.122.23 -> 192.168.122.82 NLM V4 LOCK Call FH:0x194a83e6 svid:1 pos:0-0
 29   2.166309 192.168.122.82 -> 192.168.122.23 NLM V4 LOCK Reply (Call In 28) NLM_BLOCKED
 38   4.100686 192.168.122.82 -> 192.168.122.23 NLM V4 GRANTED_MSG Call FH:0x194a83e6 svid:1 pos:0-0
 39   4.100884 192.168.122.23 -> 192.168.122.82 NLM V4 GRANTED_RES Call
 40   4.100895 192.168.122.23 -> 192.168.122.82 NLM V4 GRANTED_MSG Reply (Call In 38)
 42   4.100947 192.168.122.82 -> 192.168.122.23 NLM V4 GRANTED_RES Reply (Call In 39)
 49   5.436184 192.168.122.23 -> 192.168.122.82 NLM V4 UNLOCK Call FH:0x194a83e6 svid:1 pos:0-0
 50   5.436226 192.168.122.82 -> 192.168.122.23 NLM V4 UNLOCK Reply (Call In 49)
 
28 - A lock request is sent from the client to the floating ip.
29 - A NLM_BLOCKED request is sent back by the Floating ip to the client.
38 - The _Floating_IP_ IP address returns a NLM_GRANTED.
39 - successful GRANTED_RES
40 - Ack for GRANTED_MSG in 38.
42 - Ack for GRANTED_RES in 39.
49 & 50, the Unlock calls from the client.


Frame 38 shows the server contacts the client using the floating ip address.

The test kernel was successful in fixing the problem.

Comment 29 Don Zickus 2009-10-06 19:37:26 UTC
in kernel-2.6.18-168.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Please do NOT transition this bugzilla state to VERIFIED until our QE team
has sent specific instructions indicating when to do so.  However feel free
to provide a comment indicating that this fix has been verified.

Comment 31 Jeff Layton 2010-01-23 14:44:49 UTC
*** Bug 557911 has been marked as a duplicate of this bug. ***

Comment 32 Ray Van Dolson 2010-01-25 16:24:33 UTC
Support is requesting I try one of the test kernels mentioned in this bug to see if it resolves the issue.  The link above looks a bit old, would the kernels found here be appropriate to try?

  http://people.redhat.com/jwilson/el5/185.el5/

Thanks.

Comment 33 Jeff Layton 2010-01-25 16:39:58 UTC
Yes. In general, higher kernel versions contain the same fixes as lower versions.

Comment 34 Ray Van Dolson 2010-01-26 03:44:43 UTC
I can confirm this fixes the issue for us.  Thanks.

Comment 35 Chris Ward 2010-02-11 10:13:07 UTC
~~ Attention Customers and Partners - RHEL 5.5 Beta is now available on RHN ~~

RHEL 5.5 Beta has been released! There should be a fix present in this 
release that addresses your request. Please test and report back results 
here, by March 3rd 2010 (2010-03-03) or sooner.

Upon successful verification of this request, post your results and update 
the Verified field in Bugzilla with the appropriate value.

If you encounter any issues while testing, please describe them and set 
this bug into NEED_INFO. If you encounter new defects or have additional 
patch(es) to request for inclusion, please clone this bug per each request
and escalate through your support representative.

Comment 43 errata-xmlrpc 2010-03-30 07:18:00 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2010-0178.html