Bug 195874

Summary:	lockd ignores requests
Product:	[Fedora] Fedora	Reporter:	Garth Mollett <gmollett>
Component:	kernel	Assignee:	Steve Dickson <steved>
Status:	CLOSED WONTFIX	QA Contact:	Brian Brock <bbrock>
Severity:	high	Docs Contact:
Priority:	medium
Version:	5	CC:	davej, triage, wtogami
Target Milestone:	---
Target Release:	---
Hardware:	i386
OS:	Linux
Whiteboard:	OldNeedsRetesting bzcl34nup
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2008-05-06 16:01:13 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Garth Mollett 2006-06-19 04:13:27 UTC

Description of problem:
Lockd appears to be ignoring requests for hosts on remote network segments.
[root@entei ~]# uname -a
Linux entei 2.6.16-1.2096_FC4smp #1 SMP Wed Apr 19 15:51:25 EDT 2006 i686 i686
i386 GNU/Linux
[root@entei ~]# rpcinfo -p
   program vers proto   port
    100000    2   tcp    111  portmapper
    100000    2   udp    111  portmapper
    100011    1   udp    990  rquotad
    100011    2   udp    990  rquotad
    100011    1   tcp    993  rquotad
    100011    2   tcp    993  rquotad
    100003    3   udp   2049  nfs
    100003    4   udp   2049  nfs
    100003    3   tcp   2049  nfs
    100003    4   tcp   2049  nfs
    100021    1   udp  32770  nlockmgr
    100021    3   udp  32770  nlockmgr
    100021    4   udp  32770  nlockmgr
    100021    1   tcp  36795  nlockmgr
    100021    3   tcp  36795  nlockmgr
    100021    4   tcp  36795  nlockmgr
    100005    1   udp   2050  mountd
    100005    1   tcp   2050  mountd
    100005    3   udp   2050  mountd
    100005    3   tcp   2050  mountd
    100024    1   udp  32771  status
    100024    1   tcp  38113  status

tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 9000 bytes
14:07:59.347750 IP (tos 0x0, ttl  63, id 0, offset 0, flags [DF], proto 17,
length: 268) 192.168.0.20.1022 > 192.168.129.4.32770: [udp sum ok] UDP, length 240
14:08:19.349864 IP (tos 0x0, ttl  63, id 0, offset 0, flags [DF], proto 17,
length: 284) 192.168.0.20.1022 > 192.168.129.4.32770: [udp sum ok] UDP, length 256
14:08:29.348184 IP (tos 0x0, ttl  63, id 0, offset 0, flags [DF], proto 17,
length: 268) 192.168.0.20.1022 > 192.168.129.4.32770: [udp sum ok] UDP, length 240
14:08:39.347252 IP (tos 0x0, ttl  63, id 0, offset 0, flags [DF], proto 17,
length: 284) 192.168.0.20.1022 > 192.168.129.4.32770: [udp sum ok] UDP, length 256

(Note no response ever sent) Entei == nfs server == 192.168.129.4

Lockd is running and will service requests for machines on the same subnet.
hosts.allow, exports etc... all have correct perms for 192.168.0.0/24, and
all _was_ working but stoped (seemingly randomly?), nothing has been changed
on machine however NFS has been extreemly flakey (Occasional panics on SMP
machines due to null pointer derefrences, not reliably re-producable and
don't have time to investigate futher).
Mounts/unmounts all work fine as does eveything else other than locking (client
will hang waiting for response that is never sent).

Kernel stack trace for lockd:
lockd         S F4C0E864  2364  2525      1          2530  2523 (L-TLB)
f68bff48 00000002 f68bff34 f4c0e864 00000000 f8dd9b5c 28201d00 003d0dcd 
       c0392700 000000d0 00000044 00000000 f4c32f18 f4c32df0 f7e00af0 c24265e0 
       285d2600 003d0dcd 00000002 f68bf000 f6545940 f6023080 f6545940 c02c44d7 
Call Trace:
 [<f8dd9b5c>] svc_sendto+0x7c/0x263 [sunrpc]     [<c02c44d7>]
skb_recv_datagram+0x120/0x229
 [<c0327c46>] schedule_timeout+0xa7/0xd7     [<c03287ea>] _spin_unlock_bh+0x5/0xa
 [<f8ddae8c>] svc_recv+0x36d/0x4dd [sunrpc]     [<f8dd984f>]
svc_process+0x3d4/0x665 [sunrpc]
 [<c011fc1d>] default_wake_function+0x0/0xc     [<f8dfa5fc>] lockd+0xfd/0x253
[lockd]
 [<f8dfa4ff>] lockd+0x0/0x253 [lockd]     [<c0102005>] kernel_thread_helper+0x5/0xb

I cant read kmem with gdb on theese machines so I can't get anymore info.

I have also enabled nfs(d), rpc and nlm debuging but nothing other than the
fact that an authenticated mount request occured is logged.


Version-Release number of selected component (if applicable):


How reproducible:
Happens all the time.

Steps to Reproduce:
1. Run knfs/lockd on 2.6. 
2. Call fcntl(F_SETLK..) on client on different segment
3.
  
Actual results:
Lockd thread on server does not send any response.

Expected results:
Lockd should respond and lock files.

Additional info:

Comment 1 Garth Mollett 2006-06-19 04:31:34 UTC

Sorry that kernel stack trace is incorrect. This is the correct one:
lockd         S 00000495  2400  2554      1          2559  2552 (L-TLB)
f5a78f40 00000002 f5a78f2c 00000495 c039b700 000000d0 e5e96e00 003d08ea 
       00000206 00000002 00000044 00000000 f696acd8 f696abb0 f7e00af0 c24265e0 
       e5e96e00 003d08ea 00000002 f5a78000 f4c26c40 f4c27d00 f4c26c40 c02cbad7 
Call Trace:
 [<c02cbad7>] skb_recv_datagram+0x120/0x229     [<c03300b4>]
schedule_timeout+0xa7/0xd7
 [<c0330e0e>] _spin_lock_irqsave+0x9/0xd     [<c013891b>] add_wait_queue+0x13/0x94
 [<f8ddccb3>] svc_sock_release+0xd1/0x168 [sunrpc]     [<f8ddd203>]
svc_recv+0x384/0x586 [sunrpc]
 [<f8ddbb0f>] svc_process+0x3d4/0x665 [sunrpc]     [<c011fe6d>]
default_wake_function+0x0/0xc
 [<f8dc06dc>] lockd+0xfd/0x253 [lockd]     [<f8dc05df>] lockd+0x0/0x253 [lockd]
 [<c0102005>] kernel_thread_helper+0x5/0xb

And the kernel is now:
Linux entei 2.6.16-1.2115_FC4smp #1 SMP Mon Jun 5 15:01:58 EDT 2006 i686 i686
i386 GNU/Linux

Comment 2 Garth Mollett 2006-06-19 04:49:21 UTC

And as this appears to be getting "stuck" in skb_recv_datagram it might appear
to be a network card related issue (maybe. heh) so the card is an intel e1000.

Creating a server on a random udp port with netcat and communictaing with
it using frames of various sizes from 192.168.0.20 works fine as do other
udp based services (nfs,dns). Problem only seems to occur with lockd and
only from host not in the same segment.

Comment 3 Dave Jones 2006-06-26 14:51:14 UTC

does this still happen on the 2.6.17 based update out last week ?

Comment 4 Garth Mollett 2006-06-27 00:44:21 UTC

(In reply to comment #3)
> does this still happen on the 2.6.17 based update out last week ?

Yes it does. 
Another interesting note, it only seems to happen with a mtu above 8000.
The network is usually 9000, but setting the mtu to <= 8000 on this and all
other nodes on this segment seems to work as workaround.

Note that all other coms are fine with the mtu > 8000, including other rpc
based services, only lockd appears to have issues.

Also the packets that get "ignored" by lockd are usually no bigger than 256
or so bytes (as can be seen from the tcpdump).

Hope that helps.

Comment 5 Dave Jones 2006-06-27 03:10:21 UTC

out of curiousity, does it go back to normal if you do ..

echo 0 > /proc/sys/net/ipv4/tcp_window_scaling

(as root)

Comment 6 Garth Mollett 2006-06-27 03:33:54 UTC

I'm not sure how tcp window scaling could be related when we're talking udp 
here?

I can try it out for you if you want but theese are production machines so
I will have to schedule a time todo so.

I very much doubt this a network issue (ie packet loss due to inconsistent
mtu's or anything of that nature).

Comment 7 Steve Dickson 2006-07-24 18:27:27 UTC

Is there any output when an NLM debugging is turned on when
the following is done "echo 2 > /proc/sys/sunrpc/nlm_debug"

Comment 8 Garth Mollett 2006-08-14 09:02:12 UTC

(In reply to comment #5)
> out of curiousity, does it go back to normal if you do ..
> 
> echo 0 > /proc/sys/net/ipv4/tcp_window_scaling
> 
> (as root)

As expected, no change.

Comment 9 Garth Mollett 2006-08-14 09:14:36 UTC

(In reply to comment #7)
> Is there any output when an NLM debugging is turned on when
> the following is done "echo 2 > /proc/sys/sunrpc/nlm_debug"

Yes, but nothing really usefull or unexpected.

Without the workaround, calling fcntl() from the client we see the following
in the client logs:

Aug 14 19:10:34 client kernel: lockd: call procedure 2 on server_ip
Aug 14 19:11:44 client kernel: lockd: server server_ip not responding, timed out
Aug 14 19:11:44 client kernel: lockd: rpc_call returned error 5
Aug 14 19:11:44 client kernel: lockd: clnt proc returns -5

And nothing on the server (although the packets can be seen in tcpdump).

Enabling the workaround (dropping the MTU to 8000) fcntl() will finish and
we see the following:

Aug 14 19:13:44 client kernel: lockd: call procedure 2 on server_ip
Aug 14 19:13:44 client kernel: lockd: server returns status 0
Aug 14 19:13:44 client kernel: lockd: clnt proc returns 0

Aug 14 19:13:44 server kernel: lockd: LOCK          called
Aug 14 19:13:44 server kernel: lockd: LOCK          status 0
Aug 14 19:13:46 server kernel: lockd: UNLOCK        called
Aug 14 19:13:46 server kernel: lockd: UNLOCK        status 0

Comment 10 Dave Jones 2006-09-17 01:45:21 UTC

[This comment added as part of a mass-update to all open FC4 kernel bugs]

FC4 has now transitioned to the Fedora legacy project, which will continue to
release security related updates for the kernel.  As this bug is not security
related, it is unlikely to be fixed in an update for FC4, and has been migrated
to FC5.

Please retest with Fedora Core 5.

Thank you.

Comment 11 Dave Jones 2006-10-16 20:29:40 UTC

A new kernel update has been released (Version: 2.6.18-1.2200.fc5)
based upon a new upstream kernel release.

Please retest against this new kernel, as a large number of patches
go into each upstream release, possibly including changes that
may address this problem.

This bug has been placed in NEEDINFO state.
Due to the large volume of inactive bugs in bugzilla, if this bug is
still in this state in two weeks time, it will be closed.

Should this bug still be relevant after this period, the reporter
can reopen the bug at any time. Any other users on the Cc: list
of this bug can request that the bug be reopened by adding a
comment to the bug.

In the last few updates, some users upgrading from FC4->FC5
have reported that installing a kernel update has left their
systems unbootable. If you have been affected by this problem
please check you only have one version of device-mapper & lvm2
installed.  See bug 207474 for further details.

If this bug is a problem preventing you from installing the
release this version is filed against, please see bug 169613.

If this bug has been fixed, but you are now experiencing a different
problem, please file a separate bug for the new problem.

Thank you.

Comment 12 Garth Mollett 2006-10-30 06:21:54 UTC

Please don't close this bug. I will try and get a test environment setup
(unless someone else can?), so I can retest, obviously I can't just upgrade a 
bunch of production servers to FC5, plans are in place todo the upgrade but we
have no strict timeline untill futher testing is done.

It's very unlikely that we will be able to have the test environment up and
running and everything tested within 2weeks though.

Thanks.

(In reply to comment #11)
> A new kernel update has been released (Version: 2.6.18-1.2200.fc5)
> based upon a new upstream kernel release.
> Please retest against this new kernel, as a large number of patches
> go into each upstream release, possibly including changes that
> may address this problem.
> This bug has been placed in NEEDINFO state.
> Due to the large volume of inactive bugs in bugzilla, if this bug is
> still in this state in two weeks time, it will be closed.
> Should this bug still be relevant after this period, the reporter
> can reopen the bug at any time. Any other users on the Cc: list
> of this bug can request that the bug be reopened by adding a
> comment to the bug.
> In the last few updates, some users upgrading from FC4->FC5
> have reported that installing a kernel update has left their
> systems unbootable. If you have been affected by this problem
> please check you only have one version of device-mapper & lvm2
> installed.  See bug 207474 for further details.
> If this bug is a problem preventing you from installing the
> release this version is filed against, please see FCMETA_INSTALL.
> If this bug has been fixed, but you are now experiencing a different
> problem, please file a separate bug for the new problem.
> Thank you.

Comment 13 Jon Stanley 2008-03-31 18:31:26 UTC

Removing NeedsRetesting from whiteboard so we can repurpose it.

Comment 14 Bug Zapper 2008-04-04 03:07:11 UTC

Fedora apologizes that these issues have not been resolved yet. We're
sorry it's taken so long for your bug to be properly triaged and acted
on. We appreciate the time you took to report this issue and want to
make sure no important bugs slip through the cracks.

If you're currently running a version of Fedora Core between 1 and 6,
please note that Fedora no longer maintains these releases. We strongly
encourage you to upgrade to a current Fedora release. In order to
refocus our efforts as a project we are flagging all of the open bugs
for releases which are no longer maintained and closing them.
http://fedoraproject.org/wiki/LifeCycle/EOL

If this bug is still open against Fedora Core 1 through 6, thirty days
from now, it will be closed 'WONTFIX'. If you can reporduce this bug in
the latest Fedora version, please change to the respective version. If
you are unable to do this, please add a comment to this bug requesting
the change.

Thanks for your help, and we apologize again that we haven't handled
these issues to this point.

The process we are following is outlined here:
http://fedoraproject.org/wiki/BugZappers/F9CleanUp

We will be following the process here:
http://fedoraproject.org/wiki/BugZappers/HouseKeeping to ensure this
doesn't happen again.

And if you'd like to join the bug triage team to help make things
better, check out http://fedoraproject.org/wiki/BugZappers

Comment 15 Bug Zapper 2008-05-06 16:01:11 UTC

This bug is open for a Fedora version that is no longer maintained and
will not be fixed by Fedora. Therefore we are closing this bug.

If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen thus bug against that version.

Thank you for reporting this bug and we are sorry it could not be fixed.