Bug 136423 - ls -l hangs on NFS mounted partitions
ls -l hangs on NFS mounted partitions
Status: CLOSED DUPLICATE of bug 118839
Product: Red Hat Enterprise Linux 3
Classification: Red Hat
Component: kernel (Show other bugs)
3.0
i686 Linux
medium Severity high
: ---
: ---
Assigned To: Steve Dickson
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2004-10-19 18:19 EDT by Carl Mason
Modified: 2007-11-30 17:07 EST (History)
8 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2005-03-14 16:24:10 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
alt-sysreq-t output (65.28 KB, text/plain)
2004-11-18 23:34 EST, viawest
no flags Details
soradev sysrq-t/m output (35.71 KB, text/plain)
2004-11-19 09:46 EST, Chris Sorisio
no flags Details

  None (edit)
Description Carl Mason 2004-10-19 18:19:13 EDT
Description of problem:

on nfs partitions mounted either manually or via autofs, ls -l hangs
and produces log messages of this type:
kernel: RPC: buffer allocation failed for task f4a79ca8

ls without the flag works fine


Version-Release number of selected component (if applicable):
RHEL version 3
 2.4.21-20.ELsmp #1 SMP Wed Aug 18 20:46:40 EDT 2004 i686 i686 i386
GNU/Linux

How reproducible: completely, although the behavior is new.




Steps to Reproduce:
1. either cd to directory in automounted partition OR mount NFS mount
a partition and cd to that

2. ls -
3.
  
Actual results:

process hangs ; log file grows

Expected results:

file listing

Additional info:

the most recent change to the system was adding an LVM logical drive
I beleive the problem started after this. Its version 1.0.8-2 but I am
not sure of this.  this is probably related to bug 133246
Comment 2 Steve Dickson 2004-10-20 11:26:13 EDT
Reassigning to our VM guy....
Comment 3 Carl Mason 2004-11-10 11:14:21 EST
uh... this problem went away after a reboot.  Has not returned in 3
weeks.  Perhaps this is false alarm. sorry.
Comment 4 Larry Woodman 2004-11-10 11:24:21 EST
OK, if this ever happens again please get me an "AltSysrq-T" output
so I can see exactly what the "ls -l" process is stuck on.

Larry Woodman
Comment 5 Agustin Gomez Roca 2004-11-15 10:49:18 EST
I'm having the same problem from last friday on any NFS mounted file-
system.
Are there a patch?
Comment 6 Chris Sorisio 2004-11-17 12:26:09 EST
I can create this error almost on demand.  We are in the middle of an
Oracle upgrade project, and we mount our stage files (roughly 18GB of
data) from a Linux NFS share on another server.  Partway through the
installation process, we experience excessive NFS slowdowns on the
RHEL3 client side.  The only way to get things moving again is to set
inactive_clean_percent to 100.

However, even with that set, 'ls -l' on the mounted share hangs, and
dmesg shows the reported RPC buffer error.
Comment 7 Chris Sorisio 2004-11-17 12:40:55 EST
sysrq-t output for a frozen 'ls -l':

ls            S 00000000     0  1346   2278                     (NOTLB)
Call Trace:   [<c015878a>] __alloc_pages [kernel] 0x27a (0xd23dfb84)
[<c0123e24>] schedule [kernel] 0x2f4 (0xd23dfb98)
[<c0134e75>] schedule_timeout [kernel] 0x65 (0xd23dfbdc)
[<c0134e00>] process_timeout [kernel] 0x0 (0xd23dfbfc)
[<f8b827f2>] rpc_allocate_Rsmp_0cd1c989 [sunrpc] 0x82 (0xd23dfc14)
[<f8b7da27>] call_allocate [sunrpc] 0x67 (0xd23dfc34)
[<f8b822e9>] __rpc_execute [sunrpc] 0x109 (0xd23dfc4c)
[<f8b840f1>] rpcauth_bindcred_Rsmp_fa665c09 [sunrpc] 0x31 (0xd23dfc68)
[<f8b7d59d>] rpc_call_sync_Rsmp_c357b490 [sunrpc] 0xbd (0xd23dfc8c)
[<f8b918d8>] all_tasks [sunrpc] 0x0 (0xd23dfcb8)
[<f8b7dad0>] call_encode [sunrpc] 0x0 (0xd23dfcec)
[<f8b816e0>] rpc_run_timer [sunrpc] 0x0 (0xd23dfd0c)
[<f8bb30b4>] nfs3_rpc_wrapper [nfs] 0x44 (0xd23dfd48)
[<f8bb4bfc>] nfs3_proc_getacl [nfs] 0x10c (0xd23dfd70)
[<c015f184>] __pte_chain_free [kernel] 0x24 (0xd23dfdec)
[<c0142fa8>] do_no_page [kernel] 0x368 (0xd23dfdf8)
[<f8ba6641>] nfs_lookup_revalidate [nfs] 0x51 (0xd23dfe18)
[<f8bb8bd5>] nfs_getxattr [nfs] 0x155 (0xd23dfe3c)
[<c0185795>] getxattr [kernel] 0x105 (0xd23dfe60)
[<f8b83fcc>] rpcauth_lookup_credcache [sunrpc] 0x7c (0xd23dfec0)
[<f8b8421e>] put_rpccred_Rsmp_89a5cf35 [sunrpc] 0x1e (0xd23dfed0)
[<f8b91960>] rpc_credcache_lock [sunrpc] 0x0 (0xd23dfed8)
[<f8ba7f46>] nfs_permission [nfs] 0x1a6 (0xd23dfee0)
[<c017c890>] dput [kernel] 0x30 (0xd23dff04)
[<c0172266>] link_path_walk [kernel] 0x656 (0xd23dff18)
[<c017291c>] __user_walk [kernel] 0x5c (0xd23dff68)
[<c0185897>] sys_getxattr [kernel] 0x57 (0xd23dff84)

RPC: buffer allocation failed for task d23dfca8
Comment 8 Larry Woodman 2004-11-17 13:27:09 EST
Chris; First, does this only happen on NFS mounts?  Second, can you
get an AltSysrq-M and "cat /proc/slabinfo" outputs using the latest
RHEL3-U4 kernel ?

BTW, this process appears to stuck in rpc_allocate():
+++++++++++++++++++++++++++++++++++++++++++++++++++++++
        do {
                if ((buffer = (u32 *) kmalloc(size, gfp)) != NULL) {
                        dprintk("RPC:      allocated buffer %p\n",
buffer);
                        return buffer;
                }
                if ((flags & RPC_TASK_SWAPPER) && size <=
sizeof(swap_buffer)
                    && rpc_lock_swapbuf()) {
                        dprintk("RPC:      used last-ditch swap
buffer\n");
                        return swap_buffer;
                }
                if (flags & RPC_TASK_ASYNC)
                        return NULL;
                set_current_state(TASK_INTERRUPTIBLE);
                schedule_timeout(HZ/50);
        } while (!signalled());
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


Larry Woodman 

Comment 9 Chris Sorisio 2004-11-17 14:19:28 EST
Larry,

1.  I've only tested against NFS and local filesystems.  The behavior
does not occur when 'ls -l' is run on a local filesystem.

2.

uname -a output:  

Linux soradev.peaktechnical.com 2.4.21-20.ELsmp #1 SMP Wed Aug 18
20:46:40 EDT 2004 i686 i686 i386 GNU/Linux

alt-sysrq-m output:

SysRq : Show Memory
Mem-info:
Zone:DMA freepages:  2883 min:     0 low:     0 high:     0
Zone:Normal freepages: 27763 min:  1279 low:  4544 high:  6304
Zone:HighMem freepages: 23513 min:   255 low:  8702 high: 13053
Free pages:       54159 ( 23513 HighMem)
( Active: 144502/228825, inactive_laundry: 192099, inactive_clean:
36794, free: 54159 )
  aa:0 ac:0 id:0 il:0 ic:0 fr:2883
  aa:3480 ac:68556 id:0 il:18 ic:0 fr:27763
  aa:11139 ac:61327 id:228825 il:192081 ic:36794 fr:23513
1*4kB 1*8kB 2*16kB 3*32kB 2*64kB 0*128kB 2*256kB 1*512kB 0*1024kB
1*2048kB 2*4096kB = 11532kB)
22541*4kB 2471*8kB 70*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB
0*1024kB 0*2048kB 0*4096kB = 111052kB)
15727*4kB 2733*8kB 480*16kB 20*32kB 1*64kB 1*128kB 1*256kB 1*512kB
0*1024kB 0*2048kB 0*4096kB = 94052kB)
Swap cache: add 3691, delete 3689, find 223/394, race 0+0
107628 pages of slabcache
114 pages of kernel stacks
0 lowmem pagetables, 543 highmem pagetables
Free swap:       4076048kB
786395 pages of RAM
557019 pages of HIGHMEM
15808 reserved pages
243527 pages shared
2 pages swap cached

cat /proc/slabinfo output:

slabinfo - version: 1.1 (SMP)
kmem_cache           128    128    244    8    8    1 : 1008  252
nfs_write_data         0      0    384    0    0    1 :  496  124
nfs_read_data          0      0    384    0    0    1 :  496  124
nfs_page               0      0    128    0    0    1 : 1008  252
ip_fib_hash           10    224     32    2    2    1 : 1008  252
ip_conntrack          80     80    384    8    8    1 :  496  124
ext3_xattr             0      0     44    0    0    1 : 1008  252
journal_head        2004  21483     48   71  279    1 : 1008  252
revoke_table           2    500     12    2    2    1 : 1008  252
revoke_record        448    448     32    4    4    1 : 1008  252
clip_arp_cache         0      0    256    0    0    1 : 1008  252
ip_mrt_cache           0      0    128    0    0    1 : 1008  252
tcp_tw_bucket         90     90    128    3    3    1 : 1008  252
tcp_bind_bucket      672    672     32    6    6    1 : 1008  252
tcp_open_request      60     60    128    2    2    1 : 1008  252
inet_peer_cache       58     58     64    1    1    1 : 1008  252
secpath_cache          0      0    128    0    0    1 : 1008  252
xfrm_dst_cache         0      0    256    0    0    1 : 1008  252
ip_dst_cache        2733   2985    256  192  199    1 : 1008  252
arp_cache             45     45    256    3    3    1 : 1008  252
flow_cache             0      0    128    0    0    1 : 1008  252
blkdev_requests     5120   6030    128  201  201    1 : 1008  252
kioctx                 0      0    128    0    0    1 : 1008  252
kiocb                  0      0    128    0    0    1 : 1008  252
dnotify_cache         11    498     20    3    3    1 : 1008  252
file_lock_cache      160    160     96    4    4    1 : 1008  252
async_poll_table       0      0    140    0    0    1 : 1008  252
fasync_cache         200    200     16    1    1    1 : 1008  252
uid_cache            560    560     32    5    5    1 : 1008  252
skbuff_head_cache   2282   3542    168  139  154    1 : 1008  252
sock                 510    570   1408  113  114    2 :  240   60
sigqueue            1903   2407    132   79   83    1 : 1008  252
kiobuf                 0      0    128    0    0    1 : 1008  252
cdev_cache           846   2610     64   20   45    1 : 1008  252
bdev_cache           812    812     64   14   14    1 : 1008  252
mnt_cache            116    116     64    2    2    1 : 1008  252
inode_cache       519033 521801    512 74543 74543    1 :  496  124
dentry_cache      401544 439050    128 14635 14635    1 : 1008  252
dquot                  0      0    128    0    0    1 : 1008  252
filp                1890   1890    128   63   63    1 : 1008  252
names_cache           60     60   4096   60   60    1 :  240   60
buffer_head       230053 400680    108 11325 11448    1 : 1008  252
mm_struct            460    460    384   46   46    1 :  496  124
vm_area_struct      4424   5936     68  106  106    1 : 1008  252
fs_cache             812    812     64   14   14    1 : 1008  252
files_cache          455    455    512   65   65    1 :  496  124
signal_cache         638    638     64   11   11    1 : 1008  252
sighand_cache        405    405   1408   81   81    2 :  240   60
pte_chain           5520  20640    128  394  688    1 : 1008  252
pae_pgd              812    812     64   14   14    1 : 1008  252
size-131072(DMA)       0      0 131072    0    0   32 :    0    0
size-131072            0      0 131072    0    0   32 :    0    0
size-65536(DMA)        0      0  65536    0    0   16 :    0    0
size-65536             1      1  65536    1    1   16 :    0    0
size-32768(DMA)        0      0  32768    0    0    8 :    0    0
size-32768             0      0  32768    0    0    8 :    0    0
size-16384(DMA)        0      0  16384    0    0    4 :    0    0
size-16384            20     21  16384   20   21    4 :    0    0
size-8192(DMA)         0      0   8192    0    0    2 :    0    0
size-8192              7     15   8192    7   15    2 :    0    0
size-4096(DMA)         0      0   4096    0    0    1 :  240   60
size-4096            883   1184   4096  883 1184    1 :  240   60
size-2048(DMA)         0      0   2048    0    0    1 :  240   60
size-2048            348    468   2048  186  234    1 :  240   60
size-1024(DMA)         0      0   1024    0    0    1 :  496  124
size-1024            744    744   1024  186  186    1 :  496  124
size-512(DMA)          0      0    512    0    0    1 :  496  124
size-512            1064   1560    512  146  195    1 :  496  124
size-256(DMA)          0      0    256    0    0    1 : 1008  252
size-256            1956   2460    256  164  164    1 : 1008  252
size-128(DMA)          0      0    128    0    0    1 : 1008  252
size-128            4488   4740    128  158  158    1 : 1008  252
size-64(DMA)           0      0    128    0    0    1 : 1008  252
size-64            32478  35250    128 1134 1175    1 : 1008  252
size-32(DMA)           0      0     64    0    0    1 : 1008  252
size-32            57424  63220     64 1063 1090    1 : 1008  252
Comment 11 Chris Sorisio 2004-11-18 14:26:24 EST
I performed another test today.  First, I ran 'ls -l' on the
NFS-mounted share to confirm the freeze would still occur.  It did.  I
then unmounted the share and remounted it as a Samba share.  (3.0.6 on
the server, 3.0.7 on the client.)  Same server, same client, just
using Samba instead of NFS to mount the share.

'ls -l' ran properly.  There was no freeze.  I'm going to try another
Oracle install via the smb-mounted share and see if it develops the
same performance issues experienced under NFS.


Comment 12 Agustin Gomez Roca 2004-11-18 14:33:04 EST
I can't do further analisys because my server crashed and needs to be 
reinstalled, perhaps later.
Comment 14 viawest 2004-11-18 20:26:09 EST
We are seeing the same issue.  On an nfs mounted directory, "ls" works fine.  When 
performing an "ls -la" the command hangs with no output.

Here's the troubleshooting data.  We forgot the slabinfo before rebooting, but I'll grab that 
if we see the problem again.  After rebooting the problem does not immediately return for 
us.

-------------------
uname -a / uptime
-------------------
Linux blah.foo.net 2.4.21-20.ELsmp #1 SMP Wed Aug 18 20:46:40 EDT 2004 i686 i686 
i386 GNU/Linux
 17:48:36  up 8 days, 22:01,  1 user,  load average: 10.91, 9.69, 7.48

----------------------------
related automount configs:
----------------------------
[root@blah root]# cat /etc/auto.master
--snip--
/export/home            /etc/auto.home  --timeout=60

[root@blah root]# cat /etc/auto.home 
*       -rw,nolock,tcp  nfsserver.foo.net:/export/home/&


------------------------------------------
alt-sysrq-m (hung ls la /export/home)
------------------------------------------
Nov 18 17:46:22 blah kernel: SysRq : Show Memory
Nov 18 17:46:22 blah kernel: Mem-info:
Nov 18 17:46:22 blah kernel: Zone:DMA freepages:  2572 min:     0 low:     0 high:     0
Nov 18 17:46:22 blah kernel: Zone:Normal freepages: 29990 min:   766 low:  4031 high:  
5791
Nov 18 17:46:22 blah kernel: Zone:HighMem freepages:   496 min:   255 low:   510 high:   
765
Nov 18 17:46:22 blah kernel: Free pages:       33058 (   496 HighMem)
Nov 18 17:46:22 blah kernel: ( Active: 143951/5240, inactive_laundry: 1350, 
inactive_clean: 230, free: 33058 )
Nov 18 17:46:22 blah kernel:   aa:0 ac:0 id:0 il:0 ic:0 fr:2572
Nov 18 17:46:22 blah kernel:   aa:1201 ac:118623 id:52 il:0 ic:0 fr:29990
Nov 18 17:46:22 blah kernel:   aa:2146 ac:21981 id:5188 il:1350 ic:230 fr:496
Nov 18 17:46:22 blah kernel: 0*4kB 0*8kB 3*16kB 2*32kB 1*64kB 1*128kB 1*256kB 
1*512kB 1*1024kB 0*2048kB 2*4096kB = 10288kB)
Nov 18 17:46:22 blah kernel: 21276*4kB 3981*8kB 168*16kB 10*32kB 0*64kB 0*128kB 
0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 119960kB)
Nov 18 17:46:22 blah kernel: 32*4kB 102*8kB 3*16kB 1*32kB 1*64kB 1*128kB 1*256kB 
1*512kB 0*1024kB 0*2048kB 0*4096kB = 1984kB)
Nov 18 17:46:22 blah kernel: Swap cache: add 178303, delete 176808, find 260684/
268875, race 0+0
Nov 18 17:46:22 blah kernel: 68811 pages of slabcache
Nov 18 17:46:22 blah kernel: 150 pages of kernel stacks
Nov 18 17:46:22 blah kernel: 0 lowmem pagetables, 642 highmem pagetables
Nov 18 17:46:22 blah kernel: Free swap:       2023936kB
Nov 18 17:46:22 blah kernel: 262016 pages of RAM
Nov 18 17:46:22 blah kernel: 32624 pages of HIGHMEM
Nov 18 17:46:22 blah kernel: 5793 reserved pages
Nov 18 17:46:22 blah kernel: 90010 pages shared
Nov 18 17:46:22 blah kernel: 1495 pages swap cached

------------------------------------------
alt-sysrq-t (hung ls la /export/home)
------------------------------------------
Nov 18 17:46:36 blah kernel: ls            D 00000001     0 24309  24209                     
(NOTLB)
Nov 18 17:46:36 blah kernel: Call Trace:   [<c0123e24>] schedule [kernel] 0x2f4 
(0xc425ddd8)
Nov 18 17:46:36 blah kernel: [<c015f184>] __pte_chain_free [kernel] 0x24 (0xc425ddec)
Nov 18 17:46:36 blah kernel: [<c010adb3>] __down [kernel] 0x73 (0xc425de1c)
Nov 18 17:46:36 blah kernel: [<f995e9a3>] ext3_getxattr [ext3] 0x53 (0xc425de34)
Nov 18 17:46:36 blah kernel: [<c010af5c>] __down_failed [kernel] 0x8 (0xc425de50)
Nov 18 17:46:36 blah kernel: [<f9b7b960>] nfs_dir_inode_operations [nfs] 0x0 
(0xc425de58)
Nov 18 17:46:36 blah kernel: [<c0185e12>] .text.lock.xattr [kernel] 0x29 (0xc425de60)
Nov 18 17:46:36 blah kernel: [<c01435d1>] handle_mm_fault [kernel] 0xd1 (0xc425dec0)
Nov 18 17:46:36 blah kernel: [<c017fd9b>] update_atime [kernel] 0x6b (0xc425ded4)
Nov 18 17:46:36 blah kernel: [<f9b2a132>] autofs4_revalidate [autofs4] 0xf2 
(0xc425dee4)
Nov 18 17:46:36 blah kernel: [<c017c890>] dput [kernel] 0x30 (0xc425df04)
Nov 18 17:46:36 blah kernel: [<c0172266>] link_path_walk [kernel] 0x656 (0xc425df18)
Nov 18 17:46:36 blah kernel: [<c017291c>] __user_walk [kernel] 0x5c (0xc425df68)
Nov 18 17:46:36 blah kernel: [<c0185897>] sys_getxattr [kernel] 0x57 (0xc425df84)


thanks,

cody and paul
Comment 15 Larry Woodman 2004-11-18 22:36:35 EST
This looks different than the previous problem, here
nfs_dir_inode_operations() is downing an semaphore and never
continuing.   Can you reproduce this and get a complete AltSysrq-T so
I can see who is holding the semaphore ?

Larry Woodman
Comment 16 viawest 2004-11-18 23:34:19 EST
Created attachment 107028 [details]
alt-sysreq-t output

here's the entire output of the alt-sysreq-t from the previous issue.  I might
have grabbed the wrong ls command last time.  there were already several
hanging around in various states.
Comment 17 Chris Sorisio 2004-11-19 09:46:37 EST
Created attachment 107056 [details]
soradev sysrq-t/m output

I've added a file containing the full sysrq-t and m data while 'ls -l' was
blocked on an NFS share.
Comment 18 Larry Woodman 2004-11-19 10:40:01 EST
OK, this is the exact same problem.  There is plenty of free memory
available but nothing larger than 32KB or 8-pages of contiguous memory
is available. 
 
20178*4kB 5990*8kB 151*16kB 1*32kB 0*64kB


Larry
Comment 22 Steve Dickson 2004-12-21 20:11:02 EST
If the fileystem is mounted with the
"-o rsize=16320, wsize=16320" mount 
option, does the hang still occur? 
Comment 23 Chris Sorisio 2004-12-22 07:42:51 EST
The server has been rebuilt with RedHat AS 2.1.  I had experienced the
hang under 3.0 with the rsize set higher, and the share was read-only;
do you need me to test it with those specific settings?  If so, I can
rebuild the server and test.
Comment 24 Steve Dickson 2004-12-22 11:51:32 EST
Try added the following line to /etc/modules.conf
    options nfs nfs3_acl_max_entries=256
on the client and then reboot.

After reboot (as root on the client) the 
sysctl fs.nfs.nfs3_acl_max_entries should say 256
(instead of 1024 which is the default).

Let me know if this helps.
Comment 25 Steve Dickson 2005-01-07 10:00:47 EST
Just wondering if you got a change to make the 
changes in Comment #24.... that should take care
of the problem....
Comment 26 Chris Sorisio 2005-01-07 10:28:48 EST
As per comment #23, we had to rebuild the server with AS 2.1.  They needed to
proceed with the project before a fix was available.  I will try to make a test
system available so I can verify the recommended fix.
Comment 28 Steve Dickson 2005-03-14 09:37:30 EST
Decreasing nfs3_acl_max_entries to 256 (comment #24)
should take care of issue. Setting status to Modified
until proven otherwise
Comment 29 Ernie Petrides 2005-03-14 16:24:10 EST
An errata has been issued which should help the problem 
described in this bug report. This report is therefore being 
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files, 
please follow the link below. You may reopen this bug report 
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2004-550.html



*** This bug has been marked as a duplicate of 118839 ***

Note You need to log in before you can comment on or make changes to this bug.