Description of problem: on nfs partitions mounted either manually or via autofs, ls -l hangs and produces log messages of this type: kernel: RPC: buffer allocation failed for task f4a79ca8 ls without the flag works fine Version-Release number of selected component (if applicable): RHEL version 3 2.4.21-20.ELsmp #1 SMP Wed Aug 18 20:46:40 EDT 2004 i686 i686 i386 GNU/Linux How reproducible: completely, although the behavior is new. Steps to Reproduce: 1. either cd to directory in automounted partition OR mount NFS mount a partition and cd to that 2. ls - 3. Actual results: process hangs ; log file grows Expected results: file listing Additional info: the most recent change to the system was adding an LVM logical drive I beleive the problem started after this. Its version 1.0.8-2 but I am not sure of this. this is probably related to bug 133246
Reassigning to our VM guy....
uh... this problem went away after a reboot. Has not returned in 3 weeks. Perhaps this is false alarm. sorry.
OK, if this ever happens again please get me an "AltSysrq-T" output so I can see exactly what the "ls -l" process is stuck on. Larry Woodman
I'm having the same problem from last friday on any NFS mounted file- system. Are there a patch?
I can create this error almost on demand. We are in the middle of an Oracle upgrade project, and we mount our stage files (roughly 18GB of data) from a Linux NFS share on another server. Partway through the installation process, we experience excessive NFS slowdowns on the RHEL3 client side. The only way to get things moving again is to set inactive_clean_percent to 100. However, even with that set, 'ls -l' on the mounted share hangs, and dmesg shows the reported RPC buffer error.
sysrq-t output for a frozen 'ls -l': ls S 00000000 0 1346 2278 (NOTLB) Call Trace: [<c015878a>] __alloc_pages [kernel] 0x27a (0xd23dfb84) [<c0123e24>] schedule [kernel] 0x2f4 (0xd23dfb98) [<c0134e75>] schedule_timeout [kernel] 0x65 (0xd23dfbdc) [<c0134e00>] process_timeout [kernel] 0x0 (0xd23dfbfc) [<f8b827f2>] rpc_allocate_Rsmp_0cd1c989 [sunrpc] 0x82 (0xd23dfc14) [<f8b7da27>] call_allocate [sunrpc] 0x67 (0xd23dfc34) [<f8b822e9>] __rpc_execute [sunrpc] 0x109 (0xd23dfc4c) [<f8b840f1>] rpcauth_bindcred_Rsmp_fa665c09 [sunrpc] 0x31 (0xd23dfc68) [<f8b7d59d>] rpc_call_sync_Rsmp_c357b490 [sunrpc] 0xbd (0xd23dfc8c) [<f8b918d8>] all_tasks [sunrpc] 0x0 (0xd23dfcb8) [<f8b7dad0>] call_encode [sunrpc] 0x0 (0xd23dfcec) [<f8b816e0>] rpc_run_timer [sunrpc] 0x0 (0xd23dfd0c) [<f8bb30b4>] nfs3_rpc_wrapper [nfs] 0x44 (0xd23dfd48) [<f8bb4bfc>] nfs3_proc_getacl [nfs] 0x10c (0xd23dfd70) [<c015f184>] __pte_chain_free [kernel] 0x24 (0xd23dfdec) [<c0142fa8>] do_no_page [kernel] 0x368 (0xd23dfdf8) [<f8ba6641>] nfs_lookup_revalidate [nfs] 0x51 (0xd23dfe18) [<f8bb8bd5>] nfs_getxattr [nfs] 0x155 (0xd23dfe3c) [<c0185795>] getxattr [kernel] 0x105 (0xd23dfe60) [<f8b83fcc>] rpcauth_lookup_credcache [sunrpc] 0x7c (0xd23dfec0) [<f8b8421e>] put_rpccred_Rsmp_89a5cf35 [sunrpc] 0x1e (0xd23dfed0) [<f8b91960>] rpc_credcache_lock [sunrpc] 0x0 (0xd23dfed8) [<f8ba7f46>] nfs_permission [nfs] 0x1a6 (0xd23dfee0) [<c017c890>] dput [kernel] 0x30 (0xd23dff04) [<c0172266>] link_path_walk [kernel] 0x656 (0xd23dff18) [<c017291c>] __user_walk [kernel] 0x5c (0xd23dff68) [<c0185897>] sys_getxattr [kernel] 0x57 (0xd23dff84) RPC: buffer allocation failed for task d23dfca8
Chris; First, does this only happen on NFS mounts? Second, can you get an AltSysrq-M and "cat /proc/slabinfo" outputs using the latest RHEL3-U4 kernel ? BTW, this process appears to stuck in rpc_allocate(): +++++++++++++++++++++++++++++++++++++++++++++++++++++++ do { if ((buffer = (u32 *) kmalloc(size, gfp)) != NULL) { dprintk("RPC: allocated buffer %p\n", buffer); return buffer; } if ((flags & RPC_TASK_SWAPPER) && size <= sizeof(swap_buffer) && rpc_lock_swapbuf()) { dprintk("RPC: used last-ditch swap buffer\n"); return swap_buffer; } if (flags & RPC_TASK_ASYNC) return NULL; set_current_state(TASK_INTERRUPTIBLE); schedule_timeout(HZ/50); } while (!signalled()); ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Larry Woodman
Larry, 1. I've only tested against NFS and local filesystems. The behavior does not occur when 'ls -l' is run on a local filesystem. 2. uname -a output: Linux soradev.peaktechnical.com 2.4.21-20.ELsmp #1 SMP Wed Aug 18 20:46:40 EDT 2004 i686 i686 i386 GNU/Linux alt-sysrq-m output: SysRq : Show Memory Mem-info: Zone:DMA freepages: 2883 min: 0 low: 0 high: 0 Zone:Normal freepages: 27763 min: 1279 low: 4544 high: 6304 Zone:HighMem freepages: 23513 min: 255 low: 8702 high: 13053 Free pages: 54159 ( 23513 HighMem) ( Active: 144502/228825, inactive_laundry: 192099, inactive_clean: 36794, free: 54159 ) aa:0 ac:0 id:0 il:0 ic:0 fr:2883 aa:3480 ac:68556 id:0 il:18 ic:0 fr:27763 aa:11139 ac:61327 id:228825 il:192081 ic:36794 fr:23513 1*4kB 1*8kB 2*16kB 3*32kB 2*64kB 0*128kB 2*256kB 1*512kB 0*1024kB 1*2048kB 2*4096kB = 11532kB) 22541*4kB 2471*8kB 70*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 111052kB) 15727*4kB 2733*8kB 480*16kB 20*32kB 1*64kB 1*128kB 1*256kB 1*512kB 0*1024kB 0*2048kB 0*4096kB = 94052kB) Swap cache: add 3691, delete 3689, find 223/394, race 0+0 107628 pages of slabcache 114 pages of kernel stacks 0 lowmem pagetables, 543 highmem pagetables Free swap: 4076048kB 786395 pages of RAM 557019 pages of HIGHMEM 15808 reserved pages 243527 pages shared 2 pages swap cached cat /proc/slabinfo output: slabinfo - version: 1.1 (SMP) kmem_cache 128 128 244 8 8 1 : 1008 252 nfs_write_data 0 0 384 0 0 1 : 496 124 nfs_read_data 0 0 384 0 0 1 : 496 124 nfs_page 0 0 128 0 0 1 : 1008 252 ip_fib_hash 10 224 32 2 2 1 : 1008 252 ip_conntrack 80 80 384 8 8 1 : 496 124 ext3_xattr 0 0 44 0 0 1 : 1008 252 journal_head 2004 21483 48 71 279 1 : 1008 252 revoke_table 2 500 12 2 2 1 : 1008 252 revoke_record 448 448 32 4 4 1 : 1008 252 clip_arp_cache 0 0 256 0 0 1 : 1008 252 ip_mrt_cache 0 0 128 0 0 1 : 1008 252 tcp_tw_bucket 90 90 128 3 3 1 : 1008 252 tcp_bind_bucket 672 672 32 6 6 1 : 1008 252 tcp_open_request 60 60 128 2 2 1 : 1008 252 inet_peer_cache 58 58 64 1 1 1 : 1008 252 secpath_cache 0 0 128 0 0 1 : 1008 252 xfrm_dst_cache 0 0 256 0 0 1 : 1008 252 ip_dst_cache 2733 2985 256 192 199 1 : 1008 252 arp_cache 45 45 256 3 3 1 : 1008 252 flow_cache 0 0 128 0 0 1 : 1008 252 blkdev_requests 5120 6030 128 201 201 1 : 1008 252 kioctx 0 0 128 0 0 1 : 1008 252 kiocb 0 0 128 0 0 1 : 1008 252 dnotify_cache 11 498 20 3 3 1 : 1008 252 file_lock_cache 160 160 96 4 4 1 : 1008 252 async_poll_table 0 0 140 0 0 1 : 1008 252 fasync_cache 200 200 16 1 1 1 : 1008 252 uid_cache 560 560 32 5 5 1 : 1008 252 skbuff_head_cache 2282 3542 168 139 154 1 : 1008 252 sock 510 570 1408 113 114 2 : 240 60 sigqueue 1903 2407 132 79 83 1 : 1008 252 kiobuf 0 0 128 0 0 1 : 1008 252 cdev_cache 846 2610 64 20 45 1 : 1008 252 bdev_cache 812 812 64 14 14 1 : 1008 252 mnt_cache 116 116 64 2 2 1 : 1008 252 inode_cache 519033 521801 512 74543 74543 1 : 496 124 dentry_cache 401544 439050 128 14635 14635 1 : 1008 252 dquot 0 0 128 0 0 1 : 1008 252 filp 1890 1890 128 63 63 1 : 1008 252 names_cache 60 60 4096 60 60 1 : 240 60 buffer_head 230053 400680 108 11325 11448 1 : 1008 252 mm_struct 460 460 384 46 46 1 : 496 124 vm_area_struct 4424 5936 68 106 106 1 : 1008 252 fs_cache 812 812 64 14 14 1 : 1008 252 files_cache 455 455 512 65 65 1 : 496 124 signal_cache 638 638 64 11 11 1 : 1008 252 sighand_cache 405 405 1408 81 81 2 : 240 60 pte_chain 5520 20640 128 394 688 1 : 1008 252 pae_pgd 812 812 64 14 14 1 : 1008 252 size-131072(DMA) 0 0 131072 0 0 32 : 0 0 size-131072 0 0 131072 0 0 32 : 0 0 size-65536(DMA) 0 0 65536 0 0 16 : 0 0 size-65536 1 1 65536 1 1 16 : 0 0 size-32768(DMA) 0 0 32768 0 0 8 : 0 0 size-32768 0 0 32768 0 0 8 : 0 0 size-16384(DMA) 0 0 16384 0 0 4 : 0 0 size-16384 20 21 16384 20 21 4 : 0 0 size-8192(DMA) 0 0 8192 0 0 2 : 0 0 size-8192 7 15 8192 7 15 2 : 0 0 size-4096(DMA) 0 0 4096 0 0 1 : 240 60 size-4096 883 1184 4096 883 1184 1 : 240 60 size-2048(DMA) 0 0 2048 0 0 1 : 240 60 size-2048 348 468 2048 186 234 1 : 240 60 size-1024(DMA) 0 0 1024 0 0 1 : 496 124 size-1024 744 744 1024 186 186 1 : 496 124 size-512(DMA) 0 0 512 0 0 1 : 496 124 size-512 1064 1560 512 146 195 1 : 496 124 size-256(DMA) 0 0 256 0 0 1 : 1008 252 size-256 1956 2460 256 164 164 1 : 1008 252 size-128(DMA) 0 0 128 0 0 1 : 1008 252 size-128 4488 4740 128 158 158 1 : 1008 252 size-64(DMA) 0 0 128 0 0 1 : 1008 252 size-64 32478 35250 128 1134 1175 1 : 1008 252 size-32(DMA) 0 0 64 0 0 1 : 1008 252 size-32 57424 63220 64 1063 1090 1 : 1008 252
I performed another test today. First, I ran 'ls -l' on the NFS-mounted share to confirm the freeze would still occur. It did. I then unmounted the share and remounted it as a Samba share. (3.0.6 on the server, 3.0.7 on the client.) Same server, same client, just using Samba instead of NFS to mount the share. 'ls -l' ran properly. There was no freeze. I'm going to try another Oracle install via the smb-mounted share and see if it develops the same performance issues experienced under NFS.
I can't do further analisys because my server crashed and needs to be reinstalled, perhaps later.
We are seeing the same issue. On an nfs mounted directory, "ls" works fine. When performing an "ls -la" the command hangs with no output. Here's the troubleshooting data. We forgot the slabinfo before rebooting, but I'll grab that if we see the problem again. After rebooting the problem does not immediately return for us. ------------------- uname -a / uptime ------------------- Linux blah.foo.net 2.4.21-20.ELsmp #1 SMP Wed Aug 18 20:46:40 EDT 2004 i686 i686 i386 GNU/Linux 17:48:36 up 8 days, 22:01, 1 user, load average: 10.91, 9.69, 7.48 ---------------------------- related automount configs: ---------------------------- [root@blah root]# cat /etc/auto.master --snip-- /export/home /etc/auto.home --timeout=60 [root@blah root]# cat /etc/auto.home * -rw,nolock,tcp nfsserver.foo.net:/export/home/& ------------------------------------------ alt-sysrq-m (hung ls la /export/home) ------------------------------------------ Nov 18 17:46:22 blah kernel: SysRq : Show Memory Nov 18 17:46:22 blah kernel: Mem-info: Nov 18 17:46:22 blah kernel: Zone:DMA freepages: 2572 min: 0 low: 0 high: 0 Nov 18 17:46:22 blah kernel: Zone:Normal freepages: 29990 min: 766 low: 4031 high: 5791 Nov 18 17:46:22 blah kernel: Zone:HighMem freepages: 496 min: 255 low: 510 high: 765 Nov 18 17:46:22 blah kernel: Free pages: 33058 ( 496 HighMem) Nov 18 17:46:22 blah kernel: ( Active: 143951/5240, inactive_laundry: 1350, inactive_clean: 230, free: 33058 ) Nov 18 17:46:22 blah kernel: aa:0 ac:0 id:0 il:0 ic:0 fr:2572 Nov 18 17:46:22 blah kernel: aa:1201 ac:118623 id:52 il:0 ic:0 fr:29990 Nov 18 17:46:22 blah kernel: aa:2146 ac:21981 id:5188 il:1350 ic:230 fr:496 Nov 18 17:46:22 blah kernel: 0*4kB 0*8kB 3*16kB 2*32kB 1*64kB 1*128kB 1*256kB 1*512kB 1*1024kB 0*2048kB 2*4096kB = 10288kB) Nov 18 17:46:22 blah kernel: 21276*4kB 3981*8kB 168*16kB 10*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 119960kB) Nov 18 17:46:22 blah kernel: 32*4kB 102*8kB 3*16kB 1*32kB 1*64kB 1*128kB 1*256kB 1*512kB 0*1024kB 0*2048kB 0*4096kB = 1984kB) Nov 18 17:46:22 blah kernel: Swap cache: add 178303, delete 176808, find 260684/ 268875, race 0+0 Nov 18 17:46:22 blah kernel: 68811 pages of slabcache Nov 18 17:46:22 blah kernel: 150 pages of kernel stacks Nov 18 17:46:22 blah kernel: 0 lowmem pagetables, 642 highmem pagetables Nov 18 17:46:22 blah kernel: Free swap: 2023936kB Nov 18 17:46:22 blah kernel: 262016 pages of RAM Nov 18 17:46:22 blah kernel: 32624 pages of HIGHMEM Nov 18 17:46:22 blah kernel: 5793 reserved pages Nov 18 17:46:22 blah kernel: 90010 pages shared Nov 18 17:46:22 blah kernel: 1495 pages swap cached ------------------------------------------ alt-sysrq-t (hung ls la /export/home) ------------------------------------------ Nov 18 17:46:36 blah kernel: ls D 00000001 0 24309 24209 (NOTLB) Nov 18 17:46:36 blah kernel: Call Trace: [<c0123e24>] schedule [kernel] 0x2f4 (0xc425ddd8) Nov 18 17:46:36 blah kernel: [<c015f184>] __pte_chain_free [kernel] 0x24 (0xc425ddec) Nov 18 17:46:36 blah kernel: [<c010adb3>] __down [kernel] 0x73 (0xc425de1c) Nov 18 17:46:36 blah kernel: [<f995e9a3>] ext3_getxattr [ext3] 0x53 (0xc425de34) Nov 18 17:46:36 blah kernel: [<c010af5c>] __down_failed [kernel] 0x8 (0xc425de50) Nov 18 17:46:36 blah kernel: [<f9b7b960>] nfs_dir_inode_operations [nfs] 0x0 (0xc425de58) Nov 18 17:46:36 blah kernel: [<c0185e12>] .text.lock.xattr [kernel] 0x29 (0xc425de60) Nov 18 17:46:36 blah kernel: [<c01435d1>] handle_mm_fault [kernel] 0xd1 (0xc425dec0) Nov 18 17:46:36 blah kernel: [<c017fd9b>] update_atime [kernel] 0x6b (0xc425ded4) Nov 18 17:46:36 blah kernel: [<f9b2a132>] autofs4_revalidate [autofs4] 0xf2 (0xc425dee4) Nov 18 17:46:36 blah kernel: [<c017c890>] dput [kernel] 0x30 (0xc425df04) Nov 18 17:46:36 blah kernel: [<c0172266>] link_path_walk [kernel] 0x656 (0xc425df18) Nov 18 17:46:36 blah kernel: [<c017291c>] __user_walk [kernel] 0x5c (0xc425df68) Nov 18 17:46:36 blah kernel: [<c0185897>] sys_getxattr [kernel] 0x57 (0xc425df84) thanks, cody and paul
This looks different than the previous problem, here nfs_dir_inode_operations() is downing an semaphore and never continuing. Can you reproduce this and get a complete AltSysrq-T so I can see who is holding the semaphore ? Larry Woodman
Created attachment 107028 [details] alt-sysreq-t output here's the entire output of the alt-sysreq-t from the previous issue. I might have grabbed the wrong ls command last time. there were already several hanging around in various states.
Created attachment 107056 [details] soradev sysrq-t/m output I've added a file containing the full sysrq-t and m data while 'ls -l' was blocked on an NFS share.
OK, this is the exact same problem. There is plenty of free memory available but nothing larger than 32KB or 8-pages of contiguous memory is available. 20178*4kB 5990*8kB 151*16kB 1*32kB 0*64kB Larry
If the fileystem is mounted with the "-o rsize=16320, wsize=16320" mount option, does the hang still occur?
The server has been rebuilt with RedHat AS 2.1. I had experienced the hang under 3.0 with the rsize set higher, and the share was read-only; do you need me to test it with those specific settings? If so, I can rebuild the server and test.
Try added the following line to /etc/modules.conf options nfs nfs3_acl_max_entries=256 on the client and then reboot. After reboot (as root on the client) the sysctl fs.nfs.nfs3_acl_max_entries should say 256 (instead of 1024 which is the default). Let me know if this helps.
Just wondering if you got a change to make the changes in Comment #24.... that should take care of the problem....
As per comment #23, we had to rebuild the server with AS 2.1. They needed to proceed with the project before a fix was available. I will try to make a test system available so I can verify the recommended fix.
Decreasing nfs3_acl_max_entries to 256 (comment #24) should take care of issue. Setting status to Modified until proven otherwise
An errata has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2004-550.html *** This bug has been marked as a duplicate of 118839 ***