| Summary: | RHELS6_64 nfsd bug causes system hang | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 6 | Reporter: | Andre ten Bohmer <andre.tenbohmer> | ||||
| Component: | kernel | Assignee: | J. Bruce Fields <bfields> | ||||
| Status: | CLOSED INSUFFICIENT_DATA | QA Contact: | Filesystem QE <fs-qe> | ||||
| Severity: | high | Docs Contact: | |||||
| Priority: | unspecified | ||||||
| Version: | 6.0 | CC: | bfields, dchinner, dhowells, jlayton, kzhang, mschmidt, pasteur, rwheeler, sprabhu, steved, syeghiay, yanwang | ||||
| Target Milestone: | rc | Keywords: | Reopened | ||||
| Target Release: | --- | ||||||
| Hardware: | x86_64 | ||||||
| OS: | Linux | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2012-02-17 22:35:29 UTC | Type: | --- | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Attachments: |
|
||||||
What is the kernel version? See "uname -a" part ]# uname -a Linux scomp1110 2.6.32-71.14.1.el6.x86_64 #1 SMP Wed Jan 5 17:01:01 EST 2011 x86_64 x86_64 x86_64 GNU/Linux Is there any more to that console message? Normally I'd expect a backtrace to follow. Created attachment 481720 [details]
XFS quota enabled causes server crash
Seems xfs is buggy on this system. After enabling quota, this show up on the remote console.
Kernel ring message on the 'sister' server which did not cause a system hang: nfsd: page allocation failure. order:4, mode:0x20 Pid: 4187, comm: nfsd Not tainted 2.6.32-71.14.1.el6.x86_64 #1 Call Trace: [<ffffffff8111ea06>] __alloc_pages_nodemask+0x706/0x850 [<ffffffff811560e2>] kmem_getpages+0x62/0x170 [<ffffffff81156cfa>] fallback_alloc+0x1ba/0x270 [<ffffffff8115674f>] ? cache_grow+0x2cf/0x320 [<ffffffff81156a79>] ____cache_alloc_node+0x99/0x160 [<ffffffff814063df>] ? pskb_expand_head+0x5f/0x1e0 [<ffffffff81157809>] __kmalloc+0x189/0x220 [<ffffffff814063df>] pskb_expand_head+0x5f/0x1e0 [<ffffffff8140881a>] __pskb_pull_tail+0x2aa/0x360 [<ffffffffa028a6ce>] bnx2x_start_xmit+0x19e/0xf50 [bnx2x] [<ffffffffa028a8cf>] ? bnx2x_start_xmit+0x39f/0xf50 [bnx2x] [<ffffffffa03b55f0>] ? xfs_iomap_eof_want_preallocate+0xd0/0x150 [xfs] [<ffffffff81410da8>] dev_hard_start_xmit+0x2b8/0x370 [<ffffffff814291ba>] sch_direct_xmit+0x15a/0x1c0 [<ffffffff81414338>] dev_queue_xmit+0x378/0x4a0 [<ffffffffa02bb0b5>] ? ipt_do_table+0x295/0x678 [ip_tables] [<ffffffffa0510615>] bond_dev_queue_xmit+0x45/0x1b0 [bonding] [<ffffffffa0510b32>] bond_start_xmit+0x3b2/0x4a0 [bonding] [<ffffffff81410da8>] dev_hard_start_xmit+0x2b8/0x370 [<ffffffff81414386>] dev_queue_xmit+0x3c6/0x4a0 [<ffffffffa05939d4>] vlan_dev_hwaccel_hard_start_xmit+0x84/0xb0 [8021q] [<ffffffff81410da8>] dev_hard_start_xmit+0x2b8/0x370 [<ffffffff81414386>] dev_queue_xmit+0x3c6/0x4a0 [<ffffffff8144758c>] ip_finish_output+0x13c/0x310 [<ffffffff81447818>] ip_output+0xb8/0xc0 [<ffffffff8144676f>] ? __ip_local_out+0x9f/0xb0 [<ffffffff814467a5>] ip_local_out+0x25/0x30 [<ffffffff81446ff0>] ip_queue_xmit+0x190/0x420 [<ffffffff81447818>] ? ip_output+0xb8/0xc0 [<ffffffff8144676f>] ? __ip_local_out+0x9f/0xb0 [<ffffffff814467a5>] ? ip_local_out+0x25/0x30 [<ffffffff8145bca1>] tcp_transmit_skb+0x3f1/0x790 [<ffffffff8145e017>] tcp_write_xmit+0x1e7/0x9e0 [<ffffffff8145e9a0>] __tcp_push_pending_frames+0x30/0xe0 [<ffffffff81456613>] tcp_data_snd_check+0x33/0x100 [<ffffffff8145a310>] tcp_rcv_established+0x5c0/0x820 [<ffffffffa03b016c>] ? xfs_iext_bno_to_ext+0x8c/0x170 [xfs] [<ffffffff814620d3>] tcp_v4_do_rcv+0x2e3/0x430 [<ffffffff81264885>] ? memmove+0x45/0x50 [<ffffffff814019a5>] release_sock+0x65/0xd0 [<ffffffff814512c1>] tcp_recvmsg+0x821/0xe80 [<ffffffffa0392e5d>] ? xfs_bmapi+0x1bd/0x11a0 [xfs] CE: hpet increasing min_delta_ns to 15000 nsec CE: hpet increasing min_delta_ns to 22500 nsec [<ffffffff8145dea6>] ? tcp_write_xmit+0x76/0x9e0 [<ffffffff81401069>] sock_common_recvmsg+0x39/0x50 [<ffffffff813fea53>] sock_recvmsg+0x133/0x160 [<ffffffff810566d0>] ? __dequeue_entity+0x30/0x50 [<ffffffff8105c7f6>] ? update_curr+0xe6/0x1e0 [<ffffffff81091de0>] ? autoremove_wake_function+0x0/0x40 [<ffffffff81061c11>] ? dequeue_entity+0x1a1/0x1e0 [<ffffffff810116e0>] ? __switch_to+0xd0/0x320 [<ffffffff81059db2>] ? finish_task_switch+0x42/0xd0 [<ffffffff814c8d96>] ? thread_return+0x4e/0x778 [<ffffffff8105c434>] ? try_to_wake_up+0x284/0x380 [<ffffffff814cb656>] ? _spin_lock_bh+0x16/0x40 [<ffffffff813feac4>] kernel_recvmsg+0x44/0x60 [<ffffffffa053b885>] svc_recvfrom+0x65/0xa0 [sunrpc] [<ffffffff81472430>] ? inet_ioctl+0x30/0xa0 [<ffffffffa053c412>] svc_tcp_recvfrom+0x192/0x660 [sunrpc] [<ffffffffa0548acb>] svc_recv+0x7fb/0x830 [sunrpc] [<ffffffff8105c530>] ? default_wake_function+0x0/0x20 [<ffffffffa0599b45>] nfsd+0xa5/0x160 [nfsd] [<ffffffffa0599aa0>] ? nfsd+0x0/0x160 [nfsd] [<ffffffff81091a76>] kthread+0x96/0xa0 [<ffffffff810141ca>] child_rip+0xa/0x20 [<ffffffff810919e0>] ? kthread+0x0/0xa0 [<ffffffff810141c0>] ? child_rip+0x0/0x20 Mem-Info: Node 0 DMA per-cpu: CPU 0: hi: 0, btch: 1 usd: 0 CPU 1: hi: 0, btch: 1 usd: 0 CPU 2: hi: 0, btch: 1 usd: 0 CPU 3: hi: 0, btch: 1 usd: 0 CPU 4: hi: 0, btch: 1 usd: 0 CPU 5: hi: 0, btch: 1 usd: 0 CPU 6: hi: 0, btch: 1 usd: 0 CPU 7: hi: 0, btch: 1 usd: 0 Node 0 DMA32 per-cpu: CPU 0: hi: 186, btch: 31 usd: 182 CPU 1: hi: 186, btch: 31 usd: 170 CPU 2: hi: 186, btch: 31 usd: 30 CPU 3: hi: 186, btch: 31 usd: 114 CPU 4: hi: 186, btch: 31 usd: 43 CPU 5: hi: 186, btch: 31 usd: 29 CPU 6: hi: 186, btch: 31 usd: 15 CPU 7: hi: 186, btch: 31 usd: 110 Node 0 Normal per-cpu: CPU 0: hi: 186, btch: 31 usd: 157 CPU 1: hi: 186, btch: 31 usd: 182 CPU 2: hi: 186, btch: 31 usd: 74 CPU 3: hi: 186, btch: 31 usd: 37 CPU 4: hi: 186, btch: 31 usd: 36 CPU 5: hi: 186, btch: 31 usd: 67 CPU 6: hi: 186, btch: 31 usd: 49 CPU 7: hi: 186, btch: 31 usd: 131 active_anon:8275 inactive_anon:2222 isolated_anon:0 active_file:648839 inactive_file:2098774 isolated_file:0 unevictable:1475 dirty:16782 writeback:45649 unstable:0 free:48629 slab_reclaimable:65582 slab_unreclaimable:146436 mapped:5001 shmem:223 pagetables:1652 bounce:0 Node 0 DMA free:15700kB min:80kB low:100kB high:120kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15308kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes lowmem_reserve[]: 0 3502 12087 12087 Node 0 DMA32 free:82484kB min:19556kB low:24444kB high:29332kB active_anon:80kB inactive_anon:2448kB active_file:758724kB inactive_file:2384696kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3586464kB mlocked:0kB dirty:17772kB writeback:2064kB mapped:336kB shmem:0kB slab_reclaimable:121300kB slab_unreclaimable:28720kB kernel_stack:24kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 8584 8584 Node 0 Normal free:96332kB min:47940kB low:59924kB high:71908kB active_anon:33020kB inactive_anon:6440kB active_file:1836632kB inactive_file:6010400kB unevictable:5900kB isolated(anon):0kB isolated(file):0kB present:8791036kB mlocked:5900kB dirty:49356kB writeback:180532kB mapped:19668kB shmem:892kB slab_reclaimable:141028kB slab_unreclaimable:557024kB kernel_stack:3272kB pagetables:6608kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 Node 0 DMA: 1*4kB 2*8kB 2*16kB 1*32kB 2*64kB 1*128kB 0*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 15700kB Node 0 DMA32: 11572*4kB 2347*8kB 373*16kB 176*32kB 69*64kB 0*128kB 0*256kB 1*512kB 1*1024kB 0*2048kB 0*4096kB = 82616kB Node 0 Normal: 14879*4kB 1103*8kB 707*16kB 297*32kB 62*64kB 1*128kB 3*256kB 1*512kB 2*1024kB 0*2048kB 0*4096kB = 96580kB 2748831 total pagecache pages 0 pages in swap cache Swap cache stats: add 0, delete 0, find 0/0 Free swap = 8388600kB Total swap = 8388600kB 3145726 pages RAM 66706 pages reserved 1230916 pages shared 1749857 pages non-shared Last Friday, this server crashed multiple times under heavy IO stress via nfs. We decided to rebuild the server with RedHat 5.6 x64 and it's stable so far. (In reply to comment #6) > nfsd: page allocation failure. order:4, mode:0x20 > Pid: 4187, comm: nfsd Not tainted 2.6.32-71.14.1.el6.x86_64 #1 > [...] So bnx2x was attempting an order 4 (64 KiB) atomic allocation. It must be the skb_linearize() call which bnx2x does when it is asked to transmit a skb with more frags than the card can handle. What are the offload settings for the card (ethtool -k ethX)? From the stack trace I can see that VLANs and bonding are involved. I'll try to reproduce the allocation failures. I don't yet see how the allocation failure would relate to the BUGs from the original description though: > BUG:scheduling while atomic: nfsd/3538/0xffffffff > BUG: unable to handle kernel paging request at 000000038ab00bc0 Agreed. I think there are likely several bugs here that may or may not be related. We'll likely need a stack trace from the scheduling while atomic message in order to know what that is. On a 'sister' server running RHEL6 (stable) : ]# ethtool -k eth14 Offload parameters for eth14: rx-checksumming: on tx-checksumming: on scatter-gather: on tcp-segmentation-offload: on udp-fragmentation-offload: off generic-segmentation-offload: on generic-receive-offload: off large-receive-offload: off On the 'problem' server, but now running RHEL5 ]# ethtool -k eth0 Offload parameters for eth0: Cannot get device udp large send offload settings: Operation not supported rx-checksumming: on tx-checksumming: on scatter-gather: on tcp segmentation offload: on udp fragmentation offload: off generic segmentation offload: off generic-receive-offload: on I'm sorry I can't do any further research on the 'problem' server. It's running production and RHEL6 on this system was to unstable. It has exact the same setup as the 'sister' server which in fact is running RHEL6 very stable. The only big difference is that the problem server is 'boot from san' and the sister server boot's from local disks. The RHEL 5.6 installation is more stable, but last weekend it also crashed several time. Managed to setup a serial console monitor via the ILO2 board and now testing with heavy io loads and hope the server crashes again so I can provide you with a dump. Looks like we're waiting on more information from the reporter? I am going to close this specific BZ pending an update from the reporter. Please reopen it if this issue happens again, or a new one to track the other issues you mention in comment https://bugzilla.redhat.com/show_bug.cgi?id=676022#c13 Thanks! Hello, Today finaly managed to catch a console dump: Red Hat Enterprise Linux Server release 5.7 (Tikanga) Kernel 2.6.18-274.7.1.el5 on an x86_64 serevr login: INFO: task xfsdatad/2:3426 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. xfsdatad/2 D ffffffff80154db9 0 3426 71 3427 3425 (L-TLB) ffff81011b1f1dc0 0000000000000046 0000000000000000 0000000000000000 0000000000000100 000000000000000a ffff81011d0d77a0 ffff81011ff24080 000000f44d72caa3 000000000000071c ffff81011d0d7988 0000000200000000 Call Trace: [<ffffffff885d1d16>] :xfs:xfs_end_bio_delalloc+0x0/0x12 [<ffffffff800645e3>] __down_write_nested+0x7a/0x92 [<ffffffff885d1ca4>] :xfs:xfs_setfilesize+0x2d/0x8d [<ffffffff885d1d1f>] :xfs:xfs_end_bio_delalloc+0x9/0x12 [<ffffffff8004d32e>] run_workqueue+0x9e/0xfb [<ffffffff80049b3d>] worker_thread+0x0/0x122 [<ffffffff800a2c39>] keventd_create_kthread+0x0/0xc4 [<ffffffff80049c2d>] worker_thread+0xf0/0x122 [<ffffffff8008e87f>] default_wake_function+0x0/0xe [<ffffffff800a2c39>] keventd_create_kthread+0x0/0xc4 [<ffffffff8003270f>] kthread+0xfe/0x132 [<ffffffff8005dfb1>] child_rip+0xa/0x11 [<ffffffff800a2c39>] keventd_create_kthread+0x0/0xc4 [<ffffffff80032611>] kthread+0x0/0x132 [<ffffffff8005dfa7>] child_rip+0x0/0x11 INFO: task nfsd:5298 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. nfsd D ffffffff80154db9 0 5298 1 5299 5297 (L-TLB) ffff8100d9a799d0 0000000000000046 ffff8100c8ad3a9c ffff8100d9fc4440 ffffffffffffff5d 000000000000000a ffff8100d9957080 ffff81011fe39100 000000f44b13bfd6 0000000000006a2e ffff8100d9957268 000000078003f92f Call Trace: [<ffffffff800ef670>] inode_wait+0x0/0xd [<ffffffff800ef679>] inode_wait+0x9/0xd [<ffffffff800639fa>] __wait_on_bit+0x40/0x6e [<ffffffff800ef670>] inode_wait+0x0/0xd [<ffffffff80063a94>] out_of_line_wait_on_bit+0x6c/0x78 [<ffffffff800a2e7f>] wake_bit_function+0x0/0x23 [<ffffffff80031ab5>] sock_common_recvmsg+0x2d/0x43 [<ffffffff8003d988>] ifind_fast+0x6e/0x83 [<ffffffff8002355d>] iget_locked+0x59/0x149 [<ffffffff885b6bd9>] :xfs:xfs_iget+0x4f/0x17a [<ffffffff885d519c>] :xfs:xfs_fs_get_dentry+0x3e/0xae [<ffffffff887f536d>] :exportfs:find_exported_dentry+0x43/0x486 [<ffffffff88802739>] :nfsd:nfsd_acceptable+0x0/0xdc [<ffffffff8880680b>] :nfsd:exp_get_by_name+0x5b/0x71 [<ffffffff88806dfa>] :nfsd:exp_find_key+0x89/0x9c [<ffffffff8008cca4>] __wake_up_common+0x3e/0x68 [<ffffffff88802739>] :nfsd:nfsd_acceptable+0x0/0xdc [<ffffffff885d5042>] :xfs:xfs_fs_decode_fh+0xce/0xd8 [<ffffffff88802ab1>] :nfsd:fh_verify+0x29c/0x4cf [<ffffffff88803d1f>] :nfsd:nfsd_open+0x2c/0x196 [<ffffffff88804051>] :nfsd:nfsd_write+0x89/0xd5 [<ffffffff8880abae>] :nfsd:nfsd3_proc_write+0xea/0x109 [<ffffffff888001db>] :nfsd:nfsd_dispatch+0xd8/0x1d6 [<ffffffff8877f80d>] :sunrpc:svc_process+0x44c/0x713 [<ffffffff80064614>] __down_read+0x12/0x92 [<ffffffff88800580>] :nfsd:nfsd+0x0/0x2c8 [<ffffffff88800725>] :nfsd:nfsd+0x1a5/0x2c8 [<ffffffff8005dfb1>] child_rip+0xa/0x11 [<ffffffff88800580>] :nfsd:nfsd+0x0/0x2c8 [<ffffffff88800580>] :nfsd:nfsd+0x0/0x2c8 [<ffffffff8005dfa7>] child_rip+0x0/0x11 Hi, Found this via google: http://comments.gmane.org/gmane.comp.file-systems.xfs.general/32747 Seems to be a bug solved in newer kernels as off 2.6.34? Will RedHat reverse engineer this in 2.6.18 (RH5.7) and 2.6.32 (RH6.1) ? (In reply to comment #18) > Found this via google: > http://comments.gmane.org/gmane.comp.file-systems.xfs.general/32747 That does look similar to what you report in Comment #17, though I'm less certain it's related to the bug originally reported here--probably this deserves a new bug if it's not already fixed. I don't see any xfs people on the cc, so adding Dave Chinner to see what he thinks. (In reply to comment #18) > Hi, > Found this via google: > http://comments.gmane.org/gmane.comp.file-systems.xfs.general/32747 > Seems to be a bug solved in newer kernels as off 2.6.34? It's a different problem and completely irrelevant. xfstests 104 is testing online filesystem growing functionality, which used to deadlock in the allocator code under extreme stress. > Will RedHat reverse engineer this in 2.6.18 (RH5.7) Very unlikely because it's an extremely rare problem in production systems and the fix is very intrusive. And in most cases, growing a filesystem is done during scheduled downtime, so it's not likely to be a serious problem even if the deadlock is tripped. > and 2.6.32 (RH6.1) ? RHEL6.0 already has this fixed. (In reply to comment #20) > (In reply to comment #18) > > Hi, > > Found this via google: > > http://comments.gmane.org/gmane.comp.file-systems.xfs.general/32747 > > Seems to be a bug solved in newer kernels as off 2.6.34? > > It's a different problem and completely irrelevant. xfstests 104 is testing > online filesystem growing functionality, which used to deadlock in the > allocator > code under extreme stress. That's the nature of intensive HPC jobs, extreme stress. > > Will RedHat reverse engineer this in 2.6.18 (RH5.7) > > Very unlikely because it's an extremely rare problem in production systems and > the fix is very intrusive. And in most cases, growing a filesystem is done > during scheduled downtime, so it's not likely to be a serious problem even if > the deadlock is tripped. So it's not advisable to grow file systems online? > > and 2.6.32 (RH6.1) ? > > RHEL6.0 already has this fixed. Since when? Otherwise I'll rebuild this system with RH 6.2 Thanks for your time. (In reply to comment #17) > Hello, > Today finaly managed to catch a console dump: > > Red Hat Enterprise Linux Server release 5.7 (Tikanga) > Kernel 2.6.18-274.7.1.el5 on an x86_64 > > serevr login: INFO: task xfsdatad/2:3426 blocked for more than 120 seconds. > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > xfsdatad/2 D ffffffff80154db9 0 3426 71 3427 3425 (L-TLB) > ffff81011b1f1dc0 0000000000000046 0000000000000000 0000000000000000 > 0000000000000100 000000000000000a ffff81011d0d77a0 ffff81011ff24080 > 000000f44d72caa3 000000000000071c ffff81011d0d7988 0000000200000000 > Call Trace: > [<ffffffff885d1d16>] :xfs:xfs_end_bio_delalloc+0x0/0x12 > [<ffffffff800645e3>] __down_write_nested+0x7a/0x92 > [<ffffffff885d1ca4>] :xfs:xfs_setfilesize+0x2d/0x8d > [<ffffffff885d1d1f>] :xfs:xfs_end_bio_delalloc+0x9/0x12 > [<ffffffff8004d32e>] run_workqueue+0x9e/0xfb > [<ffffffff80049b3d>] worker_thread+0x0/0x122 > [<ffffffff800a2c39>] keventd_create_kthread+0x0/0xc4 > [<ffffffff80049c2d>] worker_thread+0xf0/0x122 > [<ffffffff8008e87f>] default_wake_function+0x0/0xe > [<ffffffff800a2c39>] keventd_create_kthread+0x0/0xc4 > [<ffffffff8003270f>] kthread+0xfe/0x132 > [<ffffffff8005dfb1>] child_rip+0xa/0x11 > [<ffffffff800a2c39>] keventd_create_kthread+0x0/0xc4 > [<ffffffff80032611>] kthread+0x0/0x132 > [<ffffffff8005dfa7>] child_rip+0x0/0x11 This implies that something else is holding the XFS inode ilock and not letting it go. That's one step closer to the potential root cause of the problem - "echo w > /proc/sysrq-trigger" should dump the traces of all the currently blocked processes and will probably tell us who is holding the ilock and why they are not letting it go. Then we'll know whether this is caused by or the cause of your OOM problem. (In reply to comment #21) > (In reply to comment #20) > > (In reply to comment #18) > > > Will RedHat reverse engineer this in 2.6.18 (RH5.7) > > > > Very unlikely because it's an extremely rare problem in production systems and > > the fix is very intrusive. And in most cases, growing a filesystem is done > > during scheduled downtime, so it's not likely to be a serious problem even if > > the deadlock is tripped. > So it's not advisable to grow file systems online? It's never advisable to do filesystem/storage administration tasks while your system is under extreme stress. Best practise implies that you make config changes when there is the least chance of something going wrong, regardless of whether you need to take something offline or not to make the change. Online filesystem growing simply reduces the downtime needed for the operation; it doesn't remove the need to schedule or perform that operation in a safe manner... Indeed, this problem, before it was fixed in January 2010, had been present in XFS for more than 10 years and I don't recall ever seeing a bug report from anything other than test 104 about the problem.... Anyway, the fact we have a test in a regression test suite that can trip a bug doesn't mean everyone who does that operation will trip that very bug. That's because we devise stress tests specifically to trip over such known issues. There isn't a workload on the planet that looks like the load that test 104 is generating - how many workloads do you know that repeatedly grow the filesystem wile doing hundreds of concurrent operations known to be specifically problematic for the grow operation? > > > and 2.6.32 (RH6.1) ? > > > > RHEL6.0 already has this fixed. > Since when? Otherwise I'll rebuild this system with RH 6.2 The fixes were in the original RHEL 6.0 release. W'll see, now rebuilding this server with RH 6.2. Thanks for your time. (In reply to comment #24) > W'll see, now rebuilding this server with RH 6.2. Thanks for your time. Just ot be clear - the fixes for the completely unrelated XFS growfs deadlock problem you pointed to are in rhel6.x. The problem you actually reported is still completely unknown at at this point.... Ok thanks, sorry for mixing things up but the main issue was an unstable RH5/6 system under heavy IO stress on a NFS export. The first issue was indeed XFS related I guess (becoming unresponsive/crashes when running xfs defrag tool), later under on it seemed also related to a problem in the kernel which roars its ugly head when there is a lot of IO stress on the system. Should I file a new bug report for this one the clear things up? (In reply to comment #26) > Ok thanks, sorry for mixing things up but the main issue was an unstable RH5/6 > system under heavy IO stress on a NFS export. The first issue was indeed XFS > related I guess (becoming unresponsive/crashes when running xfs defrag tool), > later under on it seemed also related to a problem in the kernel which roars > its ugly head when there is a lot of IO stress on the system. Should I file a > new bug report for this one the clear things up? Yes, if you haven't already done that, please do. I've lost track of what exactly the problem is here. Assuming this one should be closed for now. |
Description of problem: RHELS6_64 server running as NFS server with about 10 clients: Console message: BUG:scheduling while atomic: nfsd/3538/0xffffffff BUG: unable to handle kernel paging request at 000000038ab00bc0 IP: [<ffffffff81056fd>] task_rq_lock+0x4d/0xa0 PGD 0 Oops: 0000 {#1} SMP Version-Release number of selected component (if applicable): nfs-utils-1.2.2-7.el6.x86_64 nfs-utils-lib-1.1.5-1.el6.x86_64 How reproducible: I've seen the "atomic" word once before regarding a xfs defrag on this server (46TB xfs file system) which also caused a system hang. Maybe if I run a defrag again, the system will hang. Steps to Reproduce: 1. 2. 3. Actual results: System hang (no IP ping, no keyboard response from system console) Expected results: No system hang Additional info: ]# lsb_release -a LSB Version: :core-4.0-amd64:core-4.0-noarch:graphics-4.0-amd64:graphics-4.0-noarch:printing-4.0-amd64:printing-4.0-noarch Distributor ID: RedHatEnterpriseServer Description: Red Hat Enterprise Linux Server release 6.0 (Santiago) Release: 6.0 Codename: Santiago ]# uname -a Linux scomp1110 2.6.32-71.14.1.el6.x86_64 #1 SMP Wed Jan 5 17:01:01 EST 2011 x86_64 x86_64 x86_64 GNU/Linux Manufacturer: HP Product Name: ProLiant BL460c G6 NFS export on a 46T lvm striped volume group with a XFS file system.