Description of problem: Thanks to Dave Teigland for the analysis on this deadlock. ---- On RHEL5 there is a possibility of deadlock in dlm_recv/dlm_send whey dlm_recv/dlm_send are processing a lock in GPF_KERNEL mode. In such a case, if GFS is in use and the dlm call triggers the kernel to free memory from filesystems, GFS will try to free up memory, which in turn will cause dlm requests, therefore GFS may call back on dlm causing dlm_recv/dlm_send to deadlock. When this deadlock occurs all dlm operations on the node will go locked up. Because dlm_send/dlm_recv will not be able to communicate, there is the possibility of GFS hanging completely cluster wide. GFS dlm requests are done with lm_allocation GPF_NOFS which prevents the system from freeing up memory from filesystem drivers. However all other dlm applications, namely clvmd, rgmanager, use GPF_KERNEL, therefore if clvmd and rgmanager are in use in the cluster, and the system is under memory pressure it is possible to hit this deadlock. There was a real occurence of this deadlock. The sequence of events that triggered the problem is: - dlm_recv was processing an rgmanager lock - rgmanager locks use ls_allocation is GFP_KERNEL. - dlm_recv then does a memory allocation - due to the memory pressure in the system, the memory allocation triggers the kernel to free memory from GFS2. - GFS2 then calls into the dlm to do a lock and waits for it. - This means the dlm_recv thread has deadlocked. - There's one dlm_recv that services all lockspaces. As we can see we have clurgmgrd blocked in a dlm:search_rsb_list request: Oct 20 19:32:56 node1 kernel: clurgmgrd D ffff880001095460 0 5767 20817 5768 5766 (NOTLB) Oct 20 19:32:56 node1 kernel: ffff880036075d58 0000000000000282 0000000000000000 0000000000000000 Oct 20 19:32:56 node1 kernel: 0000000000000007 ffff88007f7717e0 ffff8800789897e0 00000000000164fc Oct 20 19:32:56 node1 kernel: ffff88007f7719c8 0000000000000000 Oct 20 19:32:56 node1 kernel: Call Trace: Oct 20 19:32:56 node1 kernel: [<ffffffff88549b7b>] :dlm:search_rsb_list+0x3f/0x85 Oct 20 19:32:56 node1 kernel: [<ffffffff88549c12>] :dlm:_search_rsb+0x51/0x1b1 Oct 20 19:32:56 node1 kernel: [<ffffffff80262b40>] __mutex_lock_slowpath+0x60/0x9b Oct 20 19:32:56 node1 kernel: [<ffffffff8038562e>] extract_entropy+0x47/0x90 Oct 20 19:32:56 node1 kernel: [<ffffffff80262b8a>] .text.lock.mutex+0xf/0x14 Oct 20 19:32:56 node1 kernel: [<ffffffff8854a483>] :dlm:request_lock+0x52/0xa0 Oct 20 19:32:56 node1 kernel: [<ffffffff8854d4f9>] :dlm:dlm_user_request+0xed/0x174 Oct 20 19:32:56 node1 kernel: [<ffffffff885544b7>] :dlm:device_write+0x2f5/0x5e5 Oct 20 19:32:56 node1 kernel: [<ffffffff80216d8b>] vfs_write+0xce/0x174 Oct 20 19:32:56 node1 kernel: [<ffffffff802175d8>] sys_write+0x45/0x6e Oct 20 19:32:56 node1 kernel: [<ffffffff8025f2f9>] tracesys+0xab/0xb6 Oct 20 19:32:56 node1 kernel: dlm_recv is also blocked on a dlm request during a shrink_dcache: Oct 20 19:32:55 node1 kernel: dlm_recv D ffff880001041460 0 12321 99 12322 12320 (L-TLB) Oct 20 19:32:55 node1 kernel: ffff88007e8e36e0 0000000000000246 0000000000000000 ffff88006aec1000 Oct 20 19:32:55 node1 kernel: 000000000000000a ffff88006e74e820 ffff88007f6290c0 0000000000060610 Oct 20 19:32:55 node1 kernel: ffff88006e74ea08 0000000000000000 Oct 20 19:32:55 node1 kernel: Call Trace: Oct 20 19:32:55 node1 kernel: [<ffffffff8854d784>] :dlm:dlm_put_lockspace+0x10/0x1f Oct 20 19:32:55 node1 kernel: [<ffffffff8854be2b>] :dlm:dlm_lock+0x117/0x129 Oct 20 19:32:55 node1 kernel: [<ffffffff885ef556>] :lock_dlm:gdlm_ast+0x0/0x311 Oct 20 19:32:55 node1 kernel: [<ffffffff885ef2c1>] :lock_dlm:gdlm_bast+0x0/0x8d Oct 20 19:32:55 node1 kernel: [<ffffffff885750f8>] :gfs2:just_schedule+0x0/0xe Oct 20 19:32:55 node1 kernel: [<ffffffff88575101>] :gfs2:just_schedule+0x9/0xe Oct 20 19:32:55 node1 kernel: [<ffffffff802628e7>] __wait_on_bit+0x40/0x6e Oct 20 19:32:55 node1 kernel: [<ffffffff885750f8>] :gfs2:just_schedule+0x0/0xe Oct 20 19:32:55 node1 kernel: [<ffffffff80262981>] out_of_line_wait_on_bit+0x6c/0x78 Oct 20 19:32:55 node1 kernel: [<ffffffff8029a018>] wake_bit_function+0x0/0x23 Oct 20 19:32:55 node1 kernel: [<ffffffff885750f3>] :gfs2:gfs2_glock_wait+0x2b/0x30 Oct 20 19:32:55 node1 kernel: [<ffffffff88584b22>] :gfs2:gfs2_delete_inode+0x4e/0x191 Oct 20 19:32:55 node1 kernel: [<ffffffff88584b1a>] :gfs2:gfs2_delete_inode+0x46/0x191 Oct 20 19:32:55 node1 kernel: [<ffffffff88584ad4>] :gfs2:gfs2_delete_inode+0x0/0x191 Oct 20 19:32:55 node1 kernel: [<ffffffff80230465>] generic_delete_inode+0xc6/0x143 Oct 20 19:32:55 node1 kernel: [<ffffffff802d96ef>] prune_one_dentry+0x4d/0x76 Oct 20 19:32:55 node1 kernel: [<ffffffff8022f925>] prune_dcache+0x10f/0x149 Oct 20 19:32:55 node1 kernel: [<ffffffff802d972f>] shrink_dcache_memory+0x17/0x30 Oct 20 19:32:55 node1 kernel: [<ffffffff802409dc>] shrink_slab+0xdc/0x154 Oct 20 19:32:55 node1 kernel: [<ffffffff802c00ac>] try_to_free_pages+0x1c8/0x2c2 Oct 20 19:32:55 node1 kernel: [<ffffffff8020f5c6>] __alloc_pages+0x1cb/0x2ce Oct 20 19:32:55 node1 kernel: [<ffffffff8855055c>] :dlm:dlm_lowcomms_get_buffer+0x13f/0x1c5 Oct 20 19:32:55 node1 kernel: [<ffffffff88547333>] :dlm:_create_message+0x30/0x8b Oct 20 19:32:55 node1 kernel: [<ffffffff88548854>] :dlm:add_lkb+0x15/0x15e Oct 20 19:32:55 node1 kernel: [<ffffffff885483b7>] :dlm:send_common_reply+0x25/0x59 Oct 20 19:32:55 node1 kernel: [<ffffffff88548be7>] :dlm:do_request+0x34/0x93 Oct 20 19:32:55 node1 kernel: [<ffffffff8854a9e5>] :dlm:receive_request+0xf8/0x17a Oct 20 19:32:55 node1 kernel: [<ffffffff8854b4cb>] :dlm:_receive_message+0x87c/0xb09 Oct 20 19:32:55 node1 kernel: [<ffffffff80263a0d>] _spin_lock_irq+0x9/0x14 Oct 20 19:32:55 node1 kernel: [<ffffffff802629d6>] mutex_lock+0xd/0x1d Oct 20 19:32:55 node1 kernel: [<ffffffff8854b858>] :dlm:dlm_receive_buffer+0xfb/0x12e Oct 20 19:32:55 node1 kernel: [<ffffffff80273963>] xen_send_IPI_mask+0xa5/0xaa Oct 20 19:32:55 node1 kernel: [<ffffffff8854ef08>] :dlm:dlm_process_incoming_buffer+0xf8/0x134 Oct 20 19:32:55 node1 kernel: [<ffffffff8020f4e1>] __alloc_pages+0xe6/0x2ce Oct 20 19:32:55 node1 kernel: [<ffffffff88550fd5>] :dlm:receive_from_sock+0x5b0/0x6dc Oct 20 19:32:55 node1 kernel: [<ffffffff8022b9b1>] local_bh_enable+0x9/0xa5 Oct 20 19:32:55 node1 kernel: [<ffffffff802342d2>] lock_sock+0xa7/0xb2 Oct 20 19:32:55 node1 kernel: [<ffffffff8022b9b1>] local_bh_enable+0x9/0xa5 Oct 20 19:32:55 node1 kernel: [<ffffffff802342d2>] lock_sock+0xa7/0xb2 Oct 20 19:32:55 node1 kernel: [<ffffffff80231c7c>] release_sock+0x13/0xaa Oct 20 19:32:55 node1 kernel: [<ffffffff8033add5>] __next_cpu+0x19/0x28 Oct 20 19:32:55 node1 kernel: [<ffffffff8025f82b>] error_exit+0x0/0x6e Oct 20 19:32:55 node1 kernel: [<ffffffff80285ab9>] find_busiest_group+0x1db/0x44a Oct 20 19:32:55 node1 kernel: [<ffffffff8855003f>] :dlm:process_recv_sockets+0x0/0x16 Oct 20 19:32:55 node1 kernel: [<ffffffff8855004f>] :dlm:process_recv_sockets+0x10/0x16 Oct 20 19:32:55 node1 kernel: [<ffffffff8024ef13>] run_workqueue+0x94/0xe4 Oct 20 19:32:55 node1 kernel: [<ffffffff8024b81e>] worker_thread+0x0/0x122 Oct 20 19:32:55 node1 kernel: [<ffffffff80299dd2>] keventd_create_kthread+0x0/0xc4 Oct 20 19:32:55 node1 kernel: [<ffffffff8024b90e>] worker_thread+0xf0/0x122 Oct 20 19:32:55 node1 kernel: [<ffffffff80286d89>] default_wake_function+0x0/0xe Oct 20 19:32:55 node1 kernel: [<ffffffff80299dd2>] keventd_create_kthread+0x0/0xc4 Oct 20 19:32:55 node1 kernel: [<ffffffff80299dd2>] keventd_create_kthread+0x0/0xc4 Oct 20 19:32:55 node1 kernel: [<ffffffff80233575>] kthread+0xfe/0x132 Oct 20 19:32:55 node1 kernel: [<ffffffff8025fb2c>] child_rip+0xa/0x12 The pdflush and kswapd processes are also blocked: Oct 20 19:32:54 node1 kernel: pdflush D 0000b22d004b09b1 0 577 99 578 357 (L-TLB) Oct 20 19:32:54 node1 kernel: ffff88007eb1fbd0 0000000000000246 0000000300000000 ffff8800768286c0 Oct 20 19:32:54 node1 kernel: 000000000000000a ffff88007f629820 ffffffff804e0a80 00000000000076f8 Oct 20 19:32:54 node1 kernel: ffff88007f629a08 ffffffff80234070 Oct 20 19:32:54 node1 kernel: Call Trace: Oct 20 19:32:54 node1 kernel: [<ffffffff80234070>] submit_bio+0xcd/0xd4 Oct 20 19:32:54 node1 kernel: [<ffffffff885750f8>] :gfs2:just_schedule+0x0/0xe Oct 20 19:32:54 node1 kernel: [<ffffffff88575101>] :gfs2:just_schedule+0x9/0xe Oct 20 19:32:54 node1 kernel: [<ffffffff802628e7>] __wait_on_bit+0x40/0x6e Oct 20 19:32:54 node1 kernel: [<ffffffff885750f8>] :gfs2:just_schedule+0x0/0xe Oct 20 19:32:54 node1 kernel: [<ffffffff80262981>] out_of_line_wait_on_bit+0x6c/0x78 Oct 20 19:32:54 node1 kernel: [<ffffffff8029a018>] wake_bit_function+0x0/0x23 Oct 20 19:32:54 node1 kernel: [<ffffffff885750f3>] :gfs2:gfs2_glock_wait+0x2b/0x30 Oct 20 19:32:54 node1 kernel: [<ffffffff88584d5c>] :gfs2:gfs2_write_inode+0x5f/0x157 Oct 20 19:32:54 node1 kernel: [<ffffffff88584d54>] :gfs2:gfs2_write_inode+0x57/0x157 Oct 20 19:32:54 node1 kernel: [<ffffffff80230da1>] __writeback_single_inode+0x1e9/0x328 Oct 20 19:32:54 node1 kernel: [<ffffffff881f2d4d>] :dm_mod:dm_any_congested+0x38/0x3f Oct 20 19:32:54 node1 kernel: [<ffffffff881f4ab4>] :dm_mod:dm_table_any_congested+0x46/0x62 Oct 20 19:32:54 node1 kernel: [<ffffffff80221280>] sync_sb_inodes+0x1a9/0x267 Oct 20 19:32:54 node1 kernel: [<ffffffff80299dd2>] keventd_create_kthread+0x0/0xc4 Oct 20 19:32:54 node1 kernel: [<ffffffff80252d60>] writeback_inodes+0x82/0xd8 Oct 20 19:32:54 node1 kernel: [<ffffffff802be1c9>] wb_kupdate+0x9e/0x112 Oct 20 19:32:54 node1 kernel: [<ffffffff80258027>] pdflush+0x0/0x207 Oct 20 19:32:54 node1 kernel: [<ffffffff80258180>] pdflush+0x159/0x207 Oct 20 19:32:54 node1 kernel: [<ffffffff802be12b>] wb_kupdate+0x0/0x112 Oct 20 19:32:54 node1 kernel: [<ffffffff80233575>] kthread+0xfe/0x132 Oct 20 19:32:54 node1 kernel: [<ffffffff8025fb2c>] child_rip+0xa/0x12 Oct 20 19:32:54 node1 kernel: [<ffffffff80299dd2>] keventd_create_kthread+0x0/0xc4 Oct 20 19:32:54 node1 kernel: [<ffffffff80233477>] kthread+0x0/0x132 Oct 20 19:32:54 node1 kernel: [<ffffffff8025fb22>] child_rip+0x0/0x12 Oct 20 19:32:54 node1 kernel: pdflush D ffff880001041460 0 578 99 579 577 (L-TLB) Oct 20 19:32:54 node1 kernel: ffff88007eb21bd0 0000000000000246 ffffffff8020622a ffffffffff578000 Oct 20 19:32:54 node1 kernel: 000000000000000a ffff88007f6290c0 ffff88007f629820 00000000001a1b6a Oct 20 19:32:54 node1 kernel: ffff88007f6292a8 ffffffff881f2d4d Oct 20 19:32:54 node1 kernel: Call Trace: Oct 20 19:32:54 node1 kernel: [<ffffffff8020622a>] hypercall_page+0x22a/0x1000 Oct 20 19:32:54 node1 kernel: [<ffffffff881f2d4d>] :dm_mod:dm_any_congested+0x38/0x3f Oct 20 19:32:54 node1 kernel: [<ffffffff881f4ab4>] :dm_mod:dm_table_any_congested+0x46/0x62 Oct 20 19:32:54 node1 kernel: [<ffffffff881f2d4d>] :dm_mod:dm_any_congested+0x38/0x3f Oct 20 19:32:54 node1 kernel: [<ffffffff885750f8>] :gfs2:just_schedule+0x0/0xe Oct 20 19:32:54 node1 kernel: [<ffffffff88575101>] :gfs2:just_schedule+0x9/0xe Oct 20 19:32:54 node1 kernel: [<ffffffff802628e7>] __wait_on_bit+0x40/0x6e Oct 20 19:32:54 node1 kernel: [<ffffffff885750f8>] :gfs2:just_schedule+0x0/0xe Oct 20 19:32:54 node1 kernel: [<ffffffff80262981>] out_of_line_wait_on_bit+0x6c/0x78 Oct 20 19:32:54 node1 kernel: [<ffffffff8029a018>] wake_bit_function+0x0/0x23 Oct 20 19:32:54 node1 kernel: [<ffffffff885750f3>] :gfs2:gfs2_glock_wait+0x2b/0x30 Oct 20 19:32:54 node1 kernel: [<ffffffff88584d5c>] :gfs2:gfs2_write_inode+0x5f/0x157 Oct 20 19:32:54 node1 kernel: [<ffffffff88584d54>] :gfs2:gfs2_write_inode+0x57/0x157 Oct 20 19:32:54 node1 kernel: [<ffffffff80230da1>] __writeback_single_inode+0x1e9/0x328 Oct 20 19:32:54 node1 kernel: [<ffffffff881f2d4d>] :dm_mod:dm_any_congested+0x38/0x3f Oct 20 19:32:54 node1 kernel: [<ffffffff881f4ab4>] :dm_mod:dm_table_any_congested+0x46/0x62 Oct 20 19:32:54 node1 kernel: [<ffffffff80221280>] sync_sb_inodes+0x1a9/0x267 Oct 20 19:32:54 node1 kernel: [<ffffffff80299dd2>] keventd_create_kthread+0x0/0xc4 Oct 20 19:32:54 node1 kernel: [<ffffffff80252d60>] writeback_inodes+0x82/0xd8 Oct 20 19:32:54 node1 kernel: [<ffffffff802be0c4>] background_writeout+0x82/0xb5 Oct 20 19:32:54 node1 kernel: [<ffffffff80258027>] pdflush+0x0/0x207 Oct 20 19:32:54 node1 kernel: [<ffffffff80258180>] pdflush+0x159/0x207 Oct 20 19:32:54 node1 kernel: [<ffffffff802be042>] background_writeout+0x0/0xb5 Oct 20 19:32:54 node1 kernel: [<ffffffff80233575>] kthread+0xfe/0x132 Oct 20 19:32:54 node1 kernel: [<ffffffff8025fb2c>] child_rip+0xa/0x12 Oct 20 19:32:54 node1 kernel: [<ffffffff80299dd2>] keventd_create_kthread+0x0/0xc4 Oct 20 19:32:54 node1 kernel: [<ffffffff80233477>] kthread+0x0/0x132 Oct 20 19:32:54 node1 kernel: [<ffffffff8025fb22>] child_rip+0x0/0x12 Oct 20 19:32:54 node1 kernel: Oct 20 19:32:54 node1 kernel: kswapd0 D ffff8800010cd460 0 579 99 580 578 (L-TLB) Oct 20 19:32:54 node1 kernel: ffff88007eb23be0 0000000000000246 000000000000000a ffff88007f62c860 Oct 20 19:32:54 node1 kernel: 000000000000000a ffff88007f62c860 ffff88005e8957e0 0000000000003cca Oct 20 19:32:54 node1 kernel: ffff88007f62ca48 ffffffffffffffff Oct 20 19:32:54 node1 kernel: Call Trace: Oct 20 19:32:54 node1 kernel: [<ffffffff885ef556>] :lock_dlm:gdlm_ast+0x0/0x311 Oct 20 19:32:54 node1 kernel: [<ffffffff885750f8>] :gfs2:just_schedule+0x0/0xe Oct 20 19:32:54 node1 kernel: [<ffffffff88575101>] :gfs2:just_schedule+0x9/0xe Oct 20 19:32:54 node1 kernel: [<ffffffff802628e7>] __wait_on_bit+0x40/0x6e Oct 20 19:32:54 node1 kernel: [<ffffffff885750f8>] :gfs2:just_schedule+0x0/0xe Oct 20 19:32:54 node1 kernel: [<ffffffff80262981>] out_of_line_wait_on_bit+0x6c/0x78 Oct 20 19:32:54 node1 kernel: [<ffffffff8029a018>] wake_bit_function+0x0/0x23 Oct 20 19:32:54 node1 kernel: [<ffffffff88584b43>] :gfs2:gfs2_delete_inode+0x6f/0x191 Oct 20 19:32:54 node1 kernel: [<ffffffff88584b1a>] :gfs2:gfs2_delete_inode+0x46/0x191 Oct 20 19:32:54 node1 kernel: [<ffffffff88584ad4>] :gfs2:gfs2_delete_inode+0x0/0x191 Oct 20 19:32:54 node1 kernel: [<ffffffff80230465>] generic_delete_inode+0xc6/0x143 Oct 20 19:32:54 node1 kernel: [<ffffffff802d96ef>] prune_one_dentry+0x4d/0x76 Oct 20 19:32:54 node1 kernel: [<ffffffff8022f925>] prune_dcache+0x10f/0x149 Oct 20 19:32:54 node1 kernel: [<ffffffff802d972f>] shrink_dcache_memory+0x17/0x30 Oct 20 19:32:54 node1 kernel: [<ffffffff802409dc>] shrink_slab+0xdc/0x154 Oct 20 19:32:54 node1 kernel: [<ffffffff80259a63>] kswapd+0x347/0x447 Oct 20 19:32:54 node1 kernel: [<ffffffff8026defe>] monotonic_clock+0x35/0x7b Oct 20 19:32:54 node1 kernel: [<ffffffff80299fea>] autoremove_wake_function+0x0/0x2e Oct 20 19:32:54 node1 kernel: [<ffffffff80299dd2>] keventd_create_kthread+0x0/0xc4 Oct 20 19:32:54 node1 kernel: [<ffffffff8025971c>] kswapd+0x0/0x447 Oct 20 19:32:54 node1 kernel: [<ffffffff80299dd2>] keventd_create_kthread+0x0/0xc4 Oct 20 19:32:54 node1 kernel: [<ffffffff80233575>] kthread+0xfe/0x132 Oct 20 19:32:54 node1 kernel: [<ffffffff8025fb2c>] child_rip+0xa/0x12 Oct 20 19:32:54 node1 kernel: [<ffffffff80299dd2>] keventd_create_kthread+0x0/0xc4 Oct 20 19:32:54 node1 kernel: [<ffffffff80233477>] kthread+0x0/0x132 Oct 20 19:32:54 node1 kernel: [<ffffffff8025fb22>] child_rip+0x0/0x12 Version-Release number of selected component (if applicable): Linux node1 2.6.18-128.4.1.el5xen #1 SMP Thu Jul 23 20:15:43 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux How reproducible: Not reproducible. Steps to Reproduce: N/A Actual results: dlm deadlock. Expected results: dlm not to deadlock.
in kernel-2.6.18-173.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5 Please do NOT transition this bugzilla state to VERIFIED until our QE team has sent specific instructions indicating when to do so. However feel free to provide a comment indicating that this fix has been verified.
~~ Attention Customers and Partners - RHEL 5.5 Beta is now available on RHN ~~ RHEL 5.5 Beta has been released! There should be a fix present in this release that addresses your request. Please test and report back results here, by March 3rd 2010 (2010-03-03) or sooner. Upon successful verification of this request, post your results and update the Verified field in Bugzilla with the appropriate value. If you encounter any issues while testing, please describe them and set this bug into NEED_INFO. If you encounter new defects or have additional patch(es) to request for inclusion, please clone this bug per each request and escalate through your support representative.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2010-0178.html
Hi I actually have the same problem with Centos 5.5, at the beginnig I thought that the problem was the old libraries, but I update it to the latest package available with "yum update" and the result are the same. The software that I've installed is: cman.x86_64 2.0.115-34.el5 drbd83.x86_64 8.3.2-6.el5_3 gfs2-utils.x86_64 0.1.62-20.el5 kernel.x86_64 2.6.18-194.3.1.el5 kernel-headers.x86_64 2.6.18-194.3.1.el5 kmod-drbd83.x86_64 8.3.2-6.el5_3 openais.x86_64 0.80.6-16.el5_5.1 I have two machine with the exactly same software and hardware configuration running DRBD as "primary/primary" plus GFS2 to share the file system between the nodes. At the first, and with few access to the GFS2 file system (and with few file per directory), everything work fine, but when I change the access to the file system (to a medium rate) or the file per directory (lets said more that 1500), everything change a lot and the kernel crashes happen as fast as a few minutes or few hours and the logs register that I get was: Jun 10 11:46:47 correo-1 kernel: block drbd0: [drbd0_worker/2369] sock_sendmsg time expired, ko = 4294967295 Jun 10 11:46:53 correo-1 kernel: block drbd0: [drbd0_worker/2369] sock_sendmsg time expired, ko = 4294967294 Jun 10 11:48:17 correo-1 kernel: INFO: task httpd:15786 blocked for more than 120 seconds. Jun 10 11:48:17 correo-1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jun 10 11:48:17 correo-1 kernel: httpd D ffff810001015120 0 15786 2961 18242 12245 (NOTLB) Jun 10 11:48:17 correo-1 kernel: ffff81026105dd68 0000000000000086 ffff81026eed9048 ffff810208a91cc8 Jun 10 11:48:17 correo-1 kernel: ffff81026eed9048 000000000000000a ffff81027927a7a0 ffff8101097a0080 Jun 10 11:48:17 correo-1 kernel: 000050f6ce60a2cc 00000000012b8094 ffff81027927a988 0000000200000001 Jun 10 11:48:17 correo-1 kernel: Call Trace: Jun 10 11:48:17 correo-1 kernel: [<ffffffff88541ee7>] :gfs2:just_schedule+0x0/0xe Jun 10 11:48:17 correo-1 kernel: [<ffffffff88541ef0>] :gfs2:just_schedule+0x9/0xe Jun 10 11:48:17 correo-1 kernel: [<ffffffff80063a16>] __wait_on_bit+0x40/0x6e Jun 10 11:48:17 correo-1 kernel: [<ffffffff88541ee7>] :gfs2:just_schedule+0x0/0xe Jun 10 11:48:17 correo-1 kernel: [<ffffffff80063ab0>] out_of_line_wait_on_bit+0x6c/0x78 Jun 10 11:48:17 correo-1 kernel: [<ffffffff800a0aec>] wake_bit_function+0x0/0x23 Jun 10 11:48:17 correo-1 kernel: [<ffffffff88541ee2>] :gfs2:gfs2_glock_wait+0x2b/0x30 Jun 10 11:48:17 correo-1 kernel: [<ffffffff8854ca37>] :gfs2:gfs2_flock+0x171/0x1ec Jun 10 11:48:17 correo-1 kernel: [<ffffffff8001e995>] __dentry_open+0x101/0x1dc Jun 10 11:48:17 correo-1 kernel: [<ffffffff800274b2>] do_filp_open+0x2a/0x38 Jun 10 11:48:17 correo-1 kernel: [<ffffffff800b76a6>] audit_syscall_entry+0x180/0x1b3 Jun 10 11:48:17 correo-1 kernel: [<ffffffff800eae55>] sys_flock+0x11a/0x153 Jun 10 11:48:17 correo-1 kernel: [<ffffffff8005d28d>] tracesys+0xd5/0xe0 Jun 10 12:39:05 correo-1 kernel: INFO: task pdflush:306 blocked for more than 120 seconds. Jun 10 12:39:05 correo-1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jun 10 12:39:05 correo-1 kernel: pdflush D ffff81000100caa0 0 306 105 307 305 (L-TLB) Jun 10 12:39:05 correo-1 kernel: ffff8102afbdfbd0 0000000000000046 0000000000000001 ffff8102723ec9a8 Jun 10 12:39:05 correo-1 kernel: ffff8102afbdfc40 000000000000000a ffff8102afa647a0 ffff810109791100 Jun 10 12:39:05 correo-1 kernel: 00000041b598bfb8 000000000001c56a ffff8102afa64988 0000000171b65190 Jun 10 12:39:05 correo-1 kernel: Call Trace: Jun 10 12:39:05 correo-1 kernel: [<ffffffff8001a927>] submit_bh+0x10a/0x111 Jun 10 12:39:05 correo-1 kernel: [<ffffffff88549ee7>] :gfs2:just_schedule+0x0/0xe Jun 10 12:39:05 correo-1 kernel: [<ffffffff88549ef0>] :gfs2:just_schedule+0x9/0xe Jun 10 12:39:05 correo-1 kernel: [<ffffffff80063a16>] __wait_on_bit+0x40/0x6e Jun 10 12:39:05 correo-1 kernel: [<ffffffff88549ee7>] :gfs2:just_schedule+0x0/0xe Jun 10 12:39:05 correo-1 kernel: [<ffffffff80063ab0>] out_of_line_wait_on_bit+0x6c/0x78 Jun 10 12:39:05 correo-1 kernel: [<ffffffff800a0aec>] wake_bit_function+0x0/0x23 Jun 10 12:39:05 correo-1 kernel: [<ffffffff88549ee2>] :gfs2:gfs2_glock_wait+0x2b/0x30 Jun 10 12:39:05 correo-1 kernel: [<ffffffff8855a269>] :gfs2:gfs2_write_inode+0x5f/0x152 Jun 10 12:39:05 correo-1 kernel: [<ffffffff8855a261>] :gfs2:gfs2_write_inode+0x57/0x152 Jun 10 12:39:05 correo-1 kernel: [<ffffffff8002fbf8>] __writeback_single_inode+0x1e9/0x328 Jun 10 12:39:05 correo-1 kernel: [<ffffffff8002e1c9>] __wake_up+0x38/0x4f Jun 10 12:39:05 correo-1 kernel: [<ffffffff80020ec9>] sync_sb_inodes+0x1b5/0x26f Jun 10 12:39:05 correo-1 kernel: [<ffffffff800a08a6>] keventd_create_kthread+0x0/0xc4 Jun 10 12:39:05 correo-1 kernel: [<ffffffff8005123a>] writeback_inodes+0x82/0xd8 Jun 10 12:39:05 correo-1 kernel: [<ffffffff800c97b5>] wb_kupdate+0xd4/0x14e Jun 10 12:39:05 correo-1 kernel: [<ffffffff80056879>] pdflush+0x0/0x1fb Jun 10 12:39:05 correo-1 kernel: [<ffffffff800569ca>] pdflush+0x151/0x1fb Jun 10 12:39:05 correo-1 kernel: [<ffffffff800c96e1>] wb_kupdate+0x0/0x14e Jun 10 12:39:05 correo-1 kernel: [<ffffffff80032894>] kthread+0xfe/0x132 Jun 10 12:39:05 correo-1 kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11 Jun 10 12:39:05 correo-1 kernel: [<ffffffff800a08a6>] keventd_create_kthread+0x0/0xc4 Jun 10 12:39:05 correo-1 kernel: [<ffffffff80032796>] kthread+0x0/0x132 Jun 10 12:39:05 correo-1 kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11 The combination of the condition is rally bad for me, because that nodes are my mails server (with postfix-2.7.0 and dovecot-1.2.11) and common condition is to have constant access to directories with many file (more that 2000). So I have to migrate all the mail access (smtp, imap, pop, webmail, etc) to one node and leave the another alone, but even in that case and when i have a high mail flow, the crashes happend again.
CentOS is not a Red Hat product, but we welcome bug reports on Red Hat products here in our public bugzilla database. Also, if you would like technical support please login at support.redhat.com or visit www.redhat.com (or call us!) for information on subscription offerings to suit your needs. In addition, DRBD is not presently supported by Red Hat since it does not ship with the RHEL5 kernel. I believe there are commercial vendors of DRBD that may be able to assist you with DRBD specific issues. Thanks.