530537 – dlm_recv deadlock under memory pressure while processing GFP_KERNEL locks.

Bug 530537 - dlm_recv deadlock under memory pressure while processing GFP_KERNEL locks.

Summary: dlm_recv deadlock under memory pressure while processing GFP_KERNEL locks.

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	5.4
Hardware:	All
OS:	Linux
Priority:	urgent
Severity:	urgent
Target Milestone:	rc
Target Release:	---
Assignee:	David Teigland
QA Contact:	Red Hat Kernel QE team
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	526947 533859
TreeView+	depends on / blocked

Reported:	2009-10-23 10:54 UTC by Eduardo Damato
Modified:	2018-10-27 15:41 UTC (History)
CC List:	12 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2010-03-30 06:59:06 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2010:0178	0	normal	SHIPPED_LIVE	Important: Red Hat Enterprise Linux 5.5 kernel security and bug fix update	2010-03-29 12:18:21 UTC

Description Eduardo Damato 2009-10-23 10:54:01 UTC

Description of problem:

Thanks to Dave Teigland for the analysis on this deadlock.

----

On RHEL5 there is a possibility of deadlock in dlm_recv/dlm_send whey dlm_recv/dlm_send are processing a lock in GPF_KERNEL mode.

In such a case, if GFS is in use and the dlm call triggers the kernel to free memory from filesystems, GFS will try to free up memory, which in turn will cause dlm requests,  therefore GFS may call back on dlm causing dlm_recv/dlm_send to deadlock. 

When this deadlock occurs all dlm operations on the node will go locked up. Because dlm_send/dlm_recv will not be able to communicate, there is the possibility of GFS hanging completely cluster wide.

GFS dlm requests are done with lm_allocation GPF_NOFS which prevents the system from freeing up memory from filesystem drivers. However all other dlm applications, namely clvmd, rgmanager, use GPF_KERNEL, therefore if clvmd and rgmanager are in use in the cluster, and the system is under memory pressure it is possible to hit this deadlock.


There was a real occurence of this deadlock. The sequence of events that triggered the problem is:

- dlm_recv was processing an rgmanager lock
- rgmanager locks use ls_allocation is GFP_KERNEL.
- dlm_recv then does a memory allocation
- due to the memory pressure in the system, the memory allocation triggers the kernel to free memory from GFS2.
- GFS2 then calls into the dlm to do a lock and waits for it. 
- This means the dlm_recv thread has deadlocked.  
- There's one dlm_recv that services all lockspaces.

As we can see we have clurgmgrd blocked in a dlm:search_rsb_list request:

Oct 20 19:32:56 node1 kernel: clurgmgrd     D ffff880001095460     0  5767  20817          5768  5766 (NOTLB)
Oct 20 19:32:56 node1 kernel:  ffff880036075d58  0000000000000282  0000000000000000  0000000000000000
Oct 20 19:32:56 node1 kernel:  0000000000000007  ffff88007f7717e0  ffff8800789897e0  00000000000164fc
Oct 20 19:32:56 node1 kernel:  ffff88007f7719c8  0000000000000000
Oct 20 19:32:56 node1 kernel: Call Trace:
Oct 20 19:32:56 node1 kernel:  [<ffffffff88549b7b>] :dlm:search_rsb_list+0x3f/0x85
Oct 20 19:32:56 node1 kernel:  [<ffffffff88549c12>] :dlm:_search_rsb+0x51/0x1b1
Oct 20 19:32:56 node1 kernel:  [<ffffffff80262b40>] __mutex_lock_slowpath+0x60/0x9b
Oct 20 19:32:56 node1 kernel:  [<ffffffff8038562e>] extract_entropy+0x47/0x90
Oct 20 19:32:56 node1 kernel:  [<ffffffff80262b8a>] .text.lock.mutex+0xf/0x14
Oct 20 19:32:56 node1 kernel:  [<ffffffff8854a483>] :dlm:request_lock+0x52/0xa0
Oct 20 19:32:56 node1 kernel:  [<ffffffff8854d4f9>] :dlm:dlm_user_request+0xed/0x174
Oct 20 19:32:56 node1 kernel:  [<ffffffff885544b7>] :dlm:device_write+0x2f5/0x5e5
Oct 20 19:32:56 node1 kernel:  [<ffffffff80216d8b>] vfs_write+0xce/0x174
Oct 20 19:32:56 node1 kernel:  [<ffffffff802175d8>] sys_write+0x45/0x6e
Oct 20 19:32:56 node1 kernel:  [<ffffffff8025f2f9>] tracesys+0xab/0xb6
Oct 20 19:32:56 node1 kernel:

dlm_recv is also blocked on a dlm request during a shrink_dcache:


Oct 20 19:32:55 node1 kernel: dlm_recv      D ffff880001041460     0 12321     99         12322 12320 (L-TLB)
Oct 20 19:32:55 node1 kernel:  ffff88007e8e36e0  0000000000000246  0000000000000000  ffff88006aec1000
Oct 20 19:32:55 node1 kernel:  000000000000000a  ffff88006e74e820  ffff88007f6290c0  0000000000060610
Oct 20 19:32:55 node1 kernel:  ffff88006e74ea08  0000000000000000
Oct 20 19:32:55 node1 kernel: Call Trace:
Oct 20 19:32:55 node1 kernel:  [<ffffffff8854d784>] :dlm:dlm_put_lockspace+0x10/0x1f
Oct 20 19:32:55 node1 kernel:  [<ffffffff8854be2b>] :dlm:dlm_lock+0x117/0x129
Oct 20 19:32:55 node1 kernel:  [<ffffffff885ef556>] :lock_dlm:gdlm_ast+0x0/0x311
Oct 20 19:32:55 node1 kernel:  [<ffffffff885ef2c1>] :lock_dlm:gdlm_bast+0x0/0x8d
Oct 20 19:32:55 node1 kernel:  [<ffffffff885750f8>] :gfs2:just_schedule+0x0/0xe
Oct 20 19:32:55 node1 kernel:  [<ffffffff88575101>] :gfs2:just_schedule+0x9/0xe
Oct 20 19:32:55 node1 kernel:  [<ffffffff802628e7>] __wait_on_bit+0x40/0x6e
Oct 20 19:32:55 node1 kernel:  [<ffffffff885750f8>] :gfs2:just_schedule+0x0/0xe
Oct 20 19:32:55 node1 kernel:  [<ffffffff80262981>] out_of_line_wait_on_bit+0x6c/0x78
Oct 20 19:32:55 node1 kernel:  [<ffffffff8029a018>] wake_bit_function+0x0/0x23
Oct 20 19:32:55 node1 kernel:  [<ffffffff885750f3>] :gfs2:gfs2_glock_wait+0x2b/0x30
Oct 20 19:32:55 node1 kernel:  [<ffffffff88584b22>] :gfs2:gfs2_delete_inode+0x4e/0x191
Oct 20 19:32:55 node1 kernel:  [<ffffffff88584b1a>] :gfs2:gfs2_delete_inode+0x46/0x191
Oct 20 19:32:55 node1 kernel:  [<ffffffff88584ad4>] :gfs2:gfs2_delete_inode+0x0/0x191
Oct 20 19:32:55 node1 kernel:  [<ffffffff80230465>] generic_delete_inode+0xc6/0x143
Oct 20 19:32:55 node1 kernel:  [<ffffffff802d96ef>] prune_one_dentry+0x4d/0x76
Oct 20 19:32:55 node1 kernel:  [<ffffffff8022f925>] prune_dcache+0x10f/0x149
Oct 20 19:32:55 node1 kernel:  [<ffffffff802d972f>] shrink_dcache_memory+0x17/0x30
Oct 20 19:32:55 node1 kernel:  [<ffffffff802409dc>] shrink_slab+0xdc/0x154
Oct 20 19:32:55 node1 kernel:  [<ffffffff802c00ac>] try_to_free_pages+0x1c8/0x2c2
Oct 20 19:32:55 node1 kernel:  [<ffffffff8020f5c6>] __alloc_pages+0x1cb/0x2ce
Oct 20 19:32:55 node1 kernel:  [<ffffffff8855055c>] :dlm:dlm_lowcomms_get_buffer+0x13f/0x1c5
Oct 20 19:32:55 node1 kernel:  [<ffffffff88547333>] :dlm:_create_message+0x30/0x8b
Oct 20 19:32:55 node1 kernel:  [<ffffffff88548854>] :dlm:add_lkb+0x15/0x15e
Oct 20 19:32:55 node1 kernel:  [<ffffffff885483b7>] :dlm:send_common_reply+0x25/0x59
Oct 20 19:32:55 node1 kernel:  [<ffffffff88548be7>] :dlm:do_request+0x34/0x93
Oct 20 19:32:55 node1 kernel:  [<ffffffff8854a9e5>] :dlm:receive_request+0xf8/0x17a
Oct 20 19:32:55 node1 kernel:  [<ffffffff8854b4cb>] :dlm:_receive_message+0x87c/0xb09
Oct 20 19:32:55 node1 kernel:  [<ffffffff80263a0d>] _spin_lock_irq+0x9/0x14
Oct 20 19:32:55 node1 kernel:  [<ffffffff802629d6>] mutex_lock+0xd/0x1d
Oct 20 19:32:55 node1 kernel:  [<ffffffff8854b858>] :dlm:dlm_receive_buffer+0xfb/0x12e
Oct 20 19:32:55 node1 kernel:  [<ffffffff80273963>] xen_send_IPI_mask+0xa5/0xaa
Oct 20 19:32:55 node1 kernel:  [<ffffffff8854ef08>] :dlm:dlm_process_incoming_buffer+0xf8/0x134
Oct 20 19:32:55 node1 kernel:  [<ffffffff8020f4e1>] __alloc_pages+0xe6/0x2ce
Oct 20 19:32:55 node1 kernel:  [<ffffffff88550fd5>] :dlm:receive_from_sock+0x5b0/0x6dc
Oct 20 19:32:55 node1 kernel:  [<ffffffff8022b9b1>] local_bh_enable+0x9/0xa5
Oct 20 19:32:55 node1 kernel:  [<ffffffff802342d2>] lock_sock+0xa7/0xb2
Oct 20 19:32:55 node1 kernel:  [<ffffffff8022b9b1>] local_bh_enable+0x9/0xa5
Oct 20 19:32:55 node1 kernel:  [<ffffffff802342d2>] lock_sock+0xa7/0xb2
Oct 20 19:32:55 node1 kernel:  [<ffffffff80231c7c>] release_sock+0x13/0xaa
Oct 20 19:32:55 node1 kernel:  [<ffffffff8033add5>] __next_cpu+0x19/0x28
Oct 20 19:32:55 node1 kernel:  [<ffffffff8025f82b>] error_exit+0x0/0x6e
Oct 20 19:32:55 node1 kernel:  [<ffffffff80285ab9>] find_busiest_group+0x1db/0x44a
Oct 20 19:32:55 node1 kernel:  [<ffffffff8855003f>] :dlm:process_recv_sockets+0x0/0x16
Oct 20 19:32:55 node1 kernel:  [<ffffffff8855004f>] :dlm:process_recv_sockets+0x10/0x16
Oct 20 19:32:55 node1 kernel:  [<ffffffff8024ef13>] run_workqueue+0x94/0xe4
Oct 20 19:32:55 node1 kernel:  [<ffffffff8024b81e>] worker_thread+0x0/0x122
Oct 20 19:32:55 node1 kernel:  [<ffffffff80299dd2>] keventd_create_kthread+0x0/0xc4
Oct 20 19:32:55 node1 kernel:  [<ffffffff8024b90e>] worker_thread+0xf0/0x122
Oct 20 19:32:55 node1 kernel:  [<ffffffff80286d89>] default_wake_function+0x0/0xe
Oct 20 19:32:55 node1 kernel:  [<ffffffff80299dd2>] keventd_create_kthread+0x0/0xc4
Oct 20 19:32:55 node1 kernel:  [<ffffffff80299dd2>] keventd_create_kthread+0x0/0xc4
Oct 20 19:32:55 node1 kernel:  [<ffffffff80233575>] kthread+0xfe/0x132
Oct 20 19:32:55 node1 kernel:  [<ffffffff8025fb2c>] child_rip+0xa/0x12

The pdflush and kswapd processes are also blocked:


Oct 20 19:32:54 node1 kernel: pdflush       D 0000b22d004b09b1     0   577     99           578   357 (L-TLB)
Oct 20 19:32:54 node1 kernel:  ffff88007eb1fbd0  0000000000000246  0000000300000000  ffff8800768286c0
Oct 20 19:32:54 node1 kernel:  000000000000000a  ffff88007f629820  ffffffff804e0a80  00000000000076f8
Oct 20 19:32:54 node1 kernel:  ffff88007f629a08  ffffffff80234070
Oct 20 19:32:54 node1 kernel: Call Trace:
Oct 20 19:32:54 node1 kernel:  [<ffffffff80234070>] submit_bio+0xcd/0xd4
Oct 20 19:32:54 node1 kernel:  [<ffffffff885750f8>] :gfs2:just_schedule+0x0/0xe
Oct 20 19:32:54 node1 kernel:  [<ffffffff88575101>] :gfs2:just_schedule+0x9/0xe
Oct 20 19:32:54 node1 kernel:  [<ffffffff802628e7>] __wait_on_bit+0x40/0x6e
Oct 20 19:32:54 node1 kernel:  [<ffffffff885750f8>] :gfs2:just_schedule+0x0/0xe
Oct 20 19:32:54 node1 kernel:  [<ffffffff80262981>] out_of_line_wait_on_bit+0x6c/0x78
Oct 20 19:32:54 node1 kernel:  [<ffffffff8029a018>] wake_bit_function+0x0/0x23
Oct 20 19:32:54 node1 kernel:  [<ffffffff885750f3>] :gfs2:gfs2_glock_wait+0x2b/0x30
Oct 20 19:32:54 node1 kernel:  [<ffffffff88584d5c>] :gfs2:gfs2_write_inode+0x5f/0x157
Oct 20 19:32:54 node1 kernel:  [<ffffffff88584d54>] :gfs2:gfs2_write_inode+0x57/0x157
Oct 20 19:32:54 node1 kernel:  [<ffffffff80230da1>] __writeback_single_inode+0x1e9/0x328
Oct 20 19:32:54 node1 kernel:  [<ffffffff881f2d4d>] :dm_mod:dm_any_congested+0x38/0x3f
Oct 20 19:32:54 node1 kernel:  [<ffffffff881f4ab4>] :dm_mod:dm_table_any_congested+0x46/0x62
Oct 20 19:32:54 node1 kernel:  [<ffffffff80221280>] sync_sb_inodes+0x1a9/0x267
Oct 20 19:32:54 node1 kernel:  [<ffffffff80299dd2>] keventd_create_kthread+0x0/0xc4
Oct 20 19:32:54 node1 kernel:  [<ffffffff80252d60>] writeback_inodes+0x82/0xd8
Oct 20 19:32:54 node1 kernel:  [<ffffffff802be1c9>] wb_kupdate+0x9e/0x112
Oct 20 19:32:54 node1 kernel:  [<ffffffff80258027>] pdflush+0x0/0x207
Oct 20 19:32:54 node1 kernel:  [<ffffffff80258180>] pdflush+0x159/0x207
Oct 20 19:32:54 node1 kernel:  [<ffffffff802be12b>] wb_kupdate+0x0/0x112
Oct 20 19:32:54 node1 kernel:  [<ffffffff80233575>] kthread+0xfe/0x132
Oct 20 19:32:54 node1 kernel:  [<ffffffff8025fb2c>] child_rip+0xa/0x12
Oct 20 19:32:54 node1 kernel:  [<ffffffff80299dd2>] keventd_create_kthread+0x0/0xc4
Oct 20 19:32:54 node1 kernel:  [<ffffffff80233477>] kthread+0x0/0x132
Oct 20 19:32:54 node1 kernel:  [<ffffffff8025fb22>] child_rip+0x0/0x12

Oct 20 19:32:54 node1 kernel: pdflush       D ffff880001041460     0   578     99           579   577 (L-TLB)
Oct 20 19:32:54 node1 kernel:  ffff88007eb21bd0  0000000000000246  ffffffff8020622a  ffffffffff578000
Oct 20 19:32:54 node1 kernel:  000000000000000a  ffff88007f6290c0  ffff88007f629820  00000000001a1b6a
Oct 20 19:32:54 node1 kernel:  ffff88007f6292a8  ffffffff881f2d4d
Oct 20 19:32:54 node1 kernel: Call Trace:
Oct 20 19:32:54 node1 kernel:  [<ffffffff8020622a>] hypercall_page+0x22a/0x1000
Oct 20 19:32:54 node1 kernel:  [<ffffffff881f2d4d>] :dm_mod:dm_any_congested+0x38/0x3f
Oct 20 19:32:54 node1 kernel:  [<ffffffff881f4ab4>] :dm_mod:dm_table_any_congested+0x46/0x62
Oct 20 19:32:54 node1 kernel:  [<ffffffff881f2d4d>] :dm_mod:dm_any_congested+0x38/0x3f
Oct 20 19:32:54 node1 kernel:  [<ffffffff885750f8>] :gfs2:just_schedule+0x0/0xe
Oct 20 19:32:54 node1 kernel:  [<ffffffff88575101>] :gfs2:just_schedule+0x9/0xe
Oct 20 19:32:54 node1 kernel:  [<ffffffff802628e7>] __wait_on_bit+0x40/0x6e
Oct 20 19:32:54 node1 kernel:  [<ffffffff885750f8>] :gfs2:just_schedule+0x0/0xe
Oct 20 19:32:54 node1 kernel:  [<ffffffff80262981>] out_of_line_wait_on_bit+0x6c/0x78
Oct 20 19:32:54 node1 kernel:  [<ffffffff8029a018>] wake_bit_function+0x0/0x23
Oct 20 19:32:54 node1 kernel:  [<ffffffff885750f3>] :gfs2:gfs2_glock_wait+0x2b/0x30
Oct 20 19:32:54 node1 kernel:  [<ffffffff88584d5c>] :gfs2:gfs2_write_inode+0x5f/0x157
Oct 20 19:32:54 node1 kernel:  [<ffffffff88584d54>] :gfs2:gfs2_write_inode+0x57/0x157
Oct 20 19:32:54 node1 kernel:  [<ffffffff80230da1>] __writeback_single_inode+0x1e9/0x328
Oct 20 19:32:54 node1 kernel:  [<ffffffff881f2d4d>] :dm_mod:dm_any_congested+0x38/0x3f
Oct 20 19:32:54 node1 kernel:  [<ffffffff881f4ab4>] :dm_mod:dm_table_any_congested+0x46/0x62
Oct 20 19:32:54 node1 kernel:  [<ffffffff80221280>] sync_sb_inodes+0x1a9/0x267
Oct 20 19:32:54 node1 kernel:  [<ffffffff80299dd2>] keventd_create_kthread+0x0/0xc4
Oct 20 19:32:54 node1 kernel:  [<ffffffff80252d60>] writeback_inodes+0x82/0xd8
Oct 20 19:32:54 node1 kernel:  [<ffffffff802be0c4>] background_writeout+0x82/0xb5
Oct 20 19:32:54 node1 kernel:  [<ffffffff80258027>] pdflush+0x0/0x207
Oct 20 19:32:54 node1 kernel:  [<ffffffff80258180>] pdflush+0x159/0x207
Oct 20 19:32:54 node1 kernel:  [<ffffffff802be042>] background_writeout+0x0/0xb5
Oct 20 19:32:54 node1 kernel:  [<ffffffff80233575>] kthread+0xfe/0x132
Oct 20 19:32:54 node1 kernel:  [<ffffffff8025fb2c>] child_rip+0xa/0x12
Oct 20 19:32:54 node1 kernel:  [<ffffffff80299dd2>] keventd_create_kthread+0x0/0xc4
Oct 20 19:32:54 node1 kernel:  [<ffffffff80233477>] kthread+0x0/0x132
Oct 20 19:32:54 node1 kernel:  [<ffffffff8025fb22>] child_rip+0x0/0x12
Oct 20 19:32:54 node1 kernel:

Oct 20 19:32:54 node1 kernel: kswapd0       D ffff8800010cd460     0   579     99           580   578 (L-TLB)
Oct 20 19:32:54 node1 kernel:  ffff88007eb23be0  0000000000000246  000000000000000a  ffff88007f62c860
Oct 20 19:32:54 node1 kernel:  000000000000000a  ffff88007f62c860  ffff88005e8957e0  0000000000003cca
Oct 20 19:32:54 node1 kernel:  ffff88007f62ca48  ffffffffffffffff
Oct 20 19:32:54 node1 kernel: Call Trace:
Oct 20 19:32:54 node1 kernel:  [<ffffffff885ef556>] :lock_dlm:gdlm_ast+0x0/0x311
Oct 20 19:32:54 node1 kernel:  [<ffffffff885750f8>] :gfs2:just_schedule+0x0/0xe
Oct 20 19:32:54 node1 kernel:  [<ffffffff88575101>] :gfs2:just_schedule+0x9/0xe
Oct 20 19:32:54 node1 kernel:  [<ffffffff802628e7>] __wait_on_bit+0x40/0x6e
Oct 20 19:32:54 node1 kernel:  [<ffffffff885750f8>] :gfs2:just_schedule+0x0/0xe
Oct 20 19:32:54 node1 kernel:  [<ffffffff80262981>] out_of_line_wait_on_bit+0x6c/0x78
Oct 20 19:32:54 node1 kernel:  [<ffffffff8029a018>] wake_bit_function+0x0/0x23
Oct 20 19:32:54 node1 kernel:  [<ffffffff88584b43>] :gfs2:gfs2_delete_inode+0x6f/0x191
Oct 20 19:32:54 node1 kernel:  [<ffffffff88584b1a>] :gfs2:gfs2_delete_inode+0x46/0x191
Oct 20 19:32:54 node1 kernel:  [<ffffffff88584ad4>] :gfs2:gfs2_delete_inode+0x0/0x191
Oct 20 19:32:54 node1 kernel:  [<ffffffff80230465>] generic_delete_inode+0xc6/0x143
Oct 20 19:32:54 node1 kernel:  [<ffffffff802d96ef>] prune_one_dentry+0x4d/0x76
Oct 20 19:32:54 node1 kernel:  [<ffffffff8022f925>] prune_dcache+0x10f/0x149
Oct 20 19:32:54 node1 kernel:  [<ffffffff802d972f>] shrink_dcache_memory+0x17/0x30
Oct 20 19:32:54 node1 kernel:  [<ffffffff802409dc>] shrink_slab+0xdc/0x154
Oct 20 19:32:54 node1 kernel:  [<ffffffff80259a63>] kswapd+0x347/0x447
Oct 20 19:32:54 node1 kernel:  [<ffffffff8026defe>] monotonic_clock+0x35/0x7b
Oct 20 19:32:54 node1 kernel:  [<ffffffff80299fea>] autoremove_wake_function+0x0/0x2e
Oct 20 19:32:54 node1 kernel:  [<ffffffff80299dd2>] keventd_create_kthread+0x0/0xc4
Oct 20 19:32:54 node1 kernel:  [<ffffffff8025971c>] kswapd+0x0/0x447
Oct 20 19:32:54 node1 kernel:  [<ffffffff80299dd2>] keventd_create_kthread+0x0/0xc4
Oct 20 19:32:54 node1 kernel:  [<ffffffff80233575>] kthread+0xfe/0x132
Oct 20 19:32:54 node1 kernel:  [<ffffffff8025fb2c>] child_rip+0xa/0x12
Oct 20 19:32:54 node1 kernel:  [<ffffffff80299dd2>] keventd_create_kthread+0x0/0xc4
Oct 20 19:32:54 node1 kernel:  [<ffffffff80233477>] kthread+0x0/0x132
Oct 20 19:32:54 node1 kernel:  [<ffffffff8025fb22>] child_rip+0x0/0x12


Version-Release number of selected component (if applicable):

Linux node1 2.6.18-128.4.1.el5xen #1 SMP Thu Jul 23 20:15:43 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux

How reproducible:

Not reproducible.

Steps to Reproduce:

N/A
  
Actual results:

dlm deadlock.

Expected results:

dlm not to deadlock.

Comment 9 Don Zickus 2009-11-10 16:51:38 UTC

in kernel-2.6.18-173.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Please do NOT transition this bugzilla state to VERIFIED until our QE team
has sent specific instructions indicating when to do so.  However feel free
to provide a comment indicating that this fix has been verified.

Comment 12 Chris Ward 2010-02-11 10:28:22 UTC

~~ Attention Customers and Partners - RHEL 5.5 Beta is now available on RHN ~~

RHEL 5.5 Beta has been released! There should be a fix present in this 
release that addresses your request. Please test and report back results 
here, by March 3rd 2010 (2010-03-03) or sooner.

Upon successful verification of this request, post your results and update 
the Verified field in Bugzilla with the appropriate value.

If you encounter any issues while testing, please describe them and set 
this bug into NEED_INFO. If you encounter new defects or have additional 
patch(es) to request for inclusion, please clone this bug per each request
and escalate through your support representative.

Comment 15 errata-xmlrpc 2010-03-30 06:59:06 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2010-0178.html

Comment 17 Asdrubal 2010-06-10 22:13:45 UTC

Hi I actually have the same problem with Centos 5.5, at the beginnig I thought that the problem was the old libraries, but I update it to the latest package available with "yum update" and the result are the same. The software that I've installed is:


cman.x86_64                                2.0.115-34.el5
drbd83.x86_64                              8.3.2-6.el5_3
gfs2-utils.x86_64                          0.1.62-20.el5
kernel.x86_64                              2.6.18-194.3.1.el5
kernel-headers.x86_64                      2.6.18-194.3.1.el5
kmod-drbd83.x86_64                         8.3.2-6.el5_3
openais.x86_64                             0.80.6-16.el5_5.1


I have two machine with the exactly same software and hardware configuration running DRBD as "primary/primary" plus GFS2 to share the file system between the nodes. At the first, and with few access to the GFS2 file system (and with few file per directory), everything work fine, but when I change the access to the file system (to a medium rate) or the file per directory (lets said more that 1500), everything change a lot and the kernel crashes happen as fast as a few minutes or few hours and the logs register that I get was:

Jun 10 11:46:47 correo-1 kernel: block drbd0: [drbd0_worker/2369] sock_sendmsg time expired, ko = 4294967295
Jun 10 11:46:53 correo-1 kernel: block drbd0: [drbd0_worker/2369] sock_sendmsg time expired, ko = 4294967294
Jun 10 11:48:17 correo-1 kernel: INFO: task httpd:15786 blocked for more than 120 seconds.
Jun 10 11:48:17 correo-1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jun 10 11:48:17 correo-1 kernel: httpd         D ffff810001015120     0 15786   2961         18242 12245 (NOTLB)
Jun 10 11:48:17 correo-1 kernel:  ffff81026105dd68 0000000000000086 ffff81026eed9048 ffff810208a91cc8
Jun 10 11:48:17 correo-1 kernel:  ffff81026eed9048 000000000000000a ffff81027927a7a0 ffff8101097a0080
Jun 10 11:48:17 correo-1 kernel:  000050f6ce60a2cc 00000000012b8094 ffff81027927a988 0000000200000001
Jun 10 11:48:17 correo-1 kernel: Call Trace:
Jun 10 11:48:17 correo-1 kernel:  [<ffffffff88541ee7>] :gfs2:just_schedule+0x0/0xe
Jun 10 11:48:17 correo-1 kernel:  [<ffffffff88541ef0>] :gfs2:just_schedule+0x9/0xe
Jun 10 11:48:17 correo-1 kernel:  [<ffffffff80063a16>] __wait_on_bit+0x40/0x6e
Jun 10 11:48:17 correo-1 kernel:  [<ffffffff88541ee7>] :gfs2:just_schedule+0x0/0xe
Jun 10 11:48:17 correo-1 kernel:  [<ffffffff80063ab0>] out_of_line_wait_on_bit+0x6c/0x78
Jun 10 11:48:17 correo-1 kernel:  [<ffffffff800a0aec>] wake_bit_function+0x0/0x23
Jun 10 11:48:17 correo-1 kernel:  [<ffffffff88541ee2>] :gfs2:gfs2_glock_wait+0x2b/0x30
Jun 10 11:48:17 correo-1 kernel:  [<ffffffff8854ca37>] :gfs2:gfs2_flock+0x171/0x1ec
Jun 10 11:48:17 correo-1 kernel:  [<ffffffff8001e995>] __dentry_open+0x101/0x1dc
Jun 10 11:48:17 correo-1 kernel:  [<ffffffff800274b2>] do_filp_open+0x2a/0x38
Jun 10 11:48:17 correo-1 kernel:  [<ffffffff800b76a6>] audit_syscall_entry+0x180/0x1b3
Jun 10 11:48:17 correo-1 kernel:  [<ffffffff800eae55>] sys_flock+0x11a/0x153
Jun 10 11:48:17 correo-1 kernel:  [<ffffffff8005d28d>] tracesys+0xd5/0xe0




Jun 10 12:39:05 correo-1 kernel: INFO: task pdflush:306 blocked for more than 120 seconds.
Jun 10 12:39:05 correo-1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jun 10 12:39:05 correo-1 kernel: pdflush       D ffff81000100caa0     0   306    105           307   305 (L-TLB)
Jun 10 12:39:05 correo-1 kernel:  ffff8102afbdfbd0 0000000000000046 0000000000000001 ffff8102723ec9a8
Jun 10 12:39:05 correo-1 kernel:  ffff8102afbdfc40 000000000000000a ffff8102afa647a0 ffff810109791100
Jun 10 12:39:05 correo-1 kernel:  00000041b598bfb8 000000000001c56a ffff8102afa64988 0000000171b65190
Jun 10 12:39:05 correo-1 kernel: Call Trace:
Jun 10 12:39:05 correo-1 kernel:  [<ffffffff8001a927>] submit_bh+0x10a/0x111
Jun 10 12:39:05 correo-1 kernel:  [<ffffffff88549ee7>] :gfs2:just_schedule+0x0/0xe
Jun 10 12:39:05 correo-1 kernel:  [<ffffffff88549ef0>] :gfs2:just_schedule+0x9/0xe
Jun 10 12:39:05 correo-1 kernel:  [<ffffffff80063a16>] __wait_on_bit+0x40/0x6e
Jun 10 12:39:05 correo-1 kernel:  [<ffffffff88549ee7>] :gfs2:just_schedule+0x0/0xe
Jun 10 12:39:05 correo-1 kernel:  [<ffffffff80063ab0>] out_of_line_wait_on_bit+0x6c/0x78
Jun 10 12:39:05 correo-1 kernel:  [<ffffffff800a0aec>] wake_bit_function+0x0/0x23
Jun 10 12:39:05 correo-1 kernel:  [<ffffffff88549ee2>] :gfs2:gfs2_glock_wait+0x2b/0x30
Jun 10 12:39:05 correo-1 kernel:  [<ffffffff8855a269>] :gfs2:gfs2_write_inode+0x5f/0x152
Jun 10 12:39:05 correo-1 kernel:  [<ffffffff8855a261>] :gfs2:gfs2_write_inode+0x57/0x152
Jun 10 12:39:05 correo-1 kernel:  [<ffffffff8002fbf8>] __writeback_single_inode+0x1e9/0x328
Jun 10 12:39:05 correo-1 kernel:  [<ffffffff8002e1c9>] __wake_up+0x38/0x4f
Jun 10 12:39:05 correo-1 kernel:  [<ffffffff80020ec9>] sync_sb_inodes+0x1b5/0x26f
Jun 10 12:39:05 correo-1 kernel:  [<ffffffff800a08a6>] keventd_create_kthread+0x0/0xc4
Jun 10 12:39:05 correo-1 kernel:  [<ffffffff8005123a>] writeback_inodes+0x82/0xd8
Jun 10 12:39:05 correo-1 kernel:  [<ffffffff800c97b5>] wb_kupdate+0xd4/0x14e
Jun 10 12:39:05 correo-1 kernel:  [<ffffffff80056879>] pdflush+0x0/0x1fb
Jun 10 12:39:05 correo-1 kernel:  [<ffffffff800569ca>] pdflush+0x151/0x1fb
Jun 10 12:39:05 correo-1 kernel:  [<ffffffff800c96e1>] wb_kupdate+0x0/0x14e
Jun 10 12:39:05 correo-1 kernel:  [<ffffffff80032894>] kthread+0xfe/0x132
Jun 10 12:39:05 correo-1 kernel:  [<ffffffff8005dfb1>] child_rip+0xa/0x11
Jun 10 12:39:05 correo-1 kernel:  [<ffffffff800a08a6>] keventd_create_kthread+0x0/0xc4
Jun 10 12:39:05 correo-1 kernel:  [<ffffffff80032796>] kthread+0x0/0x132
Jun 10 12:39:05 correo-1 kernel:  [<ffffffff8005dfa7>] child_rip+0x0/0x11


The combination of the condition is rally bad for me, because that nodes are my mails server (with postfix-2.7.0 and dovecot-1.2.11) and common condition is to have constant access to directories with many file (more that 2000).  So I have to migrate all the mail access (smtp, imap, pop, webmail, etc) to one node and leave the another alone, but even in that case and when i have a high mail flow, the crashes happend again.

Comment 18 Perry Myers 2010-06-11 15:26:49 UTC

CentOS is not a Red Hat product, but we welcome bug reports on Red Hat products here in our public bugzilla database. Also, if you would like technical support please login at support.redhat.com or visit www.redhat.com  (or call us!) for information on subscription offerings to suit your needs.

In addition, DRBD is not presently supported by Red Hat since it does not ship with the RHEL5 kernel.  I believe there are commercial vendors of DRBD that may be able to assist you with DRBD specific issues.

Thanks.

Note You need to log in before you can comment on or make changes to this bug.