Bug 530537
Summary: | dlm_recv deadlock under memory pressure while processing GFP_KERNEL locks. | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Eduardo Damato <edamato> |
Component: | kernel | Assignee: | David Teigland <teigland> |
Status: | CLOSED ERRATA | QA Contact: | Red Hat Kernel QE team <kernel-qe> |
Severity: | urgent | Docs Contact: | |
Priority: | urgent | ||
Version: | 5.4 | CC: | asados, bmr, cward, dhoward, djansa, dzickus, jarod, jpirko, jtluka, liko, mgahagan, tao |
Target Milestone: | rc | Keywords: | ZStream |
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2010-03-30 06:59:06 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 526947, 533859 |
Description
Eduardo Damato
2009-10-23 10:54:01 UTC
in kernel-2.6.18-173.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5 Please do NOT transition this bugzilla state to VERIFIED until our QE team has sent specific instructions indicating when to do so. However feel free to provide a comment indicating that this fix has been verified. ~~ Attention Customers and Partners - RHEL 5.5 Beta is now available on RHN ~~ RHEL 5.5 Beta has been released! There should be a fix present in this release that addresses your request. Please test and report back results here, by March 3rd 2010 (2010-03-03) or sooner. Upon successful verification of this request, post your results and update the Verified field in Bugzilla with the appropriate value. If you encounter any issues while testing, please describe them and set this bug into NEED_INFO. If you encounter new defects or have additional patch(es) to request for inclusion, please clone this bug per each request and escalate through your support representative. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2010-0178.html Hi I actually have the same problem with Centos 5.5, at the beginnig I thought that the problem was the old libraries, but I update it to the latest package available with "yum update" and the result are the same. The software that I've installed is: cman.x86_64 2.0.115-34.el5 drbd83.x86_64 8.3.2-6.el5_3 gfs2-utils.x86_64 0.1.62-20.el5 kernel.x86_64 2.6.18-194.3.1.el5 kernel-headers.x86_64 2.6.18-194.3.1.el5 kmod-drbd83.x86_64 8.3.2-6.el5_3 openais.x86_64 0.80.6-16.el5_5.1 I have two machine with the exactly same software and hardware configuration running DRBD as "primary/primary" plus GFS2 to share the file system between the nodes. At the first, and with few access to the GFS2 file system (and with few file per directory), everything work fine, but when I change the access to the file system (to a medium rate) or the file per directory (lets said more that 1500), everything change a lot and the kernel crashes happen as fast as a few minutes or few hours and the logs register that I get was: Jun 10 11:46:47 correo-1 kernel: block drbd0: [drbd0_worker/2369] sock_sendmsg time expired, ko = 4294967295 Jun 10 11:46:53 correo-1 kernel: block drbd0: [drbd0_worker/2369] sock_sendmsg time expired, ko = 4294967294 Jun 10 11:48:17 correo-1 kernel: INFO: task httpd:15786 blocked for more than 120 seconds. Jun 10 11:48:17 correo-1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jun 10 11:48:17 correo-1 kernel: httpd D ffff810001015120 0 15786 2961 18242 12245 (NOTLB) Jun 10 11:48:17 correo-1 kernel: ffff81026105dd68 0000000000000086 ffff81026eed9048 ffff810208a91cc8 Jun 10 11:48:17 correo-1 kernel: ffff81026eed9048 000000000000000a ffff81027927a7a0 ffff8101097a0080 Jun 10 11:48:17 correo-1 kernel: 000050f6ce60a2cc 00000000012b8094 ffff81027927a988 0000000200000001 Jun 10 11:48:17 correo-1 kernel: Call Trace: Jun 10 11:48:17 correo-1 kernel: [<ffffffff88541ee7>] :gfs2:just_schedule+0x0/0xe Jun 10 11:48:17 correo-1 kernel: [<ffffffff88541ef0>] :gfs2:just_schedule+0x9/0xe Jun 10 11:48:17 correo-1 kernel: [<ffffffff80063a16>] __wait_on_bit+0x40/0x6e Jun 10 11:48:17 correo-1 kernel: [<ffffffff88541ee7>] :gfs2:just_schedule+0x0/0xe Jun 10 11:48:17 correo-1 kernel: [<ffffffff80063ab0>] out_of_line_wait_on_bit+0x6c/0x78 Jun 10 11:48:17 correo-1 kernel: [<ffffffff800a0aec>] wake_bit_function+0x0/0x23 Jun 10 11:48:17 correo-1 kernel: [<ffffffff88541ee2>] :gfs2:gfs2_glock_wait+0x2b/0x30 Jun 10 11:48:17 correo-1 kernel: [<ffffffff8854ca37>] :gfs2:gfs2_flock+0x171/0x1ec Jun 10 11:48:17 correo-1 kernel: [<ffffffff8001e995>] __dentry_open+0x101/0x1dc Jun 10 11:48:17 correo-1 kernel: [<ffffffff800274b2>] do_filp_open+0x2a/0x38 Jun 10 11:48:17 correo-1 kernel: [<ffffffff800b76a6>] audit_syscall_entry+0x180/0x1b3 Jun 10 11:48:17 correo-1 kernel: [<ffffffff800eae55>] sys_flock+0x11a/0x153 Jun 10 11:48:17 correo-1 kernel: [<ffffffff8005d28d>] tracesys+0xd5/0xe0 Jun 10 12:39:05 correo-1 kernel: INFO: task pdflush:306 blocked for more than 120 seconds. Jun 10 12:39:05 correo-1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jun 10 12:39:05 correo-1 kernel: pdflush D ffff81000100caa0 0 306 105 307 305 (L-TLB) Jun 10 12:39:05 correo-1 kernel: ffff8102afbdfbd0 0000000000000046 0000000000000001 ffff8102723ec9a8 Jun 10 12:39:05 correo-1 kernel: ffff8102afbdfc40 000000000000000a ffff8102afa647a0 ffff810109791100 Jun 10 12:39:05 correo-1 kernel: 00000041b598bfb8 000000000001c56a ffff8102afa64988 0000000171b65190 Jun 10 12:39:05 correo-1 kernel: Call Trace: Jun 10 12:39:05 correo-1 kernel: [<ffffffff8001a927>] submit_bh+0x10a/0x111 Jun 10 12:39:05 correo-1 kernel: [<ffffffff88549ee7>] :gfs2:just_schedule+0x0/0xe Jun 10 12:39:05 correo-1 kernel: [<ffffffff88549ef0>] :gfs2:just_schedule+0x9/0xe Jun 10 12:39:05 correo-1 kernel: [<ffffffff80063a16>] __wait_on_bit+0x40/0x6e Jun 10 12:39:05 correo-1 kernel: [<ffffffff88549ee7>] :gfs2:just_schedule+0x0/0xe Jun 10 12:39:05 correo-1 kernel: [<ffffffff80063ab0>] out_of_line_wait_on_bit+0x6c/0x78 Jun 10 12:39:05 correo-1 kernel: [<ffffffff800a0aec>] wake_bit_function+0x0/0x23 Jun 10 12:39:05 correo-1 kernel: [<ffffffff88549ee2>] :gfs2:gfs2_glock_wait+0x2b/0x30 Jun 10 12:39:05 correo-1 kernel: [<ffffffff8855a269>] :gfs2:gfs2_write_inode+0x5f/0x152 Jun 10 12:39:05 correo-1 kernel: [<ffffffff8855a261>] :gfs2:gfs2_write_inode+0x57/0x152 Jun 10 12:39:05 correo-1 kernel: [<ffffffff8002fbf8>] __writeback_single_inode+0x1e9/0x328 Jun 10 12:39:05 correo-1 kernel: [<ffffffff8002e1c9>] __wake_up+0x38/0x4f Jun 10 12:39:05 correo-1 kernel: [<ffffffff80020ec9>] sync_sb_inodes+0x1b5/0x26f Jun 10 12:39:05 correo-1 kernel: [<ffffffff800a08a6>] keventd_create_kthread+0x0/0xc4 Jun 10 12:39:05 correo-1 kernel: [<ffffffff8005123a>] writeback_inodes+0x82/0xd8 Jun 10 12:39:05 correo-1 kernel: [<ffffffff800c97b5>] wb_kupdate+0xd4/0x14e Jun 10 12:39:05 correo-1 kernel: [<ffffffff80056879>] pdflush+0x0/0x1fb Jun 10 12:39:05 correo-1 kernel: [<ffffffff800569ca>] pdflush+0x151/0x1fb Jun 10 12:39:05 correo-1 kernel: [<ffffffff800c96e1>] wb_kupdate+0x0/0x14e Jun 10 12:39:05 correo-1 kernel: [<ffffffff80032894>] kthread+0xfe/0x132 Jun 10 12:39:05 correo-1 kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11 Jun 10 12:39:05 correo-1 kernel: [<ffffffff800a08a6>] keventd_create_kthread+0x0/0xc4 Jun 10 12:39:05 correo-1 kernel: [<ffffffff80032796>] kthread+0x0/0x132 Jun 10 12:39:05 correo-1 kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11 The combination of the condition is rally bad for me, because that nodes are my mails server (with postfix-2.7.0 and dovecot-1.2.11) and common condition is to have constant access to directories with many file (more that 2000). So I have to migrate all the mail access (smtp, imap, pop, webmail, etc) to one node and leave the another alone, but even in that case and when i have a high mail flow, the crashes happend again. CentOS is not a Red Hat product, but we welcome bug reports on Red Hat products here in our public bugzilla database. Also, if you would like technical support please login at support.redhat.com or visit www.redhat.com (or call us!) for information on subscription offerings to suit your needs. In addition, DRBD is not presently supported by Red Hat since it does not ship with the RHEL5 kernel. I believe there are commercial vendors of DRBD that may be able to assist you with DRBD specific issues. Thanks. |