Bug 460343 - GFS2 filesystem accesses hang in D state
GFS2 filesystem accesses hang in D state
Status: CLOSED CURRENTRELEASE
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel (Show other bugs)
5.1
i686 Linux
medium Severity high
: rc
: ---
Assigned To: Steve Whitehouse
Cluster QE
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2008-08-27 13:20 EDT by Ross Vandegrift
Modified: 2009-05-27 23:36 EDT (History)
5 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2008-09-09 09:24:50 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Glock dumps from the working node (1.35 MB, text/plain)
2008-08-27 13:22 EDT, Ross Vandegrift
no flags Details
Glock dumps from the broken node (530.45 KB, text/plain)
2008-08-27 13:23 EDT, Ross Vandegrift
no flags Details

  None (edit)
Description Ross Vandegrift 2008-08-27 13:20:53 EDT
We've recently run into a problem with GFS2 on our cluster-suite installation.  Processes on one node that access our GFS2 storage hang in the D state and never return.  The node is not fenced, CLVM access returns just fine - only GFS2 layer operations are an issue.

Processes that are hung due to I/O on GFS2 filesystem all have
a similar call stack:

 =======================
ls            D C981824A  2928 11112  11000 (NOTLB)
       e3065e48 00200086 f8cc4b9d c981824a 00008773 f5126a80 00000008 f5876550
       c20e7550 c990b3d9 00008773 000f318f 00000001 f587665c c20049e0 00000044
       f8c4b10b f5327ac0 f8c4b83e ffffffff 00000000 00000000 e3065e74 00000000
Call Trace:
 [<f8cc4b9d>] put_rsb+0x27/0x36 [dlm]
 [<f8c4b10b>] gdlm_ast+0x0/0x2 [lock_dlm]
 [<f8c4b83e>] gdlm_bast+0x0/0x76 [lock_dlm]
 [<f8d21c99>] just_schedule+0x5/0x8 [gfs2]
 [<c0604d68>] __wait_on_bit+0x33/0x58
 [<f8d21c94>] just_schedule+0x0/0x8 [gfs2]
 [<f8d21c94>] just_schedule+0x0/0x8 [gfs2]
 [<c0604def>] out_of_line_wait_on_bit+0x62/0x6a
 [<c0436076>] wake_bit_function+0x0/0x3c
 [<f8d21c90>] wait_on_holder+0x27/0x2b [gfs2]
 [<f8d22e32>] glock_wait_internal+0xdb/0x1ec [gfs2]
 [<f8d230b1>] gfs2_glock_nq+0x16e/0x18e [gfs2]
 [<f8d24177>] gfs2_glock_nq_atime+0x164/0x2de [gfs2]
 [<f8d2b7dd>] gfs2_readdir+0x47/0x8b [gfs2]
 [<c047f754>] filldir64+0x0/0xc5
 [<f8d2416f>] gfs2_glock_nq_atime+0x15c/0x2de [gfs2]
 [<c047f935>] vfs_readdir+0x63/0x8d
 [<c047f754>] filldir64+0x0/0xc5
 [<c047f9c2>] sys_getdents64+0x63/0xa5
 [<c0404eff>] syscall_call+0x7/0xb
 =======================
python        D 2A0CCE0D  1676 10551  10175 (NOTLB)
       f468fd7c 00000082 00000096 2a0cce0d 00000828 00000001 00000009 f5023550
       c20e7550 2a0ce9ae 00000828 00001ba1 00000001 f502365c c20049e0 f40af2c4
       f8cc4b9d 00000000 f40af2c0 ffffffff 00000000 00000000 f468fda8 00000000
Call Trace:
 [<f8cc4b9d>] put_rsb+0x27/0x36 [dlm]
 [<f8d21c99>] just_schedule+0x5/0x8 [gfs2]
 [<c0604d68>] __wait_on_bit+0x33/0x58
 [<f8d21c94>] just_schedule+0x0/0x8 [gfs2]
 [<f8d21c94>] just_schedule+0x0/0x8 [gfs2]
 [<c0604def>] out_of_line_wait_on_bit+0x62/0x6a
 [<c0436076>] wake_bit_function+0x0/0x3c
 [<f8d21c90>] wait_on_holder+0x27/0x2b [gfs2]
 [<f8d22e32>] glock_wait_internal+0xdb/0x1ec [gfs2]
 [<f8d230b1>] gfs2_glock_nq+0x16e/0x18e [gfs2]
 [<f8d2e911>] gfs2_permission+0x69/0xb4 [gfs2]
 [<f8d2e90a>] gfs2_permission+0x62/0xb4 [gfs2]
 [<f8d2e8a8>] gfs2_permission+0x0/0xb4 [gfs2]
 [<c047b557>] permission+0x78/0xb5
 [<c047c9c0>] __link_path_walk+0x141/0xd33
 [<f8d23322>] gfs2_glock_dq+0x9e/0xb2 [gfs2]
 [<c048d67a>] __mark_inode_dirty+0x13d/0x14f
 [<c047d5fb>] link_path_walk+0x49/0xbd
 [<c044ae04>] audit_syscall_entry+0x11c/0x14e
 [<c047d9c8>] do_path_lookup+0x20e/0x25e
 [<c047ded5>] sys_mkdirat+0x36/0xb6
 [<c044ae04>] audit_syscall_entry+0x11c/0x14e
 [<c047df64>] sys_mkdir+0xf/0x13
 [<c0404eff>] syscall_call+0x7/0xb
 =======================


I will attach dumps of the glocks from the working and broken node.
Comment 1 Ross Vandegrift 2008-08-27 13:22:02 EDT
Created attachment 315122 [details]
Glock dumps from the working node
Comment 2 Ross Vandegrift 2008-08-27 13:23:26 EDT
Created attachment 315123 [details]
Glock dumps from the broken node
Comment 3 Steve Whitehouse 2008-09-01 08:19:55 EDT
Ah, I guess I should have asked which version you were using earlier.... sorry, but 5.1 is too old and its best avoided. Can you upgrade to the latest 5.2?

That will almost certainly solve your problem.
Comment 4 Ross Vandegrift 2008-09-05 14:24:39 EDT
We've completed upgrading all of the cluster machines to 5.2.  I'll observe over the weekend and evaluate where things stand on Monday.
Comment 5 Ross Vandegrift 2008-09-09 09:16:05 EDT
So far, the upgrade to 5.2 seems to have helped.  Thanks for the suggestion Steve!
Comment 6 Nate Straz 2008-09-09 09:24:50 EDT
We'll close this bug since 5.2 is working better.  If you find anything else that breaks, file a new bug.

Note You need to log in before you can comment on or make changes to this bug.