Bug 460343

Summary: GFS2 filesystem accesses hang in D state
Product: Red Hat Enterprise Linux 5 Reporter: Ross Vandegrift <ross>
Component: kernelAssignee: Steve Whitehouse <swhiteho>
Status: CLOSED CURRENTRELEASE QA Contact: Cluster QE <mspqa-list>
Severity: high Docs Contact:
Priority: medium    
Version: 5.1CC: adam.greenfield, cluster-maint, edamato, nstraz, ross
Target Milestone: rc   
Target Release: ---   
Hardware: i686   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2008-09-09 13:24:50 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Glock dumps from the working node
none
Glock dumps from the broken node none

Description Ross Vandegrift 2008-08-27 17:20:53 UTC
We've recently run into a problem with GFS2 on our cluster-suite installation.  Processes on one node that access our GFS2 storage hang in the D state and never return.  The node is not fenced, CLVM access returns just fine - only GFS2 layer operations are an issue.

Processes that are hung due to I/O on GFS2 filesystem all have
a similar call stack:

 =======================
ls            D C981824A  2928 11112  11000 (NOTLB)
       e3065e48 00200086 f8cc4b9d c981824a 00008773 f5126a80 00000008 f5876550
       c20e7550 c990b3d9 00008773 000f318f 00000001 f587665c c20049e0 00000044
       f8c4b10b f5327ac0 f8c4b83e ffffffff 00000000 00000000 e3065e74 00000000
Call Trace:
 [<f8cc4b9d>] put_rsb+0x27/0x36 [dlm]
 [<f8c4b10b>] gdlm_ast+0x0/0x2 [lock_dlm]
 [<f8c4b83e>] gdlm_bast+0x0/0x76 [lock_dlm]
 [<f8d21c99>] just_schedule+0x5/0x8 [gfs2]
 [<c0604d68>] __wait_on_bit+0x33/0x58
 [<f8d21c94>] just_schedule+0x0/0x8 [gfs2]
 [<f8d21c94>] just_schedule+0x0/0x8 [gfs2]
 [<c0604def>] out_of_line_wait_on_bit+0x62/0x6a
 [<c0436076>] wake_bit_function+0x0/0x3c
 [<f8d21c90>] wait_on_holder+0x27/0x2b [gfs2]
 [<f8d22e32>] glock_wait_internal+0xdb/0x1ec [gfs2]
 [<f8d230b1>] gfs2_glock_nq+0x16e/0x18e [gfs2]
 [<f8d24177>] gfs2_glock_nq_atime+0x164/0x2de [gfs2]
 [<f8d2b7dd>] gfs2_readdir+0x47/0x8b [gfs2]
 [<c047f754>] filldir64+0x0/0xc5
 [<f8d2416f>] gfs2_glock_nq_atime+0x15c/0x2de [gfs2]
 [<c047f935>] vfs_readdir+0x63/0x8d
 [<c047f754>] filldir64+0x0/0xc5
 [<c047f9c2>] sys_getdents64+0x63/0xa5
 [<c0404eff>] syscall_call+0x7/0xb
 =======================
python        D 2A0CCE0D  1676 10551  10175 (NOTLB)
       f468fd7c 00000082 00000096 2a0cce0d 00000828 00000001 00000009 f5023550
       c20e7550 2a0ce9ae 00000828 00001ba1 00000001 f502365c c20049e0 f40af2c4
       f8cc4b9d 00000000 f40af2c0 ffffffff 00000000 00000000 f468fda8 00000000
Call Trace:
 [<f8cc4b9d>] put_rsb+0x27/0x36 [dlm]
 [<f8d21c99>] just_schedule+0x5/0x8 [gfs2]
 [<c0604d68>] __wait_on_bit+0x33/0x58
 [<f8d21c94>] just_schedule+0x0/0x8 [gfs2]
 [<f8d21c94>] just_schedule+0x0/0x8 [gfs2]
 [<c0604def>] out_of_line_wait_on_bit+0x62/0x6a
 [<c0436076>] wake_bit_function+0x0/0x3c
 [<f8d21c90>] wait_on_holder+0x27/0x2b [gfs2]
 [<f8d22e32>] glock_wait_internal+0xdb/0x1ec [gfs2]
 [<f8d230b1>] gfs2_glock_nq+0x16e/0x18e [gfs2]
 [<f8d2e911>] gfs2_permission+0x69/0xb4 [gfs2]
 [<f8d2e90a>] gfs2_permission+0x62/0xb4 [gfs2]
 [<f8d2e8a8>] gfs2_permission+0x0/0xb4 [gfs2]
 [<c047b557>] permission+0x78/0xb5
 [<c047c9c0>] __link_path_walk+0x141/0xd33
 [<f8d23322>] gfs2_glock_dq+0x9e/0xb2 [gfs2]
 [<c048d67a>] __mark_inode_dirty+0x13d/0x14f
 [<c047d5fb>] link_path_walk+0x49/0xbd
 [<c044ae04>] audit_syscall_entry+0x11c/0x14e
 [<c047d9c8>] do_path_lookup+0x20e/0x25e
 [<c047ded5>] sys_mkdirat+0x36/0xb6
 [<c044ae04>] audit_syscall_entry+0x11c/0x14e
 [<c047df64>] sys_mkdir+0xf/0x13
 [<c0404eff>] syscall_call+0x7/0xb
 =======================


I will attach dumps of the glocks from the working and broken node.

Comment 1 Ross Vandegrift 2008-08-27 17:22:02 UTC
Created attachment 315122 [details]
Glock dumps from the working node

Comment 2 Ross Vandegrift 2008-08-27 17:23:26 UTC
Created attachment 315123 [details]
Glock dumps from the broken node

Comment 3 Steve Whitehouse 2008-09-01 12:19:55 UTC
Ah, I guess I should have asked which version you were using earlier.... sorry, but 5.1 is too old and its best avoided. Can you upgrade to the latest 5.2?

That will almost certainly solve your problem.

Comment 4 Ross Vandegrift 2008-09-05 18:24:39 UTC
We've completed upgrading all of the cluster machines to 5.2.  I'll observe over the weekend and evaluate where things stand on Monday.

Comment 5 Ross Vandegrift 2008-09-09 13:16:05 UTC
So far, the upgrade to 5.2 seems to have helped.  Thanks for the suggestion Steve!

Comment 6 Nate Straz 2008-09-09 13:24:50 UTC
We'll close this bug since 5.2 is working better.  If you find anything else that breaks, file a new bug.