Bug 498976 - GFS2 - probably lost glock call back
Summary: GFS2 - probably lost glock call back
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.3
Hardware: All
OS: Linux
high
high
Target Milestone: rc
: ---
Assignee: Steve Whitehouse
QA Contact: Cluster QE
URL:
Whiteboard:
Depends On:
Blocks: 499522 525215 526947 533192 546279
TreeView+ depends on / blocked
 
Reported: 2009-05-04 16:18 UTC by Eduardo Damato
Modified: 2023-09-14 01:16 UTC (History)
17 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 546279 (view as bug list)
Environment:
Last Closed: 2010-03-30 07:27:34 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Instrumentation patch (7.43 KB, patch)
2009-05-14 21:58 UTC, Robert Peterson
no flags Details | Diff
Faster Patch (7.44 KB, patch)
2009-05-20 18:32 UTC, Robert Peterson
no flags Details | Diff
Potential fix (964 bytes, patch)
2009-09-11 10:26 UTC, Steve Whitehouse
no flags Details | Diff
Potential fix (RHEL5 port) (1.00 KB, patch)
2009-09-11 11:30 UTC, Steve Whitehouse
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2010:0178 0 normal SHIPPED_LIVE Important: Red Hat Enterprise Linux 5.5 kernel security and bug fix update 2010-03-29 12:18:21 UTC

Description Eduardo Damato 2009-05-04 16:18:24 UTC
Description of problem:

GFS2 becomes lock starved when having multiple nodes writting to the same base directory. Multiple nodes write unique filenamed files to a base directory on GFS2, after some time writting, the node that has the lock for the base directory continues to be able to write new files to the base directory, but the other nodes become lock starved, and all the processes that try to obtain  locks on the base directory become blocked.

Even when the application is no longer holding any locks, the locks do not get released. The node that holds the lock in its cache is able to use the basedir, however all the processes from the other nodes become blocked.

In the following example node1 and node2 have blocked processes and node3 doesn't.

[root@node1 ~]# ps -p 9999999$(fuser -m /mountpoint 2>/dev/null| sed 's/ /,/g') -o user,pid,ppid,%cpu,%mem,vsz,rss,wchan=WIDE-WCHAN-COLUMN,stat,start,cputime,comm 
USER       PID  PPID %CPU %MEM    VSZ   RSS WIDE-WCHAN-COLUMN STAT  STARTED     TIME COMMAND 
user     23279 23278  0.0  0.3  70124 66188 just_schedule     D      May 03 00:00:04 app-interpreter 
user     23332 23331  0.0  0.3  70124 66136 just_schedule     D      May 03 00:00:04 app-interpreter 
user     23444 23443  0.0  0.4  70756 66872 pipe_wait         S      May 03 00:00:04 app-interpreter 
user     23530 23529  0.0  0.3  68680 64704 just_schedule     D      May 03 00:00:03 app-interpreter 
user     23609 23608  0.0  0.4  73652 69692 just_schedule     D      May 03 00:00:04 app-interpreter 
user     23743 23742  0.0  0.3  67756 63960 pipe_wait         S      May 03 00:00:03 app-interpreter 
user     32654 23743  0.0  0.0   4480  1024 utrace_quiescent  T      May 03 00:00:00 sh 
user     32655 32654  0.0  0.0   4208   652 just_schedule     D      May 03 00:00:00 mv 
user     32659 23444  0.0  0.0   4480  1024 utrace_quiescent  T      May 03 00:00:00 sh 
user     32660 32659  0.0  0.0   4208   652 just_schedule     D      May 03 00:00:00 mv 
############## 
[root@node2 ~]# ps -p 9999999$(fuser -m /mountpoint 2>/dev/null| sed 's/ /,/g') -o user,pid,ppid,%cpu,%mem,vsz,rss,wchan=WIDE-WCHAN-COLUMN,stat,start,cputime,comm 
USER       PID  PPID %CPU %MEM    VSZ   RSS WIDE-WCHAN-COLUMN STAT  STARTED     TIME COMMAND 
user     10873 10871  0.0  0.0  11140  7600 just_schedule     D      May 03 00:00:00 app-interpreter 
user     10875 10874  0.0  0.3  68664 64800 just_schedule     D      May 03 00:00:03 app-interpreter 
user     10879 10878  0.0  0.3  67588 63548 just_schedule     D      May 03 00:00:03 app-interpreter 
user     10881 10880  0.0  0.3  68176 64156 just_schedule     D      May 03 00:00:03 app-interpreter 
user     10921 10919  0.0  0.3  68688 64824 just_schedule     D      May 03 00:00:03 app-interpreter 
user     10922 10918  0.0  0.3  68136 64132 just_schedule     D      May 03 00:00:02 app-interpreter 
user     10923 10920  0.0  0.4  70616 66676 just_schedule     D      May 03 00:00:03 app-interpreter 
############# 
[root@node3 ~]# ps -p 9999999$(fuser -m /mountpoint 2>/dev/null| sed 's/ /@/g' |  sed 's/@@/,/g' | sed 's/@/,/g') -o user,pid,ppid,%cpu,%mem,vsz,rss,wchan=WIDE-WCHAN-COLUMN,stat,start,cputime,comm 
USER       PID  PPID %CPU %MEM    VSZ   RSS WIDE-WCHAN-COLUMN STAT  STARTED     TIME COMMAND 
user      3116  3115  0.0  0.2  51108 47360 -                 S    08:55:12 00:00:08 app-interpreter 
user      6732  6731  0.0  0.3  67108 63144 -                 S    04:45:34 00:00:20 app-interpreter 
user     11981 11980  0.0  0.3  67096 63148 -                 S    09:30:39 00:00:09 app-interpreter 
user     17279 17278  0.1  0.4  70740 66856 -                 S    09:52:20 00:00:08 app-interpreter 
user     21635 21634  0.1  0.3  67208 63236 -                 S    10:10:08 00:00:07 app-interpreter 
user     22312 22311  0.5  0.3  67236 63404 -                 S    12:01:01 00:00:02 app-interpreter 
user     30743 30742  0.0  0.3  68136 64096 -                 S    02:01:38 00:00:29 app-interpreter 

When one kills all the processes on node3, and umounts the /mountpoint, the processes start to unblock:

(just after umount on node3:)

############# 
[root@node1 tmp]# ps -p 9999999$(fuser -m /mountpoint 2>/dev/null| sed 's/ /@/g' |  sed 's/@@/,/g' | sed 's/@/,/g') -o user,pid,ppid,%cpu,%mem,vsz,rss,wchan=WIDE-WCHAN-COLUMN,stat,start,cputime,comm && date 
USER       PID  PPID %CPU %MEM    VSZ   RSS WIDE-WCHAN-COLUMN STAT  STARTED     TIME COMMAND 
user     23279 23278  0.0  0.3  70120 66184 -                 S      May 03 00:00:04 app-interpreter 
user     23332 23331  0.0  0.3  70120 66132 -                 S      May 03 00:00:04 app-interpreter 
user     23530 23529  0.0  0.3  68672 64696 just_schedule     D      May 03 00:00:03 app-interpreter 
user     23609 23608  0.0  0.4  73648 69688 -                 S      May 03 00:00:04 app-interpreter 
user     25346 25345  8.9  0.3  68144 64140 -                 S    16:31:44 00:00:02 app-interpreter 
user     25348 25347  8.3  0.3  68300 64448 -                 S    16:31:44 00:00:01 app-interpreter 
Mon May  4 16:32:07 CEST 2009 
############# 
[root@node2 tmp]# ps -p 9999999$(fuser -m /mountpoint 2>/dev/null| sed 's/ /@/g' |  sed 's/@@/,/g' | sed 's/@/,/g') -o user,pid,ppid,%cpu,%mem,vsz,rss,wchan=WIDE-WCHAN-COLUMN,stat,start,cputime,comm && date 
USER       PID  PPID %CPU %MEM    VSZ   RSS WIDE-WCHAN-COLUMN STAT  STARTED     TIME COMMAND 
user      8730 10873  0.0  0.0   4480  1024 utrace_quiescent  T    16:32:05 00:00:00 sh 
user      8731  8730  0.0  0.0   4208   692 utrace_quiescent  T    16:32:05 00:00:00 mv 
user     10873 10871  0.0  0.0  11140  7596 pipe_wait         S      May 03 00:00:00 app-interpreter 
user     10875 10874  0.0  0.3  68660 64796 -                 S      May 03 00:00:03 app-interpreter 
user     10879 10878  0.0  0.3  67580 63540 -                 S      May 03 00:00:03 app-interpreter 
user     10881 10880  0.0  0.3  68172 64152 -                 S      May 03 00:00:03 app-interpreter 
user     10921 10919  0.0  0.3  68680 64816 -                 S      May 03 00:00:03 app-interpreter 
user     10922 10918  0.0  0.3  68128 64124 -                 S      May 03 00:00:02 app-interpreter 
user     10923 10920  0.0  0.4  70608 66668 -                 S      May 03 00:00:03 app-interpreter 
Mon May  4 16:32:05 CEST 2009 

(2 minutes after umount on node3):

[root@node1 tmp]# ps -p 9999999$(fuser -m /mountpoint 2>/dev/null| sed 's/ /@/g' |  sed 's/@@/,/g' | sed 's/@/,/g') -o user,pid,ppid,%cpu,%mem,vsz,rss,wchan=WIDE-WCHAN-COLUMN,stat,start,cputime,comm && date 
USER       PID  PPID %CPU %MEM    VSZ   RSS WIDE-WCHAN-COLUMN STAT  STARTED     TIME COMMAND 
user     23279 23278  0.0  0.3  70116 66180 -                 S      May 03 00:00:04 app-interpreter 
user     23332 23331  0.0  0.3  70120 66132 -                 S      May 03 00:00:04 app-interpreter 
user     23530 23529  0.0  0.3  68676 64700 -                 S      May 03 00:00:03 app-interpreter 
user     23609 23608  0.0  0.4  73648 69688 -                 S      May 03 00:00:04 app-interpreter 
user     25346 25345  1.3  0.3  68140 64136 -                 S    16:31:44 00:00:02 app-interpreter 
user     25348 25347  1.2  0.3  68300 64448 -                 S    16:31:44 00:00:02 app-interpreter 
Mon May  4 16:34:25 CEST 2009 
########### 
[root@node2 tmp]# ps -p 9999999$(fuser -m /mountpoint 2>/dev/null| sed 's/ /@/g' |  sed 's/@@/,/g' | sed 's/@/,/g') -o user,pid,ppid,%cpu,%mem,vsz,rss,wchan=WIDE-WCHAN-COLUMN,stat,start,cputime,comm && date 
USER       PID  PPID %CPU %MEM    VSZ   RSS WIDE-WCHAN-COLUMN STAT  STARTED     TIME COMMAND 
user     10873 10871  0.0  0.0  11136  7596 -                 S      May 03 00:00:00 app-interpreter 
user     10875 10874  0.0  0.3  68660 64796 -                 S      May 03 00:00:03 app-interpreter 
user     10879 10878  0.0  0.3  67584 63544 -                 S      May 03 00:00:03 app-interpreter 
user     10881 10880  0.0  0.3  68172 64152 -                 S      May 03 00:00:03 app-interpreter 
user     10921 10919  0.0  0.3  68680 64816 -                 S      May 03 00:00:03 app-interpreter 
user     10922 10918  0.0  0.3  68132 64128 -                 S      May 03 00:00:03 app-interpreter 
user     10923 10920  0.0  0.4  70612 66672 -                 S      May 03 00:00:03 app-interpreter 
Mon May  4 16:35:00 CEST 2009 

From the lockdump, it became clear that the process granted all locks was:

root      3988    35  0.0  0.0      0     0 worker_thread     S<     Apr 30 00:00:00 glock_workqueue


Version-Release number of selected component (if applicable):


How reproducible:

Not every time, it is unpredictable, the application is not available for testing.

Steps to Reproduce:
1.have 3 nodes writting to the same GFS2 directory
2.run the app for a number of hours
3.watch the processes go 'D'
  
Actual results:

processes go into 'D' state or other types of wait state.

Expected results:

processes do not lockup.

Comment 14 Steve Whitehouse 2009-05-05 14:22:52 UTC
Looks like its probably a missing call back to me - need to investigate further to be certain.

Comment 15 Steve Whitehouse 2009-05-07 11:04:49 UTC
The attached data doesn't include the glock or the dlm lock dumps. That information (just the raw dumps from the debugfs files) is required in order to try and work out what is going on here.

Since the glock is missing from the hangalyser trace, it suggests that gfs2 and the dlm have confused each other, so the dlm dumps are a vital piece of the jigsaw to see what is going on.

Comment 20 Steve Whitehouse 2009-05-08 09:59:08 UTC
Here is the dlm master copy of the lock in question:

Resource f7b672c0 Name (len=24) "       2         18b36cb"  
Master Copy
Granted Queue
0337017d EX Remote:   3 02de000a
Conversion Queue
Waiting Queue
03ea0198 -- (PR)
02330164 -- (PR) Remote:   2 036e0051


This indicates that node 1 should have sent a call back to node 3. On node 3 we have:

G:  s:EX n:2/18b36cb f:y t:EX d:EX/0 l:0 a:0 r:3
 I: n:5103274/25900747 t:4 f:0x00000010

i.e. a glock which claims to have received no callback.

So somewhere along the line, we've lost the call back. The question being why and where? There are two possible causes (a) the dlm never sent a callback or (b) gfs2 ignored it for some reason.

Comment 21 Steve Whitehouse 2009-05-08 10:21:22 UTC
Actually there is also (c) that the message got lost or reordered in the lock_dlm module.

Comment 22 Steve Whitehouse 2009-05-08 10:31:10 UTC
What I'm going to suggest at this stage is that if its possible to reproduce without too much difficulty, that you should collect tcpdumps of the dlm traffic at each node.

In the end, we are only interested in the last few dlm packets which relate to the lock in question (which is the directory inode lock) on the node where the lock is being cached (i.e. node 3 in the above traces). You'll have to capture all the traffic on all the nodes though to be sure to catch it though. I'd suggest setting the snaplen to 1500 in case it chops off the end of any packets.

Chrissie and/or Dave should be able to help with setting up tcpdump to capture the packets. There is also a tool which can analyse the dumps and turn them into something meaningful, but I forget its name. Again Chrissie and/or Dave should be able to help.

If we can prove whether the callback is making it across the wire to the stuck node, that cuts down the possibilities. If it is making it across the wire, then it will be much harder to trace what is going on, unfortunately. I'll give that issue some thought in the mean time.

Comment 23 Steve Whitehouse 2009-05-08 12:12:21 UTC
wireshark is the network protocol analyser I was thinking of in comment #22

Comment 28 Robert Peterson 2009-05-14 21:58:20 UTC
Created attachment 344053 [details]
Instrumentation patch

This patch to the 2.6.18-128 kernel adds instrumentation to lock_dlm
so that it will tell us more about what happened in the last 50 lock
"events", whether the locks were sent to the dlm, whether they came
back, etc.  This should hopefully tell us whether gfs2, dlm or
lock_dlm is at fault.

Once the system is hung, have them do a command like this:

cat /sys/fs/gfs2/bobs_roth\:roth2/lock_module/lock_hist

where "bobs_roth\:roth2" should be replaced by the lock table as
returned by this command: "gfs2_tool sb /dev/their/device table"

We might want to capture this file from all nodes.

Comment 29 Robert Peterson 2009-05-20 18:32:42 UTC
Created attachment 344859 [details]
Faster Patch

This patch removes the spin_lock from the lock history instrumentation.
This will hopefully bring the speed/performance back to normal and
maybe we'll be able to recreate the problem easier this way.
Unfortunately, that comes at the expense of possible corruption of the
history information we're trying to gather.  So we still might not be
able to tell what's going on, but it's worth a try to see what we get.

Comment 30 Robert Peterson 2009-06-02 15:03:43 UTC
I'm waiting to get the results from the previous patch.
Until then, I'm putting this bug into NEEDINFO.

Comment 34 Robert Peterson 2009-06-05 15:14:08 UTC
Unfortunately, the gfs2_hangalyzer output is zero length, so I had
to do the analysis by hand.  There is only one glock being waited
on in this case, 2/ac811, and here are the references to it:

from 3:
G:  s:UN n:2/ac811 f:l t:SH d:EX/0 l:0 a:0 r:12
 H: s:SH f:aW e:0 p:13123 [mv] gfs2_permission+0x69/0xb4 [gfs2]
 H: s:SH f:aW e:0 p:13124 [mv] gfs2_permission+0x69/0xb4 [gfs2]
 H: s:SH f:aW e:0 p:13125 [mv] gfs2_permission+0x69/0xb4 [gfs2]
 H: s:SH f:aW e:0 p:13126 [mv] gfs2_permission+0x69/0xb4 [gfs2]
 H: s:SH f:aW e:0 p:13127 [mv] gfs2_permission+0x69/0xb4 [gfs2]
 H: s:SH f:aW e:0 p:13129 [mv] gfs2_permission+0x69/0xb4 [gfs2]
 H: s:SH f:aW e:0 p:13131 [mv] gfs2_permission+0x69/0xb4 [gfs2]
 H: s:SH f:aW e:0 p:18092 [python] gfs2_permission+0x69/0xb4 [gfs2]
 H: s:SH f:aW e:0 p:18066 [python] gfs2_permission+0x69/0xb4 [gfs2]

from 2:
G:  s:EX n:2/ac811 f:y t:EX d:EX/0 l:0 a:0 r:3
 I: n:12563622/706577 t:4 f:0x00000010

Unfortunately, the glock trace information does not contain this
glock, so it doesn't tell us any more than we already knew.
Can we get a recreation where the glock trace more closely matches
the hang (i.e. taken soon after the hang)?

Comment 41 Steve Whitehouse 2009-09-11 10:26:54 UTC
Created attachment 360660 [details]
Potential fix

I have no way to prove whether this patch fixes this bug or not. On the other hand the code looks wrong and there is almost certainly some kind of issue here. The patch is against upstream, but as usual RHEL is almost identical.

I'm intending to put this patch in upstream anyway, subject to it not causing any issues during testing, if we can prove that it fixes this bug, then we can put it in RHEL too.

I've given it some brief testing in upstream, but it needs more exposure before I can be confident about it.

Comment 42 Steve Whitehouse 2009-09-11 11:30:31 UTC
Created attachment 360673 [details]
Potential fix (RHEL5 port)

This is the RHEL5 version of the previously attached upstream patch.

Comment 46 Don Zickus 2009-11-23 15:32:18 UTC
in kernel-2.6.18-175.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Please do NOT transition this bugzilla state to VERIFIED until our QE team
has sent specific instructions indicating when to do so.  However feel free
to provide a comment indicating that this fix has been verified.

Comment 55 Chris Ward 2010-02-11 10:23:10 UTC
~~ Attention Customers and Partners - RHEL 5.5 Beta is now available on RHN ~~

RHEL 5.5 Beta has been released! There should be a fix present in this 
release that addresses your request. Please test and report back results 
here, by March 3rd 2010 (2010-03-03) or sooner.

Upon successful verification of this request, post your results and update 
the Verified field in Bugzilla with the appropriate value.

If you encounter any issues while testing, please describe them and set 
this bug into NEED_INFO. If you encounter new defects or have additional 
patch(es) to request for inclusion, please clone this bug per each request
and escalate through your support representative.

Comment 58 errata-xmlrpc 2010-03-30 07:27:34 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2010-0178.html

Comment 60 Red Hat Bugzilla 2023-09-14 01:16:12 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days


Note You need to log in before you can comment on or make changes to this bug.