Description of problem: GFS2 becomes lock starved when having multiple nodes writting to the same base directory. Multiple nodes write unique filenamed files to a base directory on GFS2, after some time writting, the node that has the lock for the base directory continues to be able to write new files to the base directory, but the other nodes become lock starved, and all the processes that try to obtain locks on the base directory become blocked. Even when the application is no longer holding any locks, the locks do not get released. The node that holds the lock in its cache is able to use the basedir, however all the processes from the other nodes become blocked. In the following example node1 and node2 have blocked processes and node3 doesn't. [root@node1 ~]# ps -p 9999999$(fuser -m /mountpoint 2>/dev/null| sed 's/ /,/g') -o user,pid,ppid,%cpu,%mem,vsz,rss,wchan=WIDE-WCHAN-COLUMN,stat,start,cputime,comm USER PID PPID %CPU %MEM VSZ RSS WIDE-WCHAN-COLUMN STAT STARTED TIME COMMAND user 23279 23278 0.0 0.3 70124 66188 just_schedule D May 03 00:00:04 app-interpreter user 23332 23331 0.0 0.3 70124 66136 just_schedule D May 03 00:00:04 app-interpreter user 23444 23443 0.0 0.4 70756 66872 pipe_wait S May 03 00:00:04 app-interpreter user 23530 23529 0.0 0.3 68680 64704 just_schedule D May 03 00:00:03 app-interpreter user 23609 23608 0.0 0.4 73652 69692 just_schedule D May 03 00:00:04 app-interpreter user 23743 23742 0.0 0.3 67756 63960 pipe_wait S May 03 00:00:03 app-interpreter user 32654 23743 0.0 0.0 4480 1024 utrace_quiescent T May 03 00:00:00 sh user 32655 32654 0.0 0.0 4208 652 just_schedule D May 03 00:00:00 mv user 32659 23444 0.0 0.0 4480 1024 utrace_quiescent T May 03 00:00:00 sh user 32660 32659 0.0 0.0 4208 652 just_schedule D May 03 00:00:00 mv ############## [root@node2 ~]# ps -p 9999999$(fuser -m /mountpoint 2>/dev/null| sed 's/ /,/g') -o user,pid,ppid,%cpu,%mem,vsz,rss,wchan=WIDE-WCHAN-COLUMN,stat,start,cputime,comm USER PID PPID %CPU %MEM VSZ RSS WIDE-WCHAN-COLUMN STAT STARTED TIME COMMAND user 10873 10871 0.0 0.0 11140 7600 just_schedule D May 03 00:00:00 app-interpreter user 10875 10874 0.0 0.3 68664 64800 just_schedule D May 03 00:00:03 app-interpreter user 10879 10878 0.0 0.3 67588 63548 just_schedule D May 03 00:00:03 app-interpreter user 10881 10880 0.0 0.3 68176 64156 just_schedule D May 03 00:00:03 app-interpreter user 10921 10919 0.0 0.3 68688 64824 just_schedule D May 03 00:00:03 app-interpreter user 10922 10918 0.0 0.3 68136 64132 just_schedule D May 03 00:00:02 app-interpreter user 10923 10920 0.0 0.4 70616 66676 just_schedule D May 03 00:00:03 app-interpreter ############# [root@node3 ~]# ps -p 9999999$(fuser -m /mountpoint 2>/dev/null| sed 's/ /@/g' | sed 's/@@/,/g' | sed 's/@/,/g') -o user,pid,ppid,%cpu,%mem,vsz,rss,wchan=WIDE-WCHAN-COLUMN,stat,start,cputime,comm USER PID PPID %CPU %MEM VSZ RSS WIDE-WCHAN-COLUMN STAT STARTED TIME COMMAND user 3116 3115 0.0 0.2 51108 47360 - S 08:55:12 00:00:08 app-interpreter user 6732 6731 0.0 0.3 67108 63144 - S 04:45:34 00:00:20 app-interpreter user 11981 11980 0.0 0.3 67096 63148 - S 09:30:39 00:00:09 app-interpreter user 17279 17278 0.1 0.4 70740 66856 - S 09:52:20 00:00:08 app-interpreter user 21635 21634 0.1 0.3 67208 63236 - S 10:10:08 00:00:07 app-interpreter user 22312 22311 0.5 0.3 67236 63404 - S 12:01:01 00:00:02 app-interpreter user 30743 30742 0.0 0.3 68136 64096 - S 02:01:38 00:00:29 app-interpreter When one kills all the processes on node3, and umounts the /mountpoint, the processes start to unblock: (just after umount on node3:) ############# [root@node1 tmp]# ps -p 9999999$(fuser -m /mountpoint 2>/dev/null| sed 's/ /@/g' | sed 's/@@/,/g' | sed 's/@/,/g') -o user,pid,ppid,%cpu,%mem,vsz,rss,wchan=WIDE-WCHAN-COLUMN,stat,start,cputime,comm && date USER PID PPID %CPU %MEM VSZ RSS WIDE-WCHAN-COLUMN STAT STARTED TIME COMMAND user 23279 23278 0.0 0.3 70120 66184 - S May 03 00:00:04 app-interpreter user 23332 23331 0.0 0.3 70120 66132 - S May 03 00:00:04 app-interpreter user 23530 23529 0.0 0.3 68672 64696 just_schedule D May 03 00:00:03 app-interpreter user 23609 23608 0.0 0.4 73648 69688 - S May 03 00:00:04 app-interpreter user 25346 25345 8.9 0.3 68144 64140 - S 16:31:44 00:00:02 app-interpreter user 25348 25347 8.3 0.3 68300 64448 - S 16:31:44 00:00:01 app-interpreter Mon May 4 16:32:07 CEST 2009 ############# [root@node2 tmp]# ps -p 9999999$(fuser -m /mountpoint 2>/dev/null| sed 's/ /@/g' | sed 's/@@/,/g' | sed 's/@/,/g') -o user,pid,ppid,%cpu,%mem,vsz,rss,wchan=WIDE-WCHAN-COLUMN,stat,start,cputime,comm && date USER PID PPID %CPU %MEM VSZ RSS WIDE-WCHAN-COLUMN STAT STARTED TIME COMMAND user 8730 10873 0.0 0.0 4480 1024 utrace_quiescent T 16:32:05 00:00:00 sh user 8731 8730 0.0 0.0 4208 692 utrace_quiescent T 16:32:05 00:00:00 mv user 10873 10871 0.0 0.0 11140 7596 pipe_wait S May 03 00:00:00 app-interpreter user 10875 10874 0.0 0.3 68660 64796 - S May 03 00:00:03 app-interpreter user 10879 10878 0.0 0.3 67580 63540 - S May 03 00:00:03 app-interpreter user 10881 10880 0.0 0.3 68172 64152 - S May 03 00:00:03 app-interpreter user 10921 10919 0.0 0.3 68680 64816 - S May 03 00:00:03 app-interpreter user 10922 10918 0.0 0.3 68128 64124 - S May 03 00:00:02 app-interpreter user 10923 10920 0.0 0.4 70608 66668 - S May 03 00:00:03 app-interpreter Mon May 4 16:32:05 CEST 2009 (2 minutes after umount on node3): [root@node1 tmp]# ps -p 9999999$(fuser -m /mountpoint 2>/dev/null| sed 's/ /@/g' | sed 's/@@/,/g' | sed 's/@/,/g') -o user,pid,ppid,%cpu,%mem,vsz,rss,wchan=WIDE-WCHAN-COLUMN,stat,start,cputime,comm && date USER PID PPID %CPU %MEM VSZ RSS WIDE-WCHAN-COLUMN STAT STARTED TIME COMMAND user 23279 23278 0.0 0.3 70116 66180 - S May 03 00:00:04 app-interpreter user 23332 23331 0.0 0.3 70120 66132 - S May 03 00:00:04 app-interpreter user 23530 23529 0.0 0.3 68676 64700 - S May 03 00:00:03 app-interpreter user 23609 23608 0.0 0.4 73648 69688 - S May 03 00:00:04 app-interpreter user 25346 25345 1.3 0.3 68140 64136 - S 16:31:44 00:00:02 app-interpreter user 25348 25347 1.2 0.3 68300 64448 - S 16:31:44 00:00:02 app-interpreter Mon May 4 16:34:25 CEST 2009 ########### [root@node2 tmp]# ps -p 9999999$(fuser -m /mountpoint 2>/dev/null| sed 's/ /@/g' | sed 's/@@/,/g' | sed 's/@/,/g') -o user,pid,ppid,%cpu,%mem,vsz,rss,wchan=WIDE-WCHAN-COLUMN,stat,start,cputime,comm && date USER PID PPID %CPU %MEM VSZ RSS WIDE-WCHAN-COLUMN STAT STARTED TIME COMMAND user 10873 10871 0.0 0.0 11136 7596 - S May 03 00:00:00 app-interpreter user 10875 10874 0.0 0.3 68660 64796 - S May 03 00:00:03 app-interpreter user 10879 10878 0.0 0.3 67584 63544 - S May 03 00:00:03 app-interpreter user 10881 10880 0.0 0.3 68172 64152 - S May 03 00:00:03 app-interpreter user 10921 10919 0.0 0.3 68680 64816 - S May 03 00:00:03 app-interpreter user 10922 10918 0.0 0.3 68132 64128 - S May 03 00:00:03 app-interpreter user 10923 10920 0.0 0.4 70612 66672 - S May 03 00:00:03 app-interpreter Mon May 4 16:35:00 CEST 2009 From the lockdump, it became clear that the process granted all locks was: root 3988 35 0.0 0.0 0 0 worker_thread S< Apr 30 00:00:00 glock_workqueue Version-Release number of selected component (if applicable): How reproducible: Not every time, it is unpredictable, the application is not available for testing. Steps to Reproduce: 1.have 3 nodes writting to the same GFS2 directory 2.run the app for a number of hours 3.watch the processes go 'D' Actual results: processes go into 'D' state or other types of wait state. Expected results: processes do not lockup.
Looks like its probably a missing call back to me - need to investigate further to be certain.
The attached data doesn't include the glock or the dlm lock dumps. That information (just the raw dumps from the debugfs files) is required in order to try and work out what is going on here. Since the glock is missing from the hangalyser trace, it suggests that gfs2 and the dlm have confused each other, so the dlm dumps are a vital piece of the jigsaw to see what is going on.
Here is the dlm master copy of the lock in question: Resource f7b672c0 Name (len=24) " 2 18b36cb" Master Copy Granted Queue 0337017d EX Remote: 3 02de000a Conversion Queue Waiting Queue 03ea0198 -- (PR) 02330164 -- (PR) Remote: 2 036e0051 This indicates that node 1 should have sent a call back to node 3. On node 3 we have: G: s:EX n:2/18b36cb f:y t:EX d:EX/0 l:0 a:0 r:3 I: n:5103274/25900747 t:4 f:0x00000010 i.e. a glock which claims to have received no callback. So somewhere along the line, we've lost the call back. The question being why and where? There are two possible causes (a) the dlm never sent a callback or (b) gfs2 ignored it for some reason.
Actually there is also (c) that the message got lost or reordered in the lock_dlm module.
What I'm going to suggest at this stage is that if its possible to reproduce without too much difficulty, that you should collect tcpdumps of the dlm traffic at each node. In the end, we are only interested in the last few dlm packets which relate to the lock in question (which is the directory inode lock) on the node where the lock is being cached (i.e. node 3 in the above traces). You'll have to capture all the traffic on all the nodes though to be sure to catch it though. I'd suggest setting the snaplen to 1500 in case it chops off the end of any packets. Chrissie and/or Dave should be able to help with setting up tcpdump to capture the packets. There is also a tool which can analyse the dumps and turn them into something meaningful, but I forget its name. Again Chrissie and/or Dave should be able to help. If we can prove whether the callback is making it across the wire to the stuck node, that cuts down the possibilities. If it is making it across the wire, then it will be much harder to trace what is going on, unfortunately. I'll give that issue some thought in the mean time.
wireshark is the network protocol analyser I was thinking of in comment #22
Created attachment 344053 [details] Instrumentation patch This patch to the 2.6.18-128 kernel adds instrumentation to lock_dlm so that it will tell us more about what happened in the last 50 lock "events", whether the locks were sent to the dlm, whether they came back, etc. This should hopefully tell us whether gfs2, dlm or lock_dlm is at fault. Once the system is hung, have them do a command like this: cat /sys/fs/gfs2/bobs_roth\:roth2/lock_module/lock_hist where "bobs_roth\:roth2" should be replaced by the lock table as returned by this command: "gfs2_tool sb /dev/their/device table" We might want to capture this file from all nodes.
Created attachment 344859 [details] Faster Patch This patch removes the spin_lock from the lock history instrumentation. This will hopefully bring the speed/performance back to normal and maybe we'll be able to recreate the problem easier this way. Unfortunately, that comes at the expense of possible corruption of the history information we're trying to gather. So we still might not be able to tell what's going on, but it's worth a try to see what we get.
I'm waiting to get the results from the previous patch. Until then, I'm putting this bug into NEEDINFO.
Unfortunately, the gfs2_hangalyzer output is zero length, so I had to do the analysis by hand. There is only one glock being waited on in this case, 2/ac811, and here are the references to it: from 3: G: s:UN n:2/ac811 f:l t:SH d:EX/0 l:0 a:0 r:12 H: s:SH f:aW e:0 p:13123 [mv] gfs2_permission+0x69/0xb4 [gfs2] H: s:SH f:aW e:0 p:13124 [mv] gfs2_permission+0x69/0xb4 [gfs2] H: s:SH f:aW e:0 p:13125 [mv] gfs2_permission+0x69/0xb4 [gfs2] H: s:SH f:aW e:0 p:13126 [mv] gfs2_permission+0x69/0xb4 [gfs2] H: s:SH f:aW e:0 p:13127 [mv] gfs2_permission+0x69/0xb4 [gfs2] H: s:SH f:aW e:0 p:13129 [mv] gfs2_permission+0x69/0xb4 [gfs2] H: s:SH f:aW e:0 p:13131 [mv] gfs2_permission+0x69/0xb4 [gfs2] H: s:SH f:aW e:0 p:18092 [python] gfs2_permission+0x69/0xb4 [gfs2] H: s:SH f:aW e:0 p:18066 [python] gfs2_permission+0x69/0xb4 [gfs2] from 2: G: s:EX n:2/ac811 f:y t:EX d:EX/0 l:0 a:0 r:3 I: n:12563622/706577 t:4 f:0x00000010 Unfortunately, the glock trace information does not contain this glock, so it doesn't tell us any more than we already knew. Can we get a recreation where the glock trace more closely matches the hang (i.e. taken soon after the hang)?
Created attachment 360660 [details] Potential fix I have no way to prove whether this patch fixes this bug or not. On the other hand the code looks wrong and there is almost certainly some kind of issue here. The patch is against upstream, but as usual RHEL is almost identical. I'm intending to put this patch in upstream anyway, subject to it not causing any issues during testing, if we can prove that it fixes this bug, then we can put it in RHEL too. I've given it some brief testing in upstream, but it needs more exposure before I can be confident about it.
Created attachment 360673 [details] Potential fix (RHEL5 port) This is the RHEL5 version of the previously attached upstream patch.
in kernel-2.6.18-175.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5 Please do NOT transition this bugzilla state to VERIFIED until our QE team has sent specific instructions indicating when to do so. However feel free to provide a comment indicating that this fix has been verified.
~~ Attention Customers and Partners - RHEL 5.5 Beta is now available on RHN ~~ RHEL 5.5 Beta has been released! There should be a fix present in this release that addresses your request. Please test and report back results here, by March 3rd 2010 (2010-03-03) or sooner. Upon successful verification of this request, post your results and update the Verified field in Bugzilla with the appropriate value. If you encounter any issues while testing, please describe them and set this bug into NEED_INFO. If you encounter new defects or have additional patch(es) to request for inclusion, please clone this bug per each request and escalate through your support representative.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2010-0178.html
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days