Bug 199673
Summary: | lock_dlm assertion failing under heavy (hard)linking or unlinking | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | [Retired] Red Hat Cluster Suite | Reporter: | Stephen Willey <stephen.willey> | ||||||||||||
Component: | gfs | Assignee: | David Teigland <teigland> | ||||||||||||
Status: | CLOSED ERRATA | QA Contact: | GFS Bugs <gfs-bugs> | ||||||||||||
Severity: | medium | Docs Contact: | |||||||||||||
Priority: | medium | ||||||||||||||
Version: | 4 | CC: | alexander.laamanen, Chris.Verhoef, daire.byrne, jbacik, lenny, michael.hagmann, nobody+wcheng, riaanvn | ||||||||||||
Target Milestone: | --- | ||||||||||||||
Target Release: | --- | ||||||||||||||
Hardware: | x86_64 | ||||||||||||||
OS: | Linux | ||||||||||||||
Whiteboard: | |||||||||||||||
Fixed In Version: | RHBA-2006-0702 | Doc Type: | Bug Fix | ||||||||||||
Doc Text: | Story Points: | --- | |||||||||||||
Clone Of: | Environment: | ||||||||||||||
Last Closed: | 2006-10-11 16:45:16 UTC | Type: | --- | ||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||
Documentation: | --- | CRM: | |||||||||||||
Verified Versions: | Category: | --- | |||||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||
Embargoed: | |||||||||||||||
Bug Depends On: | |||||||||||||||
Bug Blocks: | 207487 | ||||||||||||||
Attachments: |
|
Description
Stephen Willey
2006-07-21 10:17:24 UTC
The dlm is returning EINVAL, but there are a lot of possible reasons for that. We've added some messages to pinpoint which error condition is causing this, but those changes won't be included until the 4.5 release. This is the first case I've seen where this problem is so repeatable, perhaps we could create some special rpm's for you to try that would give us some more data to go on. Here's another recent report where we needed these extra printk's do get some idea what was going wrong: https://www.redhat.com/archives/linux-cluster/2006-June/msg00132.html Sounds good. We need to get to the bottom of this ASAP so if you can send us some patches/SRPMs, that'd be great. Thanks, Stephen http://redhat.com/~teigland/dlm-fs1.ko http://redhat.com/~teigland/dlm-fs1.patch dlm-fs1.ko is a dlm kernel module with some extra printk's that should help narrow down what's wrong, built for x86_64 2.6.9-34.ELsmp. dlm-fs1.patch is the corresponding patch to cluster/dlm-kernel/src/locking.c in case you'd like to build your own module from source. The next time your machine hits this assertion, please send the additional kernel messages beginning "dlm:", they should appear before the "lock_dlm: ..." messages you've already sent. There's also a tuning option to increase the lock caching gfs is permitted to do; the default is probably too low if you have vast numbers of locks and gigabytes of main memory. It can often improve performance, but it may also delay the onset of the problems you're having: echo "500000" >> /proc/cluster/lock_dlm/drop_count (In reply to comment #3) > The next time your machine hits this assertion, please send the > additional kernel messages beginning "dlm:", they should appear > before the "lock_dlm: ..." messages you've already sent. A bit of delay due to a RAID problem that we've only just fixed. We observed crashes last night for an "rm" task and a "rsync" task. I'm attaching both netdump recorded stack traces for your consideration. Created attachment 134625 [details]
dlm "rm" unlinking
Created attachment 134626 [details]
dlm rsync "linking"
Created attachment 134703 [details]
dlm "rm" unlinking
Another example netdump trace from last night. These really are occurring
nightly now. What does the error "convert unknown lkid" mean exactly?
This looks quite similar to bz 172944 where an unlock is happening on a lock that no longer exists. Here a convert is being done on a lock that no longer exists (or can't be found). e.g. gfs is holding a lock, it goes to unlock or convert it and the dlm can't find any record of the lock that gfs believes it holds. So, the dlm returns an error and gfs panics (it doesn't want to corrupt anything based on the fact that it might be holding an invalid lock). The main question that we're looking into is why these locks seem to be disappearing in the dlm. Thanks for trying the special kernel modules, it focuses our investigation quite a bit. We've never seen this problem in our own testing and want to try to emulate what you're doing to see if we can trigger it. We've not used dirvish either. Clearly this stresses gfs in ways that our normal stress tests don't. Could you briefly describe your gfs cluster (how many and nodes, how much memory each has), what each of the nodes is doing, how many gfs file systems, how many original fs's are being backed up, approx number/size of the files, and any other details about how dirvish is configured? Thanks (In reply to comment #9) > Could you briefly describe your gfs cluster (how many and nodes, > how much memory each has), what each of the nodes is doing, how > many gfs file systems, how many original fs's are being backed > up, approx number/size of the files, and any other details about > how dirvish is configured? Thanks I suspect this will be difficult for you to replicate! The GFS filsystem is connected to 7 x 6.4Tb RAIDs and 2 X HP DL385 servers with 4gig ram each. It is formatted as one big 48Tb filesystem. We backup and expire probably around 30 "blocks" of storage each night. The size of the original filesystems range from 30 Gigs to 11Tb! Every night we use dirvish/rsync to hardlink and copy against the previous night's backup. So if the file is unchaged rsync makes a hardlink but if the file is changed/new it gets copied to GFS. We keep 10 days worth of these "incrementals" online. Every evening we expire the backup images older than 10 days by doing an rm -rf of the whole directory. This is all handled by the "dirvish" scripts but can be easily done by hand as well. Now in order to maximise throughput we run 4 rsync backups on each server simultaneously and for the expires we run 2 "rm" commands simultaneously on each server. If you want to try and reproduce this maybe you can create a two node cluster with ~6Tb of data on it (we have lots of 13Mb image files). Then make a few hard linked copies of the directories of data using "cp -lR". Then run a few rm -rf instances on each node simultaneously and hope one of them crashes! If there is any more debugging info we can provide don't hesitate to contact us. We are obviously quite keen to avoid having one of our nodes crash every other night and are willing to run beta code in order to fix it. Looking back, I worked on this same problem quite a while ago with sunjw onewaveinc com -- bug 172944 references his report. (I believe he worked around the problem by not calling 'df' so frequently, don't know why or how that helped his situation.) He was using some dlm debugging patches and one interesting thing we found was that prior to at least one of the assertions, create_lkb() was having difficulty creating a new lkb with a unique lkid: Apr 30 00:00:27 nd10 kernel: create_lkb retry 1026b Apr 30 00:00:28 nd10 kernel: create_lkb retry 2026b Apr 30 00:00:28 nd10 kernel: create_lkb retry 2008d Apr 30 00:00:28 nd10 kernel: create_lkb retry 3008d Apr 30 00:00:28 nd10 kernel: create_lkb retry 501a5 Apr 30 00:00:28 nd10 kernel: create_lkb retry 601a5 Apr 30 00:00:28 nd10 kernel: create_lkb retry 10250 Apr 30 00:00:29 nd10 kernel: create_lkb retry 10210 Apr 30 00:00:29 nd10 kernel: find_lock_by_id 1d103f8 b 1016 c 470 roll 5 u_last 1d003f8 u_high feff03f8 Apr 30 00:00:29 nd10 kernel: UNLOCK 1D103F8 NO ID Here's a script I've whipped up and am running right now that I'm hoping will come close to loading gfs in a similar way. Any suggestions on how I might improve this? The "set1" and "set2" directories that I'm copying are flat and each have about 13,000 10MB files. I'll increase the sizes of those directory trees and add some depth. I have a 1.5 TB fs (pretty puny I know). #!/bin/sh EXPR=/usr/bin/expr HOSTNAME=/bin/hostname I=0 MAX=10 H=`${HOSTNAME}` echo ${H} while [ ${I} -le ${MAX} ] do echo "${I}: begin" date time cp -lR set1/ bak-${I}-${H}-set1-1/ & time cp -lR set1/ bak-${I}-${H}-set1-2/ & time cp -lR set2/ bak-${I}-${H}-set2-1/ & time cp -lR set2/ bak-${I}-${H}-set2-2/ & LAST=`${EXPR} ${I} - 1` time rm -rf bak-${LAST}-${H}-set1-1/ & time rm -rf bak-${LAST}-${H}-set1-2/ & time rm -rf bak-${LAST}-${H}-set2-1/ & time rm -rf bak-${LAST}-${H}-set2-2/ & wait echo "${I}: end" date I=`${EXPR} ${I} + 1` done (In reply to comment #11) > Looking back, I worked on this same problem quite a while ago > with sunjw onewaveinc com -- bug 172944 references his report. > (I believe he worked around the problem by not calling 'df' so > frequently, don't know why or how that helped his situation.) That bug report seems to end up pointing the finger at packet loss. I'm pretty sure we arn't getting packet loss as I've reproduced the bug on two different switches and we arn't doing any bonding. We do run a cron job that does a "df" every 10 minutes. We use this to update a status webpage. I'll try turning the frequency down to every 2 hours and see if it effects the frequency of crashes. > He was using some dlm debugging patches and one interesting > thing we found was that prior to at least one of the assertions, > create_lkb() was having difficulty creating a new lkb with a > unique lkid: Should we try similar debugging patches? (In reply to comment #12) > Here's a script I've whipped up and am running right now that I'm > hoping will come close to loading gfs in a similar way. Any > suggestions on how I might improve this? > > The "set1" and "set2" directories that I'm copying are flat and > each have about 13,000 10MB files. I'll increase the sizes of > those directory trees and add some depth. I have a 1.5 TB fs > (pretty puny I know). I'm sure this will stress test it quite well. But if you want the script to do what our scripts do then need to have each host and instance of rm and cp working on seperate images trees. So maybe have 4 "sets" and run this script simultaneously on each node: #!/bin/sh EXPR=/usr/bin/expr HOSTNAME=/bin/hostname I=0 MAX=10 H=`${HOSTNAME}` echo ${H} while [ ${I} -le ${MAX} ] do echo "${I}: begin" date time cp -lR set1-${H} bak-${I}-${H}-set1 & time cp -lR set2-${H} bak-${I}-${H}-set2 & LAST=`${EXPR} ${I} - 1` time rm -rf bak-${LAST}-${H}-set1 & time rm -rf bak-${LAST}-${H}-set2 & wait echo "${I}: end" date I=`${EXPR} ${I} + 1` done I've been running this test (modified per comment 14) for 24 hours so far and will let it continue running through the weekend. - The test is up to the 27'th iteration on both nodes. - Each rm takes about 33 minutes. - Each cp takes about 48 minutes. - I had a loop on one node running df every 10 minutes for about 10 hours and have now stopped that. - My 1.5TB fs is only about 40% full, will add more data to each "set" next week. - each "set" now contains about 13,000 10MB files in a flat dir plus an rsynced copy of the root fs with about 200,000 files I'm running with a new dlm patch that I'll attach. It adds some debugging statements to point out errors early in the suspected areas of the code. The patch also add's a small "cache" of 100 recently deleted lock's. Since the central problem in this bz is that an lkb seems to disappear, the idea is to be able to look through the "trash" of recently deleted lkb's for the missing one if we hit the problem. Created attachment 134957 [details]
dlm debugging patch
Could you apply and use this patch, or should I create a dlm binary?
I reproduced the original bug after two days and 54 iterations of the test program. For 52 minutes prior to the assertion, create_lkb() was warning constantly about having to retry when looking for a unique lockid. There was no inter-machine locking going on because processes on the two machines were working on their own sets of files. The missing lkb was not found in the "trash". (In reply to comment #17) > I reproduced the original bug after two days and 54 iterations of > the test program. For 52 minutes prior to the assertion, create_lkb() > was warning constantly about having to retry when looking for a > unique lockid. There was no inter-machine locking going on because > processes on the two machines were working on their own sets of files. > The missing lkb was not found in the "trash". Glad to hear its reproducible - pity it takes so long to trigger. We probably hit it a bit more frequently cos we deal with so many more files every night. I've applied the patch (id=134957) but we won;t get any data out of it for a day or two as we are running a gfs_fsck atm (48Tb takes around 3 days!). A RAID threw a mental and knocked out some blocks. Don't bother applying the patch I attached, I don't think it will give us any new info at this point. I've reproduced this again in a much shorter time with a different patch that adds more debugging. The create_lkb() warnings appeared 5 minutes before the bug. The new debug info dumps the rsb against which the expected lock is held. The rsb shows a granted PR lkb with a 0 lkid which probably corresponds to the lkid that's not found. I believe this new info will get us much closer to solving this. Aug 29 16:36:56 bull-05 kernel: dlm: x: dlm_lock: convert unknown lkid ff87007d " 2 108bbf7f" Aug 29 16:36:56 bull-05 kernel: no lkid ff87007d in trash 15935394 Aug 29 16:36:56 bull-05 kernel: dlm: rsb Aug 29 16:36:56 bull-05 kernel: name " 2 108bbf7f" Aug 29 16:36:56 bull-05 kernel: nodeid 0 Aug 29 16:36:56 bull-05 kernel: flags 2 Aug 29 16:36:56 bull-05 kernel: ref 2 Aug 29 16:36:56 bull-05 kernel: grantqueue Aug 29 16:36:56 bull-05 kernel: dlm: lkb Aug 29 16:36:56 bull-05 kernel: id 0 Aug 29 16:36:56 bull-05 kernel: remid 0 Aug 29 16:36:56 bull-05 kernel: flags 0 Aug 29 16:36:56 bull-05 kernel: status 2 Aug 29 16:36:56 bull-05 kernel: rqmode -1 Aug 29 16:36:56 bull-05 kernel: grmode 3 Aug 29 16:36:56 bull-05 kernel: nodeid 0 Aug 29 16:36:56 bull-05 kernel: lqstate 0 Aug 29 16:36:56 bull-05 kernel: lqflags 10001 Aug 29 16:36:56 bull-05 kernel: convertqueue Aug 29 16:36:57 bull-05 kernel: waitqueue Aug 29 16:36:57 bull-05 kernel: x: num=2,108bbf7f err=-22 cur=3 req=5 lkf=44 id=ff87007d Dumping /proc/cluster/dlm_locks on bull-05 also shows the lkb having a 0 id. No other lkb's had a 0 id. Resource 00000100704bf598 (parent 0000000000000000). Name (len=24) " 2 108bbf7f" Master Copy Granted Queue 00000000 PR Conversion Queue Waiting Queue When the lkid counter for bucket 0 in the lkb hash table rolls over (it's a uint16), it results in a lkid of zero being assigned to the lkb. I've verified with the test above that this happens. The zero lkid confuses lock_dlm which uses lkid != 0 to determine if the lock is being converted or not. I still need to verify that this is the root cause of all the problems for this bug, I suspect it is. I'll attach a patch to fix the zero lkid problem. Created attachment 135286 [details]
patch to skip zero lkid's
Please try this patch, it definately fixes a bug, probably the cause
of the trouble in this bz.
(In reply to comment #22) > Created an attachment (id=135286) [edit] > Please try this patch, it definately fixes a bug, probably the cause > of the trouble in this bz. Patch applied - if its the fix then you probably won't hear back from us for a week or so. I've left in the debug code in from dlm-fs1.patch and attachment id=134957 too in case this isn't the fix and more debug data will be required. Thanks for your work on this bug so far. I've confirmed with my own test (initializing the counter to FF00 so we hit the zero lkid more quickly) that the zero lkid is the cause of the original panic. So, the patch from comment #22 will very likely be the final and complete solution for this bz. I will run the test with only the fix applied for a couple days to verify further. (In reply to comment #23) Dave, > Patch applied - if its the fix then you probably won't hear back from us for a > week or so. I've left in the debug code in from dlm-fs1.patch and attachment > id=134957 too in case this isn't the fix and more debug data will be required. We have had no assertions since adding the "zero lkid" patch last week. It looks like the fix. I've removed the extra debugging code and will now just run with the patch from comment #22. Thanks for your time and effort - we are finally getting all our backups through every night without any manual intervention. My test has also been running for about a week now without any problem. Module name: cluster Branch: RHEL4 Changes by: teigland 2006-08-31 15:22:41 Modified files: dlm-kernel/src : lkb.c Log message: Don't create lkids of 0. When the lkid counter for bucket 0 of the hash table rolls over (it's uint16), we'd get a lkid of zero and assign it to the new lkb. lock_dlm would be confused by a zero lkid and use some other, incorrect value as the lkid which would lead to a dlm error which would lead to a lock_dlm/gfs panic. this should fix bz 199673 it may also fix bz 172944 *** Bug 172944 has been marked as a duplicate of this bug. *** *** Bug 203829 has been marked as a duplicate of this bug. *** An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2006-0702.html *** Bug 209582 has been marked as a duplicate of this bug. *** |