Bug 199673 - lock_dlm assertion failing under heavy (hard)linking or unlinking
lock_dlm assertion failing under heavy (hard)linking or unlinking
Status: CLOSED ERRATA
Product: Red Hat Cluster Suite
Classification: Red Hat
Component: gfs (Show other bugs)
4
x86_64 Linux
medium Severity medium
: ---
: ---
Assigned To: David Teigland
GFS Bugs
:
: 172944 203829 209582 (view as bug list)
Depends On:
Blocks: 207487
  Show dependency treegraph
 
Reported: 2006-07-21 06:17 EDT by Stephen Willey
Modified: 2010-01-11 22:12 EST (History)
8 users (show)

See Also:
Fixed In Version: RHBA-2006-0702
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2006-10-11 12:45:16 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
dlm "rm" unlinking (2.00 KB, text/plain)
2006-08-22 05:45 EDT, Stephen Willey
no flags Details
dlm rsync "linking" (2.37 KB, text/plain)
2006-08-22 05:46 EDT, Stephen Willey
no flags Details
dlm "rm" unlinking (3.90 KB, text/plain)
2006-08-23 07:15 EDT, Stephen Willey
no flags Details
dlm debugging patch (3.74 KB, patch)
2006-08-25 15:59 EDT, David Teigland
no flags Details | Diff
patch to skip zero lkid's (518 bytes, patch)
2006-08-31 09:36 EDT, David Teigland
no flags Details | Diff

  None (edit)
Description Stephen Willey 2006-07-21 06:17:24 EDT
Description of problem:  We use dirvish to do incremental backups of a large
amount of data onto a 45Tb GFS.  This creates a large number of hard links (made
with rsync) and each night, a lot of these are unlinked by rm as they expire. 
Regularly (every 48 hours or so), one of our two servers attached to the GFS
will crash while doing a lot of un/linking.


Version-Release number of selected component (if applicable):
gfs: 6.1.5
kernel: 2.6.9-34smp
dlm: 1.0.0-5

How reproducible:

As mentioned above, it recurs approximately every 48 hours.


Steps to Reproduce:
1. rsync hardlink a vast number of files (several simultaneous rsyncs on each host).
2. rm -rf large numbers of hardlinks (simultaneously).
3.
  
Actual results:

Crashes and gets STONITHed by the other host.  Here are a couple of stack traces
from netdump:

lock_dlm:  Assertion failed on line 428 of file
/usr/src/redhat/BUILD/gfs-kernel-2.6.9-49/smp/src/dlm/lock.c
lock_dlm:  assertion:  "!error"
lock_dlm:  time = 4435245392
gfs1: num=2,8c8d370d err=-22 cur=3 req=5 lkf=44

----------- [cut here ] --------- [please bite here ] ---------
Kernel BUG at lock:428
invalid operand: 0000 [1]

 nfsd parport_pc i2c_core cman pcmcia_core dm_mirror battery tg3 sd_mod
Pid: 1888, comm: rsync Not tainted 2.6.9-34smp
<ffffffffa0208956>{:lock_dlm:do_dlm_lock+365}RSP: 0018:000001000d317b88  EFLAGS:
00010212
RAX: 0000000000000001 RBX: 00000000ffffffea RCX: 000000000003b690
RDX: 0000000000000246 RSI: 0000000000000246 RDI: ffffffff8038e160
RBP: 000001006f70b980 R08: 0000000000000004 R09: 0000000000000000
R10: 0000000000000000 R11: ffffffff80209c6f R12: 00000100f4d1fc00
R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000001
FS:  0000002a955863a0(0000) GS:ffffffff804b6040(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000002a985a1008 CR3: 0000000000101000 CR4: 00000000000006e0
Process rsync (pid: 1888, threadinfo 000001000d316000, task 000001005d6ee030)
Stack:


<ffffffff8011d5cc>{flat_send_IPI_mask+0}
       <ffffffffa01d8595>{:gfs:gfs_lm_lock+50}
<ffffffffa01cf683>{:gfs:gfs_glock_xmote_th+357}
       <ffffffffa01cd985>{:gfs:run_queue+668}
<ffffffffa01ce977>{:gfs:gfs_glock_nq+938}
      
<ffffffffa01ceb6f>{:gfs:gfs_glock_nq_m+416}<ffffffffa01e507d>{:gfs:gfs_link+149}
<ffffffff801813ac>{vfs_link+308} <ffffffff801814a1>{sys_link+158}
       <ffffffff801101be>{system_call+126}

Code: 0f 0b d8 ce 20 a0 ff ff ff ff ac 01 48 c7 c7 dd ce 20 a0 31
RIP <ffffffffa0208956>{:lock_dlm:do_dlm_lock+365} RSP <000001000d317b88>

Modules linked in: nfs nfsd exportfs lockd nfs_acl parport_pc lp parport
netconsole netdump i2c_dev i2c_core lock_dlm gfs lock_harness dlm cman ipv6
sunrpc ds yenta_socket pcmcia_core dm_mirror dm_multipath dm_mod button battery
ac ohci_hcd hw_random shpchp tg3 qla2400 qla2xxx scsi_transport_fc cciss sd_mod
scsi_mod
Pid: 1888, comm: rsync Not tainted 2.6.9-34smp
RIP: 0010:[<ffffffffa0208956>] <ffffffffa0208956>{:lock_dlm:do_dlm_lock+365}
RSP: 0018:000001000d317b88  EFLAGS: 00010212
RAX: 0000000000000001 RBX: 00000000ffffffea RCX: 000000000003b690
RDX: 0000000000000246 RSI: 0000000000000246 RDI: ffffffff8038e160
RBP: 000001006f70b980 R08: 0000000000000004 R09: 0000000000000000
R10: 0000000000000000 R11: ffffffff80209c6f R12: 00000100f4d1fc00
R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000001
FS:  0000002a955863a0(0000) GS:ffffffff804b6040(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000002a985a1008 CR3: 0000000000101000 CR4: 00000000000006e0

Call Trace:<ffffffffa0208956>{:lock_dlm:do_dlm_lock+365}
<ffffffffa0208a92>{:lock_dlm:lm_dlm_lock+214}
       <ffffffff8011d5cc>{flat_send_IPI_mask+0}
<ffffffffa01d8595>{:gfs:gfs_lm_lock+50}
       <ffffffffa01cf683>{:gfs:gfs_glock_xmote_th+357}
<ffffffffa01cd985>{:gfs:run_queue+668}
       <ffffffffa01ce977>{:gfs:gfs_glock_nq+938}
<ffffffffa01ceb6f>{:gfs:gfs_glock_nq_m+416}
       <ffffffffa01e507d>{:gfs:gfs_link+149} <ffffffff801813ac>{vfs_link+308}
       <ffffffff801814a1>{sys_link+158} <ffffffff801101be>{system_call+126}

...
...



lock_dlm:  Assertion failed on line 428 of file
/usr/src/redhat/BUILD/gfs-kernel-2.6.9-49/smp/src/dlm/lock.c
lock_dlm:  assertion:  "!error"
lock_dlm:  time = 4438575995
gfs1: num=2,23fc019f err=-22 cur=3 req=5 lkf=44

----------- [cut here ] --------- [please bite here ] ---------
Kernel BUG at lock:428
invalid operand: 0000 [1]

 parport_pc i2c_dev i2c_core cman pcmcia_core dm_mirror battery tg3 sd_mod
Pid: 5463, comm: rm Not tainted 2.6.9-34smp
<ffffffffa0208956>{:lock_dlm:do_dlm_lock+365}
RSP: 0018:000001001a6dfc18  EFLAGS: 00010212
RAX: 0000000000000001 RBX: 00000000ffffffea RCX: 0000000000006f46
RBP: 000001009cdb34c0 R08: 0000000000000004 R09: 0000000000000000
R10: 0000000000000000 R11: ffffffff80209c6f R12: 0000010037c61200
R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000001
FS:  0000002a95585b00(0000) GS:ffffffff804b60c0(0000) knlGS:00000000f7fce6c0
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000002a9759afe8 CR3: 00000000f5798000 CR4: 00000000000006e0
Process rm (pid: 5463, threadinfo 000001001a6de000, task 00000100d45b4030)
Stack: 2020202020202020
0000000000000003
       0000000000000001 0000000000000000
Call Trace:<ffffffffa0208a92>{:lock_dlm:lm_dlm_lock+214}
<ffffffffa01d8595>{:gfs:gfs_lm_lock+50}
       <ffffffffa01cf683>{:gfs:gfs_glock_xmote_th+357}
<ffffffffa01cd985>{:gfs:run_queue+668}
       <ffffffffa01ce977>{:gfs:gfs_glock_nq+938}
<ffffffffa01ceb6f>{:gfs:gfs_glock_nq_m+416}
       <ffffffffa01e53d5>{:gfs:gfs_unlink+133} <ffffffff80180ec0>{vfs_unlink+439}
       <ffffffff80180fc5>{sys_unlink+185} <ffffffff80183707>{sys_getdents64+166}
       <ffffffff801101be>{system_call+126}

Code: 0f 0b d8 ce 20 a0 ff ff ff ff ac 01 48 c7 c7 dd ce 20 a0 31
RIP <ffffffffa0208956>{:lock_dlm:do_dlm_lock+365} RSP <000001001a6dfc18>

Modules linked in: nfs nfsd exportfs lockd nfs_acl parport_pc lp parport
netconsole netdump i2c_dev i2c_core lock_dlm gfs lock_harness dlm cman ipv6
sunrpc ds yenta_socket pcmcia_core dm_mirror dm_multipath dm_mod button battery
ac ohci_hcd hw_random shpchp tg3 qla2400 qla2xxx scsi_transport_fc cciss sd_mod
scsi_mod
Pid: 5463, comm: rm Not tainted 2.6.9-34smp
RIP: 0010:[<ffffffffa0208956>] <ffffffffa0208956>{:lock_dlm:do_dlm_lock+365}
RSP: 0018:000001001a6dfc18  EFLAGS: 00010212
RAX: 0000000000000001 RBX: 00000000ffffffea RCX: 0000000000006f46
RDX: 0000000000000246 RSI: 0000000000000246 RDI: ffffffff8038e160
RBP: 000001009cdb34c0 R08: 0000000000000004 R09: 0000000000000000
R10: 0000000000000000 R11: ffffffff80209c6f R12: 0000010037c61200
R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000001
FS:  0000002a95585b00(0000) GS:ffffffff804b60c0(0000) knlGS:00000000f7fce6c0
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000002a9759afe8 CR3: 00000000f5798000 CR4: 00000000000006e0

Call Trace:<ffffffffa0208956>{:lock_dlm:do_dlm_lock+365}
<ffffffffa0208a92>{:lock_dlm:lm_dlm_lock+214}
       <ffffffffa01d8595>{:gfs:gfs_lm_lock+50}
<ffffffffa01cf683>{:gfs:gfs_glock_xmote_th+357}
       <ffffffffa01cd985>{:gfs:run_queue+668}
<ffffffffa01ce977>{:gfs:gfs_glock_nq+938}
       <ffffffffa01ceb6f>{:gfs:gfs_glock_nq_m+416}
<ffffffffa01e53d5>{:gfs:gfs_unlink+133}
       <ffffffff80180ec0>{vfs_unlink+439} <ffffffff80180fc5>{sys_unlink+185}
       <ffffffff80183707>{sys_getdents64+166} <ffffffff801101be>{system_call+126}

Expected results:

None.

Additional info:
Comment 1 David Teigland 2006-07-21 10:02:31 EDT
The dlm is returning EINVAL, but there are a lot of possible
reasons for that.  We've added some messages to pinpoint which
error condition is causing this, but those changes won't be
included until the 4.5 release.

This is the first case I've seen where this problem is so
repeatable, perhaps we could create some special rpm's for you
to try that would give us some more data to go on.

Here's another recent report where we needed these extra printk's
do get some idea what was going wrong:
https://www.redhat.com/archives/linux-cluster/2006-June/msg00132.html
Comment 2 Stephen Willey 2006-07-21 10:11:52 EDT
Sounds good.  We need to get to the bottom of this ASAP so if you can send us
some patches/SRPMs, that'd be great.

Thanks,

Stephen
Comment 3 David Teigland 2006-07-21 10:51:19 EDT
http://redhat.com/~teigland/dlm-fs1.ko
http://redhat.com/~teigland/dlm-fs1.patch

dlm-fs1.ko is a dlm kernel module with some extra printk's that
should help narrow down what's wrong, built for x86_64 2.6.9-34.ELsmp.

dlm-fs1.patch is the corresponding patch to
cluster/dlm-kernel/src/locking.c in case you'd like to build your own
module from source.

The next time your machine hits this assertion, please send the
additional kernel messages beginning "dlm:", they should appear
before the "lock_dlm: ..." messages you've already sent.

There's also a tuning option to increase the lock caching gfs is
permitted to do; the default is probably too low if you have vast
numbers of locks and gigabytes of main memory.  It can often improve
performance, but it may also delay the onset of the problems you're
having:
  echo "500000" >> /proc/cluster/lock_dlm/drop_count
Comment 4 Stephen Willey 2006-08-22 05:43:07 EDT
(In reply to comment #3)

> The next time your machine hits this assertion, please send the
> additional kernel messages beginning "dlm:", they should appear
> before the "lock_dlm: ..." messages you've already sent.

A bit of delay due to a RAID problem that we've only just fixed. We observed
crashes last night for an "rm" task and a "rsync" task. I'm attaching both
netdump recorded stack traces for your consideration.
Comment 5 Stephen Willey 2006-08-22 05:45:15 EDT
Created attachment 134625 [details]
dlm "rm" unlinking
Comment 6 Stephen Willey 2006-08-22 05:46:44 EDT
Created attachment 134626 [details]
dlm rsync "linking"
Comment 7 Stephen Willey 2006-08-23 07:15:55 EDT
Created attachment 134703 [details]
dlm "rm" unlinking

Another example netdump trace from last night. These really are occurring
nightly now. What does the error "convert unknown lkid" mean exactly?
Comment 8 David Teigland 2006-08-23 10:07:36 EDT
This looks quite similar to bz 172944 where an unlock is happening
on a lock that no longer exists.  Here a convert is being done on
a lock that no longer exists (or can't be found).  e.g. gfs is
holding a lock, it goes to unlock or convert it and the dlm can't
find any record of the lock that gfs believes it holds.  So, the
dlm returns an error and gfs panics (it doesn't want to corrupt
anything based on the fact that it might be holding an invalid lock).

The main question that we're looking into is why these locks seem to
be disappearing in the dlm.  Thanks for trying the special kernel
modules, it focuses our investigation quite a bit.
Comment 9 David Teigland 2006-08-23 10:21:16 EDT
We've never seen this problem in our own testing and want to
try to emulate what you're doing to see if we can trigger it.
We've not used dirvish either.  Clearly this stresses gfs in
ways that our normal stress tests don't.

Could you briefly describe your gfs cluster (how many and nodes,
how much memory each has), what each of the nodes is doing, how
many gfs file systems, how many original fs's are being backed
up, approx number/size of the files, and any other details about
how dirvish is configured?  Thanks
Comment 10 Stephen Willey 2006-08-23 10:44:00 EDT
(In reply to comment #9)

> Could you briefly describe your gfs cluster (how many and nodes,
> how much memory each has), what each of the nodes is doing, how
> many gfs file systems, how many original fs's are being backed
> up, approx number/size of the files, and any other details about
> how dirvish is configured?  Thanks

I suspect this will be difficult for you to replicate! The GFS filsystem is
connected to 7 x 6.4Tb RAIDs and 2 X HP DL385 servers with 4gig ram each. It is
formatted as one big 48Tb filesystem. We backup and expire probably around 30
"blocks" of storage each night. The size of the original filesystems range from
30 Gigs to 11Tb! Every night we use dirvish/rsync to hardlink and copy against
the previous night's backup. So if the file is unchaged rsync makes a hardlink
but if the file is changed/new it gets copied to GFS. We keep 10 days worth of
these "incrementals" online. Every evening we expire the backup images older
than 10 days by doing an rm -rf of the whole directory. This is all handled by
the "dirvish" scripts but can be easily done by hand as well.

Now in order to maximise throughput we run 4 rsync backups on each server
simultaneously and for the expires we run 2 "rm" commands simultaneously on each
server.

If you want to try and reproduce this maybe you can create a two node cluster
with ~6Tb of data on it (we have lots of 13Mb image files). Then make a few 
hard linked copies of the directories of data using "cp -lR". Then run a few rm
-rf instances on each node simultaneously and hope one of them crashes!

If there is any more debugging info we can provide don't hesitate to contact us.
We are obviously quite keen to avoid having one of our nodes crash every other
night and are willing to run beta code in order to fix it.
Comment 11 David Teigland 2006-08-23 16:43:04 EDT
Looking back, I worked on this same problem quite a while ago
with sunjw onewaveinc com -- bug 172944 references his report.
(I believe he worked around the problem by not calling 'df' so
frequently, don't know why or how that helped his situation.)

He was using some dlm debugging patches and one interesting
thing we found was that prior to at least one of the assertions,
create_lkb() was having difficulty creating a new lkb with a
unique lkid:

Apr 30 00:00:27 nd10 kernel: create_lkb retry 1026b
Apr 30 00:00:28 nd10 kernel: create_lkb retry 2026b
Apr 30 00:00:28 nd10 kernel: create_lkb retry 2008d
Apr 30 00:00:28 nd10 kernel: create_lkb retry 3008d
Apr 30 00:00:28 nd10 kernel: create_lkb retry 501a5
Apr 30 00:00:28 nd10 kernel: create_lkb retry 601a5
Apr 30 00:00:28 nd10 kernel: create_lkb retry 10250
Apr 30 00:00:29 nd10 kernel: create_lkb retry 10210
Apr 30 00:00:29 nd10 kernel: find_lock_by_id 1d103f8 b 1016 c 470 roll 5 u_last 
1d003f8 u_high feff03f8
Apr 30 00:00:29 nd10 kernel: UNLOCK 1D103F8 NO ID
Comment 12 David Teigland 2006-08-23 17:35:08 EDT
Here's a script I've whipped up and am running right now that I'm
hoping will come close to loading gfs in a similar way.  Any
suggestions on how I might improve this?

The "set1" and "set2" directories that I'm copying are flat and
each have about 13,000 10MB files.  I'll increase the sizes of
those directory trees and add some depth.  I have a 1.5 TB fs
(pretty puny I know).

#!/bin/sh

EXPR=/usr/bin/expr
HOSTNAME=/bin/hostname
I=0
MAX=10

H=`${HOSTNAME}`
echo ${H}

while [ ${I} -le ${MAX} ]
do
  echo "${I}: begin"
  date
  time cp -lR set1/ bak-${I}-${H}-set1-1/ &
  time cp -lR set1/ bak-${I}-${H}-set1-2/ &
  time cp -lR set2/ bak-${I}-${H}-set2-1/ &
  time cp -lR set2/ bak-${I}-${H}-set2-2/ &

  LAST=`${EXPR} ${I} - 1`

  time rm -rf bak-${LAST}-${H}-set1-1/ &
  time rm -rf bak-${LAST}-${H}-set1-2/ &
  time rm -rf bak-${LAST}-${H}-set2-1/ &
  time rm -rf bak-${LAST}-${H}-set2-2/ &

  wait
  echo "${I}: end"
  date
  I=`${EXPR} ${I} + 1`
done
Comment 13 Stephen Willey 2006-08-24 05:54:37 EDT
(In reply to comment #11)
> Looking back, I worked on this same problem quite a while ago
> with sunjw onewaveinc com -- bug 172944 references his report.
> (I believe he worked around the problem by not calling 'df' so
> frequently, don't know why or how that helped his situation.)

That bug report seems to end up pointing the finger at packet loss. I'm pretty
sure we arn't getting packet loss as I've reproduced the bug on two different
switches and we arn't doing any bonding.

We do run a cron job that does a "df" every 10 minutes. We use this to update a
status webpage. I'll try turning the frequency down to every 2 hours and see if
it effects the frequency of crashes.

> He was using some dlm debugging patches and one interesting
> thing we found was that prior to at least one of the assertions,
> create_lkb() was having difficulty creating a new lkb with a
> unique lkid:

Should we try similar debugging patches?
Comment 14 Stephen Willey 2006-08-24 06:06:34 EDT
(In reply to comment #12)
> Here's a script I've whipped up and am running right now that I'm
> hoping will come close to loading gfs in a similar way.  Any
> suggestions on how I might improve this?
> 
> The "set1" and "set2" directories that I'm copying are flat and
> each have about 13,000 10MB files.  I'll increase the sizes of
> those directory trees and add some depth.  I have a 1.5 TB fs
> (pretty puny I know).

I'm sure this will stress test it quite well. But if you want the script to do
what our scripts do then need to have each host and instance of rm and cp
working on seperate images trees. So maybe have 4 "sets" and run this script
simultaneously on each node:

#!/bin/sh

EXPR=/usr/bin/expr
HOSTNAME=/bin/hostname
I=0
MAX=10

H=`${HOSTNAME}`
echo ${H}

while [ ${I} -le ${MAX} ]
do
  echo "${I}: begin"
  date
  time cp -lR set1-${H} bak-${I}-${H}-set1 &
  time cp -lR set2-${H} bak-${I}-${H}-set2 &

  LAST=`${EXPR} ${I} - 1`

  time rm -rf bak-${LAST}-${H}-set1 &
  time rm -rf bak-${LAST}-${H}-set2 &

  wait
  echo "${I}: end"
  date
  I=`${EXPR} ${I} + 1`
done
Comment 15 David Teigland 2006-08-25 15:57:01 EDT
I've been running this test (modified per comment 14) for 24 hours so
far and will let it continue running through the weekend.

- The test is up to the 27'th iteration on both nodes.
- Each rm takes about 33 minutes.
- Each cp takes about 48 minutes.
- I had a loop on one node running df every 10 minutes for about
  10 hours and have now stopped that.
- My 1.5TB fs is only about 40% full, will add more data to each
  "set" next week.
- each "set" now contains about 13,000 10MB files in a flat dir plus
  an rsynced copy of the root fs with about 200,000 files

I'm running with a new dlm patch that I'll attach.  It adds some
debugging statements to point out errors early in the suspected
areas of the code.  The patch also add's a small "cache" of 100
recently deleted lock's.  Since the central problem in this bz is
that an lkb seems to disappear, the idea is to be able to look
through the "trash" of recently deleted lkb's for the missing one
if we hit the problem.
Comment 16 David Teigland 2006-08-25 15:59:34 EDT
Created attachment 134957 [details]
dlm debugging patch

Could you apply and use this patch, or should I create a dlm binary?
Comment 17 David Teigland 2006-08-28 14:46:33 EDT
I reproduced the original bug after two days and 54 iterations of
the test program.  For 52 minutes prior to the assertion, create_lkb()
was warning constantly about having to retry when looking for a
unique lockid.  There was no inter-machine locking going on because
processes on the two machines were working on their own sets of files.
The missing lkb was not found in the "trash".
Comment 18 Stephen Willey 2006-08-29 11:13:13 EDT
(In reply to comment #17)
> I reproduced the original bug after two days and 54 iterations of
> the test program.  For 52 minutes prior to the assertion, create_lkb()
> was warning constantly about having to retry when looking for a
> unique lockid.  There was no inter-machine locking going on because
> processes on the two machines were working on their own sets of files.
> The missing lkb was not found in the "trash".

Glad to hear its reproducible - pity it takes so long to trigger. We probably
hit it a bit more frequently cos we deal with so many more files every night.
I've applied the patch (id=134957) but we won;t get any data out of it for a day
or two as we are running a gfs_fsck atm (48Tb takes around 3 days!). A RAID
threw a mental and knocked out some blocks.

Comment 19 David Teigland 2006-08-30 12:21:34 EDT
Don't bother applying the patch I attached, I don't think it will
give us any new info at this point.

I've reproduced this again in a much shorter time with a different
patch that adds more debugging.  The create_lkb() warnings appeared
5 minutes before the bug.

The new debug info dumps the rsb against which the expected lock
is held.  The rsb shows a granted PR lkb with a 0 lkid which probably
corresponds to the lkid that's not found.  I believe this new info
will get us much closer to solving this.

Aug 29 16:36:56 bull-05 kernel: dlm: x: dlm_lock: convert unknown lkid ff87007d
"       2        108bbf7f"
Aug 29 16:36:56 bull-05 kernel: no lkid ff87007d in trash 15935394
Aug 29 16:36:56 bull-05 kernel: dlm: rsb
Aug 29 16:36:56 bull-05 kernel: name "       2        108bbf7f"
Aug 29 16:36:56 bull-05 kernel: nodeid 0
Aug 29 16:36:56 bull-05 kernel: flags 2
Aug 29 16:36:56 bull-05 kernel: ref 2
Aug 29 16:36:56 bull-05 kernel: grantqueue
Aug 29 16:36:56 bull-05 kernel: dlm: lkb
Aug 29 16:36:56 bull-05 kernel: id 0
Aug 29 16:36:56 bull-05 kernel: remid 0
Aug 29 16:36:56 bull-05 kernel: flags 0
Aug 29 16:36:56 bull-05 kernel: status 2
Aug 29 16:36:56 bull-05 kernel: rqmode -1
Aug 29 16:36:56 bull-05 kernel: grmode 3
Aug 29 16:36:56 bull-05 kernel: nodeid 0
Aug 29 16:36:56 bull-05 kernel: lqstate 0
Aug 29 16:36:56 bull-05 kernel: lqflags 10001
Aug 29 16:36:56 bull-05 kernel: convertqueue
Aug 29 16:36:57 bull-05 kernel: waitqueue
Aug 29 16:36:57 bull-05 kernel: x: num=2,108bbf7f err=-22 cur=3 req=5 lkf=44
id=ff87007d
Comment 20 David Teigland 2006-08-30 13:09:06 EDT
Dumping /proc/cluster/dlm_locks on bull-05 also shows the lkb
having a 0 id.  No other lkb's had a 0 id.

Resource 00000100704bf598 (parent 0000000000000000). Name (len=24) "       2    
    108bbf7f"  
Master Copy
Granted Queue
00000000 PR
Conversion Queue
Waiting Queue

Comment 21 David Teigland 2006-08-31 09:32:26 EDT
When the lkid counter for bucket 0 in the lkb hash table rolls over
(it's a uint16), it results in a lkid of zero being assigned to the
lkb.  I've verified with the test above that this happens.  The zero
lkid confuses lock_dlm which uses lkid != 0 to determine if the lock
is being converted or not.  I still need to verify that this is the
root cause of all the problems for this bug, I suspect it is.  I'll
attach a patch to fix the zero lkid problem.
Comment 22 David Teigland 2006-08-31 09:36:57 EDT
Created attachment 135286 [details]
patch to skip zero lkid's

Please try this patch, it definately fixes a bug, probably the cause
of the trouble in this bz.
Comment 23 Stephen Willey 2006-08-31 10:04:55 EDT
(In reply to comment #22)
> Created an attachment (id=135286) [edit]
> Please try this patch, it definately fixes a bug, probably the cause
> of the trouble in this bz.

Patch applied - if its the fix then you probably won't hear back from us for a
week or so. I've left in the debug code in from dlm-fs1.patch and attachment
id=134957 too in case this isn't the fix and more debug data will be required.

Thanks for your work on this bug so far.
Comment 24 David Teigland 2006-08-31 10:30:57 EDT
I've confirmed with my own test (initializing the counter to FF00 so
we hit the zero lkid more quickly) that the zero lkid is the cause
of the original panic.  So, the patch from comment #22 will very
likely be the final and complete solution for this bz.  I will run
the test with only the fix applied for a couple days to verify
further.
Comment 25 Stephen Willey 2006-09-07 05:51:25 EDT
(In reply to comment #23)

Dave,

> Patch applied - if its the fix then you probably won't hear back from us for a
> week or so. I've left in the debug code in from dlm-fs1.patch and attachment
> id=134957 too in case this isn't the fix and more debug data will be required.

We have had no assertions since adding the "zero lkid" patch last week. It looks
like the fix. I've removed the extra debugging code and will now just run with
the patch from comment #22. 

Thanks for your time and effort - we are finally getting all our backups through
every night without any manual intervention.
Comment 26 David Teigland 2006-09-07 09:49:47 EDT
My test has also been running for about a week now without any problem.

Module name:    cluster
Branch:         RHEL4
Changes by:     teigland@sourceware.org 2006-08-31 15:22:41

Modified files:
        dlm-kernel/src : lkb.c

Log message:
        Don't create lkids of 0.  When the lkid counter for bucket 0 of the
        hash table rolls over (it's uint16), we'd get a lkid of zero and
        assign it to the new lkb.  lock_dlm would be confused by a zero lkid
        and use some other, incorrect value as the lkid which would lead to
        a dlm error which would lead to a lock_dlm/gfs panic.
        this should fix bz 199673
        it may also fix bz 172944
Comment 27 David Teigland 2006-09-22 15:11:24 EDT
*** Bug 172944 has been marked as a duplicate of this bug. ***
Comment 29 Lon Hohberger 2006-10-05 16:05:58 EDT
*** Bug 203829 has been marked as a duplicate of this bug. ***
Comment 31 Red Hat Bugzilla 2006-10-11 12:45:18 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2006-0702.html
Comment 33 David Teigland 2006-11-29 17:25:40 EST
*** Bug 209582 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.