Bug 537010 - GFS2 OOPS
Summary: GFS2 OOPS
Keywords:
Status: CLOSED UPSTREAM
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 12
Hardware: All
OS: Linux
low
medium
Target Milestone: ---
Assignee: David Teigland
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
: 525739 (view as bug list)
Depends On:
Blocks: 562917
TreeView+ depends on / blocked
 
Reported: 2009-11-12 08:10 UTC by Fabio Massimo Di Nitto
Modified: 2010-08-01 23:20 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2010-07-30 10:38:38 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)
Logs from all the 6 nodes (590.26 KB, application/x-bzip2)
2009-11-12 08:10 UTC, Fabio Massimo Di Nitto
no flags Details
new logs from all the 6 nodes (1.45 MB, application/x-bzip2)
2009-11-13 14:02 UTC, Fabio Massimo Di Nitto
no flags Details
Fabio's latest set of test results (210.00 KB, application/x-tar)
2010-02-01 15:15 UTC, Steve Whitehouse
no flags Details
Fabio's latest results (17.07 KB, application/x-bzip2)
2010-02-02 10:28 UTC, Steve Whitehouse
no flags Details
Debugging patch (1.35 KB, patch)
2010-02-09 10:02 UTC, Steve Whitehouse
no flags Details | Diff
glock trace (428.15 KB, text/plain)
2010-02-12 16:41 UTC, Steve Whitehouse
no flags Details
patch to test (6.72 KB, text/plain)
2010-02-17 23:06 UTC, David Teigland
no flags Details
updated patch (7.30 KB, text/plain)
2010-02-19 20:27 UTC, David Teigland
no flags Details
patch to test (11.28 KB, text/plain)
2010-02-23 20:45 UTC, David Teigland
no flags Details

Description Fabio Massimo Di Nitto 2009-11-12 08:10:54 UTC
Created attachment 369152 [details]
Logs from all the 6 nodes

Description of problem:

GFS2 OOPS during mount/umount operations

Version-Release number of selected component (if applicable):

2.6.31.5-122.fc12.i686.PAE

How reproducible:

Always

Steps to Reproduce:
1. 6 nodes cluster, 3 x86, 3 x86_64
2. start the test suite (2 nodes mount/umount, 2 nodes write tests, 1 node clvmd test and 1 node rgmanager service test)
3. wait...

Actual results:

during one of the umount operations, GFS2 will ops (see node1 logs attached)

activity in the cluster was not blocked. node3/node5 were able to continue read/write operations on the gfs2 partitions.

node1/node4 mount/umount operations were blocked.

Comment 1 Steve Whitehouse 2009-11-12 09:24:40 UTC
The interesting bit is this:

G:  s:UN n:2/264793 f:I t:UN d:EX/0 a:0 r:0
------------[ cut here ]------------
kernel BUG at fs/gfs2/glock.c:173!
invalid opcode: 0000 [#1] SMP 
last sysfs file: /sys/kernel/dlm/gfs2/control
Modules linked in: gfs2 dlm configfs nfsd exportfs nfs lockd fscache nfs_acl auth_rpcgss sunrpc ipv6 virtio_balloon virtio_net i2c_piix4 i2c_core joydev virtio_blk virtio_pci floppy [last unloaded: scsi_wait_scan]

Pid: 1698, comm: dlm_astd Not tainted (2.6.31.5-122.fc12.i686.PAE #1) 
EIP: 0060:[<f8972ad0>] EFLAGS: 00010246 CPU: 0
EIP is at gfs2_glock_hold+0x18/0x22 [gfs2]
EAX: 00000000 EBX: c1513430 ECX: c0a38e50 EDX: 00000000
ESI: f8989634 EDI: 003b9410 EBP: f55d1f7c ESP: f55d1f7c
 DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
Process dlm_astd (pid: 1698, ti=f55d0000 task=f59a6600 task.ti=f55d0000)
Stack:
 f55d1f98 f8973a82 00000000 00000246 f697e6c0 f8989634 00000002 f55d1fa0
<0> f8989662 f55d1fb8 f84e41d8 05000000 f5537ed0 00000000 f84e410e f55d1fe0
<0> c0450aab 00000000 00000000 00000000 f55d1fcc f55d1fcc c0450a3b 00000000
Call Trace:
 [<f8973a82>] ? gfs2_glock_cb+0x1e/0x7e [gfs2]
 [<f8989634>] ? gdlm_bast+0x0/0x41 [gfs2]
 [<f8989662>] ? gdlm_bast+0x2e/0x41 [gfs2]
 [<f84e41d8>] ? dlm_astd+0xca/0x112 [dlm]
 [<f84e410e>] ? dlm_astd+0x0/0x112 [dlm]
 [<c0450aab>] ? kthread+0x70/0x75
 [<c0450a3b>] ? kthread+0x0/0x75
 [<c0409bc7>] ? kernel_thread_helper+0x7/0x10
Code: c2 31 c0 e8 be fc ff ff 0f 0b eb fe e8 c4 f9 ff ff 5d c3 55 89 e5 0f 1f 44 00 00 8b 50 18 85 d2 75 0d 89 c2 31 c0 e8 9b fc ff ff <0f> 0b eb fe 3e ff 40 18 5d c3 55 89 e5 57 56 53 83 ec 0c 0f 1f 
EIP: [<f8972ad0>] gfs2_glock_hold+0x18/0x22 [gfs2] SS:ESP 0068:f55d1f7c
---[ end trace 3e320fe83a62faf1 ]---


It looks like a glock ref count issue. The ref count should be greater than 0 at all times when there is an outstanding request relating to the glock.

Comment 2 Steve Whitehouse 2009-11-12 09:29:39 UTC
Actually there is something more strange.... the lock is unlocked and the dlm is requesting that the lock be demoted. The reason that the ref count is zero is that the lock is not locked in the first place, so something strange is going on here.

Comment 3 David Teigland 2009-11-12 16:42:37 UTC
Isn't this the very issue you emailed about today?
"Subject: dlm lockspace shutdown"

gfs2 is sending off a bunch of unlocks and then disappearing before it collects the callbacks for them.

Comment 4 Steve Whitehouse 2009-11-12 17:17:00 UTC
No, this is a different issue. This mornings was caused by the umount procedure where we do:

1. send unlock requests for all dlm locks
2. unmount the lock space
3. not all unlock requests receive callbacks leaving some glocks still allocated
4. module unload tries to free the kmem cache containing the glocks which calls BUG() as glocks are still allocated

This bug is:
1. there is a glock holding a DLM null lock which is otherwise idle
2. it receives a "bast" call back requesting that it demote to some mode.
3. the "glock hold" code takes exception to being asked to hold a glock which has a 0 ref count.

Since we always take the glock ref count while a glock is locked, it should never be possible for the DLM callback to occur after the ref count is at zero. We also take a ref count on the glock while any locking operations are in progress, so that isn't the case here either (and if it were the flags in the glock would give it away as well).

Comment 5 David Teigland 2009-11-12 19:14:34 UTC
Hm, so you actually think the dlm is doing something wrong here and not gfs?
That is possible, of course, but seems rather unlikely.  I'd suggest adding a check for this in gfs2 so you can do a printk instead of an oops, and then dumping the dlm lock state.

Comment 6 David Teigland 2009-11-12 22:16:44 UTC
Be sure to collect all the info which includes the bast and queue times:
  dlm_tool -s -v -w lockdebug <name>

Comment 7 Steve Whitehouse 2009-11-13 10:57:04 UTC
I don't think the oops makes any difference as to whether the dlm lock state can be collected or not. Fabio, can you get that information?

Comment 8 Fabio Massimo Di Nitto 2009-11-13 11:50:30 UTC
I´ll try to reproduce the issue again and get a lock dump.

Comment 9 Fabio Massimo Di Nitto 2009-11-13 14:02:58 UTC
Created attachment 369434 [details]
new logs from all the 6 nodes

Comment 10 Fabio Massimo Di Nitto 2009-11-13 14:04:27 UTC
David: you want to notice that the OOPS is still happening on node1 but there is no lockspace for gfs2 on that node at the time of the crash.

I still attached the output you requested from all the other nodes.

The filesystem is being umounted while the crash happens and probably the lockspace is already gone at that time.

Comment 11 David Teigland 2009-11-13 16:03:26 UTC
Fabio, that's fine, it means comment 3 was probably correct after all.  It seems gfs2 has various issues related to shutting down the lockspace.

Comment 12 Steve Whitehouse 2009-11-13 17:19:47 UTC
*** Bug 525739 has been marked as a duplicate of this bug. ***

Comment 13 Steve Whitehouse 2010-02-01 15:15:12 UTC
Created attachment 388066 [details]
Fabio's latest set of test results

Comment 14 Steve Whitehouse 2010-02-02 10:28:25 UTC
Created attachment 388251 [details]
Fabio's latest results

Comment 15 Steve Whitehouse 2010-02-09 10:02:31 UTC
Created attachment 389709 [details]
Debugging patch

The intent of this patch is to narrow down the issue. Currently we do not know whether the callback is received before or after the demote from NL -> unlocked request is sent. The newly added 'X' flag will tell us that.

Also, we do not know what lock mode is being requested in the problematic callback. The patch will also tell us that too.

Gathering a trace from the glock tracepoints should tell us a bit more as well.

Comment 16 Steve Whitehouse 2010-02-12 16:36:55 UTC
Reproduced with the debugging patch:

Feb 12 10:20:25 exxon-01 kernel: dlm: connecting to 2
Feb 12 10:20:26 exxon-01 kernel: gfs2_glock_cb: state=0
Feb 12 10:20:26 exxon-01 kernel: G:  s:UN n:2/22 f:IX t:UN d:EX/0 a:0 r:0
Feb 12 10:20:26 exxon-01 kernel: ------------[ cut here ]------------
Feb 12 10:20:26 exxon-01 kernel: kernel BUG at fs/gfs2/glock.c:173!
Feb 12 10:20:26 exxon-01 kernel: invalid opcode: 0000 [#1] SMP 
Feb 12 10:20:26 exxon-01 kernel: last sysfs file: /sys/fs/gfs2/plurality:myfs/lo
ck_module/block
Feb 12 10:20:26 exxon-01 kernel: CPU 3 
Feb 12 10:20:26 exxon-01 kernel: Pid: 2586, comm: dlm_astd Not tainted 2.6.33-rc
6 #2 0YK962/PowerEdge SC1435
Feb 12 10:20:26 exxon-01 kernel: RIP: 0010:[<ffffffff81236663>]  [<ffffffff81236
663>] gfs2_glock_hold+0x1a/0x24
Feb 12 10:20:26 exxon-01 kernel: RSP: 0018:ffff8801176ede10  EFLAGS: 00010286
Feb 12 10:20:26 exxon-01 kernel: RAX: 0000000000000000 RBX: ffff8801175c22b8 RCX
: 0000000000000034
Feb 12 10:20:26 exxon-01 kernel: RDX: 0000000000000000 RSI: 0000000000000001 RDI
: 0000000000000246
Feb 12 10:20:26 exxon-01 kernel: RBP: ffff8801176ede10 R08: 0000000000072eb7 R09: 0000000000000000
Feb 12 10:20:26 exxon-01 kernel: R10: ffffffff81c4d598 R11: 0000000000000000 R12: 0000000000000000
Feb 12 10:20:26 exxon-01 kernel: R13: ffff8801143d0000 R14: 00000001001d446f R15: 0000000000000002
Feb 12 10:20:26 exxon-01 kernel: FS:  00007fc39db52700(0000) GS:ffff880123c80000(0000) knlGS:0000000000000000
Feb 12 10:20:26 exxon-01 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Feb 12 10:20:26 exxon-01 kernel: CR2: 00000038e9478f90 CR3: 0000000216c98000 CR4: 00000000000006e0
Feb 12 10:20:26 exxon-01 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Feb 12 10:20:26 exxon-01 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Feb 12 10:20:26 exxon-01 kernel: Process dlm_astd (pid: 2586, threadinfo ffff8801176ec000, task ffff88011aa48000)
Feb 12 10:20:26 exxon-01 kernel: Stack:
Feb 12 10:20:26 exxon-01 kernel: ffff8801176ede40 ffffffff812378e7 ffff88011aa48000 ffff88011aa48000
Feb 12 10:20:26 exxon-01 kernel: <0> ffff8801143d0000 ffffffff8125051c ffff8801176ede50 ffffffff8125054b
Feb 12 10:20:26 exxon-01 kernel: <0> ffff8801176edeb0 ffffffff8116ceb8 ffff880212d7da78 0500000000000000
Feb 12 10:20:26 exxon-01 kernel: Call Trace:
Feb 12 10:20:26 exxon-01 kernel: [<ffffffff812378e7>] gfs2_glock_cb+0x38/0xb1
Feb 12 10:20:26 exxon-01 kernel: [<ffffffff8125051c>] ? gdlm_bast+0x0/0x43
Feb 12 10:20:26 exxon-01 kernel: [<ffffffff8125054b>] gdlm_bast+0x2f/0x43
Feb 12 10:20:26 exxon-01 kernel: [<ffffffff8116ceb8>] dlm_astd+0x10a/0x176
Feb 12 10:20:26 exxon-01 kernel: [<ffffffff8116cdae>] ? dlm_astd+0x0/0x176
Feb 12 10:20:26 exxon-01 kernel: [<ffffffff81068e74>] kthread+0x8e/0x96
Feb 12 10:20:26 exxon-01 kernel: [<ffffffff8100aa64>] kernel_thread_helper+0x4/0x10
Feb 12 10:20:26 exxon-01 kernel: [<ffffffff816af210>] ? restore_args+0x0/0x30
Feb 12 10:20:26 exxon-01 kernel: [<ffffffff81068de6>] ? kthread+0x0/0x96
Feb 12 10:20:26 exxon-01 kernel: [<ffffffff8100aa60>] ? kernel_thread_helper+0x0/0x10
Feb 12 10:20:26 exxon-01 kernel: Code: ff e8 ec fc ff ff 0f 0b eb fe e8 e6 f7 ff ff c9 c3 55 48 89 e5 0f 1f 44 00 00 8b 47 28 48 89 fe 85 c0 75 0b 31 ff e8 c7 fc ff ff <0f> 0b eb fe f0 ff 47 28 c9 c3 55 48 89 e5 41 57 41 56 41 55 41 
Feb 12 10:20:26 exxon-01 kernel: RIP  [<ffffffff81236663>] gfs2_glock_hold+0x1a/0x24
Feb 12 10:20:26 exxon-01 kernel: RSP <ffff8801176ede10>
Feb 12 10:20:26 exxon-01 kernel: ---[ end trace b8ee515d247fe0b4 ]---

The interesting stuff is in the first three lines of the above report. This is towards the end of the glock lifetime. What is supposed to happen is this:

1. Something demotes the glock to unlocked. That results in a request to the dlm to demote to NL
2. When the reply from the demote to NL is received, the glock ref count is dropped be one (we always hold a glock ref count for SH/DF/EX)
3. When the processing of the lock reply is done, assuming that no other holders exist, that will result in the ref count hitting zero
4. The glock is removed from the glock hash table and a dlm Nl->unlocked request is sent
5. When the reply from the dlm unlock request is received, the glock is freed

Now back to the info from above:
Feb 12 10:20:25 exxon-01 kernel: dlm: connecting to 2
Feb 12 10:20:26 exxon-01 kernel: gfs2_glock_cb: state=0
                                                      ^ We've received an unlock request

Feb 12 10:20:26 exxon-01 kernel: G:  s:UN n:2/22 f:IX t:UN d:EX/0 a:0 r:0

s:UN means that the glock is unlocked (so the dlm mode will be either NL or unlocked)
The 'I' flag means that the glock has been locked at some point during its lifetime
The 'X' flag means that we've sent the request to unlock (i.e. step 4 above)
r:0 means that the ref count is 0 (which we can infer from the 'X' flag)

So the unlock request has arrived while we have, at the very most, a NL lock and possibly it has arrived after the final unlock (the current debug patch doesn't tell us that unfortunately). Either way, whether we are holding an NL lock or are unlocked, we shouldn't be receiving callbacks from other nodes requesting that we drop our lock.

In the end I managed to reproduce this by running postmark on one of four nodes, and having two nodes run mount/umount. The other node has the fs mounted, but was idle. Having two nodes running postmark made the system too slow and I didn't see it reproduce.

Comment 17 Steve Whitehouse 2010-02-12 16:39:45 UTC
I had a trace running on the node too. This is the last set of data for the 2/22 glock:

          umount-2607  [002]  2216.921633: gfs2_demote_rq: 8,32 glock 2:22 demot
e PR to NL flags:DI
 glock_workqueue-64    [003]  2216.921674: gfs2_glock_state_change: 8,32 glock 2
:22 state PR to NL tgt:NL dmt:NL flags:lDpI
 glock_workqueue-64    [003]  2216.921677: gfs2_glock_put: 8,32 glock 2:22 state
 NL => IV flags:I

So we tried to demote it, due to umount, and it was originally in PR and we demoted it to NL and then from NL to IV (unlocked) as per the above sequence. I'm not sure that the trace is of much further use, but I'll attach the whole file in a mo, anyway.

Comment 18 Steve Whitehouse 2010-02-12 16:41:55 UTC
Created attachment 390529 [details]
glock trace

The full trace from the node which oopsed.

Comment 20 David Teigland 2010-02-12 18:55:56 UTC
*** Bug 562917 has been marked as a duplicate of this bug. ***

Comment 21 David Teigland 2010-02-12 19:24:56 UTC
I can't tell from the glock trace whether the bast in question was delivered after the ast completion for the unlock?  The dlm shouldn't be sending basts on locks for which it has sent a completion ast for an unlock -- that would be a dlm bug, and that's the first thing I'd like to clarify.  Does the current glock trace give us that info, or can it?

Beyond that, there isn't much of a guarantee that you won't receive a blocking ast at any point in time.  If it's important for some specific sequence of operations to know a bast won't be delivered, then we could try to spell out some guarantees for that sequence.

The dlm used to try to weed out blocking asts that appeared to be unnecessary, but it's a bit of second-guessing, and was removed here:
http://git.kernel.org/gitweb.cgi?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=03339696314fffb95dafb349b84243358e945ce6

Similar to what I mentioned above, we could go back and try to identify some specific sequence of operations that produce an unnecessary or redundant bast and then try to add some code to the dlm to skip those particular basts.

Comment 22 Steve Whitehouse 2010-02-13 19:55:39 UTC
I'm not sure whether the bast was before of after the unlock cast either. Unfortunately my instrumentation doesn't tell me. All I know is that it was after the sending of the NL -> unlock request. We can probably alter the instrumentation though (just add an extra flag) and find that out.

I'm 100% certain that we've had the cast from the demote to NL long before the bast arrives though. I'd assume that holding a NL lock should be the same as being unlocked wrt to not receiving basts.

The GFS2 code assumes that if we've received a cast for a higher to lower mode transistion, then we will not receive a bast requesting a demote from the higher mode after that point. I think that has to be true - we don't have enough info in the bast message when it reaches GFS2 to work out the sequencing wrt casts, so we have to rely on the DLM to send them in the correct sequence.

The patch you point to is trying to fix the right problem, but I think, in the wrong way. The lock master should be ensuring the correct sequencing of the messages. Once it has generated them, the only requirement is that the relative ordering of the casts and basts should be preserved though until delivery to the fs.

Comment 23 Steve Whitehouse 2010-02-15 15:20:20 UTC
I've adjusted the test patch today, and rerun it. At least on this "one off" case, it looks like the "unlock callback" arrived before the completion reply to the unlock request. It takes a fairly long time for me to reproduce - this run went for about half an hour before hitting it. I'm not sure its worth doing further runs at the moment though as we have pretty much all the info we can get from it.

Comment 24 David Teigland 2010-02-17 00:06:41 UTC
I'm working on a dlm debug patch that will hopefully report whenever an "unnecessary" bast is being sent, so we can figure out exactly what's happening here, and correlate it with this gfs oops, before trying to fix anything.

Comment 25 David Teigland 2010-02-17 23:06:15 UTC
Created attachment 394844 [details]
patch to test

Steve, please do the same test with the same debugging using this dlm patch and see if it changes anything.

Include <dlm log_debug="1"/> in cluster.conf so we can see the new debug messages.

Comment 26 Steve Whitehouse 2010-02-18 09:58:14 UTC
This isn't going to work I suspect. If we have the following sequence:

bast
cast
bast

Then the bast will be set as the "first" item, and then if he second bast arrives before the cast is delivered, then it will be reordered with respect to the cast
(i.e. delivered in the order bast, bast, cast).

I still believe that the correct solution is to prevent the generation of incorrect basts at the lock master and to prevent reordering of the basts vs casts during transmission from the lock master to the client.

Comment 27 David Teigland 2010-02-18 20:15:16 UTC
Hm, ok, when queuing a bast I'll add a check to see if there's already one queued and complain loudly if there is.  (I think that's unlikely to be happening in your test, though.)

As for the second point, I didn't consider that we might be generating incorrect basts in the first place.  It's a good thing to check, and I'll try to come up with something to check for it happening.  It's definately something to fix if that's the case.

Comment 28 Steve Whitehouse 2010-02-19 12:23:20 UTC
Yes, I agree that its unlikely to be happening in this particular test, but it might happen in other circumstances. I've not yet tried running the flock test with which qe appear to be able to hit the same issue.

The way I look at it, is that "casts" should act as barriers so far as "basts" are concerned. In other words, if there is a change from lock mode A to lock mode B due to some lock request, then I only expect to see basts related to lock mode A before the cast arrives with the successful A -> B transition reply. After that cast has arrived, I only expect to see basts related to lock mode B.

It is possible (although now I've spent more time looking at the code) unlikely that the basts are being generated incorrectly. It is also possible that the basts are being reordered with respect to the casts on their journey from the lock master to the client. I'm open to other suggestions, but those seem to be the most likely explanations given the symptoms.

The patch I sent to you which removed dlm_astd was partly a test to see if that was causing reordering. That wasn't the case, but I still feel that is a useful thing to do anyway since it does speed things up a fair bit.

I've not yet looked into the sending side of the queuing system, nor all of the receive path.

If we could compare the ordering of casts and basts being generated at the master node, with the results as they are delivered at the client node, then we could at least figure out which of the two possible causes is the culprit. It might be possible to cut down on the data which is generated if we could always reproduce the issue with a specific lock. I think that is possible since in the test I've run, its always thrown up the same lock: 2/22 and judging from the inode number, its one of the system files that is involved.

It might also be a worthwhile exercise to create some dlm tracepoints as that is a more scalable and easier interface to work with than printk when there are large amounts of data being generated.

Comment 31 David Teigland 2010-02-19 16:22:56 UTC
So you didn't try your test with the patch in comment 25?

Does "kernel BUG at fs/gfs2/glock.c:173!" always mean specifically that an "unnecessary" bast been delivered?

Comment 32 Steve Whitehouse 2010-02-19 16:44:32 UTC
No, I've not tried that yet.

The message kind of means that.... the reason that the message appears is that there is a bug trap which triggers if gfs2_glock_hold() is called on a glock with a ref count of 0. That is invalid since you can only hold a glock to which there is a reference in the first place.

The glock ref count only hits zero after the glock has been demoted to NL, since its the "cast" from the demotion to NL which drops the glock ref count which we hold all the time we have a dlm lock mode which is greater than NL. In fact we also have a second ref count which is held by the glock workqueue, and it is only when this is also dropped at the end of processing the cast, that the final ref count will be removed.

That then triggers the NL -> unlocked dlm request. So the short answer is that this particular message will not occur in general for any basts which are delivered out of order or incorrectly, but only for this one specific case where the glock is at the end of the lifetime.

If other basts were delivered out of order, the most likely event is that they would either be ignored (if the glock was already in a compatible mode), or acted upon resulting in a demotion. So it would be a performance hit and not result in a kernel bug like we've seen here.

The more interesting bit of that message is actually the line above the kernel bug. The actual code which hits this is a GLOCK_BUG(); which is a wrapper that prints out the glock state before calling BUG(); The flags in that line give us more information, such as the bast not arriving until after we've sent a NL -> unlocked request. With my later tests, I was also able to discover that the cast for the NL-> unlocked request had not arrived before the bast.

We also know that the inode in question is the root dir, so that tells us that there are no "funny" locking modes involved (such as CW) but just EX, PR and NL.

Also my debugging revealed that the bast was requesting a demote to "compatible with EX".

Comment 33 David Teigland 2010-02-19 17:07:17 UTC
OK, so you hold NL, then get a bast requesting EX, and I think it's safe to say the dlm should not do that.

It could easily be explained by dlm_astd reversing the order of bast and cast, which was trivial to fix in the patch above.  So I suspect this may be fixed by the patch, we just need to run a test that reliably produced that gfs2 BUG and see if it still happens.

Comment 34 David Teigland 2010-02-19 20:27:33 UTC
Created attachment 395174 [details]
updated patch

The previous patch wasn't quite correct in deciding when to skip unnecessary basts.  We need to compare the bast mode against the cast mode prior to it, which is not necessarily the current queued cast mode (which may follow the bast).

I realized something that may be a major factor in the odd sequencing of basts and casts.  When a lock mastered remotely is demoted, the cast is queued immediately on the local node after sending the convert; no reply is involved for down conversions.  This means that a cast could easily be queued for a down convert, followed by a bast that has been sent by the master before the master received and processed the down conversion.

Comment 35 David Teigland 2010-02-23 20:45:33 UTC
Created attachment 395815 [details]
patch to test

I found another dlm bug using ocfs2 where the dlm sends a bast for a newly granted conversion before sending the reply indicating success/completion for that conversion.  This would cause the bast for the new lock mode to be delivered before the cast.  This updated patch includes the fix for that as well as the previous bast/cast ordering issue, and now covers all the issues I'm aware of relating to dlm callbacks.

Comment 36 Steve Whitehouse 2010-05-21 16:06:23 UTC
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=89d799d008710e048ee14df4f4e5441e9f4d5d50
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=7fe2b3190b8b299409f13cf3a6f85c2bd371f8bb

I think we can close this now.... the patches are upstream or do we want to wait until they appear in a fedora kernel?

Comment 38 Chuck Ebbert 2010-08-01 23:20:58 UTC
(In reply to comment #36)
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=89d799d008710e048ee14df4f4e5441e9f4d5d50
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=7fe2b3190b8b299409f13cf3a6f85c2bd371f8bb
> 
> I think we can close this now.... the patches are upstream or do we want to
> wait until they appear in a fedora kernel?    

These fixes will never make it into Fedora 12 unless you send them for -stable or have someone commit them in F-12.


Note You need to log in before you can comment on or make changes to this bug.