Bug 447748

Summary: GFS2: lock_dlm is not always delivering callbacks in the right order
Product: Red Hat Enterprise Linux 5 Reporter: Steve Whitehouse <swhiteho>
Component: kernelAssignee: Don Zickus <dzickus>
Status: CLOSED ERRATA QA Contact: GFS Bugs <gfs-bugs>
Severity: high Docs Contact:
Priority: high    
Version: 5.2CC: adas, edamato, lwang, nstraz, rpeterso
Target Milestone: rcKeywords: TestBlocker
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-01-20 19:51:37 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 432057    
Bug Blocks: 451816, 452884    
Attachments:
Description Flags
Initial patch
none
First of the upstream patches
none
Second of the upstream patches
none
Third of the upstream patches
none
Fourth upstream patch
none
Fifth, and final upstream patch (the big one)
none
RHEL5 patch - Try #1
none
The missing file
none
Patch posted to rhkernel-list on 07 Jul 2008 none

Comment 2 Robert Peterson 2008-05-30 15:57:42 UTC
Created attachment 307210 [details]
Initial patch

This is my RHEL port of the initial patch Steve did for this problem 
while we were debugging bug 432057.  It seemed to work well, but the
complete solution relies upon a change to DLM that Dave Teigland has
waiting in the wings (i.e. no dlm bugzilla assigned yet).

Comment 3 Steve Whitehouse 2008-06-03 07:38:49 UTC
*** Bug 449459 has been marked as a duplicate of this bug. ***

Comment 4 Steve Whitehouse 2008-06-13 16:28:12 UTC
So the plan is to unshare the lock modules between GFS1 and GFS2. I'm just about
to attach the upstream patches to do this to this bz. The final patch might need
a bit of tweeking before sending upstream. Bearing in mind the likely timing of
the next merge window, I suspect that these patches will be queued for the
following merge window at this stage.


Comment 5 Steve Whitehouse 2008-06-13 16:41:15 UTC
Created attachment 309243 [details]
First of the upstream patches

Comment 6 Steve Whitehouse 2008-06-13 16:41:48 UTC
Created attachment 309244 [details]
Second of the upstream patches

Comment 7 Steve Whitehouse 2008-06-13 16:42:15 UTC
Created attachment 309246 [details]
Third of the upstream patches

Comment 8 Steve Whitehouse 2008-06-13 16:42:43 UTC
Created attachment 309248 [details]
Fourth upstream patch

Comment 9 Steve Whitehouse 2008-06-13 16:58:59 UTC
Created attachment 309253 [details]
Fifth, and final upstream patch (the big one)

So this patch has most of the guts of the thing in it. I tested it so far as
mounting a lock_nolock fs, but it seems my drac card has doing strange things
and I don't know how many more times I'll be able to reboot it. This patch is
probably pretty close to the final one, but with the following issues:

 o Needs testing with a lock_dlm mount
 o Needs the /sysfs (particularly the lock_module subdir) testing for _both_
lock_nolock and lock_dlm
 o Needs recovery testing to ensure that nothing has broken there
 o Is currently missing code to make the fs wait on withdraw for the userland
tools to signal to it that its ok to leave the lock space. That needs putting
back and also checking whether its ok to do the same thing here for both
lock_nolock and lock_dlm. Its pretty minor as it only effects the withdraw
path, but does need fixing.
 o Needs build testing to ensure that the Kconfig is correct

So the other bug I discovered while looking at this was related to the
BLOCKLOCKS code. Thats been replaced so that we no longer queue locks in that
case. Instead we have a system of ignoring replies (easier since replies do not
have the ordering requirement which lock requests have) until some future time.
There is a new glock flag which indicated an ignored reply, so that now all our
existing debugging via /debugfs can see this extra state as well. The plan is
that when BLOCKLOCKS is cleared, we scan all the glocks and run the workqueue
for all replies received, but not yet acted upon.

Aside from that it should be fairly self-explanatory I think.... lots of stuff
is gone and we no longer have separate structures for glock's and lock_dlm's
locks. Also the struct gdlm_ls is merged into the sd_lockstruct. This saves a
lot of space with lock_dlm compared with the previous version of the code.

Another change relates to LVBs which used to have separate hold/unhold code.
That isn't needed provided we always ensure that we only demote to NL and never
to IV unless we are disposing of the glock. Since we already do this, all thats
needed is for the higher layers of code to hold a ref count to the glock, and
since it will have been locked at least once at that time, we can be sure that
the lock will be NL for all the time required.

There are probably many more optimisations that we can make in the future and
one in particular that I have in mind is to try and optimise handling of the
gl_strname. For now though, it should be ok as it is.

Comment 10 Robert Peterson 2008-06-16 22:21:04 UTC
Created attachment 309545 [details]
RHEL5 patch - Try #1

This is a RHEL5 port of all six upstream patches.  It is completely
untested so far, so it should be considered unsafe for now.  Also,
I haven't addressed any of Steve's issues yet from comment #9.
I'll likely have to work through some problems.

Also, this relies upon Dave Teigland's 18-part DLM patch that has not
made it to an "official" RHEL5 kernel yet.

Reassigning to myself while Steve is on holiday.

Comment 11 Robert Peterson 2008-06-16 23:10:45 UTC
We also need a userland patch to the cman init script so that it
doesn't try to load the lock_dlm module.  That probably needs to be
a separate bugzilla record.

[root@exxon-01 /boot]# service cman start
Starting cluster: 
   Loading modules... failed
FATAL: Error inserting lock_dlm
(/lib/modules/2.6.18-prep/kernel/fs/gfs2/locking/dlm/lock_dlm.ko): Unknown
symbol in module, or unknown parameter (see dmesg)
                                                           [FAILED]
lock_dlm: Unknown symbol gfs2_register_lockproto
lock_dlm: Unknown symbol gfs2_unregister_lockproto


Comment 12 Robert Peterson 2008-06-17 16:39:47 UTC
I opened up bug #451816 for the change to the cman init script.

So the big picture is this:  We now have a bunch of co-requisite patches
that all need to go in together to get the lock protocol separated out
between GFS and GFS2:

1. Dave's changes to dlm                - 450138 dlm plock code (posted)
2. Dave's gfs_controld changes          - 450169 (emailed to cluster-list)
3. Dave's gfs-kernel changes            - 450209 (emailed to cluster-list)
4. Steve & Bob's changes to gfs2        - 447748 (attached)
5. Dave's locking/* changes             - 447748 (emailed to cluster-list)
6. cman init script can't load lock_dlm - 451816

I'll probably prepare RHEL patches for each bz and attach them to their
respective bugzilla records.

I'm assuming #4 and #5 will be combined into a unified patch that will
be posted to rhkernel-list when we're ready.  The "right way" to fix
the locking protocol for UPSTREAM is to completely eliminate the
locking/dlm/* and locking/nolock/* from the kernel.  However, Kevin
indicated that it would be hard to get a new module (i.e. lock_dlm)
into RHEL for 5.3.  Too many packaging headaches.  Therefore, for
RHEL only, we'll keep it in the source tree for gfs2, but it will
reference gfs (1) entry points and structures.  Since these just pass
pointers and don't reference actual gfs data structures, this still
compiles, but it gives warning messages about undefined entry points
gfs_register_lockproto and gfs_unregister_lockproto.


Comment 13 Robert Peterson 2008-06-18 14:33:28 UTC
I think Steve may have forgotten to post a piece.  His set of patches
removes a great deal of code regarding lock_dlm, but nowhere does he
put it back.  For example, I'd expect to see some translation between
the DLM and the LM_ST modes referenced throughout GFS2's locking code.
I looked for the translation between LM_ST_EXCLUSIVE and DLM_LOCK_EX,
and there is none.  So there is clearly some big missing piece.
I suspect that Steve moved locking/lock_dlm/lock.c (corrected ala the
first patch I posted to this bugzilla record) to gfs2's main directory
and fixed it up as appropriate, but perhaps he forgot to do git-add on
the new part and therefore was ignored in his patch generation.

I'll see if I can pull this together.  Hopefully his hasn't changed
much from the original.


Comment 14 Robert Peterson 2008-06-18 21:24:26 UTC
I did a bunch of research and some porting work here, but this is
turning out to be a very big effort.  I'm beginning to think I'm wasting
my time on this, redoing work Steve probably already did, and then we'll
have a version that isn't as good as Steve's.  My version isn't likely
to be done before Steve's return anyway.


Comment 15 Steve Whitehouse 2008-06-24 10:46:52 UTC
I think that judging by comment #12, something has not gone to plan with the
port to RHEL. I had not intended that we should need to change either gfs1 or
the scripts. I'd expected that we'd leave the lock harness stuff in gfs2 in
RHEL, but that only gfs1 would use it. That way we minimise changes. I'm not
against the plan to move it into gfs1, but I don't think its actually a
requirement for this bz.

Also, wrt comment #13, are you missing the new lock_dlm.c file? It should be
obvious if the Makefile points to a file which does not exist...


Comment 16 Robert Peterson 2008-06-24 12:24:10 UTC
I never got the lock_dlm.c file.  My latest RHEL5 port has lock_dlm
still in place under GFS2, but it has Dave's patch that changes the
gfs2_* entry points to gfs_* entry points so they will only be used
for GFS1.


Comment 17 Steve Whitehouse 2008-06-24 12:34:24 UTC
Created attachment 310134 [details]
The missing file

The missing lock_dlm.c file. Sorry about that and as you say it looks like I
forgot to git-add it.

Comment 18 Robert Peterson 2008-06-25 16:16:45 UTC
The decision was made to split this effort into two stages:
Stage 1, which will use this bugzilla, will just use the single-
threaded lock_dlm we did for bug #432057.  That version was not
shipped to minimize impact to our beta customer.

I cloned this bug record to bug #452884 for Stage 2, with which we
can implement the upstream patches post below, in RHEL.


Comment 19 Robert Peterson 2008-06-26 17:55:24 UTC
I went back to the single-threaded version of lock_dlm we had before,
and I started testing with that.  Unfortunately, the dd_io test failed
without much information.  Talking to Dean and Nate, it sounded like
the same problem as this comment:
https://bugzilla.redhat.com/show_bug.cgi?id=428751#c8
I analyzed my latest code and what we did to fix the SIGBUS.  What I
found was a missing line of code from ops_file.c that we had in the
428751 incarnation of this fix.  Steve's subsequent comment about it
being a one-liner was correct, but he didn't mention what it was.
So just for the record, here is the line got added to fix the SIGBUS
problem, that subsequently got dropped out of a later patch:

--- ops_file.c.~1~      2008-06-13 14:04:43.000000000 -0500
+++ ops_file.c  2008-06-26 11:52:35.000000000 -0500
@@ -400,6 +400,6 @@ static int gfs2_page_mkwrite(struct vm_a
 
        if (page->index > last_index)
                goto out_unlock_page;
+       ret = 0;
        if (!PageUptodate(page) || page->mapping != ip->i_inode.i_mapping)
                goto out_unlock_page;
        if (gfs2_is_stuffed(ip)) {

I added that line of code back in, but unfortunately, the dd_io test
still failed.  At least this time the symptom changed.  Now I get a
"true" cluster coherency issue:

*** CHECKSUM ERROR mmap1 ***
Expected checksum of 0x41920efd, got 0xc78257d3
*** DATA COMPARISON ERROR mmap1 ***
Corrupt regions follow - unprintable chars are represented as '.'
-----------------------------------------------------------------
corrupt bytes starting at file offset 371236864
    1st 32 expected bytes:  write*W:2846:camel:mmwrite*W:284
    1st 32 actual bytes:    ................................


----- xior ----
magic: 0xfeed10
type: 4 (read)
path: mmap1
syscall: read
oflags: 2 (O_RDWR)
offset: 370422637
count: 6930968
pattern: W:2846:camel:mmwrite*
chksum: 0x41920efd
(parent) pid 2865 exited non-zero

So now I have to hunt that problem down.  At one point, we had the
cluster coherency tests running successfully.  So we have to analyze
what happened to the code since the test was passing.


Comment 20 Robert Peterson 2008-06-26 18:49:15 UTC
I asked Abhi to build a version of the gfs2-kmod that included
everything but the page_mkwrite patch for bug 315191.  That build
runs through dd_io perfectly.  So apparently something in the
page_mkwrite patch broke coherency.  That narrows it down considerably.


Comment 21 Robert Peterson 2008-07-07 13:35:40 UTC
After removing the page_mkwrite patch, I still had some problems with
the coherency and dd_io tests.  These were all resolved by the patch
associated with this comment:

https://bugzilla.redhat.com/show_bug.cgi?id=315191#c34

For this bugzilla record, I think we should ship the following:
1. The "initial patch" (i.e. the first attachment to this bugzilla).
2. The fix shown in comment #19
3. The attachment to 315191 comment 34.

Then bug #452884 may be used for "part 2", the design changes Steve
did for the glock code.


Comment 22 Robert Peterson 2008-07-07 14:28:13 UTC
On second thought, that won't work.  The fix shown in comment #19 is
only to the page_mkwrite code, so it doesn't apply to this fix.
So that one is best done with bug 315191.  So perhaps we should ship
just 1 and 3.  I will discuss this with Steve and if he approves, I'll
post it to rhkernel-list.


Comment 23 Steve Whitehouse 2008-07-07 14:33:27 UTC
sounds good to me.


Comment 24 Robert Peterson 2008-07-07 15:05:14 UTC
Patch for #1 and #3 was posted to rhkernel-list.  Reassigning to
Don Zickus and changing status to POST.



Comment 25 Robert Peterson 2008-07-07 15:06:47 UTC
Created attachment 311171 [details]
Patch posted to rhkernel-list on 07 Jul 2008

Here is the replacement for the "initial patch".

Comment 26 Don Zickus 2008-07-18 20:07:36 UTC
in kernel-2.6.18-98.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Comment 30 errata-xmlrpc 2009-01-20 19:51:37 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-0225.html