Bug 228540

Summary: Need to add gfs2_tool lockdump support to gfs2
Product: Red Hat Enterprise Linux 5 Reporter: Robert Peterson <rpeterso>
Component: kernelAssignee: Robert Peterson <rpeterso>
Status: CLOSED ERRATA QA Contact: GFS Bugs <gfs-bugs>
Severity: medium Docs Contact:
Priority: medium    
Version: 5.1CC: dzickus, jbacik, kanderso, lwang, rkenna, swhiteho
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: RHBA-2007-0959 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2007-11-07 19:40:17 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 224480, 235349    
Bug Blocks: 221300    
Attachments:
Description Flags
First go at a fix
none
Second go at a fix
none
third go at a fix
none
Patch for gfs2_tool to use the new kernel feature
none
Additional patch to use pid number rather than process pointer
none
The kallsyms patch for RHEL5 2.6.18-8.EL5 kernel
none
Patch to glock.c to use new sprint_symbol
none
Master patch against 2.6.18-15.el5 none

Description Robert Peterson 2007-02-13 17:40:03 UTC
Description of problem:
gfs2_tool lockdump command is not implemented by the gfs2 kernel
code, nor by the userland tool.  This would be very useful for 
debugging glock issues as it is for gfs1 today.

Version-Release number of selected component (if applicable):
RHEL5.1

How reproducible:
Always

Steps to Reproduce:
gfs_tool lockdump /mnt/gfs1
  
Actual results:
gfs2_tool: lockdump not implemented

Expected results:
Should dump the gfs glocks as gfs1 does.

Additional info:
This function was taken out of gfs2 and gfs2_tool because in GFS1,
it's primarily done with ioctls, which are now considered bad form
in eyes of the upstream community.  We should add the function back
in and use the sysfs or debugfs to do this.

Comment 1 Robert Peterson 2007-02-14 23:17:36 UTC
Created attachment 148091 [details]
First go at a fix

This patch adds the functionality back using debugfs.  To get the
glock dump, do something like this:

mkdir /debug
mount -t debugfs none /debug
mount -tgfs2 /dev/trin_vg/trin_lv /mnt/gfs2
cat /debug/gfs2/bob_cluster2\:bobs_gfs

We can easily make gfs2_tool do a cat of the appropriate file to
emulate its previous behavior.

Steve Whitehouse: Please review the patch and see if you have any
suggestions or improvements.  I have three points of concern:

1. There are changes in the patch that aren't in the git tree I
   pulled.  This might be due to the fact that I pulled gfs2 from
   Linus' git tree earlier and ported it back to your git tree
   because I had problems with the qla2xxx tree in the Linus version.
   I need to know what to keep and what to throw out.
2. I introduced a somewhat ugly kludge around the problem of not having
   a vseq_printf function.  I used an intermediate vsprintf to a string.
   Ugly, I'll admit, but how will it be perceived by the upstream
   community?  I deemed this less ugly than duplicating code between the
   printk and seq_printf versions of the functions.
   Perhaps you can suggest an alternative.
3. The dump_holder function has a print_symbol function that I don't
   understand its purpose.  Perhaps you can let me know.  It doesn't
   seem to affect the output any.  My fear is that we might need to
   implement a special seq_printf version.

Comment 2 Robert Peterson 2007-02-14 23:19:16 UTC
Added Steve Whitehouse to the cc list.  Should have done that before
I added comment #3, which I wanted him to see.


Comment 3 Steve Whitehouse 2007-02-23 15:48:39 UTC
To try and address each point:

1. I'm not quite sure that I follow what you did, but possibly you pulled some
changes without doing a checkout -f to ensure that your copy of the source was
clean before you started. Either way it doesn't look to tricky to cut out the
unwanted bits by hand. So far as I can see, there are no overlaps in the patches.

2. Yes, its ugly :-) How do you know 256 bytes is (a) enough and (b) ok for the
stack?

3. print_symbol prints the value of a kernel symbol, it takes a printf style
argument with a single %s which will be replaced by the symbol. If the value
given doesn't match any entry in the kernel symbol table, its value alone is
printed.

I'm not yet completely convinced of the merits of this approach. The format of
the various fields of the glock and holder structures was really designed to be
human readable. Here the aim seems to be to pass it to the gfs2_tool, so I think
we could perhaps invent something a bit more succinct?

Perhaps one line per glock or holder with each glock starting with its number
and each holder starting with 'H' or something in the first field? It would
result in a fair amount of memory saved for large dumps, as well as being easier
to parse.



Comment 4 Robert Peterson 2007-02-26 16:33:53 UTC
1. I'll resync with the latest nwm tree today.
2. I'll rewrite gfs2_print_dbg so that it doesn't rely on the stack;
   good catch.
3. The current refs to print_symbol are all used for printing an inode
   pointer, as in gl->gl_ip or gh->gh_ip.  Can that really be a kernel
   symbol or is it alright to just print it as a decimal or hex value?

Here's the thing: I wanted to have a separate debugfs.c, but the code
need to reference symbols that are internal to glock.c, so I had to put
the code there.  In the current glock.c, there is already a bunch of
code to dump a human-readable version of the glocks (dump_glock).  It is
called from gfs2_dump_lockstate, which is triggered if a deadlock is 
detected in unmount.  Today it dumps human-readable glocks using
printk.  This is exactly the same format that gfs_tool lockdump uses
in GFS1.  The patch merely transforms the existing kernel code so that 
it can print either to the kernel messages (printk) or to a debugfs 
sequence file.  Plus there's some extra code needed either way for 
managing debugfs.  (I believe using debugfs is the way to go, rather than
trying to reintroduce ioctl calls into the kernel and push them upstream.)

I can restructure the whole thing so that it dumps the glock information
to less friendly output that can then be reinterpreted by gfs2_tool and
translated back into the same output, but that involves (1) Making a
second set of dump routines (leaving alone the unmount deadlock code 
described above) and creating a second (redundant) set of routines to
dump the glock data in a more succinct format. (2) Changing gfs2_tool to
interpret the succinct data and print it out in the same format that
gfs2 would, should a unmount deadlock occur.  Seems pointless when we
can have one function that serves both purposes.

Of course, a succinct format would be more practical if we were planning
to write a glock monitoring tool that would download the data several
times a second.


Comment 5 Steve Whitehouse 2007-02-26 16:53:01 UTC
Ok, that sounds like a reasonable plan. The symbols that print_symbool is
printing are not inode pointers, but instruction pointers. It shows the function
which actually locked the glock in question, so is rather useful. Given a kernel
and a hex address, its possible to match the two up, but its rather easier to do
automatically than manually after the fact.

If you can reliably open the System.map which matches the running kernel at the
time of the dump, then you should be able to do the matching in userspace, if
that helps.


Comment 6 Robert Peterson 2007-02-27 18:00:14 UTC
Created attachment 148880 [details]
Second go at a fix

I'm not sure if your "reasonable plan" was referring to the code that
doesn't have redundant functions, or the one that does.
At any rate, this attachment is my latest and greatest.  I synched it 
with the latest gfs2 nmw tree, so there shouldn't be anything extraneous.  

I agree that the print_symbol information is very useful when it
comes to glocks, and it would be nice to have it in the debugfs.  
Unfortunately, I only see three choices:

(1) Leave it a hex value in the debugfs version, which affects the
least amount of code, but it's also not very useful.

(2) Copy a great deal of code from kallsyms.c into glock.c, which
I don't like, and neither will the upstream community, or

(3) Change the kernel so that it exports the function in kallsyms 
needed to do the symbol lookup, which is what I did in this patch.
That may also be frowned upon by the upstream community.  However, 
that's the most functional and has minimal impact on the underlying
kernel, since it just adds "extern" to existing declarations and such.

I got around the stack issue by keeping the temporary string inside
the debugfs iterator.  However, having done that, I now see that the
print_symbol function in kallsyms.c does what I was doing before: 
it uses a temp string on the stack.  At any rate, since the variable 
is used on a per-line basis, I would think 256 bytes should be ample.  
I increased it to 1K just to be safe.

Waiting for some feedback on this before I try to post something like
it on rhkernel-list or lkml.

Comment 7 Robert Peterson 2007-02-27 18:13:14 UTC
*** Bug 221300 has been marked as a duplicate of this bug. ***

Comment 8 Robert Peterson 2007-03-01 17:33:49 UTC
Created attachment 149026 [details]
third go at a fix

I had some IRC discussions with Steve Whitehouse and Dave Teigland.
Together, we decided to separate the print_symbol changes from the
basic functionality.  Hopefully, this will make the whole process more
palatable to lkml.  This third version of the patch is the basic
functionality, without any changes to print_symbol.  If the code
is called from debugfs, the symbol will be printed as a hex number.
If the code is called from an unmount lockup, the symbol will be
printed with print_symbol.  Later, I'll post a kernel patch to lkml
to introduce a more versatile version of print_symbol, and then
incorporate it into glock.c.  This version also fixes some issues
raised by Steve concerning the kmalloc and a "} else {" statement.

Comment 9 Robert Peterson 2007-03-12 19:02:51 UTC
Progress report: Looks like patch 3 was integrated into Steve Whitehouse's
upstream git tree.  I've made great progress is getting a patch into the
upstream kernel that allows the exporting of a new kernel function,
sprint_symbol, in kernel/kallsyms.c, that does a kernel symbol lookup.  
I patched the gfs2 kernel so that it uses the new function so we get the 
symbol in the debugfs log instead of a hex number, as in comment #8.
The kallsyms patch was incorporated into Andrew Morton's latest (-mm2)
patch set after some revisions based on lkml feedback.

I tested 2.6.21-rc3-mm2 and the new function worked perfectly with gfs2.
I ran into two problems with mm2 that were unrelated:
(1) The qla2xxx device driver was hosed.  I got around that by using
the previous version of the driver, which works.  (2) GFS2 makes a call 
to filemap_nopage which is gone for some reason in mm2.  I got around 
that by patching around it.  

Having tested the mm2 patch, I asked Andrew Morton to pull the fix 
into Linus' upstream git tree.  As of this morning, it's not there 
and I haven't gotten a response from him.

Meanwhile, I made a patch to the userland gfs2_tool so that it
simulates old behavior by using the debugfs file system.  I imagine
this bugzilla record will be used for POSTing the RHEL5.1 kernel
changes and that the userland changes will require another bz.
I'll post the userland patch here anyway.


Comment 10 Robert Peterson 2007-03-12 19:08:27 UTC
Created attachment 149854 [details]
Patch for gfs2_tool to use the new kernel feature

This patch adds back the functionality of "gfs2_tool lockdump", provided
the kernel patch for debugfs is in place.  I'm not sure if it's really
necessary to do this, but if necessary, I'll probably need a
separate bugzilla record for the userland portion, and this bug record 
will be for the kernel patch to be POSTed for the RHEL5.1 kernel.

Comment 11 Robert Peterson 2007-03-26 17:09:10 UTC
Created attachment 150918 [details]
Additional patch to use pid number rather than process pointer

In Testing the previously posted and accepted patch 
I uncovered some gfs2 badness.	It turns out that the current
gfs2 code saves off a process pointer when glocks is taken
in both the glock and glock holder structures.	Those
structures will persist in memory long after the process has
ended; pointers to poisoned memory.

This problem isn't caused by the 228540 fix; the new capability
introduced by the fix just uncovered the problem.

I wrote this patch that avoids saving process pointers
and instead saves off the process pid.	Rather than
referencing the bad pointers, it now does process lookups.
There is special code that makes the output nicer for
printing holder information for processes that have ended.

This patch also adds a stub for the new "sprint_symbol"
function that exists in Andrew Morton's -mm patch set, but
won't go into the base kernel until 2.6.22, since it adds
functionality but doesn't fix a bug.

The patch was tested on system trin-10.
The patch is against 23 March 2007 nwm git tree.
It was pulled into the tree by Steve Whitehouse on 26 March 2007.

Comment 12 Robert Peterson 2007-03-26 18:13:44 UTC
Created attachment 150925 [details]
The kallsyms patch for RHEL5 2.6.18-8.EL5 kernel

This base kernel patch adds the ability for kernel symbols to be 
printed out to a buffer rather than just to dmesg.

This enables gfs2's debugfs new lockdump implementation to print the
kernel symbols for "initialized at" values of a glock (very important
to glock debugging).  Without it, the value will be printed as a hex
number, which is one step above useless.

This patch was reviewed and accepted upstream in lkml.	It now appears
in Andrew Morton's "mm" (i.e. 2.6.21-rcX-mmX) patch set for the latest
kernel.  However, since this is considered a kernel enhancement and
not a bug, I was told the fix won't go into the base upstream kernel 
until 2.6.22.

However, Steve and I both want this to go into the base RHEL5.1 kernel
for gfs2 glock debugging purposes.

Comment 13 Robert Peterson 2007-03-26 18:24:49 UTC
I discussed the gameplan for this patch with Steve Whitehouse this
morning, and this is the status and current plan as I understand it:

1. The "Third go at a fix" and "additional patch to use pid number"
   patches are in Steve's nwm git tree for 2.6.21 as of today.
2. Steve is going to try to request a bunch of patches be pulled into
   the RHEL5 kernel for 5.1 by Don Zickus which will include them both.
3. The additional change, that allows the kernel symbols to print
   properly has been attached as the "kallsyms patch".  Steve asked me
   to post it here so that Don Z can pull it all in one place rather
   than trying to keep track of multiples.
4. Once those patches are in place, the stub for glock.c may be
   patched to call the new sprint_symbol function, as described in the
   comments of glock.c, function gfs2_print_symbol.  This also should 
   go into the RHEL5.1 kernel otherwise the kallsyms patch will be 
   pointless.


Comment 14 Robert Peterson 2007-03-26 19:05:10 UTC
Created attachment 150936 [details]
Patch to glock.c to use new sprint_symbol

For the sake of completeness, this is the patch needed so that gfs2
can take advantage of the previous kallsyms patch, function sprint_symbol.
This corresponds to comment #13, point #4.  The patch is against the 
2.6.18-8.EL5 kernel.

Comment 15 Robert Peterson 2007-04-19 02:48:33 UTC
Created attachment 152978 [details]
Master patch against 2.6.18-15.el5

This patch is against the 2.6.18-15.el5 kernel.  It consolidates
the following fixes that all relate to gfs2_tool lockdump:

1. The original fix for this bug, 228540.
2. The fix to use pid number and do pid lookups so we don't reference
   poisoned memory.  See comment #11.
3. The kallsyms patch that was accepted by the upstream kernel for
   2.6.22, currently appearing in Andrew Morton's -mm patchset.
4. The fix for 236008, where lockdump was causing a kernel gpf.
5. The fix I sent to cluster-devel today regarding the fact that
   the debugfs files were not deleted upon failed mount attempts.
6. The implementation of Dave Teigland's suggestion to move the
   debugfs glock file into a separate directory (renamed) leaving
   room for future debugfs enhancements.
7. Reintroduce glock demote stuff (GLF_DEMOTE constant, 
   gl_demote_state and so on) that were apparently dropped off the
   previous patch.  The gfs2 module wouldn't compile without it.

Comment 16 Steve Whitehouse 2007-04-19 08:07:16 UTC
Can you leave out step 7 in your comment #15 above? This is introduced by a
later patch and the idea here was to try and keep the same patch ordering in
RHEL as in upstream so that we'd need only minimal changes to get the patches to
apply. You ought to be able to just leave out anything refering to the
GLF_DEMOTE flag since it shouldn't be defined or used until that later patch is
applied.

Otherwise it looks good I think.


Comment 17 Robert Peterson 2007-04-19 15:11:12 UTC
Revised patch (without #7 from comment #15) sent to rhkernel-list.
Changing status of the bz to POST and adding Don Zickus to the cc list.


Comment 18 Robert Peterson 2007-04-19 15:15:15 UTC
Changing the component to kernel and putting lwang on the cc list.
BTW, This is now in the upstream nmw git tree.  It was tested on
RHEL5 on trin-10 and found to be correct.


Comment 19 Steve Whitehouse 2007-04-20 10:48:11 UTC
*** Bug 231369 has been marked as a duplicate of this bug. ***

Comment 20 Robert Peterson 2007-04-27 13:02:14 UTC
*** Bug 236008 has been marked as a duplicate of this bug. ***

Comment 21 Don Zickus 2007-05-01 18:08:24 UTC
in 2.6.18-17.el5

Comment 24 errata-xmlrpc 2007-11-07 19:40:17 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2007-0959.html