458684 – GFS2: glock deadlock in page fault path

Bug 458684 - GFS2: glock deadlock in page fault path

Summary: GFS2: glock deadlock in page fault path

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	5.3
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	beta
Target Release:	---
Assignee:	Robert Peterson
QA Contact:	Martin Jenner
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	455940 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2008-08-11 15:49 UTC by Nate Straz
Modified:	2009-01-20 20:19 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2009-01-20 20:19:21 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
glock dump and sysrq-t of tank-0[1-4] (270.96 KB, application/x-gzip) 2008-08-12 02:13 UTC, Nate Straz	no flags	Details
smoke locks tarball (8.65 KB, application/octet-stream) 2008-08-19 15:01 UTC, Abhijith Das	no flags	Details
Crude start of a hang analyzer tool (10.34 KB, text/plain) 2008-08-21 04:40 UTC, Robert Peterson	no flags	Details
Glock analysis file (4.48 KB, text/plain) 2008-08-21 19:54 UTC, Robert Peterson	no flags	Details
Glock analysis file (8.02 KB, text/plain) 2008-08-21 22:36 UTC, Robert Peterson	no flags	Details
Latest hang analyzer (13.01 KB, text/plain) 2008-08-21 22:42 UTC, Robert Peterson	no flags	Details
Latest incarnation of glock_history (15.34 KB, patch) 2008-08-21 22:45 UTC, Robert Peterson	no flags	Details \| Diff
Latest hang data (20.67 KB, text/plain) 2008-08-26 12:33 UTC, Robert Peterson	no flags	Details
Latest hang data (150.39 KB, text/plain) 2008-08-27 22:39 UTC, Robert Peterson	no flags	Details
Bob's rwsem patch (6.73 KB, patch) 2008-08-29 20:25 UTC, Robert Peterson	no flags	Details \| Diff
Latest hang analyzer (22.52 KB, text/plain) 2008-08-29 20:31 UTC, Robert Peterson	no flags	Details
Proposed patch (30.11 KB, patch) 2008-09-03 16:12 UTC, Robert Peterson	no flags	Details \| Diff
Short-read recreator (1.04 KB, text/plain) 2008-09-09 18:28 UTC, Robert Peterson	no flags	Details
Upstream patch for the short read problem (477 bytes, patch) 2008-09-11 19:51 UTC, Robert Peterson	no flags	Details \| Diff
Patch that was POSTed 11 Sep 2008 (30.21 KB, patch) 2008-09-12 14:20 UTC, Robert Peterson	no flags	Details \| Diff
Patch that was POSTed 15 Sep 2008 (30.92 KB, patch) 2008-09-16 02:32 UTC, Robert Peterson	no flags	Details \| Diff
Show Obsolete (6) View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2009:0225	0	normal	SHIPPED_LIVE	Important: Red Hat Enterprise Linux 5.3 kernel security and bug fix update	2009-01-20 16:06:24 UTC

Description Nate Straz 2008-08-11 15:49:02 UTC

Description of problem:

We added a new memory mapped read test case to d_iogen/d_doio which works like so:
1. tmpfd = mkstemp()
2. mptr = mmap(0, nbytes, PROT_READ, MAP_SHARED, fd, offset)
3. write(tmpfd, mptr, nbytes)

When this new method ran as part of our cluster coherency test we hit a deadlock.  Here's the analysis from Bob:

Looks like Nate might have found a gfs2 bug using your suggestions
for testing.  I didn't dig too deeply, but here is what I found:
The deadlock occurred on the morph cluster.  Apparently, morph-01
is the culprit.  Looks to me like a glock deadlock:

pid 13810 is holding inode 2/5b17fb2, waiting for rgrp   3/42fbed
pid 13809 is holding rgrp   3/42fbed, waiting for inode 2/5b17fb2

Here are those pids' call traces:

Aug  8 15:56:53 morph-01 kernel: d_doio        D 00006305  1832 13809  13808         13810       (NOTLB)
Aug  8 15:56:53 morph-01 kernel:        f365dc20 00000082 dadc4671 00006305 f656cd18 c300278c 00000001 00000009
Aug  8 15:56:53 morph-01 kernel:        ebeb1aa0 c06713c0 dadc6532 00006305 00001ec1 00000000 ebeb1bac c30128e0
Aug  8 15:56:53 morph-01 kernel:        00000000 c3002788 f365dc34 00000001 f365dc28 ffffffff 00000000 00000000
Aug  8 15:56:53 morph-01 kernel: Call Trace:
Aug  8 15:56:53 morph-01 kernel:  [<f8de0509>] just_schedule+0x5/0x8 [gfs2]
Aug  8 15:56:53 morph-01 kernel:  [<c060875b>] __wait_on_bit+0x33/0x58
Aug  8 15:56:53 morph-01 kernel:  [<f8de0504>] just_schedule+0x0/0x8 [gfs2]
Aug  8 15:56:53 morph-01 kernel:  [<f8de0504>] just_schedule+0x0/0x8 [gfs2]
Aug  8 15:56:53 morph-01 kernel:  [<c06087e2>] out_of_line_wait_on_bit+0x62/0x6a
Aug  8 15:56:53 morph-01 kernel:  [<c04360c8>] wake_bit_function+0x0/0x3c
Aug  8 15:56:53 morph-01 kernel:  [<f8de04fd>] gfs2_glock_wait+0x27/0x2e [gfs2]
Aug  8 15:56:53 morph-01 kernel:  [<f8dee02a>] gfs2_sharewrite_nopage+0x4a/0x28c [gfs2]
Aug  8 15:56:53 morph-01 kernel:  [<f8df306e>] gfs2_rgrp_bh_get+0x1cc/0x242 [gfs2]
Aug  8 15:56:53 morph-01 kernel:  [<f8de1a5c>] gfs2_glock_nq+0x217/0x23c [gfs2]
Aug  8 15:56:53 morph-01 kernel:  [<f8dee023>] gfs2_sharewrite_nopage+0x43/0x28c [gfs2]
Aug  8 15:56:53 morph-01 kernel:  [<f8de179e>] run_queue+0x9e/0x145 [gfs2]
Aug  8 15:56:53 morph-01 kernel:  [<c0460f74>] __handle_mm_fault+0x178/0x8e6
Aug  8 15:56:53 morph-01 kernel:  [<c04e2a24>] __next_cpu+0x12/0x21
Aug  8 15:56:53 morph-01 kernel:  [<f8de0615>] do_promote+0xd8/0xfb [gfs2]
Aug  8 15:56:53 morph-01 kernel:  [<c060a6d3>] do_page_fault+0x20a/0x4b8
Aug  8 15:56:53 morph-01 kernel:  [<c060a4c9>] do_page_fault+0x0/0x4b8
Aug  8 15:56:53 morph-01 kernel:  [<c0405a89>] error_code+0x39/0x40
Aug  8 15:56:53 morph-01 kernel:  [<f8df007b>] gfs2_quota_init+0x8b/0x3a4 [gfs2]
Aug  8 15:56:53 morph-01 kernel:  [<c0456bc8>] generic_file_buffered_write+0x12c/0x5b7
Aug  8 15:56:53 morph-01 kernel:  [<c042a1b5>] current_fs_time+0x4a/0x55
Aug  8 15:56:53 morph-01 kernel:  [<c04574f9>] __generic_file_aio_write_nolock+0x4a6/0x52a
Aug  8 15:56:53 morph-01 kernel:  [<c0457757>] generic_file_write+0x0/0x94
Aug  8 15:56:53 morph-01 kernel:  [<c04576ad>] __generic_file_write_nolock+0x86/0x9a
Aug  8 15:56:53 morph-01 kernel:  [<c043609b>] autoremove_wake_function+0x0/0x2d
Aug  8 15:56:53 morph-01 kernel:  [<c045e240>] vma_prio_tree_insert+0x17/0x2a
Aug  8 15:56:53 morph-01 kernel:  [<c0463c4c>] vma_link+0x9c/0xb8
Aug  8 15:56:53 morph-01 kernel:  [<c0608836>] mutex_lock+0xb/0x19
Aug  8 15:56:53 morph-01 kernel:  [<c0457791>] generic_file_write+0x3a/0x94
Aug  8 15:56:53 morph-01 kernel:  [<c0457757>] generic_file_write+0x0/0x94
Aug  8 15:56:53 morph-01 kernel:  [<c04718c7>] vfs_write+0xa1/0x143
Aug  8 15:56:53 morph-01 kernel:  [<c0471eb9>] sys_write+0x3c/0x63
Aug  8 15:56:53 morph-01 kernel:  [<c0404f17>] syscall_call+0x7/0xb
Aug  8 15:56:53 morph-01 kernel:  =======================

Aug  8 15:56:53 morph-01 kernel: d_doio        D 00006305  1828 13810  13808               13809 (NOTLB)
Aug  8 15:56:53 morph-01 kernel:        f656ccb0 00000082 daec17e3 00006305 f8d2b97b fffffff5 f8d2e3d3 00000009
Aug  8 15:56:53 morph-01 kernel:        f666c000 c310c550 daec284d 00006305 0000106a 00000001 f666c10c c30196c4
Aug  8 15:56:53 morph-01 kernel:        de5d2570 00000000 cfe2ca5c 00000018 de5d2550 ffffffff 00000000 00000000
Aug  8 15:56:53 morph-01 kernel: Call Trace:
Aug  8 15:56:53 morph-01 kernel:  [<f8d2b97b>] __put_lkb+0x28/0xd5 [dlm]
Aug  8 15:56:53 morph-01 kernel:  [<f8d2e3d3>] dlm_put_lockspace+0xd/0x1a [dlm]
Aug  8 15:56:53 morph-01 kernel:  [<f8de0509>] just_schedule+0x5/0x8 [gfs2]
Aug  8 15:56:53 morph-01 kernel:  [<c060875b>] __wait_on_bit+0x33/0x58
Aug  8 15:56:53 morph-01 kernel:  [<f8de0504>] just_schedule+0x0/0x8 [gfs2]
Aug  8 15:56:53 morph-01 kernel:  [<f8de0504>] just_schedule+0x0/0x8 [gfs2]
Aug  8 15:56:53 morph-01 kernel:  [<c06087e2>] out_of_line_wait_on_bit+0x62/0x6a
Aug  8 15:56:53 morph-01 kernel:  [<c04360c8>] wake_bit_function+0x0/0x3c
Aug  8 15:56:53 morph-01 kernel:  [<f8de04fd>] gfs2_glock_wait+0x27/0x2e [gfs2]
Aug  8 15:56:53 morph-01 kernel:  [<f8df2418>] gfs2_inplace_reserve_i+0x25f/0x61a [gfs2]
Aug  8 15:56:53 morph-01 kernel:  [<c04360c8>] wake_bit_function+0x0/0x3c
Aug  8 15:56:53 morph-01 kernel:  [<f8de04fd>] gfs2_glock_wait+0x27/0x2e [gfs2]
Aug  8 15:56:53 morph-01 kernel:  [<f8de9125>] gfs2_prepare_write+0x172/0x325 [gfs2]
Aug  8 15:56:53 morph-01 kernel:  [<c0456cd7>] generic_file_buffered_write+0x23b/0x5b7
Aug  8 15:56:53 morph-01 kernel:  [<c042a1b5>] current_fs_time+0x4a/0x55
Aug  8 15:56:53 morph-01 kernel:  [<c04574f9>] __generic_file_aio_write_nolock+0x4a6/0x52a
Aug  8 15:56:53 morph-01 kernel:  [<c0457757>] generic_file_write+0x0/0x94
Aug  8 15:56:53 morph-01 kernel:  [<c04576ad>] __generic_file_write_nolock+0x86/0x9a
Aug  8 15:56:53 morph-01 kernel:  [<c043609b>] autoremove_wake_function+0x0/0x2d
Aug  8 15:56:53 morph-01 kernel:  [<c0471d29>] remote_llseek+0xb1/0xbb
Aug  8 15:56:53 morph-01 kernel:  [<f8dea806>] gfs2_llseek+0x81/0x93 [gfs2]
Aug  8 15:56:53 morph-01 kernel:  [<c0608836>] mutex_lock+0xb/0x19
Aug  8 15:56:53 morph-01 kernel:  [<c0457791>] generic_file_write+0x3a/0x94
Aug  8 15:56:53 morph-01 kernel:  [<c0457757>] generic_file_write+0x0/0x94
Aug  8 15:56:53 morph-01 kernel:  [<c04718c7>] vfs_write+0xa1/0x143
Aug  8 15:56:53 morph-01 kernel:  [<c0471eb9>] sys_write+0x3c/0x63
Aug  8 15:56:53 morph-01 kernel:  [<c0404f17>] syscall_call+0x7/0xb
Aug  8 15:56:53 morph-01 kernel:  =======================


Version-Release number of selected component (if applicable):
kernel-2.6.18-101.el5

How reproducible:
unknown

Steps to Reproduce:
A.
 1. run coherency from the latest sts-rhel5.3 package
B.
 1. d_iogen -s write -v mmindirect -F 1g:testfile -I 12345
 2. d_doio -I 12345

Comment 1 Nate Straz 2008-08-12 02:13:50 UTC

Created attachment 314028 [details]
glock dump and sysrq-t of tank-0[1-4]

I was able to reproduce this on another cluster.  This time I ran with the latest kernel, -103.  Here are the glock dumps and stack traces for development to figure out what is deadlocked.

Comment 3 Abhijith Das 2008-08-19 15:01:21 UTC

Created attachment 314549 [details]
smoke locks tarball

I'm seeing a hang running the coherency test on the smoke cluster as well. It's in the test case sync-writev-mmindirect.

The kernel I'm running is the rhel -104 kernel with lockdep enabled and Steve's lockdep patch posted to cluster-devel.

Comment 4 Steve Whitehouse 2008-08-20 09:43:39 UTC

RE: comment #3

This isn't the same as the original problem reported under this bz. The lock that the other nodes are waiting for is held by winston:

G:  s:EX n:2/2053c f:lDdpyi t:UN d:UN/33554395000 l:0 a:0 r:7
 H: s:EX f:cW e:0 p:4654 [gfs2_quotad] gfs2_statfs_sync+0x37/0x1f0 [gfs2]
 I: n:22/132412 t:8 f:0x00000010

So it looks like winston is trying to drop the EX lock, but has got stuck somewhere while trying to drop the lock. Either way this is nothing like the original deadlock at all.

Comment 5 Robert Peterson 2008-08-21 04:40:24 UTC

Created attachment 314680 [details]
Crude start of a hang analyzer tool

This is a crude gfs2 hang analyzer tool I whipped up.  I haven't really
taken any time to make it pretty or work the bugs out, but it's a
rough beginning.  I call it gfs2_hangalyzer.  The command line is
something like this (-v gives verbose output):

[bob@ganesha /home.f8/bob/tools]$ ./gfs2_hangalyzer -q -v -n roth-03
exec: /usr/bin/qarsh qarsh -l root roth-03 "/bin/grep 'clusternode name' /etc/cluster/cluster.conf"
1:nodeid 1: roth-01
2:nodeid 3: roth-03
node_count = 2
exec: /usr/bin/qarsh qarsh -l root roth-01 "/bin/grep debugfs /proc/mounts"
exec: /usr/bin/qarsh qarsh -l root roth-03 "/bin/grep debugfs /proc/mounts"
exec: /usr/bin/qarsh qarsh -l root roth-01 "/bin/ls /sys/kernel/debug/gfs2/"
exec: /usr/bin/qarsh qarsh -l root roth-03 "/bin/ls /sys/kernel/debug/gfs2/"
fetching glocks for roth-01: bobs_roth:test_gfs
exec: /usr/bin/qarsh qarsh -l root roth-01 "/bin/cat /sys/kernel/debug/gfs2/bobs_roth:test_gfs/glocks"
fetching glocks for roth-03: bobs_roth:test_gfs
exec: /usr/bin/qarsh qarsh -l root roth-03 "/bin/cat /sys/kernel/debug/gfs2/bobs_roth:test_gfs/glocks"
Glock waited on: 
G:  s:EX n:2/1ce62d f:lDpr t:UN d:UN/231000 l:0 a:0 r:7
 H: s:EX f:W e:0 p:6514 [d_doio] gfs2_sharewrite_nopage+0x5b/0x2bf [gfs2]
Dumping all occurrences of glock 2/1ce62d
roth-01: -------------------------
G:  s:EX n:2/1ce62d f:lDpr t:UN d:UN/231000 l:0 a:0 r:7
 H: s:EX f:W e:0 p:6514 [d_doio] gfs2_sharewrite_nopage+0x5b/0x2bf [gfs2]
 H: s:SH f:aW e:0 p:6515 [d_doio] gfs2_mmap+0x2e/0x83 [gfs2]
 I: n:40/1893933 t:8 f:0x00000011
roth-03: -------------------------
G:  s:UN n:2/1ce62d f:l t:EX d:EX/0 l:0 a:0 r:5
 H: s:EX f:W e:0 p:7436 [d_doio] gfs2_sharewrite_nopage+0x5b/0x2bf [gfs2]
 H: s:SH f:aW e:0 p:7437 [d_doio] gfs2_mmap+0x2e/0x83 [gfs2]

The tool uses qarsh (-q) or ssh (assuming ssh keys set up) to any
node in your cluster.  From there, it analyzes your cluster.conf to
figure out all the nodes.  Then it goes out to all the nodes and
figures out where debugfs is mounted, then dumps the glocks to a
temp file for all gfs2 mount points it finds.  The glock dump file
is then analyzed for anything waiting on the Holders queue, then
it dumps the glock values for that glock across all nodes.

It's still kind of primitive, but it's a place to start.  Hopefully
it can be improved upon as we get time.  I whipped this up in a big
hurry, so there are bound to be bugs.

I've noticed that the cat command on the glock file sometimes hangs
and I haven't analyzed why yet.

Enjoy.  --Bob

Comment 6 Robert Peterson 2008-08-21 19:54:40 UTC

Created attachment 314745 [details]
Glock analysis file

I resurrected the "glock history" patch and recreated the coherency
hang problem on my roth cluster.  Then I ran my latest and greatest
glock analysis tool.  This attachment contains the output.

In this case, the glock that's hung happens to be for the root inode.
That sounds similar to the problem Abhi is looking at with the bind
mount hang, but this kernel does not contain the bind mount patch.
Rather, this is kernel-2.6.18-104.gfs2abhi.001 plus the glock history
patch.  I believe the output clearly shows how the glock got into this
state.

The call to "handle_callbk2" means that handle_callback was called
from the bottom of function blocking_cb.  It appears that the
GLF_DEMOTE bit got set on because delay was == 0, so the minimum hold
time was already met.

I need to study this further.

Comment 7 Robert Peterson 2008-08-21 22:36:23 UTC

Created attachment 314760 [details]
Glock analysis file

This analysis dump was run with a revised glock history patch that
shows more details about the problem.  In this case, do_xmote was
called from the bottom of run_queue.  Next, do_xmote decided to
call do_error because:

	if ((target == LM_ST_UNLOCKED || target == LM_ST_DEFERRED) &&
		glops->go_inval) { ...
	}

And do_error is likely what dequeued the holder from the glock, thus
leaving it in an invalid state.

Comment 8 Robert Peterson 2008-08-21 22:42:28 UTC

Created attachment 314761 [details]
Latest hang analyzer

This version of the hang analyzer is a bit more sophisticated.
I would still like to make it chase down glock dependencies and
give a report of them.  Something like this (ref: the previous report):

roth-01: pid 3544 is waiting for glock 3/5ba0b76, which is held by pid 3545
roth-01: pid 3545 is waiting for glock 2/5ba2344, nobody's holding it!

It should be a relatively easy thing to write.  Maybe tomorrow.

Comment 9 Robert Peterson 2008-08-21 22:45:40 UTC

Created attachment 314762 [details]
Latest incarnation of glock_history

Here is the latest incarnation of my "glock history" patch.
This is the one that produced the glock output that made its way into
the previously attached glock analysis file.

Comment 10 Steve Whitehouse 2008-08-22 09:52:21 UTC

I don't see what is invalid about the state of the glock. It looks just like the previous reports where we are stuck on the page lock, for some reason.

Comment 11 Robert Peterson 2008-08-26 12:33:15 UTC

Created attachment 314981 [details]
Latest hang data

This file contains some information from output of my latest hang
analyzer tool, plus glock history.  It also contains some of my notes
and the call traces for the two processes that are waiting for each
other.  This version of glock history actually remembers and prints
the all gfs2_holder structs on the glock's holders queue (up to
three), so there is a reasonable holder history as well.

The latest hang analysis tool also dumps the dlm's view of the same
glocks, including a human translation of the State.  For example,
"grnt" is printed for an lkb that's on the granted queue of dlm.
It also backtracks any waiting holders it finds.  In this case, it
says that:

roth-01, pid 5989 is waiting for glock 2/5c7989d, which is held by pid 5988
(my note: the file /buffered-write-mmindirect) but
roth-01, pid 5988 is waiting for glock 3/5c70b69, which is held by pid 5989
(my note: an rgrp)

There are a few things I don't understand about this data, especially
regarding the call trace for 5989, which I believe is the culprit.

Comment 12 Robert Peterson 2008-08-27 22:39:01 UTC

Created attachment 315157 [details]
Latest hang data

Here are my latest findings on this hang problem:

I've established through instrumentation that we're calling
gfs2_sharewrite_nopage with the sd_log_flush_lock rwsem held,
and I've also proven that the rwsem was previously taken due
to a call to gfs2_write_lock_start, of course from function
gfs2_prepare_write.

So as far as I can tell, the calling sequence looks something
like this:

generic_file_buffered_write
   prepare_write
      gfs2_write_lock_start
         unlock page
         takes an EX lock on glock ip->i_gl
         gfs2_trans_begin
            gfs2_log_reserve
               down_read(&sdp->sd_log_flush_lock);
      return AOP_TRUNCATED_PAGE
   generic_file_buffered_write loops back around due to the error
   ...
   filemap_copy_from_user->copy_from_user
      do_page_fault
         gfs2_sharewrite_nopage
            takes an EX lock on glock ip->i_gl

So at the point of failure, the "correct" process has locked the
ip->i_gl glock and is waiting for the rwsem due to a log flush.
The "bad" process has locked the rwsem and is waiting for the
ip->i_gl glock.

What I haven't quite worked out yet is why the "bad" process
doesn't still hold the i_gl glock exclusively when it gets around
to calling nopage.  We need to figure out how the transaction can
be left open while the i_gl becomes unlocked.


Excerpt from the call trace of the "bad" process:
 [<ffffffff884d25a5>] :gfs2:gfs2_glock_wait+0x2b/0x30
 [<ffffffff884e3aaa>] :gfs2:gfs2_sharewrite_nopage+0x60/0x123
 [<ffffffff884e3aa2>] :gfs2:gfs2_sharewrite_nopage+0x58/0x123
 [<ffffffff8000884b>] __handle_mm_fault+0x1f8/0xe5c
 [<ffffffff8002d253>] wake_up_bit+0x11/0x22
 [<ffffffff8006686f>] do_page_fault+0x4cb/0x830
 [<ffffffff884d5f6c>] :gfs2:gfs2_glock_nq+0x248/0x273
 [<ffffffff8002d253>] wake_up_bit+0x11/0x22
 [<ffffffff800625df>] __sched_text_start+0x17f/0xaeb
 [<ffffffff884d27e4>] :gfs2:do_promote+0x1f7/0x26f
 [<ffffffff8005dde9>] error_exit+0x0/0x84
 [<ffffffff8000fb02>] generic_file_buffered_write+0x14d/0x6d3

Also, I compared the filemap.c in RHEL to the upstream, and discovered that
Nick Piggin added a new call to pagefault_disable and pagefault_enable
to protect the call to filemap_copy_from_user.  This is documented
here:

http://git.kernel.org/?p=linux/kernel/git/steve/gfs2-2.6-nmw.git;a=commitdiff;h=08291429cfa6258c4cd95d8833beb40f828b194e

Comment 13 Robert Peterson 2008-08-29 13:10:55 UTC

Well, it's not the Nick Piggin thing I mentioned.  Josef submitted a
RHEL5 kernel patch to implement Nick's change, and I am still able
to recreate the problem.  So it just must be lock ordering.  I still
believe, however, that I'm dealing with two or more separate hangs
because I'm seeing two different symptoms.  In the first hang, the
gfs2_log_flush_lock rwsem is being waited on.  In the second hang,
it doesn't seem to be, unless there is some kind of memory corruption.
I added instrumentation to catch any corruption involving the rwsem.
I'll find out more when that particular scenario hits again.  In the
meantime, the first hang seems more common, so I'll focus on it.

Comment 14 Robert Peterson 2008-08-29 20:25:40 UTC

Created attachment 315399 [details]
Bob's rwsem patch

This patch is for the first problem described in comment #13, where
the processes all get blocked behind the gfs2_log_flush_lock rwsem.

Although I don't have absolute proof, what seems to be happening is
this:  There are two processes contending for the same inode and page.
The vfs function generic_file_buffered_write calls prepare_write,
which starts a transaction, which calls gfs2_log_reserve, which takes
the gfs2_log_flush_lock rwsem.  The gfs2_prepare_write function
unlocks the page, locks the inode's i_gl glock, and returns
AOP_TRUNCATED_PAGE to vfs like it's supposed to.  However, something
happens to that page, similar to read cases we've seen in the past.
That causes generic_file_buffered_write to exit its loop early so
the transaction is left open and the rwsem is still held.  The process
may run for a while longer, but eventually everything backs up on the
rwsem and stops.

My approach to solving the problem with this patch is to make the
prepare_write path NOT call gfs2_log_reserve until it has been
called the second time, thus ensuring that the transaction will
complete.  In theory, this should also boost multi-writer performance
too, because processes shouldn't be blocked on the rwsem for as long
as they were prior to the patch.

This seems to have a positive effect on the code.  The tests run
much longer and I haven't seen any hangs on the rwsem since the fix.
However, I'm still seeing two problems:

First, some of the tests are failing with data comparison errors.
For example:

[direct-writev-pread] [d_doio-roth-01] corrupt bytes starting at file offset 65536
[direct-writev-pread] [d_doio-roth-01]     1st 32 expected bytes:  v*W:3689:roth-01:writev*W:3689:r
[direct-writev-pread] [d_doio-roth-01]     1st 32 actual bytes:    W:3689:roth-01:writev*W:3689:rot

I don't see how this patch could possible cause data corruption,
but the fact remains that before the patch, I never saw these errors.
I tend to believe that the rwsem was protecting the page accidentally
and that it was masking an additional problem in the code.

Second, there is still a hang out there, but this one is not blocked
on the rwsem at all.  Instead, it's a simple lock ordering problem.
In this case one process locks the ip->i_gl glock followed by an
rgrp glock, and a second process locks the same rgrp glock followed
by the ip->i_gl glock.  This second process is causing the problem
because it's always supposed to grab the inode glock first, then the
rgrp glock.

I'm guessing it's the exact same code path that caused the rwsem
deadlock.  I guessing that not holding the rwsem in that code path
caused it to be happier, but it still had the rgrp glock locked
at the time of the error, so the next time it looped around and
called generic_file_buffered_write, that tried to grab the ip->i_gl
with the rgrp already held from the previously failed attempt.
That's just a guess on my part.  I need to do more research on it.

Comment 15 Robert Peterson 2008-08-29 20:28:23 UTC

Here is the call trace from one instance of the rgrp/i_gl glock ordering
problem.

d_doio        D ffff810037eee7e0     0  3703   3702          3704       (NOTLB)
 ffff81005b23f7e8 0000000000000082 000000000000003f ffff81005b23e000
 0000000000000000 0000000000000007 ffff810037eee7e0 ffff81007bcd1100
 0000011a2365917f 0000000000005399 ffff810037eee9c8 0000000100000000
Call Trace:
 [<ffffffff8009e0e7>] autoremove_wake_function+0x9/0x2e
 [<ffffffff8008944d>] __wake_up_common+0x3e/0x68
 [<ffffffff884d20e8>] :gfs2:just_schedule+0x0/0xe
 [<ffffffff884d20f1>] :gfs2:just_schedule+0x9/0xe
 [<ffffffff800639be>] __wait_on_bit+0x40/0x6e
 [<ffffffff884d20e8>] :gfs2:just_schedule+0x0/0xe
 [<ffffffff80063a58>] out_of_line_wait_on_bit+0x6c/0x78
 [<ffffffff8009e10c>] wake_bit_function+0x0/0x23
 [<ffffffff884d20e3>] :gfs2:gfs2_glock_wait+0x2b/0x30
 [<ffffffff884e0afb>] :gfs2:gfs2_sharewrite_nopage+0x63/0x2ba
 [<ffffffff80062f4b>] thread_return+0x0/0xdf
 [<ffffffff884e0af3>] :gfs2:gfs2_sharewrite_nopage+0x5b/0x2ba
 [<ffffffff8000884b>] __handle_mm_fault+0x1f8/0xe5c
 [<ffffffff8006686f>] do_page_fault+0x4cb/0x830
 [<ffffffff8009e10c>] wake_bit_function+0x0/0x23
 [<ffffffff884e4286>] :gfs2:try_rgrp_fit+0x29/0x55
 [<ffffffff800625df>] __sched_text_start+0x17f/0xaeb
 [<ffffffff8002d3da>] wake_up_bit+0x11/0x22
 [<ffffffff8005dde9>] error_exit+0x0/0x84
 [<ffffffff8000fb61>] generic_file_buffered_write+0x1a7/0x855
 [<ffffffff8000fe41>] generic_file_buffered_write+0x487/0x855
 [<ffffffff884d9ba2>] :gfs2:gfs2_getbuf+0x106/0x115
 [<ffffffff884d93d3>] :gfs2:buf_lo_add+0x71/0x106
 [<ffffffff8000dff0>] current_fs_time+0x3b/0x40
 [<ffffffff8001624e>] __generic_file_aio_write_nolock+0x36c/0x3b8
 [<ffffffff884d36a8>] :gfs2:run_queue+0xad/0x173
 [<ffffffff884d39b6>] :gfs2:gfs2_glock_nq+0x248/0x273
 [<ffffffff800c06ad>] __generic_file_write_nolock+0x8f/0xa8
 [<ffffffff8009e0de>] autoremove_wake_function+0x0/0x2e
 [<ffffffff8001c972>] vma_link+0xd0/0xfd
 [<ffffffff80063aad>] mutex_lock+0xd/0x1d
 [<ffffffff8000de4a>] do_mmap_pgoff+0x66c/0x7d7
 [<ffffffff800451b8>] generic_file_write+0x49/0xa7
 [<ffffffff80016656>] vfs_write+0xce/0x174
 [<ffffffff80016f23>] sys_write+0x45/0x6e
 [<ffffffff8005d28d>] tracesys+0xd5/0xe0

And here is what my hang analyzer has to say about it:

[bob@ganesha ../sata1/home.f8/bob/tools]$ ./gfs2_hangalyzer -n roth-01
roth-01   : test_gfs: G:  s:EX n:3/5c70b69 f:y t:EX d:EX/0 l:0 a:0 r:5/test_gfs_locks"                                  
roth-01   : test_gfs:  H: s:EX f:tH e:0 p:3703 [d_doio] gfs2_inplace_reserve_i+0x254/0x691 [gfs2]
roth-01   : test_gfs:  H: s:EX f:tW e:0 p:3704 [d_doio] gfs2_inplace_reserve_i+0x254/0x691 [gfs2]
roth-01   : test_gfs:  R: n:96930665
                        lkb_id N RemoteID  pid exflg lkbflgs stat gr rq    waiting n ln             resource name
roth-01   : test_gfs:  2620007 0        0 3666     1       0 grnt  5 -1          0 0 24 "       3         5c70b69"
roth-03   : test_gfs: G:  s:UN n:3/5c70b69 f: t:UN d:EX/0 l:0 a:0 r:2



roth-01   : test_gfs: G:  s:EX n:2/5c75425 f:Dy t:EX d:UN/420339000 l:0 a:0 r:5
roth-01   :                         (demote, dirty)
roth-01   : test_gfs:  H: s:EX f:aH e:0 p:3704 [d_doio] gfs2_prepare_write+0x11e/0x3e7 [gfs2]
roth-01   : test_gfs:  H: s:EX f:W e:0 p:3703 [d_doio] gfs2_sharewrite_nopage+0x5b/0x2ba [gfs2]
roth-01   : test_gfs:  I: n:1049585/96949285 t:8 f:0x0000001c
                        lkb_id N RemoteID  pid exflg lkbflgs stat gr rq    waiting n ln             resource name
roth-01   : test_gfs:   17000b 0        0 3054     0       0 grnt  5 -1          0 0 24 "       2         5c75425"
roth-01   : test_gfs:   84000a 3  2f10006 3168     0   10000 wait -1  5     456892 0 24 "       2         5c75425"
roth-03   : test_gfs: G:  s:UN n:2/5c75425 f:l t:EX d:EX/0 l:0 a:0 r:4
roth-03   : test_gfs:  H: s:EX f:aW e:0 p:3416 [d_doio] gfs2_prepare_write+0x11e/0x3e7 [gfs2]
                        lkb_id N RemoteID  pid exflg lkbflgs stat gr rq    waiting n ln             resource name
roth-03   : test_gfs:  2f10006 1   84000a 3168     0       0 wait -1  5          0 1 24 "       2         5c75425"



There are 2 glocks with waiters.
roth-01, pid 3704 is waiting for glock 3/5c70b69, which is held by pid 3703
         The dlm has granted lkb "       3         5c70b69" to pid 3666

roth-01, pid 3703 is waiting for glock 2/5c75425, which is held by pid 3704
roth-03, pid 3416 is waiting for glock 2/5c75425, which is held by pid 3704
         The dlm has granted lkb "       2         5c75425" to pid 3054

Comment 16 Robert Peterson 2008-08-29 20:31:27 UTC

Created attachment 315400 [details]
Latest hang analyzer

This is my latest hang analyzer.

Comment 17 Steve Whitehouse 2008-09-02 09:10:45 UTC

I think I can see where the problem lies now. It looks like the current RHEL VFS code touches the pages in the wrong place, and what we really need is to be able to use the ordering of operations from the current upstream code.

It should be possible to do this by copying it into ops_address.c and giving it a different name (e.g. gfs2_file_write() or similar) although I've not checked all the details of this. We'd thus be able to use the new VFS code, but only for
GFS2 to avoid changes to any other fs.

Anyway, it looks like the source page (i.e. the mmaped page) is getting pushed out between the initial call to fault_in_pages_readable() and the actual copy from those pages, thus resulting in a page fault at a point where there should be none.

Its no good trying to move the second call to fault_in_pages_readable() since that just swaps one deadlock for another. The upstream code gets around this by being prepared to copy 0 bytes and by looping to try again if that happens. It only ever touches the pages outside of the write_begin/write_end pairing, and by never taking a page fault between them, as per comments #12, #13.

The real question is do we want to do this now, or is it better to release with a caveat that the particular circumstances which led to this hang should be avoided. If the source file is not on a gfs2 filesystem, then it should be safe, for example. This needs some thought.

Comment 18 Robert Peterson 2008-09-03 13:11:45 UTC

I spent a good chunk of yesterday back-porting upstream vfs code into
the RHEL5 GFS2 (so we don't impact the entire kernel).  It was a lot
of porting and a lot of work.  Then I started a coherency test, the
same test that was hanging, and let it run overnight.  The good news
is that the test did not hang.  The bad news is that eight of the tests
failed.  The eight tests that failed are:

direct-write-read, direct-write-readv, direct-write-pread,
direct-writev-read, direct-writev-readv, direct-pwrite-read,
direct-pwrite-readv, direct-pwrite-pread.

I've got to figure out what's going on there.
Here is the printout from the test run:

----------------  pan2 summary ------------------
Testcase                                 Result    
--------                                 ------    
buffered-write-read                      PASS      
buffered-write-readv                     PASS      
buffered-write-pread                     PASS      
buffered-write-mmread                    PASS      
buffered-write-mmindirect                PASS      
buffered-writev-read                     PASS      
buffered-writev-readv                    PASS      
buffered-writev-pread                    PASS      
buffered-writev-mmread                   PASS      
buffered-writev-mmindirect               PASS      
buffered-pwrite-read                     PASS      
buffered-pwrite-readv                    PASS      
buffered-pwrite-pread                    PASS      
buffered-pwrite-mmread                   PASS      
buffered-pwrite-mmindirect               PASS      
buffered-mmwrite-read                    PASS      
buffered-mmwrite-readv                   PASS      
buffered-mmwrite-pread                   PASS      
buffered-mmwrite-mmread                  PASS      
buffered-mmwrite-mmindirect              PASS      
sync-write-read                          PASS      
sync-write-readv                         PASS      
sync-write-pread                         PASS      
sync-write-mmread                        PASS      
sync-write-mmindirect                    PASS      
sync-writev-read                         PASS      
sync-writev-readv                        PASS      
sync-writev-pread                        PASS      
sync-writev-mmread                       PASS      
sync-writev-mmindirect                   PASS      
sync-pwrite-read                         PASS      
sync-pwrite-readv                        PASS      
sync-pwrite-pread                        PASS      
sync-pwrite-mmread                       PASS      
sync-pwrite-mmindirect                   PASS      
sync-mmwrite-read                        PASS      
sync-mmwrite-readv                       PASS      
sync-mmwrite-pread                       PASS      
sync-mmwrite-mmread                      PASS      
sync-mmwrite-mmindirect                  PASS      
direct-write-read                        FAIL      
direct-write-readv                       FAIL      
direct-write-pread                       FAIL      
direct-write-mmread                      PASS      
direct-write-mmindirect                  PASS      
direct-writev-read                       FAIL      
direct-writev-readv                      FAIL      
direct-writev-pread                      PASS      
direct-writev-mmread                     PASS      
direct-writev-mmindirect                 PASS      
direct-pwrite-read                       FAIL      
direct-pwrite-readv                      FAIL      
direct-pwrite-pread                      FAIL      
direct-pwrite-mmread                     PASS      
direct-pwrite-mmindirect                 PASS      
direct-mmwrite-read                      PASS      
direct-mmwrite-readv                     PASS      
direct-mmwrite-pread                     PASS      
direct-mmwrite-mmread                    PASS      
direct-mmwrite-mmindirect                PASS      
=================================================
Total Tests Run: 60
Total PASS:      52
Total FAIL:      8
Total TIMEOUT:   0
Total KILLED:    0
Total STOPPED:   0

Comment 19 Robert Peterson 2008-09-03 14:03:05 UTC

All of the failures were short reads, reading zero bytes where it
should have read more.  Here is a summary of all eight failures:

[direct-write-read] [d_doio-roth-01] Short read(), read 0 of 1024 bytes at 18826752 on direct-write-read
[direct-write-pread] [d_doio-roth-03] Short pread(), read 0 of 512 bytes at 13644288 on direct-write-pread
[direct-pwrite-read] [d_doio-roth-01] Short read(), read 0 of 512 bytes at 9320960 on direct-pwrite-read
[direct-writev-readv] [d_doio-roth-03] Short readv(), read 0 of 2048 bytes at 7055360 on direct-writev-readv
[direct-write-read] [d_doio-roth-01] Short read(), read 0 of 1024 bytes at 18826752 on direct-write-read
[direct-pwrite-readv] [d_doio-roth-01] Short readv(), read 0 of 512 bytes at 3598336 on direct-pwrite-readv
[direct-pwrite-pread] [d_doio-roth-03] Short pread(), read 0 of 512 bytes at 1940992 on direct-pwrite-pread
[direct-write-readv] [d_doio-roth-03] Short readv(), read 0 of 2048 bytes at 4564992 on direct-write-readv

I have verified that the fix for bug 456453 was on the system.  At
least as nearly as I can tell.  I think I'll download a newer kernel,
apply my patch and try again.  That will take a while.

Comment 20 Robert Peterson 2008-09-03 16:12:31 UTC

Created attachment 315660 [details]
Proposed patch

This patch implements Steve Whitehouse's idea of comment #17.
In order to not affect the entire vfs, I've ported the necessary bits
of the upstream vfs into GFS2 to get the lock ordering correct.  The
hang does not occur, but now that the test is allowed to go further, I
hit the previously mentioned short read problem.

Comment 21 Robert Peterson 2008-09-04 15:32:47 UTC

I tried to remedy the "short read" problem by adding Ben M's
obsolete patch from bug #456453.  This one here:

https://bugzilla.redhat.com/attachment.cgi?id=312687

My goal was to see if forcing the call to generic_file_direct_IO
(read) would solve the problem.  It did not.  I still got the short
read problem, still on the direct IO write tests.  I did some investigating
in the upstream kernel, but the vfs code looks similar.  So now I have to
figure out what's going on with the short reads.  I don't have a cluster
running an upstream kernel, but if I did, it would be good to see if the
short read problem recreates there.

Comment 22 Robert Peterson 2008-09-05 17:06:28 UTC

I verified that the zero-length read problem displayed by the coherency
test IS INDEED a problem in the latest and greatest -nmw git tree GFS2
code and its vfs.  That somewhat exonerates my patch.  But it's still
a bug either with vfs or gfs2.  We might consider treating it as a new
bug, since my patch seems to solve the hang.

Comment 23 Robert Peterson 2008-09-09 18:28:50 UTC

Created attachment 316226 [details]
Short-read recreator

This is one file (bob4.h2) I use to recreate the short-read problem.
To recreate it, mount the gfs2 file system on two nodes with the sts
test suite installed, and run this command:

collie -e -i 1 -O /tmp/bob/bobtest -f /tmp/bob/bob4.h2 -a /tmp/bob/active.h2 -l /tmp/bob/bob.log -t

It doesn't always recreate, so you may have to run it more than once.

Comment 24 Robert Peterson 2008-09-09 19:21:17 UTC

It's time for an update, since I haven't done that in a while.

My patch that ports the upstream write_begin code solves the hang;
I still haven't seen a hang since I implemented it, and I've run
coherency variations hundreds of times.

There is still a short read problem, but it seems like two
problems.

First, it seems like Ben's obsolete patch from bug #456453
(mentioned in comment #21) is still needed or you can hit the
short read.  The other upstream port from mm/filemap.c does not
seem to solve it completely for me.  Ben's patch cuts the
probability of hitting it maybe in half.

The second short-read problem seems to be in the gfs2_readpage
function, which I restructured several times last May for the
infamous GFS2 bug #432057.  I've identified what seems to be some
more problems with the logic and I've revised it several more
times for this bug.

For one thing, my reworked gfs2_readpage function never returns
AOP_TRUNCATED_PAGE to the caller, do_generic_mapping_read.
Instead, if it can't find the page it needs, it goes through
some logic similar to what do_generic_mapping_read does when
it gets AOP_TRUNCATED_PAGE back.  I suspect I may some day
revert this patch, but it does seem to make the problem harder
to hit.

The reason I removed the AOP_TRUNCATED_PAGE return is because
of this code snippet from filemap.c, do_generic_mapping_read:

		error = mapping->a_ops->readpage(filp, page);

		if (unlikely(error)) {
			if (error == AOP_TRUNCATED_PAGE) {
				page_cache_release(page);

So if we return AOP_TRUNCATED_PAGE, vfs is going to do a
page_cache_release on the page, but since we unlocked the
page, another process may have tampered with it, so the
"page" that vfs had a pointer to may be re-used for something
completely different, in which case, calling page_cache_release
could mess them up.

My latest-and-greatest revision only gets the short read about
once every seven or eight runs of the command in comment #23,
which is a whole lot better.  It used to recreate about half the
time.  

There is still something strange going on here, so I'm still
investigating.  I can tell you this much:  If you're running
with (1) The 2.6.18-107 kernel that contains the fix for bug
#456453 plus (2) The obsolete patch from bug #456453, plus (3)
My latest version of gfs2_readpage, then the only time it seems
to give a short read is when gfs2_readpage's call to
find_lock_page returns zero for the new page (which would have
returned AOP_TRUNCATED_PAGE, but I've rearranged the logic).

Comment 25 Robert Peterson 2008-09-10 22:01:00 UTC

After extensive instrumentation and testing, I've determined that my
latest version of gfs2_readpage is doing the right thing.  The read
request specifies some bytes to be read at offset X, and the file
size is only X, so it's properly reporting the end of file back.
So the error is definitely on the write side of things.  The write
gives a successful return code, but the inode on disk does not reflect
the new size at all.  So for example:

1. Request to roth-03 to "writev" at offset 893952 for 3072 bytes.
2. roth-03 responds with an ACK
3. Request to roth-01 to "readv" at offset 893952 for 3072 bytes.
4. At this point the file is only 893952 bytes long, even after the
   file system is quiesced, synced and unmounted happily.  Therefore
   the read fails and reports a short read of 0 bytes.

So now that I know the read side of things (including filemap.c) pretty
well, I'm going to have to backtrack the write side of things.  Hopefully
that won't be as painful because I'm already somewhat familiar with the
write side.

Once I get the write problem figured out, I have to figure out how much
of my changes to gfs2_readpage are needed, if any.

Comment 26 Robert Peterson 2008-09-11 19:51:13 UTC

Created attachment 316480 [details]
Upstream patch for the short read problem

The short-read-at-end-of-file problem turned out to be a failure of
the direct_IO code to fall back to buffered IO when writing at the end
of file.  This patch fixes the problem, and I've sent it to cluster-devel
for possible inclusion into the GFS2 -nmw git tree.  With this patch, I
was able to strip away all the previous crud, including (1) the obsolete
patch of Ben's, (2) my rework of gfs2_readpage, and (3) all other
instrumentation.  I've run the recreator mentioned above 60 times in
a row successfully using (1) this patch and (2) the lockup patch.
I still suspect that gfs2_readpage has some issues though.  I need to
discuss that with Steve Whitehouse when he returns from holiday.

Comment 27 Robert Peterson 2008-09-11 21:34:44 UTC

This patch passed the very long cluster "coherency" test on my
roth-0{1,3} cluster.  I posted it to rhkernel-list, so I'm changing
the status to POST.

Total Tests Run: 60
Total PASS:      60
Total FAIL:      0

Comment 28 Robert Peterson 2008-09-12 14:20:13 UTC

Created attachment 316579 [details]
Patch that was POSTed 11 Sep 2008

Here is the patch I POSTed to rhkernel-list on 11 September 2008.
This passed the full coherency test.

Comment 29 Robert Peterson 2008-09-16 02:32:23 UTC

Created attachment 316802 [details]
Patch that was POSTed 15 Sep 2008

Replacement patch that fixes the i386 arch dependency problem, but
adds a new symbol.

Comment 30 Robert Peterson 2008-09-16 02:34:48 UTC

POSTed a replacement patch to rhkernel-list.  Re-tested complete
coherency test on the roth-0{1,3} cluster.  Changing status back to POST.

Comment 31 Kiersten (Kerri) Anderson 2008-09-16 17:58:26 UTC

Am setting the blocker flag to be a beta blocker.  The last set of patch modifications ended up changing a core kernel header file and file.  Because this fix is no longer self contained in the gfs2 kernel module, we need to get those changes into the beta kernel.  This will allow us to spin the gfs2-kmod if we need to during the beta testing cycle.   The core kernel change is possibly beneficial to ext4 effort, but that has not been verified.

So, net result, think we need this patch in the beta kernel.

Comment 32 Steve Whitehouse 2008-09-17 14:57:44 UTC

*** Bug 455940 has been marked as a duplicate of this bug. ***

Comment 33 Don Zickus 2008-09-19 15:39:21 UTC

in kernel-2.6.18-116.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Comment 36 errata-xmlrpc 2009-01-20 20:19:21 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-0225.html

Note You need to log in before you can comment on or make changes to this bug.