432057 – GFS2: d_doio stuck in readv() waiting for pagelock.

Bug 432057 - GFS2: d_doio stuck in readv() waiting for pagelock.

Summary: GFS2: d_doio stuck in readv() waiting for pagelock.

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	5.2
Hardware:	All
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Don Zickus
QA Contact:	GFS Bugs
Docs Contact:
URL:
Whiteboard:
Duplicates (4):	428751 432108 437893 447920 (view as bug list)
Depends On:
Blocks:	428751 447747 447748 447944
TreeView+	depends on / blocked

Reported:	2008-02-08 16:59 UTC by Dean Jansa
Modified:	2009-01-20 19:51 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2009-01-20 19:51:15 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
sys-req-t of node with hung d_doio (108.47 KB, text/plain) 2008-02-08 17:03 UTC, Dean Jansa	no flags	Details
Patch to fix bug for upstream kernels (4.24 KB, patch) 2008-02-22 10:59 UTC, Steve Whitehouse	no flags	Details \| Diff
The RHEL version of the patch (4.79 KB, patch) 2008-03-03 10:15 UTC, Steve Whitehouse	no flags	Details \| Diff
readpage patch with malloc fix, and some checks to free forgotten holders (6.79 KB, patch) 2008-03-13 05:17 UTC, Ben Marzinski	no flags	Details \| Diff
another version of my patch that handles more readpage callers (9.17 KB, patch) 2008-03-16 13:24 UTC, Ben Marzinski	no flags	Details \| Diff
Test patch for fixing this bug and cluster coherency bug (57.23 KB, patch) 2008-04-18 15:05 UTC, Steve Whitehouse	no flags	Details \| Diff
RHEL5 port of readv patch (66.50 KB, patch) 2008-04-21 20:46 UTC, Robert Peterson	no flags	Details \| Diff
sysrq-t at the hang (16.04 KB, application/octet-stream) 2008-04-21 21:24 UTC, Robert Peterson	no flags	Details
New glock dump at the time of the hang (17.33 KB, text/plain) 2008-04-21 21:28 UTC, Robert Peterson	no flags	Details
RHEL5 port of readv patch (revised) (68.20 KB, patch) 2008-04-23 14:46 UTC, Robert Peterson	no flags	Details \| Diff
Call trace from salem (17.03 KB, application/octet-stream) 2008-04-23 19:31 UTC, Robert Peterson	no flags	Details
Respind of the RHEL5 patch (69.37 KB, patch) 2008-04-24 14:02 UTC, Robert Peterson	no flags	Details \| Diff
Respin of respin (68.73 KB, patch) 2008-04-24 14:34 UTC, Robert Peterson	no flags	Details \| Diff
New call trace from salem (18.86 KB, application/octet-stream) 2008-04-24 15:26 UTC, Robert Peterson	no flags	Details
RHEL5 patch as of 2008 Apr 28 (68.57 KB, patch) 2008-04-28 18:47 UTC, Robert Peterson	no flags	Details \| Diff
Latest upstream patch (57.78 KB, patch) 2008-04-29 10:43 UTC, Steve Whitehouse	no flags	Details \| Diff
Another updated version (57.84 KB, patch) 2008-04-29 12:41 UTC, Steve Whitehouse	no flags	Details \| Diff
Glock history for latest BUG() problem (9.66 KB, text/plain) 2008-04-29 21:39 UTC, Robert Peterson	no flags	Details
Combined specsfs/readv patch for aol (73.13 KB, patch) 2008-05-02 16:22 UTC, Robert Peterson	no flags	Details \| Diff
Combined specsfs/better readv patch for aol (73.21 KB, patch) 2008-05-02 20:24 UTC, Robert Peterson	no flags	Details \| Diff
lockdumps when dd_io processes were hung on the smoke cluster (8.09 KB, application/x-gzip) 2008-05-05 22:36 UTC, Abhijith Das	no flags	Details
dlm-lockdumps corresponding to the the gfs2 glock-dumps in comment #49 (1.68 KB, application/x-gzip) 2008-05-06 14:46 UTC, Abhijith Das	no flags	Details
Latest upstream patch including debugfs fix (59.89 KB, patch) 2008-05-06 14:55 UTC, Steve Whitehouse	no flags	Details \| Diff
Upstream patch with latest changes (60.38 KB, patch) 2008-05-07 10:04 UTC, Steve Whitehouse	no flags	Details \| Diff
Another upstream update (60.53 KB, patch) 2008-05-07 11:04 UTC, Steve Whitehouse	no flags	Details \| Diff
gfs2 and dlm lockdumps from latest dd_io hang (13.07 KB, application/x-gzip) 2008-05-07 22:29 UTC, Abhijith Das	no flags	Details
contains 2 patches that partially port swhiteho's latest upstream patch (1.72 KB, patch) 2008-05-08 05:40 UTC, Abhijith Das	no flags	Details \| Diff
Updated upstream patch (62.57 KB, patch) 2008-05-08 16:35 UTC, Steve Whitehouse	no flags	Details \| Diff
Hang info: glock dump, dlm locks, call traces (66.91 KB, application/octet-stream) 2008-05-08 20:51 UTC, Robert Peterson	no flags	Details
Lockdumps from the latest hang on the smoke cluster (7.86 KB, application/octet-stream) 2008-05-09 14:34 UTC, Abhijith Das	no flags	Details
Comment 60 patch with glock history debug (84.20 KB, patch) 2008-05-09 15:10 UTC, Robert Peterson	no flags	Details \| Diff
Debug version with Dave's patch (85.28 KB, patch) 2008-05-09 18:55 UTC, Robert Peterson	no flags	Details \| Diff
Lockdumps & sysrq-t from the latest hang on the smoke cluster (89.90 KB, application/octet-stream) 2008-05-09 20:09 UTC, Abhijith Das	no flags	Details
Another upstream patch (81.82 KB, patch) 2008-05-12 13:50 UTC, Steve Whitehouse	no flags	Details \| Diff
Wireshark capture of problem (54.70 KB, application/octet-stream) 2008-05-12 22:21 UTC, Robert Peterson	no flags	Details
Debug version of the comment67 patch plus fixes (105.09 KB, patch) 2008-05-13 14:08 UTC, Robert Peterson	no flags	Details \| Diff
Fix the glock history panic (105.12 KB, patch) 2008-05-13 16:52 UTC, Robert Peterson	no flags	Details \| Diff
Latest upstream patch (87.01 KB, patch) 2008-05-14 16:04 UTC, Steve Whitehouse	no flags	Details \| Diff
RHEL version of the comment #74 patch w/debug (102.14 KB, patch) 2008-05-14 17:24 UTC, Robert Peterson	no flags	Details \| Diff
lockdumps and sysrq-t from the latest hang on smoke with Bob's rhel5 patch in comment 75 (326.29 KB, application/octet-stream) 2008-05-14 20:30 UTC, Abhijith Das	no flags	Details
Latest upstream patch (87.48 KB, patch) 2008-05-15 16:20 UTC, Steve Whitehouse	no flags	Details \| Diff
Comment 81 patch for RHEL with debug / glock history (104.03 KB, patch) 2008-05-15 18:06 UTC, Robert Peterson	no flags	Details \| Diff
Latest upstream patch (87.95 KB, patch) 2008-05-16 13:05 UTC, Steve Whitehouse	no flags	Details \| Diff
RHEL version of the comment #87 patch (95.88 KB, patch) 2008-05-16 16:14 UTC, Robert Peterson	no flags	Details \| Diff
smoke dumps from today's hang. camel's logs are absent because the console was unresponsive (394.18 KB, application/octet-stream) 2008-05-16 19:23 UTC, Abhijith Das	no flags	Details
glock dumps and call traces (85.18 KB, application/octet-stream) 2008-05-20 03:55 UTC, Robert Peterson	no flags	Details
The patch I was using last night (95.87 KB, patch) 2008-05-20 12:22 UTC, Robert Peterson	no flags	Details \| Diff
Dave's patch to dlm (1.21 KB, patch) 2008-05-20 23:50 UTC, Robert Peterson	no flags	Details \| Diff
glock dumps, dlm dumps, call traces (152.24 KB, application/octet-stream) 2008-05-21 03:41 UTC, Robert Peterson	no flags	Details
My patch from today (97.33 KB, patch) 2008-05-21 03:57 UTC, Robert Peterson	no flags	Details \| Diff
Proposed final RHEL patch (72.33 KB, patch) 2008-05-21 16:32 UTC, Robert Peterson	no flags	Details \| Diff
Proposed final RHEL patch (revised) (72.38 KB, patch) 2008-05-21 18:12 UTC, Robert Peterson	no flags	Details \| Diff
Proposed final RHEL patch (revised again) (73.70 KB, patch) 2008-05-27 20:40 UTC, Robert Peterson	no flags	Details \| Diff
glock dumps and call traces (48.28 KB, application/octet-stream) 2008-05-28 19:01 UTC, Robert Peterson	no flags	Details
29 May 2008 patch (75.15 KB, patch) 2008-05-29 14:32 UTC, Robert Peterson	no flags	Details \| Diff
29 May 2008 patch (revised) (75.17 KB, patch) 2008-05-29 15:36 UTC, Robert Peterson	no flags	Details \| Diff
29 May 2008 patch (revised again) (75.08 KB, patch) 2008-05-30 04:06 UTC, Robert Peterson	no flags	Details \| Diff
Show Obsolete (41) View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2009:0225	0	normal	SHIPPED_LIVE	Important: Red Hat Enterprise Linux 5.3 kernel security and bug fix update	2009-01-20 16:06:24 UTC

Description Dean Jansa 2008-02-08 16:59:19 UTC

Description of problem:

While running d_iogen/d_doio on a 4 node ia64 cluster, during the writev-readv
test case (write data via writev(), verify data with readv()) one of the d_doio
processes stalled.   I will included complete the sys-req-t from that node as an
attachment.  Here is the stack from the hung d_doio:


d_doio        D a00000010062f8c0     0 32128  32127                     (NOTLB)

Call Trace:
 [<a00000010062e470>] schedule+0x1dd0/0x20c0
                                sp=e00000407d25f9b0 bsp=e00000407d259790
 [<a00000010062f8c0>] io_schedule+0x80/0xe0
                                sp=e00000407d25fac0 bsp=e00000407d259770
 [<a0000001000ff920>] sync_page+0x100/0x120
                                sp=e00000407d25fac0 bsp=e00000407d259750
 [<a00000010062fc60>] __wait_on_bit_lock+0xa0/0x180
                                sp=e00000407d25fac0 bsp=e00000407d259708
 [<a0000001000ff620>] __lock_page+0xc0/0xe0
                                sp=e00000407d25fac0 bsp=e00000407d2596e8
 [<a0000001001be8a0>] mpage_writepages+0x2c0/0x8a0
                                sp=e00000407d25fb00 bsp=e00000407d259670
 [<a000000100110130>] do_writepages+0xd0/0x120
                                sp=e00000407d25fba0 bsp=e00000407d259648
 [<a0000001001017b0>] __filemap_fdatawrite_range+0xb0/0xe0
                                sp=e00000407d25fba0 bsp=e00000407d259610
 [<a000000100101b80>] filemap_fdatawrite+0x40/0x60
                                sp=e00000407d25fbe0 bsp=e00000407d2595e8
 [<a0000002014adcb0>] inode_go_sync+0xd0/0x320 [gfs2]
                                sp=e00000407d25fbe0 bsp=e00000407d2595c0
 [<a0000002014adf70>] inode_go_drop_th+0x70/0xa0 [gfs2]
                                sp=e00000407d25fbe0 bsp=e00000407d2595a0
 [<a0000002014a9290>] gfs2_glock_drop_th+0x70/0x360 [gfs2]
                                sp=e00000407d25fbe0 bsp=e00000407d259570
 [<a0000002014aa4d0>] run_queue+0x330/0x920 [gfs2]
                                sp=e00000407d25fbe0 bsp=e00000407d259528
 [<a0000002014ab730>] gfs2_glock_nq+0x550/0x640 [gfs2]
                                sp=e00000407d25fbe0 bsp=e00000407d2594d8
 [<a0000002014af360>] gfs2_glock_nq_atime+0x260/0x6e0 [gfs2]
                                sp=e00000407d25fbe0 bsp=e00000407d259448
 [<a0000002014c15b0>] gfs2_readpage+0xd0/0x280 [gfs2]
                                sp=e00000407d25fc10 bsp=e00000407d259408
 [<a000000100100dd0>] do_generic_mapping_read+0x470/0x880
                                sp=e00000407d25fc50 bsp=e00000407d259350
 [<a000000100102360>] __generic_file_aio_read+0x280/0x3c0
                                sp=e00000407d25fca0 bsp=e00000407d2592f8
 [<a0000001001025a0>] generic_file_readv+0x100/0x140
                                sp=e00000407d25fcc0 bsp=e00000407d2592c0
 [<a000000100163d90>] do_readv_writev+0x350/0x620
                                sp=e00000407d25fda0 bsp=e00000407d259258
 [<a000000100164350>] vfs_readv+0xd0/0x100
                                sp=e00000407d25fe20 bsp=e00000407d259220
 [<a0000001001653f0>] sys_readv+0x70/0x120
                                sp=e00000407d25fe20 bsp=e00000407d2591a0
 [<a00000010000bdb0>] __ia64_trace_syscall+0xd0/0x110
                                sp=e00000407d25fe30 bsp=e00000407d2591a0
 [<a000000000010620>] __start_ivt_text+0xffffffff00010620/0x400
                                sp=e00000407d260000 bsp=e00000407d2591a0



Version-Release number of selected component (if applicable):

kernel: 2.6.18-76.el5

How reproducible:
Unknown

Steps to Reproduce:
1. d_iogen -I 12321 -i 1200s -s writev -v readv -F 1m:writev-readv

2. d_doio -P fore -I 12321 -a -m 1000 -r 100 -n 1  # On each node in cluster

Comment 1 Dean Jansa 2008-02-08 17:03:37 UTC

Created attachment 294383 [details]
sys-req-t of node with hung d_doio

Comment 3 Dean Jansa 2008-02-20 20:26:38 UTC

Hit this again, 2.6.18-81.el5

d_doio        D a00000010062e600     0  9181   5410                     (NOTLB)

Call Trace:
 [<a00000010062d1b0>] schedule+0x1db0/0x20a0
                                sp=e00000406fbcfaa0 bsp=e00000406fbc9718
 [<a00000010062e600>] io_schedule+0x80/0xe0
                                sp=e00000406fbcfb30 bsp=e00000406fbc96f8
 [<a0000001000ff700>] sync_page+0x100/0x120
                                sp=e00000406fbcfb30 bsp=e00000406fbc96d8
 [<a00000010062e9a0>] __wait_on_bit_lock+0xa0/0x180
                                sp=e00000406fbcfb30 bsp=e00000406fbc9690
 [<a0000001000ff400>] __lock_page+0xc0/0xe0
                                sp=e00000406fbcfb30 bsp=e00000406fbc9668
 [<a0000001001be500>] mpage_writepages+0x2c0/0x8a0
                                sp=e00000406fbcfb70 bsp=e00000406fbc95e8
 [<a00000010010fe90>] do_writepages+0xd0/0x120
                                sp=e00000406fbcfc10 bsp=e00000406fbc95c0
 [<a0000001001015a0>] __filemap_fdatawrite_range+0xc0/0xe0
                                sp=e00000406fbcfc10 bsp=e00000406fbc9580
 [<a000000100101960>] filemap_fdatawrite+0x40/0x60
                                sp=e00000406fbcfc50 bsp=e00000406fbc9560
 [<a0000002013dda50>] inode_go_sync+0xd0/0x320 [gfs2]
                                sp=e00000406fbcfc50 bsp=e00000406fbc9538
 [<a0000002013ddd10>] inode_go_drop_th+0x70/0xa0 [gfs2]
                                sp=e00000406fbcfc50 bsp=e00000406fbc9518
 [<a0000002013d8d10>] gfs2_glock_drop_th+0x70/0x340 [gfs2]
                                sp=e00000406fbcfc50 bsp=e00000406fbc94e8
 [<a0000002013da230>] run_queue+0x2f0/0x900 [gfs2]
                                sp=e00000406fbcfc50 bsp=e00000406fbc94a0
 [<a0000002013db490>] gfs2_glock_nq+0x550/0x640 [gfs2]
                                sp=e00000406fbcfc50 bsp=e00000406fbc9450
 [<a0000002013df1e0>] gfs2_glock_nq_atime+0x260/0x6e0 [gfs2]
                                sp=e00000406fbcfc50 bsp=e00000406fbc93b8
 [<a0000002013f0dd0>] gfs2_readpage+0xd0/0x280 [gfs2]
                                sp=e00000406fbcfc80 bsp=e00000406fbc9378
 [<a000000100100bb0>] do_generic_mapping_read+0x470/0x880
                                sp=e00000406fbcfcc0 bsp=e00000406fbc92c8
 [<a000000100102140>] __generic_file_aio_read+0x280/0x3c0
                                sp=e00000406fbcfd10 bsp=e00000406fbc9270
 [<a0000001001049c0>] generic_file_read+0x120/0x160
                                sp=e00000406fbcfd30 bsp=e00000406fbc9230
 [<a0000001001646c0>] vfs_read+0x200/0x3a0
                                sp=e00000406fbcfe20 bsp=e00000406fbc91d8
 [<a000000100164d90>] sys_read+0x70/0xe0
                                sp=e00000406fbcfe20 bsp=e00000406fbc9160
 [<a00000010000bdb0>] __ia64_trace_syscall+0xd0/0x110
                                sp=e00000406fbcfe30 bsp=e00000406fbc9160
 [<a000000000010620>] __start_ivt_text+0xffffffff00010620/0x400
                                sp=e00000406fbd0000 bsp=e00000406fbc9160

Comment 4 Steve Whitehouse 2008-02-22 10:54:37 UTC

This trace shows the problem quite nicely.

Comment 5 RHEL Program Management 2008-02-22 10:57:12 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 6 Steve Whitehouse 2008-02-22 10:59:42 UTC

Created attachment 295614 [details]
Patch to fix bug for upstream kernels

I don't like having to add a memory allocation here. We'll probably get away
with it simply because we have readpages and thus readpage isn't called very
often these days. The place where it might matter is with stuffed files where
readpages currently always delegates its duties to readpage. If we need to fix
that then we should be able to do it by making readpages handle stuffed files
itself. My quick test with postmark suggests that the difference isn't
measurable though.

Also I'll be discussing this and similar problems with those at the filesystems
workshop next week.

Comment 7 Steve Whitehouse 2008-03-03 10:15:24 UTC

Created attachment 296569 [details]
The RHEL version of the patch

Comment 8 Steve Whitehouse 2008-03-03 10:16:24 UTC

Abhi, can you include the patch from comment #7 in your next build so that we
can test it?

Comment 9 Ben Marzinski 2008-03-07 23:48:00 UTC

Am I the only one who crashes with this patch?
I get panics running kmod-gfs2-xen-1.84-1.2.i686
If I remove this patch, everything is fine

It looks to me like I'm crashing in add_to_queue() because someone is messing
with the gl_holders list while I'm traversing it, but I can't find anywhere in
the patch where the holder is messed with without holding the gl_spin lock. 
Maybe I'm missing something.

Comment 10 Ben Marzinski 2008-03-11 19:17:00 UTC

GAH!!!!!

It's sizeof(struct gfs2_holder) not sizeof(struct gfs2_holder *)

I hate it when the simple answer is staring me in the face for days and I keep
overlooking it for something complicated.  Only the RHEL patch has the typeo.
I'm verifying that everything works now, but I'm pretty sure that this is the
whole of the problem.

Comment 11 Ben Marzinski 2008-03-12 06:38:06 UTC

Unfortunately, there is still a problem.  I'm running into a bug where
add_to_queue() calls find_holder_by_owner() to see if the current process
already has a holder on the glock's holder list, and it finds one. The original
holder is from gfs2_readpage().

I added a GL_NEED_TO_RETRY bit to the gh_flags to mark when a holder was created
in gfs2_readpage(), and then I clear it when gfs2_readpage() is called again,
and finds the holder.  When I see this error, the original holder doesn't have
this bit cleared, so it seems pretty obvious that in some cases, when you return
with AOP_TRUNCATED_PAGE, you don't get called again. Every time I've seen this
bug, the original holder is created by a readpage call from
do_generic_mapping_read().

I found this running kmod-gfs2-1.84-1.4.el5 with the kmalloc size fix compiled
for the 2.6.18-83.el5 kernel, since the 84 kernel has the sctp problems. But
unless do_generic_mapping_read() works differently in the 84 kernel, this bug
should be there too.

Comment 12 Steve Whitehouse 2008-03-12 10:28:57 UTC

I suspect there is a race. Just before the call to readpage there is this:


                /* Did somebody else fill it already? */
                if (PageUptodate(page)) {
                        unlock_page(page);
                        goto page_ok;
                }

readpage:
                /* Start the actual read. The read will unlock the page. */
                error = mapping->a_ops->readpage(filp, page);

and I suspect that after we've been around this loop for the second time,
another process has raced, brought the page uptodate and thus we've skipped the
second readpage, and we only find out w
                /* Did somebody else fill it already? */
                if (PageUptodate(page)) {
                        unlock_page(page);
                        goto page_ok;
                }

readpage:
                /* Start the actual read. The read will unlock the page. */
                error = mapping->a_ops->readpage(filp, page);

then the same process later tries to bring the same page uptodate again and hits
the original lock.

The odd thing though is why you see this in find_holder_by_owner since what
should be happening is that the lock would be found on entry to readpage and
thus it shouldn't be trying to lock it again.

Sorry about the typo. Probably a cut & paste error when I copied the patch into
RHEL.

Also I guess it could be worth dropping Mark a line and talking to him about how
OCFS2 deals with this. We are using the same technique as Mark does, and I think
he also has the same bug wrt not calling back into readpage if there has been a
race in bringing the page uptodate.

Comment 13 Ben Marzinski 2008-03-12 16:22:19 UTC

The reason you don't find the original holder until find_holder_by_owner is
because when you hit this, the process isn't doing another gfs2_readpage() call.
I have seen it happen with gfs2_mmap() and gfs2_readpages().

Comment 14 Ben Marzinski 2008-03-13 05:17:33 UTC

Created attachment 297894 [details]
readpage patch with malloc fix, and some checks to free forgotten holders

This patch has a possible partial fix.	There are definitely some calls to
readpage that aren't don't go through gfs2_aio_file_read(), gfs2_file_readv()
and gfs2_file_read(), so I still see this bug, but I can get much further. 
Unfortunately, I keep running into sctp panics, so I can't see if it's possible
to check for forgotten holders for all the calls to readpage().  But if it is
possible to check after all the readpage calls, then this seems like a
reasonable workaround for RHEL.

Comment 15 Ben Marzinski 2008-03-16 13:24:33 UTC

Created attachment 298191 [details]
another version of my patch that handles more readpage callers

Here is another version of my patch. Even if this isn't the way we go, this
will may be useful for QA, so that they can test bugs that rely on this fix,
like 428751, until the real fix is done. This version of the patch handles all
the readpage callers except the cachefiles interface, but I highly doubt that
QA checks that cachefiles work correctly on gfs2. Unfortunately, I never got
time to verify that it does correctly handle all the other readpage calls. 
I've only verified that it patches gfs2-kmod-1.85-1.2.el5 and compiles
correctly.

Comment 17 Ben Marzinski 2008-04-17 19:37:04 UTC

Unless another idea has been proposed since I looked at this last, the current
proposed solution is to make sure that we don't do a demote from a try lock
request.  Instead, the demote will be queued up and handled by the work queue.
This means that try lock requests will behave as expected and not do blocking
work if they can't directly acquire the lock.  I'll look into changing this.

Comment 18 Steve Whitehouse 2008-04-18 07:04:24 UTC

Yes, thats the plan. I have a patch, not yet fully debugged, to try and do that.

Comment 19 Steve Whitehouse 2008-04-18 15:05:57 UTC

Created attachment 302891 [details]
Test patch for fixing this bug and cluster coherency bug

This is a test patch (vs. upstream) which is aimed at fixing this bug by making
"try locks" not do demotions where it might lead to blocking. Its still
experimental - it needs lots of testing, it currently passes single node
mount/postmark/umount tests. We need some more demanding (lot of glocks) tests
as well as to test the cluster behaviour. I'd be surprised if there was not the
odd bug still lurking in the code somewhere bearing in mind the scale of the
change.

One single node though, it seems to be running nicely.

Known bugs:
 o There is a problem with dumping glocks via debugfs which sometimes results
in a lock up. It seems to be related to finding a glock which has been removed
from the lists during glock dumping. When I hit it I get an oops which points
to a poisoned list pointer (the ->next pointer to be specific). I can hit it
from time to time by dumping glocks while running postmark.

 o I might get rid of gfs2_glockd

 o The glock dump format is new and experimental. Its much faster than the
old format to dump out. It will need some tweeking, but I'd like to try and
keep each line to 80 cols and one line per object if at all possible. Let me
know what you think on this issue.

Other than that, lets test it as much as can and find/fix as many bugs as
possible in the mean time.

Comment 20 Robert Peterson 2008-04-18 18:52:41 UTC

The good news: This patch passes my "hell" tests.
The bad news: I was able to crash it by doing this (single node):

time /usr/tests/sts-rhel5.1/gfs/bin/d_io -m /mnt/gfs2 -n roth-01 -S REG -l
/usr/tests/sts-rhel5.1/ -r /usr/tests/sts-rhel5.1/

Here is the call trace:

BUG: unable to handle kernel paging request at ffff81026cb29a66
IP: [<ffffffff8840d2a9>] :gfs2:gfs2_block_map+0x7ed/0xab0
PGD 8063 PUD 0 
Oops: 0000 [1] SMP 
CPU 1 
Modules linked in: lock_dlm gfs2 dlm configfs autofs4 hidp rfcomm l2cap
bluetooth sunrpc ipv6 dm_multipath sbs sbshc battery acpi_memhotplug ac
parport_pc lp parport joydev sg i3000_edac i2c_i801 i2c_core edac_core
ide_cd_mod button tg3 cdrom pcspkr serio_raw dcdbas shpchp dm_snapshot dm_zero
dm_mirror dm_mod qla2xxx scsi_transport_fc ata_piix libata sd_mod scsi_mod ext3
jbd ehci_hcd ohci_hcd uhci_hcd [last unloaded: gfs2]
Pid: 5898, comm: xdoio Not tainted 2.6.25-rc7 #1
RIP: 0010:[<ffffffff8840d2a9>]  [<ffffffff8840d2a9>]
:gfs2:gfs2_block_map+0x7ed/0xab0
RSP: 0018:ffff81006cb29968  EFLAGS: 00010297
RAX: 00000000ffffffff RBX: 0000000000000002 RCX: 0000000000000000
RDX: ffff81001d212d88 RSI: 0000000000000000 RDI: ffff81002de2c4d8
RBP: ffffffffffffffff R08: 0000000000000000 R09: ffff81007f41b368
R10: ffff81005f130f30 R11: 00000000000000d0 R12: ffff81007f41b368
R13: 0000000000000000 R14: ffff81002de2c470 R15: 00000000000183fd
FS:  00007fdc02af36f0(0000) GS:ffff81007fba7840(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: ffff81026cb29a66 CR3: 000000007e4a1000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process xdoio (pid: 5898, threadinfo ffff81006cb28000, task ffff81005dbd94a0)
Stack:  ffff81007c143e40 0000000100010123 0000000000001387 ffff81001655e150
 0000000000010123 000000018841b0cd 0000000000000018 ffff81001d3d6810
 ffff81002de2c408 ffff81001d212d88 ffff81006216c000 ffff81001655e000
Call Trace:
 [<ffffffff802ac4c7>] ? __block_prepare_write+0x19d/0x36c
 [<ffffffff8840cabc>] ? :gfs2:gfs2_block_map+0x0/0xab0
 [<ffffffff8840dac6>] ? :gfs2:gfs2_unstuff_dinode+0x379/0x38b
 [<ffffffff802ac6b0>] ? block_prepare_write+0x1a/0x25
 [<ffffffff8841d324>] ? :gfs2:gfs2_write_begin+0x230/0x2c0
 [<ffffffff80266b38>] ? generic_file_buffered_write+0x14a/0x62c
 [<ffffffff80238c77>] ? current_fs_time+0x1e/0x24
 [<ffffffff80267356>] ? __generic_file_aio_write_nolock+0x33c/0x3a6
 [<ffffffff884150c8>] ? :gfs2:gfs2_glock_dq+0x1e/0x112
 [<ffffffff80245e68>] ? wake_up_bit+0x11/0x22
 [<ffffffff80267421>] ? generic_file_aio_write+0x61/0xc1
 [<ffffffff8028d2a7>] ? do_sync_write+0xc9/0x10c
 [<ffffffff8841e2f2>] ? :gfs2:gfs2_open+0x0/0x13c
 [<ffffffff80245e79>] ? autoremove_wake_function+0x0/0x2e
 [<ffffffff80474a46>] ? lock_kernel+0x1b/0x37
 [<ffffffff8028dd93>] ? remote_llseek+0x73/0x7b
 [<ffffffff8028da09>] ? vfs_write+0xad/0x136
 [<ffffffff8028df46>] ? sys_write+0x45/0x6e
 [<ffffffff8020bfc9>] ? tracesys+0xdc/0xe1

BTW, this was done on the upstream kernel (from today's nmw tree).

Comment 21 Robert Peterson 2008-04-18 19:51:20 UTC

I recreated the problem from comment #20 on the -nmw tree that DOES NOT
include the patch, so the problem exists in the current upstream kernel
GFS2.  So it's not this patch causing the problem.  It's something
prior.  I suspected as much, given the symptoms seem unrelated to the
glock management.

Comment 22 Robert Peterson 2008-04-21 19:38:39 UTC

The problem from comment #20 seems to go back to function gfs2_bmap_alloc
which isn't in RHEL.  It appears that function zero_metapath_length is
returning 0.  Coming into ALLOC_GROW_HEIGHT of the state machine, we have
i==1.  Coming into ALLOC_GROW_DEPTH, i==0.  A call to function
gfs2_indirect_init is made which tries to pass mp->mp_list[i-1], but since
i==0, this i - 1 is -1, and that's what is blowing up.

The problem seems to be when we have a stuffed file and then we go to
write a bunch of data at a fairly big offset.  Dean suggested this
command to test it in addition to the command I gave above:

xiogen -z -m random -s write -i 10 -t 100b -T 1000b -F 10000b:foofile | xdoio -v
-p none

I spoke with Steve about it and tried a few experiments, such as using
"i = zmpl + 1" rather than "i = zmpl" but that broke other things.
I also tried adding if clauses such as "i > 0 &&" but I don't know the
intent of this function well enough.  This is going to have to be
well thought-out.

There's also a concern with it potentially doing a brelse on the dinode
bh.  Since Steve wrote this function and the whole "zmpl" thing, and since
it's not in RHEL, I didn't want to spend too much time debugging it.
I'll let Steve do the heavy lifting on this.

Instead, I'm focusing my efforts on porting his patch to RHEL and getting
it working / tested.  That's been a big challenge because there are a lot
of changes between RHEL and upstream, such as pid number management and
references to go_drop_th.  It's not a straightforward port as I was hoping.

I've now got a version that compiles but I haven't begun testing it to
see what breaks.  My fear, of course, is that there might be assumptions
in upstream that make this port unusable (unless we port a ton of other
upstream changes).  I guess I'll find out soon.

Comment 23 Robert Peterson 2008-04-21 20:46:00 UTC

Created attachment 303203 [details]
RHEL5 port of readv patch

Here is my RHEL5 port of Steve's readv patch from comment #19.
Due to the number of patch rejects due to differenced between RHEL5
and upstream, I won't guarantee this is a good patch.

Again, I've got good news and bad news:
The good news:	This patch passes all my "hell" tests.
It also does not crash when I do the command from comment #20.

The bad news: The d_io test from comment #20 ran a long time, but
it hung up permanently while performing the "laio_rwsdirect" test.
It's not a hard lock, but the xdoio processes are hung.  There seems
to be pairs of xdoio's, one of which is stuck in the "D" state.
So I need to take a sysrq-t and debug this.

This could be a problem with my port, or like I said before, due to
assumptions that are no longer true due to differences between upstream
and RHEL.

Comment 24 Robert Peterson 2008-04-21 21:24:04 UTC

Created attachment 303208 [details]
sysrq-t at the hang

This only shows two xdoio processes, both stuck in glock_wait.

Comment 25 Robert Peterson 2008-04-21 21:28:33 UTC

Created attachment 303209 [details]
New glock dump at the time of the hang

The glock dump didn't hang or kill the system, and that's good.
The new format is going to take some getting used to.

Comment 26 Robert Peterson 2008-04-22 23:01:29 UTC

I tried some experimental changes to the patch but the problem is still
there (I recreated it again).  There are a couple of things that
concern me about this patch.  It might be smaller, but perhaps more
confusing to the untrained eye.  For example:

1. It bothers me that finish_xmote can call do_xmote AND do_xmote can
   also call finish_xmote.  That just seems wrong to me.  Not intuitive
   at any rate.
2. It also seems wrong that run_queue can call do_xmote twice in the
   demote case.  It's my belief (completely unproven) as to why the
   gl_ref count is continually going up after the failure occurs.
3. It bothers me that run_queue always calls do_xmote, even if it does
   not find any waiters on the queue.

This makes me wonder if the patch impacts performance.  Perhaps I just
need to study the logic of it some more.

Comment 27 Steve Whitehouse 2008-04-23 07:58:00 UTC

To answer some of these points....

1. This was true of the previous code as well, its just that now its obvious
that this happens. The recursion is strictly limited, ending when we obtain the
lock mode that we requested, or when we receive an error. The worst case looks
like this:

   do_xmote: Try to promote from shared to exclusive
   finish_xmote: try lock fails due to conflicting holders
   do_xmote: called to demote to unlocked
   finish_xmote: now unlocked
   do_xmote: need to send a request for exclusive lock
   finish_xmote: exclusive lock granted

i.e. max depth of three calls, which is not excessive. In fact in practise its
less than that since finish_xmote is run in the context of the dlm's callback
thread and we never experience this particular case when running with
lock_nolock since the conversion will always succeed.

2. run_queue has to call do_xmote twice if there is a queued demote request
since it has to do that demote request first and then process the pending
promote to avoid deadlock. This was also true of the previous code.

3. run_queue doesn't call do_xmote unless there is either a pending demote
request or there are waiters on the queue which can be processed only if there
is a change of the lock mode (i.e. do_promote returns non-zero).

Comment 28 Robert Peterson 2008-04-23 14:46:46 UTC

Created attachment 303500 [details]
RHEL5 port of readv patch (revised)

This RHEL5 port of Steve's patch passes the "hell" tests and the d_io
test that was failing in previous comments.  It also has a few minor
whitespace fixes I grabbed from upstream.  I'm a little concerned that
I may have grabbed too much, in my effort to solve some of these issues.

We need to do some cluster tests; perhaps distributed d_io (dd_io) and
revolver with locks.

Comment 29 Robert Peterson 2008-04-23 18:48:17 UTC

Found some problems during single-node testing.  Steve made a couple
corrections to the patch, particularly, function may_grant().
This seemed to fix the single-node problems; I've been running
specsfs on Barry's machine for nearly two hours now, and everything
seems happy.  It's on run 6 out of 14.  I've also been running benchp13
on kool and it's happy too, on iteration 222.  There does seem to be a
bit of a performance increase, at least with benchp13.

However, Abhi found some problems running distributed d_io (dd_io) on
the smoke cluster.  All of the d_doio processes on system "salem" are
stucking waiting for an unlocked glock.  The important part of the
glock dump looks like this:

G:  s:UN n:2/19660907 f:l t:EX d:EX/0 l:0 a:0 r:9
 H: s:EX f:W e:0 p:6189 [d_doio] gfs2_sharewrite_nopage+0x5b/0x2bf [gfs2]
 H: s:EX f:aW e:0 p:6192 [d_doio] gfs2_prepare_write+0xc7/0x360 [gfs2]
 H: s:EX f:W e:0 p:6190 [d_doio] gfs2_sharewrite_nopage+0x5b/0x2bf [gfs2]
 H: s:EX f:W e:0 p:6191 [d_doio] gfs2_sharewrite_nopage+0x5b/0x2bf [gfs2]
 H: s:SH f:aW e:0 p:6193 [d_doio] gfs2_permission+0x7b/0xd4 [gfs2]
 H: s:SH f:aW e:0 p:6275 [ls] gfs2_getattr+0x7d/0xc4 [gfs2]

Comment 30 Robert Peterson 2008-04-23 19:31:56 UTC

Created attachment 303542 [details]
Call trace from salem

This is a gzip of Salem's call-trace at the hang.

Comment 31 Robert Peterson 2008-04-23 21:12:25 UTC

Okay, salem, winston and merit are all apparently waiting for a glock
that's held by camel.  Here are the respective excerpts from the glock
dumps:

salem:
G:  s:UN n:2/19660907 f:l t:EX d:EX/0 l:0 a:0 r:9
 H: s:EX f:W e:0 p:6189 [d_doio] gfs2_sharewrite_nopage+0x5b/0x2bf [gfs2]
 H: s:EX f:aW e:0 p:6192 [d_doio] gfs2_prepare_write+0xc7/0x360 [gfs2]
 H: s:EX f:W e:0 p:6190 [d_doio] gfs2_sharewrite_nopage+0x5b/0x2bf [gfs2]
 H: s:EX f:W e:0 p:6191 [d_doio] gfs2_sharewrite_nopage+0x5b/0x2bf [gfs2]
 H: s:SH f:aW e:0 p:6193 [d_doio] gfs2_permission+0x7b/0xd4 [gfs2]
 H: s:SH f:aW e:0 p:6275 [ls] gfs2_getattr+0x7d/0xc4 [gfs2]

winston: cat /sys/kernel/debug/gfs2/Smoke4\:gfs20/glocks > /tmp/glocks.winston 
G:  s:UN n:2/19660907 f:l t:EX d:EX/0 l:0 a:0 r:5
 H: s:EX f:aW e:0 p:2859 [d_doio] gfs2_prepare_write+0xa8/0x325 [gfs2]
 H: s:EX f:W e:0 p:2856 [d_doio] gfs2_sharewrite_nopage+0x43/0x28c [gfs2]

merit: cat /sys/kernel/debug/gfs2/Smoke4\:gfs20/glocks > /tmp/glocks.merit
G:  s:UN n:2/19660907 f:l t:EX d:EX/0 l:0 a:0 r:8
 H: s:EX f:aW e:0 p:2861 [d_doio] gfs2_prepare_write+0xa8/0x325 [gfs2]
 H: s:EX f:W e:0 p:2863 [d_doio] gfs2_sharewrite_nopage+0x43/0x28c [gfs2]
 H: s:SH f:TaW e:0 p:2859 [d_doio] gfs2_readpages+0x4f/0xcc [gfs2]
 H: s:SH f:aW e:0 p:2860 [d_doio] gfs2_permission+0x69/0xb4 [gfs2]
 H: s:SH f:aW e:0 p:2862 [d_doio] gfs2_permission+0x69/0xb4 [gfs2]

camel: cat /sys/kernel/debug/gfs2/Smoke4\:gfs20/glocks > /tmp/glocks.camel
G:  s:EX n:2/19660907 f:lD t:UN d:UN/10443042000 l:0 a:0 r:5
 H: s:SH f:TaW e:0 p:2840 [d_doio] gfs2_readpage+0x4f/0xea [gfs2]
 H: s:EX f:aW e:0 p:2839 [d_doio] gfs2_prepare_write+0xa8/0x325 [gfs2]
 I: n:1048602/19660907 t:8 f:0x00000001

Camel's pid 2840 has this call trace:

d_doio        D 0000006F  1988  2839   2837          2840  2838 (NOTLB)
       cf7e3c34 00000082 006b1354 0000006f 00000000 00000003 0002407b 00000007 
       df728aa0 c06713c0 006b7a4e 0000006f 000066fa 00000000 df728bac c13f4ee0 
       c04d94a8 d3a225c0 e01dc840 c042cb1a dc71ce1c ffffffff 00000000 00000000 
Call Trace:
 [<c04d94a8>] generic_unplug_device+0x15/0x22
 [<e01dc840>] dm_table_unplug_all+0x22/0x2e [dm_mod]
 [<c042cb1a>] getnstimeofday+0x30/0xb6
 [<c06083dc>] io_schedule+0x36/0x59
 [<c045520c>] sync_page+0x0/0x3b
 [<c0455244>] sync_page+0x38/0x3b
 [<c06084ee>] __wait_on_bit_lock+0x2a/0x52
 [<c0455187>] __lock_page+0x52/0x59
 [<c0435f90>] wake_bit_function+0x0/0x3c
 [<c045c0e8>] truncate_inode_pages_range+0x201/0x256
 [<c045c146>] truncate_inode_pages+0x9/0xf
 [<e0592f47>] inode_go_inval+0x40/0x4b [gfs2]
 [<e0592774>] do_xmote+0x66/0xfe [gfs2]
 [<e0592859>] run_queue+0x4d/0x9c [gfs2]
 [<e0592a01>] gfs2_glock_nq+0x159/0x17a [gfs2]
 [<e059387f>] gfs2_glock_nq_atime+0xfa/0x2c2 [gfs2]
 [<e0599ea6>] gfs2_prepare_write+0xb8/0x325 [gfs2]
 [<c045661f>] generic_file_buffered_write+0x226/0x5a2
 [<c0456e41>] __generic_file_aio_write_nolock+0x4a6/0x52a
 [<c0407406>] do_IRQ+0xa5/0xae
 [<c045972f>] get_page_from_freelist+0x2ba/0x333
 [<c045709f>] generic_file_write+0x0/0x94
 [<c0456ff5>] __generic_file_write_nolock+0x86/0x9a
 [<c0435f63>] autoremove_wake_function+0x0/0x2d
 [<c060868e>] mutex_lock+0xb/0x19
 [<c04570d9>] generic_file_write+0x3a/0x94
 [<c045709f>] generic_file_write+0x0/0x94
 [<c0471033>] vfs_write+0xa1/0x143
 [<c0471625>] sys_write+0x3c/0x63
 [<c0404eff>] syscall_call+0x7/0xb

Camel's pid 2839 call trace:
d_doio        D 0000006E  2448  2838   2837          2839       (NOTLB)
       cf7d1ecc 00000082 78b3b5c3 0000006e ce384000 00000046 c11c7060 00000006 
       df4da550 c06713c0 7997d7cd 0000006e 00e4220a 00000000 df4da65c c13f4ee0 
       c067df28 000280d2 c11c7c40 00000000 c067ba00 ffffffff 00000000 00000000 
Call Trace:
 [<c06087b3>] __mutex_lock_slowpath+0x4d/0x7c
 [<c06087f1>] .text.lock.mutex+0xf/0x14
 [<c0457031>] generic_file_writev+0x28/0x96
 [<c0457009>] generic_file_writev+0x0/0x96
 [<c045709f>] generic_file_write+0x0/0x94
 [<c0470dfe>] do_readv_writev+0x166/0x277
 [<c0460e76>] __handle_mm_fault+0x896/0x8b3
 [<c044b0d5>] audit_syscall_entry+0x14b/0x17d
 [<c0470f46>] vfs_writev+0x37/0x43
 [<c047138a>] sys_writev+0x3c/0x96
 [<c0404eff>] syscall_call+0x7/0xb

Comment 32 Robert Peterson 2008-04-23 22:29:34 UTC

Benchp13 is still running happily.  It's on iteration 748.
Unfortunately, specsfs ended abnormally on iteration 12 of 14.  This
is after several hours of running successfully.  There were no messages
in dmesg, but no processes were in the D state.  Unfortunately, there
are 57 GFS2 file systems mounted, which makes it difficult to narrow
down what's happening.  I'll see if I can dump them all and see if
I can see anything wrong, but it's a needle in a haystack, since there
aren't any hung processes and huge numbers of files open.  However, the
timing of this problem makes me believe it might have something to do
with vfs trimming the inodes in cache.

Comment 33 Steve Whitehouse 2008-04-24 09:22:53 UTC

Thanks for the info in comment #31. In fact you've given me the backtraces for
the wrong processes, but it doesn't matter as I think I've figured out whats
going on there anyway.

So this is my theory: readpage has queued a "try lock" but due to the pending
demote request, it exited run_queue at the test:

                if (gh || nonblock)
                        return;

so that it was then stuck until the EX request was queued from prepare_write.
That request did actually try and do the demote request, but it has got stuck
since readpage is still holding a lock on one of the pages, hence deadlock.

Now what should have happened is that in the "nonblock" is true case, we should
have returned GLR_TRYFAILED to the try locks and then scheduled the work queue
to do the demote in the mean time (would be a good plan to have a yield() in the
GLR_TRYFAILED path in readpage to set it run). This should mean that a further
request from readpage should then be successful.

We thus need to insert something like:

if (nonblock)
        fail_trylocks_and_reschedule(gl);

just before the if (gh || nonblock) in run_queue() and I'll try and write a
suitable function to do that in the mean time.

Comment 34 Steve Whitehouse 2008-04-24 10:16:45 UTC

I suspect that we need something like this for the actual function. This is what
I'm about to start testing:

static void fail_trylocks_and_reschedule(struct gfs2_glock *gl)
{
        do_error(gl, 0);
        gfs2_glock_hold(gl);
        if (queue_delayed_work(glock_workqueue, &gl->gl_work, 0) == 0)
                gfs2_glock_put(gl);
}

Comment 35 Robert Peterson 2008-04-24 14:02:12 UTC

Created attachment 303631 [details]
Respind of the RHEL5 patch

This version of the patch contains the latest changes Steve mentioned.

Comment 36 Robert Peterson 2008-04-24 14:34:32 UTC

Created attachment 303637 [details]
Respin of respin

I caved in to peer pressure and removed the white-space changes to
inode.c from the patch.  Otherwise it's the same.

Comment 37 Robert Peterson 2008-04-24 15:16:52 UTC

Another hang on the smoke cluster:

salem:
G:  s:EX n:2/132425 f:lD t:UN d:UN/1744247000 l:0 a:0 r:5
 H: s:SH f:TaW e:0 p:9381 [d_doio] gfs2_readpage+0x60/0xf6 [gfs2]
 H: s:EX f:aW e:0 p:9373 [d_doio] gfs2_prepare_write+0xc7/0x360 [gfs2]
 I: n:25/132425 t:8 f:0x00000001

camel:
G:  s:UN n:2/132425 f:l t:EX d:EX/0 l:0 a:0 r:6
 H: s:EX f:aW e:0 p:6318 [d_doio] gfs2_prepare_write+0xa8/0x325 [gfs2]
 H: s:SH f:TaW e:0 p:6329 [d_doio] gfs2_readpage+0x4f/0xea [gfs2]
 H: s:SH f:TaW e:0 p:6325 [d_doio] gfs2_readpage+0x4f/0xea [gfs2]

merit:
G:  s:UN n:2/132425 f:l t:SH d:EX/0 l:0 a:0 r:8
 H: s:SH f:aW e:0 p:6347 [d_doio] gfs2_open+0x9f/0x112 [gfs2]
 H: s:SH f:aW e:0 p:6339 [d_doio] gfs2_open+0x9f/0x112 [gfs2]
 H: s:SH f:aW e:0 p:6346 [d_doio] gfs2_open+0x9f/0x112 [gfs2]
 H: s:SH f:TaW e:0 p:6336 [d_doio] gfs2_readpage+0x4f/0xea [gfs2]
 H: s:SH f:TaW e:0 p:6333 [d_doio] gfs2_readpage+0x4f/0xea [gfs2]

winston:
G:  s:UN n:2/132425 f:l t:SH d:EX/0 l:0 a:0 r:8
 H: s:SH f:aW e:0 p:6336 [d_doio] gfs2_permission+0x69/0xb4 [gfs2]
 H: s:SH f:aW e:0 p:6343 [d_doio] gfs2_permission+0x69/0xb4 [gfs2]
 H: s:SH f:aW e:0 p:6346 [d_doio] gfs2_permission+0x69/0xb4 [gfs2]
 H: s:SH f:TaW e:0 p:6347 [d_doio] gfs2_readpage+0x4f/0xea [gfs2]
 H: s:SH f:TaW e:0 p:6348 [d_doio] gfs2_readpage+0x4f/0xea [gfs2]

Comment 38 Robert Peterson 2008-04-24 15:26:44 UTC

Created attachment 303646 [details]
New call trace from salem

Comment 39 Robert Peterson 2008-04-24 23:55:57 UTC

I added my "glock history" patch (a modified version of it anyway)
to the latest and greatest as a test patch, then I rereated the
hang.  Here is what the "hung" glock looks like, with history going
back 25 items, from all three nodes' perspectives.  Starting with
roth-03, who apparently held it last:

roth-03
G:  s:EX n:2/99322 f:lD t:UN d:UN/1678975000 l:0 a:0 r:6
 H: s:EX f:aW e:0 p:3053 [d_doio] gfs2_prepare_write+0xc7/0x360 [gfs2]
 H: s:SH f:TaW e:0 p:3050 [d_doio] gfs2_readpage+0x60/0xf6 [gfs2]
 I: n:2097173/99322 t:8 f:0x00000001
    779: gfs2_do_promote: 3050, LM_ST_EXCLUSIVE
    780: gfs2_glock_dq: 3053, LM_ST_EXCLUSIVE
    781: gfs2_do_promote: 3053, LM_ST_EXCLUSIVE
    782: gfs2_glock_dq: 3050, LM_ST_EXCLUSIVE
    783: gfs2_do_promote: 3051, LM_ST_EXCLUSIVE
    784: gfs2_glock_dq: 3053, LM_ST_EXCLUSIVE
    785: do_xmote: 2830, LM_ST_EXCLUSIVE
    786: search_bucket: 2952, LM_ST_EXCLUSIVE
    787: finish_xmote: 2952, LM_ST_EXCLUSIVE
    788: do_xmote: 3050, LM_ST_UNLOCKED
    789: search_bucket: 2952, LM_ST_UNLOCKED
    790: search_bucket: 2951, LM_ST_UNLOCKED
    791: finish_xmote: 2952, LM_ST_UNLOCKED
    792: handle_callback: 2951, LM_ST_SHARED
    793: gfs2_do_promote: 2952, LM_ST_SHARED
    794: gfs2_do_promote: 2952, LM_ST_SHARED
    795: gfs2_glock_dq: 3051, LM_ST_SHARED
    796: gfs2_glock_dq: 3052, LM_ST_SHARED
    797: do_xmote: 3052, LM_ST_SHARED
    798: search_bucket: 2951, LM_ST_SHARED
    799: handle_callback: 2951, LM_ST_SHARED
    800: search_bucket: 2951, LM_ST_SHARED
    801: finish_xmote: 2951, LM_ST_SHARED
    802: gfs2_do_promote: 2951, LM_ST_EXCLUSIVE
    803: gfs2_glock_dq: 3053, LM_ST_EXCLUSIVE
    804: gfs2_do_promote: 3053, LM_ST_EXCLUSIVE
    805: gfs2_do_promote: 3053, LM_ST_EXCLUSIVE
    806: gfs2_do_promote: 3053, LM_ST_EXCLUSIVE
    807: gfs2_glock_dq: 3051, LM_ST_EXCLUSIVE
    808: gfs2_glock_dq: 3050, LM_ST_EXCLUSIVE
    809: gfs2_glock_dq: 3052, LM_ST_EXCLUSIVE
    810: do_xmote: 2829, LM_ST_EXCLUSIVE

roth-02
G:  s:UN n:2/99322 f:l t:EX d:EX/0 l:0 a:0 r:6
 H: s:EX f:aW e:0 p:3077 [d_doio] gfs2_prepare_write+0xc7/0x360 [gfs2]
 H: s:SH f:TaW e:0 p:3079 [d_doio] gfs2_readpage+0x60/0xf6 [gfs2]
 H: s:SH f:TaW e:0 p:3078 [d_doio] gfs2_readpage+0x60/0xf6 [gfs2]
    929: gfs2_glock_dq: 3078, LM_ST_EXCLUSIVE
    930: gfs2_glock_dq: 3079, LM_ST_EXCLUSIVE
    931: gfs2_do_promote: 3079, LM_ST_EXCLUSIVE
    932: gfs2_glock_dq: 3077, LM_ST_EXCLUSIVE
    933: gfs2_do_promote: 3077, LM_ST_EXCLUSIVE
    934: gfs2_do_promote: 3077, LM_ST_EXCLUSIVE
    935: gfs2_glock_dq: 3078, LM_ST_EXCLUSIVE
    936: gfs2_glock_dq: 3079, LM_ST_EXCLUSIVE
    937: gfs2_do_promote: 3079, LM_ST_EXCLUSIVE
    938: gfs2_glock_dq: 3077, LM_ST_EXCLUSIVE
    939: gfs2_do_promote: 3077, LM_ST_EXCLUSIVE
    940: gfs2_do_promote: 3077, LM_ST_EXCLUSIVE
    941: gfs2_glock_dq: 3078, LM_ST_EXCLUSIVE
    942: gfs2_glock_dq: 3079, LM_ST_EXCLUSIVE
    943: gfs2_do_promote: 3079, LM_ST_EXCLUSIVE
    944: gfs2_glock_dq: 3077, LM_ST_EXCLUSIVE
    945: do_xmote: 2852, LM_ST_EXCLUSIVE
    946: search_bucket: 2975, LM_ST_EXCLUSIVE
    947: finish_xmote: 2975, LM_ST_EXCLUSIVE
    948: do_xmote: 3078, LM_ST_UNLOCKED
    949: search_bucket: 2975, LM_ST_UNLOCKED
    950: handle_callback: 2975, LM_ST_UNLOCKED
    951: search_bucket: 2975, LM_ST_UNLOCKED
    952: finish_xmote: 2975, LM_ST_UNLOCKED
    953: gfs2_do_promote: 2975, LM_ST_EXCLUSIVE
    954: search_bucket: 2975, LM_ST_EXCLUSIVE
    955: handle_callback: 2975, LM_ST_EXCLUSIVE
    956: gfs2_glock_dq: 3077, LM_ST_EXCLUSIVE
    957: do_xmote: 2852, LM_ST_EXCLUSIVE
    958: search_bucket: 2975, LM_ST_EXCLUSIVE
    959: finish_xmote: 2975, LM_ST_EXCLUSIVE
    960: do_xmote: 3079, LM_ST_UNLOCKED

roth-01
G:  s:UN n:2/99322 f:l t:EX d:EX/0 l:0 a:0 r:6
 H: s:EX f:aW e:0 p:3020 [d_doio] gfs2_prepare_write+0xc7/0x360 [gfs2]
 H: s:SH f:TaW e:0 p:3024 [d_doio] gfs2_readpage+0x60/0xf6 [gfs2]
 H: s:SH f:TaW e:0 p:3022 [d_doio] gfs2_readpage+0x60/0xf6 [gfs2]
    231: gfs2_do_promote: 3020, LM_ST_EXCLUSIVE
    232: gfs2_glock_dq: 3022, LM_ST_EXCLUSIVE
    233: gfs2_do_promote: 3022, LM_ST_EXCLUSIVE
    234: gfs2_glock_dq: 3020, LM_ST_EXCLUSIVE
    235: gfs2_do_promote: 3020, LM_ST_EXCLUSIVE
    236: gfs2_glock_dq: 3022, LM_ST_EXCLUSIVE
    237: gfs2_do_promote: 3022, LM_ST_EXCLUSIVE
    238: gfs2_glock_dq: 3020, LM_ST_EXCLUSIVE
    239: gfs2_do_promote: 3020, LM_ST_EXCLUSIVE
    240: gfs2_glock_dq: 3022, LM_ST_EXCLUSIVE
    241: gfs2_do_promote: 3022, LM_ST_EXCLUSIVE
    242: gfs2_glock_dq: 3020, LM_ST_EXCLUSIVE
    243: gfs2_do_promote: 3020, LM_ST_EXCLUSIVE
    244: gfs2_glock_dq: 3022, LM_ST_EXCLUSIVE
    245: gfs2_do_promote: 3022, LM_ST_EXCLUSIVE
    246: gfs2_glock_dq: 3020, LM_ST_EXCLUSIVE
    247: gfs2_do_promote: 3020, LM_ST_EXCLUSIVE
    248: gfs2_glock_dq: 3022, LM_ST_EXCLUSIVE
    249: do_xmote: 2797, LM_ST_EXCLUSIVE
    250: search_bucket: 2920, LM_ST_EXCLUSIVE
    251: finish_xmote: 2920, LM_ST_EXCLUSIVE
    252: do_xmote: 3022, LM_ST_UNLOCKED
    253: search_bucket: 2920, LM_ST_UNLOCKED
    254: handle_callback: 2920, LM_ST_UNLOCKED
    255: search_bucket: 2921, LM_ST_UNLOCKED
    256: finish_xmote: 2921, LM_ST_UNLOCKED
    257: gfs2_do_promote: 2921, LM_ST_EXCLUSIVE
    258: gfs2_glock_dq: 3020, LM_ST_EXCLUSIVE
    259: do_xmote: 2797, LM_ST_EXCLUSIVE
    260: search_bucket: 2921, LM_ST_EXCLUSIVE
    261: finish_xmote: 2921, LM_ST_EXCLUSIVE
    262: do_xmote: 3024, LM_ST_UNLOCKED

Comment 40 Robert Peterson 2008-04-25 14:41:25 UTC

Another hang with better history:

Note: do_xmote3 is finish_xmote getting Unlocked due to conversion deadlock.
do_xmote4 is finish_xmote getting LM_ST_SHARED or LM_ST_DEFERRED

roth-02:
G:  s:DF n:2/12583013 f:l t:SH d:EX/0 l:0 a:0 r:6
 H: s:SH f:aW e:0 p:3037 [d_doio] gfs2_lookup+0x50/0xa7 [gfs2]
 I: n:1048597/12583013 t:8 f:0x00000001
    26841216: finish_xmote: 2973, LM_ST_DEFERRED
    26841217: do_xmote3: 2973, LM_ST_UNLOCKED
    26841218: search_bucket: 2974, LM_ST_UNLOCKED
    26841219: finish_xmote: 2974, LM_ST_UNLOCKED
    26841220: do_xmote4: 2974, LM_ST_DEFERRED
    26841221: search_bucket: 2973, LM_ST_DEFERRED
    26841222: finish_xmote: 2973, LM_ST_DEFERRED
    26841223: do_xmote3: 2973, LM_ST_UNLOCKED
    26841224: search_bucket: 2974, LM_ST_UNLOCKED
    26841225: finish_xmote: 2974, LM_ST_UNLOCKED
    26841226: do_xmote4: 2974, LM_ST_DEFERRED
    26841227: search_bucket: 2973, LM_ST_DEFERRED
    26841228: finish_xmote: 2973, LM_ST_DEFERRED
    26841229: do_xmote3: 2973, LM_ST_UNLOCKED
    26841230: search_bucket: 2974, LM_ST_UNLOCKED
    26841231: finish_xmote: 2974, LM_ST_UNLOCKED
    26841232: do_xmote4: 2974, LM_ST_DEFERRED
    26841233: search_bucket: 2973, LM_ST_DEFERRED
    26841234: finish_xmote: 2973, LM_ST_DEFERRED
    26841235: do_xmote3: 2973, LM_ST_UNLOCKED
    26841236: search_bucket: 2974, LM_ST_UNLOCKED
    26841237: finish_xmote: 2974, LM_ST_UNLOCKED
    26841238: do_xmote4: 2974, LM_ST_DEFERRED
    26841239: search_bucket: 2973, LM_ST_DEFERRED
    26841240: finish_xmote: 2973, LM_ST_DEFERRED
    26841241: do_xmote3: 2973, LM_ST_UNLOCKED
    26841242: search_bucket: 2974, LM_ST_UNLOCKED
    26841243: finish_xmote: 2974, LM_ST_UNLOCKED
    26841244: do_xmote4: 2974, LM_ST_DEFERRED
    26841245: search_bucket: 2973, LM_ST_DEFERRED
    26841246: finish_xmote: 2973, LM_ST_DEFERRED
    26841247: do_xmote3: 2973, LM_ST_UNLOCKED
    26841248: search_bucket: 2974, LM_ST_UNLOCKED
    26841249: finish_xmote: 2974, LM_ST_UNLOCKED
    26841250: do_xmote4: 2974, LM_ST_DEFERRED
    26841251: search_bucket: 2973, LM_ST_DEFERRED
    26841252: finish_xmote: 2973, LM_ST_DEFERRED
    26841253: do_xmote3: 2973, LM_ST_UNLOCKED
    26841254: search_bucket: 2974, LM_ST_UNLOCKED
    26841255: finish_xmote: 2974, LM_ST_UNLOCKED
    26841256: do_xmote4: 2974, LM_ST_DEFERRED
    26841257: search_bucket: 2973, LM_ST_DEFERRED
    26841258: finish_xmote: 2973, LM_ST_DEFERRED
    26841259: do_xmote3: 2973, LM_ST_UNLOCKED
    26841260: search_bucket: 2974, LM_ST_UNLOCKED
    26841261: finish_xmote: 2974, LM_ST_UNLOCKED
    26841262: do_xmote4: 2974, LM_ST_DEFERRED
    26841263: search_bucket: 2973, LM_ST_DEFERRED

roth-03:
G:  s:UN n:2/12583013 f:l t:SH d:EX/0 l:0 a:0 r:5
 H: s:SH f:aW e:0 p:3019 [d_doio] gfs2_lookup+0x50/0xa7 [gfs2]
    26818610: search_bucket: 2955, LM_ST_UNLOCKED
    26818611: finish_xmote: 2955, LM_ST_UNLOCKED
    26818612: do_xmote4: 2955, LM_ST_DEFERRED
    26818613: search_bucket: 2956, LM_ST_DEFERRED
    26818614: finish_xmote: 2956, LM_ST_DEFERRED
    26818615: do_xmote3: 2956, LM_ST_UNLOCKED
    26818616: search_bucket: 2955, LM_ST_UNLOCKED
    26818617: finish_xmote: 2955, LM_ST_UNLOCKED
    26818618: do_xmote4: 2955, LM_ST_DEFERRED
    26818619: search_bucket: 2956, LM_ST_DEFERRED
    26818620: finish_xmote: 2956, LM_ST_DEFERRED
    26818621: do_xmote3: 2956, LM_ST_UNLOCKED
    26818622: search_bucket: 2955, LM_ST_UNLOCKED
    26818623: finish_xmote: 2955, LM_ST_UNLOCKED
    26818624: do_xmote4: 2955, LM_ST_DEFERRED
    26818625: search_bucket: 2956, LM_ST_DEFERRED
    26818626: finish_xmote: 2956, LM_ST_DEFERRED
    26818627: do_xmote3: 2956, LM_ST_UNLOCKED
    26818628: search_bucket: 2955, LM_ST_UNLOCKED
    26818629: finish_xmote: 2955, LM_ST_UNLOCKED
    26818630: do_xmote4: 2955, LM_ST_DEFERRED
    26818631: search_bucket: 2956, LM_ST_DEFERRED
    26818632: finish_xmote: 2956, LM_ST_DEFERRED
    26818633: do_xmote3: 2956, LM_ST_UNLOCKED
    26818634: search_bucket: 2955, LM_ST_UNLOCKED
    26818635: finish_xmote: 2955, LM_ST_UNLOCKED
    26818636: do_xmote4: 2955, LM_ST_DEFERRED
    26818637: search_bucket: 2956, LM_ST_DEFERRED
    26818638: finish_xmote: 2956, LM_ST_DEFERRED
    26818639: do_xmote3: 2956, LM_ST_UNLOCKED
    26818640: search_bucket: 2955, LM_ST_UNLOCKED
    26818641: finish_xmote: 2955, LM_ST_UNLOCKED
    26818642: do_xmote4: 2955, LM_ST_DEFERRED
    26818643: search_bucket: 2956, LM_ST_DEFERRED
    26818644: finish_xmote: 2956, LM_ST_DEFERRED
    26818645: do_xmote3: 2956, LM_ST_UNLOCKED
    26818646: search_bucket: 2955, LM_ST_UNLOCKED
    26818647: finish_xmote: 2955, LM_ST_UNLOCKED
    26818648: do_xmote4: 2955, LM_ST_DEFERRED
    26818649: search_bucket: 2956, LM_ST_DEFERRED
    26818650: finish_xmote: 2956, LM_ST_DEFERRED
    26818651: do_xmote3: 2956, LM_ST_UNLOCKED
    26818652: search_bucket: 2955, LM_ST_UNLOCKED
    26818653: finish_xmote: 2955, LM_ST_UNLOCKED
    26818654: do_xmote4: 2955, LM_ST_DEFERRED
    26818655: search_bucket: 2956, LM_ST_DEFERRED
    26818656: finish_xmote: 2956, LM_ST_DEFERRED
    26818657: do_xmote3: 2956, LM_ST_UNLOCKED

roth-01:
G:  s:DF n:2/12583013 f: t:DF d:EX/0 l:0 a:0 r:3
 I: n:1048597/12583013 t:8 f:0x00000000

Comment 41 Robert Peterson 2008-04-28 18:47:26 UTC

Created attachment 304026 [details]
RHEL5 patch as of 2008 Apr 28

Here is the latest RHEL5 patch.  Steve and I (mainly Steve) worked through
several more issues and bugs this morning.  This version still isn't
quite right; d_doio can still get hangs, although much later in the test.

Comment 42 Steve Whitehouse 2008-04-29 10:05:33 UTC

The hang in this case seemed to be caused by a lost demote request. My
suspicions fall on the test at the top of finish_xmote which was the cause of
the original lock bouncing bug that we saw. I thought then that I'd have to add
an extra flag to ensure that we only updated the target state in specific
circumstances. I suspect that this bug isn't caused directly by that, but by the
test in the "success" path. I've added a GLF_DEMOTE_IN_PROGRESS flag and updated
both tests so that they take account of this flag.

I'm doing a few quick sanity checks on it atm. If it passes I'll attach the new
upstream patch to this bz.

Comment 43 Steve Whitehouse 2008-04-29 10:43:32 UTC

Created attachment 304098 [details]
Latest upstream patch

This is the latest version. It passes my very basic single node tests. Lets
hope this is the change that fixes the remaining hang. To see the differences,
grep for GLF_DEMOTE_IN_PROGRESS as the only change is the addition of this flag
in a few places.

Comment 44 Steve Whitehouse 2008-04-29 12:41:17 UTC

Created attachment 304112 [details]
Another updated version

I found another corner case. This one should be better.

Comment 45 Robert Peterson 2008-04-29 21:39:24 UTC

Created attachment 304161 [details]
Glock history for latest BUG() problem

This glock history shows more detail, including the glock flags,
gl_target amd gl_demote_state at each point.  It also includes the
dlm status from all three nodes for the glock in question.
Note that do_xmote1 means do_xmote was called from run_queue when it
saw the GLF_DEMOTE bit in gl_flags.

Comment 46 Robert Peterson 2008-05-02 16:22:55 UTC

Created attachment 304398 [details]
Combined specsfs/readv patch for aol

This patch contains the latest RHEL5 version of the GFS2 code.
The patch is against the 2.6.18-92 kernel.  This is what I suggest
we build and give to aol as a kmod.  We've run a number of tests on
it, including benchp13 (single fs) and all my "hell" tests.
I want to do one or two more tests, but I wanted to post this ASAP.

Comment 47 Robert Peterson 2008-05-02 16:40:06 UTC

I also ran a quick "d_io" test that ran for about 20 minutes.  It was
also successful.  I think we're good to go.  Still need to resolve
dd_io, but I don't think it's likely that they'd hit that bug.
Steve has another patch for me to try in that regard, which is not
included in the aol patch from comment #46.

Comment 48 Robert Peterson 2008-05-02 20:24:02 UTC

Created attachment 304409 [details]
Combined specsfs/better readv patch for aol

This version, which is Steve's latest, often passes the dd_io test.
I can sometimes get dd_io to hang, but this one runs it much better.
However, at the time of the this posting, it has not undergone the
testing we did for its predecessor.

Comment 49 Abhijith Das 2008-05-05 22:36:18 UTC

Created attachment 304572 [details]
lockdumps when dd_io processes were hung on the smoke cluster

I was running dd_io with the latest patch in comment #48 and I managed to
recreate the hang pretty quickly. Attached is the tarball of all the lockdumps
from the smoke nodes.

Comment 50 Steve Whitehouse 2008-05-06 09:34:48 UTC

Btw, the attachment of comment #49 seems to have been compressed twice.

Comment 51 Steve Whitehouse 2008-05-06 09:45:54 UTC

Abhi, can you provide the dlm lock dump too? I can't work out from the dumps in
comment #49 what is going on. It appears camel, merit and winston are all
waiting for a glock on 2/20593 but salem appears not to have any reference to
that glock. So the only way to find out what it going on is to compare with the
dlm's idea of the current state.

Comment 52 Abhijith Das 2008-05-06 14:46:25 UTC

Created attachment 304636 [details]
dlm-lockdumps corresponding to the the gfs2 glock-dumps in comment #49

Comment 53 Steve Whitehouse 2008-05-06 14:55:27 UTC

Created attachment 304637 [details]
Latest upstream patch including debugfs fix

Just about to re-test this after some clean up.

Comment 54 Steve Whitehouse 2008-05-07 10:04:53 UTC

Created attachment 304749 [details]
Upstream patch with latest changes

It seems that there was still a window where invalidates were causing us
problems. I think this really is the last such case now. The previous patch did
all the demote cases, but missed out SH -> DF which is a promotion, but one
which causes an invalidate of data pages. This patch should fix that case as
well.

You can find the new bits easily by looking for GLF_INVALIDATE_IN_PROGRESS.
There are only a few changed lines compared with the previous patch, probably
about half a dozen or so.

Comment 55 Steve Whitehouse 2008-05-07 11:04:56 UTC

Created attachment 304755 [details]
Another upstream update

A few changes in add_to_queue() relating to lock cancellation to ensure that we
always call ->lm_cancel() in all cases where its required.

Comment 56 Abhijith Das 2008-05-07 22:29:15 UTC

Created attachment 304811 [details]
gfs2 and dlm lockdumps from latest dd_io hang

Steve, I tried to port your latest upstream patch to rhel5, but I got an error
because of seq_open_private() not being available in rhel5. I extracted the
debugfs fix from your patch and ported the rest. I managed to recreate the
dd_io hang after about 12 minutes of runtime.
Now, I could've made a mistake in porting the patch, so I'm going to compile
your git-tree with your upstream patch on the smoke cluster and give that a
try.

I'll also post the rhel5 ports I made of your patch here.

Comment 57 Abhijith Das 2008-05-08 05:40:21 UTC

Created attachment 304837 [details]
contains 2 patches that partially port swhiteho's latest upstream patch

Steve, could you take a look at these patches to see if I missed something? I
tried to leave out the debugfs bits of your patch because they wouldn't compile
on rhel5(no seq_open_private).

Comment 58 Steve Whitehouse 2008-05-08 09:05:08 UTC

The patches in comment #57 look ok. The seq_open_private function can just be
copied from the upstream kernel and added as a static function. Its only small
and I think its pretty much standalone. The same goes for the release function
in case that it missing too.

Comment 59 Steve Whitehouse 2008-05-08 09:24:20 UTC

Now it seems that merit is holding the lock for 2/ca9436 in DF mode, and that
there are queued requests for that lock in both DF and SH modes. The only way
that makes sense is if the SH request is at the front of the queue and if a
demote has been lost somewhere along the line. If the other DF modes were at the
front of the queue, then they should have been granted.

We saw something like this with some previous tests, but its not something that
has been seen recently.

Also, please be careful with mime types when adding attachments to bugs. The
patches in comment #57 are gzipped and tared, but just appear as "data" in my
browser. The data in comment #56 is downloaded as a .tar.gz file but is in fact
gzipped twice over.

Comment 60 Steve Whitehouse 2008-05-08 16:35:32 UTC

Created attachment 304876 [details]
Updated upstream patch

This has the start of a change to the lock_dlm threads to try and fix a
potential ordering problem which I believe _might_ (and I stress that, as I
can't be sure!) be responsible for the "lost" demote request that we've been
seeing recently.

One useful further test would be to separate the two threads in a different way
so that one thread only does lock submissions, and the other thread does the
blocking/completion/drop callbacks. The patch is currently untested and there
is still a potential issue relating to ordering which we might be able to solve
by adding a sequence number to the messages as received from the dlm module
(which I presume ensures such reordering does not occur).

There is a lot of potential future cleanup that could be done in this area too.
We do not need to release the spinlock before calling blocking callbacks in
gfs2 since such callbacks don't block. Once we've sorted the completion
callbacks so that the same is true of them as well, then we can do the same
there as well.

Comment 61 Robert Peterson 2008-05-08 20:51:18 UTC

Created attachment 304898 [details]
Hang info: glock dump, dlm locks, call traces

I ran dd_io several times successfully on the exxon cluster. 
I was finally able to create a hang on about the twelfth run or so.
Abhi says he can create a hang on the smoke cluster easily.
This tgz contains the info for the hang.

Comment 62 Steve Whitehouse 2008-05-09 10:27:12 UTC

Looking at the traces from comment #61, it looks like the demote request was
sent to the wrong node.... at least node 02 seems to be waiting for an EX lock
(currently unlocked) but with a queued demote request (!) where as 01 holds the
lock in shared state with no demote pending.

Comment 63 Abhijith Das 2008-05-09 14:34:31 UTC

Created attachment 304946 [details]
Lockdumps from the latest hang on the smoke cluster

For some reason, my previous attachments .tar.gz were double-zipped. The
tarballs were correct when I had created them. Maybe because I let bugzilla
auto-detect the attachment type. This time, I've set the attachment type to
(binary file (application/octet-stream)). Hopefully it won't do it again.

Comment 64 Robert Peterson 2008-05-09 15:10:02 UTC

Created attachment 304955 [details]
Comment 60 patch with glock history debug

For debugging.

Comment 65 Robert Peterson 2008-05-09 18:55:16 UTC

Created attachment 304976 [details]
Debug version with Dave's patch

This patch contains:
1. All of the "good" patches up to comment #60
2. The "glock history" debug code
3. Dave's lock_dlm patch that processes "cast" then "bast" requests
   (in that order) if both are pending.

Comment 66 Abhijith Das 2008-05-09 20:09:21 UTC

Created attachment 304984 [details]
Lockdumps & sysrq-t from the latest hang on the smoke cluster

Comment 67 Steve Whitehouse 2008-05-12 13:50:01 UTC

Created attachment 305126 [details]
Another upstream patch

This one has changes in lock_dlm. I'm not at all confident that the patch
relating to lock_dlm which Dave did last week actually closed all the holes. I
noticed the following in lock_dlm:

 1. The DLM calls both basts and casts in process context
 2. The DLM ensures the correct ordering
 3. lock_dlm was adding these requests to two different queues and the
dequeuing them with two different threads, one of which did not deal with basts
for some reason.

So my plan is to simply get rid of the queues and send the DLM's callbacks
directly to the fs. Thats what this patch does. I then noticed some other
things which should be fixed in a later patch:

 1. The lock_dlm thread(s) when in a recovery state remove entries from the
delay list, and then add them back again for no obvious reason. I presume that
this is why the schedule() has been added as otherwise this will soak up large
amounts of cpu.

2. Most of what is left in lock_dlm is the logic for dealing with conversions
from SH -> EX which is now dealt with in glock.c anyway, so could be removed in
upstream.

3. The callbacks (gdlm_cast and gdlm_bast) were marked inline despite the only
access to them being via function pointers, so thats been removed since its
pointless.

There are probably other possible cleanups too, but those are the ones I've
found so far. The patch also saves 32bytes per glock due to the loss of the two
queues (on x86_64) so should generally help boost scalability and performance,
not to mention the overhead of needlessly adding messages to a queue.

I'm still working on this, but this represents where I've got to at the moment.

Comment 68 Robert Peterson 2008-05-12 22:21:13 UTC

Created attachment 305185 [details]
Wireshark capture of problem

This is a bzip2 of a wireshark capture I took during one failure.
The dd_io tests are sometimes failing due to the processes taking
too long to reply to the test program, or so it seems.	I don't see
anything wrong with the code and i don't see anything obvious in
the wireshark trace either.  But according to Dean, the tests are
taking a lot longer with the previous patch.

Comment 69 Robert Peterson 2008-05-12 23:28:43 UTC

Just to clarify: The test now sometimes passes all 25 dd_io tests,
although sometimes it apparently times out and runs a lot slower in
some scenarios.

Comment 70 Steve Whitehouse 2008-05-13 08:17:03 UTC

I'm not sure that does clarify things :-) When you say it "sometimes passes" do
you mean that the failures are all related to the test taking too long, or that
they are related to the original problem of hangs?

Also did you try adjusting the minimum hold time? So far as I can tell from the
traces, the problem is just down to bouncing the locks too much.

Comment 71 Robert Peterson 2008-05-13 14:08:17 UTC

Created attachment 305237 [details]
Debug version of the comment67 patch plus fixes

This fix contains the comment #67 patch, plus some fixes (e.g. 
the additional goto in do_promote) plus glock history debug code.
This is patched against the -92 kernel (so you'll find extra code
for my specsfs fix in there).

Comment 72 Robert Peterson 2008-05-13 16:52:41 UTC

Created attachment 305266 [details]
Fix the glock history panic

Same as previous, but the glock dump panic is (hopefully) gone.

Comment 73 Robert Peterson 2008-05-13 21:22:49 UTC

I tried a few things today.  I tried going back to the lock_dlm from
comment #60, then making it single-threaded.  This is similar to what
Steve did, but the callbacks are not done by the dlm thread, they're
done from the normal lock_dlm thread.  This had the exact same symptom,
in other words, the tests "failed" because they were running too slowly.
So the performance impact seems to be due to the missing second thread,
not due to the way Steve's latest code does the callback.

The second thing I tried was reverting back to the comment #60 lock_dlm
but keeping my new do_promote goto, and the dump_glock fixes.  This ran
successfully on the exxon cluster.  I ran the failing mmap tests multiple
times.  Everything was back to "fast" again (the mmap tests were taking
from 35-39 seconds.)

Then I ran that same code on the smoke cluster.  Many of the tests ran
to completion in a timely manner, but once again, it got stuck in
d_rwrandirectlarge.  I took glock dumps on all the nodes (in
/tmp/lockdump.{camel|merit|winston|salem}).  Once again, these showed
glocks sitting in the DF state without any holders/waiters on any node.

Comment 74 Steve Whitehouse 2008-05-14 16:04:05 UTC

Created attachment 305377 [details]
Latest upstream patch

Changes from the last upstream patch:

 - Bug fixed where we forgot to run the workqueue with lock_nolock (I've
verified that this does not affect lock_dlm) after the lock module returns a
status.
 - Makes the processing of glock completions async to the extent that we now
only submit I/O but we don't wait for it. When the I/O completes, we schedule
the workqueue to run the remaining bits of the glops.

So to find the changes quickly, look for the new GLF_IO_DONE flag, the
->go_xmote_bh callbacks (new prototype since it returns an int now) and the
change in do_xmote to schedule the workqueue in the lock_nolock case.

Comment 75 Robert Peterson 2008-05-14 17:24:11 UTC

Created attachment 305387 [details]
RHEL version of the comment #74 patch w/debug

Here is the rhel version of the "comment 74" patch.  I had to do a bit
of massaging due to differences between upstream and rhel.  It's passed
some of the tests on my exxon cluster, but I wanted to get something
out there asap for the smoke cluster.

Comment 76 Abhijith Das 2008-05-14 20:30:22 UTC

Created attachment 305404 [details]
lockdumps and sysrq-t from the latest hang on smoke with Bob's rhel5 patch in comment 75

Comment 77 Robert Peterson 2008-05-14 22:49:01 UTC

This latest version gave me a CPU lockup when it ran the d_mmap3
test.  Here is the call trace:

BUG: soft lockup - CPU#0 stuck for 10s! [d_doio:3582]
CPU 0:
Modules linked in: sctp lock_dlm(U) dlm configfs gfs2(U) autofs4 hidp rfcomm l2d
Pid: 3582, comm: d_doio Tainted: G      2.6.18-91.el5 #1
RIP: 0010:[<ffffffff800621d1>]  [<ffffffff800621d1>] __write_lock_failed+0x9/0x0
RSP: 0018:ffff81010d63bb58  EFLAGS: 00000287
RAX: 0000000000000000 RBX: ffff81010dba8000 RCX: 00000000ffff8fc7
RDX: ffff810001004400 RSI: 0000000000000000 RDI: ffffffff88573da8
RBP: ffff81010d613080 R08: ffff810000013600 R09: ffff81010d63bb88
R10: 0000000000000000 R11: ffff81010d026bd8 R12: 0000000000000001
R13: ffff81010cd28c60 R14: ffffffff8002d0d4 R15: 0000000000000018
FS:  00002ae5fa5e4020(0000) GS:ffffffff8039e000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00002ae5fb1f2000 CR3: 000000010d633000 CR4: 00000000000006e0

Call Trace:
 [<ffffffff800649f5>] _write_lock+0xe/0xf
 [<ffffffff884fec4d>] :gfs2:gfs2_glock_put+0x26/0xeb
 [<ffffffff884ff57d>] :gfs2:gfs2_holder_uninit+0xd/0x1f
 [<ffffffff88508790>] :gfs2:gfs2_readpage+0xe5/0xf6
 [<ffffffff8850870b>] :gfs2:gfs2_readpage+0x60/0xf6
 [<ffffffff8000bf84>] do_generic_mapping_read+0x22b/0x3f8
 [<ffffffff8000cd69>] file_read_actor+0x0/0x154
 [<ffffffff8000c29d>] __generic_file_aio_read+0x14c/0x190
 [<ffffffff800c0070>] generic_file_readv+0x8f/0xa8
 [<ffffffff8009dde2>] autoremove_wake_function+0x0/0x2e
 [<ffffffff80021f44>] __up_read+0x19/0x7f
 [<ffffffff800d814f>] do_readv_writev+0x176/0x295
 [<ffffffff800c0089>] generic_file_read+0x0/0xc5
 [<ffffffff80028e79>] do_brk+0x1cf/0x2b2
 [<ffffffff800b3fd8>] audit_syscall_entry+0x16e/0x1a1
 [<ffffffff800d838b>] sys_readv+0x45/0x93
 [<ffffffff8005d28d>] tracesys+0xd5/0xe0

I'm chasing down a theory now.  My theory--based on evidence from the
glock history--is that finish_xmote is getting -EAGAIN from go_xmote_bh
and it returns without clearing the GLF_LOCK bit.  From then on,
subsequent run_queues are ignoring the glock because the bit is set.
I'm going to test this theory with the comment #60 (Dave's) version of
lock_dlm.

Comment 78 Robert Peterson 2008-05-15 03:52:42 UTC

Well, I didn't exactly do what I said I would in comment #77.
Maybe it's just me, but it still seems as if performance is really bad
with the latest code.  I added more debugging points to the glock
history, but between the performance and the soft lockup, I decided to
set that version aside temporarily and focus on slightly older code that
was fast.  That code passed the dd_io test (all 25 variants) on my exxon
cluster.  However, I recreated a hang on the smoke cluster again.
I collected glock dumps, with my improved history code.  Looks like
merit is holding a glock in DF mode with no holder again.  Here is the
history:

G:  s:DF n:2/c999d9 f: t:DF d:EX/0 l:0 a:0 r:3
 I: n:2097178/13212121 t:8 f:0x00000010
    170141: gfs2_glock_dq          3547, f:    dy tgt: EX, dmt: DF, St: EX
    170142: do_promote             3547, f:   ldy tgt: EX, dmt: DF, St: EX
    170143: gfs2_glock_dq          3547, f:    dy tgt: EX, dmt: DF, St: EX
    170144: do_promote             3547, f:   ldy tgt: EX, dmt: DF, St: EX
    170145: gfs2_glock_dq          3547, f:    dy tgt: EX, dmt: DF, St: EX
    170146: do_promote             3547, f:   ldy tgt: EX, dmt: DF, St: EX
    170147: gfs2_glock_dq          3547, f:    dy tgt: EX, dmt: DF, St: EX
    170148: do_promote             3547, f:   ldy tgt: EX, dmt: DF, St: EX
    170149: gfs2_glock_dq          3547, f:    dy tgt: EX, dmt: DF, St: EX
    170150: do_promote             3547, f:   ldy tgt: EX, dmt: DF, St: EX
    170151: gfs2_glock_dq          3547, f:    dy tgt: EX, dmt: DF, St: EX
    170152: do_promote             3547, f:   ldy tgt: EX, dmt: DF, St: EX
    170153: gfs2_glock_dq          3547, f:    dy tgt: EX, dmt: DF, St: EX
    170154: do_promote             3547, f:   ldy tgt: EX, dmt: DF, St: EX
    170155: gfs2_glock_dq          3547, f:    dy tgt: EX, dmt: DF, St: EX
    170156: do_promote             3547, f:   ldy tgt: EX, dmt: DF, St: EX
    170157: gfs2_glock_dq          3547, f:    dy tgt: EX, dmt: DF, St: EX
    170158: do_promote             3547, f:   ldy tgt: EX, dmt: DF, St: EX
    170159: gfs2_glock_dq          3547, f:    dy tgt: EX, dmt: DF, St: EX
    170160: do_promote             3547, f:   ldy tgt: EX, dmt: DF, St: EX
    170161: gfs2_glock_dq          3547, f:    dy tgt: EX, dmt: DF, St: EX
    170162: do_promote             3547, f:   ldy tgt: EX, dmt: DF, St: EX
    170163: gfs2_glock_dq          3547, f:    dy tgt: EX, dmt: DF, St: EX
    170164: do_promote             3547, f:   ldy tgt: EX, dmt: DF, St: EX
    170165: gfs2_glock_dq          3547, f:    dy tgt: EX, dmt: DF, St: EX
    170166: do_promote             3547, f:   ldy tgt: EX, dmt: DF, St: EX
    170167: gfs2_glock_dq          3547, f:    dy tgt: EX, dmt: DF, St: EX
    170168: do_promote             3547, f:   ldy tgt: EX, dmt: DF, St: EX
    170169: gfs2_glock_dq          3547, f:    dy tgt: EX, dmt: DF, St: EX
    170170: do_xmote1              3003, f:  lDpy tgt: DF, dmt: DF, St: EX
    170171: search_bucket          3372, f:   lDp tgt: DF, dmt: DF, St: EX
    170172: finish_xmote           3372, f:   lDp tgt: DF, dmt: DF, St: EX -> DF

I'm beginning to wonder if we should keep Dave's version of lock_dlm
that had two threads, but find a better way to make sure we process
the requests in the correct order.

Comment 79 Steve Whitehouse 2008-05-15 08:56:59 UTC

The analysis in comment #77 doesn't appear to match the stack trace. The trace
in question points to a delay in getting the rwlock in order to release a ref to
a glock. So it would be a reasonable conclusion that some other thread is
keeping hold of that rwlock somewhere.

Since we've not changed any of the ref counting code internally, I can't
immediately see why this is happening, but most likely its either scheduling
over a rwlock or a lock ordering problem. I'll have a look and see if I can spot
it anyway.

Comment 80 Steve Whitehouse 2008-05-15 10:07:04 UTC

I have a feeling that the problem in comment #77 is down to a one-liner bug.
Here is my proposed solution - to change the end read I/O function as follows:

void gfs2_glock_end_read_io(struct buffer_head *bh, int uptodate)
{
        struct gfs2_bufdata *bd = bh->b_private;
        struct gfs2_glock *gl = bd->bd_gl;
        end_buffer_read_sync(bh, uptodate);
        set_bit(GLF_IO_DONE, &gl->gl_flags);
        if (queue_delayed_work(glock_workqueue, &gl->gl_work, 0) == 0)
                atomic_dec(&gl->gl_ref);
}

This can be run from irq context, and to save us from having to use irq safe
locks for the glock hash table, we can just use an atomic_dec since we know that
this only runs when the ref count is held high by the state change on the glock,
so that it can never hit zero as a result of this particular decrement.

If the problem persists, use lockdep to debug it. Also it would be very helpful
to have stack traces of all processes, not just the one that caused the lockup
in this case since it was down to another process holding the lock.

Comment 81 Steve Whitehouse 2008-05-15 16:20:19 UTC

Created attachment 305506 [details]
Latest upstream patch

So this patch moves all of the finish_xmote processing into the workqueue for
_both_ lock_nolock and lock_dlm so at last both of these now share the same
code path. Also the ref count bug seems to be gone too.

I've removed ->go_xmote_bh for inodes, since ->go_lock does the work anyway and
there didn't seem to be a need to do it twice. For the future, it would be nice
to make the remaining I/O in ->go_lock async such that it doesn't block the
workqueue while its in progress.

On the otherhand, it now means that there are no circumstances under which the
dlm's ast thread will now block. So it should prove whether that was the
problem which has been preventing the mmap tests from passing.

Also I've cleaned up the error handling a bit as well, and introduced a
glock_bug function and GLOCK_BUG_ON macro. Note that both should be called only
with gl_spin held, or in places where there cannot possible be a race.

Comment 82 Robert Peterson 2008-05-15 18:06:53 UTC

Created attachment 305513 [details]
Comment 81 patch for RHEL with debug / glock history

This is the RHEL port of Steve's patch from comment #81.  I changed
some of the logging code because the new GLOCK_BUG code was not
identifying where the error was; it was only saying where the BUG()
statement was.	I'm not sure I like it that way, but at least it works.
I'm testing it on my exxon cluster now.

Comment 83 Robert Peterson 2008-05-15 18:13:26 UTC

The new version also fails d_mmap3 due to performance lags.  :7(

Comment 84 Robert Peterson 2008-05-15 21:47:21 UTC

After way too much prep-work, I used systemtap to compare how much
time was being spent in the "most popular" functions of glock.c.
The script tracks how much cumulative time is taken inside each function
and how many calls were made to that function.  Obviously, recent code
changes included functions that aren't in the original, and functions
that are missing from the original, such as do_xmote, etc.  That said,
here are the very telling results:

With comment #81 patch:

do_xmote                    532969 in 192 calls
do_promote                  535454 in 319 calls
finish_xmote                535796 in 191 calls
wait_on_holder             4565742 in 224 calls
gfs2_glock_wait            4567063 in 224 calls
gfs2_glock_nq              4569446 in 224 calls
---------------------------------------------------
With glock.c from stock -92 kernel:

gfs2_glock_nq              2089951 in 605548 calls
xmote_bh                    765822 in 61 calls
gfs2_glock_cb               767264 in 89 calls

Note that these results are "typical" values and the counters
reset and are reported every five seconds.
So the old code was spending almost no time in gfs2_glock_wait
whereas the new code is spending a lot of time there.

Comment 85 Steve Whitehouse 2008-05-16 07:51:47 UTC

So it seems from comment #83 that we've now proved that whatever is causing the
performance problem, its not down to running finish_xmote in a different way. I
still think the most likely cause is the min hold time code, on the basis that
we've not managed to prove that it is working correctly so far.

Comment 86 Steve Whitehouse 2008-05-16 11:23:07 UTC

Some time back we looked into the possibility of whether we were demoting too
soon based upon early scheduling of the workqueue. That was discounted at the
time, since the change we made (altering one point at which was scheduled the
workqueue) appeared not to make any difference. Now that we have to run the
workqueue to deal with both completions and callback events, it occurred to me
that we should do more in the function itself to prevent us shortening the hold
time.

As a result I've come up with the following change to the workqueue function:

static void glock_work_func(struct work_struct *work)
{
        unsigned long delay = 0;
        struct gfs2_glock *gl = container_of(work, struct gfs2_glock, gl_work.work);

        if (test_and_clear_bit(GLF_REPLY_PENDING, &gl->gl_flags))
                finish_xmote(gl, gl->gl_reply);
        spin_lock(&gl->gl_spin);
        if (test_and_clear_bit(GLF_PENDING_DEMOTE, &gl->gl_flags)) {
                unsigned long holdtime, now = jiffies;
                holdtime = gl->gl_tchange + gl->gl_ops->go_min_hold_time;
                if (time_before(now, holdtime))
                        delay = holdtime - now;
                set_bit(delay ? GLF_PENDING_DEMOTE : GLF_DEMOTE, &gl->gl_flags);
        }
        run_queue(gl, 0);
        spin_unlock(&gl->gl_spin);
        if (!delay ||
            queue_delayed_work(glock_workqueue, &gl->gl_work, delay) == 0)
                gfs2_glock_put(gl);
}

Also I've been looking at the GLOCK_BUG_ON macro and I've changed that slightly
to avoid the problems reported in comment #82, so I hope that is better now.

From the figures in comment #84 it seems clear that in the -92 case we are doing
far more locks than we could reasonably expect to do if we were having to call
out remotely each time. That means that most of the time, we must be using
cached locks in this case.

The fact that in the comment #81 patch case a lot of time is spend waiting,
suggests that the cause is waiting for something to release the lock. It seems
more plausible that this is due to remote locking activity than local activity,
so I guess thats the thing to test first (hence this patch).

Comment 87 Steve Whitehouse 2008-05-16 13:05:10 UTC

Created attachment 305678 [details]
Latest upstream patch

As per earlier comments.

Comment 88 Robert Peterson 2008-05-16 16:14:21 UTC

Created attachment 305702 [details]
RHEL version of the comment #87 patch

This is the RHEL version of the patch in comment #87.  It seems fast
and passes the dd_io test on my exxon cluster.	It also has the seq
file changes from upstream.  There is no debug code in this one, so
we should be able to do performance testing on it.

Comment 89 Robert Peterson 2008-05-16 18:33:24 UTC

The dd_io tests on my exxon cluster were all pretty fast.  Unfortunately
I tried running the benchp13 test with this version on "kool" and it
showed a significant performance problem.  My last "bitfit" version
of gfs2 had been taking 17-19 seconds per iteration.  This version takes
200 seconds per iteration on average, so more than ten times as long.

I tried three things:

1. I re-ran my bitfit version to make sure it wasn't hardware
   (i.e. silent disk retries when a disk goes bad).  Not that.
2. I tried running it in a "one-node cluster" using lock_dlm and got
   the same 200s performance.  So it's not specific to lock_nolock.
3. I tried switching from the 5.1 kernel (2.6.18-53) to a newer
   (2.6.18-90) kernel.  Wasn't that either, so it's definitely gfs2.

So I've eliminated the "easy" things to check.

I also took a glock dump while benchp13 was running.  It showed
145597 glocks in memory in various conditions.  I need to analyze
this some more.

Comment 90 Abhijith Das 2008-05-16 19:23:19 UTC

Created attachment 305727 [details]
smoke dumps from today's hang. camel's logs are absent because the console was unresponsive

Comment 91 Robert Peterson 2008-05-16 21:32:08 UTC

I did some more experiments.  First, I tried reducing the minimum hold
time from HZ/5 back to its normal HZ/10.  The performance was about the
same.  I also did a systemtap profile of the code and saw this:

gfs2_glock_wait            3518607 in 68856 calls
do_xmote                     12311 in 4225 calls
do_promote                  314372 in 77306 calls
gfs2_glock_nq              4454741 in 68856 calls

Which really doesn't tell us much.  I've been pouring over code, but
haven't reached any conclusions yet.

Comment 92 Steve Whitehouse 2008-05-19 10:55:22 UTC

It seems that the slowdown might be related to writeback. I notice that if I run
postmark several times, the second and later runs are always slow compared with
the first one. Unmounting and rerunning postmark returns the results to the
original (faster) speed.

Also I notice that umount is often very slow. So I suspect that its something to
do with writeback and/or unlocking glocks.

Changing the minimum hold time should not make any difference for lock_nolock
since there is no code path that calls that code. The same goes for lock_dlm on
single node, since it will not send callbacks to itself as the result of a lock
request.

Although I've seen some regression in the performance with the latest patch,
I've not seen anything like the 10x reported in comment #89 so perhaps its worth
comparing the create phase of postmark with benchp13 to see if the difference
there might explain the difference in performance.

Comment 93 Robert Peterson 2008-05-19 13:42:41 UTC

For the record, the benchp13 performance issue was solved with
this suggestion by Steve:

--- glock.c.~1~	2008-05-16 10:51:35.000000000 -0500
+++ glock.c	2008-05-19 08:26:05.000000000 -0500
@@ -565,8 +565,7 @@ static void do_xmote(struct gfs2_glock *
 	ret = gfs2_lm_lock(sdp, gl->gl_lock, gl->gl_state, target, lck_flags);
 
 	if (!(ret & LM_OUT_ASYNC)) {
-		gl->gl_reply = ret;
-		set_bit(GLF_REPLY_PENDING, &gl->gl_flags);
+		finish_xmote(gl, ret);
 		gfs2_glock_hold(gl);
 		if (queue_delayed_work(glock_workqueue, &gl->gl_work, 0) == 0)
 			gfs2_glock_put(gl);

We've still got Abhi's hang on the smoke cluster to consider.

Comment 94 Robert Peterson 2008-05-20 03:55:45 UTC

Created attachment 306084 [details]
glock dumps and call traces

I tried some variations on the code mentioned in the
previous comment.  One of the things I tried was introducing a new 
bit that indicated whether or not the glock was in the process of
acquiring the page lock after having acquired the EX lock during
the write_lock_start.  Then I had the read try-locks behave differently
when the bit is on, such as failing the lock right away.
It behaved pretty much the same as the version mentioned in the
previous comment:  try locks were occasionally bursting up to 
100% cpu utilization for short periods of time, then eventually
resolving themselves when the lock is acquired.  Since it behaved the
same way, I reverted it.

This latest patch ran through the dd_io test twice successfully, but
it hung on my third attempt.  This attachment contains a tar/zip file
with a directory containing glock dumps and call traces for all the
smoke nodes.

Comment 95 Steve Whitehouse 2008-05-20 08:30:58 UTC

So it looks like we are back to seeing missing demote requests again. The
waitiing nodes all have the "l" flag, so that they all sent their requests to
the dlm, so it looks like merit has just dropped the demote request somehow.

Comment 96 Robert Peterson 2008-05-20 12:22:30 UTC

Created attachment 306111 [details]
The patch I was using last night

Comment 97 Robert Peterson 2008-05-20 23:50:20 UTC

Created attachment 306191 [details]
Dave's patch to dlm

This morning, we discovered an apparent bug in DLM while testing.
Dave T. posted the attached patch for the fix with this comment:

    The fix in commit 3650925893469ccb03dbcc6a440c5d363350f591 was addressing
    the case of a granted PR lock with waiting PR and CW locks.  It's a
    special case that requires forcing a CW bast.  However, that forced CW
    bast was incorrectly applying to a second condition where the granted
    lock was CW.  So, the holder of a CW lock could receive an extraneous CW
    bast instead of a PR bast.	This fix narrows the original special case to
    what was intended.

Further testing got us back into a hang in which all our reads were
apparently spinning trying to acquire try locks, eating 100% of the cpu.
I tried another fix I thought up, but I got this error which we've never
seen before:

DLM:  Assertion failed on line 242 of file /home/bob/20May2008.dlm/lock.c
DLM:  assertion:  "r->res_nodeid >= 0"
DLM:  time = 10546352
rsb: nodeid -1 flags 0 first 0 rlc 0 name	 2	    cde57c

------------[ cut here ]------------
kernel BUG at /home/bob/20May2008.dlm/lock.c:242!
invalid opcode: 0000 [#1]

Unfortunately, I don't have more because apparently the node was fenced.

I'm going to assume this is an unrelated problem and continue testing my
newest patch, which I'll post later tonight.

Comment 98 Robert Peterson 2008-05-21 03:41:01 UTC

Created attachment 306207 [details]
glock dumps, dlm dumps, call traces

The dlm messages mentioned in my previous comment were likely caused by
a problem with my experimental patch, which I've since corrected.  So
forget about that for now.

I devised a new patch with changes to function add_to_queue in glock.c.
Unfortunately, after three 15-minute runs, I was able to recreate a hang
on this version as well.  This tar contains glock dumps, dlm dumps and
call traces from that hang on the smoke cluster.

I'll post the patch itself shortly.

Comment 99 Robert Peterson 2008-05-21 03:57:56 UTC

Created attachment 306208 [details]
My patch from today

This is the latest RHEL patch I have.  It contains:

1. Steve's latest and greatest changes (including the new lock_dlm).
2. Dave's new dlm patch from this morning.
3. Some debug code from this morning that will eventually need to be
   removed, but it may be useful for debugging, so I left it in.
4. Some experimental new code I added this evening.  The new changes
   are basically to add a new GLF_TRYFAILED bit to the glock that
   indicates glocks for which we've failed a trylock.  My idea here
   was that once we fail a trylock, we might want to give the next
   trylock attempt a higher priority, queuing it like a LM_FLAG_PRIORITY
   lock, at the earliest possible insertion point.  That's to try to
   keep the try locks from always getting shuffled to the end of the
   queue and never getting done.  I also rearranged the logic somewhat
   to simplify it and avoid a possible case I was concerned about with
   the previous patch where we might queue a priority glock at the front
   of the queue and then cancel it right away.

The code seemed to work well for two full 15-minute runs of dd_io,
but again, on the third attempt, I recreated a hang.  I still saw some
bursts of 100% cpu utilization from d_doio, but they were usually
short-lived.  See previous comment and associated attachment for details
on the hang.

I have not analyzed the data from the hang; I'm too tired this evening.
I'm hoping Steve can get a look at it in the morning.

Comment 100 Robert Peterson 2008-05-21 16:32:33 UTC

Created attachment 306275 [details]
Proposed final RHEL patch

This is my proposed "final" patch that we can support in RHEL5.2.z.
It has not undergone any testing in its final form, but of course, the
various pieces have undergone lots of testing in prior incarnations.

Comment 101 Robert Peterson 2008-05-21 17:14:32 UTC

The version posted in comment #100 has successfully passed all my
"Hell" tests plus a complete REG run of d_io.  Also, benchp13 is
running happily on kool, although it seems a tiny bit slower than
the version with all the lock_dlm changes.

Comment 102 Robert Peterson 2008-05-21 18:12:48 UTC

Created attachment 306287 [details]
Proposed final RHEL patch (revised)

Found and removed some debug code that was in the previous patch.

Comment 103 Robert Peterson 2008-05-21 22:40:35 UTC

I posted the patch to rhkernel-list, so I'm setting this one to POST
and reassigning to Don Zickus.

Comment 106 Rob Kenna 2008-05-27 17:46:26 UTC

This issue is not quite resolved and one more patch is expected.

Comment 107 Robert Peterson 2008-05-27 19:32:44 UTC

We found some problems during testing, documented in bug #447920.
I wrote a patch to solve these issues, which may be found here:
https://bugzilla.redhat.com/show_bug.cgi?id=447920#c12
Unfortunately, this patch segfaulted during one of the dd_io tests,
which seems to be a direct result of the code change I did in the patch.
I'm investigating now.

My plan is to leave 447920 open until we can get through the jdata
tests and be sure that issue is fixed too.

Comment 108 Robert Peterson 2008-05-27 20:40:42 UTC

Created attachment 306832 [details]
Proposed final RHEL patch (revised again)

This patch is a revision to the one I posted to rhkernel-list about
a week ago.  This fixes the problems I was reporting in bug #447920.
This version has passed one run of dd_io and one run of d_io on
lock_nolock, on my exxon cluster.  We still need to run the "hell"
tests, dd_io on the smoke cluster, and Dean's new test.

Comment 109 Robert Peterson 2008-05-28 03:40:49 UTC

Apparently, my calling of deprecated function find_trylock_page has
caused a dependency problem:

Error: Missing Dependency: ksym(find_trylock_page) = edd3ec35 is needed by
package kmod-gfs2

Function find_trylock_page is only a 7 or so line function, so we could
include it.  However, in the kernel documentation it says that the function
is deprecated and should be replaced with functions like find_get_page or
find_lock_page.  What we want is something that doesn't block.
I'm hoping Steve has a good idea of the "right way" to do this.

Comment 110 Robert Peterson 2008-05-28 15:46:33 UTC

Steve posted a new upstream patch for this problem in bug 447920.
Unfortunately, the RHEL code is more complex because RHEL does not have
the upstream code that Abhi is working on.  At any rate, in rewriting
the logic to avoid the dependency, I discovered a problem with the error
path of my old code (and in the upstream patch as well).  Basically: if
acquiring the glock returns an error, and possibly in some other error
cases, we could return the error to VFS without re-locking the page.
VFS always tried to unlock the page unless we return AOP_TRUNCATED_PAGE.
So in cases where we return another error, our page counts can get hosed.
I attempted to correct that with my latest patch, which I'm testing now.

This patch is based on Steve's patch, which still uses a blocking
page lock, but it shouldn't matter if it's blocking because the lock
is now in the correct order.  (Thanks, Steve).

If my testing goes well, I'll post the master patch to this bz.  In the
meantime, here is the retooled gfs2_readpage for people to scrutinize.
I apologize for the ugly logic, but there are lots of different conditions
we need to handle: (1) if we got the glock or not, (2) if we got the
same page back as original or not, (3) if we got any page back at all
or not (4) if we're dealing with an internal file sentinel or not,
(5) if we've encountered an error or not.  It makes for messy logic.
I'm open to all suggestions on how to make it less messy.  Here it is:

static int gfs2_readpage(struct file *file, struct page *page)
{
	struct address_space *mapping = page->mapping;
	struct gfs2_inode *ip = GFS2_I(mapping->host);
	struct gfs2_sbd *sdp = GFS2_SB(mapping->host);
	struct gfs2_file *gf = NULL;
	struct gfs2_holder gh;
	int do_unlock = 0;
	pgoff_t index = page->index;
	struct page *page2 = page;
	int error = 0;

	if (likely(file != &gfs2_internal_file_sentinel)) {
		if (file) {
			gf = file->private_data;
			if (test_bit(GFF_EXLOCK, &gf->f_flags))
				/* gfs2_sharewrite_nopage has grabbed the ip->i_gl already */
				goto skip_lock;
		}
		unlock_page(page);
		gfs2_holder_init(ip->i_gl, LM_ST_SHARED, GL_ATIME, &gh);
		error = gfs2_glock_nq_atime(&gh);
		if (unlikely(error)) {
			lock_page(page);
			return error;
		}
		do_unlock = 1;
		page2 = find_lock_page(mapping, index);
		if (page != page2)
			error = AOP_TRUNCATED_PAGE;
	}

skip_lock:
	if (!error && !PageUptodate(page2)) {
		if (gfs2_is_stuffed(ip)) {
			error = stuffed_readpage(ip, page2);
			unlock_page(page2);
		} else
			error = mpage_readpage(page2, gfs2_block_map);
		if (unlikely(test_bit(SDF_SHUTDOWN, &sdp->sd_flags)))
			error = -EIO;
	}
	if (do_unlock) {
		if (page2)
			page_cache_release(page2);
		gfs2_glock_dq_uninit(&gh);
	}
	if (error) {
		if (error == AOP_TRUNCATED_PAGE)
			yield();
		else
			lock_page(page2);
	}
	return error;
}

Comment 111 Robert Peterson 2008-05-28 19:01:22 UTC

Created attachment 306972 [details]
glock dumps and call traces

This is a tar/zip with a glock dump and a call trace from exxon-01.

I've had lots of d_io problems with the new patch.  Most of the time,
all the IO just stops and everything waits.  There is nothing vital in
the glocks and all the processes related to the test are waiting for
page locks.  I modified the patch so that it prints out an error if it
returns an error.  It never is.  Still, everything waits.

This latest recreation is a bit different.  it shows processes waiting
for glocks, and glocks are held by processes waiting for page locks.
I'm going to change it back to a non-blocking page lock rather than
find_lock_page and see if it behaves better again.  That's because
yesterday's version was working perfectly, whereas these problems only
started happening after I introduced the find_lock_page today.

Comment 112 Robert Peterson 2008-05-28 21:03:37 UTC

Latest debug version of gfs2_readpage:

static int gfs2_readpage(struct file *file, struct page *page)
{
	struct address_space *mapping = page->mapping;
	struct gfs2_inode *ip = GFS2_I(mapping->host);
	struct gfs2_sbd *sdp = GFS2_SB(mapping->host);
	struct gfs2_file *gf = NULL;
	struct gfs2_holder gh;
	int do_unlock = 0;
	pgoff_t index = page->index;
	int error = 0;
	struct page *remember_page = page;

	if (likely(file != &gfs2_internal_file_sentinel)) {
		if (file) {
			gf = file->private_data;
			if (test_bit(GFF_EXLOCK, &gf->f_flags))
				/* gfs2_sharewrite_nopage has grabbed the ip->i_gl already */
				goto skip_lock;
		}
		unlock_page(page);
		gfs2_holder_init(ip->i_gl, LM_ST_SHARED, GL_ATIME, &gh);
		error = gfs2_glock_nq_atime(&gh);
		if (unlikely(error)) {
			printk("returning %d", error);
			lock_page(page);
			printk("\n");
			return error;
		}
		do_unlock = 1;
		page = find_lock_page(mapping, index);
		if (page->mapping != mapping)
			error = AOP_TRUNCATED_PAGE;
	}

skip_lock:
	if (!error && !PageUptodate(page)) {
		if (gfs2_is_stuffed(ip)) {
			error = stuffed_readpage(ip, page);
			unlock_page(page);
		} else
			error = mpage_readpage(page, gfs2_block_map);
		if (unlikely(test_bit(SDF_SHUTDOWN, &sdp->sd_flags)))
			error = -EIO;
	}
	if (do_unlock && page)
		page_cache_release(page);
	if (error) {
		if (error == AOP_TRUNCATED_PAGE) {
			if (page && remember_page != page)
				printk("We swapped pages.\n");
			yield();
		}
		else if (do_unlock) {
			printk("Returning %d", error);
			lock_page(page);
			printk("\n");
		}
	}
	if (do_unlock)
		gfs2_glock_dq_uninit(&gh);
	return error;
}

Note that when the "lock3" hang happens, we're not getting ANY of the
debug printk statements shown.

Comment 113 Robert Peterson 2008-05-28 22:12:06 UTC

I expanded the scope of my search for this bug to everywhere we lock
and unlock pages.  After an extensive search, I've still found nothing
that can account for this hang.  I'll resume the search in the morning.

Comment 114 Robert Peterson 2008-05-29 14:32:46 UTC

Created attachment 307077 [details]
29 May 2008 patch

This version passes single-node d_io, and Dean's multiple-lock3 test.
It still needs to be smoke tested with dd_io, plus the "hell" tests.

Comment 115 Robert Peterson 2008-05-29 15:36:55 UTC

Created attachment 307096 [details]
29 May 2008 patch (revised)

This hopefully fixes a NULL pointer dereference that Abhi found on the
smoke cluster.

Comment 116 Robert Peterson 2008-05-29 17:41:26 UTC

This version passes my "hell" tests and single-node d_io on exxon-01
and roth-01.  The dd_io tests failed, most likely because this patch
does not include the lock_dlm patch we did earlier.  That will need to 
be shipped with RHEL5.3 as well at some point, but we decided to send
this out without it for the sake of certain customers who will be using
GFS2 as single-node.

Comment 117 Robert Peterson 2008-05-29 17:42:47 UTC

Forgot to mention, the benchp13 test is going well too, on kool.

Comment 118 Robert Peterson 2008-05-29 20:49:05 UTC

The good news: d_io, benchp13, and all of Nate's tests on jdata ran okay.
The bad news: Abhi got this during testing:

GFS2: wanted 0 got 1
 G:  s:EX n:2/2053c f:lDp t:UN d:UN/1000 l:0 a:0 r:4
  I: n:22/132412 t:8 f:0x00000011
------------[ cut here ]------------
kernel BUG at /builddir/build/BUILD/gfs2-kmod-1.92/_kmod_build_/glock.c:487!
invalid opcode: 0000 [#1]
SMP 
last sysfs file: /kernel/dlm/gfs20/control
Modules linked in: sctp lock_dlm(U) gfs2(U) dlm configfs autofs4 hidp rfcomm
l2cap bluetooth sunrpc ipv6 xfrm_nalgo crypto_api dm_multipath video sbs
backlight i2c_ec button battery asus_acpi ac lp sg floppy i2c_i801 i2c_i810
i2c_algo_bit parport_pc parport pcspkr i2c_core e1000 ide_cd cdrom dm_snapshot
dm_zero dm_mirror dm_mod qla2xxx scsi_transport_fc ata_piix libata sd_mod
scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd
CPU:    0
EIP:    0060:[<e058f632>]    Tainted: G      VLI
EFLAGS: 00010282   (2.6.18-92.el5 #1) 
EIP is at finish_xmote+0x188/0x250 [gfs2]
eax: 00000000   ebx: d87d0f38   ecx: 00000046   edx: 00000000
esi: 00000000   edi: 00000001   ebp: 00000001   esp: c38aaf40
ds: 007b   es: 007b   ss: 0068
Process glock_workqueue (pid: 6656, ti=c38aa000 task=df7d1000 task.ti=c38aa000)
Stack: e05a5234 00000000 00000001 e05a3ca0 d87d0f38 d87d0f40 d58f49c0 00000286 
       e058fd9d d87d0fbc d87d0fc0 d58f49c0 c04331de e058fd7f d87d0f38 d58f49d4 
       d58f49c0 d58f49cc 00000000 c0433a92 00000001 00000000 c13f4ee0 00010000 
Call Trace:
 [<e058fd9d>] glock_work_func+0x1e/0x92 [gfs2]
 [<c04331de>] run_workqueue+0x78/0xb5
 [<e058fd7f>] glock_work_func+0x0/0x92 [gfs2]
 [<c0433a92>] worker_thread+0xd9/0x10b
 [<c042027b>] default_wake_function+0x0/0xc
 [<c04339b9>] worker_thread+0x0/0x10b
 [<c0435ea1>] kthread+0xc0/0xeb
 [<c0435de1>] kthread+0x0/0xeb
 [<c0405c3b>] kernel_thread_helper+0x7/0x10

Abhi was running with dlm, even though there should not have been
lock contention to my knowledge.  I know we saw this earlier, and
I think it went back to either the lock_dlm or dlm modules.
Unfortunately, the node was fenced and so we couldn't get dlm lock
dumps.

Comment 119 Robert Peterson 2008-05-29 22:25:28 UTC

I was just seconds away from pressing the "Send" button to rhkernel-list
on this patch.  Oh well.

The strange thing is that the DEMOTE_IN_PROGRESS bit is on, and yet
there are no waiters or holders on the queue.  Here's my theory:

At the beginning of glock_work_func it does:
	if (test_and_clear_bit(GLF_REPLY_PENDING, &gl->gl_flags))
		finish_xmote(gl, gl->gl_reply);
	spin_lock(&gl->gl_spin);
The first thing that finish_xmote does is take the gl_spin spinlock.
I'm thinking that might be a timing window during which another process
can jump in and deal with the reply from dlm.  That might give the
symptom Abhi saw.  This is tricky code, so I'm not 100% sure.  If my
theory is right, we might be able to fix it by changing finish_xmote
so that it assumes the spin_lock is already held upon entry.  It can
also exit with the spin_lock held (although it may need to release
it and reacquire it again).  In other words:
	spin_lock(&gl->gl_spin);
	if (test_and_clear_bit(GLF_REPLY_PENDING, &gl->gl_flags))
		finish_xmote(gl, gl->gl_reply);
The other caller, function do_xmote(), would have to have its spin_lock
usage adjusted accordingly as well.

Steve might have a simpler explanation for what happened and how to
fix it.  I know it took Abhi several hours of running before this
failure occurred, so even if we try my fix, it will be hard to tell
whether it's really fixed or not.

Maybe after dinner, I'll try to whip up a patch and see how it behaves.
We need to determine if this can happen with lock_nolock, and if not,
if it's okay to ship this to the customer.

Comment 120 Robert Peterson 2008-05-30 04:06:00 UTC

Created attachment 307158 [details]
29 May 2008 patch (revised again)

This version merely rearranges the spin_locks as noted in comment #119.
It's passed all the "hell" tests.  Doing d_io now.  Waiting for Steve's
comments and/or revision before posting to rhkernel-list.
It'd be nice if Abhi could try to recreate this afternoon's failure with
this version, but that also depends on what Steve can make out of the
information posted.  Unfortunately, the root hard drive died on exxon-02,
rendering that cluster useless, as a three-node cluster anyway.
I'm using the roth cluster until Chris can get it fixed.  I'll trim the
exxons down to a two-node cluster if need be, but for now the roths
suffice.

Comment 121 Robert Peterson 2008-05-30 04:20:57 UTC

The patch from comment #120 passes a single-node d_io test (all 49
tests successful).  I've set it to run continuously overnight.

I also put the new patch on kool for some overnight benchp13 runs.

Comment 122 Steve Whitehouse 2008-05-30 10:10:26 UTC

If you are using dlm and you've not got the three fixes in the kernel then its
not at all surprising that you've hit this. Provided that was the case, then I
don't think its a problem that we need worry about. The message means that we
asked to unlock, but received an exclusive lock instead.

That can't happen with lock_nolock since every lock request always succeeds
returning exactly the requested mode.

Comment 123 Robert Peterson 2008-05-30 12:22:57 UTC

The patch from comment #120 ran d_io for seven hours successfully,
but my SAN apparently died on the roth cluster, which caused a kernel
panic.  Meanwhile, benchp13 is still running happily on kool.

Comment 124 Robert Peterson 2008-05-30 12:53:01 UTC

Comment on attachment 307096 [details]
29 May 2008 patch (revised)

This is the most tested version, and shouldn't have problems single-node, so
this is what we plan to ship.  Therefore, I'm unmarking it obsolete.

Comment 125 Robert Peterson 2008-05-30 12:57:26 UTC

In light of comment #122, we've made the decision to ship the version
from comment #115, which has had the most testing and should work fine
for single-node configurations.  The changes I did last night should
not be necessary for single-node configurations.  I've posted that 
patch to rhkernel-list so I'm changing this back to POST and reassigning
to Don Z again.

Comment 126 Robert Peterson 2008-05-30 15:27:56 UTC

*** Bug 447920 has been marked as a duplicate of this bug. ***

Comment 127 Steve Whitehouse 2008-07-09 15:44:09 UTC

*** Bug 432108 has been marked as a duplicate of this bug. ***

Comment 128 Steve Whitehouse 2008-07-09 15:53:08 UTC

*** Bug 437893 has been marked as a duplicate of this bug. ***

Comment 129 Don Zickus 2008-07-18 20:06:53 UTC

in kernel-2.6.18-98.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Comment 130 Steve Whitehouse 2008-07-21 15:43:34 UTC

*** Bug 428751 has been marked as a duplicate of this bug. ***

Comment 132 Nate Straz 2008-11-12 21:20:46 UTC

Verified.  We haven't seen this in quite a while.

Comment 134 errata-xmlrpc 2009-01-20 19:51:15 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-0225.html

Note You need to log in before you can comment on or make changes to this bug.