RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 635041 - GFS2: inode glock stuck without holder
Summary: GFS2: inode glock stuck without holder
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: kernel
Version: 6.0
Hardware: All
OS: Linux
high
medium
Target Milestone: rc
: ---
Assignee: David Teigland
QA Contact: Cluster QE
URL:
Whiteboard:
: 635762 (view as bug list)
Depends On: 682951
Blocks: 710642
TreeView+ depends on / blocked
 
Reported: 2010-09-17 16:26 UTC by Nate Straz
Modified: 2016-04-26 13:42 UTC (History)
11 users (show)

Fixed In Version: kernel-2.6.32-128.el6
Doc Type: Bug Fix
Doc Text:
When the Distributed Lock Manager (DLM) queued three callbacks for a lock in the following sequence: blocking - completion - blocking, it would consider the final blocking callback redundant and skip it. Because the callback was skipped, GFS would not release the lock, causing processes on other nodes to wait indefinitely for it. With this update, the DLM does not skip the necessary blocking callback.
Clone Of:
Environment:
Last Closed: 2011-05-23 20:52:36 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
The hung lock, hung process backtraces, trace output (11.09 MB, application/x-gzip)
2010-09-17 16:32 UTC, Nate Straz
no flags Details
The interesting bits of the trace (18.04 MB, application/octet-stream)
2010-09-21 11:17 UTC, Steve Whitehouse
no flags Details
debugging patch (568 bytes, text/plain)
2011-02-09 23:21 UTC, David Teigland
no flags Details
patch (27.31 KB, text/plain)
2011-02-21 21:30 UTC, David Teigland
no flags Details
updated patch (28.19 KB, text/plain)
2011-03-02 20:17 UTC, David Teigland
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2011:0542 0 normal SHIPPED_LIVE Important: Red Hat Enterprise Linux 6.1 kernel security, bug fix and enhancement update 2011-05-19 11:58:07 UTC

Description Nate Straz 2010-09-17 16:26:11 UTC
Description of problem:

While running a shared load on a single file (part of dd_io) I get into a state where d_doio processes are waiting on the inode lock, but the node it is granted on doesn't have any holders.

In this case buzz-04 has the glock but no holders.

2 57d3e3
buzz-01
GFS2 Glock (2, 57d3e3)
G:  s:UN n:2/57d3e3 f:lI t:DF d:UN/0 a:0 r:5
 H: s:DF f:W e:0 p:2756 [d_doio] gfs2_direct_IO+0x5c/0x110 [gfs2]
 H: s:EX f:W e:0 p:30116 [flush-253:5] gfs2_write_inode+0x7a/0x190 [gfs2]

buzz-02
GFS2 Glock (2, 57d3e3)
G:  s:UN n:2/57d3e3 f:lI t:SH d:EX/0 a:0 r:5
 H: s:SH f:AW e:0 p:13901 [d_doio] gfs2_getxattr+0x4e/0xa0 [gfs2]
 H: s:EX f:W e:0 p:11662 [flush-253:3] gfs2_write_inode+0x7a/0x190 [gfs2]

buzz-03
GFS2 Glock (2, 57d3e3)
G:  s:UN n:2/57d3e3 f:lI t:DF d:UN/0 a:0 r:6
 H: s:DF f:W e:0 p:32700 [d_doio] gfs2_direct_IO+0x5c/0x110 [gfs2]
 H: s:DF f:W e:0 p:32702 [d_doio] gfs2_direct_IO+0x5c/0x110 [gfs2]
 H: s:EX f:W e:0 p:30461 [flush-253:3] gfs2_write_inode+0x7a/0x190 [gfs2]

buzz-04
GFS2 Glock (2, 57d3e3)
G:  s:EX n:2/57d3e3 f:yI t:EX d:EX/0 a:0 r:3
 I: n:44/5755875 t:8 f:0x10 d:0x00000000 s:2152914944/2152914944

buzz-05
GFS2 Glock (2, 57d3e3)
G:  s:UN n:2/57d3e3 f:lI t:SH d:EX/0 a:0 r:5
 H: s:SH f:AW e:0 p:32387 [d_doio] gfs2_getxattr+0x4e/0xa0 [gfs2]
 H: s:EX f:W e:0 p:30144 [flush-253:3] gfs2_write_inode+0x7a/0x190 [gfs2]

dlm
Resource "       2          57d3e3" master buzz-05 1G/4C/0W
LVB: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Granted Queue
016b0473 EX Remote:   4 02cb047a
Conversion Queue
03e80490 NL (CW) Remote:   3 029203f8
02e80488 NL (CW) Remote:   1 00e10426
01a904c1 NL (PR)
00b60463 NL (PR) Remote:   2 00410496
None Waiting



Version-Release number of selected component (if applicable):
kernel-2.6.32-71.el6.x86_64

How reproducible:
Takes a while to hit

Steps to Reproduce:
1. Run brawl
2.
3.
  
Actual results:
Workload is hung

Expected results:
Workload runs to completion

Additional info:

This is a new cluster with 24G RAM per node.

Comment 1 Nate Straz 2010-09-17 16:32:59 UTC
Created attachment 448056 [details]
The hung lock, hung process backtraces, trace output

Here is a tarball with the following output:
 - glocker output showing the hung lock state
 - sysrq-w output showing backtraces of the hung processes on each node
 - tracepoint output from each node

Comment 2 Nate Straz 2010-09-17 18:50:43 UTC
fyi, using the demote_rq interface does free up the lock and allows the workload to complete.

[root@buzz-04 buzzez:brawl0]# echo "2:5755875 EX" > demote_rq
GFS2: fsid=buzzez:brawl0.3: demote interface used

then the load finished.

Comment 3 Nate Straz 2010-09-20 21:09:48 UTC
I hit this on my dash cluster (3 node x86_64) while running coherency.

[nstraz@try ~]$ python glocker.py dash-01:/mnt/brawl
Getting node list from dash-01
Getting GFS2 locks for /mnt/brawl from dash-01
Getting GFS2 locks for /mnt/brawl from dash-02
Getting DLM locks for brawl0 from dash-02
Getting DLM locks for brawl0 from dash-01
Getting GFS2 locks for /mnt/brawl from dash-03
Dumping all glocks which have waiters
================================================================================
2 23dd4f
dash-01
GFS2 Glock (2, 23dd4f)
G:  s:EX n:2/23dd4f f:yfI t:EX d:UN/0 a:0 r:3
 I: n:3/2350415 t:8 f:0x18 d:0x00000000 s:10737418240/10737418240

dash-02
GFS2 Glock (2, 23dd4f)
G:  s:UN n:2/23dd4f f:lfI t:EX d:UN/0 a:0 r:8
 H: s:EX f:W e:0 p:23925 [d_doio] gfs2_page_mkwrite+0x6a/0x3a0 [gfs2]
 H: s:SH f:W e:0 p:23924 [d_doio] gfs2_readpages+0x58/0xd0 [gfs2]
 H: s:EX f:W e:0 p:23922 [d_doio] gfs2_mmap+0x72/0xb0 [gfs2]
 H: s:EX f:W e:0 p:23923 [d_doio] gfs2_page_mkwrite+0x6a/0x3a0 [gfs2]
 H: s:EX f:W e:0 p:23707 [flush-253:3] gfs2_write_inode+0x7a/0x190 [gfs2]

dash-03
GFS2 Glock (2, 23dd4f)
G:  s:UN n:2/23dd4f f:lI t:EX d:UN/0 a:0 r:7
 H: s:EX f:W e:0 p:18905 [d_doio] gfs2_page_mkwrite+0x6a/0x3a0 [gfs2]
 H: s:SH f:W e:0 p:18907 [d_doio] gfs2_readpages+0x58/0xd0 [gfs2]
 H: s:EX f:W e:0 p:18906 [d_doio] gfs2_mmap+0x72/0xb0 [gfs2]
 H: s:EX f:W e:0 p:18904 [d_doio] gfs2_mmap+0x72/0xb0 [gfs2]

dlm
Resource "       2          23dd4f" master dash-01 1G/2C/0W
LVB: f0 40 e6 a9 00 88 ff ff e0 e2 31 a0 ff ff ff ff 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Granted Queue
037d00fb EX
Conversion Queue
01390109 NL (EX) Remote:   3 032900ed
03c700ee NL (EX) Remote:   2 01aa00da
None Waiting

================================================================================

Comment 4 Steve Whitehouse 2010-09-21 11:17:00 UTC
Created attachment 448676 [details]
The interesting bits of the trace

This is the buzz-04 trace log from the earlier attachment, but with everything except the entries relating to inode 575587 filtered out.

There is a far amount of activity relating to this inode/glock. The most recent demote request received is:

dlm_astd-2846  [011] 157802.656247: gfs2_demote_rq: 253,3 glock 2:5755875 demote NL to PR flags:ldfrI

Since there are waiters for both SH and DF listed (in comment #1), I'd have expected an EX demote request to have been requested at some stage. It appears that there was in fact a race which generated this particular request. GFS2 was holding an EX and demoted to NL at about the same time that a demote to PR request was sent. I suspect this crossed on the wire resulting in the reception of this demote to PR request, even though the demote to NL had been successfully completed.

Next buzz-04 requests an EX lock, but it looks like some confusion has been caused by the earlier demote request since there is a demotion to PR shortly after that, followed by another promotion to EX for which there is no obvious reason in the trace. I guess it is possible that it was an earlier request that had been queued for some time.

Comment 5 Steve Whitehouse 2010-09-21 13:25:06 UTC

Here is a summary of the last few operations before things go wrong:

glock_workqueue-2908  [011] 157801.169141: gfs2_glock_state_change: 253,3 glock
 2:5755875 state PR to CW tgt:CW dmt:NL flags:ldI

Successful convert of PR to CW lock

 glock_workqueue-2908  [011] 157801.169142: gfs2_promote: 253,3 glock 2:5755875 
promote first CW

Grant local CW lock as a result of the above

        dlm_astd-2846  [011] 157801.170000: gfs2_demote_rq: 253,3 glock 2:575587
5 demote CW to NL flags:ldrI

Demote request to NL from DLM (note: sets pending demote flag)

 glock_workqueue-2908  [011] 157801.368652: gfs2_glock_state_change: 253,3 glock
 2:5755875 state CW to EX tgt:EX dmt:NL flags:ldI

Successful convert from CW to EX (pending demote flag still set)

 glock_workqueue-2908  [011] 157801.368654: gfs2_promote: 253,3 glock 2:5755875 
promote first EX

Granted local EX lock request

 glock_workqueue-2908  [011] 157801.592644: gfs2_glock_state_change: 253,3 glock
 2:5755875 state EX to NL tgt:NL dmt:NL flags:lDpfI

The pending demote has become an actual demote now (note that this is approx 200ms after the demote request was received, as per expectations) and the demote is successful.

        dlm_astd-2846  [011] 157802.656247: gfs2_demote_rq: 253,3 glock 2:575587
5 demote NL to PR flags:ldfrI

We are sitting now in NL, but along comes a demote request demanding that we drop into PR. Maybe its a race with our demote request, but I thought we'd fixed the ordering issue there. Not sure why we are getting a demote request requiring a higher mode than the one we are in already....

 glock_workqueue-2908  [011] 157802.656250: gfs2_glock_state_change: 253,3 glock
 2:5755875 state NL to EX tgt:EX dmt:PR flags:ldfI

Here we promote to EX, but still carrying the demote request (note it is delayed) with us.

 glock_workqueue-2908  [011] 157802.656582: gfs2_promote: 253,3 glock 2:5755875 
promote first EX

Promotion of an EX waiter, as expected.

 glock_workqueue-2908  [011] 157802.982305: gfs2_glock_state_change: 253,3 glock
 2:5755875 state EX to PR tgt:PR dmt:PR flags:lDpfI

The demote has now got to the end of its delayed period. Slightly over 200ms this time, but still within the expected time. So it has demoted to PR.

 glock_workqueue-2908  [011] 157802.982672: gfs2_glock_state_change: 253,3 glock
 2:5755875 state PR to EX tgt:EX dmt:EX flags:lfI

.... and back to EX again

 glock_workqueue-2908  [011] 157802.982673: gfs2_promote: 253,3 glock 2:5755875 promote first EX

And another glock is granted in EX mode

          d_doio-17377 [007] 157805.276698: gfs2_bmap: 253,3 bmap 5755875 map 514181/4096 to 0 flags:00000000 nocreate 1

The first of a number of bmaps now occurs. The remainder of the trace appears to be buzz-04 writing to the file and then a log flush occurs. There is no more activity to the inode in question after that.




So there are really two things to investigate. Firstly why the final demote request was a "demote to PR" when the lock was already in NL state. Secondly why there was not a further demote request received when the waiters (comment #1) were a mix of PR and CW, I'd expect to see a "demote to NL" demote request generated by the dlm.

Comment 6 Steve Whitehouse 2010-09-21 13:26:11 UTC
Looks like this one might be dlm related.

Comment 7 David Teigland 2010-09-21 21:25:19 UTC
If gfs gets a completion callback for its conversion to NL, the dlm should not then send a blocking callback for mode PR.  That should have been fixed by upstream commit 7fe2b3190b8b299409f13cf3a6f85c2bd371f8bb.

There could be other similar problems left to debug and fix.  A starting point would be to collect the full dlm lock state using:
dlm_tool lockdebug -svw <name>
This will show us the mode in the last bast and the timestamp of the last bast
which can be compared to the time when locks were queued.

A blocking callback should never indicate a blocking mode of NL, I don't think that's what you really meant.  If you hold an EX lock and there are PR and CW locks waiting, the mode in the blocking callback is the mode that is being requested, i.e. PR or CW.  It's up to gfs to decide which mode it wants to down-convert to, e.g. NL or PR or CW.

If the full lockdebug still points to more dlm callback problems I'll try to get this reproducing on my cluster for deeper debugging.

Comment 8 Steve Whitehouse 2010-09-22 09:25:05 UTC
> A blocking callback should never indicate a blocking mode of NL, I don't think
> that's what you really meant.

No, I meant that it was in NL and had received a callback for PR which as you mentioned earlier in comment #7, should have been fixed by the earlier patch. That is why I was rather surprised to see something like this appearing again.

Since Nate can reproduce this, it should be possible to get the extra lockdebug info as well.

Comment 9 David Teigland 2010-09-22 16:17:39 UTC
I actually did see the full lockdebug for the original case that you described in comment 5, just not for the second case in comment 3.  I believe it showed that a bast(PR) was sent to gfs with a timestamp later than the timestamp at which the EX lock was placed on the granted queue (when an EX completion callback is queued).  I'm not sure if this contradicts your suggestion that the bast(PR) was delivered following a NL completion callback.

Comment 10 Steve Whitehouse 2010-09-23 08:57:08 UTC
Ah, I see. In which case I suspect that is correct. The state change tracepoint is called when the local copy of the dlm state gets updated. That is in the context of the glock workqueue which is scheduled when the completion arrives from the dlm.

As a result, it is possible for that to occasionally show a different ordering (bast arriving first) which is ok because the code can cope with that. We cache the bast until the state change has completed. In fact that is what has happened here... the bast to demote to PR has been kept until the promote to EX has completed. It then drops to PR and goes back to EX again, after that (comment #5) so that might all be correct in that case. The question remains though why no further bast requests have arrived from the dlm?

Bearing in mind that the waiters (opening comment) are a mix of PR and CW, the PR waiters must have been queued after the final promote to EX otherwise they should have been granted already. The CW waiters are first in the list, which tends to support that theory. So maybe it is just a case of a missing bast in this case, rather than one which has been reordered?

Comment 11 David Teigland 2010-09-23 19:48:56 UTC
I'm going to try to reproduce this on my nodes using the following patch.

diff --git a/fs/dlm/ast.c b/fs/dlm/ast.c
index 4314f0d..90508dd 100644
--- a/fs/dlm/ast.c
+++ b/fs/dlm/ast.c
@@ -107,6 +107,14 @@ repeat:
                                last_castmode = lkb->lkb_castmode_done;
                        if (dlm_modes_compat(bastmode, last_castmode))
                                do_bast = 0;
+
+                       if (!do_bast)
+                               log_debug(ls, "skip b %d c %d lastc %d type %d "
+                                         "first %d bdone %d cdone %d %x %s",
+                                         bastmode, castmode, last_castmode,
+                                         type, first, lkb->lkb_bastmode_done,
+                                         lkb->lkb_castmode_done, lkb->lkb_id,
+                                         r->res_name);
                }
 
                if (first == AST_COMP) {

Comment 12 David Teigland 2010-09-23 20:02:18 UTC
I don't have the time to install and configure the sts framework on my cluster right now, so I'm going to have to put off testing this.

Comment 13 Steve Whitehouse 2010-10-21 10:49:36 UTC
Dave, any more ideas on this one? I had a look again recently and I still can't spot where the issue is.

Comment 14 David Teigland 2010-10-21 15:27:45 UTC
I don't expect we'll make any progress on this until we can reproduce it on some devel machines with extra debugging.  My general approach is to let something like this sit until I can batch it up with another couple RHEL5 dlm kernel issues to work on together.  This ammortizes the overhead of setting up a test environment and getting into the correct frame of reference.

Comment 15 Steve Whitehouse 2010-12-15 11:31:27 UTC
This bug has come up on the agenda for the Wednesday cluster GSS meetings and GSS are intending to help reproduce the issue based on the info in this bz.

Comment 16 Ben Turner 2010-12-15 16:42:44 UTC
I just got my new hardware setup and configured with RHEL 6 cluster.  I see the reproducers steps here a listed as run brawl.  @Nate, is there any way I can get a copy of this test app to run on my HW?  I have a locking app that Adam Drew wrote that I can run here if needs be.

Comment 17 RHEL Program Management 2011-01-07 03:50:56 UTC
This request was evaluated by Red Hat Product Management for
inclusion in the current release of Red Hat Enterprise Linux.
Because the affected component is not scheduled to be updated
in the current release, Red Hat is unfortunately unable to
address this request at this time. Red Hat invites you to
ask your support representative to propose this request, if
appropriate and relevant, in the next release of Red Hat
Enterprise Linux. If you would like it considered as an
exception in the current release, please ask your support
representative.

Comment 18 David Teigland 2011-02-02 21:12:48 UTC
It took two days (as I feared it would), but I finally got four machines set up and sts set up, and the tests started.  They have run fine so far, but I'll keep rerunning them...  if we can hit something, then we can try to add some debugging instrumentation and reproduce.

Comment 19 David Teigland 2011-02-04 16:18:34 UTC
I've had this running for a couple days now.  The tests did get stuck at one point when node 4's network died, giving end endless stream of these messages to the console:

> Feb  4 11:05:19 gfs-a16c-04 kernel: bnx2: eth0 DEBUG: intr_sem[0]
> Feb  4 11:05:19 gfs-a16c-04 kernel: bnx2: eth0 DEBUG: EMAC_TX_STATUS[ffffffff] RPM_MGMT_PKT_CTRL[ffffffff]
> Feb  4 11:05:19 gfs-a16c-04 kernel: bnx2: eth0 DEBUG: MCP_STATE_P0[ffffffff] MCP_STATE_P1[ffffffff]
> Feb  4 11:05:19 gfs-a16c-04 kernel: bnx2: eth0 DEBUG: HC_STATS_INTERRUPT_STATUS[ffffffff]
> Feb  4 11:05:19 gfs-a16c-04 kernel: bnx2: eth0 DEBUG: PBA[ffffffff]
> Feb  4 11:05:20 gfs-a16c-04 kernel: bnx2: Chip not in correct endian mode
> Feb  4 11:05:20 gfs-a16c-04 kernel: bnx2: eth0: BUG! Tx ring full when queue awake!
> Feb  4 11:05:20 gfs-a16c-04 kernel: bnx2: eth0: BUG! Tx ring full when queue awake!

Comment 20 David Teigland 2011-02-04 23:05:20 UTC
Running all day and still no problem.  Will let it go through the weekend, and if nothing I'll probably close this.

Comment 21 David Teigland 2011-02-07 16:53:20 UTC
Same network problems occured on node 4 on Friday evening.  I will try to reconfigure the test to only run on nodes 1,2,3.

Comment 23 David Teigland 2011-02-08 19:30:54 UTC
I have updated kernel and kernel-firmware rpm's to 2.6.32-113 (from latest 6.1) and restarted the tests.

Comment 24 David Teigland 2011-02-09 16:42:37 UTC
Test ran until it hit bug 675270.  Will restart test when a kernel build is ready with that fix.

Comment 25 David Teigland 2011-02-09 22:57:40 UTC
Nate reproduced this on buzz nodes:

01

>Resource len 24  "       2          56238f"
>Master
>LVB len 32 seq 22
>00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>Granted
>017d0007 EX      Remote:   5 01420003          time 0023204040444611 flags 0000080d 00010000 bast 3 23204040445371
>Convert
>01ef0009 NL (CW)                               time 0023203768163396 flags 0000000c 00000000 bast 5 23203368392239
>03ed0006 NL (CW) Remote:   3 03b60006          time 0023203768668911 flags 0000000c 00010000 bast 5 23203368392423
>01e20009 NL (CW) Remote:   4 00870001          time 0023203768853514 flags 0000000c 00010000 bast 5 23203368394774
>02d10005 NL (PR) Remote:   2 00000004          time 0023203788279121 flags 0000000c 00010000 bast 5 23203568922830

02

>Resource len 24  "       2          56238f"
>Local 1
>Convert
>00000004 NL (PR) Master:   1 02d10005          time 0027620832273663 flags 0000000c 00000000 bast 5 27620613043881

03

>Resource len 24  "       2          56238f"
>Local 1
>Convert
>03b60006 NL (CW) Master:   1 03ed0006          time 0027444978769352 flags 0000000c 00000000 bast 5 27444578453150

04

>Resource len 24  "       2          56238f"
>Local 1
>Convert
>00870001 NL (CW) Master:   1 01e20009          time 0027055296764648 flags 0000000c 00000000 bast 5 27054896247848

05

>Resource len 24  "       2          56238f"
>Local 1
>Granted
>01420003 EX      Master:   1 017d0007          time 0026657784676664 flags 0000080d 00000000 bast 3 26657784677978

Comment 26 David Teigland 2011-02-09 23:09:16 UTC
This is consistent with the data we saw before -- everything looks ok in the dlm, with the bast time being later than the queue time for the EX lock, i.e. the dlm queued the completion callback before the blocking callback.  The next step will be to apply a patch like comment 11 and repeat the failure, to check if the order of the two callbacks is being swapped around at the very end, or if the blocking callback is being suppressed.

Comment 27 David Teigland 2011-02-09 23:21:51 UTC
Created attachment 477926 [details]
debugging patch

I'd like to add this patch to the kernel, enable <dlm log_debug="1"/>, reproduce the failure, and see if the new debug message sheds any light on the particular lock being waited on.

This patch just tells us about basts that are suppressed.  I don't yet know how to highlight possible ast ordering problems.

Comment 30 David Teigland 2011-02-11 16:52:20 UTC
Nate reproduced this on buzz 01-05 using my patch from comment 27.
The extra debug output appears to confirm that this problem is caused by a supressed bast.  I have not yet traced through the suspect logic using the new data, but am recording the data here now for safe keeping.

(Given Nate's success at reproducing this I don't expect I'll need the gfs machines or the new kernel for them.)

01

>Resource len 24  "       2          3fed16"
>Master           
>LVB len 32 seq 104
>20 13 64 1c 06 88 ff ff 20 13 64 1c 06 88 ff ff 30 13 64 1c 06 88 ff ff 30 13 64 1c 06 88 ff ff
>Granted
>002b0004 EX      Remote:   3 00da0006          time 0014401792462721 flags 0000080d 00010000 bast 3 14401792464052
>Convert
>03480008 NL (CW)                               time 0014401487151680 flags 0000000c 00000000 bast 5 14401087444418
>02690004 NL (CW) Remote:   2 02460001          time 0014401487705358 flags 0000000c 00010000 bast 5 14401087447876
>01f50007 NL (CW) Remote:   5 023a0002          time 0014401488038375 flags 0000000c 00010000 bast 5 14401087444075
>02130002 NL (PR) Remote:   4 027e0001          time 0014401519545486 flags 0000000c 00010000 bast 5 14401288057564

02

>Resource len 24  "       2          3fed16"
>Local 1          
>Convert
>02460001 NL (CW) Master:   1 02690004          time 0014401330677914 flags 0000000c 00000000 bast 5 14400930444076

03

>Resource len 24  "       2          3fed16"
>Local 1          
>Granted
>00da0006 EX      Master:   1 002b0004          time 0014400816621362 flags 0000080d 00000000 bast 3 14400816622078

04

>Resource len 24  "       2          3fed16"
>Local 1          
>Convert
>027e0001 NL (PR) Master:   1 02130002          time 0014399685640502 flags 0000000c 00000000 bast 5 14399454220891


03

>dlm: brawl0: skip b 5 c 0 lastc 0 type 2 first 2 bdone 3 cdone 0 da0006        2          3fed16
>dlm: brawl0: skip b 2 c 0 lastc 0 type 3 first 1 bdone 2 cdone 3 da0006        2          3fed16
>dlm: brawl0: skip b 3 c 5 lastc 3 type 3 first 2 bdone 3 cdone 3 da0006        2          3fed16

Comment 31 David Teigland 2011-02-11 19:05:57 UTC
The timestamps seem to indicate the following sequence of events on this lock

master:
add granted EX at 14401792462721
queue_bast PR at 14401792464052 (1331 after EX on granted queue)

local:
add granted EX at 14400816621362
queue_bast PR at 14400816622078 (716 after EX on granted queue)

which means we should expect to deliver a cast EX followed by a bast PR.
That is not what seems to actually happen.

The ast variables show the reverse sequence:
1. a bast PR was queued (b 3)
2. a cast EX was queued (c 5)

Because the previous cast was for PR (lastc 3), the unnecessary bast PR in step 1 is suppressed (the mode is compatible with the last granted mode), and this debug line is printed:

> dlm: brawl0: skip b 3 c 5 lastc 3 type 3 first 2 bdone 3 cdone 3 da0006        2          3fed16


All of this state is a reduction of multiple callback events into a couple variables, so none of it allows us to see exactly what happened.  For example, one possible explanation of the state would be three callbacks all queued but undelivered:

  lkb_ast_first = 0
  lkb_ast_type = 0
dlm_add_ast(BAST, PR)
  lkb_ast_first = BAST
  lkb_ast_type = BAST
  lkb_bastmode = PR
dlm_add_ast(CAST, EX)
  lkb_ast_first = BAST
  lkb_ast_type = BAST | CAST
  lkb_castmode = EX
dlm_add_ast(BAST, PR)
  lkb_ast_first = BAST
  lkb_ast_type = BAST | CAST | BAST
  lkb_bastmode = PR

I don't know know if there's an actual sequence of lock requests would produce this sequence of callbacks, but if there were, the correct sequence of callbacks would be reversed as we seem to observe.

The current code for handling these problems is relatively new and was done for bug 568102.  It apparently didn't entirely solve the problem.  My plan for fixing this again is to stop condensing multiple callbacks into a small number of variables, but to save each callback individually and in full.  It will not be a small or trivial change, but will hopefully finally solve this properly.

For someone wondering if this bug could explain the problems they are seeing, they should look for lock state that matches what is shown in comment 25 and comment 30.  If a matching pattern is not found, then the problem will be something else.

Comment 32 David Teigland 2011-02-21 21:30:34 UTC
Created attachment 479998 [details]
patch

This patch reworks how the dlm handles callbacks.  I hope it will finally fix this persistent problem completely where previous attempts didn't.  It's a big change to a sensitive area, so the regression risk is high, and I'm trying to go to extra lengths in testing this.

Comment 33 Steve Whitehouse 2011-02-22 14:09:46 UTC
I'm a bit confused by the patch. Upstream a patch was recently sent, and the following discussion ensued:

http://groups.google.com/group/linux.kernel/browse_thread/thread/8cf9f954ef1caf91

I thought at the time that this change was aimed at fixing this bug, but RHEL6 still has a mutithreaded dlm_recv. So I wonder whether that has been left out, or whether it only applies to upstream for some reason or other? Surely if it is a bug upstream it should be fixed in RHEL6 and I don't see any other DLM RHEL6 bugs which might cover it.

I also agree with Chrissie's email comments, wrt the array. Also, what is the actual bug which is being fixed? Are callbacks really being delivered in an incorrect order, and if so, is that because they are arriving in the wrong order or is it the queuing which is causing the ordering problem?

The code appears to imply that the problem is the queuing, and I'm left wondering why there is a need to rewrite the queuing code rather than simply remove it, which would seem to be the simpler option.

Comment 34 David Teigland 2011-02-22 15:08:39 UTC
Steve, you've really gotten confused, let me try to straighten this out.

You're mixing up two completely different, separate, unrelated issues:
1. This bugzilla
2. send/recv workqueue threading

1. The subject of this bugzilla is the hang that Nate can reproduce when running brawl.  This is a very specific hang, characterized by a very specific lock state that you can observe in a lock dump.  We have now identified the cause of this hang: it is a dlm bug where the dlm supresses a blocking callback because it thinks it is unnecessary due to the preceding completion callback.  The following output confirms this analysis:

> dlm: brawl0: skip b 3 c 5 lastc 3 type 3 first 2 bdone 3 cdone 3 da0006        2          3fed16

So the reason for the hang is that the dlm is not sending a blocking callback to gfs when it should, so gfs is never releasing its EX lock, causing everyone else to wait indefinately for it.  So the mistake the dlm is making is pretty clear.  You can read in comment 31 why I think the current dlm code gets confused into making this mistake.

My hope is that this patch will finally fix this problem (we've tried before unsuccessfully, see bug 568102.)  I appreciate Chrissie's feedback about lists vs arrays and replied to that separately.  Note that we have not yet verified that this patch does fix the problem, so we may not be done yet.

I think you can see from that description that this bug has nothing to do with send or recv workqueues.


2. You recently sent a patch upstream that made changes to the dlm send/recv workqueues, to make use of the new cmwq features.  Your patch, as far as I know, was not intended to fix any known bug or problem, but was meant to take advantage of the new efficiencies of the on-demand workqueue threads.  I think that this is a good idea, so I accepted your patch.

However, once I began doing some tests (parallel finds on the fs), I discovered a new regression that was quickly reproducable.  The problem was that the receive routines (at least two different ones) would call find_lkb and an existing but incorrect lkb would be returned.  The dlm code would notice this error, print an error message, and abort processing the message it had received.

After debugging this for a while, I discovered that your workqueue patch was the cause of the regression.  In particular it was the multi-threaded nature of your patch the caused the regression -- when I changed send/recv workqueues to be single threaded the problem went away.  I considered this minor change a better option than reverting your entire patch.

I realize that previously the recv workqueues were multithreaded (who could not notice the excessive dlm_recv/1 - dlm_recv/32 on 32 core machines), and I cannot explain why this same problem did not appear before.  My suspicion is that the previous workqueue code did not introduce the same degree of threaded concurrency as the new cmwq code, but I do not know.  My suspicion about why concurrent dlm_recv threads may exhibit the problems I saw is the shared "stub" structures in the lockspace struct.  All of this is speculation.  I'd like to get multithreaded workqueues to work for send/recv, but it's further down my list than bugs like this one.

Comment 35 Steve Whitehouse 2011-02-22 15:42:55 UTC
Ok, that is fair enough, but why if it is only the recv workqueue that was causing a problem did the send workqueue also get reverted? and that still doesn't answer my question about RHEL6. Are you intending to back port the workqueue change to RHEL6 and if not, why not?

Also, it doesn't address the point of why try to rewrite the code rather adopting my patch to remove it, and thus prevent any possible reordering during the queuing process?

Comment 36 David Teigland 2011-02-22 16:50:11 UTC
> but why if it is only the recv workqueue that was causing a problem did the send workqueue also get reverted?

Because I didn't take the time to test send and recv separately.

> Are you intending to back port the workqueue change to RHEL6 and if not, why not?

Yes, my current plan is to make RHEL6 match upstream as closely as possible.

> Also, it doesn't address the point of why try to rewrite the code rather
> adopting my patch to remove it, and thus prevent any possible reordering during
> the queuing process?

- Because such a change is far too big, invasive, disruptive, risky, etc.

- The dlm is designed in innumerable unknown and subtle ways around the assuption of a dedicated callback thread.

- It's a fundamental change in the stable, long standing behavior of the dlm.

- I do not make broad, sweeping changes to "fix" bugs.  I fix bugs in as narrow, limited and targeted a way as possible.

Removing the callback thread is simply out of the question, as I've said before.  In the future I would be open to introducing a new API with newly defined callback semantics and arguements, based on its own merits and measurable benefits.  At that time we can consider where it may be possible to call functions directly without an intermediate callback thread.

Comment 39 David Teigland 2011-02-23 16:33:34 UTC
Steve, I've already spent a great deal of time explaining this.  I stand by what I've said and don't have anything more to add.

Comment 40 David Teigland 2011-02-23 16:50:43 UTC
Getting back to reality, here's a kernel build with the fix which Nate asked for:
https://brewweb.devel.redhat.com/taskinfo?taskID=3132323

It looks like Nate will be on vacation for the next couple weeks, so we'll have to wait a bit for him to try the original test case that demonstrated the problem.

My own tests over the last two days have gone very well and shown no problems.

Comment 41 Steve Whitehouse 2011-02-24 11:21:50 UTC
Some test results using an upstream kernel, 2.6.38-rc1+ with varying combinations of patches. The test were conducted using my dlmtest module which is designed to test the raw lock conversion speed (200k conversion, 100k up and 100k down) of the DLM on a single node basis. The module was complied once and then used for all tests. The results are in mean microseconds per lock conversion based on an average of 8 runs of the test. The individual run times were fairly stable overall.

Baseline (2.6.38-rc1+)                15.89 us/lock    0%
With wq patch from upstream           16.36 us/lock   +3%
With wq patch and comment #32 patch   16.38 us/lock   +3%
With wq patch and my "no astd" patch   1.93 us/lock  -87%
With just my "no astd" patch           1.93 us/lock  -87%

The percentages given are relative to the baseline.

Note that the wq patch is this one from upstream: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff_plain;h=6b155c8fd4d239f7d883d455bbad1be47724bbfc which was only added after 2.6.38-rc1+.

Note that the above results are not comparable with a previous set I posted a while ago because those were gathered on a different test machine and with a different baseline kernel.

Since the test is both single threaded and local, I didn't really expect to see any differences relating to the wq patch, and I was surprised to see the 3% difference. I don't have a multi-threaded test for dlm, but I'd expect to see greater differences in the multi-threaded case.

I have some more tests to do, but the above are the easy ones to do, so I did them first.

Comment 42 David Teigland 2011-03-02 20:17:28 UTC
Created attachment 481937 [details]
updated patch

Minor updates to previous patch.

New build
https://brewweb.devel.redhat.com/taskinfo?taskID=3151955

Comment 43 Nate Straz 2011-03-14 14:35:25 UTC
I built and tested a -119.el6 based kernel with patches from 635041 and 682951.  I ran on two clusters for about 4 days and was not able to hit either issue.  Both patches have passed QE testing.

Comment 45 Steve Whitehouse 2011-03-16 13:02:16 UTC
Dave, did you intend to set devel ack?

Comment 46 Ric Wheeler 2011-03-17 19:11:37 UTC
*** Bug 635762 has been marked as a duplicate of this bug. ***

Comment 47 RHEL Program Management 2011-03-18 15:29:57 UTC
This request was evaluated by Red Hat Product Management for inclusion
in a Red Hat Enterprise Linux maintenance release. Product Management has 
requested further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed 
products. This request is not yet committed for inclusion in an Update release.

Comment 48 Nate Straz 2011-03-25 19:24:36 UTC
Bumping up the priority since I've hit this twice while testing the -250 kernel.  This hangs our GFS2 regression tests thus taking us longer to run through testing.

Comment 49 Aristeu Rozanski 2011-03-30 14:23:15 UTC
Patch(es) available on kernel-2.6.32-128.el6

Comment 53 Nate Straz 2011-04-04 14:05:47 UTC
Made it through a regression run without hitting this bug with the -128.el6 kernel.

Comment 58 errata-xmlrpc 2011-05-23 20:52:36 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0542.html

Comment 60 Martin Prpič 2011-08-18 14:42:19 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
When the Distributed Lock Manager (DLM) queued three callbacks for a lock
in the following sequence: blocking - completion - blocking, it would
consider the final blocking callback redundant and skip it. Because the
callback was skipped, GFS would not release the lock, causing processes on
other nodes to wait indefinitely for it. With this update, the DLM does not
skip the necessary blocking callback.


Note You need to log in before you can comment on or make changes to this bug.