Bug 659798

Summary: GFS2 lock requests not being passed back to DLM
Product: Red Hat Enterprise Linux 5 Reporter: Adam Drew <adrew>
Component: kernelAssignee: Steve Whitehouse <swhiteho>
Status: CLOSED DUPLICATE QA Contact: Cluster QE <mspqa-list>
Severity: high Docs Contact:
Priority: urgent    
Version: 5.7CC: adas, ahecox, bmarzins, cluster-maint, fhirtz, liko, rpeterso, sbradley, swhiteho, teigland
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-01-18 17:14:34 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
DLM, glock, messages, and cluster data.
none
12.02 issue data
none
12.01 issue data none

Description Adam Drew 2010-12-03 17:32:25 UTC
Description of problem:
Running an FTP server against a GFS2 filesystem that is mounted on 2 nodes at the same time. This is active/active FTP with each node providing the data over FTP to clients that may connect to either node.

After some interval of time one or more of the FTP processes will "hang" and fall into D state. The process will remain in that state until one or more nodes are rebooted. The cluster has been left in this state for upwards of 24 hours during one test. 

Systems are running kernel 2.6.18-223 as provided as a hotfix for BZ643165/BZ595397. Observed this issue new and attained a different hotfix with a smaller change delta between the supported kernel and the fix. Attained and ran on 2.6.18-192.28.1. The issue also occurs on that kernel. On the supported 5.5 kernel this issue does not occur but BZ595397 does.

glocks, DLM, and traces during a hang:
node1:
G:  s:UN n:3/69ffac f:lrq t:EX d:EX/0 l:0 a:0 r:7
 H: s:EX f:tW e:0 p:21944 [ftp] gfs2_inplace_reserve_i+0x269/0x6b0 [gfs2]
 H: s:EX f:tW e:0 p:17435 [pmdtm] gfs2_inplace_reserve_i+0x269/0x6b0 [gfs2]
 H: s:EX f:tW e:0 p:5787 [pmdtm] gfs2_inplace_reserve_i+0x269/0x6b0 [gfs2]

<3          69ffac not found in DLM on node 1>

node2:
G:  s:EX n:3/69ffac f:yq t:EX d:EX/0 l:0 a:0 r:3
 R: n:6946732 f:04 b:9/9 i:4
 
Resource ffff810a0fa4ddc0 Name (len=24) "       3          69ffac"  
Master Copy
Granted Queue
020d41fb EX
Conversion Queue
Waiting Queue

Dec  1 11:27:01 podetlsapp01 kernel: ftp           D ffffffff80153676     0 21944  21941                     (NOTLB)
Dec  1 11:27:01 podetlsapp01 kernel:  ffff8106fd78b9e8 0000000000000086 0000000000000000 ffff8106776ed800
Dec  1 11:27:01 podetlsapp01 kernel:  0000000000000296 0000000000000007 ffff810bc1a427e0 ffff810695335820
Dec  1 11:27:01 podetlsapp01 kernel:  0003e5a5e6982299 0000000000003da7 ffff810bc1a429c8 00000009886e2e5f
Dec  1 11:27:01 podetlsapp01 kernel: Call Trace:
Dec  1 11:27:01 podetlsapp01 kernel:  [<ffffffff8870cf54>] :gfs2:just_schedule+0x0/0xe
Dec  1 11:27:01 podetlsapp01 kernel:  [<ffffffff8870cf5d>] :gfs2:just_schedule+0x9/0xe
Dec  1 11:27:01 podetlsapp01 kernel:  [<ffffffff800639f6>] __wait_on_bit+0x40/0x6e
Dec  1 11:27:01 podetlsapp01 kernel:  [<ffffffff8870cf54>] :gfs2:just_schedule+0x0/0xe
Dec  1 11:27:01 podetlsapp01 kernel:  [<ffffffff80063a90>] out_of_line_wait_on_bit+0x6c/0x78
Dec  1 11:27:01 podetlsapp01 kernel:  [<ffffffff800a28c4>] wake_bit_function+0x0/0x23
Dec  1 11:27:01 podetlsapp01 kernel:  [<ffffffff8870cf4f>] :gfs2:gfs2_glock_wait+0x2b/0x30
Dec  1 11:27:01 podetlsapp01 kernel:  [<ffffffff88721b40>] :gfs2:gfs2_inplace_reserve_i+0x271/0x6b0
Dec  1 11:27:01 podetlsapp01 kernel:  [<ffffffff80022242>] __up_read+0x19/0x7f
Dec  1 11:27:01 podetlsapp01 kernel:  [<ffffffff8870cdfc>] :gfs2:do_promote+0xfa/0x188
Dec  1 11:27:01 podetlsapp01 kernel:  [<ffffffff8012db96>] selinux_capable+0x15/0x2a
Dec  1 11:27:01 podetlsapp01 kernel:  [<ffffffff88716549>] :gfs2:gfs2_write_begin+0x171/0x343
Dec  1 11:27:01 podetlsapp01 kernel:  [<ffffffff88717c4a>] :gfs2:gfs2_file_buffered_write+0xf3/0x26c
Dec  1 11:27:01 podetlsapp01 kernel:  [<ffffffff88718070>] :gfs2:__gfs2_file_aio_write_nolock+0x2ad/0x2e4
Dec  1 11:27:01 podetlsapp01 kernel:  [<ffffffff88718212>] :gfs2:gfs2_file_write_nolock+0xaa/0x10f
Dec  1 11:27:01 podetlsapp01 kernel:  [<ffffffff800a2896>] autoremove_wake_function+0x0/0x2e
Dec  1 11:27:01 podetlsapp01 kernel:  [<ffffffff8012d9be>] file_has_perm+0x94/0xa3
Dec  1 11:27:01 podetlsapp01 kernel:  [<ffffffff88718362>] :gfs2:gfs2_file_write+0x49/0xa7
Dec  1 11:27:01 podetlsapp01 kernel:  [<ffffffff80016af0>] vfs_write+0xce/0x174
Dec  1 11:27:01 podetlsapp01 kernel:  [<ffffffff800173a8>] sys_write+0x45/0x6e
Dec  1 11:27:01 podetlsapp01 kernel:  [<ffffffff8005d28d>] tracesys+0xd5/0xe0

Version-Release number of selected component (if applicable):
2.6.18-223
2.6.18-194-28.1

How reproducible:
Intermittent
Seemingly random
Cannot be reproduces in staging / lab environment


Steps to Reproduce:
1. Run active/active FTP over GFS2
2. Observe hang
  
Actual results:
Client connections hang and FTP goes into D state. Locks and DLM appear to hint at a callback request to DLM either not going out or not being recieved.

Expected results:
Waiters are granted if there are no holders blocking. 

Additional info:
I wonder if this could be a different manifestation of BZ656032. Could both of these observed symptoms point to a similar underlying cause but present differently due to differences in workload and environment? I wonder simply because they both look like possible DLM/glock communication problems and they both appeared after the 2.6.18-223 kernel was provided to relieve the symptoms of BZ595397.

Comment 1 Adam Drew 2010-12-03 17:33:16 UTC
Created attachment 464621 [details]
DLM, glock, messages, and cluster data.

Comment 2 Adam Drew 2010-12-03 17:34:29 UTC
The messages logs were cut off without glock_workqueue traces. I am getting newer copies of the logs that should have those in them.

Comment 3 Adam Drew 2010-12-03 17:50:21 UTC
Got new logs from the customer with the full sysrq -t. In common with BZ656032
there are no glock_workqueue traces from the time of the hang. A sysrq -t
gathered earlier that day does contain glock_workqueue traces but the sysrq -t
gathered during the issue does not. This is the same situation as BZ656032.

Comment 8 Steve Whitehouse 2010-12-07 10:22:35 UTC
try_rgrp_fit() returns 1 if the alloc fits, and 0 if it does not. I can't see anything wrong with that.

The real question is why (opening comment) there appears to be no reference to the dlm lock on node1, when there are obviously queued glocks. node2 is caching that lock, but with no sign of any demote request having been received.

Bearing in mind that in RHEL5 it is possible for DLM locks to be reclaimed if they are not being held by the glock layer, the reason that there is no dlm lock showing is that it might have been reclaimed in the mean time.

The flags on the glock say l (locked), r (reply pending) and q (queued). Of these the most interesting is r (reply pending) since that should mean that the dlm has responded to our request for an EX lock (see target state) and that all we are waiting for is the glock_workqueue to process the reply.

This is at odds with the dlm though, since if we've requested an EX lock and a reply has been sent, then there ought to be some local reference to the dlm lock, even if the reply is an error notification for some reason.

Also we need to figure out whether the real issue is that the glock_workqueue is stuck somehow, and that this is the problem which is preventing the glock from being granted.

Comment 9 Steve Whitehouse 2010-12-07 10:57:45 UTC
All glock_workqueue's seem to be in 'S' indicating that they are not stuck, but are sitting idle.

Comment 15 Steve Whitehouse 2010-12-08 16:23:23 UTC
Looking at __generic_block_fiemap() I'm suspicious that in the big do loop, there seems to be nothing updating the "len" variable. Surely the len ought to get shorter as progress is made....

Comment 25 Frank Hirtz 2010-12-15 17:36:19 UTC
Created attachment 468910 [details]
12.02 issue data

Comment 26 Frank Hirtz 2010-12-15 17:39:36 UTC
Created attachment 468913 [details]
12.01 issue data

Comment 27 Frank Hirtz 2010-12-15 17:44:31 UTC
Hi Steve,

There really haven't been new datasets as the client fell back to the old, pre-hotfix kernel, which doesn't exhibit the issue. They sent over a dataset from that setup on Monday, but it was just the original problem, not this one. I've attached the previous 2 datasets to this ticket so you can have a look.

Also, the client is quite interested in providing a webex session so we can see/interact with a cluster that's reproducing the issue. Is there any utility in that to you? We've been working on an internal reproduction case (which has been going slowly), but aside from the dataset, I don't have anything that I would need to look for on a live host.

As mentioned, the reproducer work here has been going slowly and the workload is fairly complex as you can see from the previous notes, so if there's anything that we could do or get in parallel which might help, please let me know and I'll see about getting it.

Thanks again,

Frank.

Comment 32 Robert Peterson 2010-12-22 16:36:32 UTC
I recommend, as a test, that we bring them up to a newer,
pre-release kernel 2.6.18-232 or newer, to pick up the recent
dlm performance patch for bug #604139.  Then let me know which
kernel they're running, and I'll build a test version of my
latest and greatest patch for bug #650494.  At best case,
their performance should improve because these are both
performance-related patches.  However, if it's not a performance
issue and they still get stuck, then we can get some lock dumps
and call traces to debug the problem.

Comment 37 Steve Whitehouse 2011-01-15 11:20:51 UTC
Comment #36 shows that two glocks are being held by processes creating inodes and that the creation process appears to be stuck. The answer will probably be found in the stack traces of the processes concerned. I'll take a look shortly, however this is worth comparing with #656032 since they look similar in some respects.

Comment 38 Steve Whitehouse 2011-01-15 16:00:59 UTC
Looking at the comment #36 info, I'm assuming that is it from node 2 since pid 8966 is not a pmdtm process on node 1. From the node 2 messages file:

Jan 11 18:27:47 poq6etlsapp03 kernel: INFO: task pmdtm:8966 blocked for more than 120 seconds.
Jan 11 18:27:47 poq6etlsapp03 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan 11 18:27:47 poq6etlsapp03 kernel: pmdtm         D ffffffff80153676     0  8966  12285          8967  8965 (NOTLB)
Jan 11 18:27:47 poq6etlsapp03 kernel:  ffff810629f0da58 0000000000000082 ffff81063ffe0340 0000000000010243
Jan 11 18:27:47 poq6etlsapp03 kernel:  0000000000000050 0000000000000007 ffff81062bee3860 ffff810680060820
Jan 11 18:27:47 poq6etlsapp03 kernel:  00000ff79a6f2c25 00000000000068c3 ffff81062bee3a48 0000000100000000
Jan 11 18:27:47 poq6etlsapp03 kernel: Call Trace:
Jan 11 18:27:47 poq6etlsapp03 kernel:  [<ffffffff88575da7>] :gfs2:gfs2_getbuf+0x106/0x115
Jan 11 18:27:47 poq6etlsapp03 kernel:  [<ffffffff885714d3>] :gfs2:iget_test+0x0/0x1c
Jan 11 18:27:47 poq6etlsapp03 kernel:  [<ffffffff800512de>] __wait_on_freeing_inode+0x82/0xa0
Jan 11 18:27:47 poq6etlsapp03 kernel:  [<ffffffff800a28c4>] wake_bit_function+0x0/0x23
Jan 11 18:27:47 poq6etlsapp03 kernel:  [<ffffffff80025c55>] find_or_create_page+0x22/0x72
Jan 11 18:27:47 poq6etlsapp03 kernel:  [<ffffffff8004fb37>] find_inode+0x3d/0x6c
Jan 11 18:27:47 poq6etlsapp03 kernel:  [<ffffffff885714d3>] :gfs2:iget_test+0x0/0x1c
Jan 11 18:27:47 poq6etlsapp03 kernel:  [<ffffffff885714d3>] :gfs2:iget_test+0x0/0x1c
Jan 11 18:27:47 poq6etlsapp03 kernel:  [<ffffffff80059d3c>] ifind+0x34/0x96
Jan 11 18:27:47 poq6etlsapp03 kernel:  [<ffffffff800eeda2>] iget5_locked+0x6c/0x183
Jan 11 18:27:47 poq6etlsapp03 kernel:  [<ffffffff885714ef>] :gfs2:iget_set+0x0/0x1d
Jan 11 18:27:47 poq6etlsapp03 kernel:  [<ffffffff88571257>] :gfs2:gfs2_inode_lookup+0x40/0x21b
Jan 11 18:27:47 poq6etlsapp03 kernel:  [<ffffffff88572205>] :gfs2:gfs2_createi+0x9d6/0xd2c
Jan 11 18:27:47 poq6etlsapp03 kernel:  [<ffffffff801395e9>] security_compute_sid+0x307/0x329
Jan 11 18:27:47 poq6etlsapp03 kernel:  [<ffffffff8857d08c>] :gfs2:gfs2_create+0x65/0x143
Jan 11 18:27:47 poq6etlsapp03 kernel:  [<ffffffff88571892>] :gfs2:gfs2_createi+0x63/0xd2c
Jan 11 18:27:47 poq6etlsapp03 kernel:  [<ffffffff8856f756>] :gfs2:gfs2_glock_nq_num+0x3b/0x68
Jan 11 18:27:47 poq6etlsapp03 kernel:  [<ffffffff8003a6d6>] vfs_create+0xeJan 11 18:27:47 poq6etlsapp03 kernel:  [<ffffffff8001b229>] open_namei+0x19d/0x712
Jan 11 18:27:47 poq6etlsapp03 kernel:  [<ffffffff80027527>] do_filp_open+0x1c/0x38
Jan 11 18:27:47 poq6etlsapp03 kernel:  [<ffffffff80019fb5>] do_sys_open+0x44/0xbe
Jan 11 18:27:47 poq6etlsapp03 kernel:  [<ffffffff8005d28d>] tracesys+0xd5/0xe0
6/0x158

So it looks like it is stuck waiting for an inode to be freed, which is another indicator that this could be a dup of bug #656032.

Comment 39 Steve Whitehouse 2011-01-18 10:23:46 UTC
Looks increasingly like a dup of #656032

Comment 40 Steve Whitehouse 2011-01-18 17:14:34 UTC

*** This bug has been marked as a duplicate of bug 656032 ***