495600 – GFS fs: 'ls' on a directory hangs node

Bug 495600 - GFS fs: 'ls' on a directory hangs node

Summary: GFS fs: 'ls' on a directory hangs node

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Cluster Suite
Classification:	Retired
Component:	dlm-kernel
Sub Component:
Version:	4
Hardware:	All
OS:	Linux
Priority:	urgent
Severity:	urgent
Target Milestone:	rc
Assignee:	David Teigland
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	515453
TreeView+	depends on / blocked

Reported:	2009-04-14 00:17 UTC by Shane Bradley
Modified:	2018-11-14 15:59 UTC (History)
CC List:	12 users (show)
Fixed In Version:	dlm-kernel-2.6.9-59.1.el4
Clone Of:
Environment:
Last Closed:	2011-02-16 16:22:27 UTC
Embargoed:

Attachments	(Terms of Use)
some structures from crash (25.42 KB, text/plain) 2009-04-16 05:46 UTC, Ben Marzinski	no flags	Details
dlm lockspace and resource structure (4.36 KB, text/plain) 2009-04-17 01:16 UTC, Ben Marzinski	no flags	Details
gl->gl_lock structure for (2, 6941334) on petcs1 (922 bytes, text/plain) 2009-04-17 17:46 UTC, Ben Marzinski	no flags	Details
patch to try (3.02 KB, text/plain) 2009-04-20 22:04 UTC, David Teigland	no flags	Details
patch to test (4.10 KB, text/plain) 2009-05-12 16:59 UTC, David Teigland	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2011:0269	0	normal	SHIPPED_LIVE	dlm-kernel bug fix update	2011-02-16 16:22:10 UTC

Description Shane Bradley 2009-04-14 00:17:58 UTC

Description of problem:

$ ls /var/opt/nokia/oss/global/tktcif/work/import hangs in
uniterruptable sleep when run on the node petcs1. Listing other
directories on the same file system works fine. Listing the specific
file system from other nodes also works fine.

This takes about a week for it to occur and in the sysrq-t data and
vmcores it shows the ls commands hanging.

Version-Release number of selected component (if applicable):
version: RHEL 4u6 
kernel:  2.6.9-78.0.8.ELsmp/largesmp

GFS version: GFS-6.1.15-3-x86_64

How reproducible:
Currently this has happen twice, they both occurred within a week with
load testing.

They are using a similator to test the load of their enviroment when
it goes into production.

Steps to Reproduce:
1. Mount the GFS fs and do the load testing(for a week or so)
2. $ ls /var/opt/nokia/oss/global/tktcif/work/import
  
Actual results:
The cluster node will hang and then node will need to be rebooted
because node is hung.

Expected results:

When you do an "ls" of a directory or any operation on a GFS fs
directory the machine should not hang.

Additional info:
Vmcores have been uploaded and processed with details to follow.
rpeterson is currently looking at this issue already.

Comment 4 Robert Peterson 2009-04-14 15:22:55 UTC

Lots of processes are waiting for this glock, which I assume is a directory dinode:

Glock (2, 6941334)
  gl_flags = 1 
  gl_count = 52
  gl_state = 1
  req_gh = yes
  req_bh = yes
  lvb_count = 0
  object = yes
  new_le = no
  incore_le = no
  reclaim = no
  aspace = 0
  ail_bufs = no
  Request
    owner = -1
    gh_state = 0
    gh_flags = 0 
    error = 0
    gh_iflags = 2 4 5 
  Waiter2
    owner = -1
    gh_state = 0
    gh_flags = 0 
    error = 0
    gh_iflags = 2 4 5 
  Waiter3
    owner = 18321
    gh_state = 3
    gh_flags = 3 
    error = 0
    gh_iflags = 1 
  Waiter3
    owner = 8381
    gh_state = 1
    gh_flags = 5 
    error = 0
    gh_iflags = 1 
  Waiter3
    owner = 8378
    gh_state = 3
    gh_flags = 3 
    error = 0
    gh_iflags = 1 
(lots more)

According to dlm the lock is being held in PR mode by two processes, one on petcs1
and another on petds1.

petcs1.279813-810215/dlm_locks_vgglobal-lvol3:
Resource 000001040cbb63b8 (parent 0000000000000000). Name (len=24) "       2          69ea96"  
Local Copy, Master is node 4
Granted Queue
0c5900a3 PR 27847 Master:     fe640289
Conversion Queue
Waiting Queue

petcs2
Resource 000001038787a5d8 (parent 0000000000000000). Name (len=24) "       2          69ea96"  
Master Copy
LVB: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
     00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
Granted Queue
0d6a01e1 PR 27491 Remote:   2 398b02b8
fe640289 PR 27847 Remote:   5 0c5900a3
Conversion Queue
Waiting Queue

Resource 0000010ed8a41598 (parent 0000000000000000). Name (len=24) "       2          69ea96"  
Local Copy, Master is node 4
Granted Queue
398b02b8 PR 27491 Master:     0d6a01e1
Conversion Queue
Waiting Queue

Comment 5 Steve Whitehouse 2009-04-15 14:40:09 UTC

So it looks like the node whose waiting processes are in the glock dump of comment #4 is trying to demote the glock. Strangely there are two requests for this, one actively being processed and one sitting in the queue.

Currently it thinks it has the lock in EX, but I presume that at least some of the demote must have already happened based upon the dlm lock dump and then it got stuck.

I'd like to see a stack trace from the process doing the demote of the glock in comment #4 as I suspect that there might be a lock ordering issue somewhere. There was something with similar results in GFS2 recently where the transaction lock was being taken out of order during demotion.

Can we confirm that once a stuck node is rebooted that it will then work ok again with no after-effects? That would be useful to know.

Comment 6 David Juran 2009-04-15 14:54:24 UTC

Indeed, after rebooting the one node where the directory is stuck (petcs1) everything works fine with no after-effects.

Comment 8 Ben Marzinski 2009-04-16 05:46:25 UTC

Created attachment 339788 [details]
some structures from crash

The information in the petcs1 glock dump is not what the (2, 6941334) glock looked like at the time of the crash dump.  There are some of the structures pulled from the crash dump, with a few notes.

Comment 9 Ben Marzinski 2009-04-16 06:13:33 UTC

From what I see, petcs1 is hanging because it is waiting for a callback from the dlm which never comes.  Every indication is that it did send off a demote request to the dlm. The dlm should send it a callback when that demote request completes.
All the lock_dlm1 and lock_dlm2 threads look fine, so they should have no problem
processing the request when it arrives.

The one thing that looks really wrong is that GFS thinks the lock is in the exclusive state and the active demote request is to the unlocked state, but according to the dlm lock dumps, the node has the lock in the shared state.

If there is already a queued demote request, and handle_callback() gets a request for a different state, it simply changes the requested state of the queued demote to unlocked.  There could be a window where if there is a queued demote to shared, and you change it to unlocked, something goes wrong.  This would explain the difference in states. I haven't seen anything like that yet, but I haven't looked very hard yet.  I'll take another look tomorrow.

Comment 10 Ben Marzinski 2009-04-16 17:19:00 UTC

Looking at the superblock,

 sd_lm_outstanding = {
    counter = 1
  }, 

There is one outstanding dlm call.  I can't be sure that this is for the demote request, but it looks pretty likely.

I couldn't find any race in handle_callback() changing the demote state that could have caused this.

Comment 11 Ben Marzinski 2009-04-17 01:16:40 UTC

Created attachment 339953 [details]
dlm lockspace and resource structure

This is the dlm lockspace and resource structure for the (2, 6941334) glock on petcs1 at the time of the crash dump.  The important bit of information is that the Grant Queue is empty for the lock, which means that at the time of the crash dump, the dlm did not think the lock was in the shared state on petcs1.  It thought that the lock was unlocked.

Comment 12 Ben Marzinski 2009-04-17 17:46:42 UTC

Created attachment 340054 [details]
gl->gl_lock structure for (2, 6941334) on petcs1

Comment 13 David Teigland 2009-04-17 17:53:13 UTC

Key entries from /var/log/messages on the stuck node:

Mar 25 19:04:07 petcs1 kernel: dlm: vgglobal-lvol3: process_lockqueue_reply id b9b6013d state 0
Mar 25 22:00:06 petcs1 kernel: dlm: vgglobal-lvol3: process_lockqueue_reply id d9c3015d state 0
Mar 25 22:00:07 petcs1 kernel: dlm: vgglobal-lvol3: cancel reply ret 0
Mar 25 22:00:07 petcs1 kernel: lock_dlm: unlock sb_status 0 2,69ea96 flags 0
Mar 25 22:00:07 petcs1 kernel: dlm: vgglobal-lvol3: process_lockqueue_reply id d542037d state 0

Comment 14 David Teigland 2009-04-17 18:18:42 UTC

It looks like the unlock reply is being mixed up with a separate message/reply, much like bug 349001.  The dlm sends an unlock to the master, then a message/reply comes back for this lock, which the sender assumes is the unlock reply, and when it sees the result of 0 instead of the expected -EUNLOCK it prints:
dlm: vgglobal-lvol3: cancel reply ret 0

The bad result is passed back to lock_dlm which complains and just drops the callback (leaving gfs waiting forever):
lock_dlm: unlock sb_status 0 2,69ea96 flags 0

Then, the actual unlock reply comes back, and since the dlm isn't waiting for a reply any longer, it prints:
dlm: vgglobal-lvol3: process_lockqueue_reply id d542037d state 0

Comment 15 David Teigland 2009-04-17 21:53:44 UTC

Studying bug 349001 and the patch for it seems to leave a small window for the same problem to occur.  The symptoms seem consistent and it's the only possibility I can come up with.

The problem that commit c92628dcc39e03a4e9eccc4fa76257c871e5ba00 aims to fix is a grant message followed by a convert reply message for the same lock.  I think the following sequence of events could still cause that to happen, even though the patch closes the window most of the way.

dlm_recv thread
1. receive convert -- process_cluster_request/GDLM_REMCMD_CONVREQUEST
2.   lkb->lkb_request = freq
3.   dlm_convert_stage2()
4.     lkb can't be granted immediately, so it's put on the convert queue
6.   if (lkb->lkb_request != NULL)
9.     send convert reply

other thread
5. unlocks another lkb, finds lkb above can be granted, calls remote_grant()
7. lkb->lkb_request = NULL
8. send grant message

remote_grant() is supposed to prevent the convert reply from being sent at all by setting lkb_request = NULL.  But, given the right race it doesn't work and both grant and convert reply messages are sent (and sent in the bad order).
We'd need to add info to the printk messages to confirm this theory.

I haven't yet come up with any ideas on how to fix this.

Comment 16 David Teigland 2009-04-20 22:04:55 UTC

Created attachment 340446 [details]
patch to try

This patch may work around the problem, if we've identified it correctly.  I think it's unlikely that the patch would cause a regression to anything that currently works.  If we can run a test that shows the new printk and doesn't cause a failure, that will mean we have correctly identified and fixed the problem.  Whether we (or they) can produce this condition to verify remains in doubt.

I can't say whether we'd be willing to commit the patch to rhel without verifying that it fixes the problem.  We should at least do some testing with the patch to check that it does not cause a regression.  I'll look into getting a rhel4 cluster set up to do a smoke test with it.

Comment 19 David Teigland 2009-05-12 16:59:13 UTC

Created attachment 343629 [details]
patch to test

This is the same work-around fix as the previous patch in comment 16, but this version of the patch adds some more information to the "process_lockqueue_reply" printk.  This additional info may help us understand the reasons for the process_lockqueue_reply messages, which aren't generally harmful, but would be nice to understand more fully, and may give us more confidence in the fix.

Comment 20 David Teigland 2009-05-12 17:02:33 UTC

I do have two rhel4 nodes (xen vm's) with which I'm now running sanity tests of these patches (building test code from source, not rpms).

Comment 21 Issue Tracker 2009-05-15 07:20:50 UTC

With the updated patch, right after reboot, the following message turned up
on one of the nodes:

May 15 10:13:29 petcs2 kernel: dlm: vgglobal-lvol1:
process_lockqueue_reply id 300ee state 0 0 0 0


This event sent from IssueTracker by djuran 
 issue 279813

Comment 22 David Teigland 2009-05-15 14:21:41 UTC

The new process_lockqueue_reply details don't tell us why we're getting the unexpected message, but they do show that it is not an issue of confusing one reply for another.  Messages with non-zero numbers after state may be interesting to see.

I'm comfortable enough with this patch to queue it up for a RHEL4 release, at which point our QA will also be able to run with it.

Comment 23 David Teigland 2009-05-18 20:53:18 UTC

pushed to RHEL4 branch

http://git.fedorahosted.org/git/cluster.git?p=cluster.git;a=commitdiff;h=e4a4744a154a6acaf41579f99c0b04be447b8182

Comment 24 Issue Tracker 2009-06-03 07:46:40 UTC

Checking the log files of the cluster where the patch is tested, I found
the following messages in syslog:
May 27 13:07:01 petds1 kernel: dlm: vgglobal-lvol1:
process_lockqueue_reply id b9f03c7 state 0 2 0 0

May 30 11:34:53 petds1 kernel: dlm: vgglobal-lvol1:
process_lockqueue_reply id 2b9a0319 state 0 2 0 0

Jun  2 14:03:26 petds1 kernel: dlm: vgglobal-lvol1: ignore zero unlock
reply from 1 4e67019e        2           25563

And on another node of the cluster:

May 27 22:21:45 petcs2 kernel: dlm: vgglobal-lvol1:
process_lockqueue_reply id 6b6d000f state 0 2 0 0

May 28 17:32:10 petcs2 kernel: dlm: vgglobal-lvol1:
process_lockqueue_reply id 88c00318 state 0 3 0 0

May 28 21:20:39 petcs2 kernel: dlm: vgglobal-lvol1:
process_lockqueue_reply id 8eb702f6 state 0 3 0 0

May 29 13:06:14 petcs2 kernel: dlm: vgglobal-lvol1:
process_lockqueue_reply id a3f10271 state 0 2 0 0

May 30 10:17:15 petcs2 kernel: dlm: vgglobal-lvol1:
process_lockqueue_reply id c41902bc state 0 3 0 0

And yet another node:

May 25 08:53:02 petcs3 kernel: dlm: vgglobal-lvol1:
process_lockqueue_reply id 20602ae state 0 1 0 0

May 28 21:53:11 petcs3 kernel: dlm: vgglobal-lvol2:
process_lockqueue_reply id 34b02e3 state 0 1 0 0

Is this something to worry about?

One noteworthy thing I found is that the network seems somewhat flaky, the
interface on petds1 and petds2 are reporting a considerable amount of
dropped packets. Could this be related?

sosreports attached.




This event sent from IssueTracker by djuran 
 issue 279813
it_file 225911

Comment 25 David Teigland 2009-06-03 14:15:52 UTC

Thanks, the "ignore zero unlock reply" confirms that the patch is fixing the bug they found.

The other process_lockqueue_reply's are harmless as usual; the extra info is somewhat interesting and may help in fixing other related bugs that may appear in the future.

Yes, the flaky network could be related and may be the reason they are seeing problems that haven't appeared before.

Comment 36 Chris Feist 2010-10-26 22:44:09 UTC

Moved to POST since there is no built package.

Comment 38 Florian Nadge 2011-01-03 14:35:13 UTC

    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Previously, the "ls" command could cause directories to become unresponsive and stay in uninterruptable sleep when using the Global File System (GFS), due to a rare race condition. With this update the "ls" command no longer causes directories to be suspended.

Comment 39 Nate Straz 2011-01-24 16:39:15 UTC

Verified patch is included in dlm-kernel-2.6.9-59.4.el4.  Nothing was hit during regression runs and customer tested the patch.

Comment 40 errata-xmlrpc 2011-02-16 16:22:27 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-0269.html

Note You need to log in before you can comment on or make changes to this bug.