Description of problem: $ ls /var/opt/nokia/oss/global/tktcif/work/import hangs in uniterruptable sleep when run on the node petcs1. Listing other directories on the same file system works fine. Listing the specific file system from other nodes also works fine. This takes about a week for it to occur and in the sysrq-t data and vmcores it shows the ls commands hanging. Version-Release number of selected component (if applicable): version: RHEL 4u6 kernel: 2.6.9-78.0.8.ELsmp/largesmp GFS version: GFS-6.1.15-3-x86_64 How reproducible: Currently this has happen twice, they both occurred within a week with load testing. They are using a similator to test the load of their enviroment when it goes into production. Steps to Reproduce: 1. Mount the GFS fs and do the load testing(for a week or so) 2. $ ls /var/opt/nokia/oss/global/tktcif/work/import Actual results: The cluster node will hang and then node will need to be rebooted because node is hung. Expected results: When you do an "ls" of a directory or any operation on a GFS fs directory the machine should not hang. Additional info: Vmcores have been uploaded and processed with details to follow. rpeterson is currently looking at this issue already.
Lots of processes are waiting for this glock, which I assume is a directory dinode: Glock (2, 6941334) gl_flags = 1 gl_count = 52 gl_state = 1 req_gh = yes req_bh = yes lvb_count = 0 object = yes new_le = no incore_le = no reclaim = no aspace = 0 ail_bufs = no Request owner = -1 gh_state = 0 gh_flags = 0 error = 0 gh_iflags = 2 4 5 Waiter2 owner = -1 gh_state = 0 gh_flags = 0 error = 0 gh_iflags = 2 4 5 Waiter3 owner = 18321 gh_state = 3 gh_flags = 3 error = 0 gh_iflags = 1 Waiter3 owner = 8381 gh_state = 1 gh_flags = 5 error = 0 gh_iflags = 1 Waiter3 owner = 8378 gh_state = 3 gh_flags = 3 error = 0 gh_iflags = 1 (lots more) According to dlm the lock is being held in PR mode by two processes, one on petcs1 and another on petds1. petcs1.279813-810215/dlm_locks_vgglobal-lvol3: Resource 000001040cbb63b8 (parent 0000000000000000). Name (len=24) " 2 69ea96" Local Copy, Master is node 4 Granted Queue 0c5900a3 PR 27847 Master: fe640289 Conversion Queue Waiting Queue petcs2 Resource 000001038787a5d8 (parent 0000000000000000). Name (len=24) " 2 69ea96" Master Copy LVB: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Granted Queue 0d6a01e1 PR 27491 Remote: 2 398b02b8 fe640289 PR 27847 Remote: 5 0c5900a3 Conversion Queue Waiting Queue Resource 0000010ed8a41598 (parent 0000000000000000). Name (len=24) " 2 69ea96" Local Copy, Master is node 4 Granted Queue 398b02b8 PR 27491 Master: 0d6a01e1 Conversion Queue Waiting Queue
So it looks like the node whose waiting processes are in the glock dump of comment #4 is trying to demote the glock. Strangely there are two requests for this, one actively being processed and one sitting in the queue. Currently it thinks it has the lock in EX, but I presume that at least some of the demote must have already happened based upon the dlm lock dump and then it got stuck. I'd like to see a stack trace from the process doing the demote of the glock in comment #4 as I suspect that there might be a lock ordering issue somewhere. There was something with similar results in GFS2 recently where the transaction lock was being taken out of order during demotion. Can we confirm that once a stuck node is rebooted that it will then work ok again with no after-effects? That would be useful to know.
Indeed, after rebooting the one node where the directory is stuck (petcs1) everything works fine with no after-effects.
Created attachment 339788 [details] some structures from crash The information in the petcs1 glock dump is not what the (2, 6941334) glock looked like at the time of the crash dump. There are some of the structures pulled from the crash dump, with a few notes.
From what I see, petcs1 is hanging because it is waiting for a callback from the dlm which never comes. Every indication is that it did send off a demote request to the dlm. The dlm should send it a callback when that demote request completes. All the lock_dlm1 and lock_dlm2 threads look fine, so they should have no problem processing the request when it arrives. The one thing that looks really wrong is that GFS thinks the lock is in the exclusive state and the active demote request is to the unlocked state, but according to the dlm lock dumps, the node has the lock in the shared state. If there is already a queued demote request, and handle_callback() gets a request for a different state, it simply changes the requested state of the queued demote to unlocked. There could be a window where if there is a queued demote to shared, and you change it to unlocked, something goes wrong. This would explain the difference in states. I haven't seen anything like that yet, but I haven't looked very hard yet. I'll take another look tomorrow.
Looking at the superblock, sd_lm_outstanding = { counter = 1 }, There is one outstanding dlm call. I can't be sure that this is for the demote request, but it looks pretty likely. I couldn't find any race in handle_callback() changing the demote state that could have caused this.
Created attachment 339953 [details] dlm lockspace and resource structure This is the dlm lockspace and resource structure for the (2, 6941334) glock on petcs1 at the time of the crash dump. The important bit of information is that the Grant Queue is empty for the lock, which means that at the time of the crash dump, the dlm did not think the lock was in the shared state on petcs1. It thought that the lock was unlocked.
Created attachment 340054 [details] gl->gl_lock structure for (2, 6941334) on petcs1
Key entries from /var/log/messages on the stuck node: Mar 25 19:04:07 petcs1 kernel: dlm: vgglobal-lvol3: process_lockqueue_reply id b9b6013d state 0 Mar 25 22:00:06 petcs1 kernel: dlm: vgglobal-lvol3: process_lockqueue_reply id d9c3015d state 0 Mar 25 22:00:07 petcs1 kernel: dlm: vgglobal-lvol3: cancel reply ret 0 Mar 25 22:00:07 petcs1 kernel: lock_dlm: unlock sb_status 0 2,69ea96 flags 0 Mar 25 22:00:07 petcs1 kernel: dlm: vgglobal-lvol3: process_lockqueue_reply id d542037d state 0
It looks like the unlock reply is being mixed up with a separate message/reply, much like bug 349001. The dlm sends an unlock to the master, then a message/reply comes back for this lock, which the sender assumes is the unlock reply, and when it sees the result of 0 instead of the expected -EUNLOCK it prints: dlm: vgglobal-lvol3: cancel reply ret 0 The bad result is passed back to lock_dlm which complains and just drops the callback (leaving gfs waiting forever): lock_dlm: unlock sb_status 0 2,69ea96 flags 0 Then, the actual unlock reply comes back, and since the dlm isn't waiting for a reply any longer, it prints: dlm: vgglobal-lvol3: process_lockqueue_reply id d542037d state 0
Studying bug 349001 and the patch for it seems to leave a small window for the same problem to occur. The symptoms seem consistent and it's the only possibility I can come up with. The problem that commit c92628dcc39e03a4e9eccc4fa76257c871e5ba00 aims to fix is a grant message followed by a convert reply message for the same lock. I think the following sequence of events could still cause that to happen, even though the patch closes the window most of the way. dlm_recv thread 1. receive convert -- process_cluster_request/GDLM_REMCMD_CONVREQUEST 2. lkb->lkb_request = freq 3. dlm_convert_stage2() 4. lkb can't be granted immediately, so it's put on the convert queue 6. if (lkb->lkb_request != NULL) 9. send convert reply other thread 5. unlocks another lkb, finds lkb above can be granted, calls remote_grant() 7. lkb->lkb_request = NULL 8. send grant message remote_grant() is supposed to prevent the convert reply from being sent at all by setting lkb_request = NULL. But, given the right race it doesn't work and both grant and convert reply messages are sent (and sent in the bad order). We'd need to add info to the printk messages to confirm this theory. I haven't yet come up with any ideas on how to fix this.
Created attachment 340446 [details] patch to try This patch may work around the problem, if we've identified it correctly. I think it's unlikely that the patch would cause a regression to anything that currently works. If we can run a test that shows the new printk and doesn't cause a failure, that will mean we have correctly identified and fixed the problem. Whether we (or they) can produce this condition to verify remains in doubt. I can't say whether we'd be willing to commit the patch to rhel without verifying that it fixes the problem. We should at least do some testing with the patch to check that it does not cause a regression. I'll look into getting a rhel4 cluster set up to do a smoke test with it.
Created attachment 343629 [details] patch to test This is the same work-around fix as the previous patch in comment 16, but this version of the patch adds some more information to the "process_lockqueue_reply" printk. This additional info may help us understand the reasons for the process_lockqueue_reply messages, which aren't generally harmful, but would be nice to understand more fully, and may give us more confidence in the fix.
I do have two rhel4 nodes (xen vm's) with which I'm now running sanity tests of these patches (building test code from source, not rpms).
With the updated patch, right after reboot, the following message turned up on one of the nodes: May 15 10:13:29 petcs2 kernel: dlm: vgglobal-lvol1: process_lockqueue_reply id 300ee state 0 0 0 0 This event sent from IssueTracker by djuran issue 279813
The new process_lockqueue_reply details don't tell us why we're getting the unexpected message, but they do show that it is not an issue of confusing one reply for another. Messages with non-zero numbers after state may be interesting to see. I'm comfortable enough with this patch to queue it up for a RHEL4 release, at which point our QA will also be able to run with it.
pushed to RHEL4 branch http://git.fedorahosted.org/git/cluster.git?p=cluster.git;a=commitdiff;h=e4a4744a154a6acaf41579f99c0b04be447b8182
Checking the log files of the cluster where the patch is tested, I found the following messages in syslog: May 27 13:07:01 petds1 kernel: dlm: vgglobal-lvol1: process_lockqueue_reply id b9f03c7 state 0 2 0 0 May 30 11:34:53 petds1 kernel: dlm: vgglobal-lvol1: process_lockqueue_reply id 2b9a0319 state 0 2 0 0 Jun 2 14:03:26 petds1 kernel: dlm: vgglobal-lvol1: ignore zero unlock reply from 1 4e67019e 2 25563 And on another node of the cluster: May 27 22:21:45 petcs2 kernel: dlm: vgglobal-lvol1: process_lockqueue_reply id 6b6d000f state 0 2 0 0 May 28 17:32:10 petcs2 kernel: dlm: vgglobal-lvol1: process_lockqueue_reply id 88c00318 state 0 3 0 0 May 28 21:20:39 petcs2 kernel: dlm: vgglobal-lvol1: process_lockqueue_reply id 8eb702f6 state 0 3 0 0 May 29 13:06:14 petcs2 kernel: dlm: vgglobal-lvol1: process_lockqueue_reply id a3f10271 state 0 2 0 0 May 30 10:17:15 petcs2 kernel: dlm: vgglobal-lvol1: process_lockqueue_reply id c41902bc state 0 3 0 0 And yet another node: May 25 08:53:02 petcs3 kernel: dlm: vgglobal-lvol1: process_lockqueue_reply id 20602ae state 0 1 0 0 May 28 21:53:11 petcs3 kernel: dlm: vgglobal-lvol2: process_lockqueue_reply id 34b02e3 state 0 1 0 0 Is this something to worry about? One noteworthy thing I found is that the network seems somewhat flaky, the interface on petds1 and petds2 are reporting a considerable amount of dropped packets. Could this be related? sosreports attached. This event sent from IssueTracker by djuran issue 279813 it_file 225911
Thanks, the "ignore zero unlock reply" confirms that the patch is fixing the bug they found. The other process_lockqueue_reply's are harmless as usual; the extra info is somewhat interesting and may help in fixing other related bugs that may appear in the future. Yes, the flaky network could be related and may be the reason they are seeing problems that haven't appeared before.
Moved to POST since there is no built package.
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: Previously, the "ls" command could cause directories to become unresponsive and stay in uninterruptable sleep when using the Global File System (GFS), due to a rare race condition. With this update the "ls" command no longer causes directories to be suspended.
Verified patch is included in dlm-kernel-2.6.9-59.4.el4. Nothing was hit during regression runs and customer tested the patch.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2011-0269.html