Description of problem: While running distributed I/O tests, the test d_rwrevdirectlarge hung on one node in a RHCS 4.6 cluster. Version-Release number of selected component (if applicable): kernel-hugemem-2.6.9-62.EL GFS-kernel-hugemem-2.6.9-75.6 cman-1.0.17-0 GFS-6.1.15-1 How reproducible: unknown Steps to Reproduce: 1. 2. 3. Actual results: Stack traces of the hung processes on morph-04: d_doio D 000000D0 1948 17022 17021 17023 (NOTLB) 4effadd8 00000002 00000286 000000d0 7baae924 8071d268 02155fe2 0000793b 03fce3a0 7fe4f080 00000000 04018de0 00000000 00000000 abce4c40 000f4eb9 0231fa80 74c510b0 74c5121c 00000000 00000002 4effaf2c 4effaf28 4effadf4 Call Trace: [<02155fe2>] rw_vm+0x28e/0x29c [<022ce47a>] wait_for_completion+0x94/0xcb [<0211e8b0>] default_wake_function+0x0/0xc [<02143074>] buffered_rmqueue+0x17d/0x1a5 [<0211e8b0>] default_wake_function+0x0/0xc [<02143150>] __alloc_pages+0xb4/0x2a6 [<82c99d4e>] glock_wait_internal+0xaa/0x1ef [gfs] [<82c9a1f1>] gfs_glock_nq+0xe3/0x116 [gfs] [<82cae17e>] do_write_direct+0xf3/0x3cc [gfs] [<022ce348>] schedule+0x850/0x8ee [<022ce378>] schedule+0x880/0x8ee [<0214de89>] handle_mm_fault+0xdc/0x193 [<82cad8ec>] walk_vm+0xd7/0x100 [gfs] [<82caeaa9>] __gfs_write+0xa1/0xbb [gfs] [<82cae08b>] do_write_direct+0x0/0x3cc [gfs] [<82caeace>] gfs_write+0xb/0xe [gfs] [<0215af9e>] vfs_write+0xb6/0xe2 [<0215b068>] sys_write+0x3c/0x62 d_doio D 00000020 1948 17023 17021 17024 17022 (NOTLB) 79320e20 00000002 00000026 00000020 1c19ece0 7d5208b0 00000019 04018de0 7d5208b0 00000000 04021740 04020de0 00000001 000f4240 ac57a080 000f4eb9 7d5208b0 04791430 0479159c 00000001 000000d2 ffff0001 2d255b78 2d255af0 Call Trace: [<022cee3e>] rwsem_down_read_failed+0x143/0x162 [<82caf813>] .text.lock.ops_file+0x51/0xca [gfs] [<82cad8ec>] walk_vm+0xd7/0x100 [gfs] [<02155fe2>] rw_vm+0x28e/0x29c [<82cadd34>] __gfs_read+0x43/0x4a [gfs] [<82cad9e4>] do_read_direct+0x0/0x1fb [gfs] [<82cadd46>] gfs_read+0xb/0xe [gfs] [<0215b30a>] do_readv_writev+0x19d/0x1f7 [<82cadd3b>] gfs_read+0x0/0xe [gfs] [<0214f513>] do_mmap_pgoff+0x3ef/0x688 [<0214f6ae>] do_mmap_pgoff+0x58a/0x688 [<0215b39f>] vfs_readv+0x3b/0x40 [<0215b423>] sys_readv+0x3c/0x62 d_doio D 040181C0 1904 17024 17021 17025 17023 (NOTLB) 5304ff1c 00000002 00000021 040181c0 3a59b4f8 2b416330 00000019 04020de0 2b416330 00000000 04019740 04018de0 00000000 000f4240 ac1a9780 000f4eb9 2b416330 7ff54230 7ff5439c 00000002 5304ff6c 2d255b60 00000246 2d255b68 Call Trace: [<022cd925>] __down+0x81/0xdb [<0211e8b0>] default_wake_function+0x0/0xc [<022cda9c>] __down_failed+0x8/0xc [<82caf853>] .text.lock.ops_file+0x91/0xca [gfs] [<82caeace>] gfs_write+0xb/0xe [gfs] [<0215af9e>] vfs_write+0xb6/0xe2 [<0215b068>] sys_write+0x3c/0x62 d_doio D 040181C0 1948 17025 17021 17026 17024 (NOTLB) 3fec5f1c 00000002 0989a000 040181c0 48a434f8 00000246 0989a000 0989c000 63c01268 040181c0 0214c136 04018de0 00000000 000f4240 ac29d9c0 000f4eb9 0231fa80 2b416330 2b41649c 00000000 00000002 2d255b60 00000246 2d255b68 Call Trace: [<0214c136>] zap_pmd_range+0x59/0x7c [<022cd925>] __down+0x81/0xdb [<0211e8b0>] default_wake_function+0x0/0xc [<022cda9c>] __down_failed+0x8/0xc [<82caf853>] .text.lock.ops_file+0x91/0xca [gfs] [<82caeace>] gfs_write+0xb/0xe [gfs] [<0215af9e>] vfs_write+0xb6/0xe2 [<0215b068>] sys_write+0x3c/0x62 d_doio D 00000000 1948 17026 17021 17025 (NOTLB) 6fd31e80 00000002 00000000 00000000 0370d3a0 00000000 59d93067 00000000 3a69da40 0214d9c8 5bd93000 04020de0 00000001 000f4240 ac66e2c0 000f4eb9 81eb10b0 7d5208b0 7d520a1c f6f48000 00000001 fffe0001 2d255b78 2d255af0 Call Trace: [<0214d9c8>] do_anonymous_page+0x1c9/0x1db [<022cee3e>] rwsem_down_read_failed+0x143/0x162 [<82caf813>] .text.lock.ops_file+0x51/0xca [gfs] [<82cad8ec>] walk_vm+0xd7/0x100 [gfs] [<0214f037>] vma_merge+0x156/0x165 [<0214f513>] do_mmap_pgoff+0x3ef/0x688 [<82cadd34>] __gfs_read+0x43/0x4a [gfs] [<82cad9e4>] do_read_direct+0x0/0x1fb [gfs] [<82cadd46>] gfs_read+0xb/0xe [gfs] [<0215adf1>] vfs_read+0xb6/0xe2 [<0215b006>] sys_read+0x3c/0x62 glock we are stuck on from morph-04: Glock (2, 36574638) gl_flags = 1 gl_count = 6 gl_state = 1 req_gh = yes req_bh = yes lvb_count = 0 object = yes new_le = no incore_le = no reclaim = no aspace = 3 ail_bufs = no Request owner = -1 gh_state = 0 gh_flags = 0 error = 0 gh_iflags = 2 4 5 Waiter2 owner = -1 gh_state = 0 gh_flags = 0 error = 0 gh_iflags = 2 4 5 Waiter3 owner = 17022 gh_state = 2 gh_flags = error = 0 gh_iflags = 1 Inode: busy DLM lock on master (morph-02, node 3): Resource 60e7bb8c (parent 00000000). Name (len=24) " 2 22e15ae" Master Copy LVB: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Granted Queue Conversion Queue Waiting Queue glock on morph-02: Glock (2, 36574638) gl_flags = gl_count = 2 gl_state = 0 req_gh = no req_bh = no lvb_count = 0 object = yes new_le = no incore_le = no reclaim = no aspace = 0 ail_bufs = no Inode: num = 36574638/36574638 type = 1 i_count = 1 i_flags = vnode = yes Expected results: Additional info:
The above processes on morph-04 are waiting for a LM_ST_DEFERRED lock on an inode glock. The glock seems to be in an EXCLUSIVE state, but there is apparently no holder on the Request queue, which is odd. Since it's on Waiters3, the lock is waiting to be promoted, but since it's already in exclusive, isn't that a demote? I wonder if the glock was held in a lower mode (LM_ST_SHARED) and then a requester got promoted to exclusive, leaving this promote request behind or something. It makes me think back on Josef's fix to gfs2 where he made this comment: "If a glock is in the exclusive state and a request for demote to deferred has been received, then further requests for demote to shared are being ignored. This patch fixes that by ensuring that we demote to unlocked in that case." Maybe the same can be said not only for shared, but for deferred as well. Of course, the code is all different in gfs1, but I might as well add him to the cc list too in case he has thoughts on it. Maybe his fix needs to be ported to gfs1. I'll check that out tomorrow. There was a weird syslog message on morph-02 (the dlm master for the glock in question). It said: dlm: unlock ungranted 3 I don't know what that means. Does it mean that someone tried to unlock said glock and was denied? There were also weird syslog messages on morph-04, the node that's hung: Oct 22 14:40:09 morph-04 qarshd[17019]: Running cmdline: cd /mnt/vedder && /usr/tests/sts-rhel4.6/bin/d_doio -P try -I 19942 -a -m 1000 -r 100 -n 5 Oct 22 14:40:18 morph-04 kernel: dlm: vedder: cancel reply ret 0 Oct 22 14:40:18 morph-04 kernel: lock_dlm: unlock sb_status 0 2,22e15ae flags 0 Oct 22 14:40:18 morph-04 kernel: dlm: vedder: process_lockqueue_reply id ff03cf state 0 I'd like to see what Dave T. can tell me about these messages, so I'm adding him to the cc list.
According to Dave, the "dlm: unlock ungranted 3" means: "...a simple case of unlocking a lock that's not been granted which should never happen, of course...The problem is that gfs is unlocking a lock that it presumes is granted, but the dlm thinks the lock is being converted when it gets the unlock from gfs."
Created attachment 240911 [details] lockdump with glock history showing failure I couldn't recreate the problem on my trin-{09,10,11} cluster, so I put debug code on the morphs. With the debug code on, it took several runs throughout the weekend before it finally recreated. But it did. This tgz contains "gfs_tool lockdump" from all the morphs at the time of failure. The glock.c code has a "glock history" feature I've added to analyze problems like this before in gfs2. It remembers the last 100 significant events for every glock. Although I haven't gone through the glock history yet, I can tell this: The system showing the problem is morph-04 and the glock having the problem is (2, 66860). The problem will hopefully be apparent in the history of that glock. I just need to go through it.
I'm testing a fix now. With debug code, my fix has gone through sixteen iterations of dd_io successfully. Now I'm testing it without the debug code, so just the bare minimum fix. So far it's gone three iterations successfully. (Each iteration takes roughly 15 minutes).
Created attachment 243361 [details] Annotated glock history of the problem--first version This is a condensed version of the glock dump from one failure scenario on morph-04. It contains only the glock in question and it has some of my notes. I'm now running a newer version that should give me more details in the glock history as to how it got that way. Provided I can recreate the problem again.
Regarding comment #4: I ran with my patch for five iterations and finally hit the problem again. Without the patch, it was usually failing on the first iteration. So the patch definitely closes a timing window, but the problem remains (I had a suspicion that was the case). Now I'm running with my improved glock history patch again, but like before, it's not recreating. Yesterday it ran for 35 iterations without failure. This morning the test stopped around 6:00am, after running overnight, but I'm not sure if it was a recreation or because my vpn dropped. I took some information and restarted it.
As per Dave's suggestion, I tried running with the 4.5 gfs. So far it has gone eight iterations without a problem. Before doing this exercise, I narrowed the problem down to function gfs_glock_dq() which is where my patch from comment #4 is located. It is also the location of the only significant change to gfs between 4.5 and today. That change was done for bug 191222. I believe that fix may be a regression. I'm adding Abhi to the cc list to get him in the loop.
I'm confused. It now looks like glock.c is identical between RHEL4.5 and today. As a matter of fact, it looks like there are no significant differences between in the gfs kernel code AT ALL except for the gfs_reset_readahead code that should just lie dormant until activated. Still, I suspect that function is the one with the problem, and I suspect that code change is what did it. The 191222 patch seems reasonable at first glance. My patch from comment #4 has to do with what I'm seeing in the glock history, which is this: While one process is inside that function, gfs_glock_dq(), another process comes in and messes with the glock. Then assumptions made at the beginning of gfs_glock_dq are not necessarily true at the end.
With the latest (almost 4.6) code I ran 97 iterations of JUST d_rwrevdirectlarge last night without any failure. When I ran the entire dd_io, it recreated on the 9th attempt. So apparently it's not enough to run just d_rwrevdirectlarge to get the problem to recreate. Now I'm running a new patch to see if it will fix the problem. Only time will tell if this is fixed. The problem is that each run lasts about 17 minutes. If I was able to run just d_rwrevdirectlarge, it would take only somewhere between 30 seconds and 5 minutes.
Just an quick update to let everyone know what's happening here. For some reason, the problem seems to be getting harder to recreate. The hang is ALWAYS accompanied by the dlm message: "dlm: unlock ungranted". I've used lots of instrumentation in gfs to determine that gfs SEEMS to be doing the right thing. Here is my latest annotated, instrumented glock history showing what's happening: Glock (2, 145829642) gl_flags =GLF_LOCK gl_count = 9 gl_state = 1 (LM_ST_EXCLUSIVE) req_gh = yes req_bh = yes lvb_count = 0 object = yes new_le = no incore_le = no reclaim = no aspace = 9 ail_bufs = no Request owner = -1 gh_state = 0 gh_flags = 0 error = 0 gh_iflags = 2 4 5 Waiter2 owner = -1 gh_state = 0 gh_flags = 0 error = 0 gh_iflags = 2 4 5 Waiter3 owner = 9120 gh_state = 2 gh_flags = error = 0 gh_iflags = 1 Waiter3 owner = 9118 gh_state = 2 gh_flags = error = 0 gh_iflags = 1 Waiter3 owner = 9121 gh_state = 3 gh_flags = 3 error = 0 gh_iflags = 1 Waiter3 owner = 9117 gh_state = 2 gh_flags = error = 0 gh_iflags = 1 Inode: busy Glock history: 1055365: RQ_PROMOTE4: LM_ST_DEFERRED, ref = 4, flags=1 1055366: gfs_glock_dq(start): LM_ST_DEFERRED, ref = 4, flags=1 1055367: gfs_holder_init: LM_ST_DEFERRED, ref = 5, flags=1 1055368: gfs_glock_dq(b4runq): LM_ST_DEFERRED, ref = 4, flags=0 1055369: gfs_glock_dq(end): LM_ST_DEFERRED, ref = 4, flags=0 1055370: gfs_holder_init: LM_ST_DEFERRED, ref = 4, flags=0 1055371: GFS_GLOCK_NQ: LM_ST_DEFERRED, ref = 4, flags=0 1055372: RQ_PROMOTE4: LM_ST_DEFERRED, ref = 4, flags=1 1055373: gfs_glock_dq(start): LM_ST_DEFERRED, ref = 4, flags=1 1055374: gfs_holder_init: LM_ST_DEFERRED, ref = 5, flags=1 1055375: gfs_glock_dq(b4runq): LM_ST_DEFERRED, ref = 4, flags=0 1055376: gfs_glock_dq(end): LM_ST_DEFERRED, ref = 4, flags=0 1055377: gfs_holder_init: LM_ST_DEFERRED, ref = 4, flags=0 1055378: GFS_GLOCK_NQ: LM_ST_DEFERRED, ref = 4, flags=0 1055379: RQ_PROMOTE4: LM_ST_DEFERRED, ref = 4, flags=1 1055380: gfs_glock_dq(start): LM_ST_DEFERRED, ref = 4, flags=1 1055381: gfs_holder_init: LM_ST_DEFERRED, ref = 5, flags=1 1055382: gfs_glock_dq(b4runq): LM_ST_DEFERRED, ref = 4, flags=0 1055383: gfs_glock_dq(end): LM_ST_DEFERRED, ref = 4, flags=0 1055384: gfs_holder_init: LM_ST_DEFERRED, ref = 4, flags=0 1055385: GFS_GLOCK_NQ: LM_ST_DEFERRED, ref = 4, flags=0 1055386: RQ_PROMOTE4: LM_ST_DEFERRED, ref = 4, flags=1 1055387: gfs_glock_dq(start): LM_ST_DEFERRED, ref = 4, flags=1 1055388: gfs_holder_init: LM_ST_DEFERRED, ref = 5, flags=1 1055389: gfs_glock_dq(b4runq): LM_ST_DEFERRED, ref = 4, flags=0 1055390: gfs_glock_dq(end): LM_ST_DEFERRED, ref = 4, flags=0 1055391: gfs_holder_init: LM_ST_DEFERRED, ref = 4, flags=0 1055392: GFS_GLOCK_NQ: LM_ST_DEFERRED, ref = 4, flags=0 1055393: RQ_PROMOTE4: LM_ST_DEFERRED, ref = 4, flags=1 1055394: gfs_glock_dq(start): LM_ST_DEFERRED, ref = 4, flags=1 1055395: gfs_holder_init: LM_ST_DEFERRED, ref = 5, flags=1 1055396: gfs_glock_dq(b4runq): LM_ST_DEFERRED, ref = 4, flags=0 1055397: gfs_glock_dq(end): LM_ST_DEFERRED, ref = 4, flags=0 1055398: gfs_holder_init: LM_ST_DEFERRED, ref = 4, flags=0 1055399: GFS_GLOCK_NQ: LM_ST_DEFERRED, ref = 4, flags=0 1055400: RQ_PROMOTE4: LM_ST_DEFERRED, ref = 4, flags=1 1055401: gfs_holder_init: LM_ST_DEFERRED, ref = 5, flags=1 1055402: GFS_GLOCK_NQ: LM_ST_DEFERRED, ref = 5, flags=1 1055403: RQ_PROMOTE4: LM_ST_DEFERRED, ref = 5, flags=1 1055404: gfs_holder_init: LM_ST_DEFERRED, ref = 6, flags=1 1055405: GFS_GLOCK_NQ: LM_ST_DEFERRED, ref = 6, flags=1 1055406: RQ_PROMOTE4: LM_ST_DEFERRED, ref = 6, flags=1 1055407: gfs_glock_dq(start): LM_ST_DEFERRED, ref = 6, flags=1 1055408: gfs_holder_init: LM_ST_DEFERRED, ref = 7, flags=1 1055409: gfs_glock_dq(b4runq): LM_ST_DEFERRED, ref = 6, flags=1 1055410: gfs_glock_dq(end): LM_ST_DEFERRED, ref = 6, flags=1 1055411: gfs_glock_dq(start): LM_ST_DEFERRED, ref = 5, flags=1 1055412: gfs_holder_init: LM_ST_DEFERRED, ref = 6, flags=1 1055413: gfs_glock_dq(b4runq): LM_ST_DEFERRED, ref = 5, flags=1 1055414: gfs_glock_dq(end): LM_ST_DEFERRED, ref = 5, flags=1 1055415: gfs_holder_init: LM_ST_DEFERRED, ref = 5, flags=1 1055416: GFS_GLOCK_NQ: LM_ST_DEFERRED, ref = 5, flags=1 1055417: RQ_PROMOTE4: LM_ST_DEFERRED, ref = 5, flags=1 1055418: gfs_glock_dq(start): LM_ST_DEFERRED, ref = 5, flags=1 1055419: gfs_holder_init: LM_ST_DEFERRED, ref = 6, flags=1 1055420: gfs_glock_dq(b4runq): LM_ST_DEFERRED, ref = 5, flags=1 1055421: gfs_glock_dq(end): LM_ST_DEFERRED, ref = 5, flags=1 1055422: gfs_holder_init: LM_ST_DEFERRED, ref = 5, flags=1 1055423: GFS_GLOCK_NQ: LM_ST_DEFERRED, ref = 5, flags=1 1055424: RQ_PROMOTE4: LM_ST_DEFERRED, ref = 5, flags=1 1055425: gfs_glock_dq(start): LM_ST_DEFERRED, ref = 5, flags=1 1055426: gfs_holder_init: LM_ST_DEFERRED, ref = 6, flags=1 1055427: gfs_glock_dq(b4runq): LM_ST_DEFERRED, ref = 5, flags=1 1055428: gfs_glock_dq(end): LM_ST_DEFERRED, ref = 5, flags=1 1055429: gfs_glock_dq(start): LM_ST_DEFERRED, ref = 4, flags=1 1055430: gfs_holder_init: LM_ST_DEFERRED, ref = 5, flags=1 1055431: gfs_glock_dq(b4runq): LM_ST_DEFERRED, ref = 4, flags=0 1055432: gfs_glock_dq(end): LM_ST_DEFERRED, ref = 4, flags=0 1055433: gfs_holder_init: LM_ST_DEFERRED, ref = 4, flags=0 1055434: GFS_GLOCK_NQ: LM_ST_DEFERRED, ref = 4, flags=0 1055435: RQ_PROMOTE4: LM_ST_DEFERRED, ref = 4, flags=1 1055436: gfs_glock_dq(start): LM_ST_DEFERRED, ref = 4, flags=1 1055437: gfs_holder_init: LM_ST_DEFERRED, ref = 5, flags=1 1055438: gfs_glock_dq(b4runq): LM_ST_DEFERRED, ref = 4, flags=0 1055439: gfs_glock_dq(end): LM_ST_DEFERRED, ref = 4, flags=0 1055440: gfs_holder_init: LM_ST_DEFERRED, ref = 4, flags=0 1055441: GFS_GLOCK_NQ: LM_ST_DEFERRED, ref = 4, flags=0 1055442: gfs_glock_xmote_th: LM_ST_DEFERRED, ref = 5, flags=1000 1055443: RQ_PROMOTE: LM_ST_DEFERRED, ref = 5, flags=1000 1055444: search_bucket: LM_ST_DEFERRED, ref = 6, flags=1000 1055445: state_change: LM_ST_EXCLUSIVE, ref = 6, flags=1000 1055446: LM_CB_ASYNC: LM_ST_EXCLUSIVE, ref = 4, flags=1 1055447: search_bucket: LM_ST_EXCLUSIVE, ref = 5, flags=1 1055448: HANDLE_CALLBACK(strt): LM_ST_EXCLUSIVE, ref = 5, flags=1 1055449: gfs_holder_init: LM_ST_EXCLUSIVE, ref = 6, flags=1 1055450: HANDLE_CALLBACK(end): LM_ST_EXCLUSIVE, ref = 6, flags=101 1055451: gfs_glock_dq(start): LM_ST_EXCLUSIVE, ref = 5, flags=101 1055452: gfs_holder_init: LM_ST_EXCLUSIVE, ref = 6, flags=101 1055453: gfs_glock_dq(b4runq): LM_ST_EXCLUSIVE, ref = 5, flags=100 1055454: RQ_DEMOTE3: LM_ST_EXCLUSIVE, ref = 5, flags=100 1055455: gfs_glock_drop_th: LM_ST_EXCLUSIVE, ref = 6, flags=100 1055456: gfs_glock_dq(end): LM_ST_EXCLUSIVE, ref = 6, flags=100 1055457: gfs_holder_init: LM_ST_EXCLUSIVE, ref = 6, flags=100 1055458: GFS_GLOCK_NQ: LM_ST_EXCLUSIVE, ref = 6, flags=100 1055459: gfs_holder_init: LM_ST_EXCLUSIVE, ref = 7, flags=1100 1055460: GFS_GLOCK_NQ: LM_ST_EXCLUSIVE, ref = 7, flags=1100 1055461: gfs_holder_init: LM_ST_EXCLUSIVE, ref = 8, flags=1100 1055462: GFS_GLOCK_NQ: LM_ST_EXCLUSIVE, ref = 8, flags=1100 1055463: gfs_holder_init: LM_ST_EXCLUSIVE, ref = 9, flags=1100 1055464: GFS_GLOCK_NQ: LM_ST_EXCLUSIVE, ref = 9, flags=1100 On the line marked 1055446, we see the async callback from the lock manager, giving us the lock in question. It should be ours to promote, demote or unlock as we see fit. On the line marked 1055455, we see the call to gfs_glock_drop_th, which called the lock module to unlock that glock. However, all indication is that it got a non-zero return code from the lock modules and therefore bypassed calling drop_bh. Likewise, we never get any kind of callback indicating the unlock was done. Instead, we get errors from dlm. Here are dlm messages I got on my latest recreation: dlm: vedder: process_lockqueue_reply id 150224 state 0 dlm: vedder: process_lockqueue_reply id 1c02ef state 0 dlm: vedder: process_lockqueue_reply id a0038e state 0 dlm: vedder: process_lockqueue_reply id 8f0151 state 0 dlm: vedder: process_lockqueue_reply id aa023d state 0 dlm: unlock ungranted 3 [<f8bd1598>] dlm_unlock_stage2+0x148/0x210 [dlm] [<f8bd3c81>] process_cluster_request+0x81f/0x8e4 [dlm] [<02146b29>] cache_grow+0x116/0x138 [<f8bd6f61>] midcomms_process_incoming_buffer+0x1ce/0x1f8 [dlm] [<02143074>] buffered_rmqueue+0x17d/0x1a5 [<02120511>] autoremove_wake_function+0x0/0x2d [<02143150>] __alloc_pages+0xb4/0x2a6 [<f8bd518c>] receive_from_sock+0x181/0x254 [dlm] [<f8bd607b>] dlm_recvd+0x0/0x95 [dlm] [<f8bd5f25>] process_sockets+0x56/0x91 [dlm] [<f8bd6100>] dlm_recvd+0x85/0x95 [dlm] [<02133271>] kthread+0x73/0x9b [<021331fe>] kthread+0x0/0x9b [<021041f5>] kernel_thread_helper+0x5/0xb On the other node: dlm: vedder: process_lockqueue_reply id 190116 state 0 dlm: vedder: convert from 3 ungranted dlm: request rh_cmd 4 rh_lkid f008f remlkid 1c02ba flags 44 status 0 rqmode 2 dlm: lkb id 1c02ba remid f008f flags 40000 status 3 rqmode 5 grmode 2 nodeid 3 lqstate 0 lqflags 44 dlm: rsb name " 2 8b12f0a" nodeid 0 flags 2 ref 5 ... dlm: vedder: process_lockqueue_reply id 330330 state 0 dlm: vedder: process_lockqueue_reply id 370264 state 0 dlm: vedder: process_lockqueue_reply id 35004f state 0 dlm: vedder: process_lockqueue_reply id 31019b state 0 dlm: vedder: process_lockqueue_reply id 3102f8 state 0 dlm: vedder: process_lockqueue_reply id 390236 state 0 dlm: vedder: process_lockqueue_reply id 3f0282 state 0 dlm: vedder: process_lockqueue_reply id 420102 state 0 dlm: vedder: process_lockqueue_reply id 310290 state 0 dlm: vedder: process_lockqueue_reply id 4102b8 state 0 dlm: vedder: process_lockqueue_reply id 4303e2 state 0 dlm: vedder: process_lockqueue_reply id 38016e state 0 dlm: vedder: process_lockqueue_reply id 4a006d state 0 dlm: vedder: cancel reply ret 0 lock_dlm: unlock sb_status 0 2,8b12f0a flags 0 dlm: vedder: process_lockqueue_reply id 4902aa state 0 I'm now in the process of trying to recreate the failure with some instrumentation recommended by Dave Teigland to dump the lkb and rsb when we get any of the three errors shown above. I'm also working on new instrumentation to the gfs kernel to show whether it's sending the proper things to the lock module.
Here are the messages from the instrumented DLM at the time of failure: dlm: vedder: process_lockqueue_reply id 1103c5 state 0 dlm: vedder: process_lockqueue_reply id 1100d7 state 0 dlm: unlock ungranted 3 dlm: lkb id 14037a remid f03ed flags 440000 status 0 rqmode 5 grmode 0 nodeid 4 lqstate 0 lqflags 44 dlm: rsb name " 2 74" nodeid 0 flags 2 ref 5 [<82bd1597>] dlm_unlock_stage2+0x147/0x21e [dlm] [<82bd3ca6>] process_cluster_request+0x81f/0x8e4 [dlm] [<02146b29>] cache_grow+0x116/0x138 [<02117472>] smp_apic_timer_interrupt+0x9a/0x9c [<82bd6f85>] midcomms_process_incoming_buffer+0x1ce/0x1f8 [dlm] [<02143074>] buffered_rmqueue+0x17d/0x1a5 [<02120511>] autoremove_wake_function+0x0/0x2d [<02143150>] __alloc_pages+0xb4/0x2a6 [<82bd51b0>] receive_from_sock+0x181/0x254 [dlm] [<82bd609f>] dlm_recvd+0x0/0x95 [dlm] [<82bd5f49>] process_sockets+0x56/0x91 [dlm] [<82bd6124>] dlm_recvd+0x85/0x95 [dlm] [<02133271>] kthread+0x73/0x9b [<021331fe>] kthread+0x0/0x9b [<021041f5>] kernel_thread_helper+0x5/0xb -------- Other node: dlm: vedder: process_lockqueue_reply id 17036a state 0 dlm: vedder: process_lockqueue_reply id 150034 state 0 dlm: vedder: cancel reply ret 0 dlm: lkb id f03ed remid 14037a flags 10080000 status 0 rqmode -1 grmode 2 nodeid 3 lqstate 0 lqflags 0 dlm: rsb name " 2 74" nodeid 3 flags 0 ref 1 dlm: reply rh_cmd 5 rh_lkid f03ed lockstate 3 nodeid 34892585 status 0 lkid 14037a lock_dlm: unlock sb_status 0 2,74 flags 0 dlm: vedder: process_lockqueue_reply id f03ed state 0
Okay, just to clarify: The test is hung because it's waiting for a glock. The glock is stuck because it's got the GLF_LOCK gl_flag set. Therefore run_queue won't do anything to it. It's in that state because it never got its callback from a drop request to the lock manager. Here is the last few entries in the sequence of events from my glock history, with my comments, AFTER the entries: This history starts with a holder, in LM_ST_DEFERRED state. The glock is dequeued: 201826: gfs_glock_dq(start): LM_ST_DEFERRED, LM_ST_DEFERRED, rc=1, flags=1 201827: gfs_glock_dq(b4runq): LM_ST_DEFERRED, LM_ST_DEFERRED, rc=1, flags=0 201828: gfs_glock_dq(end): LM_ST_DEFERRED, LM_ST_DEFERRED, rc=1, flags=0 When glock_dq is done, the holder has been taken off the queue, thus flags=0 (no holders on any of the waiters queues). 201829: gfs_glock_xmote_th: LM_ST_DEFERRED, LM_ST_EXCLUSIVE, rc=777, flags=1000 201830: gfs_glock_xmote_th: LM_ST_DEFERRED, LM_ST_EXCLUSIVE, rc=128, flags=1000 A new request came in to promote the lock from deferred to exclusive. The "777" is before the call to lock manager, the "128" is right after. The "128" indicates the async bit is on as well. The flags=1000 indicates that there the request went onto Waiters3. 201831: RQ_PROMOTE: LM_ST_DEFERRED, LM_ST_EXCLUSIVE, rc=1, flags=1000 The lock is being promoted to exclusive. This log entry implies that !relaxed_state_ok(). This is who called the previous gfs_glock_xmote_th. 201832: LM_CB_ASYNC: LM_ST_DEFERRED, LM_ST_DEFERRED, rc=1, flags=1000 Here we have the async callback from the promote. All is well. 201833: state_change: LM_ST_DEFERRED, LM_ST_EXCLUSIVE, rc=0, flags=1000 The state is therefore changed to exclusive. After this, the holder is moved from Waiters3 to gl_holders. 201834: gfs_glock_dq(start): LM_ST_EXCLUSIVE, LM_ST_EXCLUSIVE, rc=1, flags=101 A new request came in to unlock the glock. The rc=1 means one sole holder. 201835: gfs_glock_dq(b4runq): LM_ST_EXCLUSIVE, LM_ST_EXCLUSIVE, rc=1, flags=100 This is right before the run_queue at the bottom of gfs_glock_dq. The flags=100 indicates the previous unlock request was put on Waiters2, so it's now waiting to demote the lock. 201836: RQ_DEMOTE3: LM_ST_EXCLUSIVE, LM_ST_UNLOCKED, rc=1, flags=100 Since we're changing state, and since it's an unlock request, rq_demote calls go_drop_th which calls gfs_glock_drop_th(). 201837: gfs_glock_drop_th: LM_ST_EXCLUSIVE, LM_ST_EXCLUSIVE, rc=777, flags=100 201838: gfs_glock_drop_th: LM_ST_EXCLUSIVE, LM_ST_EXCLUSIVE, rc=128, flags=100 gfs_glock_drop_th called the lock module to do the drop. Since the async bit is still on, this request doesn't cause drop_bh to be called. Instead we wait for the callback. 201839: gfs_glock_dq(end): LM_ST_EXCLUSIVE, LM_ST_EXCLUSIVE, rc=1, flags=100 The callback never came. The debug version of dlm gave these messages: dlm: vedder: cancel reply ret 0 dlm: lkb id 1f0196 remid 8102e8 flags 480000 status 0 rqmode -1 grmode 5 nodeid 5 lqstate 0 lqflags 0 dlm: rsb name " 2 22bae75" nodeid 5 flags 0 ref 1 dlm: reply rh_cmd 5 rh_lkid 1f0196 lockstate 3 nodeid 582 status 0 lkid 8102e8 lock_dlm: unlock sb_status 0 2,22bae75 flags 0 dlm: vedder: process_lockqueue_reply id 1f0196 state 0 So again, it seems like the cancel never caused the callback. I'm running the test again now with some more debug messages suggested by Dave Teigland.
Output from Dave's patch: dlm: vedder: cancel reply ret 0 dlm: lkb id 1c010d remid 61020a flags 80000 status 0 rqmode -1 grmode 5 nodeid 5 lqstate 0 lqflags 0 dlm: rsb name " 2 8b12f09" nodeid 5 flags 0 ref 1 dlm: reply rh_cmd 5 rh_lkid 1c010d lockstate 3 nodeid 5 status 0 lkid 61020a dlm: queue_ast 1c010d status 0 dlm: lkb id 1c010d remid 61020a flags 80000 status 0 rqmode -1 grmode 5 nodeid 5 lqstate 0 lqflags 0 dlm: rsb name " 2 8b12f09" nodeid 5 flags 0 ref 1 lock_dlm: unlock sb_status 0 2,8b12f09 flags 0 dlm: vedder: process_lockqueue_reply id 1c010d state 0
Here is the output from Dave's latest patch: From morph-01: dlm: vedder: grant lock 1502dc on lockqueue 2 44 3 3 2 0 dlm: vedder: process_lockqueue_reply id 1502dc state 0 dlm: vedder: grant lock 140270 on lockqueue 2 44 3 3 2 0 dlm: vedder: process_lockqueue_reply id 140270 state 0 dlm: vedder: grant lock 1f025d on lockqueue 2 44 3 3 2 0 dlm: vedder: process_lockqueue_reply id 1f025d state 0 dlm: vedder: grant lock 220172 on lockqueue 2 44 3 3 2 0 dlm: vedder: process_lockqueue_reply id 220172 state 0 dlm: vedder: grant lock 3b00f0 on lockqueue 2 44 3 3 2 0 dlm: vedder: process_lockqueue_reply id 3b00f0 state 0 dlm: vedder: grant lock 2a0195 on lockqueue 2 44 3 3 2 0 dlm: vedder: process_lockqueue_reply id 2a0195 state 0 dlm: vedder: grant lock 2c0224 on lockqueue 2 44 3 3 2 0 dlm: vedder: process_lockqueue_reply id 2c0224 state 0 dlm: vedder: grant lock 3101f6 on lockqueue 2 44 3 3 2 0 dlm: vedder: process_lockqueue_reply id 3101f6 state 0 dlm: vedder: grant lock 2600b8 on lockqueue 2 44 3 3 2 0 dlm: vedder: process_lockqueue_reply id 2600b8 state 0 dlm: vedder: grant lock 3c03cc on lockqueue 2 44 3 3 2 0 dlm: vedder: process_lockqueue_reply id 3c03cc state 0 dlm: vedder: grant lock 49016a on lockqueue 2 44 3 3 2 0 dlm: vedder: process_lockqueue_reply id 49016a state 0 dlm: vedder: grant lock 470086 on lockqueue 2 44 3 3 2 0 dlm: vedder: cancel reply ret 0 dlm: lkb id 470086 remid 2b024b flags 80000 status 0 rqmode -1 grmode 5 nodeid 4 lqstate 0 lqflags 0 dlm: rsb name " 2 68585d6" nodeid 4 flags 0 ref 1 dlm: reply rh_cmd 5 rh_lkid 470086 lockstate 3 nodeid 518 status 0 lkid 2b024b dlm: queue_ast 470086 status 0 dlm: lkb id 470086 remid 2b024b flags 80000 status 0 rqmode -1 grmode 5 nodeid 4 lqstate 0 lqflags 0 dlm: rsb name " 2 68585d6" nodeid 4 flags 0 ref 1 Reproduced: offset=0x48 lock_dlm: unlock sb_status 0 2,68585d6 flags 0 00000000 01000100 05004800 86004700 09000001 [......H...G.....] 00000010 03000000 06020000 00000000 4B022B00 [............K.+.] 00000020 00000020 00000003 0040F678 9000EE81 [... .....@.x....] 00000030 06020000 46020000 00000000 80D1EE81 [....F...........] 00000040 D0000000 0010F478 00000000 00000000 [.......x........] 00000050 00000000 00000000 00000000 00000000 [................] From morph-02: dlm: vedder: grant lock e01f2 on lockqueue 2 44 3 3 2 0 dlm: vedder: process_lockqueue_reply id e01f2 state 0 dlm: vedder: grant lock 8029c on lockqueue 2 44 3 3 2 0 dlm: vedder: process_lockqueue_reply id 8029c state 0 dlm: vedder: grant lock 19022f on lockqueue 2 44 3 3 2 0 dlm: vedder: process_lockqueue_reply id 19022f state 0 dlm: unlock ungranted 3 dlm: lkb id 2b024b remid 470086 flags 440000 status 0 rqmode 5 grmode 0 nodeid 2 lqstate 0 lqflags 44 dlm: rsb name " 2 68585d6" nodeid 0 flags 2 ref 5 [<82bd15fd>] dlm_unlock_stage2+0x147/0x21e [dlm] [<82bd3e2e>] process_cluster_request+0x8df/0x9af [dlm] [<82bd71c5>] midcomms_process_incoming_buffer+0x1ce/0x22a [dlm] [<02143074>] buffered_rmqueue+0x17d/0x1a5 [<02120511>] autoremove_wake_function+0x0/0x2d [<02143150>] __alloc_pages+0xb4/0x2a6 [<82bd5344>] receive_from_sock+0x181/0x254 [dlm] [<82bd6233>] dlm_recvd+0x0/0x95 [dlm] [<82bd60dd>] process_sockets+0x56/0x91 [dlm] [<82bd62b8>] dlm_recvd+0x85/0x95 [dlm] [<02133271>] kthread+0x73/0x9b [<021331fe>] kthread+0x0/0x9b [<021041f5>] kernel_thread_helper+0x5/0xb From morph-03: dlm: vedder: grant lock 600b3 on lockqueue 2 44 3 3 2 0 dlm: vedder: process_lockqueue_reply id 600b3 state 0 dlm: vedder: grant lock f0028 on lockqueue 2 44 3 3 2 0 dlm: vedder: process_lockqueue_reply id f0028 state 0 dlm: vedder: grant lock 11021c on lockqueue 2 44 3 3 2 0 dlm: vedder: process_lockqueue_reply id 11021c state 0 dlm: vedder: grant lock 1102a5 on lockqueue 2 44 3 3 2 0 dlm: vedder: process_lockqueue_reply id 1102a5 state 0 dlm: vedder: grant lock 1602d7 on lockqueue 2 44 3 3 2 0 dlm: vedder: process_lockqueue_reply id 1602d7 state 0 dlm: vedder: grant lock 1100b1 on lockqueue 2 44 3 3 2 0 dlm: vedder: process_lockqueue_reply id 1100b1 state 0 dlm: vedder: grant lock 1d03ce on lockqueue 2 44 3 3 2 0 dlm: vedder: process_lockqueue_reply id 1d03ce state 0 dlm: vedder: grant lock f02c5 on lockqueue 2 44 3 3 2 0 dlm: vedder: process_lockqueue_reply id f02c5 state 0 dlm: vedder: grant lock 150062 on lockqueue 2 44 3 3 2 0 dlm: vedder: process_lockqueue_reply id 150062 state 0 dlm: vedder: grant lock 180091 on lockqueue 2 44 3 3 2 0 dlm: vedder: process_lockqueue_reply id 180091 state 0 dlm: vedder: grant lock 11023b on lockqueue 2 44 3 3 2 0 dlm: vedder: process_lockqueue_reply id 11023b state 0 From morph-04: dlm: vedder: grant lock 40205 on lockqueue 2 44 3 3 2 0 dlm: vedder: process_lockqueue_reply id 40205 state 0 dlm: vedder: grant lock 40324 on lockqueue 2 44 3 3 2 0 dlm: vedder: process_lockqueue_reply id 40324 state 0 dlm: vedder: grant lock 400ab on lockqueue 2 44 3 3 2 0 dlm: vedder: process_lockqueue_reply id 400ab state 0 dlm: vedder: grant lock 60318 on lockqueue 2 44 3 3 2 0 dlm: vedder: process_lockqueue_reply id 60318 state 0 dlm: vedder: grant lock 802ee on lockqueue 2 44 3 3 2 0 dlm: vedder: process_lockqueue_reply id 802ee state 0 dlm: vedder: grant lock 601cf on lockqueue 2 44 3 3 2 0 dlm: vedder: process_lockqueue_reply id 601cf state 0 dlm: vedder: grant lock 80212 on lockqueue 2 44 3 3 2 0 dlm: vedder: process_lockqueue_reply id 80212 state 0 dlm: vedder: grant lock e01f0 on lockqueue 2 44 3 3 2 0 dlm: vedder: process_lockqueue_reply id e01f0 state 0 dlm: vedder: grant lock 1003be on lockqueue 2 44 3 3 2 0 dlm: vedder: process_lockqueue_reply id 1003be state 0 dlm: vedder: grant lock 800a2 on lockqueue 2 44 3 3 2 0 dlm: vedder: process_lockqueue_reply id 800a2 state 0 dlm: vedder: grant lock b01a6 on lockqueue 2 44 3 3 2 0 dlm: vedder: process_lockqueue_reply id b01a6 state 0 dlm: vedder: grant lock 1503dc on lockqueue 2 44 3 3 2 0 dlm: vedder: process_lockqueue_reply id 1503dc state 0 dlm: vedder: grant lock 1302d7 on lockqueue 2 44 3 3 2 0 dlm: vedder: process_lockqueue_reply id 1302d7 state 0 dlm: vedder: grant lock 1303e9 on lockqueue 2 44 3 3 2 0 dlm: vedder: process_lockqueue_reply id 1303e9 state 0 dlm: vedder: grant lock 11035e on lockqueue 2 44 3 3 2 0 dlm: vedder: process_lockqueue_reply id 11035e state 0 dlm: vedder: grant lock 170317 on lockqueue 2 44 3 3 2 0 dlm: vedder: process_lockqueue_reply id 170317 state 0 dlm: vedder: grant lock 14032b on lockqueue 2 44 3 3 2 0 dlm: vedder: process_lockqueue_reply id 14032b state 0 From morph-05: dlm: vedder: grant lock 3031b on lockqueue 2 44 3 3 2 0 dlm: vedder: process_lockqueue_reply id 3031b state 0 dlm: vedder: grant lock f00fd on lockqueue 2 44 3 3 2 0 dlm: vedder: process_lockqueue_reply id f00fd state 0 dlm: vedder: grant lock b00cb on lockqueue 2 44 3 3 2 0 dlm: vedder: process_lockqueue_reply id b00cb state 0 dlm: vedder: grant lock f0152 on lockqueue 2 44 3 3 2 0 dlm: vedder: process_lockqueue_reply id f0152 state 0
With my new (only slightly) improved version of the test, I was able to recreate this hang without the code from bug 191222, so there are no more worries about that being defective. I was also able to recreate it with RHEL4.5's gfs, so apparently there are no worries about it being a regression from 4.5 to 4.6 at all. I might try going back to the RHEL4.4 version and try that. However, it's looking more and more like a dlm issue right now. It might be more worthwhile to try the RHEL4.5 dlm instead.
For the entire life of the dlm, there's been an annoying issue that we've worked around and not "fixed" directly. It's the source of all these messages: process_lockqueue_reply id 2c0224 state 0 The problem that a lock master sends an async "granted" message for a convert request *before* actually sending the reply for the original convert. The work-around is that the requesting node just takes the granted message as an implicit reply to the conversion and ignores the convert reply when it arrives later (the message above is printed when it gets the out-of-order reply for its convert). Apart from the annoying messages, it's never been a problem. Now, I believe we've found a case where it's a real problem: 1. nodeA: send convert PR->CW to nodeB nodeB: send granted message to nodeA nodeB: send convert reply to nodeA 2. nodeA: receive granted message for conversion complete request, sending ast to gfs 3. nodeA: send convert CW->EX to nodeB 4. nodeA: receive reply for convert in step 1, which we ordinarily ignore, but since another convert has been sent, we mistake this message as the reply for the convert in step 3, and complete the convert request which is *not* really completed yet 5. nodeA: send unlock to nodeB nodeB: complains about an unlock during a conversion
The first attempt to fix the problem in comment 16 is to have nodeB not send a convert reply if it has already sent a granted message. (We already do this for cases where the conversion is granted when first processing it, but we don't in cases where the grant is done after processing the convert.)
The patch Dave asked me to try seems to have worked. It ran the shortened (10 min) scenario overnight without a problem, 110 iterations so far. Based on that, I'm reassigning to Dave, but I'll help any way I can. Here is how I was recreating the problem, in grueling detail: 1. Reboot all nodes: cssh -l root morph-0{1,2,3,4,5} & /sbin/reboot -fin 2. When they come back, restart the cluster software: cssh -l root morph-0{1,2,3,4,5} & On every node: service ccsd start service cman start service fenced start service clvmd start 3. On morph-01, reformat the file system: mkfs.gfs -O -p "lock_dlm" -t "morph-cluster:vedder" -j 5 /dev/linear_1/linear_10 4. On all nodes, mount the file system to /mnt/vedder and chmod it: mount -tgfs /dev/linear_1/linear_10 /mnt/vedder chmod 777 /mnt/vedder (necessary because the test doesn't run as root) 5. Log into "try" and su to nstraz (I was too lazy to change all the permissions on the files). ssh try su su nstraz 6. Go to the test case directory in tmp and start the test for 100 iterations: cd /tmp/vedder.REG.lock_dlm.morph-cluster.200710121519/3/06.gfs_vs/stripe_2_4/2.dd_io PATH=$PATH:/usr/tests/sts-rhel4.6/bin/ for i in `seq 1 100` ; do echo $i ; rm -fR bob ; ./restart.sh bob ; done (The "bob" here is a directory that contains the output of the test). If you get a failure (hang), you should do this (in this order): Stop the test on "try" by holding down <ctrl-c> until it stops Reset all the nodes On "try" type "d_close -a" Restart from step 2 above. If you don't follow that order, the for loop will try to restart the test and leave programs running that will interfere with the next run.
The patch committed to CVS worked properly for me as well. I ran more than 30 iterations (several hours) of the dd_io test case yesterday before I stopped it to reboot the system I was running it from. Later, when I went to restart it, someone had deleted the /tmp/ directory I had been using to run it. Hopefully this fix won't impact performance because I never tested that. Performance issues aside, this fix gets my blessing and I think we need to get the QE team running their official dd_io test from scratch. I think you can change the status to modified, no?
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2007-0995.html
*** Bug 228916 has been marked as a duplicate of this bug. ***