Red Hat Bugzilla – Bug 432108
GFS2: Two gfs2 nodes in four node cluster hung in ls.
Last modified: 2009-05-27 23:39:21 EDT
Description of problem: Four node ia64 cluster, after starting a test run I noted two of the nodes were not making any progress. I attempted to run ls in the working directory to see if the test files were created and two of the four nodes hung in the ls. Version-Release number of selected component (if applicable): link-13,14,15 running 2.6.18-79.el5 link-16 running 2.6.18-76.el5 How reproducible: Unsure Steps to Reproduce: mount fs, create file, ls on all nodes Actual results: two nodes (each running -79) hung in ls link-13: ls D a0000002010170c0 0 6133 6084 (NOTLB) Call Trace: [<a00000010062d170>] schedule+0x1db0/0x20a0 sp=e0000000168bfc20 bsp=e0000000168b9378 [<a0000002010170c0>] just_schedule+0x20/0x40 [gfs2] sp=e0000000168bfcb0 bsp=e0000000168b9360 [<a00000010062ebf0>] __wait_on_bit+0xd0/0x180 sp=e0000000168bfcb0 bsp=e0000000168b9310 [<a00000010062ed60>] out_of_line_wait_on_bit+0xc0/0xe0 sp=e0000000168bfcb0 bsp=e0000000168b92c8 [<a000000201017060>] wait_on_holder+0x60/0xa0 [gfs2] sp=e0000000168bfcf0 bsp=e0000000168b92a8 [<a00000020101ac30>] glock_wait_internal+0x350/0x700 [gfs2] sp=e0000000168bfcf0 bsp=e0000000168b9260 [<a00000020101b580>] gfs2_glock_nq+0x5a0/0x640 [gfs2] sp=e0000000168bfcf0 bsp=e0000000168b9210 [<a0000002010399a0>] gfs2_getattr+0x160/0x240 [gfs2] sp=e0000000168bfcf0 bsp=e0000000168b91d0 [<a00000010017bfe0>] vfs_getattr+0x100/0x220 sp=e0000000168bfd30 bsp=e0000000168b9198 [<a00000010017c170>] vfs_lstat_fd+0x70/0xa0 sp=e0000000168bfd30 bsp=e0000000168b9168 [<a00000010017c6f0>] sys_newlstat+0x30/0x80 sp=e0000000168bfdc0 bsp=e0000000168b9108 [<a00000010000bdb0>] __ia64_trace_syscall+0xd0/0x110 sp=e0000000168bfe30 bsp=e0000000168b9108 [<a000000000010620>] __start_ivt_text+0xffffffff00010620/0x400 sp=e0000000168c0000 bsp=e0000000168b9108 link-14: ls D a0000002010170c0 0 5949 5902 (NOTLB) Call Trace: [<a00000010062d170>] schedule+0x1db0/0x20a0 sp=e00000002caefc20 bsp=e00000002cae9378 [<a0000002010170c0>] just_schedule+0x20/0x40 [gfs2] sp=e00000002caefcb0 bsp=e00000002cae9360 [<a00000010062ebf0>] __wait_on_bit+0xd0/0x180 sp=e00000002caefcb0 bsp=e00000002cae9310 [<a00000010062ed60>] out_of_line_wait_on_bit+0xc0/0xe0 sp=e00000002caefcb0 bsp=e00000002cae92c8 [<a000000201017060>] wait_on_holder+0x60/0xa0 [gfs2] sp=e00000002caefcf0 bsp=e00000002cae92a8 [<a00000020101ac30>] glock_wait_internal+0x350/0x700 [gfs2] sp=e00000002caefcf0 bsp=e00000002cae9260 [<a00000020101b580>] gfs2_glock_nq+0x5a0/0x640 [gfs2] sp=e00000002caefcf0 bsp=e00000002cae9210 [<a0000002010399a0>] gfs2_getattr+0x160/0x240 [gfs2] sp=e00000002caefcf0 bsp=e00000002cae91d0 [<a00000010017bfe0>] vfs_getattr+0x100/0x220 sp=e00000002caefd30 bsp=e00000002cae9198 [<a00000010017c170>] vfs_lstat_fd+0x70/0xa0 sp=e00000002caefd30 bsp=e00000002cae9168 [<a00000010017c6f0>] sys_newlstat+0x30/0x80 sp=e00000002caefdc0 bsp=e00000002cae9108 [<a00000010000bdb0>] __ia64_trace_syscall+0xd0/0x110 sp=e00000002caefe30 bsp=e00000002cae9108 [<a000000000010620>] __start_ivt_text+0xffffffff00010620/0x400 sp=e00000002caf0000 bsp=e00000002cae9108
Here is what I'm seeing: link-13 and link-14 look like they're both waiting for glocks that they already hold. They both have "trylock" holders for the file /mnt/gfs2/writev-read. Here are the pertinent lockdump sections from all nodes: link-13:------------------------------------------------------------ Glock 0xe000000002585b48 (2, 0x220efe3) gl_flags = 1 gl_ref = 6 gl_state = 0 gl_owner = pid 6112 (d_doio) gl_ip = 11529215054675295824 req_gh = yes lvb_count = 0 object = yes le = no reclaim = no aspace = 0xe000000003dc8180 nrpages = 0 ail = 0 Request owner = 6112 (d_doio) gh_state = 3 gh_flags = 1 9 error = 0 gh_iflags = 1 10 initialized at: gfs2_readpage+0xc0/0x280 [gfs2] Waiter3 owner = 6112 (d_doio) gh_state = 3 gh_flags = 1 9 error = 0 gh_iflags = 1 10 initialized at: gfs2_readpage+0xc0/0x280 [gfs2] Waiter3 owner = 6133 (ls) gh_state = 3 gh_flags = 3 error = 0 gh_iflags = 1 10 initialized at: gfs2_getattr+0x150/0x240 [gfs2] Inode: busy link-14:------------------------------------------------------------ Glock 0xe00000002ca17b98 (2, 0x220efe3) gl_flags = 1 gl_ref = 6 gl_state = 0 gl_owner = pid 5929 (d_doio) gl_ip = 11529215054675295824 req_gh = yes lvb_count = 0 object = yes le = no reclaim = no aspace = 0xe00000002ca26da0 nrpages = 0 ail = 0 Request owner = 5929 (d_doio) gh_state = 3 gh_flags = 1 9 error = 0 gh_iflags = 1 10 initialized at: gfs2_readpage+0xc0/0x280 [gfs2] Waiter3 owner = 5929 (d_doio) gh_state = 3 gh_flags = 1 9 error = 0 gh_iflags = 1 10 initialized at: gfs2_readpage+0xc0/0x280 [gfs2] Waiter3 owner = 5949 (ls) gh_state = 3 gh_flags = 3 error = 0 gh_iflags = 1 10 initialized at: gfs2_getattr+0x150/0x240 [gfs2] Inode: busy link-15:------------------------------------------------------------ Glock 0xe000000028ffd858 (2, 0x220efe3) gl_flags = (unlocked) gl_ref = 3 gl_state = 3 gl_owner = 5840 (ended) gl_ip = 11529215054675263056 req_gh = no lvb_count = 0 object = yes le = no reclaim = no aspace = 0xe00000002814c430 nrpages = 1 ail = 0 Inode: num = 25/35712995 type = 8 i_flags = link-16:------------------------------------------------------------ Glock 0xe0000040657e19d0 (2, 0x220efe3) gl_flags = (unlocked) gl_ref = 3 gl_state = 3 gl_owner = 7416 (ended) gl_ip = 11529215054678753552 req_gh = no req_bh = no lvb_count = 0 object = yes le = no reclaim = no aspace = 0xe000004063480448 nrpages = 1 ail = 0 Inode: num = 25/35712995 type = 8 i_flags =
Okay, Steve, here is my theory. Does this make sense? The glock is in the unlocked state, but it has a holder. So it was once locked and now isn't. However, the holder didn't get taken off the owner's list (gl_req_gh). This is hard to explain, but my theory is that this sequence of events might be occurring: 1. The glock is locked, so the holder is put on the gl_req_gh list. 2. Function drop_bh is called to drop the holder. 3. drop_bh initializes "*gh = gl->gl_req_gh;" without the gl_spin lock. 4. drop_bh calls state_change 5. drop_bh calls go_inval 6. In the mean time, a holder gets put onto the gl_req_bh list. 7. go_inval finishes and returns back to drop_bh 8. Now we've got a holder, but gh is still NULL, so it never deletes the holder off the list. In other words, shouldn't function drop_bh() protect its setting of gh = gl->gl_req_gh with the gl_spin held, and perhaps only do it after the return from go_inval()? Perhaps drop_bh should look more like this: struct gfs2_holder *gh; ... state_change(gl, LM_ST_UNLOCKED); gl->gl_req_gh; if (glops->go_inval) glops->go_inval(gl, DIO_METADATA); spin_lock(&gl->gl_spin); gh = gl->gl_req_gh; if (gh) { list_del_init(&gh->gh_list); gh->gh_error = 0; } spin_unlock(&gl->gl_spin); Since there's a spin_lock immediately after, we can eliminate the unlock/relock from the code too. I don't know if this theory holds any water...
Function xmote_bh does the same kind of thing.
gl_req_gh isn't a list, its a pointer to the lock request which caused the current state change and is only valid during a demote or promote which involved some remote operation.
These are potentially duplicates of bz 432370. Marking NEEDINFO to have QE try to recreate.
This looks exactly like #432057, so unless this can still be recreated then I'll mark it as a dup of that bug.
*** This bug has been marked as a duplicate of 432057 ***