Description of problem: While running d_iogen/d_doio on a 4 node ia64 cluster, during the writev-readv test case (write data via writev(), verify data with readv()) one of the d_doio processes stalled. I will included complete the sys-req-t from that node as an attachment. Here is the stack from the hung d_doio: d_doio D a00000010062f8c0 0 32128 32127 (NOTLB) Call Trace: [<a00000010062e470>] schedule+0x1dd0/0x20c0 sp=e00000407d25f9b0 bsp=e00000407d259790 [<a00000010062f8c0>] io_schedule+0x80/0xe0 sp=e00000407d25fac0 bsp=e00000407d259770 [<a0000001000ff920>] sync_page+0x100/0x120 sp=e00000407d25fac0 bsp=e00000407d259750 [<a00000010062fc60>] __wait_on_bit_lock+0xa0/0x180 sp=e00000407d25fac0 bsp=e00000407d259708 [<a0000001000ff620>] __lock_page+0xc0/0xe0 sp=e00000407d25fac0 bsp=e00000407d2596e8 [<a0000001001be8a0>] mpage_writepages+0x2c0/0x8a0 sp=e00000407d25fb00 bsp=e00000407d259670 [<a000000100110130>] do_writepages+0xd0/0x120 sp=e00000407d25fba0 bsp=e00000407d259648 [<a0000001001017b0>] __filemap_fdatawrite_range+0xb0/0xe0 sp=e00000407d25fba0 bsp=e00000407d259610 [<a000000100101b80>] filemap_fdatawrite+0x40/0x60 sp=e00000407d25fbe0 bsp=e00000407d2595e8 [<a0000002014adcb0>] inode_go_sync+0xd0/0x320 [gfs2] sp=e00000407d25fbe0 bsp=e00000407d2595c0 [<a0000002014adf70>] inode_go_drop_th+0x70/0xa0 [gfs2] sp=e00000407d25fbe0 bsp=e00000407d2595a0 [<a0000002014a9290>] gfs2_glock_drop_th+0x70/0x360 [gfs2] sp=e00000407d25fbe0 bsp=e00000407d259570 [<a0000002014aa4d0>] run_queue+0x330/0x920 [gfs2] sp=e00000407d25fbe0 bsp=e00000407d259528 [<a0000002014ab730>] gfs2_glock_nq+0x550/0x640 [gfs2] sp=e00000407d25fbe0 bsp=e00000407d2594d8 [<a0000002014af360>] gfs2_glock_nq_atime+0x260/0x6e0 [gfs2] sp=e00000407d25fbe0 bsp=e00000407d259448 [<a0000002014c15b0>] gfs2_readpage+0xd0/0x280 [gfs2] sp=e00000407d25fc10 bsp=e00000407d259408 [<a000000100100dd0>] do_generic_mapping_read+0x470/0x880 sp=e00000407d25fc50 bsp=e00000407d259350 [<a000000100102360>] __generic_file_aio_read+0x280/0x3c0 sp=e00000407d25fca0 bsp=e00000407d2592f8 [<a0000001001025a0>] generic_file_readv+0x100/0x140 sp=e00000407d25fcc0 bsp=e00000407d2592c0 [<a000000100163d90>] do_readv_writev+0x350/0x620 sp=e00000407d25fda0 bsp=e00000407d259258 [<a000000100164350>] vfs_readv+0xd0/0x100 sp=e00000407d25fe20 bsp=e00000407d259220 [<a0000001001653f0>] sys_readv+0x70/0x120 sp=e00000407d25fe20 bsp=e00000407d2591a0 [<a00000010000bdb0>] __ia64_trace_syscall+0xd0/0x110 sp=e00000407d25fe30 bsp=e00000407d2591a0 [<a000000000010620>] __start_ivt_text+0xffffffff00010620/0x400 sp=e00000407d260000 bsp=e00000407d2591a0 Version-Release number of selected component (if applicable): kernel: 2.6.18-76.el5 How reproducible: Unknown Steps to Reproduce: 1. d_iogen -I 12321 -i 1200s -s writev -v readv -F 1m:writev-readv 2. d_doio -P fore -I 12321 -a -m 1000 -r 100 -n 1 # On each node in cluster
Created attachment 294383 [details] sys-req-t of node with hung d_doio
Hit this again, 2.6.18-81.el5 d_doio D a00000010062e600 0 9181 5410 (NOTLB) Call Trace: [<a00000010062d1b0>] schedule+0x1db0/0x20a0 sp=e00000406fbcfaa0 bsp=e00000406fbc9718 [<a00000010062e600>] io_schedule+0x80/0xe0 sp=e00000406fbcfb30 bsp=e00000406fbc96f8 [<a0000001000ff700>] sync_page+0x100/0x120 sp=e00000406fbcfb30 bsp=e00000406fbc96d8 [<a00000010062e9a0>] __wait_on_bit_lock+0xa0/0x180 sp=e00000406fbcfb30 bsp=e00000406fbc9690 [<a0000001000ff400>] __lock_page+0xc0/0xe0 sp=e00000406fbcfb30 bsp=e00000406fbc9668 [<a0000001001be500>] mpage_writepages+0x2c0/0x8a0 sp=e00000406fbcfb70 bsp=e00000406fbc95e8 [<a00000010010fe90>] do_writepages+0xd0/0x120 sp=e00000406fbcfc10 bsp=e00000406fbc95c0 [<a0000001001015a0>] __filemap_fdatawrite_range+0xc0/0xe0 sp=e00000406fbcfc10 bsp=e00000406fbc9580 [<a000000100101960>] filemap_fdatawrite+0x40/0x60 sp=e00000406fbcfc50 bsp=e00000406fbc9560 [<a0000002013dda50>] inode_go_sync+0xd0/0x320 [gfs2] sp=e00000406fbcfc50 bsp=e00000406fbc9538 [<a0000002013ddd10>] inode_go_drop_th+0x70/0xa0 [gfs2] sp=e00000406fbcfc50 bsp=e00000406fbc9518 [<a0000002013d8d10>] gfs2_glock_drop_th+0x70/0x340 [gfs2] sp=e00000406fbcfc50 bsp=e00000406fbc94e8 [<a0000002013da230>] run_queue+0x2f0/0x900 [gfs2] sp=e00000406fbcfc50 bsp=e00000406fbc94a0 [<a0000002013db490>] gfs2_glock_nq+0x550/0x640 [gfs2] sp=e00000406fbcfc50 bsp=e00000406fbc9450 [<a0000002013df1e0>] gfs2_glock_nq_atime+0x260/0x6e0 [gfs2] sp=e00000406fbcfc50 bsp=e00000406fbc93b8 [<a0000002013f0dd0>] gfs2_readpage+0xd0/0x280 [gfs2] sp=e00000406fbcfc80 bsp=e00000406fbc9378 [<a000000100100bb0>] do_generic_mapping_read+0x470/0x880 sp=e00000406fbcfcc0 bsp=e00000406fbc92c8 [<a000000100102140>] __generic_file_aio_read+0x280/0x3c0 sp=e00000406fbcfd10 bsp=e00000406fbc9270 [<a0000001001049c0>] generic_file_read+0x120/0x160 sp=e00000406fbcfd30 bsp=e00000406fbc9230 [<a0000001001646c0>] vfs_read+0x200/0x3a0 sp=e00000406fbcfe20 bsp=e00000406fbc91d8 [<a000000100164d90>] sys_read+0x70/0xe0 sp=e00000406fbcfe20 bsp=e00000406fbc9160 [<a00000010000bdb0>] __ia64_trace_syscall+0xd0/0x110 sp=e00000406fbcfe30 bsp=e00000406fbc9160 [<a000000000010620>] __start_ivt_text+0xffffffff00010620/0x400 sp=e00000406fbd0000 bsp=e00000406fbc9160
This trace shows the problem quite nicely.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
Created attachment 295614 [details] Patch to fix bug for upstream kernels I don't like having to add a memory allocation here. We'll probably get away with it simply because we have readpages and thus readpage isn't called very often these days. The place where it might matter is with stuffed files where readpages currently always delegates its duties to readpage. If we need to fix that then we should be able to do it by making readpages handle stuffed files itself. My quick test with postmark suggests that the difference isn't measurable though. Also I'll be discussing this and similar problems with those at the filesystems workshop next week.
Created attachment 296569 [details] The RHEL version of the patch
Abhi, can you include the patch from comment #7 in your next build so that we can test it?
Am I the only one who crashes with this patch? I get panics running kmod-gfs2-xen-1.84-1.2.i686 If I remove this patch, everything is fine It looks to me like I'm crashing in add_to_queue() because someone is messing with the gl_holders list while I'm traversing it, but I can't find anywhere in the patch where the holder is messed with without holding the gl_spin lock. Maybe I'm missing something.
GAH!!!!! It's sizeof(struct gfs2_holder) not sizeof(struct gfs2_holder *) I hate it when the simple answer is staring me in the face for days and I keep overlooking it for something complicated. Only the RHEL patch has the typeo. I'm verifying that everything works now, but I'm pretty sure that this is the whole of the problem.
Unfortunately, there is still a problem. I'm running into a bug where add_to_queue() calls find_holder_by_owner() to see if the current process already has a holder on the glock's holder list, and it finds one. The original holder is from gfs2_readpage(). I added a GL_NEED_TO_RETRY bit to the gh_flags to mark when a holder was created in gfs2_readpage(), and then I clear it when gfs2_readpage() is called again, and finds the holder. When I see this error, the original holder doesn't have this bit cleared, so it seems pretty obvious that in some cases, when you return with AOP_TRUNCATED_PAGE, you don't get called again. Every time I've seen this bug, the original holder is created by a readpage call from do_generic_mapping_read(). I found this running kmod-gfs2-1.84-1.4.el5 with the kmalloc size fix compiled for the 2.6.18-83.el5 kernel, since the 84 kernel has the sctp problems. But unless do_generic_mapping_read() works differently in the 84 kernel, this bug should be there too.
I suspect there is a race. Just before the call to readpage there is this: /* Did somebody else fill it already? */ if (PageUptodate(page)) { unlock_page(page); goto page_ok; } readpage: /* Start the actual read. The read will unlock the page. */ error = mapping->a_ops->readpage(filp, page); and I suspect that after we've been around this loop for the second time, another process has raced, brought the page uptodate and thus we've skipped the second readpage, and we only find out w /* Did somebody else fill it already? */ if (PageUptodate(page)) { unlock_page(page); goto page_ok; } readpage: /* Start the actual read. The read will unlock the page. */ error = mapping->a_ops->readpage(filp, page); then the same process later tries to bring the same page uptodate again and hits the original lock. The odd thing though is why you see this in find_holder_by_owner since what should be happening is that the lock would be found on entry to readpage and thus it shouldn't be trying to lock it again. Sorry about the typo. Probably a cut & paste error when I copied the patch into RHEL. Also I guess it could be worth dropping Mark a line and talking to him about how OCFS2 deals with this. We are using the same technique as Mark does, and I think he also has the same bug wrt not calling back into readpage if there has been a race in bringing the page uptodate.
The reason you don't find the original holder until find_holder_by_owner is because when you hit this, the process isn't doing another gfs2_readpage() call. I have seen it happen with gfs2_mmap() and gfs2_readpages().
Created attachment 297894 [details] readpage patch with malloc fix, and some checks to free forgotten holders This patch has a possible partial fix. There are definitely some calls to readpage that aren't don't go through gfs2_aio_file_read(), gfs2_file_readv() and gfs2_file_read(), so I still see this bug, but I can get much further. Unfortunately, I keep running into sctp panics, so I can't see if it's possible to check for forgotten holders for all the calls to readpage(). But if it is possible to check after all the readpage calls, then this seems like a reasonable workaround for RHEL.
Created attachment 298191 [details] another version of my patch that handles more readpage callers Here is another version of my patch. Even if this isn't the way we go, this will may be useful for QA, so that they can test bugs that rely on this fix, like 428751, until the real fix is done. This version of the patch handles all the readpage callers except the cachefiles interface, but I highly doubt that QA checks that cachefiles work correctly on gfs2. Unfortunately, I never got time to verify that it does correctly handle all the other readpage calls. I've only verified that it patches gfs2-kmod-1.85-1.2.el5 and compiles correctly.
Unless another idea has been proposed since I looked at this last, the current proposed solution is to make sure that we don't do a demote from a try lock request. Instead, the demote will be queued up and handled by the work queue. This means that try lock requests will behave as expected and not do blocking work if they can't directly acquire the lock. I'll look into changing this.
Yes, thats the plan. I have a patch, not yet fully debugged, to try and do that.
Created attachment 302891 [details] Test patch for fixing this bug and cluster coherency bug This is a test patch (vs. upstream) which is aimed at fixing this bug by making "try locks" not do demotions where it might lead to blocking. Its still experimental - it needs lots of testing, it currently passes single node mount/postmark/umount tests. We need some more demanding (lot of glocks) tests as well as to test the cluster behaviour. I'd be surprised if there was not the odd bug still lurking in the code somewhere bearing in mind the scale of the change. One single node though, it seems to be running nicely. Known bugs: o There is a problem with dumping glocks via debugfs which sometimes results in a lock up. It seems to be related to finding a glock which has been removed from the lists during glock dumping. When I hit it I get an oops which points to a poisoned list pointer (the ->next pointer to be specific). I can hit it from time to time by dumping glocks while running postmark. o I might get rid of gfs2_glockd o The glock dump format is new and experimental. Its much faster than the old format to dump out. It will need some tweeking, but I'd like to try and keep each line to 80 cols and one line per object if at all possible. Let me know what you think on this issue. Other than that, lets test it as much as can and find/fix as many bugs as possible in the mean time.
The good news: This patch passes my "hell" tests. The bad news: I was able to crash it by doing this (single node): time /usr/tests/sts-rhel5.1/gfs/bin/d_io -m /mnt/gfs2 -n roth-01 -S REG -l /usr/tests/sts-rhel5.1/ -r /usr/tests/sts-rhel5.1/ Here is the call trace: BUG: unable to handle kernel paging request at ffff81026cb29a66 IP: [<ffffffff8840d2a9>] :gfs2:gfs2_block_map+0x7ed/0xab0 PGD 8063 PUD 0 Oops: 0000 [1] SMP CPU 1 Modules linked in: lock_dlm gfs2 dlm configfs autofs4 hidp rfcomm l2cap bluetooth sunrpc ipv6 dm_multipath sbs sbshc battery acpi_memhotplug ac parport_pc lp parport joydev sg i3000_edac i2c_i801 i2c_core edac_core ide_cd_mod button tg3 cdrom pcspkr serio_raw dcdbas shpchp dm_snapshot dm_zero dm_mirror dm_mod qla2xxx scsi_transport_fc ata_piix libata sd_mod scsi_mod ext3 jbd ehci_hcd ohci_hcd uhci_hcd [last unloaded: gfs2] Pid: 5898, comm: xdoio Not tainted 2.6.25-rc7 #1 RIP: 0010:[<ffffffff8840d2a9>] [<ffffffff8840d2a9>] :gfs2:gfs2_block_map+0x7ed/0xab0 RSP: 0018:ffff81006cb29968 EFLAGS: 00010297 RAX: 00000000ffffffff RBX: 0000000000000002 RCX: 0000000000000000 RDX: ffff81001d212d88 RSI: 0000000000000000 RDI: ffff81002de2c4d8 RBP: ffffffffffffffff R08: 0000000000000000 R09: ffff81007f41b368 R10: ffff81005f130f30 R11: 00000000000000d0 R12: ffff81007f41b368 R13: 0000000000000000 R14: ffff81002de2c470 R15: 00000000000183fd FS: 00007fdc02af36f0(0000) GS:ffff81007fba7840(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: ffff81026cb29a66 CR3: 000000007e4a1000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process xdoio (pid: 5898, threadinfo ffff81006cb28000, task ffff81005dbd94a0) Stack: ffff81007c143e40 0000000100010123 0000000000001387 ffff81001655e150 0000000000010123 000000018841b0cd 0000000000000018 ffff81001d3d6810 ffff81002de2c408 ffff81001d212d88 ffff81006216c000 ffff81001655e000 Call Trace: [<ffffffff802ac4c7>] ? __block_prepare_write+0x19d/0x36c [<ffffffff8840cabc>] ? :gfs2:gfs2_block_map+0x0/0xab0 [<ffffffff8840dac6>] ? :gfs2:gfs2_unstuff_dinode+0x379/0x38b [<ffffffff802ac6b0>] ? block_prepare_write+0x1a/0x25 [<ffffffff8841d324>] ? :gfs2:gfs2_write_begin+0x230/0x2c0 [<ffffffff80266b38>] ? generic_file_buffered_write+0x14a/0x62c [<ffffffff80238c77>] ? current_fs_time+0x1e/0x24 [<ffffffff80267356>] ? __generic_file_aio_write_nolock+0x33c/0x3a6 [<ffffffff884150c8>] ? :gfs2:gfs2_glock_dq+0x1e/0x112 [<ffffffff80245e68>] ? wake_up_bit+0x11/0x22 [<ffffffff80267421>] ? generic_file_aio_write+0x61/0xc1 [<ffffffff8028d2a7>] ? do_sync_write+0xc9/0x10c [<ffffffff8841e2f2>] ? :gfs2:gfs2_open+0x0/0x13c [<ffffffff80245e79>] ? autoremove_wake_function+0x0/0x2e [<ffffffff80474a46>] ? lock_kernel+0x1b/0x37 [<ffffffff8028dd93>] ? remote_llseek+0x73/0x7b [<ffffffff8028da09>] ? vfs_write+0xad/0x136 [<ffffffff8028df46>] ? sys_write+0x45/0x6e [<ffffffff8020bfc9>] ? tracesys+0xdc/0xe1 BTW, this was done on the upstream kernel (from today's nmw tree).
I recreated the problem from comment #20 on the -nmw tree that DOES NOT include the patch, so the problem exists in the current upstream kernel GFS2. So it's not this patch causing the problem. It's something prior. I suspected as much, given the symptoms seem unrelated to the glock management.
The problem from comment #20 seems to go back to function gfs2_bmap_alloc which isn't in RHEL. It appears that function zero_metapath_length is returning 0. Coming into ALLOC_GROW_HEIGHT of the state machine, we have i==1. Coming into ALLOC_GROW_DEPTH, i==0. A call to function gfs2_indirect_init is made which tries to pass mp->mp_list[i-1], but since i==0, this i - 1 is -1, and that's what is blowing up. The problem seems to be when we have a stuffed file and then we go to write a bunch of data at a fairly big offset. Dean suggested this command to test it in addition to the command I gave above: xiogen -z -m random -s write -i 10 -t 100b -T 1000b -F 10000b:foofile | xdoio -v -p none I spoke with Steve about it and tried a few experiments, such as using "i = zmpl + 1" rather than "i = zmpl" but that broke other things. I also tried adding if clauses such as "i > 0 &&" but I don't know the intent of this function well enough. This is going to have to be well thought-out. There's also a concern with it potentially doing a brelse on the dinode bh. Since Steve wrote this function and the whole "zmpl" thing, and since it's not in RHEL, I didn't want to spend too much time debugging it. I'll let Steve do the heavy lifting on this. Instead, I'm focusing my efforts on porting his patch to RHEL and getting it working / tested. That's been a big challenge because there are a lot of changes between RHEL and upstream, such as pid number management and references to go_drop_th. It's not a straightforward port as I was hoping. I've now got a version that compiles but I haven't begun testing it to see what breaks. My fear, of course, is that there might be assumptions in upstream that make this port unusable (unless we port a ton of other upstream changes). I guess I'll find out soon.
Created attachment 303203 [details] RHEL5 port of readv patch Here is my RHEL5 port of Steve's readv patch from comment #19. Due to the number of patch rejects due to differenced between RHEL5 and upstream, I won't guarantee this is a good patch. Again, I've got good news and bad news: The good news: This patch passes all my "hell" tests. It also does not crash when I do the command from comment #20. The bad news: The d_io test from comment #20 ran a long time, but it hung up permanently while performing the "laio_rwsdirect" test. It's not a hard lock, but the xdoio processes are hung. There seems to be pairs of xdoio's, one of which is stuck in the "D" state. So I need to take a sysrq-t and debug this. This could be a problem with my port, or like I said before, due to assumptions that are no longer true due to differences between upstream and RHEL.
Created attachment 303208 [details] sysrq-t at the hang This only shows two xdoio processes, both stuck in glock_wait.
Created attachment 303209 [details] New glock dump at the time of the hang The glock dump didn't hang or kill the system, and that's good. The new format is going to take some getting used to.
I tried some experimental changes to the patch but the problem is still there (I recreated it again). There are a couple of things that concern me about this patch. It might be smaller, but perhaps more confusing to the untrained eye. For example: 1. It bothers me that finish_xmote can call do_xmote AND do_xmote can also call finish_xmote. That just seems wrong to me. Not intuitive at any rate. 2. It also seems wrong that run_queue can call do_xmote twice in the demote case. It's my belief (completely unproven) as to why the gl_ref count is continually going up after the failure occurs. 3. It bothers me that run_queue always calls do_xmote, even if it does not find any waiters on the queue. This makes me wonder if the patch impacts performance. Perhaps I just need to study the logic of it some more.
To answer some of these points.... 1. This was true of the previous code as well, its just that now its obvious that this happens. The recursion is strictly limited, ending when we obtain the lock mode that we requested, or when we receive an error. The worst case looks like this: do_xmote: Try to promote from shared to exclusive finish_xmote: try lock fails due to conflicting holders do_xmote: called to demote to unlocked finish_xmote: now unlocked do_xmote: need to send a request for exclusive lock finish_xmote: exclusive lock granted i.e. max depth of three calls, which is not excessive. In fact in practise its less than that since finish_xmote is run in the context of the dlm's callback thread and we never experience this particular case when running with lock_nolock since the conversion will always succeed. 2. run_queue has to call do_xmote twice if there is a queued demote request since it has to do that demote request first and then process the pending promote to avoid deadlock. This was also true of the previous code. 3. run_queue doesn't call do_xmote unless there is either a pending demote request or there are waiters on the queue which can be processed only if there is a change of the lock mode (i.e. do_promote returns non-zero).
Created attachment 303500 [details] RHEL5 port of readv patch (revised) This RHEL5 port of Steve's patch passes the "hell" tests and the d_io test that was failing in previous comments. It also has a few minor whitespace fixes I grabbed from upstream. I'm a little concerned that I may have grabbed too much, in my effort to solve some of these issues. We need to do some cluster tests; perhaps distributed d_io (dd_io) and revolver with locks.
Found some problems during single-node testing. Steve made a couple corrections to the patch, particularly, function may_grant(). This seemed to fix the single-node problems; I've been running specsfs on Barry's machine for nearly two hours now, and everything seems happy. It's on run 6 out of 14. I've also been running benchp13 on kool and it's happy too, on iteration 222. There does seem to be a bit of a performance increase, at least with benchp13. However, Abhi found some problems running distributed d_io (dd_io) on the smoke cluster. All of the d_doio processes on system "salem" are stucking waiting for an unlocked glock. The important part of the glock dump looks like this: G: s:UN n:2/19660907 f:l t:EX d:EX/0 l:0 a:0 r:9 H: s:EX f:W e:0 p:6189 [d_doio] gfs2_sharewrite_nopage+0x5b/0x2bf [gfs2] H: s:EX f:aW e:0 p:6192 [d_doio] gfs2_prepare_write+0xc7/0x360 [gfs2] H: s:EX f:W e:0 p:6190 [d_doio] gfs2_sharewrite_nopage+0x5b/0x2bf [gfs2] H: s:EX f:W e:0 p:6191 [d_doio] gfs2_sharewrite_nopage+0x5b/0x2bf [gfs2] H: s:SH f:aW e:0 p:6193 [d_doio] gfs2_permission+0x7b/0xd4 [gfs2] H: s:SH f:aW e:0 p:6275 [ls] gfs2_getattr+0x7d/0xc4 [gfs2]
Created attachment 303542 [details] Call trace from salem This is a gzip of Salem's call-trace at the hang.
Okay, salem, winston and merit are all apparently waiting for a glock that's held by camel. Here are the respective excerpts from the glock dumps: salem: G: s:UN n:2/19660907 f:l t:EX d:EX/0 l:0 a:0 r:9 H: s:EX f:W e:0 p:6189 [d_doio] gfs2_sharewrite_nopage+0x5b/0x2bf [gfs2] H: s:EX f:aW e:0 p:6192 [d_doio] gfs2_prepare_write+0xc7/0x360 [gfs2] H: s:EX f:W e:0 p:6190 [d_doio] gfs2_sharewrite_nopage+0x5b/0x2bf [gfs2] H: s:EX f:W e:0 p:6191 [d_doio] gfs2_sharewrite_nopage+0x5b/0x2bf [gfs2] H: s:SH f:aW e:0 p:6193 [d_doio] gfs2_permission+0x7b/0xd4 [gfs2] H: s:SH f:aW e:0 p:6275 [ls] gfs2_getattr+0x7d/0xc4 [gfs2] winston: cat /sys/kernel/debug/gfs2/Smoke4\:gfs20/glocks > /tmp/glocks.winston G: s:UN n:2/19660907 f:l t:EX d:EX/0 l:0 a:0 r:5 H: s:EX f:aW e:0 p:2859 [d_doio] gfs2_prepare_write+0xa8/0x325 [gfs2] H: s:EX f:W e:0 p:2856 [d_doio] gfs2_sharewrite_nopage+0x43/0x28c [gfs2] merit: cat /sys/kernel/debug/gfs2/Smoke4\:gfs20/glocks > /tmp/glocks.merit G: s:UN n:2/19660907 f:l t:EX d:EX/0 l:0 a:0 r:8 H: s:EX f:aW e:0 p:2861 [d_doio] gfs2_prepare_write+0xa8/0x325 [gfs2] H: s:EX f:W e:0 p:2863 [d_doio] gfs2_sharewrite_nopage+0x43/0x28c [gfs2] H: s:SH f:TaW e:0 p:2859 [d_doio] gfs2_readpages+0x4f/0xcc [gfs2] H: s:SH f:aW e:0 p:2860 [d_doio] gfs2_permission+0x69/0xb4 [gfs2] H: s:SH f:aW e:0 p:2862 [d_doio] gfs2_permission+0x69/0xb4 [gfs2] camel: cat /sys/kernel/debug/gfs2/Smoke4\:gfs20/glocks > /tmp/glocks.camel G: s:EX n:2/19660907 f:lD t:UN d:UN/10443042000 l:0 a:0 r:5 H: s:SH f:TaW e:0 p:2840 [d_doio] gfs2_readpage+0x4f/0xea [gfs2] H: s:EX f:aW e:0 p:2839 [d_doio] gfs2_prepare_write+0xa8/0x325 [gfs2] I: n:1048602/19660907 t:8 f:0x00000001 Camel's pid 2840 has this call trace: d_doio D 0000006F 1988 2839 2837 2840 2838 (NOTLB) cf7e3c34 00000082 006b1354 0000006f 00000000 00000003 0002407b 00000007 df728aa0 c06713c0 006b7a4e 0000006f 000066fa 00000000 df728bac c13f4ee0 c04d94a8 d3a225c0 e01dc840 c042cb1a dc71ce1c ffffffff 00000000 00000000 Call Trace: [<c04d94a8>] generic_unplug_device+0x15/0x22 [<e01dc840>] dm_table_unplug_all+0x22/0x2e [dm_mod] [<c042cb1a>] getnstimeofday+0x30/0xb6 [<c06083dc>] io_schedule+0x36/0x59 [<c045520c>] sync_page+0x0/0x3b [<c0455244>] sync_page+0x38/0x3b [<c06084ee>] __wait_on_bit_lock+0x2a/0x52 [<c0455187>] __lock_page+0x52/0x59 [<c0435f90>] wake_bit_function+0x0/0x3c [<c045c0e8>] truncate_inode_pages_range+0x201/0x256 [<c045c146>] truncate_inode_pages+0x9/0xf [<e0592f47>] inode_go_inval+0x40/0x4b [gfs2] [<e0592774>] do_xmote+0x66/0xfe [gfs2] [<e0592859>] run_queue+0x4d/0x9c [gfs2] [<e0592a01>] gfs2_glock_nq+0x159/0x17a [gfs2] [<e059387f>] gfs2_glock_nq_atime+0xfa/0x2c2 [gfs2] [<e0599ea6>] gfs2_prepare_write+0xb8/0x325 [gfs2] [<c045661f>] generic_file_buffered_write+0x226/0x5a2 [<c0456e41>] __generic_file_aio_write_nolock+0x4a6/0x52a [<c0407406>] do_IRQ+0xa5/0xae [<c045972f>] get_page_from_freelist+0x2ba/0x333 [<c045709f>] generic_file_write+0x0/0x94 [<c0456ff5>] __generic_file_write_nolock+0x86/0x9a [<c0435f63>] autoremove_wake_function+0x0/0x2d [<c060868e>] mutex_lock+0xb/0x19 [<c04570d9>] generic_file_write+0x3a/0x94 [<c045709f>] generic_file_write+0x0/0x94 [<c0471033>] vfs_write+0xa1/0x143 [<c0471625>] sys_write+0x3c/0x63 [<c0404eff>] syscall_call+0x7/0xb Camel's pid 2839 call trace: d_doio D 0000006E 2448 2838 2837 2839 (NOTLB) cf7d1ecc 00000082 78b3b5c3 0000006e ce384000 00000046 c11c7060 00000006 df4da550 c06713c0 7997d7cd 0000006e 00e4220a 00000000 df4da65c c13f4ee0 c067df28 000280d2 c11c7c40 00000000 c067ba00 ffffffff 00000000 00000000 Call Trace: [<c06087b3>] __mutex_lock_slowpath+0x4d/0x7c [<c06087f1>] .text.lock.mutex+0xf/0x14 [<c0457031>] generic_file_writev+0x28/0x96 [<c0457009>] generic_file_writev+0x0/0x96 [<c045709f>] generic_file_write+0x0/0x94 [<c0470dfe>] do_readv_writev+0x166/0x277 [<c0460e76>] __handle_mm_fault+0x896/0x8b3 [<c044b0d5>] audit_syscall_entry+0x14b/0x17d [<c0470f46>] vfs_writev+0x37/0x43 [<c047138a>] sys_writev+0x3c/0x96 [<c0404eff>] syscall_call+0x7/0xb
Benchp13 is still running happily. It's on iteration 748. Unfortunately, specsfs ended abnormally on iteration 12 of 14. This is after several hours of running successfully. There were no messages in dmesg, but no processes were in the D state. Unfortunately, there are 57 GFS2 file systems mounted, which makes it difficult to narrow down what's happening. I'll see if I can dump them all and see if I can see anything wrong, but it's a needle in a haystack, since there aren't any hung processes and huge numbers of files open. However, the timing of this problem makes me believe it might have something to do with vfs trimming the inodes in cache.
Thanks for the info in comment #31. In fact you've given me the backtraces for the wrong processes, but it doesn't matter as I think I've figured out whats going on there anyway. So this is my theory: readpage has queued a "try lock" but due to the pending demote request, it exited run_queue at the test: if (gh || nonblock) return; so that it was then stuck until the EX request was queued from prepare_write. That request did actually try and do the demote request, but it has got stuck since readpage is still holding a lock on one of the pages, hence deadlock. Now what should have happened is that in the "nonblock" is true case, we should have returned GLR_TRYFAILED to the try locks and then scheduled the work queue to do the demote in the mean time (would be a good plan to have a yield() in the GLR_TRYFAILED path in readpage to set it run). This should mean that a further request from readpage should then be successful. We thus need to insert something like: if (nonblock) fail_trylocks_and_reschedule(gl); just before the if (gh || nonblock) in run_queue() and I'll try and write a suitable function to do that in the mean time.
I suspect that we need something like this for the actual function. This is what I'm about to start testing: static void fail_trylocks_and_reschedule(struct gfs2_glock *gl) { do_error(gl, 0); gfs2_glock_hold(gl); if (queue_delayed_work(glock_workqueue, &gl->gl_work, 0) == 0) gfs2_glock_put(gl); }
Created attachment 303631 [details] Respind of the RHEL5 patch This version of the patch contains the latest changes Steve mentioned.
Created attachment 303637 [details] Respin of respin I caved in to peer pressure and removed the white-space changes to inode.c from the patch. Otherwise it's the same.
Another hang on the smoke cluster: salem: G: s:EX n:2/132425 f:lD t:UN d:UN/1744247000 l:0 a:0 r:5 H: s:SH f:TaW e:0 p:9381 [d_doio] gfs2_readpage+0x60/0xf6 [gfs2] H: s:EX f:aW e:0 p:9373 [d_doio] gfs2_prepare_write+0xc7/0x360 [gfs2] I: n:25/132425 t:8 f:0x00000001 camel: G: s:UN n:2/132425 f:l t:EX d:EX/0 l:0 a:0 r:6 H: s:EX f:aW e:0 p:6318 [d_doio] gfs2_prepare_write+0xa8/0x325 [gfs2] H: s:SH f:TaW e:0 p:6329 [d_doio] gfs2_readpage+0x4f/0xea [gfs2] H: s:SH f:TaW e:0 p:6325 [d_doio] gfs2_readpage+0x4f/0xea [gfs2] merit: G: s:UN n:2/132425 f:l t:SH d:EX/0 l:0 a:0 r:8 H: s:SH f:aW e:0 p:6347 [d_doio] gfs2_open+0x9f/0x112 [gfs2] H: s:SH f:aW e:0 p:6339 [d_doio] gfs2_open+0x9f/0x112 [gfs2] H: s:SH f:aW e:0 p:6346 [d_doio] gfs2_open+0x9f/0x112 [gfs2] H: s:SH f:TaW e:0 p:6336 [d_doio] gfs2_readpage+0x4f/0xea [gfs2] H: s:SH f:TaW e:0 p:6333 [d_doio] gfs2_readpage+0x4f/0xea [gfs2] winston: G: s:UN n:2/132425 f:l t:SH d:EX/0 l:0 a:0 r:8 H: s:SH f:aW e:0 p:6336 [d_doio] gfs2_permission+0x69/0xb4 [gfs2] H: s:SH f:aW e:0 p:6343 [d_doio] gfs2_permission+0x69/0xb4 [gfs2] H: s:SH f:aW e:0 p:6346 [d_doio] gfs2_permission+0x69/0xb4 [gfs2] H: s:SH f:TaW e:0 p:6347 [d_doio] gfs2_readpage+0x4f/0xea [gfs2] H: s:SH f:TaW e:0 p:6348 [d_doio] gfs2_readpage+0x4f/0xea [gfs2]
Created attachment 303646 [details] New call trace from salem
I added my "glock history" patch (a modified version of it anyway) to the latest and greatest as a test patch, then I rereated the hang. Here is what the "hung" glock looks like, with history going back 25 items, from all three nodes' perspectives. Starting with roth-03, who apparently held it last: roth-03 G: s:EX n:2/99322 f:lD t:UN d:UN/1678975000 l:0 a:0 r:6 H: s:EX f:aW e:0 p:3053 [d_doio] gfs2_prepare_write+0xc7/0x360 [gfs2] H: s:SH f:TaW e:0 p:3050 [d_doio] gfs2_readpage+0x60/0xf6 [gfs2] I: n:2097173/99322 t:8 f:0x00000001 779: gfs2_do_promote: 3050, LM_ST_EXCLUSIVE 780: gfs2_glock_dq: 3053, LM_ST_EXCLUSIVE 781: gfs2_do_promote: 3053, LM_ST_EXCLUSIVE 782: gfs2_glock_dq: 3050, LM_ST_EXCLUSIVE 783: gfs2_do_promote: 3051, LM_ST_EXCLUSIVE 784: gfs2_glock_dq: 3053, LM_ST_EXCLUSIVE 785: do_xmote: 2830, LM_ST_EXCLUSIVE 786: search_bucket: 2952, LM_ST_EXCLUSIVE 787: finish_xmote: 2952, LM_ST_EXCLUSIVE 788: do_xmote: 3050, LM_ST_UNLOCKED 789: search_bucket: 2952, LM_ST_UNLOCKED 790: search_bucket: 2951, LM_ST_UNLOCKED 791: finish_xmote: 2952, LM_ST_UNLOCKED 792: handle_callback: 2951, LM_ST_SHARED 793: gfs2_do_promote: 2952, LM_ST_SHARED 794: gfs2_do_promote: 2952, LM_ST_SHARED 795: gfs2_glock_dq: 3051, LM_ST_SHARED 796: gfs2_glock_dq: 3052, LM_ST_SHARED 797: do_xmote: 3052, LM_ST_SHARED 798: search_bucket: 2951, LM_ST_SHARED 799: handle_callback: 2951, LM_ST_SHARED 800: search_bucket: 2951, LM_ST_SHARED 801: finish_xmote: 2951, LM_ST_SHARED 802: gfs2_do_promote: 2951, LM_ST_EXCLUSIVE 803: gfs2_glock_dq: 3053, LM_ST_EXCLUSIVE 804: gfs2_do_promote: 3053, LM_ST_EXCLUSIVE 805: gfs2_do_promote: 3053, LM_ST_EXCLUSIVE 806: gfs2_do_promote: 3053, LM_ST_EXCLUSIVE 807: gfs2_glock_dq: 3051, LM_ST_EXCLUSIVE 808: gfs2_glock_dq: 3050, LM_ST_EXCLUSIVE 809: gfs2_glock_dq: 3052, LM_ST_EXCLUSIVE 810: do_xmote: 2829, LM_ST_EXCLUSIVE roth-02 G: s:UN n:2/99322 f:l t:EX d:EX/0 l:0 a:0 r:6 H: s:EX f:aW e:0 p:3077 [d_doio] gfs2_prepare_write+0xc7/0x360 [gfs2] H: s:SH f:TaW e:0 p:3079 [d_doio] gfs2_readpage+0x60/0xf6 [gfs2] H: s:SH f:TaW e:0 p:3078 [d_doio] gfs2_readpage+0x60/0xf6 [gfs2] 929: gfs2_glock_dq: 3078, LM_ST_EXCLUSIVE 930: gfs2_glock_dq: 3079, LM_ST_EXCLUSIVE 931: gfs2_do_promote: 3079, LM_ST_EXCLUSIVE 932: gfs2_glock_dq: 3077, LM_ST_EXCLUSIVE 933: gfs2_do_promote: 3077, LM_ST_EXCLUSIVE 934: gfs2_do_promote: 3077, LM_ST_EXCLUSIVE 935: gfs2_glock_dq: 3078, LM_ST_EXCLUSIVE 936: gfs2_glock_dq: 3079, LM_ST_EXCLUSIVE 937: gfs2_do_promote: 3079, LM_ST_EXCLUSIVE 938: gfs2_glock_dq: 3077, LM_ST_EXCLUSIVE 939: gfs2_do_promote: 3077, LM_ST_EXCLUSIVE 940: gfs2_do_promote: 3077, LM_ST_EXCLUSIVE 941: gfs2_glock_dq: 3078, LM_ST_EXCLUSIVE 942: gfs2_glock_dq: 3079, LM_ST_EXCLUSIVE 943: gfs2_do_promote: 3079, LM_ST_EXCLUSIVE 944: gfs2_glock_dq: 3077, LM_ST_EXCLUSIVE 945: do_xmote: 2852, LM_ST_EXCLUSIVE 946: search_bucket: 2975, LM_ST_EXCLUSIVE 947: finish_xmote: 2975, LM_ST_EXCLUSIVE 948: do_xmote: 3078, LM_ST_UNLOCKED 949: search_bucket: 2975, LM_ST_UNLOCKED 950: handle_callback: 2975, LM_ST_UNLOCKED 951: search_bucket: 2975, LM_ST_UNLOCKED 952: finish_xmote: 2975, LM_ST_UNLOCKED 953: gfs2_do_promote: 2975, LM_ST_EXCLUSIVE 954: search_bucket: 2975, LM_ST_EXCLUSIVE 955: handle_callback: 2975, LM_ST_EXCLUSIVE 956: gfs2_glock_dq: 3077, LM_ST_EXCLUSIVE 957: do_xmote: 2852, LM_ST_EXCLUSIVE 958: search_bucket: 2975, LM_ST_EXCLUSIVE 959: finish_xmote: 2975, LM_ST_EXCLUSIVE 960: do_xmote: 3079, LM_ST_UNLOCKED roth-01 G: s:UN n:2/99322 f:l t:EX d:EX/0 l:0 a:0 r:6 H: s:EX f:aW e:0 p:3020 [d_doio] gfs2_prepare_write+0xc7/0x360 [gfs2] H: s:SH f:TaW e:0 p:3024 [d_doio] gfs2_readpage+0x60/0xf6 [gfs2] H: s:SH f:TaW e:0 p:3022 [d_doio] gfs2_readpage+0x60/0xf6 [gfs2] 231: gfs2_do_promote: 3020, LM_ST_EXCLUSIVE 232: gfs2_glock_dq: 3022, LM_ST_EXCLUSIVE 233: gfs2_do_promote: 3022, LM_ST_EXCLUSIVE 234: gfs2_glock_dq: 3020, LM_ST_EXCLUSIVE 235: gfs2_do_promote: 3020, LM_ST_EXCLUSIVE 236: gfs2_glock_dq: 3022, LM_ST_EXCLUSIVE 237: gfs2_do_promote: 3022, LM_ST_EXCLUSIVE 238: gfs2_glock_dq: 3020, LM_ST_EXCLUSIVE 239: gfs2_do_promote: 3020, LM_ST_EXCLUSIVE 240: gfs2_glock_dq: 3022, LM_ST_EXCLUSIVE 241: gfs2_do_promote: 3022, LM_ST_EXCLUSIVE 242: gfs2_glock_dq: 3020, LM_ST_EXCLUSIVE 243: gfs2_do_promote: 3020, LM_ST_EXCLUSIVE 244: gfs2_glock_dq: 3022, LM_ST_EXCLUSIVE 245: gfs2_do_promote: 3022, LM_ST_EXCLUSIVE 246: gfs2_glock_dq: 3020, LM_ST_EXCLUSIVE 247: gfs2_do_promote: 3020, LM_ST_EXCLUSIVE 248: gfs2_glock_dq: 3022, LM_ST_EXCLUSIVE 249: do_xmote: 2797, LM_ST_EXCLUSIVE 250: search_bucket: 2920, LM_ST_EXCLUSIVE 251: finish_xmote: 2920, LM_ST_EXCLUSIVE 252: do_xmote: 3022, LM_ST_UNLOCKED 253: search_bucket: 2920, LM_ST_UNLOCKED 254: handle_callback: 2920, LM_ST_UNLOCKED 255: search_bucket: 2921, LM_ST_UNLOCKED 256: finish_xmote: 2921, LM_ST_UNLOCKED 257: gfs2_do_promote: 2921, LM_ST_EXCLUSIVE 258: gfs2_glock_dq: 3020, LM_ST_EXCLUSIVE 259: do_xmote: 2797, LM_ST_EXCLUSIVE 260: search_bucket: 2921, LM_ST_EXCLUSIVE 261: finish_xmote: 2921, LM_ST_EXCLUSIVE 262: do_xmote: 3024, LM_ST_UNLOCKED
Another hang with better history: Note: do_xmote3 is finish_xmote getting Unlocked due to conversion deadlock. do_xmote4 is finish_xmote getting LM_ST_SHARED or LM_ST_DEFERRED roth-02: G: s:DF n:2/12583013 f:l t:SH d:EX/0 l:0 a:0 r:6 H: s:SH f:aW e:0 p:3037 [d_doio] gfs2_lookup+0x50/0xa7 [gfs2] I: n:1048597/12583013 t:8 f:0x00000001 26841216: finish_xmote: 2973, LM_ST_DEFERRED 26841217: do_xmote3: 2973, LM_ST_UNLOCKED 26841218: search_bucket: 2974, LM_ST_UNLOCKED 26841219: finish_xmote: 2974, LM_ST_UNLOCKED 26841220: do_xmote4: 2974, LM_ST_DEFERRED 26841221: search_bucket: 2973, LM_ST_DEFERRED 26841222: finish_xmote: 2973, LM_ST_DEFERRED 26841223: do_xmote3: 2973, LM_ST_UNLOCKED 26841224: search_bucket: 2974, LM_ST_UNLOCKED 26841225: finish_xmote: 2974, LM_ST_UNLOCKED 26841226: do_xmote4: 2974, LM_ST_DEFERRED 26841227: search_bucket: 2973, LM_ST_DEFERRED 26841228: finish_xmote: 2973, LM_ST_DEFERRED 26841229: do_xmote3: 2973, LM_ST_UNLOCKED 26841230: search_bucket: 2974, LM_ST_UNLOCKED 26841231: finish_xmote: 2974, LM_ST_UNLOCKED 26841232: do_xmote4: 2974, LM_ST_DEFERRED 26841233: search_bucket: 2973, LM_ST_DEFERRED 26841234: finish_xmote: 2973, LM_ST_DEFERRED 26841235: do_xmote3: 2973, LM_ST_UNLOCKED 26841236: search_bucket: 2974, LM_ST_UNLOCKED 26841237: finish_xmote: 2974, LM_ST_UNLOCKED 26841238: do_xmote4: 2974, LM_ST_DEFERRED 26841239: search_bucket: 2973, LM_ST_DEFERRED 26841240: finish_xmote: 2973, LM_ST_DEFERRED 26841241: do_xmote3: 2973, LM_ST_UNLOCKED 26841242: search_bucket: 2974, LM_ST_UNLOCKED 26841243: finish_xmote: 2974, LM_ST_UNLOCKED 26841244: do_xmote4: 2974, LM_ST_DEFERRED 26841245: search_bucket: 2973, LM_ST_DEFERRED 26841246: finish_xmote: 2973, LM_ST_DEFERRED 26841247: do_xmote3: 2973, LM_ST_UNLOCKED 26841248: search_bucket: 2974, LM_ST_UNLOCKED 26841249: finish_xmote: 2974, LM_ST_UNLOCKED 26841250: do_xmote4: 2974, LM_ST_DEFERRED 26841251: search_bucket: 2973, LM_ST_DEFERRED 26841252: finish_xmote: 2973, LM_ST_DEFERRED 26841253: do_xmote3: 2973, LM_ST_UNLOCKED 26841254: search_bucket: 2974, LM_ST_UNLOCKED 26841255: finish_xmote: 2974, LM_ST_UNLOCKED 26841256: do_xmote4: 2974, LM_ST_DEFERRED 26841257: search_bucket: 2973, LM_ST_DEFERRED 26841258: finish_xmote: 2973, LM_ST_DEFERRED 26841259: do_xmote3: 2973, LM_ST_UNLOCKED 26841260: search_bucket: 2974, LM_ST_UNLOCKED 26841261: finish_xmote: 2974, LM_ST_UNLOCKED 26841262: do_xmote4: 2974, LM_ST_DEFERRED 26841263: search_bucket: 2973, LM_ST_DEFERRED roth-03: G: s:UN n:2/12583013 f:l t:SH d:EX/0 l:0 a:0 r:5 H: s:SH f:aW e:0 p:3019 [d_doio] gfs2_lookup+0x50/0xa7 [gfs2] 26818610: search_bucket: 2955, LM_ST_UNLOCKED 26818611: finish_xmote: 2955, LM_ST_UNLOCKED 26818612: do_xmote4: 2955, LM_ST_DEFERRED 26818613: search_bucket: 2956, LM_ST_DEFERRED 26818614: finish_xmote: 2956, LM_ST_DEFERRED 26818615: do_xmote3: 2956, LM_ST_UNLOCKED 26818616: search_bucket: 2955, LM_ST_UNLOCKED 26818617: finish_xmote: 2955, LM_ST_UNLOCKED 26818618: do_xmote4: 2955, LM_ST_DEFERRED 26818619: search_bucket: 2956, LM_ST_DEFERRED 26818620: finish_xmote: 2956, LM_ST_DEFERRED 26818621: do_xmote3: 2956, LM_ST_UNLOCKED 26818622: search_bucket: 2955, LM_ST_UNLOCKED 26818623: finish_xmote: 2955, LM_ST_UNLOCKED 26818624: do_xmote4: 2955, LM_ST_DEFERRED 26818625: search_bucket: 2956, LM_ST_DEFERRED 26818626: finish_xmote: 2956, LM_ST_DEFERRED 26818627: do_xmote3: 2956, LM_ST_UNLOCKED 26818628: search_bucket: 2955, LM_ST_UNLOCKED 26818629: finish_xmote: 2955, LM_ST_UNLOCKED 26818630: do_xmote4: 2955, LM_ST_DEFERRED 26818631: search_bucket: 2956, LM_ST_DEFERRED 26818632: finish_xmote: 2956, LM_ST_DEFERRED 26818633: do_xmote3: 2956, LM_ST_UNLOCKED 26818634: search_bucket: 2955, LM_ST_UNLOCKED 26818635: finish_xmote: 2955, LM_ST_UNLOCKED 26818636: do_xmote4: 2955, LM_ST_DEFERRED 26818637: search_bucket: 2956, LM_ST_DEFERRED 26818638: finish_xmote: 2956, LM_ST_DEFERRED 26818639: do_xmote3: 2956, LM_ST_UNLOCKED 26818640: search_bucket: 2955, LM_ST_UNLOCKED 26818641: finish_xmote: 2955, LM_ST_UNLOCKED 26818642: do_xmote4: 2955, LM_ST_DEFERRED 26818643: search_bucket: 2956, LM_ST_DEFERRED 26818644: finish_xmote: 2956, LM_ST_DEFERRED 26818645: do_xmote3: 2956, LM_ST_UNLOCKED 26818646: search_bucket: 2955, LM_ST_UNLOCKED 26818647: finish_xmote: 2955, LM_ST_UNLOCKED 26818648: do_xmote4: 2955, LM_ST_DEFERRED 26818649: search_bucket: 2956, LM_ST_DEFERRED 26818650: finish_xmote: 2956, LM_ST_DEFERRED 26818651: do_xmote3: 2956, LM_ST_UNLOCKED 26818652: search_bucket: 2955, LM_ST_UNLOCKED 26818653: finish_xmote: 2955, LM_ST_UNLOCKED 26818654: do_xmote4: 2955, LM_ST_DEFERRED 26818655: search_bucket: 2956, LM_ST_DEFERRED 26818656: finish_xmote: 2956, LM_ST_DEFERRED 26818657: do_xmote3: 2956, LM_ST_UNLOCKED roth-01: G: s:DF n:2/12583013 f: t:DF d:EX/0 l:0 a:0 r:3 I: n:1048597/12583013 t:8 f:0x00000000
Created attachment 304026 [details] RHEL5 patch as of 2008 Apr 28 Here is the latest RHEL5 patch. Steve and I (mainly Steve) worked through several more issues and bugs this morning. This version still isn't quite right; d_doio can still get hangs, although much later in the test.
The hang in this case seemed to be caused by a lost demote request. My suspicions fall on the test at the top of finish_xmote which was the cause of the original lock bouncing bug that we saw. I thought then that I'd have to add an extra flag to ensure that we only updated the target state in specific circumstances. I suspect that this bug isn't caused directly by that, but by the test in the "success" path. I've added a GLF_DEMOTE_IN_PROGRESS flag and updated both tests so that they take account of this flag. I'm doing a few quick sanity checks on it atm. If it passes I'll attach the new upstream patch to this bz.
Created attachment 304098 [details] Latest upstream patch This is the latest version. It passes my very basic single node tests. Lets hope this is the change that fixes the remaining hang. To see the differences, grep for GLF_DEMOTE_IN_PROGRESS as the only change is the addition of this flag in a few places.
Created attachment 304112 [details] Another updated version I found another corner case. This one should be better.
Created attachment 304161 [details] Glock history for latest BUG() problem This glock history shows more detail, including the glock flags, gl_target amd gl_demote_state at each point. It also includes the dlm status from all three nodes for the glock in question. Note that do_xmote1 means do_xmote was called from run_queue when it saw the GLF_DEMOTE bit in gl_flags.
Created attachment 304398 [details] Combined specsfs/readv patch for aol This patch contains the latest RHEL5 version of the GFS2 code. The patch is against the 2.6.18-92 kernel. This is what I suggest we build and give to aol as a kmod. We've run a number of tests on it, including benchp13 (single fs) and all my "hell" tests. I want to do one or two more tests, but I wanted to post this ASAP.
I also ran a quick "d_io" test that ran for about 20 minutes. It was also successful. I think we're good to go. Still need to resolve dd_io, but I don't think it's likely that they'd hit that bug. Steve has another patch for me to try in that regard, which is not included in the aol patch from comment #46.
Created attachment 304409 [details] Combined specsfs/better readv patch for aol This version, which is Steve's latest, often passes the dd_io test. I can sometimes get dd_io to hang, but this one runs it much better. However, at the time of the this posting, it has not undergone the testing we did for its predecessor.
Created attachment 304572 [details] lockdumps when dd_io processes were hung on the smoke cluster I was running dd_io with the latest patch in comment #48 and I managed to recreate the hang pretty quickly. Attached is the tarball of all the lockdumps from the smoke nodes.
Btw, the attachment of comment #49 seems to have been compressed twice.
Abhi, can you provide the dlm lock dump too? I can't work out from the dumps in comment #49 what is going on. It appears camel, merit and winston are all waiting for a glock on 2/20593 but salem appears not to have any reference to that glock. So the only way to find out what it going on is to compare with the dlm's idea of the current state.
Created attachment 304636 [details] dlm-lockdumps corresponding to the the gfs2 glock-dumps in comment #49
Created attachment 304637 [details] Latest upstream patch including debugfs fix Just about to re-test this after some clean up.
Created attachment 304749 [details] Upstream patch with latest changes It seems that there was still a window where invalidates were causing us problems. I think this really is the last such case now. The previous patch did all the demote cases, but missed out SH -> DF which is a promotion, but one which causes an invalidate of data pages. This patch should fix that case as well. You can find the new bits easily by looking for GLF_INVALIDATE_IN_PROGRESS. There are only a few changed lines compared with the previous patch, probably about half a dozen or so.
Created attachment 304755 [details] Another upstream update A few changes in add_to_queue() relating to lock cancellation to ensure that we always call ->lm_cancel() in all cases where its required.
Created attachment 304811 [details] gfs2 and dlm lockdumps from latest dd_io hang Steve, I tried to port your latest upstream patch to rhel5, but I got an error because of seq_open_private() not being available in rhel5. I extracted the debugfs fix from your patch and ported the rest. I managed to recreate the dd_io hang after about 12 minutes of runtime. Now, I could've made a mistake in porting the patch, so I'm going to compile your git-tree with your upstream patch on the smoke cluster and give that a try. I'll also post the rhel5 ports I made of your patch here.
Created attachment 304837 [details] contains 2 patches that partially port swhiteho's latest upstream patch Steve, could you take a look at these patches to see if I missed something? I tried to leave out the debugfs bits of your patch because they wouldn't compile on rhel5(no seq_open_private).
The patches in comment #57 look ok. The seq_open_private function can just be copied from the upstream kernel and added as a static function. Its only small and I think its pretty much standalone. The same goes for the release function in case that it missing too.
Now it seems that merit is holding the lock for 2/ca9436 in DF mode, and that there are queued requests for that lock in both DF and SH modes. The only way that makes sense is if the SH request is at the front of the queue and if a demote has been lost somewhere along the line. If the other DF modes were at the front of the queue, then they should have been granted. We saw something like this with some previous tests, but its not something that has been seen recently. Also, please be careful with mime types when adding attachments to bugs. The patches in comment #57 are gzipped and tared, but just appear as "data" in my browser. The data in comment #56 is downloaded as a .tar.gz file but is in fact gzipped twice over.
Created attachment 304876 [details] Updated upstream patch This has the start of a change to the lock_dlm threads to try and fix a potential ordering problem which I believe _might_ (and I stress that, as I can't be sure!) be responsible for the "lost" demote request that we've been seeing recently. One useful further test would be to separate the two threads in a different way so that one thread only does lock submissions, and the other thread does the blocking/completion/drop callbacks. The patch is currently untested and there is still a potential issue relating to ordering which we might be able to solve by adding a sequence number to the messages as received from the dlm module (which I presume ensures such reordering does not occur). There is a lot of potential future cleanup that could be done in this area too. We do not need to release the spinlock before calling blocking callbacks in gfs2 since such callbacks don't block. Once we've sorted the completion callbacks so that the same is true of them as well, then we can do the same there as well.
Created attachment 304898 [details] Hang info: glock dump, dlm locks, call traces I ran dd_io several times successfully on the exxon cluster. I was finally able to create a hang on about the twelfth run or so. Abhi says he can create a hang on the smoke cluster easily. This tgz contains the info for the hang.
Looking at the traces from comment #61, it looks like the demote request was sent to the wrong node.... at least node 02 seems to be waiting for an EX lock (currently unlocked) but with a queued demote request (!) where as 01 holds the lock in shared state with no demote pending.
Created attachment 304946 [details] Lockdumps from the latest hang on the smoke cluster For some reason, my previous attachments .tar.gz were double-zipped. The tarballs were correct when I had created them. Maybe because I let bugzilla auto-detect the attachment type. This time, I've set the attachment type to (binary file (application/octet-stream)). Hopefully it won't do it again.
Created attachment 304955 [details] Comment 60 patch with glock history debug For debugging.
Created attachment 304976 [details] Debug version with Dave's patch This patch contains: 1. All of the "good" patches up to comment #60 2. The "glock history" debug code 3. Dave's lock_dlm patch that processes "cast" then "bast" requests (in that order) if both are pending.
Created attachment 304984 [details] Lockdumps & sysrq-t from the latest hang on the smoke cluster
Created attachment 305126 [details] Another upstream patch This one has changes in lock_dlm. I'm not at all confident that the patch relating to lock_dlm which Dave did last week actually closed all the holes. I noticed the following in lock_dlm: 1. The DLM calls both basts and casts in process context 2. The DLM ensures the correct ordering 3. lock_dlm was adding these requests to two different queues and the dequeuing them with two different threads, one of which did not deal with basts for some reason. So my plan is to simply get rid of the queues and send the DLM's callbacks directly to the fs. Thats what this patch does. I then noticed some other things which should be fixed in a later patch: 1. The lock_dlm thread(s) when in a recovery state remove entries from the delay list, and then add them back again for no obvious reason. I presume that this is why the schedule() has been added as otherwise this will soak up large amounts of cpu. 2. Most of what is left in lock_dlm is the logic for dealing with conversions from SH -> EX which is now dealt with in glock.c anyway, so could be removed in upstream. 3. The callbacks (gdlm_cast and gdlm_bast) were marked inline despite the only access to them being via function pointers, so thats been removed since its pointless. There are probably other possible cleanups too, but those are the ones I've found so far. The patch also saves 32bytes per glock due to the loss of the two queues (on x86_64) so should generally help boost scalability and performance, not to mention the overhead of needlessly adding messages to a queue. I'm still working on this, but this represents where I've got to at the moment.
Created attachment 305185 [details] Wireshark capture of problem This is a bzip2 of a wireshark capture I took during one failure. The dd_io tests are sometimes failing due to the processes taking too long to reply to the test program, or so it seems. I don't see anything wrong with the code and i don't see anything obvious in the wireshark trace either. But according to Dean, the tests are taking a lot longer with the previous patch.
Just to clarify: The test now sometimes passes all 25 dd_io tests, although sometimes it apparently times out and runs a lot slower in some scenarios.
I'm not sure that does clarify things :-) When you say it "sometimes passes" do you mean that the failures are all related to the test taking too long, or that they are related to the original problem of hangs? Also did you try adjusting the minimum hold time? So far as I can tell from the traces, the problem is just down to bouncing the locks too much.
Created attachment 305237 [details] Debug version of the comment67 patch plus fixes This fix contains the comment #67 patch, plus some fixes (e.g. the additional goto in do_promote) plus glock history debug code. This is patched against the -92 kernel (so you'll find extra code for my specsfs fix in there).
Created attachment 305266 [details] Fix the glock history panic Same as previous, but the glock dump panic is (hopefully) gone.
I tried a few things today. I tried going back to the lock_dlm from comment #60, then making it single-threaded. This is similar to what Steve did, but the callbacks are not done by the dlm thread, they're done from the normal lock_dlm thread. This had the exact same symptom, in other words, the tests "failed" because they were running too slowly. So the performance impact seems to be due to the missing second thread, not due to the way Steve's latest code does the callback. The second thing I tried was reverting back to the comment #60 lock_dlm but keeping my new do_promote goto, and the dump_glock fixes. This ran successfully on the exxon cluster. I ran the failing mmap tests multiple times. Everything was back to "fast" again (the mmap tests were taking from 35-39 seconds.) Then I ran that same code on the smoke cluster. Many of the tests ran to completion in a timely manner, but once again, it got stuck in d_rwrandirectlarge. I took glock dumps on all the nodes (in /tmp/lockdump.{camel|merit|winston|salem}). Once again, these showed glocks sitting in the DF state without any holders/waiters on any node.
Created attachment 305377 [details] Latest upstream patch Changes from the last upstream patch: - Bug fixed where we forgot to run the workqueue with lock_nolock (I've verified that this does not affect lock_dlm) after the lock module returns a status. - Makes the processing of glock completions async to the extent that we now only submit I/O but we don't wait for it. When the I/O completes, we schedule the workqueue to run the remaining bits of the glops. So to find the changes quickly, look for the new GLF_IO_DONE flag, the ->go_xmote_bh callbacks (new prototype since it returns an int now) and the change in do_xmote to schedule the workqueue in the lock_nolock case.
Created attachment 305387 [details] RHEL version of the comment #74 patch w/debug Here is the rhel version of the "comment 74" patch. I had to do a bit of massaging due to differences between upstream and rhel. It's passed some of the tests on my exxon cluster, but I wanted to get something out there asap for the smoke cluster.
Created attachment 305404 [details] lockdumps and sysrq-t from the latest hang on smoke with Bob's rhel5 patch in comment 75
This latest version gave me a CPU lockup when it ran the d_mmap3 test. Here is the call trace: BUG: soft lockup - CPU#0 stuck for 10s! [d_doio:3582] CPU 0: Modules linked in: sctp lock_dlm(U) dlm configfs gfs2(U) autofs4 hidp rfcomm l2d Pid: 3582, comm: d_doio Tainted: G 2.6.18-91.el5 #1 RIP: 0010:[<ffffffff800621d1>] [<ffffffff800621d1>] __write_lock_failed+0x9/0x0 RSP: 0018:ffff81010d63bb58 EFLAGS: 00000287 RAX: 0000000000000000 RBX: ffff81010dba8000 RCX: 00000000ffff8fc7 RDX: ffff810001004400 RSI: 0000000000000000 RDI: ffffffff88573da8 RBP: ffff81010d613080 R08: ffff810000013600 R09: ffff81010d63bb88 R10: 0000000000000000 R11: ffff81010d026bd8 R12: 0000000000000001 R13: ffff81010cd28c60 R14: ffffffff8002d0d4 R15: 0000000000000018 FS: 00002ae5fa5e4020(0000) GS:ffffffff8039e000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00002ae5fb1f2000 CR3: 000000010d633000 CR4: 00000000000006e0 Call Trace: [<ffffffff800649f5>] _write_lock+0xe/0xf [<ffffffff884fec4d>] :gfs2:gfs2_glock_put+0x26/0xeb [<ffffffff884ff57d>] :gfs2:gfs2_holder_uninit+0xd/0x1f [<ffffffff88508790>] :gfs2:gfs2_readpage+0xe5/0xf6 [<ffffffff8850870b>] :gfs2:gfs2_readpage+0x60/0xf6 [<ffffffff8000bf84>] do_generic_mapping_read+0x22b/0x3f8 [<ffffffff8000cd69>] file_read_actor+0x0/0x154 [<ffffffff8000c29d>] __generic_file_aio_read+0x14c/0x190 [<ffffffff800c0070>] generic_file_readv+0x8f/0xa8 [<ffffffff8009dde2>] autoremove_wake_function+0x0/0x2e [<ffffffff80021f44>] __up_read+0x19/0x7f [<ffffffff800d814f>] do_readv_writev+0x176/0x295 [<ffffffff800c0089>] generic_file_read+0x0/0xc5 [<ffffffff80028e79>] do_brk+0x1cf/0x2b2 [<ffffffff800b3fd8>] audit_syscall_entry+0x16e/0x1a1 [<ffffffff800d838b>] sys_readv+0x45/0x93 [<ffffffff8005d28d>] tracesys+0xd5/0xe0 I'm chasing down a theory now. My theory--based on evidence from the glock history--is that finish_xmote is getting -EAGAIN from go_xmote_bh and it returns without clearing the GLF_LOCK bit. From then on, subsequent run_queues are ignoring the glock because the bit is set. I'm going to test this theory with the comment #60 (Dave's) version of lock_dlm.
Well, I didn't exactly do what I said I would in comment #77. Maybe it's just me, but it still seems as if performance is really bad with the latest code. I added more debugging points to the glock history, but between the performance and the soft lockup, I decided to set that version aside temporarily and focus on slightly older code that was fast. That code passed the dd_io test (all 25 variants) on my exxon cluster. However, I recreated a hang on the smoke cluster again. I collected glock dumps, with my improved history code. Looks like merit is holding a glock in DF mode with no holder again. Here is the history: G: s:DF n:2/c999d9 f: t:DF d:EX/0 l:0 a:0 r:3 I: n:2097178/13212121 t:8 f:0x00000010 170141: gfs2_glock_dq 3547, f: dy tgt: EX, dmt: DF, St: EX 170142: do_promote 3547, f: ldy tgt: EX, dmt: DF, St: EX 170143: gfs2_glock_dq 3547, f: dy tgt: EX, dmt: DF, St: EX 170144: do_promote 3547, f: ldy tgt: EX, dmt: DF, St: EX 170145: gfs2_glock_dq 3547, f: dy tgt: EX, dmt: DF, St: EX 170146: do_promote 3547, f: ldy tgt: EX, dmt: DF, St: EX 170147: gfs2_glock_dq 3547, f: dy tgt: EX, dmt: DF, St: EX 170148: do_promote 3547, f: ldy tgt: EX, dmt: DF, St: EX 170149: gfs2_glock_dq 3547, f: dy tgt: EX, dmt: DF, St: EX 170150: do_promote 3547, f: ldy tgt: EX, dmt: DF, St: EX 170151: gfs2_glock_dq 3547, f: dy tgt: EX, dmt: DF, St: EX 170152: do_promote 3547, f: ldy tgt: EX, dmt: DF, St: EX 170153: gfs2_glock_dq 3547, f: dy tgt: EX, dmt: DF, St: EX 170154: do_promote 3547, f: ldy tgt: EX, dmt: DF, St: EX 170155: gfs2_glock_dq 3547, f: dy tgt: EX, dmt: DF, St: EX 170156: do_promote 3547, f: ldy tgt: EX, dmt: DF, St: EX 170157: gfs2_glock_dq 3547, f: dy tgt: EX, dmt: DF, St: EX 170158: do_promote 3547, f: ldy tgt: EX, dmt: DF, St: EX 170159: gfs2_glock_dq 3547, f: dy tgt: EX, dmt: DF, St: EX 170160: do_promote 3547, f: ldy tgt: EX, dmt: DF, St: EX 170161: gfs2_glock_dq 3547, f: dy tgt: EX, dmt: DF, St: EX 170162: do_promote 3547, f: ldy tgt: EX, dmt: DF, St: EX 170163: gfs2_glock_dq 3547, f: dy tgt: EX, dmt: DF, St: EX 170164: do_promote 3547, f: ldy tgt: EX, dmt: DF, St: EX 170165: gfs2_glock_dq 3547, f: dy tgt: EX, dmt: DF, St: EX 170166: do_promote 3547, f: ldy tgt: EX, dmt: DF, St: EX 170167: gfs2_glock_dq 3547, f: dy tgt: EX, dmt: DF, St: EX 170168: do_promote 3547, f: ldy tgt: EX, dmt: DF, St: EX 170169: gfs2_glock_dq 3547, f: dy tgt: EX, dmt: DF, St: EX 170170: do_xmote1 3003, f: lDpy tgt: DF, dmt: DF, St: EX 170171: search_bucket 3372, f: lDp tgt: DF, dmt: DF, St: EX 170172: finish_xmote 3372, f: lDp tgt: DF, dmt: DF, St: EX -> DF I'm beginning to wonder if we should keep Dave's version of lock_dlm that had two threads, but find a better way to make sure we process the requests in the correct order.
The analysis in comment #77 doesn't appear to match the stack trace. The trace in question points to a delay in getting the rwlock in order to release a ref to a glock. So it would be a reasonable conclusion that some other thread is keeping hold of that rwlock somewhere. Since we've not changed any of the ref counting code internally, I can't immediately see why this is happening, but most likely its either scheduling over a rwlock or a lock ordering problem. I'll have a look and see if I can spot it anyway.
I have a feeling that the problem in comment #77 is down to a one-liner bug. Here is my proposed solution - to change the end read I/O function as follows: void gfs2_glock_end_read_io(struct buffer_head *bh, int uptodate) { struct gfs2_bufdata *bd = bh->b_private; struct gfs2_glock *gl = bd->bd_gl; end_buffer_read_sync(bh, uptodate); set_bit(GLF_IO_DONE, &gl->gl_flags); if (queue_delayed_work(glock_workqueue, &gl->gl_work, 0) == 0) atomic_dec(&gl->gl_ref); } This can be run from irq context, and to save us from having to use irq safe locks for the glock hash table, we can just use an atomic_dec since we know that this only runs when the ref count is held high by the state change on the glock, so that it can never hit zero as a result of this particular decrement. If the problem persists, use lockdep to debug it. Also it would be very helpful to have stack traces of all processes, not just the one that caused the lockup in this case since it was down to another process holding the lock.
Created attachment 305506 [details] Latest upstream patch So this patch moves all of the finish_xmote processing into the workqueue for _both_ lock_nolock and lock_dlm so at last both of these now share the same code path. Also the ref count bug seems to be gone too. I've removed ->go_xmote_bh for inodes, since ->go_lock does the work anyway and there didn't seem to be a need to do it twice. For the future, it would be nice to make the remaining I/O in ->go_lock async such that it doesn't block the workqueue while its in progress. On the otherhand, it now means that there are no circumstances under which the dlm's ast thread will now block. So it should prove whether that was the problem which has been preventing the mmap tests from passing. Also I've cleaned up the error handling a bit as well, and introduced a glock_bug function and GLOCK_BUG_ON macro. Note that both should be called only with gl_spin held, or in places where there cannot possible be a race.
Created attachment 305513 [details] Comment 81 patch for RHEL with debug / glock history This is the RHEL port of Steve's patch from comment #81. I changed some of the logging code because the new GLOCK_BUG code was not identifying where the error was; it was only saying where the BUG() statement was. I'm not sure I like it that way, but at least it works. I'm testing it on my exxon cluster now.
The new version also fails d_mmap3 due to performance lags. :7(
After way too much prep-work, I used systemtap to compare how much time was being spent in the "most popular" functions of glock.c. The script tracks how much cumulative time is taken inside each function and how many calls were made to that function. Obviously, recent code changes included functions that aren't in the original, and functions that are missing from the original, such as do_xmote, etc. That said, here are the very telling results: With comment #81 patch: do_xmote 532969 in 192 calls do_promote 535454 in 319 calls finish_xmote 535796 in 191 calls wait_on_holder 4565742 in 224 calls gfs2_glock_wait 4567063 in 224 calls gfs2_glock_nq 4569446 in 224 calls --------------------------------------------------- With glock.c from stock -92 kernel: gfs2_glock_nq 2089951 in 605548 calls xmote_bh 765822 in 61 calls gfs2_glock_cb 767264 in 89 calls Note that these results are "typical" values and the counters reset and are reported every five seconds. So the old code was spending almost no time in gfs2_glock_wait whereas the new code is spending a lot of time there.
So it seems from comment #83 that we've now proved that whatever is causing the performance problem, its not down to running finish_xmote in a different way. I still think the most likely cause is the min hold time code, on the basis that we've not managed to prove that it is working correctly so far.
Some time back we looked into the possibility of whether we were demoting too soon based upon early scheduling of the workqueue. That was discounted at the time, since the change we made (altering one point at which was scheduled the workqueue) appeared not to make any difference. Now that we have to run the workqueue to deal with both completions and callback events, it occurred to me that we should do more in the function itself to prevent us shortening the hold time. As a result I've come up with the following change to the workqueue function: static void glock_work_func(struct work_struct *work) { unsigned long delay = 0; struct gfs2_glock *gl = container_of(work, struct gfs2_glock, gl_work.work); if (test_and_clear_bit(GLF_REPLY_PENDING, &gl->gl_flags)) finish_xmote(gl, gl->gl_reply); spin_lock(&gl->gl_spin); if (test_and_clear_bit(GLF_PENDING_DEMOTE, &gl->gl_flags)) { unsigned long holdtime, now = jiffies; holdtime = gl->gl_tchange + gl->gl_ops->go_min_hold_time; if (time_before(now, holdtime)) delay = holdtime - now; set_bit(delay ? GLF_PENDING_DEMOTE : GLF_DEMOTE, &gl->gl_flags); } run_queue(gl, 0); spin_unlock(&gl->gl_spin); if (!delay || queue_delayed_work(glock_workqueue, &gl->gl_work, delay) == 0) gfs2_glock_put(gl); } Also I've been looking at the GLOCK_BUG_ON macro and I've changed that slightly to avoid the problems reported in comment #82, so I hope that is better now. From the figures in comment #84 it seems clear that in the -92 case we are doing far more locks than we could reasonably expect to do if we were having to call out remotely each time. That means that most of the time, we must be using cached locks in this case. The fact that in the comment #81 patch case a lot of time is spend waiting, suggests that the cause is waiting for something to release the lock. It seems more plausible that this is due to remote locking activity than local activity, so I guess thats the thing to test first (hence this patch).
Created attachment 305678 [details] Latest upstream patch As per earlier comments.
Created attachment 305702 [details] RHEL version of the comment #87 patch This is the RHEL version of the patch in comment #87. It seems fast and passes the dd_io test on my exxon cluster. It also has the seq file changes from upstream. There is no debug code in this one, so we should be able to do performance testing on it.
The dd_io tests on my exxon cluster were all pretty fast. Unfortunately I tried running the benchp13 test with this version on "kool" and it showed a significant performance problem. My last "bitfit" version of gfs2 had been taking 17-19 seconds per iteration. This version takes 200 seconds per iteration on average, so more than ten times as long. I tried three things: 1. I re-ran my bitfit version to make sure it wasn't hardware (i.e. silent disk retries when a disk goes bad). Not that. 2. I tried running it in a "one-node cluster" using lock_dlm and got the same 200s performance. So it's not specific to lock_nolock. 3. I tried switching from the 5.1 kernel (2.6.18-53) to a newer (2.6.18-90) kernel. Wasn't that either, so it's definitely gfs2. So I've eliminated the "easy" things to check. I also took a glock dump while benchp13 was running. It showed 145597 glocks in memory in various conditions. I need to analyze this some more.
Created attachment 305727 [details] smoke dumps from today's hang. camel's logs are absent because the console was unresponsive
I did some more experiments. First, I tried reducing the minimum hold time from HZ/5 back to its normal HZ/10. The performance was about the same. I also did a systemtap profile of the code and saw this: gfs2_glock_wait 3518607 in 68856 calls do_xmote 12311 in 4225 calls do_promote 314372 in 77306 calls gfs2_glock_nq 4454741 in 68856 calls Which really doesn't tell us much. I've been pouring over code, but haven't reached any conclusions yet.
It seems that the slowdown might be related to writeback. I notice that if I run postmark several times, the second and later runs are always slow compared with the first one. Unmounting and rerunning postmark returns the results to the original (faster) speed. Also I notice that umount is often very slow. So I suspect that its something to do with writeback and/or unlocking glocks. Changing the minimum hold time should not make any difference for lock_nolock since there is no code path that calls that code. The same goes for lock_dlm on single node, since it will not send callbacks to itself as the result of a lock request. Although I've seen some regression in the performance with the latest patch, I've not seen anything like the 10x reported in comment #89 so perhaps its worth comparing the create phase of postmark with benchp13 to see if the difference there might explain the difference in performance.
For the record, the benchp13 performance issue was solved with this suggestion by Steve: --- glock.c.~1~ 2008-05-16 10:51:35.000000000 -0500 +++ glock.c 2008-05-19 08:26:05.000000000 -0500 @@ -565,8 +565,7 @@ static void do_xmote(struct gfs2_glock * ret = gfs2_lm_lock(sdp, gl->gl_lock, gl->gl_state, target, lck_flags); if (!(ret & LM_OUT_ASYNC)) { - gl->gl_reply = ret; - set_bit(GLF_REPLY_PENDING, &gl->gl_flags); + finish_xmote(gl, ret); gfs2_glock_hold(gl); if (queue_delayed_work(glock_workqueue, &gl->gl_work, 0) == 0) gfs2_glock_put(gl); We've still got Abhi's hang on the smoke cluster to consider.
Created attachment 306084 [details] glock dumps and call traces I tried some variations on the code mentioned in the previous comment. One of the things I tried was introducing a new bit that indicated whether or not the glock was in the process of acquiring the page lock after having acquired the EX lock during the write_lock_start. Then I had the read try-locks behave differently when the bit is on, such as failing the lock right away. It behaved pretty much the same as the version mentioned in the previous comment: try locks were occasionally bursting up to 100% cpu utilization for short periods of time, then eventually resolving themselves when the lock is acquired. Since it behaved the same way, I reverted it. This latest patch ran through the dd_io test twice successfully, but it hung on my third attempt. This attachment contains a tar/zip file with a directory containing glock dumps and call traces for all the smoke nodes.
So it looks like we are back to seeing missing demote requests again. The waitiing nodes all have the "l" flag, so that they all sent their requests to the dlm, so it looks like merit has just dropped the demote request somehow.
Created attachment 306111 [details] The patch I was using last night
Created attachment 306191 [details] Dave's patch to dlm This morning, we discovered an apparent bug in DLM while testing. Dave T. posted the attached patch for the fix with this comment: The fix in commit 3650925893469ccb03dbcc6a440c5d363350f591 was addressing the case of a granted PR lock with waiting PR and CW locks. It's a special case that requires forcing a CW bast. However, that forced CW bast was incorrectly applying to a second condition where the granted lock was CW. So, the holder of a CW lock could receive an extraneous CW bast instead of a PR bast. This fix narrows the original special case to what was intended. Further testing got us back into a hang in which all our reads were apparently spinning trying to acquire try locks, eating 100% of the cpu. I tried another fix I thought up, but I got this error which we've never seen before: DLM: Assertion failed on line 242 of file /home/bob/20May2008.dlm/lock.c DLM: assertion: "r->res_nodeid >= 0" DLM: time = 10546352 rsb: nodeid -1 flags 0 first 0 rlc 0 name 2 cde57c ------------[ cut here ]------------ kernel BUG at /home/bob/20May2008.dlm/lock.c:242! invalid opcode: 0000 [#1] Unfortunately, I don't have more because apparently the node was fenced. I'm going to assume this is an unrelated problem and continue testing my newest patch, which I'll post later tonight.
Created attachment 306207 [details] glock dumps, dlm dumps, call traces The dlm messages mentioned in my previous comment were likely caused by a problem with my experimental patch, which I've since corrected. So forget about that for now. I devised a new patch with changes to function add_to_queue in glock.c. Unfortunately, after three 15-minute runs, I was able to recreate a hang on this version as well. This tar contains glock dumps, dlm dumps and call traces from that hang on the smoke cluster. I'll post the patch itself shortly.
Created attachment 306208 [details] My patch from today This is the latest RHEL patch I have. It contains: 1. Steve's latest and greatest changes (including the new lock_dlm). 2. Dave's new dlm patch from this morning. 3. Some debug code from this morning that will eventually need to be removed, but it may be useful for debugging, so I left it in. 4. Some experimental new code I added this evening. The new changes are basically to add a new GLF_TRYFAILED bit to the glock that indicates glocks for which we've failed a trylock. My idea here was that once we fail a trylock, we might want to give the next trylock attempt a higher priority, queuing it like a LM_FLAG_PRIORITY lock, at the earliest possible insertion point. That's to try to keep the try locks from always getting shuffled to the end of the queue and never getting done. I also rearranged the logic somewhat to simplify it and avoid a possible case I was concerned about with the previous patch where we might queue a priority glock at the front of the queue and then cancel it right away. The code seemed to work well for two full 15-minute runs of dd_io, but again, on the third attempt, I recreated a hang. I still saw some bursts of 100% cpu utilization from d_doio, but they were usually short-lived. See previous comment and associated attachment for details on the hang. I have not analyzed the data from the hang; I'm too tired this evening. I'm hoping Steve can get a look at it in the morning.
Created attachment 306275 [details] Proposed final RHEL patch This is my proposed "final" patch that we can support in RHEL5.2.z. It has not undergone any testing in its final form, but of course, the various pieces have undergone lots of testing in prior incarnations.
The version posted in comment #100 has successfully passed all my "Hell" tests plus a complete REG run of d_io. Also, benchp13 is running happily on kool, although it seems a tiny bit slower than the version with all the lock_dlm changes.
Created attachment 306287 [details] Proposed final RHEL patch (revised) Found and removed some debug code that was in the previous patch.
I posted the patch to rhkernel-list, so I'm setting this one to POST and reassigning to Don Zickus.
This issue is not quite resolved and one more patch is expected.
We found some problems during testing, documented in bug #447920. I wrote a patch to solve these issues, which may be found here: https://bugzilla.redhat.com/show_bug.cgi?id=447920#c12 Unfortunately, this patch segfaulted during one of the dd_io tests, which seems to be a direct result of the code change I did in the patch. I'm investigating now. My plan is to leave 447920 open until we can get through the jdata tests and be sure that issue is fixed too.
Created attachment 306832 [details] Proposed final RHEL patch (revised again) This patch is a revision to the one I posted to rhkernel-list about a week ago. This fixes the problems I was reporting in bug #447920. This version has passed one run of dd_io and one run of d_io on lock_nolock, on my exxon cluster. We still need to run the "hell" tests, dd_io on the smoke cluster, and Dean's new test.
Apparently, my calling of deprecated function find_trylock_page has caused a dependency problem: Error: Missing Dependency: ksym(find_trylock_page) = edd3ec35 is needed by package kmod-gfs2 Function find_trylock_page is only a 7 or so line function, so we could include it. However, in the kernel documentation it says that the function is deprecated and should be replaced with functions like find_get_page or find_lock_page. What we want is something that doesn't block. I'm hoping Steve has a good idea of the "right way" to do this.
Steve posted a new upstream patch for this problem in bug 447920. Unfortunately, the RHEL code is more complex because RHEL does not have the upstream code that Abhi is working on. At any rate, in rewriting the logic to avoid the dependency, I discovered a problem with the error path of my old code (and in the upstream patch as well). Basically: if acquiring the glock returns an error, and possibly in some other error cases, we could return the error to VFS without re-locking the page. VFS always tried to unlock the page unless we return AOP_TRUNCATED_PAGE. So in cases where we return another error, our page counts can get hosed. I attempted to correct that with my latest patch, which I'm testing now. This patch is based on Steve's patch, which still uses a blocking page lock, but it shouldn't matter if it's blocking because the lock is now in the correct order. (Thanks, Steve). If my testing goes well, I'll post the master patch to this bz. In the meantime, here is the retooled gfs2_readpage for people to scrutinize. I apologize for the ugly logic, but there are lots of different conditions we need to handle: (1) if we got the glock or not, (2) if we got the same page back as original or not, (3) if we got any page back at all or not (4) if we're dealing with an internal file sentinel or not, (5) if we've encountered an error or not. It makes for messy logic. I'm open to all suggestions on how to make it less messy. Here it is: static int gfs2_readpage(struct file *file, struct page *page) { struct address_space *mapping = page->mapping; struct gfs2_inode *ip = GFS2_I(mapping->host); struct gfs2_sbd *sdp = GFS2_SB(mapping->host); struct gfs2_file *gf = NULL; struct gfs2_holder gh; int do_unlock = 0; pgoff_t index = page->index; struct page *page2 = page; int error = 0; if (likely(file != &gfs2_internal_file_sentinel)) { if (file) { gf = file->private_data; if (test_bit(GFF_EXLOCK, &gf->f_flags)) /* gfs2_sharewrite_nopage has grabbed the ip->i_gl already */ goto skip_lock; } unlock_page(page); gfs2_holder_init(ip->i_gl, LM_ST_SHARED, GL_ATIME, &gh); error = gfs2_glock_nq_atime(&gh); if (unlikely(error)) { lock_page(page); return error; } do_unlock = 1; page2 = find_lock_page(mapping, index); if (page != page2) error = AOP_TRUNCATED_PAGE; } skip_lock: if (!error && !PageUptodate(page2)) { if (gfs2_is_stuffed(ip)) { error = stuffed_readpage(ip, page2); unlock_page(page2); } else error = mpage_readpage(page2, gfs2_block_map); if (unlikely(test_bit(SDF_SHUTDOWN, &sdp->sd_flags))) error = -EIO; } if (do_unlock) { if (page2) page_cache_release(page2); gfs2_glock_dq_uninit(&gh); } if (error) { if (error == AOP_TRUNCATED_PAGE) yield(); else lock_page(page2); } return error; }
Created attachment 306972 [details] glock dumps and call traces This is a tar/zip with a glock dump and a call trace from exxon-01. I've had lots of d_io problems with the new patch. Most of the time, all the IO just stops and everything waits. There is nothing vital in the glocks and all the processes related to the test are waiting for page locks. I modified the patch so that it prints out an error if it returns an error. It never is. Still, everything waits. This latest recreation is a bit different. it shows processes waiting for glocks, and glocks are held by processes waiting for page locks. I'm going to change it back to a non-blocking page lock rather than find_lock_page and see if it behaves better again. That's because yesterday's version was working perfectly, whereas these problems only started happening after I introduced the find_lock_page today.
Latest debug version of gfs2_readpage: static int gfs2_readpage(struct file *file, struct page *page) { struct address_space *mapping = page->mapping; struct gfs2_inode *ip = GFS2_I(mapping->host); struct gfs2_sbd *sdp = GFS2_SB(mapping->host); struct gfs2_file *gf = NULL; struct gfs2_holder gh; int do_unlock = 0; pgoff_t index = page->index; int error = 0; struct page *remember_page = page; if (likely(file != &gfs2_internal_file_sentinel)) { if (file) { gf = file->private_data; if (test_bit(GFF_EXLOCK, &gf->f_flags)) /* gfs2_sharewrite_nopage has grabbed the ip->i_gl already */ goto skip_lock; } unlock_page(page); gfs2_holder_init(ip->i_gl, LM_ST_SHARED, GL_ATIME, &gh); error = gfs2_glock_nq_atime(&gh); if (unlikely(error)) { printk("returning %d", error); lock_page(page); printk("\n"); return error; } do_unlock = 1; page = find_lock_page(mapping, index); if (page->mapping != mapping) error = AOP_TRUNCATED_PAGE; } skip_lock: if (!error && !PageUptodate(page)) { if (gfs2_is_stuffed(ip)) { error = stuffed_readpage(ip, page); unlock_page(page); } else error = mpage_readpage(page, gfs2_block_map); if (unlikely(test_bit(SDF_SHUTDOWN, &sdp->sd_flags))) error = -EIO; } if (do_unlock && page) page_cache_release(page); if (error) { if (error == AOP_TRUNCATED_PAGE) { if (page && remember_page != page) printk("We swapped pages.\n"); yield(); } else if (do_unlock) { printk("Returning %d", error); lock_page(page); printk("\n"); } } if (do_unlock) gfs2_glock_dq_uninit(&gh); return error; } Note that when the "lock3" hang happens, we're not getting ANY of the debug printk statements shown.
I expanded the scope of my search for this bug to everywhere we lock and unlock pages. After an extensive search, I've still found nothing that can account for this hang. I'll resume the search in the morning.
Created attachment 307077 [details] 29 May 2008 patch This version passes single-node d_io, and Dean's multiple-lock3 test. It still needs to be smoke tested with dd_io, plus the "hell" tests.
Created attachment 307096 [details] 29 May 2008 patch (revised) This hopefully fixes a NULL pointer dereference that Abhi found on the smoke cluster.
This version passes my "hell" tests and single-node d_io on exxon-01 and roth-01. The dd_io tests failed, most likely because this patch does not include the lock_dlm patch we did earlier. That will need to be shipped with RHEL5.3 as well at some point, but we decided to send this out without it for the sake of certain customers who will be using GFS2 as single-node.
Forgot to mention, the benchp13 test is going well too, on kool.
The good news: d_io, benchp13, and all of Nate's tests on jdata ran okay. The bad news: Abhi got this during testing: GFS2: wanted 0 got 1 G: s:EX n:2/2053c f:lDp t:UN d:UN/1000 l:0 a:0 r:4 I: n:22/132412 t:8 f:0x00000011 ------------[ cut here ]------------ kernel BUG at /builddir/build/BUILD/gfs2-kmod-1.92/_kmod_build_/glock.c:487! invalid opcode: 0000 [#1] SMP last sysfs file: /kernel/dlm/gfs20/control Modules linked in: sctp lock_dlm(U) gfs2(U) dlm configfs autofs4 hidp rfcomm l2cap bluetooth sunrpc ipv6 xfrm_nalgo crypto_api dm_multipath video sbs backlight i2c_ec button battery asus_acpi ac lp sg floppy i2c_i801 i2c_i810 i2c_algo_bit parport_pc parport pcspkr i2c_core e1000 ide_cd cdrom dm_snapshot dm_zero dm_mirror dm_mod qla2xxx scsi_transport_fc ata_piix libata sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd CPU: 0 EIP: 0060:[<e058f632>] Tainted: G VLI EFLAGS: 00010282 (2.6.18-92.el5 #1) EIP is at finish_xmote+0x188/0x250 [gfs2] eax: 00000000 ebx: d87d0f38 ecx: 00000046 edx: 00000000 esi: 00000000 edi: 00000001 ebp: 00000001 esp: c38aaf40 ds: 007b es: 007b ss: 0068 Process glock_workqueue (pid: 6656, ti=c38aa000 task=df7d1000 task.ti=c38aa000) Stack: e05a5234 00000000 00000001 e05a3ca0 d87d0f38 d87d0f40 d58f49c0 00000286 e058fd9d d87d0fbc d87d0fc0 d58f49c0 c04331de e058fd7f d87d0f38 d58f49d4 d58f49c0 d58f49cc 00000000 c0433a92 00000001 00000000 c13f4ee0 00010000 Call Trace: [<e058fd9d>] glock_work_func+0x1e/0x92 [gfs2] [<c04331de>] run_workqueue+0x78/0xb5 [<e058fd7f>] glock_work_func+0x0/0x92 [gfs2] [<c0433a92>] worker_thread+0xd9/0x10b [<c042027b>] default_wake_function+0x0/0xc [<c04339b9>] worker_thread+0x0/0x10b [<c0435ea1>] kthread+0xc0/0xeb [<c0435de1>] kthread+0x0/0xeb [<c0405c3b>] kernel_thread_helper+0x7/0x10 Abhi was running with dlm, even though there should not have been lock contention to my knowledge. I know we saw this earlier, and I think it went back to either the lock_dlm or dlm modules. Unfortunately, the node was fenced and so we couldn't get dlm lock dumps.
I was just seconds away from pressing the "Send" button to rhkernel-list on this patch. Oh well. The strange thing is that the DEMOTE_IN_PROGRESS bit is on, and yet there are no waiters or holders on the queue. Here's my theory: At the beginning of glock_work_func it does: if (test_and_clear_bit(GLF_REPLY_PENDING, &gl->gl_flags)) finish_xmote(gl, gl->gl_reply); spin_lock(&gl->gl_spin); The first thing that finish_xmote does is take the gl_spin spinlock. I'm thinking that might be a timing window during which another process can jump in and deal with the reply from dlm. That might give the symptom Abhi saw. This is tricky code, so I'm not 100% sure. If my theory is right, we might be able to fix it by changing finish_xmote so that it assumes the spin_lock is already held upon entry. It can also exit with the spin_lock held (although it may need to release it and reacquire it again). In other words: spin_lock(&gl->gl_spin); if (test_and_clear_bit(GLF_REPLY_PENDING, &gl->gl_flags)) finish_xmote(gl, gl->gl_reply); The other caller, function do_xmote(), would have to have its spin_lock usage adjusted accordingly as well. Steve might have a simpler explanation for what happened and how to fix it. I know it took Abhi several hours of running before this failure occurred, so even if we try my fix, it will be hard to tell whether it's really fixed or not. Maybe after dinner, I'll try to whip up a patch and see how it behaves. We need to determine if this can happen with lock_nolock, and if not, if it's okay to ship this to the customer.
Created attachment 307158 [details] 29 May 2008 patch (revised again) This version merely rearranges the spin_locks as noted in comment #119. It's passed all the "hell" tests. Doing d_io now. Waiting for Steve's comments and/or revision before posting to rhkernel-list. It'd be nice if Abhi could try to recreate this afternoon's failure with this version, but that also depends on what Steve can make out of the information posted. Unfortunately, the root hard drive died on exxon-02, rendering that cluster useless, as a three-node cluster anyway. I'm using the roth cluster until Chris can get it fixed. I'll trim the exxons down to a two-node cluster if need be, but for now the roths suffice.
The patch from comment #120 passes a single-node d_io test (all 49 tests successful). I've set it to run continuously overnight. I also put the new patch on kool for some overnight benchp13 runs.
If you are using dlm and you've not got the three fixes in the kernel then its not at all surprising that you've hit this. Provided that was the case, then I don't think its a problem that we need worry about. The message means that we asked to unlock, but received an exclusive lock instead. That can't happen with lock_nolock since every lock request always succeeds returning exactly the requested mode.
The patch from comment #120 ran d_io for seven hours successfully, but my SAN apparently died on the roth cluster, which caused a kernel panic. Meanwhile, benchp13 is still running happily on kool.
Comment on attachment 307096 [details] 29 May 2008 patch (revised) This is the most tested version, and shouldn't have problems single-node, so this is what we plan to ship. Therefore, I'm unmarking it obsolete.
In light of comment #122, we've made the decision to ship the version from comment #115, which has had the most testing and should work fine for single-node configurations. The changes I did last night should not be necessary for single-node configurations. I've posted that patch to rhkernel-list so I'm changing this back to POST and reassigning to Don Z again.
*** Bug 447920 has been marked as a duplicate of this bug. ***
*** Bug 432108 has been marked as a duplicate of this bug. ***
*** Bug 437893 has been marked as a duplicate of this bug. ***
in kernel-2.6.18-98.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5
*** Bug 428751 has been marked as a duplicate of this bug. ***
Verified. We haven't seen this in quite a while.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2009-0225.html