Escalated to Bugzilla from IssueTracker
Event posted on 2010-05-25 11:34 PDT by woodard Description of problem: the problem manifests itself when the nr_unused ends up being negative. This causes an oops when the dcache is reclaimed. This problem has been really hard to track down. First we were able to find that the problem was evident when we dropped the caches between HPC jobs. The customer wrote the attached stap script which allowed us to verify when the problem occurs. We were finally able to catch the kind of workload that causes the original problem and that led us to a reproducer. The reproducer seems to be the same one that was causing: 178773 This problem seems to be with RHEL but we haven't eliminated the possibility that it is some sort of local patch. How reproducible: seemingly aways. Steps to Reproduce: very difficult outside of LLNL's environment. We are working on that now. we think: populate the dcache with a lot of negative entries that exist on a NFS server. Wait about 15 minutes for these entries timeout. Then revalidate them. If the end This event sent from IssueTracker by woodard [LLNL (HPC)] issue 937133
Event posted on 2010-05-25 11:34 PDT by woodard Issue Registered (Severity: 3) This event sent from IssueTracker by woodard [LLNL (HPC)] issue 937133
Event posted on 2010-05-25 11:34 PDT by woodard File uploaded: dcache.stp This event sent from IssueTracker by woodard [LLNL (HPC)] issue 937133 it_file 702543
Event posted on 2010-05-26 14:42 PDT by woodard <grondo> unable to recreate dcache issue (so far) with the RHEL5.4 patch linux-2.6-fs-private-dentry-list-to-avoid-dcache_lock-contention.patch <grondo> removed <neb> just to confirm that I understand what you said. <neb> You removed: linux-2.6-fs-private-dentry-list-to-avoid-dcache_lock-contention.patch and so far you haven't seen the problem reappear. <grondo> yes, sorry\ <grondo> I expected the old problem to reappear <grondo> but if I remember right that was a difficult one to reproduce <neb> weird I don't know Lachlan McIIlroy <neb> I do know jeff layton and josef <neb> let me go and talk to them <grondo> for someone familiar with the code, and given a description of the problem, I have a feeling they could identify the fix needed in the patched code pretty quickly <grondo> it seems to be a case where dentry_stat.nr_unused is decremented but the dentry doesn't actually get removed from the list <grondo> and it seems to be somewhere in that patch <grondo> though a similar bug did exist before that patch, which would cause softlockups with the same reproducer <neb> I sort of feel like I can almost but not quite see the problem having looked at the patch <neb> it is that tip of the tongue feeling. <grondo> I felt similar, but I think I would have to spend half a day at least walking through the code to fully understand it first * neb chatting with josef <neb> <josef> hmm well shit <grondo> :-o <neb> hey do you have one of the oopses that occurs anywhere handy <neb> I know that we've been trying to get far away from that <neb> it has been happening on atlas right? <grondo> an oops? <grondo> what oops exactly? <neb> <josef> neb: i can give you a patch that makes the nr_unused stuff work right <neb> <josef> but who knows if thats the actual problem <neb> <josef> can you test hte patch and see if it fixes the problem? <neb> <neb> sure <grondo> Send it over and I will test <neb> The oops when nr_unused and the length of the unused list gets way out of sync and the thing finally blows up. <grondo> doesn't oops <grondo> gets stuck indefinitely in prune_dcache or thereabouts <grondo> since there are tests in the code that kind of assume nr_unused >= 0 <grondo> ;-) <grondo> there aren't any actual checks <neb> just forwarded you Josef's patch. <neb> Can you try it out? <grondo> yep <grondo> shouldn't take me too long <neb> rough expectation on time. <neb> couple of hours? <grondo> or less yeah <grondo> that's amazing he saw a fix so quick <neb> <josef> neb: actually wait <neb> <josef> that patch is crap <grondo> heh <neb> just came now <neb> rush job -- bad job <neb> he needs more time. <grondo> no prob <grondo> no rush This event sent from IssueTracker by woodard [LLNL (HPC)] issue 937133
Event posted on 2010-05-26 14:43 PDT by woodard <neb> josef, jlayton you guys acked a patch linux-2.6-fs-private-dentry-list-to-avoid-dcache_lock-contention.patch <neb> for BZ526612 and with that patch applied we are seeing problems with the dcache on large cpu count boxes <neb> when we remove that patch we can't reproduce the problem anymore. <neb> I worry that there might be a subtle race in it. <neb> after about 1 month of work and a lot of systemtap scripts we found that at the root of the problem the number of items on the dentry_unused list doesn't match nr_unused. <neb> The numbers slowly drift out of agreement and eventually you end up with an oops and one of the odd things before the oops is that nr_unused is negative. <bob> I gotta go take my dog to the vet. * bob has quit (Quit: Send a sign, raise the sail, wave a last goodbye, destiny is calling, immortality be mine.) * bmarson is now known as bmarson_gone <josef> neb: ok you still ther? <neb> yes <josef> neb: in select_parent <josef> hmm well shit <neb> Unfortunately it is hard to reproduce but it is taking out 50-100 machines a day here at LLNL. <josef> neb: whats the oops? <neb> it happens hours after the problem first appears <neb> but hold on I can get it for you. <neb> it seems to loose a few at a time and then sometime when it 3-9000 off things go poof. * neb looking for one of the oopses * vaurora (~val.prod.ext.phx2.redhat.com) has joined #fs <josef> neb: i can give you a patch that makes the nr_unused stuff work right <josef> but who knows if thats the actual problem <josef> can you test hte patch and see if it fixes the problem? <neb> sure * hobbit is now known as hobbit_gone <neb> josef, BTW we were also bit BZ526612 but it didn't hit as frequently 2-3x/week <josef> neb: whats your email * kzak_away has quit (Ping timeout: 604 seconds) <neb> woodard <neb> oh yeah that's why I can't find the oops grepping through the logs. <neb> It doesn't oops the machine basically locks up stuck indefinitely in prune_dcache or thereabouts <neb> this is due to all the checks in the code that that kind of assume nr_unused >= 0 <josef> neb: actually wait <josef> that patch is crap <neb> OK won't test that one. <josef> neb: i've got to run and do birthing classes <josef> neb: if you can send me all the debugging you have i'll take al ook tomorrow <josef> and give you something a little better thought out :) <neb> We've been working on this for >1month and we thought it was a local problem. <neb> All the debug info is WAY too much but I'll make sure that I get you a good BZ <neb> OK? This event sent from IssueTracker by woodard [LLNL (HPC)] issue 937133
Event posted on 2010-05-26 14:56 PDT by woodard May 05 11:04:38 <grondo> argggh. More hits of that dcache_lock bug May 05 11:04:54 <grondo> really have to focus on that one and get to the bottom of it now May 05 11:08:20 <neb> any more informaiton this time? May 05 11:08:45 <neb> I think we kind of abandoned it before due to lack of info while we figured that it might be a lustre bug. May 05 11:09:18 <grondo> No more info yet. Have yet to look into it today May 05 11:09:45 <neb> still thinking it is a lustre bug? May 05 11:10:15 <grondo> I guess not. Had Chris Morrone look into it and I guess he didn't find anything lustre related. May 05 11:10:56 <neb> Can you remind me of the facts again? May 05 11:12:03 <grondo> So far all I know is that a process gets hung when trying to do 'echo 2 > /proc/sys/vm/drop_caches' May 05 11:12:20 <grondo> it is hung trying to lock the dcache_lock May 05 11:13:19 <grondo> no other process seems to have the dcache_lock afaics May 05 11:14:24 <neb> https://bugzilla.redhat.com/show_bug.cgi?id=526612 May 05 11:14:31 <neb> I think we have that fix already May 05 11:15:03 <neb> https://bugzilla.redhat.com/show_bug.cgi?id=554574 May 05 11:15:08 <neb> this one is new May 05 11:52:39 <grondo> strangely we are not getting softlockups with this problem May 05 11:52:47 <grondo> so nothing is hogging the dcache_lock May 05 11:53:08 <neb> is is lock juggling? May 05 11:53:21 <grondo> I don't think so May 05 11:54:07 <grondo> there is just one process spinning on the dcache_lock May 05 11:54:12 <grondo> it is strange May 05 11:54:14 <grondo> I don May 05 11:54:20 <grondo> 't really know where to start looking May 05 11:54:40 <neb> the problem is triggered when you try to drop caches. May 05 11:55:06 <neb> how often does it happen? is it intermittent or fairly frequent? May 05 11:55:08 <grondo> [<ffffffff800654e6>] __sched_text_start+0xf6/0xbe3 May 05 11:55:08 <grondo> [<ffffffff8002f607>] prune_dcache+0x37/0x138 May 05 11:55:08 <grondo> [<ffffffff800f4a7f>] shrink_dcache_memory+0x19/0x32 May 05 11:55:08 <grondo> [<ffffffff80040a15>] shrink_slab+0xe0/0x15c May 05 11:55:08 <grondo> [<ffffffff800fe952>] drop_slab+0x36/0x5f May 05 11:55:11 <grondo> [<ffffffff800fee30>] drop_caches_sysctl_handler+0x162/0x1ae May 05 11:55:13 <grondo> [<ffffffff8009c5e4>] do_rw_proc+0xcb/0x120 May 05 11:55:16 <grondo> [<ffffffff8009c653>] proc_writesys+0x1a/0x1c May 05 11:55:19 <grondo> [<ffffffff80017012>] vfs_write+0xcf/0x175 May 05 11:55:21 <grondo> [<ffffffff80017902>] sys_write+0x47/0x70 May 05 11:55:23 <grondo> [<ffffffff80060116>] system_call+0x7e/0x83 May 05 11:55:26 <grondo> seems to be very rare May 05 11:55:29 <grondo> that is the backtrace of the process trying to drop caches May 05 11:55:34 <grondo> (calling drop_slab) May 05 11:59:37 <grondo> hm, maybe I should actually look at the code in prune_dcache and see what could be going on May 05 11:59:53 <grondo> I just assumed it is the dcache_lock, but I can't remember why now May 05 12:02:37 <grondo> the process is listed as running: May 05 12:02:42 <grondo> prolog R running task 0 17916 17915 (NOTLB) May 05 12:02:47 <grondo> Call Trace: [<ffffffff80066031>] thread_return+0x5e/0xf6 [<ffffffff8002f5f6>] prune_dcache+0x26/0x138 May 05 12:03:20 <neb> what's at 138? May 05 12:03:33 <neb> thats pretty early in the function May 05 12:05:00 <grondo> spin_lock(&dcache_lock); May 05 12:05:05 <grondo> guess it is the lock May 05 12:11:57 <neb> is there any commonality in the jobs that happen to be running on the nodes. Maybe it is some unusual usage pattern such as zillions of small files. May 05 12:12:09 <neb> thus the dcache is much bigger than for other jobs. May 05 12:12:54 <neb> or maybe it looks for a lot of files that don't exist and there are a ton of negative dcache entries. May 05 12:13:22 <grondo> this happens before the job starts, in the job prolog May 05 12:13:33 <grondo> so I guess it would have to be the previous job May 05 12:13:38 <neb> but what did the job before it do May 05 12:13:39 <neb> yeah May 05 12:14:35 <neb> what if you changed the drop_caches to the postscript rather than the prolog? May 05 12:14:46 <neb> might point the finger a bit better. May 05 12:15:12 <grondo> yeah, I'm not sure why it is done in the prolog. May 05 12:17:53 <grondo> I should be able to tell which jobs ran previously May 05 12:18:13 <grondo> however that won't help me figure out what took the lock and didn't drop it! May 05 12:18:40 <neb> the previous job might give some clues as to how to make a reproducer though. May 05 12:18:52 <neb> if we can reproduce, then we can study much better. May 05 12:19:20 <neb> maybe have them run their job and then before the next job starts look at the state of the system May 05 12:19:25 <grondo> yes that is true May 05 14:32:16 <neb> any more information on the dcache_lock problem? May 05 14:32:31 <neb> anything actionable yet? May 05 14:33:09 <neb> Should I open up a ticket just to store notes until we can figure rule out it being a RHEL bug? May 05 14:35:24 <grondo> no, let's wait a bit for that May 05 14:35:34 <grondo> still thinking it must be self-inflicted May 05 14:36:26 <neb> OK I'll stand by May 05 16:02:03 <grondo> wish there was a runtime way to enable DEBUG_SPINLOCK May 05 16:03:11 <neb> The way that they turn off and on tracepoints in the kernel might be able to be adapted to do something like that but it is one of those things that is "really clever" May 05 16:18:50 <grondo> I wonder if I can easily write a systemtap or kprobes mousetrap for this issue May 05 16:18:59 <grondo> I'm not really sure how else to proceed at this point May 05 16:19:08 <grondo> looking for the common job but so far don't see much May 05 16:19:51 <neb> So what would you be looking for? May 05 16:20:14 <grondo> all I need to collect was the last locker of dcache_lock May 05 16:20:23 <grondo> (backtrace) May 05 16:20:40 <neb> that should be doable. May 05 16:20:41 <grondo> that will be a significant hint I'd imagine May 05 16:21:02 <grondo> but I'll want to run the probe everywhere, so it should be low impact May 05 16:28:22 * neb looking for a bit of code to snatch May 05 16:29:56 <neb> how many places is the dcache_lock taken? May 05 16:30:10 <grondo> I don't know. About a zillion? May 05 16:30:13 <grondo> ;-) May 05 16:32:49 <neb> [root@snog linux-2.6.33.3]# find . -name "*.c" | xargs grep spin_lock.*dcache_lock | wc -l May 05 16:32:49 <neb> 80 May 05 16:32:53 <neb> hmm May 05 16:33:23 <grondo> you know grep has a -c option ;-) May 05 16:33:33 <grondo> and a -R ;-) May 05 16:33:34 <grondo> heh May 05 16:34:03 <neb> pbbt May 05 16:34:13 <neb> TIMTOWTSAC May 05 16:34:38 <grondo> I don't know that particular acronym :-) May 05 16:34:46 <neb> there is more than one way to skin a cat May 05 16:36:34 <grondo> Hm, this patch might help: https://patchwork.kernel.org/patch/45557/ May 05 16:36:47 <grondo> (heh. joking in case you didn't get it) May 05 16:37:43 <neb> I was wondering if there was some way that you could modify the ticketed spinlock code to give you the clue you want. May 05 16:38:17 <grondo> well, if I'm going to be modifying code, which would require a new kernel build, I would probably just build with debug spinlocks May 05 16:38:30 <grondo> which track the current holder of the lock I think May 05 16:40:08 <grondo> I want to do something that doesn't require taking the machine down May 05 16:44:32 <neb> However, I don't know how to probe 80 probe points. May 05 16:45:22 <neb> if it was fewer or more local May 05 16:47:34 <grondo> I wonder if I could (safely) write a kprobe for spin_lock/unlock that replaces the spinlock implementation with the debug version at runtime May 05 16:48:15 <neb> Honestly, I doubt it. May 05 16:48:17 <grondo> hm, maybe not because you can not really do an early return May 05 16:49:07 <neb> The reason being you would need a spinlock to insert the code May 05 16:49:11 <neb> <neb> customer has a problem with a missing unlock of dcache_lock, can anyone think of a way to figure out who is the last person to have the lock? Unfortunately, there are about 80 places which take the dcache_lock May 05 16:49:11 <neb> <mjg59> neb: Well, if the kernel has lockdep, you can do alt+sysreq+d May 05 16:49:11 <neb> <roland> run a kernel with lockdep enabled, but i don't suppose we have rhel kernels like that May 05 16:49:11 <neb> * tscott_gone is now known as tscott May 05 16:49:19 <grondo> I wonder if it would be too slow to essentially do if (lock == dcache_lock) { /* store some debug */ } May 05 16:49:22 <neb> plus isn't spin_lock inlined code. May 05 16:49:44 <grondo> probably May 05 16:49:56 <neb> <mjg59> Yeah, for some reason it's on in our kdump kernels but not our debug kernels May 05 16:49:56 <neb> * clalance has quit (Quit: Leaving) May 05 16:49:56 <neb> <mjg59> neb: So probably needs a custom kernel build May 05 16:50:37 <grondo> how much overhead is lockdep? Could it be run in production for awhile, or is this really something where we need a reproducer first, then we run the lockdep kernel? May 05 16:50:49 <neb> <roland> and if you're actually talking about rhel5 then there probably isn't an option like that, is there? May 05 16:50:49 <neb> * rwheeler has quit (Quit: Leaving) May 05 16:50:49 <neb> <neb> yeah this is still RHEL5 May 05 16:52:02 <neb> <roland> neb: well, stap on all those 80 sites, i guess May 05 16:52:02 <neb> * denise has quit (Ping timeout: 622 seconds) May 05 16:52:02 <neb> <neb> yeah that might be the best May 05 16:52:02 <neb> <neb> sigh May 05 16:52:34 <grondo> not to mention the sites in lustre... May 05 16:52:42 <neb> heh May 05 16:52:50 <grondo> doesn't the core spinlock code detect some of these issues? May 05 16:53:07 <neb> not that I know of May 05 16:53:19 <grondo> ok May 05 16:54:37 <neb> I think you could automate writing the script a bit. May 05 17:00:12 <grondo> heh, this post has a good testcase May 05 17:00:15 <grondo> http://lkml.org/lkml/2009/1/16/503 May 05 17:01:38 <neb> I worry that if you do do all the stap stuff it will prevent the race. This event sent from IssueTracker by woodard [LLNL (HPC)] issue 937133
Event posted on 2010-05-26 15:11 PDT by woodard May 13 10:24:42 <grondo> hm, more occurences of the dcache bug May 13 10:24:57 <grondo> # atlas866 /root > cat /proc/sys/fs/dentry-state May 13 10:24:57 <grondo> 2826 -3260655 45 0 0 0 May 13 10:25:16 <neb> humph May 13 10:25:34 <neb> so have you been monitoring it? May 13 10:25:35 <grondo> the new logging shows that the node enters the job epilog in this state May 13 10:25:39 <grondo> 2010-05-12-04:36:22: 52545 prolog-start 5327 1692 May 13 10:25:39 <grondo> 2010-05-12-04:36:26: 52545 prolog-end 3615 200 May 13 10:25:39 <grondo> 2010-05-12-05:06:37: 52545 epilog-start 4831 -3258728 May 13 10:25:39 <grondo> 2010-05-12-05:06:45: 52545 epilog-end 5261 -3258304 May 13 10:25:39 <grondo> 2010-05-12-05:07:42: 52551 prolog-start 5260 -3258305 May 13 10:25:59 <grondo> wait, no May 13 10:26:14 <grondo> oh, yeah, I was right May 13 10:26:20 <grondo> so something the job is doing May 13 10:27:36 <neb> any records of what the job is doing? May 13 10:27:44 <grondo> not really May 13 10:27:47 <grondo> I will have to email the user May 13 10:27:56 <grondo> checking the user's other jobs on the system May 13 10:28:14 <grondo> no other nodes have this symptom May 13 10:28:15 <neb> is it a fairly contained set of users. May 13 10:30:36 <neb> does it happen to all the nodes participating in the job? May 13 10:30:41 <grondo> no May 13 10:30:55 <grondo> only seems to have happened on one node of an 8 node job May 13 10:30:59 <grondo> still gathering info May 13 10:32:56 <grondo> the job timed out, I wonder if that is a clue May 13 10:35:57 <neb> hmm I wonder if it had a ton of files open or something like that when it timed out May 13 11:07:27 <grondo> two other nodes in this state May 13 11:07:30 <grondo> different users May 13 11:08:12 <neb> any clues on the code yet, did they time out as well? May 13 11:08:23 <neb> was it the same 1 node out of quite a few? May 13 11:08:38 <grondo> no the other one completed May 13 11:25:48 <grondo> hey, whaddyaknow, sar logs dentry unused stats May 13 11:26:08 <grondo> 00:00:01 dentunusd file-sz inode-sz super-sz %super-sz dquot-sz %dquot-sz rtsig-sz %rtsig-sz May 13 11:26:17 <grondo> I don't know what the other stuff is May 13 11:26:46 <grondo> but this allos me to put a more precise timestamp on the event May 13 11:26:54 <grondo> 04:30:01 1312 1020 4689 0 0.00 0 0.00 0 0.00 May 13 11:26:57 <grondo> 04:40:01 4294714788 1020 3924 0 0.00 0 0.00 0 0.00 May 13 11:27:23 <grondo> so somewhere between 0430 and 0440 the dcache went haywire May 13 11:27:51 <grondo> the job started at 04:36 May 13 11:29:25 <grondo> file-sz is the number of used file handles May 13 11:29:46 <grondo> inode-sz is the "Number of used inode handlers"[sic] May 13 11:32:09 <grondo> neither of those look suspcious May 13 11:35:38 <grondo> other thing going on at that time is a large-ish amount of pagefaults: May 13 11:35:47 <grondo> 04:30:01 0.00 0.00 10999.17 0.01 May 13 11:35:47 <grondo> 04:40:01 0.00 0.00 14179.37 1.86 May 13 11:35:57 <neb> that may not be weird. May 13 11:36:02 <grondo> however, that probably matches with the fact that the job was being launched at that time May 13 11:36:20 <neb> exactly a whole lot of pages might need to be loaded to get an app up and running May 13 11:36:49 <grondo> I will check on other nodes to see if the problem seemed to have occurred during job launch May 13 11:36:55 <grondo> that will be an interesting datapoint May 13 11:36:57 <neb> I wonder if it was doing a huge amount searching for shared libs and stuff causing the negative dcache to get giant. May 13 11:37:50 <grondo> nothing else of interest in the sar logs May 13 11:38:40 <neb> e.g. ldd /usr/bin/evolution | wc -l = 129 May 13 11:39:01 <grondo> yeah that is nothing May 13 11:41:27 <neb> but if your ld.so.conf is big and then you have an app that has zillions of shared libs then the number of negative dcache entries might be gobs May 13 11:43:41 <grondo> we happen to have a test app that simulates that May 13 11:43:43 <grondo> pynamic May 13 11:43:52 <neb> hmm May 13 11:44:09 <neb> try pynamic in a tight loop? May 13 11:46:46 <grondo> the actual binary that caused the problem only has 19 .so's May 13 11:46:55 <neb> bummer. May 13 11:46:56 <grondo> (the binary in one case anyway) May 13 11:47:08 <neb> does it go fishing for config files though? May 13 11:47:14 <neb> or restart files? May 13 11:47:51 <grondo> maybe the failures, which all look to have occurred in the wee hours of 5/12, happened because something else was running at the same time May 13 11:50:15 <grondo> e.g. dist May 13 11:51:01 <grondo> or fixperms May 13 11:53:35 <neb> hmm what are the times? May 13 11:54:13 <grondo> fixperms matches up exactly with the event on one node May 13 11:54:16 <grondo> still loooking into other May 13 11:54:17 <grondo> s May 13 11:56:47 <grondo> god damn they run fixperms a lot May 13 11:56:57 <grondo> May 12 04:36:02 atlas866 logger: fixperms: done running at 05/12/10 04:36:02 ... May 13 11:57:00 <grondo> another time match May 13 11:57:04 <grondo> :-O May 13 11:57:19 <neb> very interesting. May 13 11:57:26 <neb> what does fixperms do again? May 13 12:39:58 <grondo> turns out fixperms running at the same time as the job startup wasn't quite right May 13 12:40:15 <grondo> it just looked like the same time becuase fixperms runs in the epilog, just before the next job starts May 13 12:41:13 <neb> oh well. I thought we were onto something. May 13 12:41:50 <neb> this one is one of the more interesting investigation kind of problems. We usually have a better idea of the cause of something repeatable. May 13 12:47:31 <grondo> yeah, I'm kind of at a loss May 13 12:49:46 <neb> so we know it is something that the program is doing. May 13 12:50:31 <neb> I think we should track the program and talk to the person running the program and maybe do a test run or two to sort of profile it. May 13 12:51:02 <grondo> uh, there isn't really a common program between all the failures May 13 12:51:35 <neb> It very well might be that they are all doing somewhat the same thing or maybe even using a particular library function. May 13 12:52:19 <neb> If we could watch the sequence of things and then stop the program when things seem to start going off the rails. it might give us some idea of what the program is doing at that moment. May 13 12:52:25 <neb> or what it has just done. May 13 12:54:09 <grondo> well, as I said I think the problem occurs perhaps before even main() May 13 12:54:30 <neb> when did you say that and why do you think that? May 13 12:56:49 <grondo> based on the timestamps in sar logs May 13 12:57:10 <grondo> which does only give a 10 minute window May 13 13:03:05 <grondo> however, it is always within that first 10 minutes May 13 13:03:37 <grondo> in one case it was within the same minute May 13 13:03:56 <neb> reading init files? May 13 13:04:00 <grondo> so it could be something each job (by a different user, different app) does the same at startup May 13 13:04:17 <neb> esp if it is a lib function May 13 13:04:17 <grondo> or the more likely case is that it happens during launch itself May 13 13:04:24 <grondo> because that is the same across all jobs May 13 13:04:53 <neb> I'm not sure that I agree with you on the "more likely" bit May 13 13:05:12 <neb> it might be that they all search for initfiles or something May 13 13:05:27 <neb> that isn't launch but it might be something that many apps do. May 13 13:07:18 <grondo> initfiles? May 13 13:07:35 <grondo> I think most apps have an input file and a restart dump May 13 13:07:49 <grondo> I _highly_ doubt that they search some kind of path for either of those May 13 13:08:11 <grondo> however, it still could be something they do at startup May 13 13:08:31 <grondo> I just find it more likely that it is the process of launching, just because that is definitely the same May 13 13:11:29 <grondo> even jobs using different MPI libraries are involved May 13 13:11:37 <grondo> thought maybe the MPI had something to do withit May 13 13:12:21 <grondo> one launches out of lustre, another out of NFS homedir May 13 13:12:33 <grondo> no luck with leads here May 13 13:15:25 <neb> yes abstractly everybody has init files. However, there still may be something about the specific way that they find and read the init files that might cause the problem May 13 13:16:27 <neb> and everybody launches too so the same argument you make about init files applies to that and maybe even more so. May 13 13:17:05 <neb> e.g. how many open() or access() calls does it do? May 13 13:19:44 <neb> is it always just one node / job? May 13 13:19:57 <neb> could that be MPI rank(0) node? May 17 11:22:58 <grondo> ugh. lots more hits on the dcache problem May 17 11:25:24 <grondo> I installed a systemtap script that did a backtrace from the function in dcache.c that first notices nr_unused < 0 May 17 11:25:34 <grondo> however, there doesn't seem to be any extra clues there May 17 11:25:48 <grondo> one app hit the issue while doing an open May 17 11:25:59 <grondo> another hit while doing a stat May 17 11:26:44 <grondo> I think I will somehow have to understand the dcache accounting better so I can develop more assertions other than just nr_unused < 0 May 17 11:27:39 <grondo> something seems to be disrupting the accounting before, then at some point later the value goes negative as a result. May 17 11:29:29 <grondo> and the accounting just goes wildly out of whack May 17 11:30:31 <neb> data corruption? May 17 11:30:49 <neb> maybe it isn't a dcache accounting problem after all. May 17 11:30:50 <grondo> could be, but wouldn't that tend to be more random May 17 11:31:07 <neb> isn't the trent pretty darn random? May 17 11:31:21 <neb> that seems to be THE thing about this problem May 17 11:31:56 <grondo> no, always happens at or near job startup, always affects the dcache dentry_stat.nr_unused value (or I guess that is the symptom) May 17 11:32:32 <grondo> wouldn't memory corruption tend to cause random effects, a node crash here, invalid value there, etc... May 17 11:32:45 <neb> not necessarily May 17 11:33:11 <neb> if it is a constant offset into the structure May 17 11:34:37 <neb> http://rhkernel.org/RHEL5+2.6.18-194.el5/fs/dcache.c#L60 May 17 11:35:43 <neb> would you be able to try a test kernel where the dentry_stat is relocated before the other variables in the file. May 17 11:35:59 <neb> like put it up near line 60 and then have all the other variables below it. May 17 11:36:47 <grondo> does that really affect the layout of memory? May 17 11:37:29 <neb> and maybe reorganize the dentry_stat_t type so that it has nr_unused down farther in the structure. May 17 11:38:07 <neb> There are some cases where it might not and you are right that this case might be one of those. May 17 11:39:11 <grondo> I don't know the rules for that May 17 11:39:17 <neb> the compiler might want to put the statics in a bss segment and since the dentry_stat_t in a static data section May 17 11:39:38 <neb> don't worry mere mortals aren't supposed to care. ;-) May 17 11:39:41 <grondo> https://eris.llnl.gov/svn/kernel-exp/vanilla/trunk/patches/rh/linux-2.6-fs-private-dentry-list-to-avoid-dcache_lock-contention.patch May 17 11:39:53 <grondo> that patch is new with this release May 17 11:40:23 <grondo> I think it fixed another bug we were having, but I wonder if I should try backing it out. We have a w/a for the other bug May 17 11:40:48 <grondo> note this hunk May 17 11:40:51 <grondo> @@ -770,8 +769,7 @@ resume: May 17 11:40:51 <grondo> * of the unused list for prune_dcache May 17 11:40:51 <grondo> */ May 17 11:40:51 <grondo> if (!atomic_read(&dentry->d_count)) { May 17 11:40:51 <grondo> - list_add_tail(&dentry->d_lru, &dentry_unused); May 17 11:40:53 <neb> grrr fucking firewalls May 17 11:40:53 <grondo> - dentry_stat.nr_unused++; May 17 11:40:56 <grondo> + list_add_tail(&dentry->d_lru, dispose_list); May 17 11:40:58 <grondo> found++; May 17 11:41:01 <grondo> } May 17 11:41:04 <grondo> May 17 11:41:17 <grondo> I'm sure it is right, but the patch does definitely modify the dentry_stat accounting May 17 11:41:28 <grondo> maybe it could have introduced a very rare leak? May 17 11:41:59 <grondo> it would be nice if I understood the invariants of the accounting, which should be pretty simple, I could write a systemtap script that verified them on every function call May 17 11:42:32 <grondo> for now, I think I'll capture the last N values from dentry_stat and spit them out on failure... in case something jumps out May 17 15:33:40 <grondo> trying to do something like a circular buffer log in systemtap May 17 15:33:53 <grondo> can't find any examples online, and the obvious things don't work May 17 15:34:31 <grondo> like head = 0; log [head++] = string; if (head > max) head = 0; May 17 15:34:58 <grondo> if I do that I get an error that the array has exceeded max entries or something May 17 15:35:07 <neb> an array index % array_size May 17 15:38:49 <grondo> yeah, basically May 17 15:38:51 <grondo> that doesn't work May 17 15:44:27 <grondo> (for some reason I always forget about mod) May 17 15:44:35 <grondo> anyhow it should be the same May 18 08:25:16 * Disconnected (Connection timed out). May 18 08:53:59 <neb> 940000 May 18 10:15:52 <grondo> nice May 18 10:26:51 <grondo> if you have any experts on stap, can you ask the best way to use stap to create a circular log with minimal impact? May 18 10:27:38 <grondo> I'm having trouble with exceeding MAXMAPENTRIES, even though I use essentially array [index % 100] May 18 10:27:50 <grondo> (i.e. limit to 100 entries) May 18 10:28:02 <grondo> I also tried doing per-cpu logs with something like May 18 10:28:19 <grondo> array [cpu(), index % 100] May 18 10:28:24 <grondo> without much success May 18 10:28:45 <neb> can you send me your whole script. May 18 10:29:12 <grondo> I'd like to basically keep a log of the last N things that happened, then dump the log on a trigger May 18 10:30:01 <grondo> hmm, original script isn't really around anymore because of all the experimentation May 18 10:33:26 <grondo> let me rewrite it how I think it should work May 18 10:33:26 <grondo> the most naive version May 18 13:19:00 <neb> back working on your stap script May 18 13:24:46 <neb> try not using the increment and decrement operators. May 18 13:25:07 <neb> https://bugzilla.redhat.com/show_bug.cgi?id=591407 May 18 13:49:40 <neb> try s / global array[100] / global array / May 18 13:50:31 <neb> this is atlas right? May 18 13:51:26 <neb> ooh 017007 May 18 13:52:25 <neb> Also you need to update to the newer version of stap. May 18 13:53:03 <neb> systemtap-1.1-3.el5 May 18 14:06:41 <neb> grondo tossed the new version of systemtap into worpi:/tmp May 18 16:33:58 <neb> did you get the new version of systemtap? May 18 16:34:16 <neb> I changed location too thinking that was the problem but it wasn't May 18 16:35:07 <grondo> yes, not sure if it made it into the image that Trent is rebooting onto hype now May 18 16:35:24 <grondo> That version is the RHEL5.5 version, so hopefully no dependencies gotchas May 18 16:35:30 <grondo> What exactly was the fix in there? May 18 16:35:41 <grondo> I think I'm going to order a new bike soon May 18 16:36:04 <neb> the stap developers didn't say but they did say that the error message didn't look right for the script. May 18 16:36:22 <neb> and because of the text of the error message they could see that you were running an old version of stap May 18 16:36:49 <neb> they think that there was array indexing issues in that version on the edge of the arrays. May 18 16:37:30 <neb> So they thought that you might be hitting array[99] and it was telling you that it was an overflow when it was reasonable becuase of a bug in stap May 18 16:37:42 <neb> they said that it was best to not declare the size of arrays. May 18 16:38:27 <grondo> ok, I got that error even if I didn't declare the size of the array May 18 16:38:40 <grondo> will try the new stap once hype is back up May 18 16:39:01 <grondo> in theory what I'm doing should work though? May 18 16:39:33 <grondo> the other question I have is how to probe multiple functions with the same "probe" line May 18 16:39:50 <grondo> the "*@fs/dcache.c" gets too much May 18 16:39:59 <grondo> it even probes stuff _not_ in dcache.c strangely May 18 16:40:36 <grondo> oh, interesting May 18 16:40:42 <grondo> I can't recreate that problem now May 18 16:41:20 <grondo> ah, Trent _did_ get the new systemtap May 18 16:42:53 <grondo> cool, seems to work better now (so far so good) May 18 16:43:00 <grondo> too bad I can't use it on the production systems May 18 16:43:02 <grondo> :-( May 18 16:43:57 <grondo> cool that the script works now, tho May 18 16:44:09 <neb> OK so it appears to be the stap script. May 18 16:44:38 <neb> Can't you just install the new stap packages into the writable May 18 16:44:42 <neb> err I meant: May 18 16:44:52 <neb> OK so it appears to be the stap binary not the script May 18 16:44:57 <neb> that makes more sense. May 18 16:45:27 <neb> keep in mind with stap that it is a new and evolving tool. It may not always be your skill it can also be the binary. May 18 18:00:28 <neb> so does it look like your stap script will work. May 18 18:00:50 <neb> other than the multiprobe thing are there any additional things you need me to track down for you regarding that? May 18 18:01:08 <grondo> not at the moment May 20 11:12:53 <grondo> On the dcache problem --- I implemented checks for dentry_stat.nr_unused < 0 in the SLURM prolog and epilog May 20 11:13:00 <grondo> to catch the problem before the next job May 20 11:13:17 <grondo> A couple nodes triggered last night May 20 11:13:44 <grondo> however, one of them didn't have the signature negative value in /proc/sys/fs/dentry-stat May 20 11:13:48 <neb> They seem to implicitly see it as a runtime environment and the notion that it is used as a production development environment for some larger customers seems to have escaped them. May 20 11:13:51 <grondo> # atlas449 /root > cat /proc/sys/fs/dentry-state May 20 11:13:51 <grondo> 12040 4530 45 0 0 0 May 20 11:14:40 <grondo> however, right when I started running drop_caches, it went negative! May 20 11:14:44 <grondo> cat /grondo@atlas449 ~ >cat /proc/sys/fs/dentry-state May 20 11:14:44 <grondo> 2841 -3402 45 0 0 0 May 20 11:15:23 <grondo> I'm not sure what to make of that May 20 11:16:10 <neb> so how did the slurm check catch it if dentry_stat.nr_unused was still positive? May 20 11:16:25 <grondo> yes, that is the first question I had May 20 11:16:26 <neb> re: dev env yeah May 20 11:16:37 <grondo> it must have been negative at the time of the check May 20 11:16:40 <neb> could it be an intermittant. May 20 11:16:41 <neb> yeah May 20 11:16:50 <grondo> then went back positive? May 20 11:16:59 <grondo> I don't really understand how that could May 20 11:17:00 <grondo> be May 20 11:17:32 <grondo> I really wish I could install the new systemtap on atlas so I could run my script that captures the history of the dentry_stat values May 20 11:18:06 <grondo> I really have no idea what is going on here May 20 11:18:19 <grondo> I think my only recourse right now is to try to reproduce on hype May 20 11:22:53 <neb> have you come up with any clues on how to reproduce the problem? May 20 11:23:08 <neb> have you tried removing that one patch which you thought might be implicated? May 20 11:25:13 <grondo> no, and no, because I can't recreate it May 20 16:54:06 <grondo> stuck on a systemtap thing again May 20 16:54:26 <grondo> I want to resolve the address of a list to pass it to a function May 20 16:54:38 <grondo> I'm basically doing l = $dentry_unused May 20 16:54:42 <grondo> but I'm getting: May 20 16:54:52 <grondo> semantic error: struct/union 'list_head' is being accessed instead of a member of the struct/union: identifier '$dentry_unused' at /g/g0/grondo/proj/stap/dcache-count.stp:19:11 May 20 16:57:22 * neb thinkin May 20 16:57:26 <neb> err thinking May 20 16:57:38 <grondo> if instead do $dentry_unused->next May 20 16:57:46 <grondo> then it gets further, but the error is confusing May 20 16:57:56 <grondo> Running make -C "/lib/modules/2.6.18-90chaos/build" M="/var/tmp/root/stapAHG5Ew" modules >/dev/null May 20 16:57:59 <grondo> /var/tmp/root/stapAHG5Ew/stap_23834.c: In function 'function_dentry_unused_count': May 20 16:58:01 <grondo> /var/tmp/root/stapAHG5Ew/stap_23834.c:336: error: 'struct list_head' has no member named 'd_list' May 20 16:58:04 <grondo> /var/tmp/root/stapAHG5Ew/stap_23834.c:345: error: 'struct list_head' has no member named '__retvalue' May 20 16:59:46 <neb> looks like some type polymorphism problems. May 20 17:00:52 <grondo> hm, the generated code is not as expected May 20 18:15:51 <grondo> aw, shit May 20 18:16:17 <grondo> so I have a stap script to compare the number of dentries in the dentry_unused list to dentry_stat.nr_unused May 20 18:16:24 <grondo> and prints an error if they don't match May 20 18:16:37 <grondo> many nodes on atlas don't match May 20 18:16:54 <grondo> eatlas16: 3184 members of dentry_unused list. nr_dentry=6865 nr_unused=3182 May 20 18:16:55 <grondo> eatlas7: 3233 members of dentry_unused list. nr_dentry=6942 nr_unused=3232 May 20 18:16:55 <grondo> eatlas10: 3138 members of dentry_unused list. nr_dentry=6823 nr_unused=3137 May 20 18:16:55 <grondo> eatlas8: 3250 members of dentry_unused list. nr_dentry=6941 nr_unused=3246 May 20 18:16:57 <grondo> eatlas28: 3176 members of dentry_unused list. nr_dentry=6824 nr_unused=3175 May 20 18:16:59 <grondo> eatlas2: 3195 members of dentry_unused list. nr_dentry=6887 nr_unused=3194 May 20 18:17:02 <grondo> is just a sample May 20 18:17:29 <grondo> I can "fix" it with the script as well, but the problem is worse than I thought May 20 18:17:37 <grondo> well have to run May 20 18:18:03 <neb> Hmm... so it might be that the problem lies dormant for a while but then goes critical May 20 18:18:32 <grondo> yes, those nodes all have the chance for nr_unused < 0 I believe May 20 18:19:05 <neb> that is good it seems to push the problem back. May 20 18:19:15 <grondo> unfortunately to count the dentry_unused list I have to walk the entire thing May 20 18:19:27 <neb> maybe the reason you couldn't find the cause is the cause is long gone before the problem shows up. May 20 18:19:34 <grondo> yes, could be May 20 18:19:58 <neb> do you have any 4.2 systems that you can compare against? May 20 18:21:57 <neb> did the stap thing work? May 21 10:19:49 <grondo> so what I'm thinking of doing with the stap script I have is to run it in the epilog after every job and drain the node if the size of dentry_unused != dentry_stat.nr_unused. That =should= mean that the event which caused the problem happened during that job May 21 10:20:34 <neb> I would agree. May 21 10:21:15 <grondo> egad May 21 10:21:28 <grondo> another bug seems to have hit us on a *bunch* of nodes May 21 10:21:47 <neb> with the proviso that it isn't necessarily the cause but is strongly correlated to the problem occuring. May 21 10:23:25 <neb> you know me, I really try very hard to not mistake coincidence & correlation with causality This event sent from IssueTracker by woodard [LLNL (HPC)] issue 937133
Event posted on 2010-05-26 15:13 PDT by woodard May 25 10:22:19 <grondo> hmm, it might be that the jobs causing the errors in dentry_stat.nr_unused are the same ones that caused our dcache softlockups in nfs May 25 10:22:35 <grondo> I hope I still have that reproducer May 25 10:26:58 <neb> interesting May 25 10:27:28 <neb> are these the one that search a bunch of places for a table of strings. May 25 10:27:38 <grondo> the reason I think this is that the job that hit the early detector had been canceled by timeout, and that was the case that reproduced the old bug May 25 10:27:39 <neb> do you think it is NFS or might it be lustre? May 25 10:27:58 <grondo> I believe it is NFS May 25 10:28:09 <grondo> but I don't think that is the bug May 25 10:28:25 <grondo> THis is totally guesswork right now May 25 10:28:32 <neb> did we actually have a repro for that bug/ May 25 10:28:59 <grondo> a patch went into 5.4 to fix the problem we saw before, where a special case of negative dentry validation caused a corner case to hit in the dccache and resulted in softlockups May 25 10:29:10 <neb> right I remember that one. May 25 10:29:26 <grondo> the problem occurred fairly rarely in production, but we could reproduce it by populating the dcache and running a certain user job then canceling it May 25 10:29:45 <grondo> the problem seemed to be triggered by all 8 tasks of the job getting sigterm at the same time May 25 10:30:04 <grondo> a signal handler would search the msg catalog path for non-existent files, triggering the revalidation May 25 10:30:27 <grondo> it also seemed like we were *much* more likely to reproduce when we had the job container module May 25 10:30:46 <grondo> in fact, I don't think any other RH customers saw this problem May 25 10:31:21 <neb> https://enterprise.redhat.com/issue-tracker/?module=issues&action=view&tid=178773 May 25 10:31:34 <grondo> so now there is a fix for some similar kind of softlockup in the dcache, which moves some stuff to a private list May 25 10:31:51 <grondo> my guess is that in that codepath, when we hit the same rare case we hit before, the accounting for nr_unused isn't quite right May 25 10:32:14 <grondo> https://chaos.llnl.gov/bugzilla/show_bug.cgi?id=285 May 25 10:32:14 <grondo> is our local casxe May 25 10:32:22 <grondo> sadly I didn't document how to run the reproducer May 25 10:32:31 <grondo> I vaguely remmeber, but I have to find the user code May 25 10:32:54 <grondo> it also seemed like any code compiled by the intel compiler was vulnerable May 25 10:33:10 * neb still can't get to chaos.llnl.gov due to silly firewall rules and crazy network people. May 25 10:33:24 <grondo> oh geez May 25 10:33:42 <grondo> ok, found the code May 25 10:34:22 <neb> is it something kent can run along with your stap script so that we can see if it happens on stock rhel kernels. May 25 10:34:54 <grondo> I don't think so May 25 10:34:54 <grondo> for some reason I could never make a standalone reproducer May 25 10:35:57 <grondo> the theory was that the code searched for the message catalogs once, thus creating the negative dcache entries. some time later, after the entries have expired, it gets a signal, looks for them again and triggers revalidation May 25 10:36:24 <grondo> some kind of error comes back from the server, or maybe not, but this triggers shrink_dcache_parent with some kind of huge number May 25 10:36:24 <grondo> that is what I remember May 25 10:36:24 <grondo> but I tried to create a reproducer based on that and failed May 25 10:37:29 <grondo> it also seemed important May 25 10:37:29 <grondo> to have the job run in the sgi job container module May 25 10:37:56 <neb> so now you aren't running with the sgi container modules? May 25 10:38:08 <grondo> we are May 25 10:38:24 <grondo> I'm just saying that I couldn't make my own reproducer before May 25 10:38:38 <grondo> even though it should have been easy based on the theory of how the bug was triggered May 25 10:40:30 <grondo> However, the necessity of the job container module explains why nobody else seems to see the problem May 25 10:41:24 <grondo> dammit why didn't I document how to run the reproducer. stupid! May 25 10:41:33 <neb> why do you think that the job module matters May 25 10:42:24 <grondo> we supposed before that it was because the job container allows all tasks of the job to be signaled virtually at the same time May 25 10:42:24 <grondo> slurm doesn't signal the individual tasks, it signals the job container as a whole May 25 10:43:30 <neb> Oh yeah. I remember coming to that now. May 25 10:43:47 <neb> I did a lot of playing with signals to try to make that work when I was trying to come up with a repro May 25 10:43:54 <grondo> yes May 25 10:44:12 <grondo> we just couldn't make it work May 25 10:44:21 <grondo> hold on, I'm going to search email for the steps to reproduce May 25 13:14:22 <neb> for an issue like this one, it is weird I keep on saying "we" but it feels wrong becuas e you did most of the work yet I feel like we have been consulting and brainstorming all along. May 25 13:16:36 <grondo> yes, we have May 25 13:17:37 <grondo> doesn't bother me if you say we May 25 13:23:03 <neb> it it even possible for you to test to see if the problem occurs without the job module? May 25 13:27:46 <neb> Some questions I have, I wonder if you could reproduce it on a UP box or if it is a race between two procs manipulating the dcache list May 25 13:28:04 <neb> is this the basic idea: May 25 13:28:19 <neb> 1) make a lot of negative dcache entries May 25 13:28:23 <neb> 2) waith 15 min May 25 13:28:49 <neb> 3) exercise those negative dcache entries. May 25 13:29:56 <grondo> all the dcache lists should be protected by dcache_lock May 25 13:30:05 <grondo> I think you just have to make a lot of unused dentries May 25 13:30:08 <grondo> e.g. find / May 25 13:30:18 <grondo> although that doesn't even seem to be all that necessary May 25 13:30:27 <neb> my idea was May 25 13:30:29 <grondo> then make a _few_ negative dentries May 25 13:30:35 <grondo> wait for them to expire May 25 13:30:35 <neb> only a few May 25 13:30:43 <grondo> then hit them again from all cpus May 25 13:30:50 <grondo> that is my guess as to how it works May 25 13:30:56 <grondo> but I swear we tried that before May 25 13:31:07 <neb> I was going to try: May 25 13:33:55 <neb> find /nfsmnt/ -type d; do find /usr/src/linux -type f | while read line;do name=`basename $line`;if [ -e $name ]; then ;; fi; done; done; sleep 900 May 25 13:34:14 <neb> the for all the CPUs do the same thing. May 25 13:34:26 <grondo> try it May 25 13:34:31 <neb> working on it now. This event sent from IssueTracker by woodard [LLNL (HPC)] issue 937133
Event posted on 2010-05-27 04:43 PDT by woodard Problem history: This problem initially manifested as a hang during our job prolog script, run before each job to prepare the node for work. This script does the equivalent of echo 2> /proc/sys/vm/drop_caches so that nodes handed over to customers start out with a similar memory profile. one some nodes we would see the script hang forever at this point chewing up CPU time. After some investigation, it was found that dentry_stat.nr_unused had somehow gone negative, and this was causing an endless cycle under shrink_dcache_memory. How this value got to be negative was a mystery, however. We installed some hooks into our job prolog and epilog to track dentry_stat values, and to notify us whenever nr_unused < 0. This was happening at a rate of a few nodes per day on a 1152 node machine, but there did not appear to be any commonality between the jobs. The next thing we tried was a systemtap script that compared the actual length of the dentry_unused list to the value of nr_unused in dentry_stat, on the theory that perhaps the count was off way before the count actually went negative (and thus the job affected wasn't actually the cause). We ran this stap script after every job, and we found that jobs for which the count was off all had these things in common: * Were canceled by a signal (timed out or canceled) (thus all tasks in the job were killed by a signal) * Used the Intel compiler This bore a strong resemblance to *another* dcache bug we had hit. I tried the reproducer from that other bug (178773) and found that I could reproduce the problem (length(dentry_unused) != dentry_stat.nr_unused) at will. The key features of the reproducer, I believe are the following: - The reproducer, or perhaps some previous job, add some negative dentries to the cache when looking for the intel message catalogs along the default NLSPATH - while the job is running, those negative dentries expire - when the job is signaled due to timeout or cancelation, *all* tasks get SIGTERM at virtually the same moment. This triggers a handler installed by the intel compiler which searches the NLSPATH again, hitting the negative dentries and requiring some kind of revalidation. (We are running on top of NFSv3) This seems to be the trigger for the bug, although I earlier failed to make a reproducer for 178773 based on this information, so it might not be complete. However, a proven workaround for this problem is to prepend the path to the fortran message catalogs to the NLSPATH. I have also shown that by removing the RHEL patch linux-2.6-fs-private-dentry-list-to-avoid-dcache_lock-contention.patch from the kernel, I can no longer reproduce this bug. However, that patch is thought to be the fix for 178773 so I'd like to keep it if at all possible. Version changed from '5.4' to '4.6' This event sent from IssueTracker by woodard issue 937133
Created attachment 417328 [details] Possible fix Ok so based on what you guys are telling me, you get hung up in prune_dcache, and this comes from shrink_dcache_memory, from shrink_slab. What happens is shrink_slab does this nr = shrink_dcache_memory(0, mask); where nr ends up being some negative number because we use nr_unused to calculate this number, then we call back with shrink_dcache_memory(nr, mask) and in prune_dcache we do for (; nr; nr--) { shrink dcache } which will go on _forever_ if nr is negative. So what is happening is we have a bunch of places where we do if (!list_empty(&dentry->d_lru)) { nr_unused--; list_del_init(&dentry->d_lru); } but if the dentry is on our own private dispose_list, we have already taken care of the nr_unused counter, so we'll get --'ed again. So instead of not messing with nr_unused when we add and take away from the dispose_list, treat dispose_list just like we do dentry_unused, do nr_unused++ when we add to the dispose_list and do nr_unused-- when we take away from the dispose_list. This will keep everything happy and hopefully stop your hangs. Thanks, Josef
I'm still waiting on confirmation that the patch fixes the problem. As soon as I have confirmation that it makes the problem go away I will post it.
*** Bug 625749 has been marked as a duplicate of this bug. ***
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
Posted 8/23.
in kernel-2.6.18-214.el5 You can download this test kernel from http://people.redhat.com/jwilson/el5 Detailed testing feedback is always welcomed.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-0017.html