Bug 596548
| Summary: | dcache unused accounting problem | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 5 | Reporter: | Issue Tracker <tao> | ||||
| Component: | kernel | Assignee: | Josef Bacik <jbacik> | ||||
| Status: | CLOSED ERRATA | QA Contact: | Igor Zhang <yugzhang> | ||||
| Severity: | urgent | Docs Contact: | |||||
| Priority: | urgent | ||||||
| Version: | 5.4 | CC: | bturner, cward, cww, dhoward, mmahut, rmitchel, sprabhu, tao, yugzhang | ||||
| Target Milestone: | rc | Keywords: | Regression, ZStream | ||||
| Target Release: | --- | ||||||
| Hardware: | All | ||||||
| OS: | Linux | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2011-01-13 21:34:05 UTC | Type: | --- | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Bug Depends On: | |||||||
| Bug Blocks: | 658857 | ||||||
| Attachments: |
|
||||||
|
Description
Issue Tracker
2010-05-26 22:19:39 UTC
Event posted on 2010-05-25 11:34 PDT by woodard Description of problem: the problem manifests itself when the nr_unused ends up being negative. This causes an oops when the dcache is reclaimed. This problem has been really hard to track down. First we were able to find that the problem was evident when we dropped the caches between HPC jobs. The customer wrote the attached stap script which allowed us to verify when the problem occurs. We were finally able to catch the kind of workload that causes the original problem and that led us to a reproducer. The reproducer seems to be the same one that was causing: 178773 This problem seems to be with RHEL but we haven't eliminated the possibility that it is some sort of local patch. How reproducible: seemingly aways. Steps to Reproduce: very difficult outside of LLNL's environment. We are working on that now. we think: populate the dcache with a lot of negative entries that exist on a NFS server. Wait about 15 minutes for these entries timeout. Then revalidate them. If the end This event sent from IssueTracker by woodard [LLNL (HPC)] issue 937133 Event posted on 2010-05-25 11:34 PDT by woodard Issue Registered (Severity: 3) This event sent from IssueTracker by woodard [LLNL (HPC)] issue 937133 Event posted on 2010-05-25 11:34 PDT by woodard File uploaded: dcache.stp This event sent from IssueTracker by woodard [LLNL (HPC)] issue 937133 it_file 702543 Event posted on 2010-05-26 14:42 PDT by woodard
<grondo> unable to recreate dcache issue (so far) with the RHEL5.4 patch
linux-2.6-fs-private-dentry-list-to-avoid-dcache_lock-contention.patch
<grondo> removed
<neb> just to confirm that I understand what you said.
<neb> You removed:
linux-2.6-fs-private-dentry-list-to-avoid-dcache_lock-contention.patch and
so far you haven't seen the problem reappear.
<grondo> yes, sorry\
<grondo> I expected the old problem to reappear
<grondo> but if I remember right that was a difficult one to reproduce
<neb> weird I don't know Lachlan McIIlroy
<neb> I do know jeff layton and josef
<neb> let me go and talk to them
<grondo> for someone familiar with the code, and given a description of
the problem, I have a feeling they could identify the fix needed in the
patched code pretty quickly
<grondo> it seems to be a case where dentry_stat.nr_unused is decremented
but the dentry doesn't actually get removed from the list
<grondo> and it seems to be somewhere in that patch
<grondo> though a similar bug did exist before that patch, which would
cause softlockups with the same reproducer
<neb> I sort of feel like I can almost but not quite see the problem
having looked at the patch
<neb> it is that tip of the tongue feeling.
<grondo> I felt similar, but I think I would have to spend half a day at
least walking through the code to fully understand it first
* neb chatting with josef
<neb> <josef> hmm well shit
<grondo> :-o
<neb> hey do you have one of the oopses that occurs anywhere handy
<neb> I know that we've been trying to get far away from that
<neb> it has been happening on atlas right?
<grondo> an oops?
<grondo> what oops exactly?
<neb> <josef> neb: i can give you a patch that makes the nr_unused stuff
work right
<neb> <josef> but who knows if thats the actual problem
<neb> <josef> can you test hte patch and see if it fixes the problem?
<neb> <neb> sure
<grondo> Send it over and I will test
<neb> The oops when nr_unused and the length of the unused list gets way
out of sync and the thing finally blows up.
<grondo> doesn't oops
<grondo> gets stuck indefinitely in prune_dcache or thereabouts
<grondo> since there are tests in the code that kind of assume nr_unused
>= 0
<grondo> ;-)
<grondo> there aren't any actual checks
<neb> just forwarded you Josef's patch.
<neb> Can you try it out?
<grondo> yep
<grondo> shouldn't take me too long
<neb> rough expectation on time.
<neb> couple of hours?
<grondo> or less yeah
<grondo> that's amazing he saw a fix so quick
<neb> <josef> neb: actually wait
<neb> <josef> that patch is crap
<grondo> heh
<neb> just came now
<neb> rush job -- bad job
<neb> he needs more time.
<grondo> no prob
<grondo> no rush
This event sent from IssueTracker by woodard [LLNL (HPC)]
issue 937133
Event posted on 2010-05-26 14:43 PDT by woodard <neb> josef, jlayton you guys acked a patch linux-2.6-fs-private-dentry-list-to-avoid-dcache_lock-contention.patch <neb> for BZ526612 and with that patch applied we are seeing problems with the dcache on large cpu count boxes <neb> when we remove that patch we can't reproduce the problem anymore. <neb> I worry that there might be a subtle race in it. <neb> after about 1 month of work and a lot of systemtap scripts we found that at the root of the problem the number of items on the dentry_unused list doesn't match nr_unused. <neb> The numbers slowly drift out of agreement and eventually you end up with an oops and one of the odd things before the oops is that nr_unused is negative. <bob> I gotta go take my dog to the vet. * bob has quit (Quit: Send a sign, raise the sail, wave a last goodbye, destiny is calling, immortality be mine.) * bmarson is now known as bmarson_gone <josef> neb: ok you still ther? <neb> yes <josef> neb: in select_parent <josef> hmm well shit <neb> Unfortunately it is hard to reproduce but it is taking out 50-100 machines a day here at LLNL. <josef> neb: whats the oops? <neb> it happens hours after the problem first appears <neb> but hold on I can get it for you. <neb> it seems to loose a few at a time and then sometime when it 3-9000 off things go poof. * neb looking for one of the oopses * vaurora (~val.prod.ext.phx2.redhat.com) has joined #fs <josef> neb: i can give you a patch that makes the nr_unused stuff work right <josef> but who knows if thats the actual problem <josef> can you test hte patch and see if it fixes the problem? <neb> sure * hobbit is now known as hobbit_gone <neb> josef, BTW we were also bit BZ526612 but it didn't hit as frequently 2-3x/week <josef> neb: whats your email * kzak_away has quit (Ping timeout: 604 seconds) <neb> woodard <neb> oh yeah that's why I can't find the oops grepping through the logs. <neb> It doesn't oops the machine basically locks up stuck indefinitely in prune_dcache or thereabouts <neb> this is due to all the checks in the code that that kind of assume nr_unused >= 0 <josef> neb: actually wait <josef> that patch is crap <neb> OK won't test that one. <josef> neb: i've got to run and do birthing classes <josef> neb: if you can send me all the debugging you have i'll take al ook tomorrow <josef> and give you something a little better thought out :) <neb> We've been working on this for >1month and we thought it was a local problem. <neb> All the debug info is WAY too much but I'll make sure that I get you a good BZ <neb> OK? This event sent from IssueTracker by woodard [LLNL (HPC)] issue 937133 Event posted on 2010-05-26 14:56 PDT by woodard May 05 11:04:38 <grondo> argggh. More hits of that dcache_lock bug May 05 11:04:54 <grondo> really have to focus on that one and get to the bottom of it now May 05 11:08:20 <neb> any more informaiton this time? May 05 11:08:45 <neb> I think we kind of abandoned it before due to lack of info while we figured that it might be a lustre bug. May 05 11:09:18 <grondo> No more info yet. Have yet to look into it today May 05 11:09:45 <neb> still thinking it is a lustre bug? May 05 11:10:15 <grondo> I guess not. Had Chris Morrone look into it and I guess he didn't find anything lustre related. May 05 11:10:56 <neb> Can you remind me of the facts again? May 05 11:12:03 <grondo> So far all I know is that a process gets hung when trying to do 'echo 2 > /proc/sys/vm/drop_caches' May 05 11:12:20 <grondo> it is hung trying to lock the dcache_lock May 05 11:13:19 <grondo> no other process seems to have the dcache_lock afaics May 05 11:14:24 <neb> https://bugzilla.redhat.com/show_bug.cgi?id=526612 May 05 11:14:31 <neb> I think we have that fix already May 05 11:15:03 <neb> https://bugzilla.redhat.com/show_bug.cgi?id=554574 May 05 11:15:08 <neb> this one is new May 05 11:52:39 <grondo> strangely we are not getting softlockups with this problem May 05 11:52:47 <grondo> so nothing is hogging the dcache_lock May 05 11:53:08 <neb> is is lock juggling? May 05 11:53:21 <grondo> I don't think so May 05 11:54:07 <grondo> there is just one process spinning on the dcache_lock May 05 11:54:12 <grondo> it is strange May 05 11:54:14 <grondo> I don May 05 11:54:20 <grondo> 't really know where to start looking May 05 11:54:40 <neb> the problem is triggered when you try to drop caches. May 05 11:55:06 <neb> how often does it happen? is it intermittent or fairly frequent? May 05 11:55:08 <grondo> [<ffffffff800654e6>] __sched_text_start+0xf6/0xbe3 May 05 11:55:08 <grondo> [<ffffffff8002f607>] prune_dcache+0x37/0x138 May 05 11:55:08 <grondo> [<ffffffff800f4a7f>] shrink_dcache_memory+0x19/0x32 May 05 11:55:08 <grondo> [<ffffffff80040a15>] shrink_slab+0xe0/0x15c May 05 11:55:08 <grondo> [<ffffffff800fe952>] drop_slab+0x36/0x5f May 05 11:55:11 <grondo> [<ffffffff800fee30>] drop_caches_sysctl_handler+0x162/0x1ae May 05 11:55:13 <grondo> [<ffffffff8009c5e4>] do_rw_proc+0xcb/0x120 May 05 11:55:16 <grondo> [<ffffffff8009c653>] proc_writesys+0x1a/0x1c May 05 11:55:19 <grondo> [<ffffffff80017012>] vfs_write+0xcf/0x175 May 05 11:55:21 <grondo> [<ffffffff80017902>] sys_write+0x47/0x70 May 05 11:55:23 <grondo> [<ffffffff80060116>] system_call+0x7e/0x83 May 05 11:55:26 <grondo> seems to be very rare May 05 11:55:29 <grondo> that is the backtrace of the process trying to drop caches May 05 11:55:34 <grondo> (calling drop_slab) May 05 11:59:37 <grondo> hm, maybe I should actually look at the code in prune_dcache and see what could be going on May 05 11:59:53 <grondo> I just assumed it is the dcache_lock, but I can't remember why now May 05 12:02:37 <grondo> the process is listed as running: May 05 12:02:42 <grondo> prolog R running task 0 17916 17915 (NOTLB) May 05 12:02:47 <grondo> Call Trace: [<ffffffff80066031>] thread_return+0x5e/0xf6 [<ffffffff8002f5f6>] prune_dcache+0x26/0x138 May 05 12:03:20 <neb> what's at 138? May 05 12:03:33 <neb> thats pretty early in the function May 05 12:05:00 <grondo> spin_lock(&dcache_lock); May 05 12:05:05 <grondo> guess it is the lock May 05 12:11:57 <neb> is there any commonality in the jobs that happen to be running on the nodes. Maybe it is some unusual usage pattern such as zillions of small files. May 05 12:12:09 <neb> thus the dcache is much bigger than for other jobs. May 05 12:12:54 <neb> or maybe it looks for a lot of files that don't exist and there are a ton of negative dcache entries. May 05 12:13:22 <grondo> this happens before the job starts, in the job prolog May 05 12:13:33 <grondo> so I guess it would have to be the previous job May 05 12:13:38 <neb> but what did the job before it do May 05 12:13:39 <neb> yeah May 05 12:14:35 <neb> what if you changed the drop_caches to the postscript rather than the prolog? May 05 12:14:46 <neb> might point the finger a bit better. May 05 12:15:12 <grondo> yeah, I'm not sure why it is done in the prolog. May 05 12:17:53 <grondo> I should be able to tell which jobs ran previously May 05 12:18:13 <grondo> however that won't help me figure out what took the lock and didn't drop it! May 05 12:18:40 <neb> the previous job might give some clues as to how to make a reproducer though. May 05 12:18:52 <neb> if we can reproduce, then we can study much better. May 05 12:19:20 <neb> maybe have them run their job and then before the next job starts look at the state of the system May 05 12:19:25 <grondo> yes that is true May 05 14:32:16 <neb> any more information on the dcache_lock problem? May 05 14:32:31 <neb> anything actionable yet? May 05 14:33:09 <neb> Should I open up a ticket just to store notes until we can figure rule out it being a RHEL bug? May 05 14:35:24 <grondo> no, let's wait a bit for that May 05 14:35:34 <grondo> still thinking it must be self-inflicted May 05 14:36:26 <neb> OK I'll stand by May 05 16:02:03 <grondo> wish there was a runtime way to enable DEBUG_SPINLOCK May 05 16:03:11 <neb> The way that they turn off and on tracepoints in the kernel might be able to be adapted to do something like that but it is one of those things that is "really clever" May 05 16:18:50 <grondo> I wonder if I can easily write a systemtap or kprobes mousetrap for this issue May 05 16:18:59 <grondo> I'm not really sure how else to proceed at this point May 05 16:19:08 <grondo> looking for the common job but so far don't see much May 05 16:19:51 <neb> So what would you be looking for? May 05 16:20:14 <grondo> all I need to collect was the last locker of dcache_lock May 05 16:20:23 <grondo> (backtrace) May 05 16:20:40 <neb> that should be doable. May 05 16:20:41 <grondo> that will be a significant hint I'd imagine May 05 16:21:02 <grondo> but I'll want to run the probe everywhere, so it should be low impact May 05 16:28:22 * neb looking for a bit of code to snatch May 05 16:29:56 <neb> how many places is the dcache_lock taken? May 05 16:30:10 <grondo> I don't know. About a zillion? May 05 16:30:13 <grondo> ;-) May 05 16:32:49 <neb> [root@snog linux-2.6.33.3]# find . -name "*.c" | xargs grep spin_lock.*dcache_lock | wc -l May 05 16:32:49 <neb> 80 May 05 16:32:53 <neb> hmm May 05 16:33:23 <grondo> you know grep has a -c option ;-) May 05 16:33:33 <grondo> and a -R ;-) May 05 16:33:34 <grondo> heh May 05 16:34:03 <neb> pbbt May 05 16:34:13 <neb> TIMTOWTSAC May 05 16:34:38 <grondo> I don't know that particular acronym :-) May 05 16:34:46 <neb> there is more than one way to skin a cat May 05 16:36:34 <grondo> Hm, this patch might help: https://patchwork.kernel.org/patch/45557/ May 05 16:36:47 <grondo> (heh. joking in case you didn't get it) May 05 16:37:43 <neb> I was wondering if there was some way that you could modify the ticketed spinlock code to give you the clue you want. May 05 16:38:17 <grondo> well, if I'm going to be modifying code, which would require a new kernel build, I would probably just build with debug spinlocks May 05 16:38:30 <grondo> which track the current holder of the lock I think May 05 16:40:08 <grondo> I want to do something that doesn't require taking the machine down May 05 16:44:32 <neb> However, I don't know how to probe 80 probe points. May 05 16:45:22 <neb> if it was fewer or more local May 05 16:47:34 <grondo> I wonder if I could (safely) write a kprobe for spin_lock/unlock that replaces the spinlock implementation with the debug version at runtime May 05 16:48:15 <neb> Honestly, I doubt it. May 05 16:48:17 <grondo> hm, maybe not because you can not really do an early return May 05 16:49:07 <neb> The reason being you would need a spinlock to insert the code May 05 16:49:11 <neb> <neb> customer has a problem with a missing unlock of dcache_lock, can anyone think of a way to figure out who is the last person to have the lock? Unfortunately, there are about 80 places which take the dcache_lock May 05 16:49:11 <neb> <mjg59> neb: Well, if the kernel has lockdep, you can do alt+sysreq+d May 05 16:49:11 <neb> <roland> run a kernel with lockdep enabled, but i don't suppose we have rhel kernels like that May 05 16:49:11 <neb> * tscott_gone is now known as tscott May 05 16:49:19 <grondo> I wonder if it would be too slow to essentially do if (lock == dcache_lock) { /* store some debug */ } May 05 16:49:22 <neb> plus isn't spin_lock inlined code. May 05 16:49:44 <grondo> probably May 05 16:49:56 <neb> <mjg59> Yeah, for some reason it's on in our kdump kernels but not our debug kernels May 05 16:49:56 <neb> * clalance has quit (Quit: Leaving) May 05 16:49:56 <neb> <mjg59> neb: So probably needs a custom kernel build May 05 16:50:37 <grondo> how much overhead is lockdep? Could it be run in production for awhile, or is this really something where we need a reproducer first, then we run the lockdep kernel? May 05 16:50:49 <neb> <roland> and if you're actually talking about rhel5 then there probably isn't an option like that, is there? May 05 16:50:49 <neb> * rwheeler has quit (Quit: Leaving) May 05 16:50:49 <neb> <neb> yeah this is still RHEL5 May 05 16:52:02 <neb> <roland> neb: well, stap on all those 80 sites, i guess May 05 16:52:02 <neb> * denise has quit (Ping timeout: 622 seconds) May 05 16:52:02 <neb> <neb> yeah that might be the best May 05 16:52:02 <neb> <neb> sigh May 05 16:52:34 <grondo> not to mention the sites in lustre... May 05 16:52:42 <neb> heh May 05 16:52:50 <grondo> doesn't the core spinlock code detect some of these issues? May 05 16:53:07 <neb> not that I know of May 05 16:53:19 <grondo> ok May 05 16:54:37 <neb> I think you could automate writing the script a bit. May 05 17:00:12 <grondo> heh, this post has a good testcase May 05 17:00:15 <grondo> http://lkml.org/lkml/2009/1/16/503 May 05 17:01:38 <neb> I worry that if you do do all the stap stuff it will prevent the race. This event sent from IssueTracker by woodard [LLNL (HPC)] issue 937133 Event posted on 2010-05-26 15:11 PDT by woodard
May 13 10:24:42 <grondo> hm, more occurences of the dcache bug
May 13 10:24:57 <grondo> # atlas866 /root > cat
/proc/sys/fs/dentry-state
May 13 10:24:57 <grondo> 2826 -3260655 45 0 0
0
May 13 10:25:16 <neb> humph
May 13 10:25:34 <neb> so have you been monitoring it?
May 13 10:25:35 <grondo> the new logging shows that the node enters
the job epilog in this state
May 13 10:25:39 <grondo> 2010-05-12-04:36:22: 52545
prolog-start 5327 1692
May 13 10:25:39 <grondo> 2010-05-12-04:36:26: 52545
prolog-end 3615 200
May 13 10:25:39 <grondo> 2010-05-12-05:06:37: 52545
epilog-start 4831 -3258728
May 13 10:25:39 <grondo> 2010-05-12-05:06:45: 52545
epilog-end 5261 -3258304
May 13 10:25:39 <grondo> 2010-05-12-05:07:42: 52551
prolog-start 5260 -3258305
May 13 10:25:59 <grondo> wait, no
May 13 10:26:14 <grondo> oh, yeah, I was right
May 13 10:26:20 <grondo> so something the job is doing
May 13 10:27:36 <neb> any records of what the job is doing?
May 13 10:27:44 <grondo> not really
May 13 10:27:47 <grondo> I will have to email the user
May 13 10:27:56 <grondo> checking the user's other jobs on the
system
May 13 10:28:14 <grondo> no other nodes have this symptom
May 13 10:28:15 <neb> is it a fairly contained set of users.
May 13 10:30:36 <neb> does it happen to all the nodes participating in
the job?
May 13 10:30:41 <grondo> no
May 13 10:30:55 <grondo> only seems to have happened on one node of
an 8 node job
May 13 10:30:59 <grondo> still gathering info
May 13 10:32:56 <grondo> the job timed out, I wonder if that is a
clue
May 13 10:35:57 <neb> hmm I wonder if it had a ton of files open or
something like that when it timed out
May 13 11:07:27 <grondo> two other nodes in this state
May 13 11:07:30 <grondo> different users
May 13 11:08:12 <neb> any clues on the code yet, did they time out as
well?
May 13 11:08:23 <neb> was it the same 1 node out of quite a few?
May 13 11:08:38 <grondo> no the other one completed
May 13 11:25:48 <grondo> hey, whaddyaknow, sar logs dentry unused
stats
May 13 11:26:08 <grondo> 00:00:01 dentunusd file-sz inode-sz
super-sz %super-sz dquot-sz %dquot-sz rtsig-sz %rtsig-sz
May 13 11:26:17 <grondo> I don't know what the other stuff is
May 13 11:26:46 <grondo> but this allos me to put a more precise
timestamp on the event
May 13 11:26:54 <grondo> 04:30:01 1312 1020 4689
0 0.00 0 0.00 0 0.00
May 13 11:26:57 <grondo> 04:40:01 4294714788 1020 3924
0 0.00 0 0.00 0 0.00
May 13 11:27:23 <grondo> so somewhere between 0430 and 0440 the
dcache went haywire
May 13 11:27:51 <grondo> the job started at 04:36
May 13 11:29:25 <grondo> file-sz is the number of used file
handles
May 13 11:29:46 <grondo> inode-sz is the "Number of used inode
handlers"[sic]
May 13 11:32:09 <grondo> neither of those look suspcious
May 13 11:35:38 <grondo> other thing going on at that time is a
large-ish amount of pagefaults:
May 13 11:35:47 <grondo> 04:30:01 0.00 0.00 10999.17
0.01
May 13 11:35:47 <grondo> 04:40:01 0.00 0.00 14179.37
1.86
May 13 11:35:57 <neb> that may not be weird.
May 13 11:36:02 <grondo> however, that probably matches with the
fact that the job was being launched at that time
May 13 11:36:20 <neb> exactly a whole lot of pages might need to be
loaded to get an app up and running
May 13 11:36:49 <grondo> I will check on other nodes to see if the
problem seemed to have occurred during job launch
May 13 11:36:55 <grondo> that will be an interesting datapoint
May 13 11:36:57 <neb> I wonder if it was doing a huge amount searching
for shared libs and stuff causing the negative dcache to get giant.
May 13 11:37:50 <grondo> nothing else of interest in the sar logs
May 13 11:38:40 <neb> e.g. ldd /usr/bin/evolution | wc -l = 129
May 13 11:39:01 <grondo> yeah that is nothing
May 13 11:41:27 <neb> but if your ld.so.conf is big and then you have an
app that has zillions of shared libs then the number of negative dcache
entries might be gobs
May 13 11:43:41 <grondo> we happen to have a test app that
simulates that
May 13 11:43:43 <grondo> pynamic
May 13 11:43:52 <neb> hmm
May 13 11:44:09 <neb> try pynamic in a tight loop?
May 13 11:46:46 <grondo> the actual binary that caused the problem
only has 19 .so's
May 13 11:46:55 <neb> bummer.
May 13 11:46:56 <grondo> (the binary in one case anyway)
May 13 11:47:08 <neb> does it go fishing for config files though?
May 13 11:47:14 <neb> or restart files?
May 13 11:47:51 <grondo> maybe the failures, which all look to have
occurred in the wee hours of 5/12, happened because something else was
running at the same time
May 13 11:50:15 <grondo> e.g. dist
May 13 11:51:01 <grondo> or fixperms
May 13 11:53:35 <neb> hmm what are the times?
May 13 11:54:13 <grondo> fixperms matches up exactly with the event
on one node
May 13 11:54:16 <grondo> still loooking into other
May 13 11:54:17 <grondo> s
May 13 11:56:47 <grondo> god damn they run fixperms a lot
May 13 11:56:57 <grondo> May 12 04:36:02 atlas866 logger: fixperms:
done running at 05/12/10 04:36:02 ...
May 13 11:57:00 <grondo> another time match
May 13 11:57:04 <grondo> :-O
May 13 11:57:19 <neb> very interesting.
May 13 11:57:26 <neb> what does fixperms do again?
May 13 12:39:58 <grondo> turns out fixperms running at the same
time as the job startup wasn't quite right
May 13 12:40:15 <grondo> it just looked like the same time becuase
fixperms runs in the epilog, just before the next job starts
May 13 12:41:13 <neb> oh well. I thought we were onto something.
May 13 12:41:50 <neb> this one is one of the more interesting
investigation kind of problems. We usually have a better idea of the cause
of something repeatable.
May 13 12:47:31 <grondo> yeah, I'm kind of at a loss
May 13 12:49:46 <neb> so we know it is something that the program is
doing.
May 13 12:50:31 <neb> I think we should track the program and talk to
the person running the program and maybe do a test run or two to sort of
profile it.
May 13 12:51:02 <grondo> uh, there isn't really a common program
between all the failures
May 13 12:51:35 <neb> It very well might be that they are all doing
somewhat the same thing or maybe even using a particular library
function.
May 13 12:52:19 <neb> If we could watch the sequence of things and then
stop the program when things seem to start going off the rails. it might
give us some idea of what the program is doing at that moment.
May 13 12:52:25 <neb> or what it has just done.
May 13 12:54:09 <grondo> well, as I said I think the problem occurs
perhaps before even main()
May 13 12:54:30 <neb> when did you say that and why do you think that?
May 13 12:56:49 <grondo> based on the timestamps in sar logs
May 13 12:57:10 <grondo> which does only give a 10 minute window
May 13 13:03:05 <grondo> however, it is always within that first 10
minutes
May 13 13:03:37 <grondo> in one case it was within the same minute
May 13 13:03:56 <neb> reading init files?
May 13 13:04:00 <grondo> so it could be something each job (by a
different user, different app) does the same at startup
May 13 13:04:17 <neb> esp if it is a lib function
May 13 13:04:17 <grondo> or the more likely case is that it happens
during launch itself
May 13 13:04:24 <grondo> because that is the same across all jobs
May 13 13:04:53 <neb> I'm not sure that I agree with you on the "more
likely" bit
May 13 13:05:12 <neb> it might be that they all search for initfiles or
something
May 13 13:05:27 <neb> that isn't launch but it might be something that
many apps do.
May 13 13:07:18 <grondo> initfiles?
May 13 13:07:35 <grondo> I think most apps have an input file and a
restart dump
May 13 13:07:49 <grondo> I _highly_ doubt that they search some
kind of path for either of those
May 13 13:08:11 <grondo> however, it still could be something they
do at startup
May 13 13:08:31 <grondo> I just find it more likely that it is the
process of launching, just because that is definitely the same
May 13 13:11:29 <grondo> even jobs using different MPI libraries
are involved
May 13 13:11:37 <grondo> thought maybe the MPI had something to do
withit
May 13 13:12:21 <grondo> one launches out of lustre, another out of
NFS homedir
May 13 13:12:33 <grondo> no luck with leads here
May 13 13:15:25 <neb> yes abstractly everybody has init files. However,
there still may be something about the specific way that they find and
read the init files that might cause the problem
May 13 13:16:27 <neb> and everybody launches too so the same argument
you make about init files applies to that and maybe even more so.
May 13 13:17:05 <neb> e.g. how many open() or access() calls does it
do?
May 13 13:19:44 <neb> is it always just one node / job?
May 13 13:19:57 <neb> could that be MPI rank(0) node?
May 17 11:22:58 <grondo> ugh. lots more hits on the dcache problem
May 17 11:25:24 <grondo> I installed a systemtap script that did a
backtrace from the function in dcache.c that first notices nr_unused < 0
May 17 11:25:34 <grondo> however, there doesn't seem to be any
extra clues there
May 17 11:25:48 <grondo> one app hit the issue while doing an open
May 17 11:25:59 <grondo> another hit while doing a stat
May 17 11:26:44 <grondo> I think I will somehow have to understand
the dcache accounting better so I can develop more assertions other than
just nr_unused < 0
May 17 11:27:39 <grondo> something seems to be disrupting the
accounting before, then at some point later the value goes negative as a
result.
May 17 11:29:29 <grondo> and the accounting just goes wildly out of
whack
May 17 11:30:31 <neb> data corruption?
May 17 11:30:49 <neb> maybe it isn't a dcache accounting problem after
all.
May 17 11:30:50 <grondo> could be, but wouldn't that tend to be
more random
May 17 11:31:07 <neb> isn't the trent pretty darn random?
May 17 11:31:21 <neb> that seems to be THE thing about this problem
May 17 11:31:56 <grondo> no, always happens at or near job startup,
always affects the dcache dentry_stat.nr_unused value (or I guess that is
the symptom)
May 17 11:32:32 <grondo> wouldn't memory corruption tend to cause
random effects, a node crash here, invalid value there, etc...
May 17 11:32:45 <neb> not necessarily
May 17 11:33:11 <neb> if it is a constant offset into the structure
May 17 11:34:37 <neb>
http://rhkernel.org/RHEL5+2.6.18-194.el5/fs/dcache.c#L60
May 17 11:35:43 <neb> would you be able to try a test kernel where the
dentry_stat is relocated before the other variables in the file.
May 17 11:35:59 <neb> like put it up near line 60 and then have all the
other variables below it.
May 17 11:36:47 <grondo> does that really affect the layout of
memory?
May 17 11:37:29 <neb> and maybe reorganize the dentry_stat_t type so
that it has nr_unused down farther in the structure.
May 17 11:38:07 <neb> There are some cases where it might not and you
are right that this case might be one of those.
May 17 11:39:11 <grondo> I don't know the rules for that
May 17 11:39:17 <neb> the compiler might want to put the statics in a
bss segment and since the dentry_stat_t in a static data section
May 17 11:39:38 <neb> don't worry mere mortals aren't supposed to
care. ;-)
May 17 11:39:41 <grondo>
https://eris.llnl.gov/svn/kernel-exp/vanilla/trunk/patches/rh/linux-2.6-fs-private-dentry-list-to-avoid-dcache_lock-contention.patch
May 17 11:39:53 <grondo> that patch is new with this release
May 17 11:40:23 <grondo> I think it fixed another bug we were
having, but I wonder if I should try backing it out. We have a w/a for the
other bug
May 17 11:40:48 <grondo> note this hunk
May 17 11:40:51 <grondo> @@ -770,8 +769,7 @@ resume:
May 17 11:40:51 <grondo> * of the unused list for
prune_dcache
May 17 11:40:51 <grondo> */
May 17 11:40:51 <grondo> if
(!atomic_read(&dentry->d_count)) {
May 17 11:40:51 <grondo> -
list_add_tail(&dentry->d_lru, &dentry_unused);
May 17 11:40:53 <neb> grrr fucking firewalls
May 17 11:40:53 <grondo> -
dentry_stat.nr_unused++;
May 17 11:40:56 <grondo> +
list_add_tail(&dentry->d_lru, dispose_list);
May 17 11:40:58 <grondo> found++;
May 17 11:41:01 <grondo> }
May 17 11:41:04 <grondo>
May 17 11:41:17 <grondo> I'm sure it is right, but the patch does
definitely modify the dentry_stat accounting
May 17 11:41:28 <grondo> maybe it could have introduced a very rare
leak?
May 17 11:41:59 <grondo> it would be nice if I understood the
invariants of the accounting, which should be pretty simple, I could write
a systemtap script that verified them on every function call
May 17 11:42:32 <grondo> for now, I think I'll capture the last N
values from dentry_stat and spit them out on failure... in case something
jumps out
May 17 15:33:40 <grondo> trying to do something like a circular
buffer log in systemtap
May 17 15:33:53 <grondo> can't find any examples online, and the
obvious things don't work
May 17 15:34:31 <grondo> like head = 0; log [head++] = string; if
(head > max) head = 0;
May 17 15:34:58 <grondo> if I do that I get an error that the array
has exceeded max entries or something
May 17 15:35:07 <neb> an array index % array_size
May 17 15:38:49 <grondo> yeah, basically
May 17 15:38:51 <grondo> that doesn't work
May 17 15:44:27 <grondo> (for some reason I always forget about
mod)
May 17 15:44:35 <grondo> anyhow it should be the same
May 18 08:25:16 * Disconnected (Connection timed out).
May 18 08:53:59 <neb> 940000
May 18 10:15:52 <grondo> nice
May 18 10:26:51 <grondo> if you have any experts on stap, can you
ask the best way to use stap to create a circular log with minimal
impact?
May 18 10:27:38 <grondo> I'm having trouble with exceeding
MAXMAPENTRIES, even though I use essentially array [index % 100]
May 18 10:27:50 <grondo> (i.e. limit to 100 entries)
May 18 10:28:02 <grondo> I also tried doing per-cpu logs with
something like
May 18 10:28:19 <grondo> array [cpu(), index % 100]
May 18 10:28:24 <grondo> without much success
May 18 10:28:45 <neb> can you send me your whole script.
May 18 10:29:12 <grondo> I'd like to basically keep a log of the
last N things that happened, then dump the log on a trigger
May 18 10:30:01 <grondo> hmm, original script isn't really around
anymore because of all the experimentation
May 18 10:33:26 <grondo> let me rewrite it how I think it should
work
May 18 10:33:26 <grondo> the most naive version
May 18 13:19:00 <neb> back working on your stap script
May 18 13:24:46 <neb> try not using the increment and decrement
operators.
May 18 13:25:07 <neb>
https://bugzilla.redhat.com/show_bug.cgi?id=591407
May 18 13:49:40 <neb> try s / global array[100] / global array /
May 18 13:50:31 <neb> this is atlas right?
May 18 13:51:26 <neb> ooh 017007
May 18 13:52:25 <neb> Also you need to update to the newer version of
stap.
May 18 13:53:03 <neb> systemtap-1.1-3.el5
May 18 14:06:41 <neb> grondo tossed the new version of systemtap into
worpi:/tmp
May 18 16:33:58 <neb> did you get the new version of systemtap?
May 18 16:34:16 <neb> I changed location too thinking that was the
problem but it wasn't
May 18 16:35:07 <grondo> yes, not sure if it made it into the image
that Trent is rebooting onto hype now
May 18 16:35:24 <grondo> That version is the RHEL5.5 version, so
hopefully no dependencies gotchas
May 18 16:35:30 <grondo> What exactly was the fix in there?
May 18 16:35:41 <grondo> I think I'm going to order a new bike
soon
May 18 16:36:04 <neb> the stap developers didn't say but they did say
that the error message didn't look right for the script.
May 18 16:36:22 <neb> and because of the text of the error message they
could see that you were running an old version of stap
May 18 16:36:49 <neb> they think that there was array indexing issues in
that version on the edge of the arrays.
May 18 16:37:30 <neb> So they thought that you might be hitting
array[99] and it was telling you that it was an overflow when it was
reasonable becuase of a bug in stap
May 18 16:37:42 <neb> they said that it was best to not declare the size
of arrays.
May 18 16:38:27 <grondo> ok, I got that error even if I didn't
declare the size of the array
May 18 16:38:40 <grondo> will try the new stap once hype is back
up
May 18 16:39:01 <grondo> in theory what I'm doing should work
though?
May 18 16:39:33 <grondo> the other question I have is how to probe
multiple functions with the same "probe" line
May 18 16:39:50 <grondo> the "*@fs/dcache.c" gets too much
May 18 16:39:59 <grondo> it even probes stuff _not_ in dcache.c
strangely
May 18 16:40:36 <grondo> oh, interesting
May 18 16:40:42 <grondo> I can't recreate that problem now
May 18 16:41:20 <grondo> ah, Trent _did_ get the new systemtap
May 18 16:42:53 <grondo> cool, seems to work better now (so far so
good)
May 18 16:43:00 <grondo> too bad I can't use it on the production
systems
May 18 16:43:02 <grondo> :-(
May 18 16:43:57 <grondo> cool that the script works now, tho
May 18 16:44:09 <neb> OK so it appears to be the stap script.
May 18 16:44:38 <neb> Can't you just install the new stap packages into
the writable
May 18 16:44:42 <neb> err I meant:
May 18 16:44:52 <neb> OK so it appears to be the stap binary not the
script
May 18 16:44:57 <neb> that makes more sense.
May 18 16:45:27 <neb> keep in mind with stap that it is a new and
evolving tool. It may not always be your skill it can also be the binary.
May 18 18:00:28 <neb> so does it look like your stap script will work.
May 18 18:00:50 <neb> other than the multiprobe thing are there any
additional things you need me to track down for you regarding that?
May 18 18:01:08 <grondo> not at the moment
May 20 11:12:53 <grondo> On the dcache problem --- I implemented
checks for dentry_stat.nr_unused < 0 in the SLURM prolog and epilog
May 20 11:13:00 <grondo> to catch the problem before the next job
May 20 11:13:17 <grondo> A couple nodes triggered last night
May 20 11:13:44 <grondo> however, one of them didn't have the
signature negative value in /proc/sys/fs/dentry-stat
May 20 11:13:48 <neb> They seem to implicitly see it as a runtime
environment and the notion that it is used as a production development
environment for some larger customers seems to have escaped them.
May 20 11:13:51 <grondo> # atlas449 /root > cat
/proc/sys/fs/dentry-state
May 20 11:13:51 <grondo> 12040 4530 45 0 0 0
May 20 11:14:40 <grondo> however, right when I started running
drop_caches, it went negative!
May 20 11:14:44 <grondo> cat /grondo@atlas449 ~ >cat
/proc/sys/fs/dentry-state
May 20 11:14:44 <grondo> 2841 -3402 45 0 0 0
May 20 11:15:23 <grondo> I'm not sure what to make of that
May 20 11:16:10 <neb> so how did the slurm check catch it if
dentry_stat.nr_unused was still positive?
May 20 11:16:25 <grondo> yes, that is the first question I had
May 20 11:16:26 <neb> re: dev env yeah
May 20 11:16:37 <grondo> it must have been negative at the time of
the check
May 20 11:16:40 <neb> could it be an intermittant.
May 20 11:16:41 <neb> yeah
May 20 11:16:50 <grondo> then went back positive?
May 20 11:16:59 <grondo> I don't really understand how that could
May 20 11:17:00 <grondo> be
May 20 11:17:32 <grondo> I really wish I could install the new
systemtap on atlas so I could run my script that captures the history of
the dentry_stat values
May 20 11:18:06 <grondo> I really have no idea what is going on
here
May 20 11:18:19 <grondo> I think my only recourse right now is to
try to reproduce on hype
May 20 11:22:53 <neb> have you come up with any clues on how to
reproduce the problem?
May 20 11:23:08 <neb> have you tried removing that one patch which you
thought might be implicated?
May 20 11:25:13 <grondo> no, and no, because I can't recreate it
May 20 16:54:06 <grondo> stuck on a systemtap thing again
May 20 16:54:26 <grondo> I want to resolve the address of a list to
pass it to a function
May 20 16:54:38 <grondo> I'm basically doing l = $dentry_unused
May 20 16:54:42 <grondo> but I'm getting:
May 20 16:54:52 <grondo> semantic error: struct/union 'list_head'
is being accessed instead of a member of the struct/union: identifier
'$dentry_unused' at /g/g0/grondo/proj/stap/dcache-count.stp:19:11
May 20 16:57:22 * neb thinkin
May 20 16:57:26 <neb> err thinking
May 20 16:57:38 <grondo> if instead do $dentry_unused->next
May 20 16:57:46 <grondo> then it gets further, but the error is
confusing
May 20 16:57:56 <grondo> Running make -C
"/lib/modules/2.6.18-90chaos/build" M="/var/tmp/root/stapAHG5Ew"
modules >/dev/null
May 20 16:57:59 <grondo> /var/tmp/root/stapAHG5Ew/stap_23834.c: In
function 'function_dentry_unused_count':
May 20 16:58:01 <grondo> /var/tmp/root/stapAHG5Ew/stap_23834.c:336:
error: 'struct list_head' has no member named 'd_list'
May 20 16:58:04 <grondo> /var/tmp/root/stapAHG5Ew/stap_23834.c:345:
error: 'struct list_head' has no member named '__retvalue'
May 20 16:59:46 <neb> looks like some type polymorphism problems.
May 20 17:00:52 <grondo> hm, the generated code is not as expected
May 20 18:15:51 <grondo> aw, shit
May 20 18:16:17 <grondo> so I have a stap script to compare the
number of dentries in the dentry_unused list to dentry_stat.nr_unused
May 20 18:16:24 <grondo> and prints an error if they don't match
May 20 18:16:37 <grondo> many nodes on atlas don't match
May 20 18:16:54 <grondo> eatlas16: 3184 members of dentry_unused
list. nr_dentry=6865 nr_unused=3182
May 20 18:16:55 <grondo> eatlas7: 3233 members of dentry_unused
list. nr_dentry=6942 nr_unused=3232
May 20 18:16:55 <grondo> eatlas10: 3138 members of dentry_unused
list. nr_dentry=6823 nr_unused=3137
May 20 18:16:55 <grondo> eatlas8: 3250 members of dentry_unused
list. nr_dentry=6941 nr_unused=3246
May 20 18:16:57 <grondo> eatlas28: 3176 members of dentry_unused
list. nr_dentry=6824 nr_unused=3175
May 20 18:16:59 <grondo> eatlas2: 3195 members of dentry_unused
list. nr_dentry=6887 nr_unused=3194
May 20 18:17:02 <grondo> is just a sample
May 20 18:17:29 <grondo> I can "fix" it with the script as well,
but the problem is worse than I thought
May 20 18:17:37 <grondo> well have to run
May 20 18:18:03 <neb> Hmm... so it might be that the problem lies
dormant for a while but then goes critical
May 20 18:18:32 <grondo> yes, those nodes all have the chance for
nr_unused < 0 I believe
May 20 18:19:05 <neb> that is good it seems to push the problem back.
May 20 18:19:15 <grondo> unfortunately to count the dentry_unused
list I have to walk the entire thing
May 20 18:19:27 <neb> maybe the reason you couldn't find the cause is
the cause is long gone before the problem shows up.
May 20 18:19:34 <grondo> yes, could be
May 20 18:19:58 <neb> do you have any 4.2 systems that you can compare
against?
May 20 18:21:57 <neb> did the stap thing work?
May 21 10:19:49 <grondo> so what I'm thinking of doing with the
stap script I have is to run it in the epilog after every job and drain
the node if the size of dentry_unused != dentry_stat.nr_unused. That
=should= mean that the event which caused the problem happened during that
job
May 21 10:20:34 <neb> I would agree.
May 21 10:21:15 <grondo> egad
May 21 10:21:28 <grondo> another bug seems to have hit us on a
*bunch* of nodes
May 21 10:21:47 <neb> with the proviso that it isn't necessarily the
cause but is strongly correlated to the problem occuring.
May 21 10:23:25 <neb> you know me, I really try very hard to not mistake
coincidence & correlation with causality
This event sent from IssueTracker by woodard [LLNL (HPC)]
issue 937133
Event posted on 2010-05-26 15:13 PDT by woodard May 25 10:22:19 <grondo> hmm, it might be that the jobs causing the errors in dentry_stat.nr_unused are the same ones that caused our dcache softlockups in nfs May 25 10:22:35 <grondo> I hope I still have that reproducer May 25 10:26:58 <neb> interesting May 25 10:27:28 <neb> are these the one that search a bunch of places for a table of strings. May 25 10:27:38 <grondo> the reason I think this is that the job that hit the early detector had been canceled by timeout, and that was the case that reproduced the old bug May 25 10:27:39 <neb> do you think it is NFS or might it be lustre? May 25 10:27:58 <grondo> I believe it is NFS May 25 10:28:09 <grondo> but I don't think that is the bug May 25 10:28:25 <grondo> THis is totally guesswork right now May 25 10:28:32 <neb> did we actually have a repro for that bug/ May 25 10:28:59 <grondo> a patch went into 5.4 to fix the problem we saw before, where a special case of negative dentry validation caused a corner case to hit in the dccache and resulted in softlockups May 25 10:29:10 <neb> right I remember that one. May 25 10:29:26 <grondo> the problem occurred fairly rarely in production, but we could reproduce it by populating the dcache and running a certain user job then canceling it May 25 10:29:45 <grondo> the problem seemed to be triggered by all 8 tasks of the job getting sigterm at the same time May 25 10:30:04 <grondo> a signal handler would search the msg catalog path for non-existent files, triggering the revalidation May 25 10:30:27 <grondo> it also seemed like we were *much* more likely to reproduce when we had the job container module May 25 10:30:46 <grondo> in fact, I don't think any other RH customers saw this problem May 25 10:31:21 <neb> https://enterprise.redhat.com/issue-tracker/?module=issues&action=view&tid=178773 May 25 10:31:34 <grondo> so now there is a fix for some similar kind of softlockup in the dcache, which moves some stuff to a private list May 25 10:31:51 <grondo> my guess is that in that codepath, when we hit the same rare case we hit before, the accounting for nr_unused isn't quite right May 25 10:32:14 <grondo> https://chaos.llnl.gov/bugzilla/show_bug.cgi?id=285 May 25 10:32:14 <grondo> is our local casxe May 25 10:32:22 <grondo> sadly I didn't document how to run the reproducer May 25 10:32:31 <grondo> I vaguely remmeber, but I have to find the user code May 25 10:32:54 <grondo> it also seemed like any code compiled by the intel compiler was vulnerable May 25 10:33:10 * neb still can't get to chaos.llnl.gov due to silly firewall rules and crazy network people. May 25 10:33:24 <grondo> oh geez May 25 10:33:42 <grondo> ok, found the code May 25 10:34:22 <neb> is it something kent can run along with your stap script so that we can see if it happens on stock rhel kernels. May 25 10:34:54 <grondo> I don't think so May 25 10:34:54 <grondo> for some reason I could never make a standalone reproducer May 25 10:35:57 <grondo> the theory was that the code searched for the message catalogs once, thus creating the negative dcache entries. some time later, after the entries have expired, it gets a signal, looks for them again and triggers revalidation May 25 10:36:24 <grondo> some kind of error comes back from the server, or maybe not, but this triggers shrink_dcache_parent with some kind of huge number May 25 10:36:24 <grondo> that is what I remember May 25 10:36:24 <grondo> but I tried to create a reproducer based on that and failed May 25 10:37:29 <grondo> it also seemed important May 25 10:37:29 <grondo> to have the job run in the sgi job container module May 25 10:37:56 <neb> so now you aren't running with the sgi container modules? May 25 10:38:08 <grondo> we are May 25 10:38:24 <grondo> I'm just saying that I couldn't make my own reproducer before May 25 10:38:38 <grondo> even though it should have been easy based on the theory of how the bug was triggered May 25 10:40:30 <grondo> However, the necessity of the job container module explains why nobody else seems to see the problem May 25 10:41:24 <grondo> dammit why didn't I document how to run the reproducer. stupid! May 25 10:41:33 <neb> why do you think that the job module matters May 25 10:42:24 <grondo> we supposed before that it was because the job container allows all tasks of the job to be signaled virtually at the same time May 25 10:42:24 <grondo> slurm doesn't signal the individual tasks, it signals the job container as a whole May 25 10:43:30 <neb> Oh yeah. I remember coming to that now. May 25 10:43:47 <neb> I did a lot of playing with signals to try to make that work when I was trying to come up with a repro May 25 10:43:54 <grondo> yes May 25 10:44:12 <grondo> we just couldn't make it work May 25 10:44:21 <grondo> hold on, I'm going to search email for the steps to reproduce May 25 13:14:22 <neb> for an issue like this one, it is weird I keep on saying "we" but it feels wrong becuas e you did most of the work yet I feel like we have been consulting and brainstorming all along. May 25 13:16:36 <grondo> yes, we have May 25 13:17:37 <grondo> doesn't bother me if you say we May 25 13:23:03 <neb> it it even possible for you to test to see if the problem occurs without the job module? May 25 13:27:46 <neb> Some questions I have, I wonder if you could reproduce it on a UP box or if it is a race between two procs manipulating the dcache list May 25 13:28:04 <neb> is this the basic idea: May 25 13:28:19 <neb> 1) make a lot of negative dcache entries May 25 13:28:23 <neb> 2) waith 15 min May 25 13:28:49 <neb> 3) exercise those negative dcache entries. May 25 13:29:56 <grondo> all the dcache lists should be protected by dcache_lock May 25 13:30:05 <grondo> I think you just have to make a lot of unused dentries May 25 13:30:08 <grondo> e.g. find / May 25 13:30:18 <grondo> although that doesn't even seem to be all that necessary May 25 13:30:27 <neb> my idea was May 25 13:30:29 <grondo> then make a _few_ negative dentries May 25 13:30:35 <grondo> wait for them to expire May 25 13:30:35 <neb> only a few May 25 13:30:43 <grondo> then hit them again from all cpus May 25 13:30:50 <grondo> that is my guess as to how it works May 25 13:30:56 <grondo> but I swear we tried that before May 25 13:31:07 <neb> I was going to try: May 25 13:33:55 <neb> find /nfsmnt/ -type d; do find /usr/src/linux -type f | while read line;do name=`basename $line`;if [ -e $name ]; then ;; fi; done; done; sleep 900 May 25 13:34:14 <neb> the for all the CPUs do the same thing. May 25 13:34:26 <grondo> try it May 25 13:34:31 <neb> working on it now. This event sent from IssueTracker by woodard [LLNL (HPC)] issue 937133 Event posted on 2010-05-27 04:43 PDT by woodard Problem history: This problem initially manifested as a hang during our job prolog script, run before each job to prepare the node for work. This script does the equivalent of echo 2> /proc/sys/vm/drop_caches so that nodes handed over to customers start out with a similar memory profile. one some nodes we would see the script hang forever at this point chewing up CPU time. After some investigation, it was found that dentry_stat.nr_unused had somehow gone negative, and this was causing an endless cycle under shrink_dcache_memory. How this value got to be negative was a mystery, however. We installed some hooks into our job prolog and epilog to track dentry_stat values, and to notify us whenever nr_unused < 0. This was happening at a rate of a few nodes per day on a 1152 node machine, but there did not appear to be any commonality between the jobs. The next thing we tried was a systemtap script that compared the actual length of the dentry_unused list to the value of nr_unused in dentry_stat, on the theory that perhaps the count was off way before the count actually went negative (and thus the job affected wasn't actually the cause). We ran this stap script after every job, and we found that jobs for which the count was off all had these things in common: * Were canceled by a signal (timed out or canceled) (thus all tasks in the job were killed by a signal) * Used the Intel compiler This bore a strong resemblance to *another* dcache bug we had hit. I tried the reproducer from that other bug (178773) and found that I could reproduce the problem (length(dentry_unused) != dentry_stat.nr_unused) at will. The key features of the reproducer, I believe are the following: - The reproducer, or perhaps some previous job, add some negative dentries to the cache when looking for the intel message catalogs along the default NLSPATH - while the job is running, those negative dentries expire - when the job is signaled due to timeout or cancelation, *all* tasks get SIGTERM at virtually the same moment. This triggers a handler installed by the intel compiler which searches the NLSPATH again, hitting the negative dentries and requiring some kind of revalidation. (We are running on top of NFSv3) This seems to be the trigger for the bug, although I earlier failed to make a reproducer for 178773 based on this information, so it might not be complete. However, a proven workaround for this problem is to prepend the path to the fortran message catalogs to the NLSPATH. I have also shown that by removing the RHEL patch linux-2.6-fs-private-dentry-list-to-avoid-dcache_lock-contention.patch from the kernel, I can no longer reproduce this bug. However, that patch is thought to be the fix for 178773 so I'd like to keep it if at all possible. Version changed from '5.4' to '4.6' This event sent from IssueTracker by woodard issue 937133 Created attachment 417328 [details]
Possible fix
Ok so based on what you guys are telling me, you get hung up in prune_dcache, and this comes from shrink_dcache_memory, from shrink_slab. What happens is shrink_slab does this
nr = shrink_dcache_memory(0, mask);
where nr ends up being some negative number because we use nr_unused to calculate this number, then we call back with
shrink_dcache_memory(nr, mask)
and in prune_dcache we do
for (; nr; nr--) {
shrink dcache
}
which will go on _forever_ if nr is negative. So what is happening is we have a bunch of places where we do
if (!list_empty(&dentry->d_lru)) {
nr_unused--;
list_del_init(&dentry->d_lru);
}
but if the dentry is on our own private dispose_list, we have already taken care of the nr_unused counter, so we'll get --'ed again. So instead of not messing with nr_unused when we add and take away from the dispose_list, treat dispose_list just like we do dentry_unused, do nr_unused++ when we add to the dispose_list and do nr_unused-- when we take away from the dispose_list. This will keep everything happy and hopefully stop your hangs. Thanks,
Josef
I'm still waiting on confirmation that the patch fixes the problem. As soon as I have confirmation that it makes the problem go away I will post it. *** Bug 625749 has been marked as a duplicate of this bug. *** This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release. Posted 8/23. in kernel-2.6.18-214.el5 You can download this test kernel from http://people.redhat.com/jwilson/el5 Detailed testing feedback is always welcomed. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-0017.html |