Bug 596548

Summary:

dcache unused accounting problem

Product:

Red Hat Enterprise Linux 5

Reporter:

Issue Tracker <tao>

Component:

kernel

Assignee:

Josef Bacik <jbacik>

Status:

CLOSED ERRATA

QA Contact:

Igor Zhang <yugzhang>

Severity:

urgent

Docs Contact:

Priority:

urgent

Version:

5.4

CC:

bturner, cward, cww, dhoward, mmahut, rmitchel, sprabhu, tao, yugzhang

Target Milestone:

Keywords:

Regression, ZStream

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2011-01-13 21:34:05 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

658857

Attachments:

Description	Flags
Possible fix	none

Description Issue Tracker 2010-05-26 22:19:39 UTC

Escalated to Bugzilla from IssueTracker

Comment 1 Issue Tracker 2010-05-26 22:19:40 UTC

Event posted on 2010-05-25 11:34 PDT by woodard

Description of problem:
the problem manifests itself when the nr_unused ends up being negative. This causes an oops when the dcache is reclaimed. 

This problem has been really hard to track down. First we were able to find that the problem was evident when we dropped the caches between HPC jobs. The customer wrote the attached stap script which allowed us to verify when the problem occurs. We were finally able to catch the kind of workload that causes the original problem and that led us to a reproducer. The reproducer seems to be the same one that was causing: 178773

This problem seems to be with RHEL but we haven't eliminated the possibility that it is some sort of local patch. 

How reproducible:
seemingly aways.

Steps to Reproduce:
very difficult outside of LLNL's environment. We are working on that now. 
we think: populate the dcache with a lot of negative entries that exist on a NFS server. Wait about 15 minutes for these entries timeout. Then revalidate them. If the end 
This event sent from IssueTracker by woodard  [LLNL (HPC)]
 issue 937133

Comment 2 Issue Tracker 2010-05-26 22:19:42 UTC

Event posted on 2010-05-25 11:34 PDT by woodard

Issue Registered (Severity: 3)
This event sent from IssueTracker by woodard  [LLNL (HPC)]
 issue 937133

Comment 3 Issue Tracker 2010-05-26 22:19:43 UTC

Event posted on 2010-05-25 11:34 PDT by woodard

File uploaded: dcache.stp
This event sent from IssueTracker by woodard  [LLNL (HPC)]
 issue 937133
it_file 702543

Comment 4 Issue Tracker 2010-05-26 22:19:44 UTC

Event posted on 2010-05-26 14:42 PDT by woodard

<grondo> unable to recreate dcache issue (so far) with the RHEL5.4 patch
linux-2.6-fs-private-dentry-list-to-avoid-dcache_lock-contention.patch
<grondo> removed
<neb> just to confirm that I understand what you said.
<neb> You removed:
linux-2.6-fs-private-dentry-list-to-avoid-dcache_lock-contention.patch and
so far you haven't seen the problem reappear.
<grondo> yes, sorry\
<grondo> I expected the old problem to reappear
<grondo> but if I remember right that was a difficult one to reproduce
<neb> weird I don't know Lachlan McIIlroy
<neb> I do know jeff layton and josef
<neb> let me go and talk to them
<grondo> for someone familiar with the code, and given a description of
the problem, I have a feeling they could identify the fix needed in the
patched code pretty quickly
<grondo> it seems to be a case where dentry_stat.nr_unused is decremented
but the dentry doesn't actually get removed from the list
<grondo> and it seems to be somewhere in that patch
<grondo> though a similar bug did exist before that patch, which would
cause softlockups with the same reproducer
<neb> I sort of feel like I can almost but not quite see the problem
having looked at the patch
<neb> it is that tip of the tongue feeling.
<grondo> I felt similar, but I think I would have to spend half a day at
least walking through the code to fully understand it first
* neb chatting with josef
<neb> <josef> hmm well shit
<grondo> :-o
<neb> hey do you have one of the oopses that occurs anywhere handy
<neb> I know that we've been trying to get far away from that 
<neb> it has been happening on atlas right?
<grondo> an oops?
<grondo> what oops exactly?
<neb> <josef> neb: i can give you a patch that makes the nr_unused stuff
work right
<neb> <josef> but who knows if thats the actual problem
<neb> <josef> can you test hte patch and see if it fixes the problem?
<neb> <neb> sure
<grondo> Send it over and I will test
<neb> The oops when nr_unused and the length of the unused list gets way
out of sync and the thing finally blows up.
<grondo> doesn't oops
<grondo> gets stuck indefinitely in prune_dcache or thereabouts
<grondo> since there are tests in the code that kind of assume nr_unused
>= 0
<grondo> ;-)
<grondo> there aren't any actual checks
<neb> just forwarded you Josef's patch.
<neb> Can you try it out?
<grondo> yep
<grondo> shouldn't take me too long
<neb> rough expectation on time.
<neb> couple of hours?
<grondo> or less yeah
<grondo> that's amazing he saw a fix so quick
<neb> <josef> neb: actually wait
<neb> <josef> that patch is crap
<grondo> heh
<neb> just came now
<neb> rush job -- bad job
<neb> he needs more time.
<grondo> no prob
<grondo> no rush


This event sent from IssueTracker by woodard  [LLNL (HPC)]
 issue 937133

Comment 5 Issue Tracker 2010-05-26 22:19:45 UTC

Event posted on 2010-05-26 14:43 PDT by woodard

<neb> josef, jlayton you guys acked a patch
linux-2.6-fs-private-dentry-list-to-avoid-dcache_lock-contention.patch
<neb> for BZ526612 and with that patch applied we are seeing problems with
the dcache on large cpu count boxes
<neb> when we remove that patch we can't reproduce the problem anymore.
<neb> I worry that there might be a subtle race in it.
<neb> after about 1 month of work and a lot of systemtap scripts we found
that at the root of the problem the number of items on the dentry_unused
list doesn't match nr_unused.
<neb> The numbers slowly drift out of agreement and eventually you end up
with an oops and one of the odd things before the oops is that nr_unused
is negative.
<bob> I gotta go take my dog to the vet.
* bob has quit (Quit: Send a sign, raise the sail, wave a last goodbye,
destiny is calling, immortality be mine.)
* bmarson is now known as bmarson_gone
<josef> neb: ok you still ther?
<neb> yes
<josef> neb: in select_parent
<josef> hmm well shit
<neb> Unfortunately it is hard to reproduce but it is taking out 50-100
machines a day here at LLNL.
<josef> neb: whats the oops?
<neb> it happens hours after the problem first appears
<neb> but hold on I can get it for you.
<neb> it seems to loose a few at a time and then sometime when it 3-9000
off things go poof.
* neb looking for one of the oopses
* vaurora (~val.prod.ext.phx2.redhat.com) has joined #fs
<josef> neb: i can give you a patch that makes the nr_unused stuff work
right
<josef> but who knows if thats the actual problem
<josef> can you test hte patch and see if it fixes the problem?
<neb> sure
* hobbit is now known as hobbit_gone
<neb> josef, BTW we were also bit BZ526612 but it didn't hit as
frequently 2-3x/week
<josef> neb: whats your email
* kzak_away has quit (Ping timeout: 604 seconds)
<neb> woodard
<neb> oh yeah that's why I can't find the oops grepping through the
logs.
<neb> It doesn't oops the machine basically locks up stuck indefinitely
in prune_dcache or thereabouts
<neb> this is due to all the checks in the code that that kind of assume
nr_unused >= 0
<josef> neb: actually wait
<josef> that patch is crap
<neb> OK won't test that one.
<josef> neb: i've got to run and do birthing classes
<josef> neb: if you can send me all the debugging you have i'll take al
ook tomorrow
<josef> and give you something a little better thought out :)
<neb> We've been working on this for >1month and we thought it was a
local problem.
<neb> All the debug info is WAY too much but I'll make sure that I get
you a good BZ
<neb> OK?


This event sent from IssueTracker by woodard  [LLNL (HPC)]
 issue 937133

Comment 6 Issue Tracker 2010-05-26 22:19:46 UTC

Event posted on 2010-05-26 14:56 PDT by woodard

May 05 11:04:38 <grondo>        argggh. More hits of that dcache_lock bug
May 05 11:04:54 <grondo>        really have to focus on that one and get
to the bottom of it now
May 05 11:08:20 <neb>   any more informaiton this time?
May 05 11:08:45 <neb>   I think we kind of abandoned it before due to lack
of info while we figured that it might be a lustre bug.
May 05 11:09:18 <grondo>        No more info yet. Have yet to look into it
today
May 05 11:09:45 <neb>   still thinking it is a lustre bug?
May 05 11:10:15 <grondo>        I guess not. Had Chris Morrone look into
it and I guess he didn't find anything lustre related.
May 05 11:10:56 <neb>   Can you remind me of the facts again?
May 05 11:12:03 <grondo>        So far all I know is that a process gets
hung when trying to do 'echo 2 > /proc/sys/vm/drop_caches'
May 05 11:12:20 <grondo>        it is hung trying to lock the dcache_lock
May 05 11:13:19 <grondo>        no other process seems to have the
dcache_lock afaics
May 05 11:14:24 <neb>  
https://bugzilla.redhat.com/show_bug.cgi?id=526612
May 05 11:14:31 <neb>   I think we have that fix already
May 05 11:15:03 <neb>  
https://bugzilla.redhat.com/show_bug.cgi?id=554574
May 05 11:15:08 <neb>   this one is new

May 05 11:52:39 <grondo>        strangely we are not getting softlockups
with this problem
May 05 11:52:47 <grondo>        so nothing is hogging the dcache_lock
May 05 11:53:08 <neb>   is is lock juggling?
May 05 11:53:21 <grondo>        I don't think so
May 05 11:54:07 <grondo>        there is just one process spinning on the
dcache_lock
May 05 11:54:12 <grondo>        it is strange
May 05 11:54:14 <grondo>        I don
May 05 11:54:20 <grondo>        't really know where to start looking
May 05 11:54:40 <neb>   the problem is triggered when you try to drop
caches.
May 05 11:55:06 <neb>   how often does it happen? is it intermittent or
fairly frequent?
May 05 11:55:08 <grondo>         [<ffffffff800654e6>]
__sched_text_start+0xf6/0xbe3
May 05 11:55:08 <grondo>         [<ffffffff8002f607>]
prune_dcache+0x37/0x138
May 05 11:55:08 <grondo>         [<ffffffff800f4a7f>]
shrink_dcache_memory+0x19/0x32
May 05 11:55:08 <grondo>         [<ffffffff80040a15>]
shrink_slab+0xe0/0x15c
May 05 11:55:08 <grondo>         [<ffffffff800fe952>] drop_slab+0x36/0x5f
May 05 11:55:11 <grondo>         [<ffffffff800fee30>]
drop_caches_sysctl_handler+0x162/0x1ae
May 05 11:55:13 <grondo>         [<ffffffff8009c5e4>]
do_rw_proc+0xcb/0x120
May 05 11:55:16 <grondo>         [<ffffffff8009c653>]
proc_writesys+0x1a/0x1c
May 05 11:55:19 <grondo>         [<ffffffff80017012>]
vfs_write+0xcf/0x175
May 05 11:55:21 <grondo>         [<ffffffff80017902>] sys_write+0x47/0x70
May 05 11:55:23 <grondo>         [<ffffffff80060116>]
system_call+0x7e/0x83
May 05 11:55:26 <grondo>        seems to be very rare
May 05 11:55:29 <grondo>        that is the backtrace of the process
trying to drop caches
May 05 11:55:34 <grondo>        (calling drop_slab)
May 05 11:59:37 <grondo>        hm, maybe I should actually look at the
code in prune_dcache and see what could be going on
May 05 11:59:53 <grondo>        I just assumed it is the dcache_lock, but
I can't remember why now
May 05 12:02:37 <grondo>        the process is listed as running: 
May 05 12:02:42 <grondo>        prolog        R  running task       0
17916  17915                     (NOTLB)
May 05 12:02:47 <grondo>        Call Trace: [<ffffffff80066031>]
thread_return+0x5e/0xf6 [<ffffffff8002f5f6>] prune_dcache+0x26/0x138
May 05 12:03:20 <neb>   what's at 138?
May 05 12:03:33 <neb>   thats pretty early in the function
May 05 12:05:00 <grondo>            spin_lock(&dcache_lock);
May 05 12:05:05 <grondo>        guess it is the lock
May 05 12:11:57 <neb>   is there any commonality in the jobs that happen
to be running on the nodes. Maybe it is some unusual usage pattern such as
zillions of small files.
May 05 12:12:09 <neb>   thus the dcache is much bigger than for other
jobs.
May 05 12:12:54 <neb>   or maybe it looks for a lot of files that don't
exist and there are a ton of negative dcache entries.
May 05 12:13:22 <grondo>        this happens before the job starts, in the
job prolog
May 05 12:13:33 <grondo>        so I guess it would have to be the
previous job
May 05 12:13:38 <neb>   but what did the job before it do
May 05 12:13:39 <neb>   yeah
May 05 12:14:35 <neb>   what if you changed the drop_caches to the
postscript rather than the prolog?
May 05 12:14:46 <neb>   might point the finger a bit better.
May 05 12:15:12 <grondo>        yeah, I'm not sure why it is done in the
prolog.
May 05 12:17:53 <grondo>        I should be able to tell which jobs ran
previously
May 05 12:18:13 <grondo>        however that won't help me figure out
what took the lock and didn't drop it!
May 05 12:18:40 <neb>   the previous job might give some clues as to how
to make a reproducer though.
May 05 12:18:52 <neb>   if we can reproduce, then we can study much
better.
May 05 12:19:20 <neb>   maybe have them run their job and then before the
next job starts look at the state of the system
May 05 12:19:25 <grondo>        yes that is true

May 05 14:32:16 <neb>   any more information on the dcache_lock problem?
May 05 14:32:31 <neb>   anything actionable yet?
May 05 14:33:09 <neb>   Should I open up a ticket just to store notes
until we can figure rule out it being a RHEL bug?
May 05 14:35:24 <grondo>        no, let's wait a bit for that
May 05 14:35:34 <grondo>        still thinking it must be self-inflicted
May 05 14:36:26 <neb>   OK I'll stand by

May 05 16:02:03 <grondo>        wish there was a runtime way to enable
DEBUG_SPINLOCK
May 05 16:03:11 <neb>   The way that they turn off and on tracepoints in
the kernel might be able to be adapted to do something like that but it is
one of those things that is "really clever"
May 05 16:18:50 <grondo>        I wonder if I can easily write a systemtap
or kprobes mousetrap for this issue
May 05 16:18:59 <grondo>        I'm not really sure how else to proceed
at this point
May 05 16:19:08 <grondo>        looking for the common job but so far
don't see much
May 05 16:19:51 <neb>   So what would you be looking for?
May 05 16:20:14 <grondo>        all I need to collect was the last locker
of dcache_lock
May 05 16:20:23 <grondo>        (backtrace)
May 05 16:20:40 <neb>   that should be doable.
May 05 16:20:41 <grondo>        that will be a significant hint I'd
imagine
May 05 16:21:02 <grondo>        but I'll want to run the probe
everywhere, so it should be low impact
May 05 16:28:22 *       neb looking for a bit of code to snatch
May 05 16:29:56 <neb>   how many places is the dcache_lock taken?
May 05 16:30:10 <grondo>        I don't know. About a zillion?
May 05 16:30:13 <grondo>        ;-)
May 05 16:32:49 <neb>   [root@snog linux-2.6.33.3]# find . -name "*.c" |
xargs grep spin_lock.*dcache_lock | wc -l
May 05 16:32:49 <neb>   80
May 05 16:32:53 <neb>   hmm
May 05 16:33:23 <grondo>        you know grep has a -c option ;-)
May 05 16:33:33 <grondo>        and a -R ;-)
May 05 16:33:34 <grondo>        heh
May 05 16:34:03 <neb>   pbbt
May 05 16:34:13 <neb>   TIMTOWTSAC
May 05 16:34:38 <grondo>        I don't know that particular acronym :-)
May 05 16:34:46 <neb>   there is more than one way to skin a cat
May 05 16:36:34 <grondo>        Hm, this patch might help:
https://patchwork.kernel.org/patch/45557/
May 05 16:36:47 <grondo>        (heh. joking in case you didn't get it)
May 05 16:37:43 <neb>   I was wondering if there was some way that you
could modify the ticketed spinlock code to give you the clue you want.
May 05 16:38:17 <grondo>        well, if I'm going to be modifying code,
which would require a new kernel build, I would probably just build with
debug spinlocks
May 05 16:38:30 <grondo>        which track the current holder of the lock
I think
May 05 16:40:08 <grondo>        I want to do something that doesn't
require taking the machine down
May 05 16:44:32 <neb>   However, I don't know how to probe 80 probe
points.
May 05 16:45:22 <neb>   if it was fewer or more local
May 05 16:47:34 <grondo>        I wonder if I could (safely) write a
kprobe for spin_lock/unlock that replaces the spinlock implementation with
the debug version at runtime
May 05 16:48:15 <neb>   Honestly, I doubt it.
May 05 16:48:17 <grondo>        hm, maybe not because you can not really
do an early return
May 05 16:49:07 <neb>   The reason being you would need a spinlock to
insert the code
May 05 16:49:11 <neb>   <neb> customer has a problem with a missing unlock
of dcache_lock, can anyone think of a way to figure out who is the last
person to have the lock? Unfortunately, there are about 80 places which
take the dcache_lock

May 05 16:49:11 <neb>   <mjg59> neb: Well, if the kernel has lockdep, you
can do alt+sysreq+d
May 05 16:49:11 <neb>   <roland> run a kernel with lockdep enabled, but i
don't suppose we have rhel kernels like that
May 05 16:49:11 <neb>   * tscott_gone is now known as tscott
May 05 16:49:19 <grondo>        I wonder if it would be too slow to
essentially do if (lock == dcache_lock) { /* store some debug */ }
May 05 16:49:22 <neb>   plus isn't spin_lock inlined code.
May 05 16:49:44 <grondo>        probably
May 05 16:49:56 <neb>   <mjg59> Yeah, for some reason it's on in our
kdump kernels but not our debug kernels
May 05 16:49:56 <neb>   * clalance has quit (Quit: Leaving)
May 05 16:49:56 <neb>   <mjg59> neb: So probably needs a custom kernel
build
May 05 16:50:37 <grondo>        how much overhead is lockdep? Could it be
run in production for awhile, or is this really something where we need a
reproducer first, then we run the lockdep kernel?
May 05 16:50:49 <neb>   <roland> and if you're actually talking about
rhel5 then there probably isn't an option like that, is there?
May 05 16:50:49 <neb>   * rwheeler has quit (Quit: Leaving)
May 05 16:50:49 <neb>   <neb> yeah this is still RHEL5
May 05 16:52:02 <neb>   <roland> neb: well, stap on all those 80 sites, i
guess
May 05 16:52:02 <neb>   * denise has quit (Ping timeout: 622 seconds)
May 05 16:52:02 <neb>   <neb> yeah that might be the best
May 05 16:52:02 <neb>   <neb> sigh
May 05 16:52:34 <grondo>        not to mention the sites in lustre...
May 05 16:52:42 <neb>   heh
May 05 16:52:50 <grondo>        doesn't the core spinlock code detect
some of these issues?
May 05 16:53:07 <neb>   not that I know of
May 05 16:53:19 <grondo>        ok
May 05 16:54:37 <neb>   I think you could automate writing the script a
bit.
May 05 17:00:12 <grondo>        heh, this post has a good testcase
May 05 17:00:15 <grondo>        http://lkml.org/lkml/2009/1/16/503
May 05 17:01:38 <neb>   I worry that if you do do all the stap stuff it
will prevent the race.



This event sent from IssueTracker by woodard  [LLNL (HPC)]
 issue 937133

Comment 7 Issue Tracker 2010-05-26 22:19:48 UTC

Event posted on 2010-05-26 15:11 PDT by woodard

May 13 10:24:42 <grondo>        hm, more occurences of the dcache bug
May 13 10:24:57 <grondo>        # atlas866 /root > cat
/proc/sys/fs/dentry-state 
May 13 10:24:57 <grondo>        2826    -3260655        45      0       0 
     0
May 13 10:25:16 <neb>   humph
May 13 10:25:34 <neb>   so have you been monitoring it?
May 13 10:25:35 <grondo>        the new logging shows that the node enters
the job epilog in this state
May 13 10:25:39 <grondo>        2010-05-12-04:36:22: 52545    
prolog-start       5327       1692
May 13 10:25:39 <grondo>        2010-05-12-04:36:26: 52545      
prolog-end       3615        200
May 13 10:25:39 <grondo>        2010-05-12-05:06:37: 52545    
epilog-start       4831   -3258728
May 13 10:25:39 <grondo>        2010-05-12-05:06:45: 52545      
epilog-end       5261   -3258304
May 13 10:25:39 <grondo>        2010-05-12-05:07:42: 52551    
prolog-start       5260   -3258305
May 13 10:25:59 <grondo>        wait, no
May 13 10:26:14 <grondo>        oh, yeah, I was right
May 13 10:26:20 <grondo>        so something the job is doing
May 13 10:27:36 <neb>   any records of what the job is doing?
May 13 10:27:44 <grondo>        not really
May 13 10:27:47 <grondo>        I will have to email the user
May 13 10:27:56 <grondo>        checking the user's other jobs on the
system
May 13 10:28:14 <grondo>        no other nodes have this symptom
May 13 10:28:15 <neb>   is it a fairly contained set of users.
May 13 10:30:36 <neb>   does it happen to all the nodes participating in
the job?
May 13 10:30:41 <grondo>        no
May 13 10:30:55 <grondo>        only seems to have happened on one node of
an 8 node job
May 13 10:30:59 <grondo>        still gathering info
May 13 10:32:56 <grondo>        the job timed out, I wonder if that is a
clue
May 13 10:35:57 <neb>   hmm I wonder if it had a ton of files open or
something like that when it timed out
May 13 11:07:27 <grondo>        two other nodes in this state
May 13 11:07:30 <grondo>        different users
May 13 11:08:12 <neb>   any clues on the code yet, did they time out as
well?
May 13 11:08:23 <neb>   was it the same 1 node out of quite a few?
May 13 11:08:38 <grondo>        no the other one completed
May 13 11:25:48 <grondo>        hey, whaddyaknow, sar logs dentry unused
stats
May 13 11:26:08 <grondo>        00:00:01    dentunusd   file-sz  inode-sz 
super-sz %super-sz  dquot-sz %dquot-sz  rtsig-sz %rtsig-sz
May 13 11:26:17 <grondo>        I don't know what the other stuff is
May 13 11:26:46 <grondo>        but this allos me to put a more precise
timestamp on the event
May 13 11:26:54 <grondo>        04:30:01         1312      1020      4689 
       0      0.00         0      0.00         0      0.00
May 13 11:26:57 <grondo>        04:40:01    4294714788      1020      3924
        0      0.00         0      0.00         0      0.00
May 13 11:27:23 <grondo>        so somewhere between 0430 and 0440 the
dcache went haywire
May 13 11:27:51 <grondo>        the job started at 04:36
May 13 11:29:25 <grondo>        file-sz is the number of used file
handles
May 13 11:29:46 <grondo>        inode-sz is the "Number of used inode
handlers"[sic]
May 13 11:32:09 <grondo>        neither of those look suspcious
May 13 11:35:38 <grondo>        other thing going on at that time is a
large-ish amount of pagefaults:
May 13 11:35:47 <grondo>        04:30:01         0.00      0.00  10999.17 
    0.01
May 13 11:35:47 <grondo>        04:40:01         0.00      0.00  14179.37 
    1.86
May 13 11:35:57 <neb>   that may not be weird.
May 13 11:36:02 <grondo>        however, that probably matches with the
fact that the job was being launched at that time
May 13 11:36:20 <neb>   exactly a whole lot of pages might need to be
loaded to get  an app up and running
May 13 11:36:49 <grondo>        I will check on other nodes to see if the
problem seemed to have occurred during job launch
May 13 11:36:55 <grondo>        that will be an interesting datapoint
May 13 11:36:57 <neb>   I wonder if it was doing a huge amount searching
for shared libs and stuff causing the negative dcache to get giant.
May 13 11:37:50 <grondo>        nothing else of interest in the sar logs
May 13 11:38:40 <neb>   e.g.  ldd /usr/bin/evolution | wc -l = 129

May 13 11:39:01 <grondo>        yeah that is nothing
May 13 11:41:27 <neb>   but if your ld.so.conf is big and then you have an
app that has zillions of shared libs then the number of negative dcache
entries might be gobs
May 13 11:43:41 <grondo>        we happen to have a test app that
simulates that
May 13 11:43:43 <grondo>        pynamic
May 13 11:43:52 <neb>   hmm
May 13 11:44:09 <neb>   try pynamic in a tight loop?
May 13 11:46:46 <grondo>        the actual binary that caused the problem
only has 19 .so's
May 13 11:46:55 <neb>   bummer.
May 13 11:46:56 <grondo>        (the binary in one case anyway)
May 13 11:47:08 <neb>   does it go fishing for config files though?
May 13 11:47:14 <neb>   or restart files?
May 13 11:47:51 <grondo>        maybe the failures, which all look to have
occurred in the wee hours of 5/12, happened because something else was
running at the same time
May 13 11:50:15 <grondo>        e.g. dist
May 13 11:51:01 <grondo>        or fixperms
May 13 11:53:35 <neb>   hmm what are the times?
May 13 11:54:13 <grondo>        fixperms matches up exactly with the event
on one node
May 13 11:54:16 <grondo>        still loooking into other
May 13 11:54:17 <grondo>        s
May 13 11:56:47 <grondo>        god damn they run fixperms a lot
May 13 11:56:57 <grondo>        May 12 04:36:02 atlas866 logger: fixperms:
done running at 05/12/10 04:36:02 ...
May 13 11:57:00 <grondo>        another time match
May 13 11:57:04 <grondo>        :-O
May 13 11:57:19 <neb>   very interesting.
May 13 11:57:26 <neb>   what does fixperms do again?

May 13 12:39:58 <grondo>        turns out fixperms running at the same
time as the job startup wasn't quite right
May 13 12:40:15 <grondo>        it just looked like the same time becuase
fixperms runs in the epilog, just before the next job starts
May 13 12:41:13 <neb>   oh well. I thought we were onto something.
May 13 12:41:50 <neb>   this one is one of the more interesting
investigation kind of problems. We usually have a better idea of the cause
of something repeatable.
May 13 12:47:31 <grondo>        yeah, I'm kind of at a loss
May 13 12:49:46 <neb>   so we know it is something that the program is
doing.
May 13 12:50:31 <neb>   I think we should track the program and talk to
the person running the program and maybe do a test run or two to sort of
profile it.
May 13 12:51:02 <grondo>        uh, there isn't really a common program
between all the failures
May 13 12:51:35 <neb>   It very well might be that they are all doing
somewhat the same thing or maybe even using a particular library
function.
May 13 12:52:19 <neb>   If we could watch the sequence of things and then
stop the program when things seem to start going off the rails. it might
give us some idea of what the program is doing at that moment.
May 13 12:52:25 <neb>   or what it has just done. 
May 13 12:54:09 <grondo>        well, as I said I think the problem occurs
perhaps before even main()
May 13 12:54:30 <neb>   when did you say that and why do you think that?
May 13 12:56:49 <grondo>        based on the timestamps in sar logs
May 13 12:57:10 <grondo>        which does only give a 10 minute window
May 13 13:03:05 <grondo>        however, it is always within that first 10
minutes
May 13 13:03:37 <grondo>        in one case it was within the same minute
May 13 13:03:56 <neb>   reading init files?
May 13 13:04:00 <grondo>        so it could be something each job (by a
different user, different app) does the same at startup
May 13 13:04:17 <neb>   esp if it is a lib function
May 13 13:04:17 <grondo>        or the more likely case is that it happens
during launch itself
May 13 13:04:24 <grondo>        because that is the same across all jobs
May 13 13:04:53 <neb>   I'm not sure that I agree with you on the "more
likely" bit
May 13 13:05:12 <neb>   it might be that they all search for initfiles or
something
May 13 13:05:27 <neb>   that isn't launch but it might be something that
many apps do.
May 13 13:07:18 <grondo>        initfiles?
May 13 13:07:35 <grondo>        I think most apps have an input file and a
restart dump
May 13 13:07:49 <grondo>        I _highly_ doubt that they search some
kind of path for either of those
May 13 13:08:11 <grondo>        however, it still could be something they
do at startup
May 13 13:08:31 <grondo>        I just find it more likely that it is the
process of launching, just because that is definitely the same
May 13 13:11:29 <grondo>        even jobs using different MPI libraries
are involved
May 13 13:11:37 <grondo>        thought maybe the MPI had something to do
withit
May 13 13:12:21 <grondo>        one launches out of lustre, another out of
NFS homedir
May 13 13:12:33 <grondo>        no luck with leads here
May 13 13:15:25 <neb>   yes abstractly everybody has init files. However,
there still may be something about the specific way that they find and
read the init files that might cause the problem
May 13 13:16:27 <neb>   and everybody launches too so the same argument
you make about init files applies to that and maybe even more so.
May 13 13:17:05 <neb>   e.g. how many open() or access() calls does it
do?
May 13 13:19:44 <neb>   is it always just one node / job?
May 13 13:19:57 <neb>   could that be MPI rank(0) node?

May 17 11:22:58 <grondo>        ugh. lots more hits on the dcache problem
May 17 11:25:24 <grondo>        I installed a systemtap script that did a
backtrace from the function in dcache.c that first notices nr_unused < 0
May 17 11:25:34 <grondo>        however, there doesn't seem to be any
extra clues there
May 17 11:25:48 <grondo>        one app hit the issue while doing an open
May 17 11:25:59 <grondo>        another hit while doing a stat
May 17 11:26:44 <grondo>        I think I will somehow have to understand
the dcache accounting better so I can develop more assertions other than
just nr_unused < 0
May 17 11:27:39 <grondo>        something seems to be disrupting the
accounting before, then at some point later the value goes negative as a
result.
May 17 11:29:29 <grondo>        and the accounting just goes wildly out of
whack
May 17 11:30:31 <neb>   data corruption?
May 17 11:30:49 <neb>   maybe it isn't a dcache accounting problem after
all.
May 17 11:30:50 <grondo>        could be, but wouldn't that tend to be
more random
May 17 11:31:07 <neb>   isn't the trent pretty darn random?
May 17 11:31:21 <neb>   that seems to be THE thing about this problem
May 17 11:31:56 <grondo>        no, always happens at or near job startup,
always affects the dcache dentry_stat.nr_unused value (or I guess that is
the symptom)
May 17 11:32:32 <grondo>        wouldn't memory corruption tend to cause
random effects, a node crash here, invalid value there, etc...
May 17 11:32:45 <neb>   not necessarily
May 17 11:33:11 <neb>   if it is a constant offset into the structure
May 17 11:34:37 <neb>  
http://rhkernel.org/RHEL5+2.6.18-194.el5/fs/dcache.c#L60
May 17 11:35:43 <neb>   would you be able to try a test kernel where the
dentry_stat is relocated before the other variables in the file.
May 17 11:35:59 <neb>   like put it up near line 60 and then have all the
other variables below it.
May 17 11:36:47 <grondo>        does that really affect the layout of
memory?
May 17 11:37:29 <neb>   and maybe reorganize the dentry_stat_t type so
that it has nr_unused down farther in the structure.
May 17 11:38:07 <neb>   There are some cases where it might not and you
are right that this case might be one of those.
May 17 11:39:11 <grondo>        I don't know the rules for that
May 17 11:39:17 <neb>   the compiler might want to put the statics in a
bss segment and since the dentry_stat_t in a static data section
May 17 11:39:38 <neb>   don't worry mere mortals aren't supposed to
care. ;-)
May 17 11:39:41 <grondo>       
https://eris.llnl.gov/svn/kernel-exp/vanilla/trunk/patches/rh/linux-2.6-fs-private-dentry-list-to-avoid-dcache_lock-contention.patch
May 17 11:39:53 <grondo>        that patch is new with this release
May 17 11:40:23 <grondo>        I think it fixed another bug we were
having, but I wonder if I should try backing it out. We have a w/a for the
other bug
May 17 11:40:48 <grondo>        note this hunk
May 17 11:40:51 <grondo>        @@ -770,8 +769,7 @@ resume:
May 17 11:40:51 <grondo>                         * of the unused list for
prune_dcache
May 17 11:40:51 <grondo>                         */
May 17 11:40:51 <grondo>                        if
(!atomic_read(&dentry->d_count)) {
May 17 11:40:51 <grondo>        -                      
list_add_tail(&dentry->d_lru, &dentry_unused);
May 17 11:40:53 <neb>   grrr fucking firewalls
May 17 11:40:53 <grondo>        -                      
dentry_stat.nr_unused++;
May 17 11:40:56 <grondo>        +                      
list_add_tail(&dentry->d_lru, dispose_list);
May 17 11:40:58 <grondo>                                found++;
May 17 11:41:01 <grondo>                        }
May 17 11:41:04 <grondo>         
May 17 11:41:17 <grondo>        I'm sure it is right, but the patch does
definitely modify the dentry_stat accounting
May 17 11:41:28 <grondo>        maybe it could have introduced a very rare
leak?
May 17 11:41:59 <grondo>        it would be nice if I understood the
invariants of the accounting, which should be pretty simple, I could write
a systemtap script that verified them on every function call
May 17 11:42:32 <grondo>        for now, I think I'll capture the last N
values from dentry_stat and spit them out on failure... in case something
jumps out

May 17 15:33:40 <grondo>        trying to do something like a circular
buffer log in systemtap
May 17 15:33:53 <grondo>        can't find any examples online, and the
obvious things don't work
May 17 15:34:31 <grondo>        like head = 0; log [head++] = string; if
(head > max) head = 0;
May 17 15:34:58 <grondo>        if I do that I get an error that the array
has exceeded max entries or something
May 17 15:35:07 <neb>   an array index % array_size
May 17 15:38:49 <grondo>        yeah, basically
May 17 15:38:51 <grondo>        that doesn't work
May 17 15:44:27 <grondo>        (for some reason I always forget about
mod)
May 17 15:44:35 <grondo>        anyhow it should be the same
May 18 08:25:16 *       Disconnected (Connection timed out).
May 18 08:53:59 <neb>   940000
May 18 10:15:52 <grondo>        nice
May 18 10:26:51 <grondo>        if you have any experts on stap, can you
ask the best way to use stap to create a circular log with minimal
impact?
May 18 10:27:38 <grondo>        I'm having trouble with exceeding
MAXMAPENTRIES, even though I use essentially array [index % 100]
May 18 10:27:50 <grondo>        (i.e. limit to 100 entries)
May 18 10:28:02 <grondo>        I also tried doing per-cpu logs with
something like
May 18 10:28:19 <grondo>        array [cpu(), index % 100]
May 18 10:28:24 <grondo>        without much success
May 18 10:28:45 <neb>   can you send me your whole script. 
May 18 10:29:12 <grondo>        I'd like to basically keep a log of the
last N things that happened, then dump the log on a trigger
May 18 10:30:01 <grondo>        hmm, original script isn't really around
anymore because of all the experimentation
May 18 10:33:26 <grondo>        let me rewrite it how I think it should
work
May 18 10:33:26 <grondo>        the most naive version
May 18 13:19:00 <neb>   back working on your stap script
May 18 13:24:46 <neb>   try not using the increment and decrement
operators.
May 18 13:25:07 <neb>  
https://bugzilla.redhat.com/show_bug.cgi?id=591407

May 18 13:49:40 <neb>   try s / global array[100] / global array /
May 18 13:50:31 <neb>   this is atlas right?
May 18 13:51:26 <neb>   ooh 017007
May 18 13:52:25 <neb>   Also you need to update to the newer version of
stap.
May 18 13:53:03 <neb>   systemtap-1.1-3.el5
May 18 14:06:41 <neb>   grondo tossed the new version of systemtap into
worpi:/tmp

May 18 16:33:58 <neb>   did you get the new version of systemtap?
May 18 16:34:16 <neb>   I changed location too thinking that was the
problem but it wasn't
May 18 16:35:07 <grondo>        yes, not sure if it made it into the image
that Trent is rebooting onto hype now
May 18 16:35:24 <grondo>        That version is the RHEL5.5 version, so
hopefully no dependencies gotchas
May 18 16:35:30 <grondo>        What exactly was the fix in there?
May 18 16:35:41 <grondo>        I think I'm going to order a new bike
soon
May 18 16:36:04 <neb>   the stap developers didn't say but they did say
that the error message didn't look right for the script.
May 18 16:36:22 <neb>   and because of the text of the error message they
could see that you were running an old version of stap
May 18 16:36:49 <neb>   they think that there was array indexing issues in
that version on the edge of the arrays.
May 18 16:37:30 <neb>   So they thought that you might be hitting
array[99] and it was telling you that it was an overflow when it was
reasonable becuase of a bug in stap
May 18 16:37:42 <neb>   they said that it was best to not declare the size
of arrays.

May 18 16:38:27 <grondo>        ok, I got that error even if I didn't
declare the size of the array
May 18 16:38:40 <grondo>        will try the new stap once hype is back
up
May 18 16:39:01 <grondo>        in theory what I'm doing should work
though?
May 18 16:39:33 <grondo>        the other question I have is how to probe
multiple functions with the same "probe" line
May 18 16:39:50 <grondo>        the "*@fs/dcache.c" gets too much
May 18 16:39:59 <grondo>        it even probes stuff _not_ in dcache.c
strangely
May 18 16:40:36 <grondo>        oh, interesting
May 18 16:40:42 <grondo>        I can't recreate that problem now
May 18 16:41:20 <grondo>        ah, Trent _did_ get the new systemtap
May 18 16:42:53 <grondo>        cool, seems to work better now (so far so
good)
May 18 16:43:00 <grondo>        too bad I can't use it on the production
systems
May 18 16:43:02 <grondo>        :-(
May 18 16:43:57 <grondo>        cool that the script works now, tho
May 18 16:44:09 <neb>   OK so it appears to be the stap script.
May 18 16:44:38 <neb>   Can't you just install the new stap packages into
the writable
May 18 16:44:42 <neb>   err I meant:
May 18 16:44:52 <neb>   OK so it appears to be the stap binary not the
script
May 18 16:44:57 <neb>   that makes more sense.
May 18 16:45:27 <neb>   keep in mind with stap that it is a new and
evolving tool. It may not always be your skill it can also be the binary.

May 18 18:00:28 <neb>   so does it look like your stap script will work.
May 18 18:00:50 <neb>   other than the multiprobe thing are there any
additional things you need me to track down for you regarding that?
May 18 18:01:08 <grondo>        not at the moment

May 20 11:12:53 <grondo>        On the dcache problem --- I implemented
checks for dentry_stat.nr_unused < 0 in the SLURM prolog and epilog
May 20 11:13:00 <grondo>        to catch the problem before the next job
May 20 11:13:17 <grondo>        A couple nodes triggered last night
May 20 11:13:44 <grondo>        however, one of them didn't have the
signature negative value in /proc/sys/fs/dentry-stat
May 20 11:13:48 <neb>   They seem to implicitly see it as a runtime
environment and the notion that it is used as a production development
environment for some larger customers seems to have escaped them.
May 20 11:13:51 <grondo>        # atlas449 /root > cat
/proc/sys/fs/dentry-state 
May 20 11:13:51 <grondo>        12040   4530    45      0       0       0
May 20 11:14:40 <grondo>        however, right when I started running
drop_caches, it went negative!
May 20 11:14:44 <grondo>        cat /grondo@atlas449 ~ >cat
/proc/sys/fs/dentry-state  
May 20 11:14:44 <grondo>        2841    -3402   45      0       0       0
May 20 11:15:23 <grondo>        I'm not sure what to make of that

May 20 11:16:10 <neb>   so how did the slurm check catch it if
dentry_stat.nr_unused was still positive?
May 20 11:16:25 <grondo>        yes, that is the first question I had
May 20 11:16:26 <neb>   re: dev env yeah
May 20 11:16:37 <grondo>        it must have been negative at the time of
the check
May 20 11:16:40 <neb>   could it be an intermittant.
May 20 11:16:41 <neb>   yeah
May 20 11:16:50 <grondo>        then went back positive?
May 20 11:16:59 <grondo>        I don't really understand how that could
May 20 11:17:00 <grondo>        be
May 20 11:17:32 <grondo>        I really wish I could install the new
systemtap on atlas so I could run my script that captures the history of
the dentry_stat values
May 20 11:18:06 <grondo>        I really have no idea what is going on
here
May 20 11:18:19 <grondo>        I think my only recourse right now is to
try to reproduce on hype
May 20 11:22:53 <neb>   have you come up with any clues on how to
reproduce the problem?
May 20 11:23:08 <neb>   have you tried removing that one patch which you
thought might be implicated?
May 20 11:25:13 <grondo>        no, and no, because I can't recreate it

May 20 16:54:06 <grondo>        stuck on a systemtap thing again
May 20 16:54:26 <grondo>        I want to resolve the address of a list to
pass it to a function
May 20 16:54:38 <grondo>        I'm basically doing l = $dentry_unused
May 20 16:54:42 <grondo>        but I'm getting:
May 20 16:54:52 <grondo>        semantic error: struct/union 'list_head'
is being accessed instead of a member of the struct/union: identifier
'$dentry_unused' at /g/g0/grondo/proj/stap/dcache-count.stp:19:11
May 20 16:57:22 *       neb thinkin
May 20 16:57:26 <neb>   err thinking
May 20 16:57:38 <grondo>        if instead do $dentry_unused->next
May 20 16:57:46 <grondo>        then it gets further, but the error is
confusing
May 20 16:57:56 <grondo>        Running make -C
"/lib/modules/2.6.18-90chaos/build" M="/var/tmp/root/stapAHG5Ew"
modules >/dev/null
May 20 16:57:59 <grondo>        /var/tmp/root/stapAHG5Ew/stap_23834.c: In
function 'function_dentry_unused_count':
May 20 16:58:01 <grondo>        /var/tmp/root/stapAHG5Ew/stap_23834.c:336:
error: 'struct list_head' has no member named 'd_list'
May 20 16:58:04 <grondo>        /var/tmp/root/stapAHG5Ew/stap_23834.c:345:
error: 'struct list_head' has no member named '__retvalue'
May 20 16:59:46 <neb>   looks like some type polymorphism problems.
May 20 17:00:52 <grondo>        hm, the generated code is not as expected

May 20 18:15:51 <grondo>        aw, shit
May 20 18:16:17 <grondo>        so I have a stap script to compare the
number of dentries in the dentry_unused list to dentry_stat.nr_unused
May 20 18:16:24 <grondo>        and prints an error if they don't match
May 20 18:16:37 <grondo>        many nodes on atlas don't match
May 20 18:16:54 <grondo>        eatlas16: 3184 members of dentry_unused
list. nr_dentry=6865 nr_unused=3182
May 20 18:16:55 <grondo>        eatlas7: 3233 members of dentry_unused
list. nr_dentry=6942 nr_unused=3232
May 20 18:16:55 <grondo>        eatlas10: 3138 members of dentry_unused
list. nr_dentry=6823 nr_unused=3137
May 20 18:16:55 <grondo>        eatlas8: 3250 members of dentry_unused
list. nr_dentry=6941 nr_unused=3246
May 20 18:16:57 <grondo>        eatlas28: 3176 members of dentry_unused
list. nr_dentry=6824 nr_unused=3175
May 20 18:16:59 <grondo>        eatlas2: 3195 members of dentry_unused
list. nr_dentry=6887 nr_unused=3194
May 20 18:17:02 <grondo>        is just a sample
May 20 18:17:29 <grondo>        I can "fix" it with the script as well,
but the problem is worse than I thought
May 20 18:17:37 <grondo>        well have to run
May 20 18:18:03 <neb>   Hmm... so it might be that the problem lies
dormant for a while but then goes critical
May 20 18:18:32 <grondo>        yes, those nodes all have the chance for
nr_unused < 0 I believe
May 20 18:19:05 <neb>   that is good it seems to push the problem back.
May 20 18:19:15 <grondo>        unfortunately to count the dentry_unused
list I have to walk the entire thing
May 20 18:19:27 <neb>   maybe the reason you couldn't find the cause is
the cause is long gone before the problem shows up.
May 20 18:19:34 <grondo>        yes, could be
May 20 18:19:58 <neb>   do you have any 4.2 systems that you can compare
against?
May 20 18:21:57 <neb>   did the stap thing work?
May 21 10:19:49 <grondo>        so what I'm thinking of doing with the
stap script I have is to run it in the epilog after every job and drain
the node if the size of dentry_unused != dentry_stat.nr_unused. That
=should= mean that the event which caused the problem happened during that
job
May 21 10:20:34 <neb>   I would agree.
May 21 10:21:15 <grondo>        egad
May 21 10:21:28 <grondo>        another bug seems to have hit us on a
*bunch* of nodes
May 21 10:21:47 <neb>   with the proviso that it isn't necessarily the
cause but is strongly correlated to the problem occuring.
May 21 10:23:25 <neb>   you know me, I really try very hard to not mistake
coincidence & correlation with causality 



This event sent from IssueTracker by woodard  [LLNL (HPC)]
 issue 937133

Comment 8 Issue Tracker 2010-05-26 22:19:51 UTC

Event posted on 2010-05-26 15:13 PDT by woodard

May 25 10:22:19 <grondo>        hmm, it might be that the jobs causing the
errors in dentry_stat.nr_unused are the same ones that caused our dcache
softlockups in nfs
May 25 10:22:35 <grondo>        I hope I still have that reproducer
May 25 10:26:58 <neb>   interesting
May 25 10:27:28 <neb>   are these the one that search a bunch of places
for a table of strings.
May 25 10:27:38 <grondo>        the reason I think this is that the job
that hit the early detector had been canceled by timeout, and that was the
case that reproduced the old bug
May 25 10:27:39 <neb>   do you think it is NFS or might it be lustre?
May 25 10:27:58 <grondo>        I believe it is NFS
May 25 10:28:09 <grondo>        but I don't think that is the bug
May 25 10:28:25 <grondo>        THis is totally guesswork right now
May 25 10:28:32 <neb>   did we actually have a repro for that bug/
May 25 10:28:59 <grondo>        a patch went into 5.4 to fix the problem
we saw before, where a special case of negative dentry validation caused a
corner case to hit in the dccache and resulted in softlockups
May 25 10:29:10 <neb>   right I remember that one.
May 25 10:29:26 <grondo>        the problem occurred fairly rarely in
production, but we could reproduce it by populating the dcache and running
a certain user job then canceling it
May 25 10:29:45 <grondo>        the problem seemed to be triggered by all
8 tasks of the job getting sigterm at the same time
May 25 10:30:04 <grondo>        a signal handler would search the msg
catalog path for non-existent files, triggering the revalidation
May 25 10:30:27 <grondo>        it also seemed like we were *much* more
likely to reproduce when we had the job container module
May 25 10:30:46 <grondo>        in fact, I don't think any other RH
customers saw this problem
May 25 10:31:21 <neb>  
https://enterprise.redhat.com/issue-tracker/?module=issues&action=view&tid=178773
May 25 10:31:34 <grondo>        so now there is a fix for some similar
kind of softlockup in the dcache, which moves some stuff to a private
list
May 25 10:31:51 <grondo>        my guess is that in that codepath, when we
hit the same rare case we hit before, the accounting for nr_unused isn't
quite right
May 25 10:32:14 <grondo>       
https://chaos.llnl.gov/bugzilla/show_bug.cgi?id=285
May 25 10:32:14 <grondo>        is our local casxe
May 25 10:32:22 <grondo>        sadly I didn't document how to run the
reproducer
May 25 10:32:31 <grondo>        I vaguely remmeber, but I have to find the
user code
May 25 10:32:54 <grondo>        it also seemed like any code compiled by
the intel compiler was vulnerable
May 25 10:33:10 *       neb still can't get to chaos.llnl.gov due to
silly firewall rules and crazy network people.
May 25 10:33:24 <grondo>        oh geez
May 25 10:33:42 <grondo>        ok, found the code
May 25 10:34:22 <neb>   is it something kent can run along with your stap
script so that we can see if it happens on stock rhel kernels.
May 25 10:34:54 <grondo>        I don't think so
May 25 10:34:54 <grondo>        for some reason I could never make a
standalone reproducer
May 25 10:35:57 <grondo>        the theory was that the code searched for
the message catalogs once, thus creating the negative dcache entries. some
time later, after the entries have expired, it gets a signal, looks for
them again and triggers revalidation
May 25 10:36:24 <grondo>        some kind of error comes back from the
server, or maybe not, but this triggers shrink_dcache_parent with some
kind of huge number
May 25 10:36:24 <grondo>        that is what I remember
May 25 10:36:24 <grondo>        but I tried to create a reproducer based
on that and failed
May 25 10:37:29 <grondo>        it also seemed important
May 25 10:37:29 <grondo>        to have the job run in the sgi job
container module
May 25 10:37:56 <neb>   so now you aren't running with the sgi container
modules?
May 25 10:38:08 <grondo>        we are
May 25 10:38:24 <grondo>        I'm just saying that I couldn't make my
own reproducer before
May 25 10:38:38 <grondo>        even though it should have been easy based
on the theory of how the bug was triggered

May 25 10:40:30 <grondo>        However, the necessity of the job
container module explains why nobody else seems to see the problem
May 25 10:41:24 <grondo>        dammit why didn't I document how to run
the reproducer. stupid!
May 25 10:41:33 <neb>   why do you think that the job module matters
May 25 10:42:24 <grondo>        we supposed before that it was because the
job container allows all tasks of the job to be signaled virtually at the
same time
May 25 10:42:24 <grondo>        slurm doesn't signal the individual
tasks, it signals the job container as a whole
May 25 10:43:30 <neb>   Oh yeah. I remember coming to that now.
May 25 10:43:47 <neb>   I did a lot of playing with signals to try to make
that work when I was trying to come up with a repro
May 25 10:43:54 <grondo>        yes
May 25 10:44:12 <grondo>        we just couldn't make it work
May 25 10:44:21 <grondo>        hold on, I'm going to search email for
the steps to reproduce

May 25 13:14:22 <neb>   for an issue like this one, it is weird I keep on
saying "we" but it feels wrong becuas
e you did most of the work yet I feel like we have been consulting and
brainstorming all along.
May 25 13:16:36 <grondo>        yes, we have
May 25 13:17:37 <grondo>        doesn't bother me if you say we
May 25 13:23:03 <neb>   it it even possible for you to test to see if the
problem occurs without the job module?
May 25 13:27:46 <neb>   Some questions I have, I wonder if you could
reproduce it on a UP box or if it is a race between two procs manipulating
the dcache list
May 25 13:28:04 <neb>   is this the basic idea:
May 25 13:28:19 <neb>   1) make a lot of negative dcache entries
May 25 13:28:23 <neb>   2) waith 15 min
May 25 13:28:49 <neb>   3) exercise those negative dcache entries.
May 25 13:29:56 <grondo>        all the dcache lists should be protected
by dcache_lock
May 25 13:30:05 <grondo>        I think you just have to make a lot of
unused dentries
May 25 13:30:08 <grondo>        e.g. find /
May 25 13:30:18 <grondo>        although that doesn't even seem to be all
that necessary
May 25 13:30:27 <neb>   my idea was
May 25 13:30:29 <grondo>        then make a _few_ negative dentries
May 25 13:30:35 <grondo>        wait for them to expire
May 25 13:30:35 <neb>   only a few
May 25 13:30:43 <grondo>        then hit them again from all cpus
May 25 13:30:50 <grondo>        that is my guess as to how it works
May 25 13:30:56 <grondo>        but I swear we tried that before
May 25 13:31:07 <neb>   I was going to try:
May 25 13:33:55 <neb>   find /nfsmnt/ -type d; do find /usr/src/linux
-type f | while read line;do name=`basename $line`;if [ -e $name ]; then
;; fi; done; done; sleep 900
May 25 13:34:14 <neb>   the for all the CPUs do the same thing.
May 25 13:34:26 <grondo>        try it
May 25 13:34:31 <neb>   working on it now.



This event sent from IssueTracker by woodard  [LLNL (HPC)]
 issue 937133

Comment 9 Issue Tracker 2010-05-27 11:43:23 UTC

Event posted on 2010-05-27 04:43 PDT by woodard

Problem history:

This problem initially manifested as a hang during our job prolog script,
run
before each job to prepare the node for work. This script does the
equivalent
of 

echo 2> /proc/sys/vm/drop_caches

so that nodes handed over to customers start out with a similar memory
profile.

one some nodes we would see the script hang forever at this point chewing
up
CPU time. After some investigation, it was found that
dentry_stat.nr_unused
had somehow gone negative, and this was causing an endless cycle under
shrink_dcache_memory. How this value got to be negative was a mystery,
however.

We installed some hooks into our job prolog and epilog to track
dentry_stat
values, and to notify us whenever nr_unused < 0. This was happening at a
rate
of a few nodes per day on a 1152 node machine, but there did not appear to
be
any commonality between the jobs.

The next thing we tried was a systemtap script that compared the actual
length
of the dentry_unused list to the value of nr_unused in dentry_stat, on
the
theory that perhaps the count was off way before the count actually went
negative (and thus the job affected wasn't actually the cause). We ran
this
stap script after every job, and we found that jobs for which the count
was
off all had these things in common:

* Were canceled by a signal (timed out or canceled)
  (thus all tasks in the job were killed by a signal)

* Used the Intel compiler

This bore a strong resemblance to *another* dcache bug we had hit. I
tried
the reproducer from that other bug (178773) and found that I could
reproduce
the problem (length(dentry_unused) != dentry_stat.nr_unused) at will.

The key features of the reproducer, I believe are the following:

- The reproducer, or perhaps some previous job, add some negative
dentries
   to the cache when looking for the intel message catalogs along the
default
   NLSPATH

- while the job is running, those negative dentries expire

- when the job is signaled due to timeout or cancelation, *all* tasks get
   SIGTERM at virtually the same moment. This triggers a handler
installed
   by the intel compiler which searches the NLSPATH again, hitting the 
   negative dentries and requiring some kind of revalidation.
   (We are running on top of NFSv3)

 This seems to be the trigger for the bug, although I earlier failed to
make
  a reproducer for 178773 based on this information, so it might not be
  complete.

However, a proven workaround for this problem is to prepend the path to
the
fortran message catalogs to the NLSPATH.

I have also shown that by removing the RHEL patch
linux-2.6-fs-private-dentry-list-to-avoid-dcache_lock-contention.patch

from the kernel, I can no longer reproduce this bug. However, that patch
is
thought to be the fix for 178773 so I'd like to keep it if at all
possible.


Version changed from '5.4' to '4.6'

This event sent from IssueTracker by woodard 
 issue 937133

Comment 11 Josef Bacik 2010-05-27 17:59:11 UTC

Created attachment 417328 [details]
Possible fix

Ok so based on what you guys are telling me, you get hung up in prune_dcache, and this comes from shrink_dcache_memory, from shrink_slab.  What happens is shrink_slab does this

nr = shrink_dcache_memory(0, mask);

where nr ends up being some negative number because we use nr_unused to calculate this number, then we call back with

shrink_dcache_memory(nr, mask)

and in prune_dcache we do

for (; nr; nr--) {
    shrink dcache
}

which will go on _forever_ if nr is negative.  So what is happening is we have a bunch of places where we do

if (!list_empty(&dentry->d_lru)) {
    nr_unused--;
    list_del_init(&dentry->d_lru);
}

but if the dentry is on our own private dispose_list, we have already taken care of the nr_unused counter, so we'll get --'ed again.  So instead of not messing with nr_unused when we add and take away from the dispose_list, treat dispose_list just like we do dentry_unused, do nr_unused++ when we add to the dispose_list and do nr_unused-- when we take away from the dispose_list.  This will keep everything happy and hopefully stop your hangs.  Thanks,

Josef

Comment 13 Josef Bacik 2010-07-20 13:40:22 UTC

I'm still waiting on confirmation that the patch fixes the problem.  As soon as I have confirmation that it makes the problem go away I will post it.

Comment 15 Josef Bacik 2010-08-23 13:28:29 UTC

*** Bug 625749 has been marked as a duplicate of this bug. ***

Comment 16 RHEL Program Management 2010-08-23 13:49:58 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 17 Josef Bacik 2010-08-23 14:00:32 UTC

Posted 8/23.

Comment 21 Jarod Wilson 2010-08-31 01:17:24 UTC

in kernel-2.6.18-214.el5
You can download this test kernel from http://people.redhat.com/jwilson/el5

Detailed testing feedback is always welcomed.

Comment 32 errata-xmlrpc 2011-01-13 21:34:05 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0017.html