Bug 1471875 - soft lockups during unmount when dentry cache is very large
Summary: soft lockups during unmount when dentry cache is very large
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: kernel
Version: 7.3
Hardware: Unspecified
OS: Unspecified
medium
unspecified
Target Milestone: rc
: ---
Assignee: Miklos Szeredi
QA Contact: Murphy Zhou
URL:
Whiteboard:
Depends On:
Blocks: 1420851 1469559
TreeView+ depends on / blocked
 
Reported: 2017-07-17 14:54 UTC by Frank Sorenson
Modified: 2020-12-14 09:07 UTC (History)
12 users (show)

Fixed In Version: kernel-3.10.0-796.el7
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-04-10 21:22:27 UTC
Target Upstream Version:


Attachments (Terms of Use)
test patch to limit cond_resched calls to every 10 seconds (874 bytes, patch)
2017-07-17 14:54 UTC, Frank Sorenson
no flags Details | Diff
alternative test patch (738 bytes, patch)
2017-10-09 09:58 UTC, Miklos Szeredi
no flags Details | Diff
Program to cause the allocation of negative dentries (3.54 KB, text/x-csrc)
2018-02-06 04:51 UTC, Frank Sorenson
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 3268671 0 None None None 2018-01-07 00:04:43 UTC
Red Hat Product Errata RHSA-2018:1062 0 None None None 2018-04-10 21:23:42 UTC

Description Frank Sorenson 2017-07-17 14:54:00 UTC
Created attachment 1299886 [details]
test patch to limit cond_resched calls to every 10 seconds

Description of problem:

When system memory is very large, and there are millions of objects in dentry cache, the unmount can spend a very long time in shrink_dcache_for_umount_subtree, blocking other tasks from running on the cpu.



Version-Release number of selected component (if applicable):

RHEL 7.3, kernel 3.10.0-514
kernel 3.10.0-327


How reproducible:

at least two customers report they can trigger this condition with ease.
reproduction at Red Hat has only replicated the long umount times, but not the soft lockups


Steps to Reproduce:

one customer creates 700,000 directories, with 50 million files inside, then unmounts

another customer sees this problem after 2-3 days of ordinary (but heavy) system use.  The vmcore indicates approximately 1 billion objects in the dentry slab cache (200 GiB).  The soft lockups occur during umount as well.


Actual results:

the umount runs for a very long time in shrink_dcache_for_umount_subtree, 


Expected results:

no soft lockups; umount gives up the cpu periodically if necessary
also expected: mount completes in reasonable time


Additional info:

The customer vmcores indicate that it is the migration task which hits the soft lockup while attempting to migrate a process to the cpu currently occupied by the umount process, with the migration task in multi_cpu_stop.  This seems like a separate bug; choosing a busy cpu as the target cpu seems like a poor choice.  Still... the shrink_dcache_for_umount_subtree should play more nicely.


Upstream kernel is different from RHEL 7 code, and contains a cond_resched which should prevent these soft lockups.  RHEL 7 also has the function containing a cond_resched (shrink_dcache_parent), however it is not called in the RHEL 7 implementation of the directory tree hierarchy walk.


I tried adding a cond_resched inside the loop in shrink_dcache_for_umount_subtree for my own test of 'create 50 million files, then unmount'.  I did not see the soft lockup, however because the cond_resched was called each time through the loop, when my system had other processes to run, the umount got interrupted too frequently.  As a result, the umount took 4-5 hours on a busy system (as opposed to umount only taking 1-2 minutes, but risking the possibility of soft lockups due to umount monopolizing the cpu.


I think a compromise in this case may be to cond_resched() in the loop, but only periodically (after a certain number of objects have been discarded, or if the loop has been running for longer than a certain amount of time.  I have a concept patch that does this by limiting the cond_resched to only every 10 seconds (or N*HZ jiffies... 10 was just an initial value for the test patch


vmcore information to be included in a follow-up comment.

Comment 4 Vivek Goyal 2017-07-21 19:03:39 UTC
(In reply to Frank Sorenson from comment #0)
> I tried adding a cond_resched inside the loop in
> shrink_dcache_for_umount_subtree for my own test of 'create 50 million
> files, then unmount'.  I did not see the soft lockup, however because the
> cond_resched was called each time through the loop, when my system had other
> processes to run, the umount got interrupted too frequently.  As a result,
> the umount took 4-5 hours on a busy system (as opposed to umount only taking
> 1-2 minutes, but risking the possibility of soft lockups due to umount
> monopolizing the cpu.
>

Hi Frank,

Have a query about it. I am reading upstream code and IIUC, they also interrupt the walk even if they found one dentry to dispose.

 
select_collect() {
     ...
     ...
        /*
         * We can return to the caller if we have found some (this
         * ensures forward progress). We'll be coming back to find
         * the rest.
         */
        if (!list_empty(&data->dispose))
                ret = need_resched() ? D_WALK_QUIT : D_WALK_NORETRY;
}


Does that mean even upstream will take 4-5 hours to unmount or something else is going on. I am wondering why your attempt of putting cond_resched() in loop did not work. Especially, in the inner loop where we try to cleanup a dentry till the parent (and stop till we find sibling). That loop should ensure that we have cleaned atleast one dentry and that would seem close to what upstream is doing. 

So either I understood the code wrong, or upstream also takes long time to unmount for such systems or something else is going on.

Comment 5 Frank Sorenson 2017-07-27 04:54:20 UTC
The reproducer needs to be better defined, and I will need to do more testing to see exactly how upstream will behave.

I'm currently on PTO, returning 2017-08-02, so I won't get back to this until that time.


Here are the approximate steps that were reproducing this for me (very much a work in progress, though):


mount_path=/mnt/test

# on a system with 2 numa nodes, pick a cpu on the second node
NCPUS=16
CPU=15

# create directories and files while pinned to that cpu
taskset -c $CPU (
    for i in $(seq 0 999) ; do
        for j in $(seq 0 499) ; do
            mkdir -p $mount_path/$i/$j
            for k in $(seq 0 99) ; do
                touch $mount_path/$i/$j/$k
            done
        done
    done
)

# start some cpu-intensive processes
# at least as many as cpus in the system
# (may also need to cause pin the processes to cpus
#   and switch them periodically between cpus)
pids=
for i in $(seq 0 $NCPUS) ; do
  CPU_INTENSIVE_TASK &
  pids="$pids $!"
done

time umount $mount_path

kill $pids


I'll work on this further when I return.

Comment 6 Frank Sorenson 2017-08-14 15:33:52 UTC
I'm still working to solidify the reproducer, and I'm still testing upstream, but I have managed to reproduce the very long umount times quite reliably on the latest RHEL 7.4 kernel.

The issue seems to occur when 'umount' itself is pinned to a single cpu, and other processes are using set_affinity() to change cpus to and from the cpu to which the umount is pinned.  The addition of the cond_resched() in the loop does prevent the softlockups, but has resulted in the process getting rescheduled frequently.

We have seen this issue in particular in environments such as a docker/openshift shutdown/filesystem unmount (but the unmount is taking hours).

The current reproducer is:
    create or traverse a directory tree on the filesystem
        (I am currently creating 500000 directories with 28000000 files: /home/test/dir_[0-1000]/dir_[0-500]/file_[0-56]
        (It's what fits on the /home filesystem of the box that I found in beaker)
        (after creation, simply traversing the tree should populate the dentry and inode (here, xfs_inode) slab caches (and several others)
        (I can provide the program to do this once I confirm the process on upstream)
    start some processes which rapidly switch affinity (I am confirming whether 'stress-ng --affinity 0' is sufficient)
    start umount pinned to a cpu (double-checking that '/usr/bin/time -v taskset -c 15 umount /home' is sufficient)

Comment 8 Frank Sorenson 2017-08-18 21:54:02 UTC
(In reply to Vivek Goyal from comment #4)
...
> Does that mean even upstream will take 4-5 hours to unmount or something

Correct.

> else is going on. I am wondering why your attempt of putting cond_resched()
> in loop did not work. Especially, in the inner loop where we try to cleanup

It's not that cond_resched() doesn't work; there are no more softlockups;  However, it will resched() too frequently.

> a dentry till the parent (and stop till we find sibling). That loop should
> ensure that we have cleaned atleast one dentry and that would seem close to
> what upstream is doing. 

> So either I understood the code wrong, or upstream also takes long time to
> unmount for such systems or something else is going on.

Upstream does indeed take a very long time to unmount.


I grabbed a system with 16 cpus and 32 GiB from beaker, installed 7.4, then compiled and installed 4.12
I created a directory tree of 28 million 0-byte files in 500 thousand directories: $TEST_DIR/dir_[0-1000]/dir_[0-500]/file_[0-56]
(after the directory tree is created, the next test will only require 'find' to populate the dentry and inode caches)

I then started a program that forked 15 children, pinning each to one of the cpus (0-14).  Each then repeatedly calls sched_setaffinity() to switch to cpu 15, and then back to its original cpu.
I then started the 'umount' to run on just cpu 15


prior to starting 'umount', /proc/meminfo showed 31 GiB in slab:
Slab:           31509000 kB


with the following slab caches containing a large number of objects (slab merging bites us here):
# name            <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
xfs_inode         24954724 24954998    960   34    8 : tunables    0    0    0 : slabdata 1845977 1845977      0
dentry            17176383 17176824    192   42    2 : tunables    0    0    0 : slabdata 408972 408972      0
Acpi-Namespace    24981215 24983166     40  102    1 : tunables    0    0    0 : slabdata 244933 244933      0
kmalloc-64        49940332 49940800     64   64    1 : tunables    0    0    0 : slabdata 780325 780325      0


# date ; uname -a ; ps -p 7062 -o pid,lstart,etime,args
Fri Aug 18 16:41:38 EDT 2017
Linux sun-x4600-1.gsslab.rdu2.redhat.com 4.12.0 #2 SMP Fri Aug 18 01:21:10 EDT 2017 x86_64 x86_64 x86_64 GNU/Linux
  PID                  STARTED     ELAPSED COMMAND
 7062 Fri Aug 18 13:58:11 2017    02:43:26 umount /home

after running nearly 3 hours, 
Slab:           31442180 kB

# name            <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
xfs_inode         24913338 24919842    960   34    8 : tunables    0    0    0 : slabdata 1844943 1844943      0
dentry            17017171 17039190    192   42    2 : tunables    0    0    0 : slabdata 405695 405695      0
Acpi-Namespace    24940773 24956952     40  102    1 : tunables    0    0    0 : slabdata 244676 244676      0
kmalloc-64        49857493 49894848     64   64    1 : tunables    0    0    0 : slabdata 779607 779607      0

very little progress has been made

Comment 10 Miklos Szeredi 2017-10-06 15:49:52 UTC
Digging a bit into the history of this thing...  and adding dhowells as one who was part of this history.

1)

shrink_dcache_sb() says it's used to free the dcache before umount.  Well not in any recent (i.e. going back to 2002) history.

2)
 
v2.6.19 dhowells: c636ebdb186b [PATCH] VFS: Destroy the dentries contributed by a superblock on unmounting

This  added shrink_dcache_for_umount().  The apparent advantage of this is being lockless over shrink_dcache_parent(). Although uncontended spinlocks, where the cacheline needs to be pulled in anyway the advantage is dubious.

3)

v2.6.38 npiggin: 2fd6b7f50797 fs: dcache scale subdirs

This removed the cond_resched_lock() from shrink_dcache_for_umount().  Not sure if it was intentional.

4)

v3.13 viro: 42c326082d8a switch shrink_dcache_for_umount() to use of d_walk()

Re-adds the cond_resched() as well as the locking.

5)

v3.15 viro: 9c8c10e262e0 more graceful recovery in umount_collect()

Back to shink_dcache_parent().  Latest version.

Not sure what's the best route for upstream and for rhel.  Maybe something like the 1024 limit in latest shrink_dcache_sb() version.

The cond_resched() in the walking part does not even seem sufficient, since we could end up with a very long dispose list and and shrink_dentry_list() hogging the CPU.

Comment 11 Miklos Szeredi 2017-10-09 09:58:19 UTC
Created attachment 1336281 [details]
alternative test patch

this one limits cond_resched() calls to every 1024 dentries scanned

Comment 12 Steve Whitehouse 2017-10-17 18:23:22 UTC
Miklos, are you waiting for feedback on your comment #11 patch? If so then please set needinfo on this so that support are aware. If not then is there something we can do to move this forward at this stage?

This is on the RPL for 7.5, and we need to figure out whether it is likely to make it into 7.5 based on current progress.

Comment 13 Miklos Szeredi 2017-10-18 07:46:59 UTC
Steve, yes, seems I forgot to set needinfo.

Frank, can you please test, or ask customer to test?  Here are brew builds for 7.5 and 7.3 respectively:

https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=14289356

https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=14289367

Comment 14 Frank Sorenson 2017-11-03 13:59:53 UTC
The patch appears to set a good balance between calling cond_resched() to prevent the soft lockups and limiting the frequency so that the unmount does not take forever.  It looks good to me

Testing requested from the customers, but have not heard back.

Comment 16 Miklos Szeredi 2017-11-06 07:54:36 UTC
Frank, thanks for the feedback.  I'm pretty comfortable with that patch, so for my part this can go into 7.5.

It's still missing qa and pm acks.

Comment 19 Rafael Aquini 2017-11-21 23:35:23 UTC
Patch(es) committed on kernel repository and an interim kernel build is undergoing testing

Comment 21 Rafael Aquini 2017-11-22 15:02:52 UTC
Patch(es) available on kernel-3.10.0-796.el7

Comment 32 Frank Sorenson 2018-02-06 04:51:37 UTC
Created attachment 1391796 [details]
Program to cause the allocation of negative dentries

The fastest way to fill the dentry cache is with negative dentries by calling stat() on files which don't exist.  This program does that.

(does not cause allocation of inodes, etc.)

Comment 34 errata-xmlrpc 2018-04-10 21:22:27 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:1062


Note You need to log in before you can comment on or make changes to this bug.