Bug 1903387

Summary: fsfreeze of xfs filesystem can be significantly delayed in xfs_wait_buftarg if a process continues to grab and release buffers
Product: Red Hat Enterprise Linux 8 Reporter: Frank Sorenson <fsorenso>
Component: kernelAssignee: Brian Foster <bfoster>
kernel sub component: XFS QA Contact: Zorro Lang <zlang>
Status: CLOSED ERRATA Docs Contact:
Severity: unspecified    
Priority: unspecified CC: bfoster, dchinner, gsierohu, jbreitwe, jshivers, peter.vreman, xzhou, zlang
Version: 8.3Keywords: Reproducer, Triaged
Target Milestone: rc   
Target Release: 8.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: kernel-4.18.0-283.el8 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-05-18 14:23:09 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
program to create directory/file tree none

Description Frank Sorenson 2020-12-01 23:52:31 UTC
Created attachment 1735427 [details]
program to create directory/file tree

Description of problem:

If a process is quickly grabbing and releasing buffers, a freeze could experience a lengthy delay while waiting in xfs_wait_buftarg for the buffer lru list to become empty.


Version-Release number of selected component (if applicable):

  RHEL 8 kernels - 4.18.0-240.5.1.el8_3.x86_64


How reproducible:

  easy, see below


Steps to Reproduce:

  create a large directory tree
  start 'find'
  attempt to 'fsfreeze'

  particular steps;
    truncate -s 30G /var/tmp/test.img
    mkfs.xfs /var/tmp/test.img
    
    fspath=/mnt/tmp
    mkdir -p $fspath
    mount /var/tmp/test.img $fspath

    gcc create_tree.c -o create_tree
    ./create_tree $fspath

	(creates a 4-deep tree of 1 million leaf directories, each containing 10 files)

    sysctl vm.drop_caches=3


    find $fspath >/dev/null 2>&1 &
    time fsfreeze -f $fspath

(cleanup)
    killall find
    fsfreeze -u $fspath


Actual results:

	# find $fspath >/dev/null 2>&1 & time fsfreeze -f $fspath ; killall find ; fsfreeze -u $fspath ; wait
	[14] 356561

	real	3m23.074s
	user	0m0.000s
	sys	0m1.775s
	[14]   Terminated              find $fspath > /dev/null 2>&1


Expected results:

fsfreeze completes quickly



Additional info:

  if the process is writing to the same filesystem, the freeze will happen quickly:

    # find $fspath >$fspath/find.out 2>&1 & time fsfreeze -f $fspath ; killall find ; fsfreeze -u $fspath ; wait
	[13] 357664

	real	0m0.122s
	user	0m0.002s
	sys	0m0.004s
	[13]   Terminated              find $fspath > $fspath/find.out 2>&1

Comment 1 Frank Sorenson 2020-12-01 23:57:34 UTC
Gonzalo noted the following change between RHEL 7 and RHEL 8:

RHEL 7
xfs_wait_buftarg(
        struct xfs_buftarg      *btp)
{
        struct xfs_buf          *bp;
        LIST_HEAD(dispose);
...
        flush_workqueue(btp->bt_mount->m_buf_workqueue);

restart:
        spin_lock(&btp->bt_lru_lock);              <<--------- This prevents new xfs_bufs to go to LRU
        while (!list_empty(&btp->bt_lru)) {
                bp = list_first_entry(&btp->bt_lru, struct xfs_buf, b_lru);
...
        spin_unlock(&btp->bt_lru_lock);



And in rhel8 we're not protecting the list, so new bufs come in over and over and this is what is preventing this to finish:

xfs_wait_buftarg(
        struct xfs_buftarg      *btp)
{
        LIST_HEAD(dispose);
        int loop = 0;

        /*
         * First wait on the buftarg I/O count for all in-flight buffers to be
         * released. This is critical as new buffers do not make the LRU until
         * they are released.
         *
         * Next, flush the buffer workqueue to ensure all completion processing
         * has finished. Just waiting on buffer locks is not sufficient for
         * async IO as the reference count held over IO is not released until
         * after the buffer lock is dropped. Hence we need to ensure here that
         * all reference counts have been dropped before we start walking the
         * LRU list.
         */
        while (percpu_counter_sum(&btp->bt_io_count))
                delay(100);
        flush_workqueue(btp->bt_mount->m_buf_workqueue);

        /* loop until there is nothing left on the lru list. */
        while (list_lru_count(&btp->bt_lru)) {
                list_lru_walk(&btp->bt_lru, xfs_buftarg_wait_rele,
                              &dispose, LONG_MAX);

Comment 2 Gonzalo Siero 2020-12-02 11:07:54 UTC
I believe that the problem is not with the protection of the list, but with the fact that we are now dealing with dispose list *inside* the loop that is dealing with LRU.

In cases where there's a IO in progress (reads), we could end up with lot's of bufs to deal with in the dispose list. 
By doing this, we're giving the opportunity to get more bufs into the LRU and ending waiting for very long time...

In our test case, we adding xfs_bufs here:

 0xffffffff81c5c6c0 : list_lru_add+0x0/0x140 [kernel]
 0xffffffffc02e1de7 : xfs_buf_rele+0x147/0x360 [xfs]
 0xffffffffc02ccbd7 : xfs_iread+0x167/0x1f0 [xfs]
 0xffffffffc02ed471 : xfs_iget+0x211/0x950 [xfs]
 0xffffffffc02f838f : xfs_lookup+0xdf/0x120 [xfs]
 0xffffffffc02f4650 : xfs_vn_lookup+0x70/0xa0 [xfs]
 0xffffffff81ce6f37 : __lookup_slow+0x97/0x150 [kernel]
 0xffffffff81ce7025 : lookup_slow+0x35/0x50 [kernel]
 0xffffffff81ce71ff : walk_component+0x1bf/0x2f0 [kernel]
 0xffffffff81ce7fb5 : path_lookupat.isra.48+0x75/0x200 [kernel]
 0xffffffff81cec4f0 : filename_lookup.part.64+0xa0/0x170 [kernel]
 0xffffffff81cdf954 : vfs_statx+0x74/0xd0 [kernel]
 0xffffffff81ce0019 : __do_sys_newlstat+0x39/0x70 [kernel]
 0xffffffff81a0419b : do_syscall_64+0x5b/0x1a0 [kernel]
 0xffffffff824000ad : entry_SYSCALL_64_after_hwframe+0x65/0xca [kernel]
 0xffffffff824000ad : entry_SYSCALL_64_after_hwframe+0x65/0xca [kernel]

xfs_buftarg->bt_lru: ffff94aa3139c480 129335

 0xffffffff81c5c6c0 : list_lru_add+0x0/0x140 [kernel]
 0xffffffffc02e1de7 : xfs_buf_rele+0x147/0x360 [xfs]
 0xffffffffc02ba85c : xfs_dir2_block_lookup+0xbc/0x130 [xfs]
 0xffffffffc02b92be : xfs_dir_lookup+0x1ae/0x1d0 [xfs]
 0xffffffffc02f8308 : xfs_lookup+0x58/0x120 [xfs]
 0xffffffffc02f4650 : xfs_vn_lookup+0x70/0xa0 [xfs]
 0xffffffff81ce6f37 : __lookup_slow+0x97/0x150 [kernel]
 0xffffffff81ce7025 : lookup_slow+0x35/0x50 [kernel]
 0xffffffff81ce71ff : walk_component+0x1bf/0x2f0 [kernel]
 0xffffffff81ce7fb5 : path_lookupat.isra.48+0x75/0x200 [kernel]
 0xffffffff81cec4f0 : filename_lookup.part.64+0xa0/0x170 [kernel]
 0xffffffff81cdf954 : vfs_statx+0x74/0xd0 [kernel]
 0xffffffff81ce0019 : __do_sys_newlstat+0x39/0x70 [kernel]
 0xffffffff81a0419b : do_syscall_64+0x5b/0x1a0 [kernel]
 0xffffffff824000ad : entry_SYSCALL_64_after_hwframe+0x65/0xca [kernel]
 0xffffffff824000ad : entry_SYSCALL_64_after_hwframe+0x65/0xca [kernel]

xfs_buftarg->bt_lru: ffff94aa3139c480 4071

Comment 4 Brian Foster 2020-12-02 13:55:25 UTC
(In reply to Frank Sorenson from comment #0)
...
> Actual results:
> 
> 	# find $fspath >/dev/null 2>&1 & time fsfreeze -f $fspath ; killall find ;
> fsfreeze -u $fspath ; wait
> 	[14] 356561
> 
> 	real	3m23.074s
> 	user	0m0.000s
> 	sys	0m1.775s
> 	[14]   Terminated              find $fspath > /dev/null 2>&1
> 

Ah, I see. We're only doing reads so the freeze ends up blocked behind completion of the associated task. While I can see how the change in xfs_wait_buftarg() can introduce this behavior, I don't think the old behavior was necessarily correct. The intent of xfs_wait_buftarg() is to reclaim all buffers and drain the LRU, we just generally expect to be in a quiescent state when that occurs. freeze only blocks writes, so a continuous read workload would never fall into the blocking freeze protection paths and thus we may end up having to wait for such a workload to fully complete.

> 
...
>   if the process is writing to the same filesystem, the freeze will happen
> quickly:
> 
>     # find $fspath >$fspath/find.out 2>&1 & time fsfreeze -f $fspath ;
> killall find ; fsfreeze -u $fspath ; wait
> 	[13] 357664
> 
> 	real	0m0.122s
> 	user	0m0.002s
> 	sys	0m0.004s
> 	[13]   Terminated              find $fspath > $fspath/find.out 2>&1

I suspect this occurs because the write path blocks on the freeze (i.e. sb_start_write() and friends) and thus fairly quickly blocks the task, which then remains blocked until the filesystem is unfrozen. The read-only variant is fully capable of running to completion on a frozen fs, so really there's no need to quiesce it in the first place. I think we've wondered in the past whether it's really necessary to drain the LRU (reclaiming all buffers) on freeze. I'm not sure that it is, so that might be the right approach to fix this particular performance issue.

Comment 5 Dave Chinner 2020-12-08 03:14:07 UTC
(In reply to Gonzalo Siero from comment #2)
> I believe that the problem is not with the protection of the list, but with
> the fact that we are now dealing with dispose list *inside* the loop that is
> dealing with LRU.

No, that's not the problem at all. The problem is that we are trying to dispose of the buffer cache on an active filesystem that can still be filling the cache at the same time. i.e. Any changes to the LRU code is trying to patch the symptom, not fix the root cause.

When we are freezing the filesystem all we need to do is wait for all in-flight buffer IO to complete - as Brian indicates, we do not need to empty the cache because the filesystem is still active and mounted. We only need to empty the cache after log recovery and at unmount time, so we should be able to avoid this problem just by avoiding the cache emptying in the freeze path.

So, really, I think the fix is to split xfs_wait_buftarg() into two parts - one to wait for in-flight IO to complete and flush the endio work queues, the other to empty the cache - and then to have the freeze path only wait for IO to complete....

-Dave.

Comment 6 Brian Foster 2020-12-08 09:13:06 UTC
(In reply to Dave Chinner from comment #5)
> (In reply to Gonzalo Siero from comment #2)
...
> 
> So, really, I think the fix is to split xfs_wait_buftarg() into two parts -
> one to wait for in-flight IO to complete and flush the endio work queues,
> the other to empty the cache - and then to have the freeze path only wait
> for IO to complete....
> 

Pretty much what I've been playing with locally to address this problem. I'll get patches on the list once I've done more testing..

Comment 7 Gonzalo Siero 2020-12-09 08:27:05 UTC
Thanks a lot Brian and Dave. 

>No, that's not the problem at all.

My comment was more along the lines of what has changed that could trigger the problem now and not in rhel 7.

will you need customer to test a fix or since we have a reproducer is no needed?
Let me know.

Regards,
Gonzalo.

Comment 8 Brian Foster 2020-12-09 12:39:48 UTC
(In reply to Gonzalo Siero from comment #7)
...
> will you need customer to test a fix or since we have a reproducer is no
> needed?
> Let me know.
> 

I can reproduce based on your test case so I don't think that will be necessary. Thank you for distilling the customer issue down into something easily identifiable/reproducible.

Comment 9 Brian Foster 2021-01-26 15:19:33 UTC
Fixes have landed upstream as the following commits:

10fb9ac1251f xfs: rename xfs_wait_buftarg() to xfs_buftarg_drain()
8321ddb2fa29 xfs: don't drain buffer lru on freeze and read-only remount

Comment 14 Jan Stancek 2021-02-02 05:19:06 UTC
Patch(es) available on kernel-4.18.0-280.el8.dt2

Comment 15 Zorro Lang 2021-02-04 06:25:55 UTC
Reproduced on kernel 278.el8, test passed on kernel 280.el8.dt3. I'll write a case for this bug later, let me Tested this bug at first.
# ./fs_mark -k -S 0 -D 25 -n 400000 -t 25 -s 0 `for i in $(seq 1 25); do echo -n " -d /mnt/scratch/${i}"; done`
# echo 3 >/proc/sys/vm/drop_caches
# find /mnt/scratch >/dev/null

On another console:
[root@hp-dl380pg8-01 ~]# time xfs_freeze -f /mnt/scratch; xfs_freeze -u /mnt/scratch

real    0m1.220s
user    0m0.012s
sys     0m0.125s
[root@hp-dl380pg8-01 ~]# time xfs_freeze -f /mnt/scratch; xfs_freeze -u /mnt/scratch

real    0m1.314s
user    0m0.012s
sys     0m0.126s
[root@hp-dl380pg8-01 ~]# time xfs_freeze -f /mnt/scratch; xfs_freeze -u /mnt/scratch

real    0m0.915s
user    0m0.010s
sys     0m0.129s
[root@hp-dl380pg8-01 ~]# time xfs_freeze -f /mnt/scratch; xfs_freeze -u /mnt/scratch

real    0m1.231s
user    0m0.010s
sys     0m0.144s
[root@hp-dl380pg8-01 ~]# time xfs_freeze -f /mnt/scratch; xfs_freeze -u /mnt/scratch

real    0m0.187s
user    0m0.014s
sys     0m0.121s
....
....

Comment 23 errata-xmlrpc 2021-05-18 14:23:09 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: kernel security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:1578