1903387 – fsfreeze of xfs filesystem can be significantly delayed in xfs_wait_buftarg if a process continues to grab and release buffers

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1903387 - fsfreeze of xfs filesystem can be significantly delayed in xfs_wait_buftarg if a process continues to grab and release buffers

Summary: fsfreeze of xfs filesystem can be significantly delayed in xfs_wait_buftarg i...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 8
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	8.3
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	rc
Target Release:	8.0
Assignee:	Brian Foster
QA Contact:	Zorro Lang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-12-01 23:52 UTC by Frank Sorenson
Modified:	2024-06-13 23:36 UTC (History)
CC List:	8 users (show)
Fixed In Version:	kernel-4.18.0-283.el8
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-05-18 14:23:09 UTC
Type:	Bug
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
program to create directory/file tree (4.30 KB, text/plain) 2020-12-01 23:52 UTC, Frank Sorenson	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	5628671	0	None	None	None	2020-12-10 15:18:14 UTC

Description Frank Sorenson 2020-12-01 23:52:31 UTC

Created attachment 1735427 [details]
program to create directory/file tree

Description of problem:

If a process is quickly grabbing and releasing buffers, a freeze could experience a lengthy delay while waiting in xfs_wait_buftarg for the buffer lru list to become empty.


Version-Release number of selected component (if applicable):

  RHEL 8 kernels - 4.18.0-240.5.1.el8_3.x86_64


How reproducible:

  easy, see below


Steps to Reproduce:

  create a large directory tree
  start 'find'
  attempt to 'fsfreeze'

  particular steps;
    truncate -s 30G /var/tmp/test.img
    mkfs.xfs /var/tmp/test.img
    
    fspath=/mnt/tmp
    mkdir -p $fspath
    mount /var/tmp/test.img $fspath

    gcc create_tree.c -o create_tree
    ./create_tree $fspath

	(creates a 4-deep tree of 1 million leaf directories, each containing 10 files)

    sysctl vm.drop_caches=3


    find $fspath >/dev/null 2>&1 &
    time fsfreeze -f $fspath

(cleanup)
    killall find
    fsfreeze -u $fspath


Actual results:

	# find $fspath >/dev/null 2>&1 & time fsfreeze -f $fspath ; killall find ; fsfreeze -u $fspath ; wait
	[14] 356561

	real	3m23.074s
	user	0m0.000s
	sys	0m1.775s
	[14]   Terminated              find $fspath > /dev/null 2>&1


Expected results:

fsfreeze completes quickly



Additional info:

  if the process is writing to the same filesystem, the freeze will happen quickly:

    # find $fspath >$fspath/find.out 2>&1 & time fsfreeze -f $fspath ; killall find ; fsfreeze -u $fspath ; wait
	[13] 357664

	real	0m0.122s
	user	0m0.002s
	sys	0m0.004s
	[13]   Terminated              find $fspath > $fspath/find.out 2>&1

Comment 1 Frank Sorenson 2020-12-01 23:57:34 UTC

Gonzalo noted the following change between RHEL 7 and RHEL 8:

RHEL 7
xfs_wait_buftarg(
        struct xfs_buftarg      *btp)
{
        struct xfs_buf          *bp;
        LIST_HEAD(dispose);
...
        flush_workqueue(btp->bt_mount->m_buf_workqueue);

restart:
        spin_lock(&btp->bt_lru_lock);              <<--------- This prevents new xfs_bufs to go to LRU
        while (!list_empty(&btp->bt_lru)) {
                bp = list_first_entry(&btp->bt_lru, struct xfs_buf, b_lru);
...
        spin_unlock(&btp->bt_lru_lock);



And in rhel8 we're not protecting the list, so new bufs come in over and over and this is what is preventing this to finish:

xfs_wait_buftarg(
        struct xfs_buftarg      *btp)
{
        LIST_HEAD(dispose);
        int loop = 0;

        /*
         * First wait on the buftarg I/O count for all in-flight buffers to be
         * released. This is critical as new buffers do not make the LRU until
         * they are released.
         *
         * Next, flush the buffer workqueue to ensure all completion processing
         * has finished. Just waiting on buffer locks is not sufficient for
         * async IO as the reference count held over IO is not released until
         * after the buffer lock is dropped. Hence we need to ensure here that
         * all reference counts have been dropped before we start walking the
         * LRU list.
         */
        while (percpu_counter_sum(&btp->bt_io_count))
                delay(100);
        flush_workqueue(btp->bt_mount->m_buf_workqueue);

        /* loop until there is nothing left on the lru list. */
        while (list_lru_count(&btp->bt_lru)) {
                list_lru_walk(&btp->bt_lru, xfs_buftarg_wait_rele,
                              &dispose, LONG_MAX);

Comment 2 Gonzalo Siero 2020-12-02 11:07:54 UTC

I believe that the problem is not with the protection of the list, but with the fact that we are now dealing with dispose list *inside* the loop that is dealing with LRU.

In cases where there's a IO in progress (reads), we could end up with lot's of bufs to deal with in the dispose list. 
By doing this, we're giving the opportunity to get more bufs into the LRU and ending waiting for very long time...

In our test case, we adding xfs_bufs here:

 0xffffffff81c5c6c0 : list_lru_add+0x0/0x140 [kernel]
 0xffffffffc02e1de7 : xfs_buf_rele+0x147/0x360 [xfs]
 0xffffffffc02ccbd7 : xfs_iread+0x167/0x1f0 [xfs]
 0xffffffffc02ed471 : xfs_iget+0x211/0x950 [xfs]
 0xffffffffc02f838f : xfs_lookup+0xdf/0x120 [xfs]
 0xffffffffc02f4650 : xfs_vn_lookup+0x70/0xa0 [xfs]
 0xffffffff81ce6f37 : __lookup_slow+0x97/0x150 [kernel]
 0xffffffff81ce7025 : lookup_slow+0x35/0x50 [kernel]
 0xffffffff81ce71ff : walk_component+0x1bf/0x2f0 [kernel]
 0xffffffff81ce7fb5 : path_lookupat.isra.48+0x75/0x200 [kernel]
 0xffffffff81cec4f0 : filename_lookup.part.64+0xa0/0x170 [kernel]
 0xffffffff81cdf954 : vfs_statx+0x74/0xd0 [kernel]
 0xffffffff81ce0019 : __do_sys_newlstat+0x39/0x70 [kernel]
 0xffffffff81a0419b : do_syscall_64+0x5b/0x1a0 [kernel]
 0xffffffff824000ad : entry_SYSCALL_64_after_hwframe+0x65/0xca [kernel]
 0xffffffff824000ad : entry_SYSCALL_64_after_hwframe+0x65/0xca [kernel]

xfs_buftarg->bt_lru: ffff94aa3139c480 129335

 0xffffffff81c5c6c0 : list_lru_add+0x0/0x140 [kernel]
 0xffffffffc02e1de7 : xfs_buf_rele+0x147/0x360 [xfs]
 0xffffffffc02ba85c : xfs_dir2_block_lookup+0xbc/0x130 [xfs]
 0xffffffffc02b92be : xfs_dir_lookup+0x1ae/0x1d0 [xfs]
 0xffffffffc02f8308 : xfs_lookup+0x58/0x120 [xfs]
 0xffffffffc02f4650 : xfs_vn_lookup+0x70/0xa0 [xfs]
 0xffffffff81ce6f37 : __lookup_slow+0x97/0x150 [kernel]
 0xffffffff81ce7025 : lookup_slow+0x35/0x50 [kernel]
 0xffffffff81ce71ff : walk_component+0x1bf/0x2f0 [kernel]
 0xffffffff81ce7fb5 : path_lookupat.isra.48+0x75/0x200 [kernel]
 0xffffffff81cec4f0 : filename_lookup.part.64+0xa0/0x170 [kernel]
 0xffffffff81cdf954 : vfs_statx+0x74/0xd0 [kernel]
 0xffffffff81ce0019 : __do_sys_newlstat+0x39/0x70 [kernel]
 0xffffffff81a0419b : do_syscall_64+0x5b/0x1a0 [kernel]
 0xffffffff824000ad : entry_SYSCALL_64_after_hwframe+0x65/0xca [kernel]
 0xffffffff824000ad : entry_SYSCALL_64_after_hwframe+0x65/0xca [kernel]

xfs_buftarg->bt_lru: ffff94aa3139c480 4071

Comment 4 Brian Foster 2020-12-02 13:55:25 UTC

(In reply to Frank Sorenson from comment #0)
...
> Actual results:
> 
> 	# find $fspath >/dev/null 2>&1 & time fsfreeze -f $fspath ; killall find ;
> fsfreeze -u $fspath ; wait
> 	[14] 356561
> 
> 	real	3m23.074s
> 	user	0m0.000s
> 	sys	0m1.775s
> 	[14]   Terminated              find $fspath > /dev/null 2>&1
> 

Ah, I see. We're only doing reads so the freeze ends up blocked behind completion of the associated task. While I can see how the change in xfs_wait_buftarg() can introduce this behavior, I don't think the old behavior was necessarily correct. The intent of xfs_wait_buftarg() is to reclaim all buffers and drain the LRU, we just generally expect to be in a quiescent state when that occurs. freeze only blocks writes, so a continuous read workload would never fall into the blocking freeze protection paths and thus we may end up having to wait for such a workload to fully complete.

> 
...
>   if the process is writing to the same filesystem, the freeze will happen
> quickly:
> 
>     # find $fspath >$fspath/find.out 2>&1 & time fsfreeze -f $fspath ;
> killall find ; fsfreeze -u $fspath ; wait
> 	[13] 357664
> 
> 	real	0m0.122s
> 	user	0m0.002s
> 	sys	0m0.004s
> 	[13]   Terminated              find $fspath > $fspath/find.out 2>&1

I suspect this occurs because the write path blocks on the freeze (i.e. sb_start_write() and friends) and thus fairly quickly blocks the task, which then remains blocked until the filesystem is unfrozen. The read-only variant is fully capable of running to completion on a frozen fs, so really there's no need to quiesce it in the first place. I think we've wondered in the past whether it's really necessary to drain the LRU (reclaiming all buffers) on freeze. I'm not sure that it is, so that might be the right approach to fix this particular performance issue.

Comment 5 Dave Chinner 2020-12-08 03:14:07 UTC

(In reply to Gonzalo Siero from comment #2)
> I believe that the problem is not with the protection of the list, but with
> the fact that we are now dealing with dispose list *inside* the loop that is
> dealing with LRU.

No, that's not the problem at all. The problem is that we are trying to dispose of the buffer cache on an active filesystem that can still be filling the cache at the same time. i.e. Any changes to the LRU code is trying to patch the symptom, not fix the root cause.

When we are freezing the filesystem all we need to do is wait for all in-flight buffer IO to complete - as Brian indicates, we do not need to empty the cache because the filesystem is still active and mounted. We only need to empty the cache after log recovery and at unmount time, so we should be able to avoid this problem just by avoiding the cache emptying in the freeze path.

So, really, I think the fix is to split xfs_wait_buftarg() into two parts - one to wait for in-flight IO to complete and flush the endio work queues, the other to empty the cache - and then to have the freeze path only wait for IO to complete....

-Dave.

Comment 6 Brian Foster 2020-12-08 09:13:06 UTC

(In reply to Dave Chinner from comment #5)
> (In reply to Gonzalo Siero from comment #2)
...
> 
> So, really, I think the fix is to split xfs_wait_buftarg() into two parts -
> one to wait for in-flight IO to complete and flush the endio work queues,
> the other to empty the cache - and then to have the freeze path only wait
> for IO to complete....
> 

Pretty much what I've been playing with locally to address this problem. I'll get patches on the list once I've done more testing..

Comment 7 Gonzalo Siero 2020-12-09 08:27:05 UTC

Thanks a lot Brian and Dave. 

>No, that's not the problem at all.

My comment was more along the lines of what has changed that could trigger the problem now and not in rhel 7.

will you need customer to test a fix or since we have a reproducer is no needed?
Let me know.

Regards,
Gonzalo.

Comment 8 Brian Foster 2020-12-09 12:39:48 UTC

(In reply to Gonzalo Siero from comment #7)
...
> will you need customer to test a fix or since we have a reproducer is no
> needed?
> Let me know.
> 

I can reproduce based on your test case so I don't think that will be necessary. Thank you for distilling the customer issue down into something easily identifiable/reproducible.

Comment 9 Brian Foster 2021-01-26 15:19:33 UTC

Fixes have landed upstream as the following commits:

10fb9ac1251f xfs: rename xfs_wait_buftarg() to xfs_buftarg_drain()
8321ddb2fa29 xfs: don't drain buffer lru on freeze and read-only remount

Comment 14 Jan Stancek 2021-02-02 05:19:06 UTC

Patch(es) available on kernel-4.18.0-280.el8.dt2

Comment 15 Zorro Lang 2021-02-04 06:25:55 UTC

Reproduced on kernel 278.el8, test passed on kernel 280.el8.dt3. I'll write a case for this bug later, let me Tested this bug at first.
# ./fs_mark -k -S 0 -D 25 -n 400000 -t 25 -s 0 `for i in $(seq 1 25); do echo -n " -d /mnt/scratch/${i}"; done`
# echo 3 >/proc/sys/vm/drop_caches
# find /mnt/scratch >/dev/null

On another console:
[root@hp-dl380pg8-01 ~]# time xfs_freeze -f /mnt/scratch; xfs_freeze -u /mnt/scratch

real    0m1.220s
user    0m0.012s
sys     0m0.125s
[root@hp-dl380pg8-01 ~]# time xfs_freeze -f /mnt/scratch; xfs_freeze -u /mnt/scratch

real    0m1.314s
user    0m0.012s
sys     0m0.126s
[root@hp-dl380pg8-01 ~]# time xfs_freeze -f /mnt/scratch; xfs_freeze -u /mnt/scratch

real    0m0.915s
user    0m0.010s
sys     0m0.129s
[root@hp-dl380pg8-01 ~]# time xfs_freeze -f /mnt/scratch; xfs_freeze -u /mnt/scratch

real    0m1.231s
user    0m0.010s
sys     0m0.144s
[root@hp-dl380pg8-01 ~]# time xfs_freeze -f /mnt/scratch; xfs_freeze -u /mnt/scratch

real    0m0.187s
user    0m0.014s
sys     0m0.121s
....
....

Comment 23 errata-xmlrpc 2021-05-18 14:23:09 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: kernel security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:1578

Note You need to log in before you can comment on or make changes to this bug.