Bug 1903387
Summary: | fsfreeze of xfs filesystem can be significantly delayed in xfs_wait_buftarg if a process continues to grab and release buffers | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 8 | Reporter: | Frank Sorenson <fsorenso> | ||||
Component: | kernel | Assignee: | Brian Foster <bfoster> | ||||
kernel sub component: | XFS | QA Contact: | Zorro Lang <zlang> | ||||
Status: | CLOSED ERRATA | Docs Contact: | |||||
Severity: | unspecified | ||||||
Priority: | unspecified | CC: | bfoster, dchinner, gsierohu, jbreitwe, jshivers, peter.vreman, xzhou, zlang | ||||
Version: | 8.3 | Keywords: | Reproducer, Triaged | ||||
Target Milestone: | rc | ||||||
Target Release: | 8.0 | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | kernel-4.18.0-283.el8 | Doc Type: | If docs needed, set a value | ||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2021-05-18 14:23:09 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Gonzalo noted the following change between RHEL 7 and RHEL 8: RHEL 7 xfs_wait_buftarg( struct xfs_buftarg *btp) { struct xfs_buf *bp; LIST_HEAD(dispose); ... flush_workqueue(btp->bt_mount->m_buf_workqueue); restart: spin_lock(&btp->bt_lru_lock); <<--------- This prevents new xfs_bufs to go to LRU while (!list_empty(&btp->bt_lru)) { bp = list_first_entry(&btp->bt_lru, struct xfs_buf, b_lru); ... spin_unlock(&btp->bt_lru_lock); And in rhel8 we're not protecting the list, so new bufs come in over and over and this is what is preventing this to finish: xfs_wait_buftarg( struct xfs_buftarg *btp) { LIST_HEAD(dispose); int loop = 0; /* * First wait on the buftarg I/O count for all in-flight buffers to be * released. This is critical as new buffers do not make the LRU until * they are released. * * Next, flush the buffer workqueue to ensure all completion processing * has finished. Just waiting on buffer locks is not sufficient for * async IO as the reference count held over IO is not released until * after the buffer lock is dropped. Hence we need to ensure here that * all reference counts have been dropped before we start walking the * LRU list. */ while (percpu_counter_sum(&btp->bt_io_count)) delay(100); flush_workqueue(btp->bt_mount->m_buf_workqueue); /* loop until there is nothing left on the lru list. */ while (list_lru_count(&btp->bt_lru)) { list_lru_walk(&btp->bt_lru, xfs_buftarg_wait_rele, &dispose, LONG_MAX); I believe that the problem is not with the protection of the list, but with the fact that we are now dealing with dispose list *inside* the loop that is dealing with LRU. In cases where there's a IO in progress (reads), we could end up with lot's of bufs to deal with in the dispose list. By doing this, we're giving the opportunity to get more bufs into the LRU and ending waiting for very long time... In our test case, we adding xfs_bufs here: 0xffffffff81c5c6c0 : list_lru_add+0x0/0x140 [kernel] 0xffffffffc02e1de7 : xfs_buf_rele+0x147/0x360 [xfs] 0xffffffffc02ccbd7 : xfs_iread+0x167/0x1f0 [xfs] 0xffffffffc02ed471 : xfs_iget+0x211/0x950 [xfs] 0xffffffffc02f838f : xfs_lookup+0xdf/0x120 [xfs] 0xffffffffc02f4650 : xfs_vn_lookup+0x70/0xa0 [xfs] 0xffffffff81ce6f37 : __lookup_slow+0x97/0x150 [kernel] 0xffffffff81ce7025 : lookup_slow+0x35/0x50 [kernel] 0xffffffff81ce71ff : walk_component+0x1bf/0x2f0 [kernel] 0xffffffff81ce7fb5 : path_lookupat.isra.48+0x75/0x200 [kernel] 0xffffffff81cec4f0 : filename_lookup.part.64+0xa0/0x170 [kernel] 0xffffffff81cdf954 : vfs_statx+0x74/0xd0 [kernel] 0xffffffff81ce0019 : __do_sys_newlstat+0x39/0x70 [kernel] 0xffffffff81a0419b : do_syscall_64+0x5b/0x1a0 [kernel] 0xffffffff824000ad : entry_SYSCALL_64_after_hwframe+0x65/0xca [kernel] 0xffffffff824000ad : entry_SYSCALL_64_after_hwframe+0x65/0xca [kernel] xfs_buftarg->bt_lru: ffff94aa3139c480 129335 0xffffffff81c5c6c0 : list_lru_add+0x0/0x140 [kernel] 0xffffffffc02e1de7 : xfs_buf_rele+0x147/0x360 [xfs] 0xffffffffc02ba85c : xfs_dir2_block_lookup+0xbc/0x130 [xfs] 0xffffffffc02b92be : xfs_dir_lookup+0x1ae/0x1d0 [xfs] 0xffffffffc02f8308 : xfs_lookup+0x58/0x120 [xfs] 0xffffffffc02f4650 : xfs_vn_lookup+0x70/0xa0 [xfs] 0xffffffff81ce6f37 : __lookup_slow+0x97/0x150 [kernel] 0xffffffff81ce7025 : lookup_slow+0x35/0x50 [kernel] 0xffffffff81ce71ff : walk_component+0x1bf/0x2f0 [kernel] 0xffffffff81ce7fb5 : path_lookupat.isra.48+0x75/0x200 [kernel] 0xffffffff81cec4f0 : filename_lookup.part.64+0xa0/0x170 [kernel] 0xffffffff81cdf954 : vfs_statx+0x74/0xd0 [kernel] 0xffffffff81ce0019 : __do_sys_newlstat+0x39/0x70 [kernel] 0xffffffff81a0419b : do_syscall_64+0x5b/0x1a0 [kernel] 0xffffffff824000ad : entry_SYSCALL_64_after_hwframe+0x65/0xca [kernel] 0xffffffff824000ad : entry_SYSCALL_64_after_hwframe+0x65/0xca [kernel] xfs_buftarg->bt_lru: ffff94aa3139c480 4071 (In reply to Frank Sorenson from comment #0) ... > Actual results: > > # find $fspath >/dev/null 2>&1 & time fsfreeze -f $fspath ; killall find ; > fsfreeze -u $fspath ; wait > [14] 356561 > > real 3m23.074s > user 0m0.000s > sys 0m1.775s > [14] Terminated find $fspath > /dev/null 2>&1 > Ah, I see. We're only doing reads so the freeze ends up blocked behind completion of the associated task. While I can see how the change in xfs_wait_buftarg() can introduce this behavior, I don't think the old behavior was necessarily correct. The intent of xfs_wait_buftarg() is to reclaim all buffers and drain the LRU, we just generally expect to be in a quiescent state when that occurs. freeze only blocks writes, so a continuous read workload would never fall into the blocking freeze protection paths and thus we may end up having to wait for such a workload to fully complete. > ... > if the process is writing to the same filesystem, the freeze will happen > quickly: > > # find $fspath >$fspath/find.out 2>&1 & time fsfreeze -f $fspath ; > killall find ; fsfreeze -u $fspath ; wait > [13] 357664 > > real 0m0.122s > user 0m0.002s > sys 0m0.004s > [13] Terminated find $fspath > $fspath/find.out 2>&1 I suspect this occurs because the write path blocks on the freeze (i.e. sb_start_write() and friends) and thus fairly quickly blocks the task, which then remains blocked until the filesystem is unfrozen. The read-only variant is fully capable of running to completion on a frozen fs, so really there's no need to quiesce it in the first place. I think we've wondered in the past whether it's really necessary to drain the LRU (reclaiming all buffers) on freeze. I'm not sure that it is, so that might be the right approach to fix this particular performance issue. (In reply to Gonzalo Siero from comment #2) > I believe that the problem is not with the protection of the list, but with > the fact that we are now dealing with dispose list *inside* the loop that is > dealing with LRU. No, that's not the problem at all. The problem is that we are trying to dispose of the buffer cache on an active filesystem that can still be filling the cache at the same time. i.e. Any changes to the LRU code is trying to patch the symptom, not fix the root cause. When we are freezing the filesystem all we need to do is wait for all in-flight buffer IO to complete - as Brian indicates, we do not need to empty the cache because the filesystem is still active and mounted. We only need to empty the cache after log recovery and at unmount time, so we should be able to avoid this problem just by avoiding the cache emptying in the freeze path. So, really, I think the fix is to split xfs_wait_buftarg() into two parts - one to wait for in-flight IO to complete and flush the endio work queues, the other to empty the cache - and then to have the freeze path only wait for IO to complete.... -Dave. (In reply to Dave Chinner from comment #5) > (In reply to Gonzalo Siero from comment #2) ... > > So, really, I think the fix is to split xfs_wait_buftarg() into two parts - > one to wait for in-flight IO to complete and flush the endio work queues, > the other to empty the cache - and then to have the freeze path only wait > for IO to complete.... > Pretty much what I've been playing with locally to address this problem. I'll get patches on the list once I've done more testing.. Thanks a lot Brian and Dave.
>No, that's not the problem at all.
My comment was more along the lines of what has changed that could trigger the problem now and not in rhel 7.
will you need customer to test a fix or since we have a reproducer is no needed?
Let me know.
Regards,
Gonzalo.
(In reply to Gonzalo Siero from comment #7) ... > will you need customer to test a fix or since we have a reproducer is no > needed? > Let me know. > I can reproduce based on your test case so I don't think that will be necessary. Thank you for distilling the customer issue down into something easily identifiable/reproducible. Fixes have landed upstream as the following commits: 10fb9ac1251f xfs: rename xfs_wait_buftarg() to xfs_buftarg_drain() 8321ddb2fa29 xfs: don't drain buffer lru on freeze and read-only remount Patch(es) available on kernel-4.18.0-280.el8.dt2 Reproduced on kernel 278.el8, test passed on kernel 280.el8.dt3. I'll write a case for this bug later, let me Tested this bug at first. # ./fs_mark -k -S 0 -D 25 -n 400000 -t 25 -s 0 `for i in $(seq 1 25); do echo -n " -d /mnt/scratch/${i}"; done` # echo 3 >/proc/sys/vm/drop_caches # find /mnt/scratch >/dev/null On another console: [root@hp-dl380pg8-01 ~]# time xfs_freeze -f /mnt/scratch; xfs_freeze -u /mnt/scratch real 0m1.220s user 0m0.012s sys 0m0.125s [root@hp-dl380pg8-01 ~]# time xfs_freeze -f /mnt/scratch; xfs_freeze -u /mnt/scratch real 0m1.314s user 0m0.012s sys 0m0.126s [root@hp-dl380pg8-01 ~]# time xfs_freeze -f /mnt/scratch; xfs_freeze -u /mnt/scratch real 0m0.915s user 0m0.010s sys 0m0.129s [root@hp-dl380pg8-01 ~]# time xfs_freeze -f /mnt/scratch; xfs_freeze -u /mnt/scratch real 0m1.231s user 0m0.010s sys 0m0.144s [root@hp-dl380pg8-01 ~]# time xfs_freeze -f /mnt/scratch; xfs_freeze -u /mnt/scratch real 0m0.187s user 0m0.014s sys 0m0.121s .... .... Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: kernel security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:1578 |
Created attachment 1735427 [details] program to create directory/file tree Description of problem: If a process is quickly grabbing and releasing buffers, a freeze could experience a lengthy delay while waiting in xfs_wait_buftarg for the buffer lru list to become empty. Version-Release number of selected component (if applicable): RHEL 8 kernels - 4.18.0-240.5.1.el8_3.x86_64 How reproducible: easy, see below Steps to Reproduce: create a large directory tree start 'find' attempt to 'fsfreeze' particular steps; truncate -s 30G /var/tmp/test.img mkfs.xfs /var/tmp/test.img fspath=/mnt/tmp mkdir -p $fspath mount /var/tmp/test.img $fspath gcc create_tree.c -o create_tree ./create_tree $fspath (creates a 4-deep tree of 1 million leaf directories, each containing 10 files) sysctl vm.drop_caches=3 find $fspath >/dev/null 2>&1 & time fsfreeze -f $fspath (cleanup) killall find fsfreeze -u $fspath Actual results: # find $fspath >/dev/null 2>&1 & time fsfreeze -f $fspath ; killall find ; fsfreeze -u $fspath ; wait [14] 356561 real 3m23.074s user 0m0.000s sys 0m1.775s [14] Terminated find $fspath > /dev/null 2>&1 Expected results: fsfreeze completes quickly Additional info: if the process is writing to the same filesystem, the freeze will happen quickly: # find $fspath >$fspath/find.out 2>&1 & time fsfreeze -f $fspath ; killall find ; fsfreeze -u $fspath ; wait [13] 357664 real 0m0.122s user 0m0.002s sys 0m0.004s [13] Terminated find $fspath > $fspath/find.out 2>&1