2166567 – The system is reaching to hung state in xfs_reserve_blocks while performing the xfs mounting action

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 2166567 - The system is reaching to hung state in xfs_reserve_blocks while performing the xfs mounting action

Summary: The system is reaching to hung state in xfs_reserve_blocks while performing t...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 8
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	8.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	rc
Target Release:	---
Assignee:	Andrey Albershteyn (aalbersh)
QA Contact:	Murphy Zhou
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2023-02-02 08:07 UTC by YinKe
Modified:	2023-11-14 17:22 UTC (History)
CC List:	3 users (show)
Fixed In Version:	kernel-4.18.0-492.el8
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2023-11-14 15:40:12 UTC
Type:	Bug
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Gitlab	redhat/rhel/src/kernel rhel-8 merge_requests 4619	None	None	None	2023-04-25 14:53:07 UTC
Red Hat Issue Tracker	RHELPLAN-147405	None	None	None	2023-02-02 08:08:34 UTC
Red Hat Knowledge Base (Solution)	6996415	None	None	None	2023-02-02 08:24:41 UTC
Red Hat Product Errata	RHSA-2023:7077	None	None	None	2023-11-14 15:40:49 UTC

Description YinKe 2023-02-02 08:07:06 UTC

Description of problem:
The system is reaching to hung state in xfs_reserve_blocks while performing the xfs mounting action with the following messages:

[ 4595.268513] XFS (dm-4): Mounting V5 Filesystem
[ 4595.285369] XFS (dm-4): Reserve blocks depleted! Consider increasing reserve pool size.
[ 4595.285374] XFS (dm-4): Per-AG reservation for AG 0 failed.  Filesystem may run out of space.
[ 4595.285376] XFS (dm-4): Per-AG reservation for AG 0 failed.  Filesystem may run out of space.
[ 4595.285806] XFS (dm-4): Per-AG reservation for AG 1 failed.  Filesystem may run out of space.
[ 4595.285809] XFS (dm-4): Per-AG reservation for AG 1 failed.  Filesystem may run out of space.
[ 4595.286222] XFS (dm-4): Per-AG reservation for AG 2 failed.  Filesystem may run out of space.
[ 4595.286225] XFS (dm-4): Per-AG reservation for AG 2 failed.  Filesystem may run out of space.
[ 4595.286708] XFS (dm-4): Per-AG reservation for AG 3 failed.  Filesystem may run out of space.
[ 4595.286715] XFS (dm-4): Per-AG reservation for AG 3 failed.  Filesystem may run out of space.
[ 4595.286717] XFS (dm-4): ENOSPC reserving per-AG metadata pool, log recovery may fail.
[ 4595.286719] XFS (dm-4): Ending clean mount
[ 4644.873590] watchdog: BUG: soft lockup - CPU#3 stuck for 44s! [mount:10541]
[ 4644.873749] CPU: 3 PID: 10541 Comm: mount Tainted: P           OE    --------- -  - 4.18.0-372.32.1.el8_6.x86_64 #1
[ 4644.873752] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 12/12/2018
[ 4644.873754] RIP: 0010:_raw_spin_unlock_irqrestore+0x11/0x20
[ 4644.873762] Code: c0 e9 33 8f 22 00 b8 01 00 00 00 e9 29 8f 22 00 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 c6 07 00 0f 1f 40 00 48 89 f7 57 9d <0f> 1f 44 00 00 e9 05 8f 22 00 0f 1f 44 00 00 0f 1f 44 00 00 8b 07
[ 4644.873764] RSP: 0018:ffffa860d372fcd8 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
[ 4644.873767] RAX: 0000000000000080 RBX: 0000000000000035 RCX: 0000000000000080
[ 4644.873768] RDX: 0000000000000000 RSI: 0000000000000246 RDI: 0000000000000246
[ 4644.873770] RBP: ffff99cea82334d8 R08: 0000000000000020 R09: ffff99cdf290e950
[ 4644.873772] R10: 0000000000000000 R11: ffff99cea8233530 R12: ffffffff8f7adbe0
[ 4644.873773] R13: 0000000000000246 R14: 0000000000002000 R15: ffff99cea82334d8
[ 4644.873775] FS:  00007fe634bd56c0(0000) GS:ffff99dd3dec0000(0000) knlGS:0000000000000000
[ 4644.873776] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 4644.873778] CR2: 0000560e582ad248 CR3: 0000001026f52005 CR4: 00000000007706e0
[ 4644.873811] PKRU: 55555554
[ 4644.873812] Call Trace:
[ 4644.873815]  __percpu_counter_sum+0x56/0x70
[ 4644.873823]  xfs_reserve_blocks+0xd7/0x190 [xfs]
[ 4644.873914]  xfs_mountfs+0x589/0x8d0 [xfs]
[ 4644.873958]  xfs_fs_fill_super+0x36c/0x6a0 [xfs]
[ 4644.874007]  ? xfs_mount_free+0x30/0x30 [xfs]
[ 4644.874056]  get_tree_bdev+0x18e/0x270
[ 4644.874064]  vfs_get_tree+0x25/0xc0
[ 4644.874067]  do_mount+0x2e9/0x950
[ 4644.874073]  ? memdup_user+0x4b/0x80
[ 4644.874077]  ksys_mount+0xbe/0xe0
[ 4644.874080]  __x64_sys_mount+0x21/0x30
[ 4644.874084]  unload_network_ops_symbols+0x69d0/0x79b0 [falcon_lsm_pinned_14604]
[ 4644.874092]  ? do_syscall_64+0x5b/0x1b0
[ 4644.874096]  ? entry_SYSCALL_64_after_hwframe+0x61/0xc6



Version-Release number of selected component (if applicable):

RELEASE: 4.18.0-372.32.1.el8_6.x86_64

How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

CPU 3 (which reported "soft lockup") was trying to mount "/dev/mapper/u01_vg-u01_lv" to "/u01":
~~~
crash> rd -u 0000560e582a57b0 4
    560e582a57b0:  70616d2f7665642f 5f3130752f726570   /dev/mapper/u01_
    560e582a57c0:  6c5f3130752d6776 0000000000000076   vg-u01_lv.......

crash> rd -u 0000560e582a57e0 
    560e582a57e0:  000000003130752f                    /u01....
~~~

But it got hang in "xfs_reserve_blocks" caused by an infinite loop between line 378 and line 406 as below:
~~~
316 int
317 xfs_reserve_blocks(
318         xfs_mount_t             *mp,
319         uint64_t              *inval,
320         xfs_fsop_resblks_t      *outval)
321 {
......
373         /*
374          * If the request is larger than the current reservation, reserve the
375          * blocks before we update the reserve counters. Sample m_fdblocks and
376          * perform a partial reservation if the request exceeds free space.
377          */
378         error = -ENOSPC;      <==
379         do {
380                 free = percpu_counter_sum(&mp->m_fdblocks) -
381                                                 mp->m_alloc_set_aside;
382                 if (free <= 0)
383                         break;
384                 
385                 delta = request - mp->m_resblks;
386                 lcounter = free - delta;
387                 if (lcounter < 0)
388                         /* We can't satisfy the request, just get what we can */
389                         fdblks_delta = free;
390                 else    
391                         fdblks_delta = delta;
392                 
393                 /*
394                  * We'll either succeed in getting space from the free block
395                  * count or we'll get an ENOSPC. If we get a ENOSPC, it means
396                  * things changed while we were calculating fdblks_delta and so
397                  * we should try again to see if there is anything left to
398                  * reserve.
399                  *
400                  * Don't set the reserved flag here - we don't want to reserve
401                  * the extra reserve blocks from the reserve.....
402                  */
403                 spin_unlock(&mp->m_sb_lock);
404                 error = xfs_mod_fdblocks(mp, -fdblks_delta, 0);
405                 spin_lock(&mp->m_sb_lock);
406         } while (error == -ENOSPC);      <==
~~~ 

Looks like we need the upstream commits:
~~~
commit b32e3819a8230332d7848a6fb067aee52d08557e
Merge: 1fdff407028c 919edbadebe1
Author: Linus Torvalds <torvalds>
Date:   Fri Apr 1 19:30:44 2022 -0700

    Merge tag 'xfs-5.18-merge-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
    
    Pull xfs fixes from Darrick Wong:
     "This fixes multiple problems in the reserve pool sizing functions: an
      incorrect free space calculation, a pointless infinite loop, and even 
      more braindamage that could result in the pool being overfilled. The
      pile of patches from Dave fix myriad races and UAF bugs in the log
      recovery code that much to our mutual surprise nobody's tripped over.
      Dave also fixed a performance optimization that had turned into a
      regression.
    
      Dave Chinner is taking over as XFS maintainer starting Sunday and
      lasting until 5.19-rc1 is tagged so that I can focus on starting a
      massive design review for the (feature complete after five years)
      online repair feature. From then on, he and I will be moving XFS to a 
      co-maintainership model by trading duties every other release. 
    
      NOTE: I hope very strongly that the other pieces of the (X)FS
      ecosystem (fstests and xfsprogs) will make similar changes to spread
      their maintenance load.

      Summary:

       - Fix an incorrect free space calculation in xfs_reserve_blocks that
         could lead to a request for free blocks that will never succeed.

       - Fix a hang in xfs_reserve_blocks caused by an infinite loop and the
         incorrect free space calculation.

       - Fix yet a third problem in xfs_reserve_blocks where multiple racing
         threads can overfill the reserve pool.

       - Fix an accounting error that lead to us reporting reserved space as
         "available".

       - Fix a race condition during abnormal fs shutdown that could cause
         UAF problems when memory reclaim and log shutdown try to clean up
         inodes.

       - Fix a bug where log shutdown can race with unmount to tear down the
         log, thereby causing UAF errors.

       - Disentangle log and filesystem shutdown to reduce confusion.

       - Fix some confusion in xfs_trans_commit such that a race between
         transaction commit and filesystem shutdown can cause unlogged dirty
         inode metadata to be committed, thereby corrupting the filesystem.

       - Remove a performance optimization in the log as it was discovered
         that certain storage hardware handle async log flushes so poorly as
         to cause serious performance regressions. Recent restructuring of
         other parts of the logging code mean that no performance benefit is
         seen on hardware that handle it well"

    * tag 'xfs-5.18-merge-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
      xfs: drop async cache flushes from CIL commits.
      xfs: shutdown during log recovery needs to mark the log shutdown
      xfs: xfs_trans_commit() path must check for log shutdown
      xfs: xfs_do_force_shutdown needs to block racing shutdowns
      xfs: log shutdown triggers should only shut down the log
      xfs: run callbacks before waking waiters in xlog_state_shutdown_callbacks
      xfs: shutdown in intent recovery has non-intent items in the AIL
      xfs: aborting inodes on shutdown may need buffer lock
      xfs: don't report reserved bnobt space as available
      xfs: fix overfilling of reserve pool
      xfs: always succeed at setting the reserve pool size
      xfs: remove infinite loop when reserving free block pool
      xfs: don't include bnobt blocks when reserving free block pool
      xfs: document the XFS_ALLOC_AGFL_RESERVE constant
~~~

For example:
~~~
# git show 15f04fdc75aaaa1cccb0b8b3af1be290e118a7bc
commit 15f04fdc75aaaa1cccb0b8b3af1be290e118a7bc
Author: Darrick J. Wong <djwong>
Date:   Fri Mar 11 10:56:01 2022 -0800

    xfs: remove infinite loop when reserving free block pool
    
    Infinite loops in kernel code are scary.  Calls to xfs_reserve_blocks
    should be rare (people should just use the defaults!) so we really don't
    need to try so hard.  Simplify the logic here by removing the infinite
    loop.
    
    Cc: Brian Foster <bfoster>
    Signed-off-by: Darrick J. Wong <djwong>
    Reviewed-by: Dave Chinner <dchinner>

diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
index 710e857bb825..3c6d9d6836ef 100644
--- a/fs/xfs/xfs_fsops.c
+++ b/fs/xfs/xfs_fsops.c
@@ -430,46 +430,36 @@ xfs_reserve_blocks(
         * If the request is larger than the current reservation, reserve the
         * blocks before we update the reserve counters. Sample m_fdblocks and
         * perform a partial reservation if the request exceeds free space.
+        *
+        * The code below estimates how many blocks it can request from
+        * fdblocks to stash in the reserve pool.  This is a classic TOCTOU
+        * race since fdblocks updates are not always coordinated via
+        * m_sb_lock.
         */
-       error = -ENOSPC;
-       do {
-               free = percpu_counter_sum(&mp->m_fdblocks) -
+       free = percpu_counter_sum(&mp->m_fdblocks) -
                                                xfs_fdblocks_unavailable(mp);
-               if (free <= 0)
-                       break;
-
-               delta = request - mp->m_resblks;
-               lcounter = free - delta;
-               if (lcounter < 0)
-                       /* We can't satisfy the request, just get what we can */
-                       fdblks_delta = free;
-               else
-                       fdblks_delta = delta;
-
+       delta = request - mp->m_resblks;
+       if (delta > 0 && free > 0) {
                /*
                 * We'll either succeed in getting space from the free block
-                * count or we'll get an ENOSPC. If we get a ENOSPC, it means
-                * things changed while we were calculating fdblks_delta and so
-                * we should try again to see if there is anything left to
-                * reserve.
-                *
-                * Don't set the reserved flag here - we don't want to reserve
-                * the extra reserve blocks from the reserve.....
+                * count or we'll get an ENOSPC.  Don't set the reserved flag
+                * here - we don't want to reserve the extra reserve blocks
+                * from the reserve.
                 */
+               fdblks_delta = min(free, delta);
                spin_unlock(&mp->m_sb_lock);
                error = xfs_mod_fdblocks(mp, -fdblks_delta, 0);
                spin_lock(&mp->m_sb_lock);
-       } while (error == -ENOSPC);
 
-       /*
-        * Update the reserve counters if blocks have been successfully
-        * allocated.
-        */
-       if (!error && fdblks_delta) {
-               mp->m_resblks += fdblks_delta;
-               mp->m_resblks_avail += fdblks_delta;
+               /*
+                * Update the reserve counters if blocks have been successfully
+                * allocated.
+                */
+               if (!error) {
+                       mp->m_resblks += fdblks_delta;
+                       mp->m_resblks_avail += fdblks_delta;
+               }
        }
-
 out:
        if (outval) {
                outval->resblks = mp->m_resblks;
~~~

Please advice, thanks.

Comment 2 Eric Sandeen 2023-02-10 18:44:16 UTC

According to the (now closed) case, the root cause of entering the infinite loop was a corrupted filesystem, it seems.  (The infinite loop is a bug, of course, but the reason we hit it appears to be corruption.) The number of free blocks was incorrect, and xfs_repair resolved the issue and allowed the filesystem to mount:

Phase 1 - find and verify superblock...
Phase 2 - using internal log
        - zero log...
        - scan filesystem freespace and inode maps...
sb_fdblocks 53, counted 190920
        - found root inode chunk
Phase 3 - for each AG...

I agree that the patch(es) suggested are reasonable to target for a RHEL update, but given that the customer case is resolved and the root cause appears to be a mildly corrupted filesystem, and where we are in the RHEL8.8 development cycle, we will target this for the next release.

Thanks,
-Eric

Comment 14 errata-xmlrpc 2023-11-14 15:40:12 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: kernel security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:7077

Note You need to log in before you can comment on or make changes to this bug.