1103792 – Unmounting an RHEL7 XFS filesystem on an offlined drive hangs with the message "metadata I/O error".

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1103792 - Unmounting an RHEL7 XFS filesystem on an offlined drive hangs with the message "metadata I/O error".

Summary: Unmounting an RHEL7 XFS filesystem on an offlined drive hangs with the messag...

Keywords:
Status:	CLOSED DUPLICATE of bug 1267042
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	7.1
Hardware:	All
OS:	Linux
Priority:	high
Severity:	medium
Target Milestone:	rc
Target Release:	7.3
Assignee:	Eric Sandeen
QA Contact:	Zorro Lang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1113511 1203710 1295577 1313485
TreeView+	depends on / blocked

Reported:	2014-06-02 14:45 UTC by Vimal Kumar
Modified:	2019-12-16 04:29 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2016-07-21 23:29:19 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Vimal Kumar 2014-06-02 14:45:33 UTC

1) Description of problem:

Trying to unmount an XFS filesystem on an offlined drive hangs with the message "metadata I/O error: block 0x0 ("xfs_buf_iodone_callbacks") error 19 numblks 1"

2) Version-Release number of selected component (if applicable):

kernel 3.10.0-121.el7.x86_64
xfsprogs-3.2.0-0.10.alpha2.el7.x86_64


3) How reproducible:

Always

4) Steps to Reproduce:

The following script helps to reproduce the problem easily.

~~~
#!/bin/bash

DISK=/dev/sdc1
SCSI=`ls /sys/block/${DISK:5:3}/device/scsi_device/`

mkdir -p /mnt/xfs

while true; do

    echo "mkfs on $DISK"
    mkfs.xfs -f $DISK
    sleep 1;
    echo "mount and IO on $DISK"
    mount $DISK /mnt/xfs
    dd if=/dev/zero of=/mnt/xfs/file1 bs=1M count=500
    sync
    sleep 1
    echo "offline $DISK"
    echo 1 > /sys/block/${DISK:5:3}/device/delete
    echo "umount..."
    while ( ! umount /mnt/xfs ) && [ -n "`cat /proc/mounts | egrep -e \"/mnt/xfs\"`" ] ; do echo "retrying in 5 sec."; sleep 5; done
    echo "done"
    sleep 3;
    echo "online $SCSI"
    echo "- - -" > /sys/class/scsi_host/host${SCSI:0:1}/scan
    sleep 5
    DISK=/dev/`ls /sys/class/scsi_device/$SCSI/device/block`1
    echo "new disk name: $DISK"
done
~~~

5) Actual results:

The unmount hangs at the terminal, and the logs fills up /var/log/messages. The logs are as following:

~~~
May 26 22:50:51 localhost kernel: XFS (sda1): Mounting Filesystem
May 26 22:50:51 localhost kernel: XFS (sda1): Ending clean mount
May 26 22:50:58 localhost kernel: sd 2:0:0:1: [sda] Synchronizing SCSI cache
May 26 22:50:58 localhost kernel: XFS (sda1): metadata I/O error: block 0x0 ("xfs_buf_iodone_callbacks") error 19 numblks 1
May 26 22:50:58 localhost kernel: XFS (sda1): Detected failing async write on buffer block 0x10. Retrying async write.
May 26 22:50:58 localhost kernel: XFS (sda1): Detected failing async write on buffer block 0x8. Retrying async write.
May 26 22:50:58 localhost kernel: XFS (sda1): Detected failing async write on buffer block 0x1. Retrying async write.
May 26 22:50:58 localhost kernel: XFS (sda1): Detected failing async write on buffer block 0x0. Retrying async write.
May 26 22:50:58 localhost kernel: XFS (sda1): Detected failing async write on buffer block 0x18. Retrying async write.
May 26 22:50:58 localhost kernel: XFS (sda1): Detected failing async write on buffer block 0x2. Retrying async write.
May 26 22:50:58 localhost kernel: XFS (sda1): Detected failing async write on buffer block 0x10. Retrying async write.
May 26 22:50:58 localhost kernel: XFS (sda1): Detected failing async write on buffer block 0x8. Retrying async write.
May 26 22:50:58 localhost kernel: XFS (sda1): Detected failing async write on buffer block 0x1. Retrying async write.
May 26 22:50:58 localhost kernel: XFS (sda1): Detected failing async write on buffer block 0x0. Retrying async write.
May 26 22:51:03 localhost kernel: XFS (sda1): metadata I/O error: block 0x0 ("xfs_buf_iodone_callbacks") error 19 numblks 1
May 26 22:51:08 localhost kernel: XFS (sda1): metadata I/O error: block 0x0 ("xfs_buf_iodone_callbacks") error 19 numblks 1
May 26 22:51:13 localhost kernel: XFS (sda1): metadata I/O error: block 0x0 ("xfs_buf_iodone_callbacks") error 19 numblks 1
May 26 22:51:18 localhost kernel: XFS (sda1): metadata I/O error: block 0x0 ("xfs_buf_iodone_callbacks") error 19 numblks 1
May 26 22:51:23 localhost kernel: XFS (sda1): metadata I/O error: block 0x0 ("xfs_buf_iodone_callbacks") error 19 numblks 1
May 26 22:51:28 localhost kernel: XFS:: 3590 callbacks suppressed
~~~

6) Expected results:

The umount should either fail or finish properly.

Comment 1 Eric Sandeen 2014-06-02 16:09:52 UTC

Moving to kernel; this isn't an xfsprogs (kernelspace) problem.

Comment 2 Eric Sandeen 2014-06-02 16:41:36 UTC

The retry is intentional, as far as I know; this could be a fibrechannel failover, which might take considerable time.  How can XFS know how long is "too long," and give up, leaving a corrupted filesystem?

So that makes the "fail" option less clear cut.

As for your second expected result, here is no way for an unmount to finish "properly" if there is no disk to write to.

-Eric

Comment 6 Eric Sandeen 2014-07-15 17:09:39 UTC

We'll reconsider the approach during the RHEL7.1 timeframe, but XFS by design intentionally keeps retrying.

It would be nice to have some graceful way to recover, though.

Comment 8 Eric Sandeen 2015-02-17 23:01:27 UTC

For what it's worth, this should work to recover from teh situation:

# xfs_io -x -c shutdown /mount/point

to shut down the filesystem.  Then it can be unmounted, and the messages will stop.

Comment 10 Eric Sandeen 2015-09-16 14:45:10 UTC

There are a few recent upstream changes that might help this behavior, thanks bfoster for reminding me of these on the list upstream:

> roughly commits 5e4b538 through d4a97a0 or so

d4a97a0 xfs: add missing bmap cancel calls in error paths
146e54b xfs: add helper to conditionally remove items from the AIL
f307080 xfs: fix btree cursor error cleanups
0ae120f xfs: clean up root inode properly on mount failure
a3f2001 xfs: checksum log record ext headers based on record size
fc0d165 xfs: fix broken icreate log item cancellation
78d57e4 xfs: icreate log item recovery and cancellation tracepoints
f0b2efa xfs: don't leave EFIs on AIL on mount failure
e32a1d1 xfs: use EFI refcount consistently in log recovery
6bc43af xfs: ensure EFD trans aborts on log recovery extent free failure
8d99fe9 xfs: fix efi/efd error handling to avoid fs shutdown hangs
d43ac29 xfs: return committed status from xfs_trans_roll()

Comment 15 Eric Sandeen 2016-07-21 23:29:19 UTC

This is resolved by the configurable error handling patch, which also sets xfs to "fail at unmount" by default.

This behavior is present in kernel-3.10.0-428.el7 and newer.

*** This bug has been marked as a duplicate of bug 1267042 ***

Comment 16 Eric Sandeen 2016-07-21 23:30:23 UTC

To be more clear, the new default behavior is to terminate any outstanding, failing IOs when an unmount command is issued.

Note You need to log in before you can comment on or make changes to this bug.