Bug 1103792

Summary: Unmounting an RHEL7 XFS filesystem on an offlined drive hangs with the message "metadata I/O error".
Product: Red Hat Enterprise Linux 7 Reporter: Vimal Kumar <vikumar>
Component: kernelAssignee: Eric Sandeen <esandeen>
kernel sub component: XFS QA Contact: Zorro Lang <zlang>
Status: CLOSED DUPLICATE Docs Contact:
Severity: medium    
Priority: high CC: dwysocha, eguan, esandeen, jkachuck, rsussman, swhiteho
Version: 7.1Keywords: TestCaseProvided
Target Milestone: rc   
Target Release: 7.3   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-07-21 23:29:19 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1113511, 1203710, 1295577, 1313485    

Description Vimal Kumar 2014-06-02 14:45:33 UTC
1) Description of problem:

Trying to unmount an XFS filesystem on an offlined drive hangs with the message "metadata I/O error: block 0x0 ("xfs_buf_iodone_callbacks") error 19 numblks 1"

2) Version-Release number of selected component (if applicable):

kernel 3.10.0-121.el7.x86_64
xfsprogs-3.2.0-0.10.alpha2.el7.x86_64


3) How reproducible:

Always

4) Steps to Reproduce:

The following script helps to reproduce the problem easily.

~~~
#!/bin/bash

DISK=/dev/sdc1
SCSI=`ls /sys/block/${DISK:5:3}/device/scsi_device/`

mkdir -p /mnt/xfs

while true; do

    echo "mkfs on $DISK"
    mkfs.xfs -f $DISK
    sleep 1;
    echo "mount and IO on $DISK"
    mount $DISK /mnt/xfs
    dd if=/dev/zero of=/mnt/xfs/file1 bs=1M count=500
    sync
    sleep 1
    echo "offline $DISK"
    echo 1 > /sys/block/${DISK:5:3}/device/delete
    echo "umount..."
    while ( ! umount /mnt/xfs ) && [ -n "`cat /proc/mounts | egrep -e \"/mnt/xfs\"`" ] ; do echo "retrying in 5 sec."; sleep 5; done
    echo "done"
    sleep 3;
    echo "online $SCSI"
    echo "- - -" > /sys/class/scsi_host/host${SCSI:0:1}/scan
    sleep 5
    DISK=/dev/`ls /sys/class/scsi_device/$SCSI/device/block`1
    echo "new disk name: $DISK"
done
~~~

5) Actual results:

The unmount hangs at the terminal, and the logs fills up /var/log/messages. The logs are as following:

~~~
May 26 22:50:51 localhost kernel: XFS (sda1): Mounting Filesystem
May 26 22:50:51 localhost kernel: XFS (sda1): Ending clean mount
May 26 22:50:58 localhost kernel: sd 2:0:0:1: [sda] Synchronizing SCSI cache
May 26 22:50:58 localhost kernel: XFS (sda1): metadata I/O error: block 0x0 ("xfs_buf_iodone_callbacks") error 19 numblks 1
May 26 22:50:58 localhost kernel: XFS (sda1): Detected failing async write on buffer block 0x10. Retrying async write.
May 26 22:50:58 localhost kernel: XFS (sda1): Detected failing async write on buffer block 0x8. Retrying async write.
May 26 22:50:58 localhost kernel: XFS (sda1): Detected failing async write on buffer block 0x1. Retrying async write.
May 26 22:50:58 localhost kernel: XFS (sda1): Detected failing async write on buffer block 0x0. Retrying async write.
May 26 22:50:58 localhost kernel: XFS (sda1): Detected failing async write on buffer block 0x18. Retrying async write.
May 26 22:50:58 localhost kernel: XFS (sda1): Detected failing async write on buffer block 0x2. Retrying async write.
May 26 22:50:58 localhost kernel: XFS (sda1): Detected failing async write on buffer block 0x10. Retrying async write.
May 26 22:50:58 localhost kernel: XFS (sda1): Detected failing async write on buffer block 0x8. Retrying async write.
May 26 22:50:58 localhost kernel: XFS (sda1): Detected failing async write on buffer block 0x1. Retrying async write.
May 26 22:50:58 localhost kernel: XFS (sda1): Detected failing async write on buffer block 0x0. Retrying async write.
May 26 22:51:03 localhost kernel: XFS (sda1): metadata I/O error: block 0x0 ("xfs_buf_iodone_callbacks") error 19 numblks 1
May 26 22:51:08 localhost kernel: XFS (sda1): metadata I/O error: block 0x0 ("xfs_buf_iodone_callbacks") error 19 numblks 1
May 26 22:51:13 localhost kernel: XFS (sda1): metadata I/O error: block 0x0 ("xfs_buf_iodone_callbacks") error 19 numblks 1
May 26 22:51:18 localhost kernel: XFS (sda1): metadata I/O error: block 0x0 ("xfs_buf_iodone_callbacks") error 19 numblks 1
May 26 22:51:23 localhost kernel: XFS (sda1): metadata I/O error: block 0x0 ("xfs_buf_iodone_callbacks") error 19 numblks 1
May 26 22:51:28 localhost kernel: XFS:: 3590 callbacks suppressed
~~~

6) Expected results:

The umount should either fail or finish properly.

Comment 1 Eric Sandeen 2014-06-02 16:09:52 UTC
Moving to kernel; this isn't an xfsprogs (kernelspace) problem.

Comment 2 Eric Sandeen 2014-06-02 16:41:36 UTC
The retry is intentional, as far as I know; this could be a fibrechannel failover, which might take considerable time.  How can XFS know how long is "too long," and give up, leaving a corrupted filesystem?

So that makes the "fail" option less clear cut.

As for your second expected result, here is no way for an unmount to finish "properly" if there is no disk to write to.

-Eric

Comment 6 Eric Sandeen 2014-07-15 17:09:39 UTC
We'll reconsider the approach during the RHEL7.1 timeframe, but XFS by design intentionally keeps retrying.

It would be nice to have some graceful way to recover, though.

Comment 8 Eric Sandeen 2015-02-17 23:01:27 UTC
For what it's worth, this should work to recover from teh situation:

# xfs_io -x -c shutdown /mount/point

to shut down the filesystem.  Then it can be unmounted, and the messages will stop.

Comment 10 Eric Sandeen 2015-09-16 14:45:10 UTC
There are a few recent upstream changes that might help this behavior, thanks bfoster for reminding me of these on the list upstream:

> roughly commits 5e4b538 through d4a97a0 or so

d4a97a0 xfs: add missing bmap cancel calls in error paths
146e54b xfs: add helper to conditionally remove items from the AIL
f307080 xfs: fix btree cursor error cleanups
0ae120f xfs: clean up root inode properly on mount failure
a3f2001 xfs: checksum log record ext headers based on record size
fc0d165 xfs: fix broken icreate log item cancellation
78d57e4 xfs: icreate log item recovery and cancellation tracepoints
f0b2efa xfs: don't leave EFIs on AIL on mount failure
e32a1d1 xfs: use EFI refcount consistently in log recovery
6bc43af xfs: ensure EFD trans aborts on log recovery extent free failure
8d99fe9 xfs: fix efi/efd error handling to avoid fs shutdown hangs
d43ac29 xfs: return committed status from xfs_trans_roll()

Comment 15 Eric Sandeen 2016-07-21 23:29:19 UTC
This is resolved by the configurable error handling patch, which also sets xfs to "fail at unmount" by default.

This behavior is present in kernel-3.10.0-428.el7 and newer.

*** This bug has been marked as a duplicate of bug 1267042 ***

Comment 16 Eric Sandeen 2016-07-21 23:30:23 UTC
To be more clear, the new default behavior is to terminate any outstanding, failing IOs when an unmount command is issued.