Bug 1192850

Summary: disconnected drive with XFS filesystem is stuck attempting to write the log file.
Product: [Fedora] Fedora Reporter: George R. Goffe <grgoffe>
Component: xfsprogsAssignee: Eric Sandeen <esandeen>
Status: CLOSED NOTABUG QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: high Docs Contact:
Priority: unspecified    
Version: 21CC: esandeen
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-02-17 19:46:22 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description George R. Goffe 2015-02-15 23:10:51 UTC
Description of problem:

I had a device with an XFS filesystem on it. I have tried to unmount it but it's stuck in xfs_ai* (see ps | grep umount output below). Something in the kernel, xfs driver(?) is generating messages (see below) with this text "xfs_log_force: error -5 returned." 

Usually I umount disks that have dropped ready, run fsck on them and remount them. I've been experimenting with XFS and LARGE disks. This disk is a 4TB Seagate Sata drive. It will NOT mount now and gives messages below. I'm not sure what to do now. I have power cycled the failing disk enclosure (SIIG USB Docking Station) and the drives come back but the XFS drive fails to mount.



Version-Release number of selected component (if applicable):

xfsprogs-3.2.1-2.fc21.x86_64

How reproducible:

Unknown.

Steps to Reproduce:
1.Attach a disk with a filesystem of type XFS in a USB Docking Station and make the drive busy.
2.power off the drive
3.possibly see the repeating message about xfs_log_force
4.power on the drive; notice that it may come back as a different drive letter
5.attempt to mount the drive

Actual results:

see description

Expected results:

see description

Additional info:

psg mount
F S UID        PID  PPID CLS PRI ADDR SZ WCHAN  STIME TTY          TIME CMD
4 D root      3368 18040 TS   19 - 31501 xfs_ai 14:42 pts/31   00:00:00 umount /sdj1


Feb 15 14:58:33 fc21 kernel: [165768.636657] XFS (sdh1): xfs_log_force: error -5 returned.
Feb 15 14:58:33 fc21 kernel: [165768.686710] XFS (sdh1): xfs_log_force: error -5 returned.
Feb 15 14:58:33 fc21 kernel: [165768.736791] XFS (sdh1): xfs_log_force: error -5 returned.
Feb 15 14:58:33 fc21 kernel: [165768.786812] XFS (sdh1): xfs_log_force: error -5 returned.
Feb 15 14:58:33 fc21 kernel: [165768.836888] XFS (sdh1): xfs_log_force: error -5 returned.


mount /dev/sdi1 /mnt
mount: wrong fs type, bad option, bad superblock on /dev/sdi1,
       missing codepage or helper program, or other error

       In some cases useful info is found in syslog - try
       dmesg | tail or so.

Comment 1 Eric Sandeen 2015-02-16 21:31:31 UTC
> # mount /dev/sdi1 /mnt
> mount: wrong fs type, bad option, bad superblock on /dev/sdi1,
>       missing codepage or helper program, or other error
>
>       In some cases useful info is found in syslog - try
>       dmesg | tail or so.

and what did dmesg | tail say immediately after that mount attempt?

Comment 2 George R. Goffe 2015-02-16 22:30:49 UTC
Eric,

I rebooted this system due to the LARGE number of messages filling up /var/log/... 

When I mounted the disk, it took a while but finally mounted. XFS was replaying the journal.

If/when the problem happens again, I will get the dmesg buffer.

We can leave this bug open or close it and I'll re-open it if/when the problem re-appears.

Do you want to proceed differently?

George...

Comment 3 Eric Sandeen 2015-02-16 22:41:42 UTC
Here is what I suspect happened.

> Steps to Reproduce:
> 1.Attach a disk with a filesystem of type XFS in a USB Docking Station and make the drive busy.
> 2.power off the drive

Ok, at this point, the filesystem is going to be unhappy...

> 3.possibly see the repeating message about xfs_log_force

... because it can't do IO to the device anymore

> 4.power on the drive; notice that it may come back as a different drive letter

This new device number is as expected, or at least out of xfs's control.  So pending IOs to the old device continue failing; xfs tends to retry until a truly critical error occurs.  You could kill this by manually shutting down the filesystem, i.e. 
# xfs_io -x -c shutdown /path/to/mountpoint
and then unmounting it.

But, in any case, this fs is still mounted from the kernel POV, so ...

> 5.attempt to mount the drive

... mount fails and tells you to look at dmesg; I would imagine that it told you that you were trying to mount a filesystem UUID which was already mounted.  You could look in /var/log/messages, grep -i uuid /var/log/messages perhaps.

In any case, this is pretty much all expected.

We are talking about reducing that "retry forever" behavior, and shutting down after sufficient failed attempts.

-Eric

Comment 4 George R. Goffe 2015-02-17 18:18:48 UTC
Eric,

Unmounting the filesystem failed AND I was getting several error messages per second and /var was filling up quickly even though mine is 7GB. I powered off the drive in the vain hope that XFS would get an interrupt and figure out that the disk was no longer present. It did NOT. I chose to reboot and that took a while because XFS appeared to be doing an fsck analogy. Things are cool now as far as I can tell.

Your suggested solution sounds good. Is there no way for the system to determine that this error is a SERIOUS one (I.E., no disk) and abandoning the retry attempts? Purge it's IO queue and all other requests for operations to that disk?

Regards,

George...

Comment 5 Eric Sandeen 2015-02-17 19:46:22 UTC
It's possible for a disk to disappear, and then reappear - think an unplugged and re-plugged SAN cable for example.  We don't necessarily want to bail out right away on an error.

On the other hand, we don't want to retry forever, and we don't need to fill the logs while we retry.

Our best heuristic is probably a time-based approach, and give up after we've failed for "long enough."

Right now, your best approach in a situation like this is to take administrative action, use xfs_io to shut down the fs, and unmount it.

I'm going to CLOSE/NOTABUG this one, because it's actually all working as currently designed.  We do have plans to change that design, though.  I'm just not sure we need to keep this bug open to do it.  :)

Thanks,
-Eric