Description of problem:
I had a device with an XFS filesystem on it. I have tried to unmount it but it's stuck in xfs_ai* (see ps | grep umount output below). Something in the kernel, xfs driver(?) is generating messages (see below) with this text "xfs_log_force: error -5 returned."
Usually I umount disks that have dropped ready, run fsck on them and remount them. I've been experimenting with XFS and LARGE disks. This disk is a 4TB Seagate Sata drive. It will NOT mount now and gives messages below. I'm not sure what to do now. I have power cycled the failing disk enclosure (SIIG USB Docking Station) and the drives come back but the XFS drive fails to mount.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1.Attach a disk with a filesystem of type XFS in a USB Docking Station and make the drive busy.
2.power off the drive
3.possibly see the repeating message about xfs_log_force
4.power on the drive; notice that it may come back as a different drive letter
5.attempt to mount the drive
F S UID PID PPID CLS PRI ADDR SZ WCHAN STIME TTY TIME CMD
4 D root 3368 18040 TS 19 - 31501 xfs_ai 14:42 pts/31 00:00:00 umount /sdj1
Feb 15 14:58:33 fc21 kernel: [165768.636657] XFS (sdh1): xfs_log_force: error -5 returned.
Feb 15 14:58:33 fc21 kernel: [165768.686710] XFS (sdh1): xfs_log_force: error -5 returned.
Feb 15 14:58:33 fc21 kernel: [165768.736791] XFS (sdh1): xfs_log_force: error -5 returned.
Feb 15 14:58:33 fc21 kernel: [165768.786812] XFS (sdh1): xfs_log_force: error -5 returned.
Feb 15 14:58:33 fc21 kernel: [165768.836888] XFS (sdh1): xfs_log_force: error -5 returned.
mount /dev/sdi1 /mnt
mount: wrong fs type, bad option, bad superblock on /dev/sdi1,
missing codepage or helper program, or other error
In some cases useful info is found in syslog - try
dmesg | tail or so.
> # mount /dev/sdi1 /mnt
> mount: wrong fs type, bad option, bad superblock on /dev/sdi1,
> missing codepage or helper program, or other error
> In some cases useful info is found in syslog - try
> dmesg | tail or so.
and what did dmesg | tail say immediately after that mount attempt?
I rebooted this system due to the LARGE number of messages filling up /var/log/...
When I mounted the disk, it took a while but finally mounted. XFS was replaying the journal.
If/when the problem happens again, I will get the dmesg buffer.
We can leave this bug open or close it and I'll re-open it if/when the problem re-appears.
Do you want to proceed differently?
Here is what I suspect happened.
> Steps to Reproduce:
> 1.Attach a disk with a filesystem of type XFS in a USB Docking Station and make the drive busy.
> 2.power off the drive
Ok, at this point, the filesystem is going to be unhappy...
> 3.possibly see the repeating message about xfs_log_force
... because it can't do IO to the device anymore
> 4.power on the drive; notice that it may come back as a different drive letter
This new device number is as expected, or at least out of xfs's control. So pending IOs to the old device continue failing; xfs tends to retry until a truly critical error occurs. You could kill this by manually shutting down the filesystem, i.e.
# xfs_io -x -c shutdown /path/to/mountpoint
and then unmounting it.
But, in any case, this fs is still mounted from the kernel POV, so ...
> 5.attempt to mount the drive
... mount fails and tells you to look at dmesg; I would imagine that it told you that you were trying to mount a filesystem UUID which was already mounted. You could look in /var/log/messages, grep -i uuid /var/log/messages perhaps.
In any case, this is pretty much all expected.
We are talking about reducing that "retry forever" behavior, and shutting down after sufficient failed attempts.
Unmounting the filesystem failed AND I was getting several error messages per second and /var was filling up quickly even though mine is 7GB. I powered off the drive in the vain hope that XFS would get an interrupt and figure out that the disk was no longer present. It did NOT. I chose to reboot and that took a while because XFS appeared to be doing an fsck analogy. Things are cool now as far as I can tell.
Your suggested solution sounds good. Is there no way for the system to determine that this error is a SERIOUS one (I.E., no disk) and abandoning the retry attempts? Purge it's IO queue and all other requests for operations to that disk?
It's possible for a disk to disappear, and then reappear - think an unplugged and re-plugged SAN cable for example. We don't necessarily want to bail out right away on an error.
On the other hand, we don't want to retry forever, and we don't need to fill the logs while we retry.
Our best heuristic is probably a time-based approach, and give up after we've failed for "long enough."
Right now, your best approach in a situation like this is to take administrative action, use xfs_io to shut down the fs, and unmount it.
I'm going to CLOSE/NOTABUG this one, because it's actually all working as currently designed. We do have plans to change that design, though. I'm just not sure we need to keep this bug open to do it. :)