Bug 1564186

Summary:	XFS may block endlessly in xlog_wait() on IO error
Product:	Red Hat Enterprise Linux 7	Reporter:	Zdenek Kabelac <zkabelac>
Component:	kernel	Assignee:	Carlos Maiolino <cmaiolin>
kernel sub component:	XFS	QA Contact:	Zorro Lang <zlang>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	unspecified
Priority:	urgent	CC:	brdeoliv, dchinner, dhoward, dwysocha, xzhou
Version:	7.5	Keywords:	ZStream
Target Milestone:	rc
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	kernel-3.10.0-879.el7	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1611212 1657142 (view as bug list)		Environment:
Last Closed:	2018-10-30 08:54:57 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1657142

Description Zdenek Kabelac 2018-04-05 15:27:42 UTC

Description of problem:

While hitting dm kernel bug 1563697  (which is currently the only quick way to hit the issue) - XFS revealed hidden problem with log_force function which can be stuck waiting on remove of some worker queue - that happened to be already gone because of racing ioerror.


# cat /proc/2938/stack 
[<0>] _xfs_log_force+0x112/0x260 [xfs]
[<0>] xfs_log_force+0x27/0x70 [xfs]
[<0>] xfs_trim_extents+0xa6/0x470 [xfs]
[<0>] xfs_ioc_trim+0x294/0x370 [xfs]
[<0>] xfs_file_ioctl+0x561/0xba0 [xfs]
[<0>] do_vfs_ioctl+0x93/0x680
[<0>] SyS_ioctl+0x58/0x70
[<0>] do_int80_syscall_32+0x54/0xe0
[<0>] entry_INT80_32+0x31/0x31
[<0>] 0xffffffff


stuck here: xfs_log_priv.h

static inline void xlog_wait(wait_queue_head_t *wq, spinlock_t *lock)
{
	DECLARE_WAITQUEUE(wait, current);

	add_wait_queue_exclusive(wq, &wait);
	__set_current_state(TASK_UNINTERRUPTIBLE);
	spin_unlock(lock);
	schedule();
>>>>>>	remove_wait_queue(wq, &wait);
}




Version-Release number of selected component (if applicable):
Even upstream 4.16 kernel ATM

How reproducible:


Steps to Reproduce:
1. See bug 1563697
2.
3.

Actual results:


Expected results:
Not having umountable filesystem for this error case scenario.

Additional info:

Comment 3 Dave Chinner 2018-04-06 03:41:51 UTC

Waiting in xlog_wait() implies that the log IO has not been completed. i.e. that whatever error occurred in the DM layer has not been correctly completed with an error, and XFS is still waiting for it.

Before we go on a wild goose chase in XFS, can you first confirm that DM is completing all outstanding IOs correctly (e.g. via blktrace) and that XFS isn't actually waiting on a log IO completion that has got lost?

-Dave.

Comment 4 Zdenek Kabelac 2018-04-06 07:46:30 UTC

I've already analyzed this case with Carlos - so he will explain more details for XFS part of problem.

DM is in faulty stage - switching to read-only mode - the ext4 handles same case easily.   XFS gets stuck in mentioned function waiting on an action that cannot happen.

So far we have not prepared more 'easier' test since it's nontrivial to hit table switch in the very same moment.

Basic trable is - thin-pool is switch to 'out-of-metadata' read-only mode and when the time is 'right' the action for reporting IO error happens to early (at least that's what I remember from Carlos explanation).

Also - there is already fix made by him - but having better test case that does not rely on faulty thin-pool metadata logic  is still missing (and likely bug in DM target will get fixed soon - so not usable for further testing)

Also feel free to use reproducer from bug 1563697.
The only mandatory condition is - the PV in use must be on very fast device (nearly zero latency -  I'd suggest  ramdisk based loop device - as it's very tricky race condition.)

Comment 5 Carlos Maiolino 2018-04-06 13:12:36 UTC

(In reply to Dave Chinner from comment #3)
> Waiting in xlog_wait() implies that the log IO has not been completed. i.e.
> that whatever error occurred in the DM layer has not been correctly
> completed with an error, and XFS is still waiting for it.
> 
> Before we go on a wild goose chase in XFS, can you first confirm that DM is
> completing all outstanding IOs correctly (e.g. via blktrace) and that XFS
> isn't actually waiting on a log IO completion that has got lost?
> 
> -Dave.

Hi Dave,

I'm still not 100% sure this might be a bug in XFS, mainly due the fact it is so far, only reproducible due a bug in device-mapper. There are more data than the stacktrace mentioned above though.

The stacktrace comes from a fstrim process stuck on the log force.

The IO has not been completed, yes, it failed.

The IO error is detected by XFS in xlog_iodone(), during the io completion of the iclog submitted by xfs_log_force, during the fstrim process.

When xfs_log_force_umount() is called, and it runs the xlog callbacks via xlog_state_do_callbacks() to wake up all the threads waiting on iclog->ic_force_wait, it does skip the thread waiting on the fstrim process, well, at least I thought it was skipping it (more about it below).

At this point XFS has been shut down, and the waiting fstrim has never been woken up.

I was tracking down why the fstrim thread was being skipped during the shutdown process, but I figured it out it's not actually skipping. xlog_state_do_callbacks() get stuck while trying to run the iclog's callback.

I'm actually tracking down why it's getting stuck. Hoping to have some updates soon

Comment 6 Dave Chinner 2018-04-09 10:42:22 UTC

Hi carlos,

Can you post the notes/details of the analysis to the bz? That way everyone can follow along with the analysis you've done, otherwise we're completely in the dark about what the problem you're seeing is and can't actually help at all short of duplicating everything you're already doing...

Cheers,

Dave.

Comment 7 Carlos Maiolino 2018-04-09 13:57:26 UTC

Hi Dave,

(In reply to Dave Chinner from comment #6)
> Hi carlos,
> 
> Can you post the notes/details of the analysis to the bz? That way everyone
> can follow along with the analysis you've done, otherwise we're completely
> in the dark about what the problem you're seeing is and can't actually help
> at all short of duplicating everything you're already doing...

Yes, my apologies to have not updated it sooner, I was trying to get more data and have more details about it, but so far, what I tracked is:


- xlog_iodone runs for the xlog where the IO error occurred, and it falls into the shutdown case, calling xfs_do_force_shutdown, which will end up callign xfs_log_force_umount().

In xfs_log_force_umount(), xlog_state_ioerror() is called and all iclogs are properly marked with XLOG_STATE_IOERROR.

Then we fall into xlog_state_do_callback, which is supposed to wake up the processes sleeping in iclog->ic_force_wait of each iclog.

Most of the iclogs clean up are done properly, but every time, the last iclog on the list get stuck.

Its callback (xlog_cil_committed) is called from xlog_state_do_callback in:

cb->cb_func(cb->cb_arg, aborted), which never returns back to xlog_state_do_callback(), so, it can't wake up the threads sleeping in ic_force_wait.

xlog_cil_committed calls into xfs_trans_committed_bulk(), which will then unpin all the log items.

in the if (aborted) conditional, it calls into the iop_unpin() operation of the log items.

The last cil item which had been marked with XLOG_STATE_IOERROR above, will call its associated log item iop_unpin, which is xfs_buf_item_unpin() and get stuck there.


Into xfs_buf_item_unpin, we try to lock the buffer into:

} else if (freed && remove) {

       xfs_buf_lock()
.
.
}

And it stops trying to acquire the buffer's semaphore into xfs_buf_lock()

down(&bp->b_sema);


That's basically where I am now.

The buffer semaphore is held somewhere else (where I still don't know), preventing the force umount to continue and wake up the fstrim threas sleeping on the ic_force_wait.


I'll update once I have more information.

Cheers

Comment 8 Dave Chinner 2018-04-09 22:41:45 UTC

Carlos,

If we are hanging in xfs_buf_item_unpin() trying to lock a buffer, that implies we are holding the buffer locked while calling xfs_log_force(XFS_LOG_SYNC). XFS_LOG_SYNC should not be used whenever we hold metadata locked....

IOWs, I suspect the bug is in xfs_trim_extents(), where it reads and locks the AGF then does a synchronous log force. Moving the sync log force to before locking the AGF should avoid the hang.....

Cheers,

Dave.

Comment 9 Murphy Zhou 2018-04-26 00:51:01 UTC

we need acks here.

Comment 12 Bruno Meneguele 2018-05-02 13:10:02 UTC

Patch(es) available on kernel-3.10.0-879.el7

Comment 15 Zorro Lang 2018-08-17 13:43:03 UTC

Use scsi_debug to test this bug:
# modprobe scsi_debug dev_size_mb=2048 lbpu=1 lbpws=1



1) Reproduced this bug on kernel 878.el7, which has both bug 1563697 and this bug:
# ./check generic/500
FSTYP         -- xfs (non-debug)
PLATFORM      -- Linux/x86_64 ibm-x3550m3-05 3.10.0-878.el7.x86_64
MKFS_OPTIONS  -- -f -bsize=4096 /dev/sde
MOUNT_OPTIONS -- -o context=system_u:object_r:root_t:s0 /dev/sde /mnt/scratch

generic/500  ^C^C



2) Tested on kernel 920.el7 which hasn't fixed bug 1563697, but fixed this bug:

# ./check generic/500
FSTYP         -- xfs (non-debug)
PLATFORM      -- Linux/x86_64 ibm-x3550m3-05 3.10.0-920.el7.x86_64.debug
MKFS_OPTIONS  -- -f -bsize=4096 /dev/sde
MOUNT_OPTIONS -- -o context=system_u:object_r:root_t:s0 /dev/sde /mnt/scratch

generic/500     [failed, exit status 1]- output mismatch (see /root/git/xfstests-dev/results//generic/500.out.bad)
    --- tests/generic/500.out   2018-07-25 00:00:54.313201624 -0400
    +++ /root/git/xfstests-dev/results//generic/500.out.bad     2018-08-17 09:09:44.766198368 -0400
    @@ -1,2 +1,26 @@
     QA output created by 500
    -Silence is golden
    +fstrim: /mnt/scratch: FITRIM ioctl failed: Input/output error
    +fstrim: cannot open /mnt/scratch: Input/output error
    +fstrim: cannot open /mnt/scratch: Input/output error
    +fstrim: cannot open /mnt/scratch: Input/output error
    +fstrim: cannot open /mnt/scratch: Input/output error
    ...
    (Run 'diff -u tests/generic/500.out /root/git/xfstests-dev/results//generic/500.out.bad'  to see the entire diff)
Ran: generic/500
Failures: generic/500
Failed 1 of 1 tests

We can see XFS isn't blocked there even hit bug 1563697. So this bug has been fixed.

3) Finally, all test passed on kernel 933.el7:
# ./check generic/500
FSTYP         -- xfs (non-debug)
PLATFORM      -- Linux/x86_64 ibm-x3550m3-05 3.10.0-933.el7.x86_64
MKFS_OPTIONS  -- -f -bsize=4096 /dev/sde
MOUNT_OPTIONS -- -o context=system_u:object_r:root_t:s0 /dev/sde /mnt/scratch

generic/500      11s
Ran: generic/500
Passed all 1 tests

So Verify this bug now.

Comment 17 errata-xmlrpc 2018-10-30 08:54:57 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:3083