Bug 484409 - XFS related deadlock on MP system
Summary: XFS related deadlock on MP system
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 10
Hardware: x86_64
OS: Linux
low
high
Target Milestone: ---
Assignee: Eric Sandeen
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2009-02-06 16:53 UTC by Jussi Eloranta
Modified: 2009-06-30 21:12 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2009-06-30 21:12:57 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)
dmesg output (120.76 KB, text/plain)
2009-02-06 16:54 UTC, Jussi Eloranta
no flags Details

Description Jussi Eloranta 2009-02-06 16:53:37 UTC
Description of problem:

On

Linux XXX 2.6.27.12-170.2.5.fc10.x86_64 #1 SMP Wed Jan 21 01:33:24 EST 2009 x86_64 x86_64 x86_64 GNU/Linux

which is a 16 proc system (4 x quad AMD opteron) with root on an md array (4 disks) and XFS filesystem on top of that.

All access to the filesystem led to processes hanging. dmesg revealed that there is an XFS related deadlock. I am attaching the dmesg output. Unfortunately I could not get the early messages because the dmesg buffer rolled over.

Version-Release number of selected component (if applicable):

Fedora 10 - 64 bit

How reproducible:

Not sure. I ran a threaded program (taking 16 threads) which was doing heavy file I/O. Happened over night.

Comment 1 Jussi Eloranta 2009-02-06 16:54:29 UTC
Created attachment 331146 [details]
dmesg output

Comment 2 Eric Sandeen 2009-02-09 22:27:26 UTC
I suppose /var/log/messages doesn't have any more info?

Ok, 5 processes are blocked:

     15 INFO: task 0logwatch:2767 blocked for more than 120 seconds.
     15 INFO: task auditd:27611 blocked for more than 120 seconds.
     15 INFO: task molprop_2006_1_:1847 blocked for more than 120 seconds.
     15 INFO: task ntpd:27636 blocked for more than 120 seconds.
     15 INFO: task pdflush:691 blocked for more than 120 seconds.

4 of them are stuck behind pdflush ("?" functions removed):

INFO: task pdflush:691 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
pdflush       D ffff8808228771c0     0   691      2
 ffff880a90db1cc0 0000000000000046 ffffe20021440c00 ffffe20021440c38
 ffffffff816e1500 ffffffff816e1500 ffff880c21d32e20 ffff8808229f9710
 ffff880c21d33198 0000000021440768 ffffe200214407a0 ffff880c21d33198
Call Trace:
 [<ffffffffa004dcae>] xfs_buf_wait_unpin+0x7e/0xa5 [xfs]
 [<ffffffffa004eade>] xfs_buf_iorequest+0x28/0x6c [xfs]
 [<ffffffffa0052a93>] xfs_bdstrat_cb+0x19/0x3b [xfs]
 [<ffffffffa004b1f7>] xfs_bwrite+0x5f/0xae [xfs]
 [<ffffffffa004682e>] xfs_syncsub+0x123/0x22f [xfs]
 [<ffffffffa004697c>] xfs_sync+0x42/0x47 [xfs]
 [<ffffffffa0053e88>] xfs_fs_write_super+0x23/0x2b [xfs]
 [<ffffffff810c1902>] sync_supers+0x71/0xc4
 [<ffffffff81095db9>] wb_kupdate+0x35/0x119
 [<ffffffff8109683f>] pdflush+0x16e/0x231
 [<ffffffff81054e9b>] kthread+0x49/0x76
 [<ffffffff810116e9>] child_rip+0xa/0x11

at first glance, this is xfs waiting for an io completion.  Either xfs got the counting wrong or the storage lost an IO, perhaps.  I've not seen this sort of hang before, or at least not recently... 

It'd be nice to know if there are any storage errors.  Perhaps a serial console or remote syslog would be good in case this happens again, to gather more info?

Comment 3 Jussi Eloranta 2009-02-10 16:18:09 UTC
I haven't seen any I/O related errors on this system. But may be something is about to break down - who knows. As it is not easy to reproduce this, I will just have to wait until it happens again + try to get more info. Also, I will try to do remote syslog.

Comment 4 Eric Sandeen 2009-02-10 17:53:30 UTC
Thanks - I think there have been lost IO completion issues w/ md in the past, though very rare.... I'd probably chalk this up to that, but I know that's not a very satisfying answer ...

Comment 5 Eric Sandeen 2009-06-30 21:00:06 UTC
Have you seen this since?  if not, I'll chalk it up to bogons in md and close ...

Comment 6 Jussi Eloranta 2009-06-30 21:08:03 UTC
I have not seen it any more - may be it has been fixed...

Comment 7 Eric Sandeen 2009-06-30 21:12:57 UTC
Ok, I'm going to close it based on the age & lack of info we have about the problem, but if you see it again, please feel free to re-open.

Thanks,
-Eric


Note You need to log in before you can comment on or make changes to this bug.