484409 – XFS related deadlock on MP system

Bug 484409 - XFS related deadlock on MP system

Summary: XFS related deadlock on MP system

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	10
Hardware:	x86_64
OS:	Linux
Priority:	low
Severity:	high
Target Milestone:	---
Assignee:	Eric Sandeen
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2009-02-06 16:53 UTC by Jussi Eloranta
Modified:	2009-06-30 21:12 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2009-06-30 21:12:57 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
dmesg output (120.76 KB, text/plain) 2009-02-06 16:54 UTC, Jussi Eloranta	no flags	Details
View All

Description Jussi Eloranta 2009-02-06 16:53:37 UTC

Description of problem:

On

Linux XXX 2.6.27.12-170.2.5.fc10.x86_64 #1 SMP Wed Jan 21 01:33:24 EST 2009 x86_64 x86_64 x86_64 GNU/Linux

which is a 16 proc system (4 x quad AMD opteron) with root on an md array (4 disks) and XFS filesystem on top of that.

All access to the filesystem led to processes hanging. dmesg revealed that there is an XFS related deadlock. I am attaching the dmesg output. Unfortunately I could not get the early messages because the dmesg buffer rolled over.

Version-Release number of selected component (if applicable):

Fedora 10 - 64 bit

How reproducible:

Not sure. I ran a threaded program (taking 16 threads) which was doing heavy file I/O. Happened over night.

Comment 1 Jussi Eloranta 2009-02-06 16:54:29 UTC

Created attachment 331146 [details]
dmesg output

Comment 2 Eric Sandeen 2009-02-09 22:27:26 UTC

I suppose /var/log/messages doesn't have any more info?

Ok, 5 processes are blocked:

     15 INFO: task 0logwatch:2767 blocked for more than 120 seconds.
     15 INFO: task auditd:27611 blocked for more than 120 seconds.
     15 INFO: task molprop_2006_1_:1847 blocked for more than 120 seconds.
     15 INFO: task ntpd:27636 blocked for more than 120 seconds.
     15 INFO: task pdflush:691 blocked for more than 120 seconds.

4 of them are stuck behind pdflush ("?" functions removed):

INFO: task pdflush:691 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
pdflush       D ffff8808228771c0     0   691      2
 ffff880a90db1cc0 0000000000000046 ffffe20021440c00 ffffe20021440c38
 ffffffff816e1500 ffffffff816e1500 ffff880c21d32e20 ffff8808229f9710
 ffff880c21d33198 0000000021440768 ffffe200214407a0 ffff880c21d33198
Call Trace:
 [<ffffffffa004dcae>] xfs_buf_wait_unpin+0x7e/0xa5 [xfs]
 [<ffffffffa004eade>] xfs_buf_iorequest+0x28/0x6c [xfs]
 [<ffffffffa0052a93>] xfs_bdstrat_cb+0x19/0x3b [xfs]
 [<ffffffffa004b1f7>] xfs_bwrite+0x5f/0xae [xfs]
 [<ffffffffa004682e>] xfs_syncsub+0x123/0x22f [xfs]
 [<ffffffffa004697c>] xfs_sync+0x42/0x47 [xfs]
 [<ffffffffa0053e88>] xfs_fs_write_super+0x23/0x2b [xfs]
 [<ffffffff810c1902>] sync_supers+0x71/0xc4
 [<ffffffff81095db9>] wb_kupdate+0x35/0x119
 [<ffffffff8109683f>] pdflush+0x16e/0x231
 [<ffffffff81054e9b>] kthread+0x49/0x76
 [<ffffffff810116e9>] child_rip+0xa/0x11

at first glance, this is xfs waiting for an io completion.  Either xfs got the counting wrong or the storage lost an IO, perhaps.  I've not seen this sort of hang before, or at least not recently... 

It'd be nice to know if there are any storage errors.  Perhaps a serial console or remote syslog would be good in case this happens again, to gather more info?

Comment 3 Jussi Eloranta 2009-02-10 16:18:09 UTC

I haven't seen any I/O related errors on this system. But may be something is about to break down - who knows. As it is not easy to reproduce this, I will just have to wait until it happens again + try to get more info. Also, I will try to do remote syslog.

Comment 4 Eric Sandeen 2009-02-10 17:53:30 UTC

Thanks - I think there have been lost IO completion issues w/ md in the past, though very rare.... I'd probably chalk this up to that, but I know that's not a very satisfying answer ...

Comment 5 Eric Sandeen 2009-06-30 21:00:06 UTC

Have you seen this since?  if not, I'll chalk it up to bogons in md and close ...

Comment 6 Jussi Eloranta 2009-06-30 21:08:03 UTC

I have not seen it any more - may be it has been fixed...

Comment 7 Eric Sandeen 2009-06-30 21:12:57 UTC

Ok, I'm going to close it based on the age & lack of info we have about the problem, but if you see it again, please feel free to re-open.

Thanks,
-Eric

Note You need to log in before you can comment on or make changes to this bug.