Bug 484409

Summary: XFS related deadlock on MP system
Product: [Fedora] Fedora Reporter: Jussi Eloranta <eloranta>
Component: kernelAssignee: Eric Sandeen <esandeen>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: high Docs Contact:
Priority: low    
Version: 10CC: kernel-maint, quintela
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-06-30 21:12:57 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
dmesg output none

Description Jussi Eloranta 2009-02-06 16:53:37 UTC
Description of problem:

On

Linux XXX 2.6.27.12-170.2.5.fc10.x86_64 #1 SMP Wed Jan 21 01:33:24 EST 2009 x86_64 x86_64 x86_64 GNU/Linux

which is a 16 proc system (4 x quad AMD opteron) with root on an md array (4 disks) and XFS filesystem on top of that.

All access to the filesystem led to processes hanging. dmesg revealed that there is an XFS related deadlock. I am attaching the dmesg output. Unfortunately I could not get the early messages because the dmesg buffer rolled over.

Version-Release number of selected component (if applicable):

Fedora 10 - 64 bit

How reproducible:

Not sure. I ran a threaded program (taking 16 threads) which was doing heavy file I/O. Happened over night.

Comment 1 Jussi Eloranta 2009-02-06 16:54:29 UTC
Created attachment 331146 [details]
dmesg output

Comment 2 Eric Sandeen 2009-02-09 22:27:26 UTC
I suppose /var/log/messages doesn't have any more info?

Ok, 5 processes are blocked:

     15 INFO: task 0logwatch:2767 blocked for more than 120 seconds.
     15 INFO: task auditd:27611 blocked for more than 120 seconds.
     15 INFO: task molprop_2006_1_:1847 blocked for more than 120 seconds.
     15 INFO: task ntpd:27636 blocked for more than 120 seconds.
     15 INFO: task pdflush:691 blocked for more than 120 seconds.

4 of them are stuck behind pdflush ("?" functions removed):

INFO: task pdflush:691 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
pdflush       D ffff8808228771c0     0   691      2
 ffff880a90db1cc0 0000000000000046 ffffe20021440c00 ffffe20021440c38
 ffffffff816e1500 ffffffff816e1500 ffff880c21d32e20 ffff8808229f9710
 ffff880c21d33198 0000000021440768 ffffe200214407a0 ffff880c21d33198
Call Trace:
 [<ffffffffa004dcae>] xfs_buf_wait_unpin+0x7e/0xa5 [xfs]
 [<ffffffffa004eade>] xfs_buf_iorequest+0x28/0x6c [xfs]
 [<ffffffffa0052a93>] xfs_bdstrat_cb+0x19/0x3b [xfs]
 [<ffffffffa004b1f7>] xfs_bwrite+0x5f/0xae [xfs]
 [<ffffffffa004682e>] xfs_syncsub+0x123/0x22f [xfs]
 [<ffffffffa004697c>] xfs_sync+0x42/0x47 [xfs]
 [<ffffffffa0053e88>] xfs_fs_write_super+0x23/0x2b [xfs]
 [<ffffffff810c1902>] sync_supers+0x71/0xc4
 [<ffffffff81095db9>] wb_kupdate+0x35/0x119
 [<ffffffff8109683f>] pdflush+0x16e/0x231
 [<ffffffff81054e9b>] kthread+0x49/0x76
 [<ffffffff810116e9>] child_rip+0xa/0x11

at first glance, this is xfs waiting for an io completion.  Either xfs got the counting wrong or the storage lost an IO, perhaps.  I've not seen this sort of hang before, or at least not recently... 

It'd be nice to know if there are any storage errors.  Perhaps a serial console or remote syslog would be good in case this happens again, to gather more info?

Comment 3 Jussi Eloranta 2009-02-10 16:18:09 UTC
I haven't seen any I/O related errors on this system. But may be something is about to break down - who knows. As it is not easy to reproduce this, I will just have to wait until it happens again + try to get more info. Also, I will try to do remote syslog.

Comment 4 Eric Sandeen 2009-02-10 17:53:30 UTC
Thanks - I think there have been lost IO completion issues w/ md in the past, though very rare.... I'd probably chalk this up to that, but I know that's not a very satisfying answer ...

Comment 5 Eric Sandeen 2009-06-30 21:00:06 UTC
Have you seen this since?  if not, I'll chalk it up to bogons in md and close ...

Comment 6 Jussi Eloranta 2009-06-30 21:08:03 UTC
I have not seen it any more - may be it has been fixed...

Comment 7 Eric Sandeen 2009-06-30 21:12:57 UTC
Ok, I'm going to close it based on the age & lack of info we have about the problem, but if you see it again, please feel free to re-open.

Thanks,
-Eric