Red Hat Bugzilla – Bug 484409
XFS related deadlock on MP system
Last modified: 2009-06-30 17:12:57 EDT
Description of problem:
Linux XXX 126.96.36.199-170.2.5.fc10.x86_64 #1 SMP Wed Jan 21 01:33:24 EST 2009 x86_64 x86_64 x86_64 GNU/Linux
which is a 16 proc system (4 x quad AMD opteron) with root on an md array (4 disks) and XFS filesystem on top of that.
All access to the filesystem led to processes hanging. dmesg revealed that there is an XFS related deadlock. I am attaching the dmesg output. Unfortunately I could not get the early messages because the dmesg buffer rolled over.
Version-Release number of selected component (if applicable):
Fedora 10 - 64 bit
Not sure. I ran a threaded program (taking 16 threads) which was doing heavy file I/O. Happened over night.
Created attachment 331146 [details]
I suppose /var/log/messages doesn't have any more info?
Ok, 5 processes are blocked:
15 INFO: task 0logwatch:2767 blocked for more than 120 seconds.
15 INFO: task auditd:27611 blocked for more than 120 seconds.
15 INFO: task molprop_2006_1_:1847 blocked for more than 120 seconds.
15 INFO: task ntpd:27636 blocked for more than 120 seconds.
15 INFO: task pdflush:691 blocked for more than 120 seconds.
4 of them are stuck behind pdflush ("?" functions removed):
INFO: task pdflush:691 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
pdflush D ffff8808228771c0 0 691 2
ffff880a90db1cc0 0000000000000046 ffffe20021440c00 ffffe20021440c38
ffffffff816e1500 ffffffff816e1500 ffff880c21d32e20 ffff8808229f9710
ffff880c21d33198 0000000021440768 ffffe200214407a0 ffff880c21d33198
[<ffffffffa004dcae>] xfs_buf_wait_unpin+0x7e/0xa5 [xfs]
[<ffffffffa004eade>] xfs_buf_iorequest+0x28/0x6c [xfs]
[<ffffffffa0052a93>] xfs_bdstrat_cb+0x19/0x3b [xfs]
[<ffffffffa004b1f7>] xfs_bwrite+0x5f/0xae [xfs]
[<ffffffffa004682e>] xfs_syncsub+0x123/0x22f [xfs]
[<ffffffffa004697c>] xfs_sync+0x42/0x47 [xfs]
[<ffffffffa0053e88>] xfs_fs_write_super+0x23/0x2b [xfs]
at first glance, this is xfs waiting for an io completion. Either xfs got the counting wrong or the storage lost an IO, perhaps. I've not seen this sort of hang before, or at least not recently...
It'd be nice to know if there are any storage errors. Perhaps a serial console or remote syslog would be good in case this happens again, to gather more info?
I haven't seen any I/O related errors on this system. But may be something is about to break down - who knows. As it is not easy to reproduce this, I will just have to wait until it happens again + try to get more info. Also, I will try to do remote syslog.
Thanks - I think there have been lost IO completion issues w/ md in the past, though very rare.... I'd probably chalk this up to that, but I know that's not a very satisfying answer ...
Have you seen this since? if not, I'll chalk it up to bogons in md and close ...
I have not seen it any more - may be it has been fixed...
Ok, I'm going to close it based on the age & lack of info we have about the problem, but if you see it again, please feel free to re-open.