Hide Forgot
I have samba running under CTDB and serving files from GFS2 filesystems built on DRBD devices. 20 clients. CTDB is run as pacemaker clone resource. Cluster stack is OpenAIS. DRBD is run in active-active mode (protocol c). Today I faced the following: Mar 25 15:11:42 s01-0 kernel: INFO: task smbd:5510 blocked for more than 120 seconds. Mar 25 15:11:42 s01-0 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Mar 25 15:11:42 s01-0 kernel: smbd D 0000000000000005 0 5510 18255 0x00000084 Mar 25 15:11:42 s01-0 kernel: ffff88001b34d7b8 0000000000000086 ffff88001b34d768 ffffffffa025f1ee Mar 25 15:11:42 s01-0 kernel: ffff88001b34dfd8 ffff8800273f8000 00000000000153c0 ffff88001b34dfd8 Mar 25 15:11:42 s01-0 kernel: 00000000000153c0 00000000000153c0 00000000000153c0 00000000000153c0 Mar 25 15:11:42 s01-0 kernel: Call Trace: Mar 25 15:11:42 s01-0 kernel: [<ffffffffa025f1ee>] ? request_lock+0x97/0xa7 [dlm] Mar 25 15:11:42 s01-0 kernel: [<ffffffffa02dd90e>] ? gfs2_glock_holder_wait+0x0/0x12 [gfs2] Mar 25 15:11:42 s01-0 kernel: [<ffffffffa02dd91c>] gfs2_glock_holder_wait+0xe/0x12 [gfs2] Mar 25 15:11:42 s01-0 kernel: [<ffffffff8144c3b7>] __wait_on_bit+0x48/0x7b Mar 25 15:11:42 s01-0 kernel: [<ffffffff8144c458>] out_of_line_wait_on_bit+0x6e/0x79 Mar 25 15:11:42 s01-0 kernel: [<ffffffffa02dd90e>] ? gfs2_glock_holder_wait+0x0/0x12 [gfs2] Mar 25 15:11:42 s01-0 kernel: [<ffffffff81066298>] ? wake_bit_function+0x0/0x33 Mar 25 15:11:42 s01-0 kernel: [<ffffffffa02df470>] wait_on_bit.clone.1+0x1e/0x20 [gfs2] Mar 25 15:11:42 s01-0 kernel: [<ffffffffa02df4f8>] gfs2_glock_wait+0x3e/0x46 [gfs2] Mar 25 15:11:42 s01-0 kernel: [<ffffffffa02df797>] gfs2_glock_nq+0x297/0x2a6 [gfs2] All smbd processes stuck in iowait state with no way to kill them. Reboot of single node which held public IP address used for SMB serving didn't help, only reboot of both cluster nodes helped. Earlier I saw very similar with nfsd and that was one of reasons I gave up using NFS and switched to samba. Version-Release number of selected component (if applicable): kernel-2.6.34.8-68.fc13.x86_64 clusterlib-3.1.1-1.3.fc13.x86_64 Locally-built packages: corosync-1.3.0 openais-1.1.4 dlm-pcmk-3.0.17 gfs-pcmk-3.0.17 kmod-drbd-8.3.10 What else can I provide to help diagnose this?
A glock dump would be useful, since it appears to be waiting for a glock to be granted. Also a dlm lock dump and a sysrq-t are the next most important items for debugging purposes.
Created attachment 489036 [details] Archive with requested dumps from both cluster nodes Hi! Problem just appeared again, I put what you've requested into attached tarball. Problem began on s01-1 node, then virtual IP which serves one of exported filesystems was moved to s01-0 node and it behaved similarly. I was able to obtain sysrq-t one from one node (s01-0), another rebooted (by watchdog or by fencing op) just before I was ready to echo t > /proc/sysrq-trigger.
I've just posted a patch to cluster-devel which may help resolve this issue. If not then I'll have another look at this.
Can you please be more precise? Currently I see only https://www.redhat.com/archives/cluster-devel/2011-May/msg00061.html from today, but I'm not sure this is it.
Yes, that is it, sorry for not including the url. I'm currently waiting for Linus to pull the current -nmw tree, so I'll send it along as soon as thats done.
I just finished move of all cluster to gfs-less operation today and plan to put them to production next week :) . I'll try to create cluster in virtual machines and make some testing next month. The problem I've been experiencing is not easily reproducible. BTW, is it possible to include this fix into F13 until its EOL? It will be much easier for me to test it then.
f13 is likely to be EOL in only a month or so at the most, so we will not be back porting to this particular release I'm afraid. Last night Linus pulled the current GFS2 -nmw git tree, so I'm now updating it and I'll probably land up sending another pull request for the fixes fairly shortly.
Upstream patch: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=6d3117b41295150d4ac70622055dd8f5529d86b2
This message is a reminder that Fedora 13 is nearing its end of life. Approximately 30 (thirty) days from now Fedora will stop maintaining and issuing updates for Fedora 13. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as WONTFIX if it remains open with a Fedora 'version' of '13'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version prior to Fedora 13's end of life. Bug Reporter: Thank you for reporting this issue and we are sorry that we may not be able to fix it before Fedora 13 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora please change the 'version' of this bug to the applicable version. If you are unable to change the version, please add a comment here and someone will do it for you. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete. The process we are following is described here: http://fedoraproject.org/wiki/BugZappers/HouseKeeping
Fedora 13 changed to end-of-life (EOL) status on 2011-06-25. Fedora 13 is no longer maintained, which means that it will not receive any further security or bug fix updates. As a result we are closing this bug. If you can reproduce this bug against a currently maintained version of Fedora please feel free to reopen this bug against that version. Thank you for reporting this bug and we are sorry it could not be fixed.