Bug 690922

Summary: GFS2 lock problem (deadlock?) with samba
Product: [Fedora] Fedora Reporter: Vladislav Bogdanov <bubble>
Component: kernelAssignee: Abhijith Das <adas>
Status: CLOSED WONTFIX QA Contact:
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 13CC: adas, anprice, bmarzins, gansalmon, itamar, jonathan, kernel-maint, madhu.chinakonda, rpeterso, swhiteho
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-06-27 11:56:09 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
Archive with requested dumps from both cluster nodes none

Description Vladislav Bogdanov 2011-03-25 20:10:54 UTC
I have samba running under CTDB and serving files from GFS2 filesystems built on DRBD devices. 20 clients. CTDB is run as pacemaker clone resource. Cluster stack is OpenAIS. DRBD is run in active-active mode (protocol c).

Today I faced the following:
Mar 25 15:11:42 s01-0 kernel: INFO: task smbd:5510 blocked for more than 120 seconds.
Mar 25 15:11:42 s01-0 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 25 15:11:42 s01-0 kernel: smbd          D 0000000000000005     0  5510  18255 0x00000084
Mar 25 15:11:42 s01-0 kernel: ffff88001b34d7b8 0000000000000086 ffff88001b34d768 ffffffffa025f1ee
Mar 25 15:11:42 s01-0 kernel: ffff88001b34dfd8 ffff8800273f8000 00000000000153c0 ffff88001b34dfd8
Mar 25 15:11:42 s01-0 kernel: 00000000000153c0 00000000000153c0 00000000000153c0 00000000000153c0
Mar 25 15:11:42 s01-0 kernel: Call Trace:
Mar 25 15:11:42 s01-0 kernel: [<ffffffffa025f1ee>] ? request_lock+0x97/0xa7 [dlm]
Mar 25 15:11:42 s01-0 kernel: [<ffffffffa02dd90e>] ? gfs2_glock_holder_wait+0x0/0x12 [gfs2]
Mar 25 15:11:42 s01-0 kernel: [<ffffffffa02dd91c>] gfs2_glock_holder_wait+0xe/0x12 [gfs2]
Mar 25 15:11:42 s01-0 kernel: [<ffffffff8144c3b7>] __wait_on_bit+0x48/0x7b
Mar 25 15:11:42 s01-0 kernel: [<ffffffff8144c458>] out_of_line_wait_on_bit+0x6e/0x79
Mar 25 15:11:42 s01-0 kernel: [<ffffffffa02dd90e>] ? gfs2_glock_holder_wait+0x0/0x12 [gfs2]
Mar 25 15:11:42 s01-0 kernel: [<ffffffff81066298>] ? wake_bit_function+0x0/0x33
Mar 25 15:11:42 s01-0 kernel: [<ffffffffa02df470>] wait_on_bit.clone.1+0x1e/0x20 [gfs2]
Mar 25 15:11:42 s01-0 kernel: [<ffffffffa02df4f8>] gfs2_glock_wait+0x3e/0x46 [gfs2]
Mar 25 15:11:42 s01-0 kernel: [<ffffffffa02df797>] gfs2_glock_nq+0x297/0x2a6 [gfs2]

All smbd processes stuck in iowait state with no way to kill them. Reboot of single node which held public IP address used for SMB serving didn't help, only reboot of both cluster nodes helped.

Earlier I saw very similar with nfsd and that was one of reasons I gave up using NFS and switched to samba.

Version-Release number of selected component (if applicable):
kernel-2.6.34.8-68.fc13.x86_64
clusterlib-3.1.1-1.3.fc13.x86_64

Locally-built packages:
corosync-1.3.0
openais-1.1.4
dlm-pcmk-3.0.17
gfs-pcmk-3.0.17
kmod-drbd-8.3.10

What else can I provide to help diagnose this?

Comment 1 Steve Whitehouse 2011-03-28 12:42:50 UTC
A glock dump would be useful, since it appears to be waiting for a glock to be granted. Also a dlm lock dump and a sysrq-t are the next most important items for debugging purposes.

Comment 2 Vladislav Bogdanov 2011-03-31 11:12:15 UTC
Created attachment 489036 [details]
Archive with requested dumps from both cluster nodes

Hi!

Problem just appeared again, I put what you've requested into attached tarball.

Problem began on s01-1 node, then virtual IP which serves one of exported filesystems was moved to s01-0 node and it behaved similarly.

I was able to obtain sysrq-t one from one node (s01-0), another rebooted (by watchdog or by fencing op) just before I was ready to echo t > /proc/sysrq-trigger.

Comment 3 Steve Whitehouse 2011-05-20 16:31:34 UTC
I've just posted a patch to cluster-devel which may help resolve this issue. If not then I'll have another look at this.

Comment 4 Vladislav Bogdanov 2011-05-20 17:13:09 UTC
Can you please be more precise?
Currently I see only https://www.redhat.com/archives/cluster-devel/2011-May/msg00061.html from today, but I'm not sure this is it.

Comment 5 Steve Whitehouse 2011-05-20 17:36:42 UTC
Yes, that is it, sorry for not including the url. I'm currently waiting for Linus to pull the current -nmw tree, so I'll send it along as soon as thats done.

Comment 6 Vladislav Bogdanov 2011-05-20 17:48:20 UTC
I just finished move of all cluster to gfs-less operation today and plan to put them to production next week :) .
I'll try to create cluster in virtual machines and make some testing next month.
The problem I've been experiencing is not easily reproducible.
BTW, is it possible to include this fix into F13 until its EOL? It will be much easier for me to test it then.

Comment 7 Steve Whitehouse 2011-05-21 13:39:43 UTC
f13 is likely to be EOL in only a month or so at the most, so we will not be back porting to this particular release I'm afraid.

Last night Linus pulled the current GFS2 -nmw git tree, so I'm now updating it and I'll probably land up sending another pull request for the fixes fairly shortly.

Comment 9 Bug Zapper 2011-05-30 10:52:55 UTC
This message is a reminder that Fedora 13 is nearing its end of life.
Approximately 30 (thirty) days from now Fedora will stop maintaining
and issuing updates for Fedora 13.  It is Fedora's policy to close all
bug reports from releases that are no longer maintained.  At that time
this bug will be closed as WONTFIX if it remains open with a Fedora 
'version' of '13'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version prior to Fedora 13's end of life.

Bug Reporter: Thank you for reporting this issue and we are sorry that 
we may not be able to fix it before Fedora 13 is end of life.  If you 
would still like to see this bug fixed and are able to reproduce it 
against a later version of Fedora please change the 'version' of this 
bug to the applicable version.  If you are unable to change the version, 
please add a comment here and someone will do it for you.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events.  Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

The process we are following is described here: 
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 10 Bug Zapper 2011-06-27 11:56:09 UTC
Fedora 13 changed to end-of-life (EOL) status on 2011-06-25. Fedora 13 is 
no longer maintained, which means that it will not receive any further 
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of 
Fedora please feel free to reopen this bug against that version.

Thank you for reporting this bug and we are sorry it could not be fixed.