Bug 620017

Summary:	GFS2 locks out entire cluster in case when mounted through fstab with ACL option
Product:	Red Hat Enterprise Linux 5	Reporter:	Igor Smitran <viruslaki>
Component:	gfs2-utils	Assignee:	Robert Peterson <rpeterso>
Status:	CLOSED INSUFFICIENT_DATA	QA Contact:	Cluster QE <mspqa-list>
Severity:	high	Docs Contact:
Priority:	low
Version:	5.7	CC:	adas, bmarzins, edamato, swhiteho
Target Milestone:	rc
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2010-10-06 10:00:03 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Igor Smitran 2010-07-31 12:16:48 UTC

Description of problem:

Two node cluster and quorum disk with GFS2 mounted through fstab with these options:
/dev/vg1/cluster /mnt/cluster gfs2
noatime,nodiratime,nosuid,noexec,acl,errors=panic 0 0
everything works ok for few hours (no exact time, it happens suddenly) and then
GFS2 locks out and entire cluster is locked at that time. standard restart
doesn't help, just reboot -f -n. after removing acl from fstab cluster works
without problems (currently my cluster is active for 25 hours).

Version-Release number of selected component (if applicable):
redhat 5.5 (centos) up2date with latest updates
gfs2-utils-0.1.62-20.el5.x86_64
cman-2.0.115-34.el5.x86_64
rgmanager-2.0.52-6.el5.centos.x86_64
lvm2-cluster-2.02.56-7.el5_5.4

How reproducible:


Steps to Reproduce:
1. create two node cluster with quorum disk
2. mount GFS2 partition with ACL option
3. use GFS2 partition for few hours with both nodes active

Actual results:
cluster locks with errors after few hours:
Jul 29 21:49:22 kernel: INFO: task umount.gfs2:4084 blocked for more than 120
seconds.
Jul 29 21:49:22 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
Jul 29 21:49:22 kernel: umount.gfs2   D ffff810002363458     0  4084      1    
           3530 (NOTLB)
Jul 29 21:49:22 kernel:  ffff8100756dfc58 0000000000000082 0000000000000018
ffffffff884784f3
Jul 29 21:49:22 kernel:  0000000000000296 0000000000000007 ffff81007c9f7040
ffff81006f4837a0
Jul 29 21:49:22 kernel:  000000343ed7cb69 0000000000000952 ffff81007c9f7228
0000000088479e5a
Jul 29 21:49:22 kernel: Call Trace:
Jul 29 21:49:22 kernel:  [<ffffffff884784f3>] :dlm:request_lock+0x93/0xa0
Jul 29 21:49:22 kernel:  [<ffffffff80064cd1>] __reacquire_kernel_lock+0x2c/0x45
Jul 29 21:49:22 kernel:  [<ffffffff884a3ee7>] :gfs2:just_schedule+0x0/0xe
Jul 29 21:49:22 kernel:  [<ffffffff884a3ef0>] :gfs2:just_schedule+0x9/0xe
Jul 29 21:49:22 kernel:  [<ffffffff80063a16>] __wait_on_bit+0x40/0x6e
Jul 29 21:49:22 kernel:  [<ffffffff884a3ee7>] :gfs2:just_schedule+0x0/0xe
Jul 29 21:49:22 kernel:  [<ffffffff80063ab0>] out_of_line_wait_on_bit+0x6c/0x78
Jul 29 21:49:22 kernel:  [<ffffffff800a0a06>] wake_bit_function+0x0/0x23
Jul 29 21:49:22 kernel:  [<ffffffff884a3ee2>] :gfs2:gfs2_glock_wait+0x2b/0x30
Jul 29 21:49:22 kernel:  [<ffffffff884ba5f2>] :gfs2:gfs2_statfs_sync+0x3f/0x165
Jul 29 21:49:22 kernel:  [<ffffffff884ba5ea>] :gfs2:gfs2_statfs_sync+0x37/0x165
Jul 29 21:49:22 kernel:  [<ffffffff884b64b5>] :gfs2:gfs2_quota_sync+0x253/0x268
Jul 29 21:49:22 kernel:  [<ffffffff884b3a79>] :gfs2:gfs2_make_fs_ro+0x27/0x98
Jul 29 21:49:22 kernel:  [<ffffffff800a0758>] kthread_stop+0x7a/0x80
Jul 29 21:49:22 kernel:  [<ffffffff884b3c16>] :gfs2:gfs2_put_super+0x6e/0x187
Jul 29 21:49:22 kernel:  [<ffffffff800e3e50>] generic_shutdown_super+0x79/0xfb
Jul 29 21:49:22 kernel:  [<ffffffff800e3f03>] kill_block_super+0x31/0x45
Jul 29 21:49:22 kernel:  [<ffffffff884b0116>] :gfs2:gfs2_kill_sb+0x63/0x76
Jul 29 21:49:22 kernel:  [<ffffffff800e3fd1>] deactivate_super+0x6a/0x82
Jul 29 21:49:22 kernel:  [<ffffffff800eddc3>] sys_umount+0x245/0x27b
Jul 29 21:49:22 kernel:  [<ffffffff800988b7>] recalc_sigpending+0xe/0x25
Jul 29 21:49:22 kernel:  [<ffffffff8001dca6>] sigprocmask+0xb7/0xdb
Jul 29 21:49:22 kernel:  [<ffffffff80030377>] sys_rt_sigprocmask+0xc0/0xd9
Jul 29 21:49:22 kernel:  [<ffffffff8005d116>] system_call+0x7e/0x83

this only an example of error, this is not the only error, anything will lock
the cluster down (ls, cd, cp...)

Expected results:
cluster should work with ACL mount option

Additional info:

Comment 1 Robert Peterson 2010-08-02 14:18:23 UTC

In this call trace, the system is trying to unmount the gfs2
mount point, and that is waiting for dlm to send it a response
to a lock request.  Since it's hung, dlm is probably stuck for
another reason, like a prior failure.  Unfortunately, we don't
have any information on any prior failure.

I suspect that GFS2 is either hung due to another bug, or it
encountered an error that caused it to panic due to errors=panic.
In either case we may have already solved the problem but it
hasn't made its way to your system yet.  So here's what I
recommend:

1. First, make sure you have a way to monitor the consoles of your
   nodes.
2. Adjust your "post_fail_delay" to a large value temporarily
   and reboot your cluster.
3. Recreate the hang.
4. Check to make sure both systems are still up and running after
   the hang.
5. Check dmesg to see if there are any indications of a failure
   on either node.
6. If there aren't any indications of failure on either node,
   use sysrq-t to collect complete call traces from both nodes.
7. Attach the call trace output or syslog from both nodes to the
   bugzilla.

Comment 2 Steve Whitehouse 2010-09-22 10:22:28 UTC

Igor, without further information we are unable to locate the source of this issue. If no further information is available we'll have to close this bug I'm afraid.