Description of problem: Two node cluster and quorum disk with GFS2 mounted through fstab with these options: /dev/vg1/cluster /mnt/cluster gfs2 noatime,nodiratime,nosuid,noexec,acl,errors=panic 0 0 everything works ok for few hours (no exact time, it happens suddenly) and then GFS2 locks out and entire cluster is locked at that time. standard restart doesn't help, just reboot -f -n. after removing acl from fstab cluster works without problems (currently my cluster is active for 25 hours). Version-Release number of selected component (if applicable): redhat 5.5 (centos) up2date with latest updates gfs2-utils-0.1.62-20.el5.x86_64 cman-2.0.115-34.el5.x86_64 rgmanager-2.0.52-6.el5.centos.x86_64 lvm2-cluster-2.02.56-7.el5_5.4 How reproducible: Steps to Reproduce: 1. create two node cluster with quorum disk 2. mount GFS2 partition with ACL option 3. use GFS2 partition for few hours with both nodes active Actual results: cluster locks with errors after few hours: Jul 29 21:49:22 kernel: INFO: task umount.gfs2:4084 blocked for more than 120 seconds. Jul 29 21:49:22 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jul 29 21:49:22 kernel: umount.gfs2 D ffff810002363458 0 4084 1 3530 (NOTLB) Jul 29 21:49:22 kernel: ffff8100756dfc58 0000000000000082 0000000000000018 ffffffff884784f3 Jul 29 21:49:22 kernel: 0000000000000296 0000000000000007 ffff81007c9f7040 ffff81006f4837a0 Jul 29 21:49:22 kernel: 000000343ed7cb69 0000000000000952 ffff81007c9f7228 0000000088479e5a Jul 29 21:49:22 kernel: Call Trace: Jul 29 21:49:22 kernel: [<ffffffff884784f3>] :dlm:request_lock+0x93/0xa0 Jul 29 21:49:22 kernel: [<ffffffff80064cd1>] __reacquire_kernel_lock+0x2c/0x45 Jul 29 21:49:22 kernel: [<ffffffff884a3ee7>] :gfs2:just_schedule+0x0/0xe Jul 29 21:49:22 kernel: [<ffffffff884a3ef0>] :gfs2:just_schedule+0x9/0xe Jul 29 21:49:22 kernel: [<ffffffff80063a16>] __wait_on_bit+0x40/0x6e Jul 29 21:49:22 kernel: [<ffffffff884a3ee7>] :gfs2:just_schedule+0x0/0xe Jul 29 21:49:22 kernel: [<ffffffff80063ab0>] out_of_line_wait_on_bit+0x6c/0x78 Jul 29 21:49:22 kernel: [<ffffffff800a0a06>] wake_bit_function+0x0/0x23 Jul 29 21:49:22 kernel: [<ffffffff884a3ee2>] :gfs2:gfs2_glock_wait+0x2b/0x30 Jul 29 21:49:22 kernel: [<ffffffff884ba5f2>] :gfs2:gfs2_statfs_sync+0x3f/0x165 Jul 29 21:49:22 kernel: [<ffffffff884ba5ea>] :gfs2:gfs2_statfs_sync+0x37/0x165 Jul 29 21:49:22 kernel: [<ffffffff884b64b5>] :gfs2:gfs2_quota_sync+0x253/0x268 Jul 29 21:49:22 kernel: [<ffffffff884b3a79>] :gfs2:gfs2_make_fs_ro+0x27/0x98 Jul 29 21:49:22 kernel: [<ffffffff800a0758>] kthread_stop+0x7a/0x80 Jul 29 21:49:22 kernel: [<ffffffff884b3c16>] :gfs2:gfs2_put_super+0x6e/0x187 Jul 29 21:49:22 kernel: [<ffffffff800e3e50>] generic_shutdown_super+0x79/0xfb Jul 29 21:49:22 kernel: [<ffffffff800e3f03>] kill_block_super+0x31/0x45 Jul 29 21:49:22 kernel: [<ffffffff884b0116>] :gfs2:gfs2_kill_sb+0x63/0x76 Jul 29 21:49:22 kernel: [<ffffffff800e3fd1>] deactivate_super+0x6a/0x82 Jul 29 21:49:22 kernel: [<ffffffff800eddc3>] sys_umount+0x245/0x27b Jul 29 21:49:22 kernel: [<ffffffff800988b7>] recalc_sigpending+0xe/0x25 Jul 29 21:49:22 kernel: [<ffffffff8001dca6>] sigprocmask+0xb7/0xdb Jul 29 21:49:22 kernel: [<ffffffff80030377>] sys_rt_sigprocmask+0xc0/0xd9 Jul 29 21:49:22 kernel: [<ffffffff8005d116>] system_call+0x7e/0x83 this only an example of error, this is not the only error, anything will lock the cluster down (ls, cd, cp...) Expected results: cluster should work with ACL mount option Additional info:
In this call trace, the system is trying to unmount the gfs2 mount point, and that is waiting for dlm to send it a response to a lock request. Since it's hung, dlm is probably stuck for another reason, like a prior failure. Unfortunately, we don't have any information on any prior failure. I suspect that GFS2 is either hung due to another bug, or it encountered an error that caused it to panic due to errors=panic. In either case we may have already solved the problem but it hasn't made its way to your system yet. So here's what I recommend: 1. First, make sure you have a way to monitor the consoles of your nodes. 2. Adjust your "post_fail_delay" to a large value temporarily and reboot your cluster. 3. Recreate the hang. 4. Check to make sure both systems are still up and running after the hang. 5. Check dmesg to see if there are any indications of a failure on either node. 6. If there aren't any indications of failure on either node, use sysrq-t to collect complete call traces from both nodes. 7. Attach the call trace output or syslog from both nodes to the bugzilla.
Igor, without further information we are unable to locate the source of this issue. If no further information is available we'll have to close this bug I'm afraid.