Description of problem: From time to time clusterfs returns generic error and restarts service: Oct 6 05:05:47 tefse-pro2 clurgmgrd[15993]: <notice> status on clusterfs "smtpout-gfs" returned 1 (generic error) Oct 6 05:05:47 tefse-pro2 clurgmgrd[15993]: <notice> Stopping service smtpout2 Oct 6 05:05:52 tefse-pro2 clurgmgrd[15993]: <notice> Service smtpout2 is recovering Oct 6 05:05:52 tefse-pro2 clurgmgrd[15993]: <notice> Recovering failed service smtpout2 Oct 6 05:05:53 tefse-pro2 kernel: EXT3-fs warning: checktime reached, running e2fsck is recommended Oct 6 05:05:53 tefse-pro2 kernel: GFS: Trying to join cluster "lock_dlm", "tefcl-pro:prodatasmtpout" Oct 6 05:05:55 tefse-pro2 kernel: GFS: fsid=tefcl-pro:prodatasmtpout.0: Joined cluster. Now mounting FS... Oct 6 05:05:55 tefse-pro2 kernel: GFS: fsid=tefcl-pro:prodatasmtpout.0: jid=0: Trying to acquire journal lock... Oct 6 05:05:55 tefse-pro2 kernel: GFS: fsid=tefcl-pro:prodatasmtpout.0: jid=0: Looking at journal... Oct 6 05:05:55 tefse-pro2 kernel: GFS: fsid=tefcl-pro:prodatasmtpout.0: jid=0: Done Oct 6 05:05:56 tefse-pro2 clurgmgrd[15993]: <notice> Service smtpout2 started Version-Release number of selected component (if applicable): GFS-6.1.6-1 GFS-kernel-2.6.9-60.3 GFS-kernel-smp-2.6.9-60.3 GFS-kernheaders-2.6.9-60.3 rgmanager-1.9.54-3.228823test cman-kernel-2.6.9-45.8 cman-kernel-smp-2.6.9-45.8 dlm-kernel-2.6.9-44.3 dlm-kernel-smp-2.6.9-44.3 Linux tefse-pro2 2.6.9-42.0.3.ELsmp #1 SMP Mon Sep 25 17:28:02 EDT 2006 i686 i686 i386 GNU/Linux How reproducible: not known Steps to Reproduce: 1. 2. 3. Actual results: status on clusterfs " " returned 1 (generic error) Expected results: some more precise error message or fixed root cause (if it's kind of bug in cluster software) Additional info:
cluster.conf would be needed.
(In reply to comment #1) > cluster.conf would be needed. <?xml version="1.0"?> <cluster config_version="62" name="tefcl-pro"> <fence_daemon post_fail_delay="0" post_join_delay="25"/> <clusternodes> <clusternode name="tefse-pro1" votes="1"> <fence> <method name="1"> <device name="tefse-pro1-ilo"/> </method> </fence> </clusternode> <clusternode name="tefse-pro2" votes="1"> <fence> <method name="1"> <device name="tefse-pro2-ilo"/> </method> </fence> </clusternode> </clusternodes> <cman expected_votes="1" two_node="1"/> <fencedevices> <fencedevice agent="fence_ilo" hostname="tefse-pro1-ilo" login="fence" name="tefse-pro1-ilo" passwd=""/> <fencedevice agent="fence_ilo" hostname="tefse-pro2-ilo" login="fence" name="tefse-pro2-ilo" passwd=""/> </fencedevices> <rm> <failoverdomains> <failoverdomain name="tefsv-pro-fail" ordered="0" restricted="1"> <failoverdomainnode name="tefse-pro2" priority="1"/> <failoverdomainnode name="tefse-pro1" priority="1"/> </failoverdomain> <failoverdomain name="tefsv-pro1-fail" restricted="1"> <failoverdomainnode name="tefse-pro1" priority="1"/> </failoverdomain> <failoverdomain name="tefsv-pro2-fail" restricted="1"> <failoverdomainnode name="tefse-pro2" priority="1"/> </failoverdomain> </failoverdomains> <resources> <clusterfs device="/dev/prodatasmtpin" force_unmount="1" fsid="44545" fstype="gfs" mountpoint="/opt/prodata/smtpin" name="smtpin-gfs" options=""/> <clusterfs device="/dev/prodatasmtpout" force_unmount="1" fsid="23380" fstype="gfs" mountpoint="/opt/prodata/smtpout" name="smtpout-gfs" options=""/> </resources> <service autostart="0" domain="tefsv-pro1-fail" name="smtpin1" recovery="restart"> <fs device="/dev/prosmtpin1" force_fsck="0" force_unmount="1" fsid="20017" fstype="ext3" mountpoint="/opt/pro/smtpin1" name="smtpin1-fs" options="" self_fence="0"/> <fs device="/dev/propop31" force_fsck="0" force_unmount="1" fsid="7091" fstype="ext3" mountpoint="/opt/pro/pop31" name="pop31-fs" options="" self_fence="0"/> <script file="/opt/pro/smtpin1.init" name="smtpin1"/> <script file="/opt/pro/pop31.init" name="pop31"/> <clusterfs ref="smtpin-gfs"/> </service> <service autostart="0" domain="tefsv-pro2-fail" name="smtpin2" recovery="restart"> <fs device="/dev/prosmtpin2" force_fsck="0" force_unmount="1" fsid="57823" fstype="ext3" mountpoint="/opt/pro/smtpin2" name="smtpin2-fs" options="" self_fence="0"/> <fs device="/dev/propop32" force_fsck="0" force_unmount="1" fsid="46172" fstype="ext3" mountpoint="/opt/pro/pop32" name="pop32-fs" options="" self_fence="0"/> <script file="/opt/pro/smtpin2.init" name="smtpin2"/> <script file="/opt/pro/pop32.init" name="pop32"/> <clusterfs ref="smtpin-gfs"/> </service> <service autostart="0" domain="tefsv-pro1-fail" name="smtpout1" recovery="restart"> <fs device="/dev/prosmtpout1" force_fsck="0" force_unmount="1" fsid="18742" fstype="ext3" mountpoint="/opt/pro/smtpout1" name="smtpout1-fs" options="" self_fence="0"/> <script file="/opt/pro/smtpout1.init" name="smtpout1"/> <clusterfs ref="smtpout-gfs"/> </service> <service autostart="0" domain="tefsv-pro2-fail" name="smtpout2" recovery="restart"> <fs device="/dev/prosmtpout2" force_fsck="0" force_unmount="1" fsid="59096" fstype="ext3" mountpoint="/opt/pro/smtpout2" name="smtpout2-fs" options="" self_fence="0"/> <script file="/opt/pro/smtpout2.init" name="smtpout2"/> <clusterfs ref="smtpout-gfs"/> </service> </rm> </cluster>
(In reply to comment #1) > cluster.conf would be needed. attached. any idea what could went wrong?
Full file system?
clusterfs.sh and fs.sh try to touch a file on the file system periodically -- if it's full for some reason, this will fail. You can (sort of) disable this check by adding a special child to your clusterfs resources: <clusterfs ref="smtpout-gfs"> <action depth="20" name="status" timeout="30" interval="1Y"/> </clusterfs>
(In reply to comment #4) > Full file system? for sure it wasn't full file system. We do monitor such things... For this particular FS it was not more than 2% of space used and less than 1% inodes used. anyway even if it would be full FS issue - generic error isn't very verbose.
Generic error is the failure code reported to rgmanager. There are only a couple of return codes valid for resource-agents, some are things like "Program not installed". In most cases, nothing fits, so "Generic error" is used. What's missing here (or seems to be) is a log message from the resource agent itself as to *why* it returned a failure code to rgmanager during a status check.
One issue common to clusterfs and fs agents are that they fork() a -lot- of stuff. During memory pressure, there's a chance that fork would fail, which could also be a possible explanation for this problem.
I can add more verbose error reporting, but without knowing where in the resource-agent it is failing, fixing the problem is difficult.
Created attachment 346850 [details] clusterfs patch to fix the race condition during readwrite mount test I have had this experience with multiple 4 node gfs clusters. We ran into this a lot and I tracked this down to being a race condition in the isAlive() function of clusterfs. The file name isn't randomized in any way. I hacked in more logging into our test environment and eventually caught this in action. Unfortunately I do not have the log output. What seemed to occur was during the write test, touch was in the middle of writing the test file on one node while another one was finishing its test and removing the file. This caused touch to error out and subsequently caused the isAlive function to fail. I set the hidden file used to write to the ".$HOSTNAME". Patch is attached. I hope it helps.
The patch does look like it would correct the race as you described it.
Additionally, it does not introduce any upgrade incompatibilities since the only time the file(s) are touched is during a status check.
I started to see this very issue last month. I implemented the patch, and still seeing the error, which was occuring about twice a week. Though the increased logging showed that the fail was on the write test. I started doing some write tests as user root, and interestingly my qfs_quota had recently been exceeded for user root, not quote sure why it was set anyways. The write test would pass most of the time but some instances would randomly return gfs_quota exceeded and fail to write the test file. I have since disabled quota for user root. The quota warn and limit was not spitting any messages to kernel buffer or syslog, so there was no way to see this.
I'm merging Nick's patch with one provided for local file systems in bug #562237.
*** Bug 474694 has been marked as a duplicate of this bug. ***
http://git.fedorahosted.org/git?p=cluster.git;a=commit;h=f805ba33107b8948da515f7144ee6a3828d4343c
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: Previously, the isAlive check could fail if two nodes used the same file name. With this update, the isAlive function prevents two nodes from using the same file name.
Technical note updated. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. Diffed Contents: @@ -1,2 +1 @@ -Previously, the isAlive check could fail if two nodes used the same file name. With this update, the isAlive function prevents two nodes from using the same +Previously, the isAlive check could fail if two nodes used the same file name. With this update, the isAlive function prevents two nodes from using the same file name.-file name.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-0264.html