563555 – clurgmgrd[15993]: <notice> status on clusterfs "gfs" returned 1 (generic error)

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 563555 - clurgmgrd[15993]: <notice> status on clusterfs "gfs" returned 1 (generic error)

Summary: clurgmgrd[15993]: <notice> status on clusterfs "gfs" returned 1 (generic error)

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	resource-agents
Sub Component:
Version:	6.0
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	Marek Grac
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	563733 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2010-02-10 15:31 UTC by Lon Hohberger
Modified:	2018-10-27 14:01 UTC (History)
CC List:	8 users (show)
Fixed In Version:	resource-agents-3.0.7-4.el6
Doc Type:	Bug Fix
Doc Text:
Clone Of:	469815
Environment:
Last Closed:	2010-11-15 14:43:53 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Lon Hohberger 2010-02-10 15:31:40 UTC

+++ This bug was initially created as a clone of Bug #469815 +++

Description of problem:

From time to time clusterfs returns generic error and restarts service:


Oct  6 05:05:47 tefse-pro2 clurgmgrd[15993]: <notice> status on clusterfs "smtpout-gfs" returned 1 (generic error)
Oct  6 05:05:47 tefse-pro2 clurgmgrd[15993]: <notice> Stopping service smtpout2
Oct  6 05:05:52 tefse-pro2 clurgmgrd[15993]: <notice> Service smtpout2 is recovering
Oct  6 05:05:52 tefse-pro2 clurgmgrd[15993]: <notice> Recovering failed service smtpout2
Oct  6 05:05:53 tefse-pro2 kernel: EXT3-fs warning: checktime reached, running e2fsck is recommended
Oct  6 05:05:53 tefse-pro2 kernel: GFS: Trying to join cluster "lock_dlm", "tefcl-pro:prodatasmtpout"
Oct  6 05:05:55 tefse-pro2 kernel: GFS: fsid=tefcl-pro:prodatasmtpout.0: Joined cluster. Now mounting FS...
Oct  6 05:05:55 tefse-pro2 kernel: GFS: fsid=tefcl-pro:prodatasmtpout.0: jid=0: Trying to acquire journal lock...
Oct  6 05:05:55 tefse-pro2 kernel: GFS: fsid=tefcl-pro:prodatasmtpout.0: jid=0: Looking at journal...
Oct  6 05:05:55 tefse-pro2 kernel: GFS: fsid=tefcl-pro:prodatasmtpout.0: jid=0: Done
Oct  6 05:05:56 tefse-pro2 clurgmgrd[15993]: <notice> Service smtpout2 started


Version-Release number of selected component (if applicable):
GFS-6.1.6-1
GFS-kernel-2.6.9-60.3
GFS-kernel-smp-2.6.9-60.3
GFS-kernheaders-2.6.9-60.3
rgmanager-1.9.54-3.228823test
cman-kernel-2.6.9-45.8
cman-kernel-smp-2.6.9-45.8
dlm-kernel-2.6.9-44.3
dlm-kernel-smp-2.6.9-44.3

Linux tefse-pro2 2.6.9-42.0.3.ELsmp #1 SMP Mon Sep 25 17:28:02 EDT 2006 i686 i686 i386 GNU/Linux

How reproducible:
not known

Steps to Reproduce:
1.
2.
3.
  
Actual results:
status on clusterfs " " returned 1 (generic error)

Expected results:
some more precise error message or fixed root cause (if it's kind of bug in cluster software)

Additional info:

--- Additional comment from lhh on 2008-11-04 15:10:40 EST ---

cluster.conf would be needed.

--- Additional comment from tjaszowski on 2008-11-05 02:47:00 EST ---

(In reply to comment #1)
> cluster.conf would be needed.

<?xml version="1.0"?>
<cluster config_version="62" name="tefcl-pro">
        <fence_daemon post_fail_delay="0" post_join_delay="25"/>
        <clusternodes>
                <clusternode name="tefse-pro1" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="tefse-pro1-ilo"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="tefse-pro2" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="tefse-pro2-ilo"/>
                                </method>
                        </fence>
                </clusternode>
        </clusternodes>
        <cman expected_votes="1" two_node="1"/>
        <fencedevices>
                <fencedevice agent="fence_ilo" hostname="tefse-pro1-ilo" login="fence" name="tefse-pro1-ilo" passwd=""/>
                <fencedevice agent="fence_ilo" hostname="tefse-pro2-ilo" login="fence" name="tefse-pro2-ilo" passwd=""/>
        </fencedevices>
        <rm>
                <failoverdomains>
                        <failoverdomain name="tefsv-pro-fail" ordered="0" restricted="1">
                                <failoverdomainnode name="tefse-pro2" priority="1"/>
                                <failoverdomainnode name="tefse-pro1" priority="1"/>
                        </failoverdomain>
                        <failoverdomain name="tefsv-pro1-fail" restricted="1">
                                <failoverdomainnode name="tefse-pro1" priority="1"/>
                        </failoverdomain>
                        <failoverdomain name="tefsv-pro2-fail" restricted="1">
                                <failoverdomainnode name="tefse-pro2" priority="1"/>
                        </failoverdomain>
                </failoverdomains>
                <resources>
                        <clusterfs device="/dev/prodatasmtpin" force_unmount="1" fsid="44545" fstype="gfs" mountpoint="/opt/prodata/smtpin" name="smtpin-gfs" options=""/>
                        <clusterfs device="/dev/prodatasmtpout" force_unmount="1" fsid="23380" fstype="gfs" mountpoint="/opt/prodata/smtpout" name="smtpout-gfs" options=""/>
                </resources>
                <service autostart="0" domain="tefsv-pro1-fail" name="smtpin1" recovery="restart">
                        <fs device="/dev/prosmtpin1" force_fsck="0" force_unmount="1" fsid="20017" fstype="ext3" mountpoint="/opt/pro/smtpin1" name="smtpin1-fs" options="" self_fence="0"/>
                        <fs device="/dev/propop31" force_fsck="0" force_unmount="1" fsid="7091" fstype="ext3" mountpoint="/opt/pro/pop31" name="pop31-fs" options="" self_fence="0"/>
                        <script file="/opt/pro/smtpin1.init" name="smtpin1"/>
                        <script file="/opt/pro/pop31.init" name="pop31"/>
                        <clusterfs ref="smtpin-gfs"/>
                </service>
                <service autostart="0" domain="tefsv-pro2-fail" name="smtpin2" recovery="restart">
                        <fs device="/dev/prosmtpin2" force_fsck="0" force_unmount="1" fsid="57823" fstype="ext3" mountpoint="/opt/pro/smtpin2" name="smtpin2-fs" options="" self_fence="0"/>
                        <fs device="/dev/propop32" force_fsck="0" force_unmount="1" fsid="46172" fstype="ext3" mountpoint="/opt/pro/pop32" name="pop32-fs" options="" self_fence="0"/>
                        <script file="/opt/pro/smtpin2.init" name="smtpin2"/>
                        <script file="/opt/pro/pop32.init" name="pop32"/>
                        <clusterfs ref="smtpin-gfs"/>
                </service>
                <service autostart="0" domain="tefsv-pro1-fail" name="smtpout1" recovery="restart">
                        <fs device="/dev/prosmtpout1" force_fsck="0" force_unmount="1" fsid="18742" fstype="ext3" mountpoint="/opt/pro/smtpout1" name="smtpout1-fs" options="" self_fence="0"/>
                        <script file="/opt/pro/smtpout1.init" name="smtpout1"/>
                        <clusterfs ref="smtpout-gfs"/>
                </service>
                <service autostart="0" domain="tefsv-pro2-fail" name="smtpout2" recovery="restart">
                        <fs device="/dev/prosmtpout2" force_fsck="0" force_unmount="1" fsid="59096" fstype="ext3" mountpoint="/opt/pro/smtpout2" name="smtpout2-fs" options="" self_fence="0"/>
                        <script file="/opt/pro/smtpout2.init" name="smtpout2"/>
                        <clusterfs ref="smtpout-gfs"/>
                </service>
        </rm>
</cluster>

--- Additional comment from tjaszowski on 2008-11-12 13:59:48 EST ---

(In reply to comment #1)
> cluster.conf would be needed.

attached.


any idea what could went wrong?

--- Additional comment from lhh on 2008-12-09 16:08:09 EST ---

Full file system?

--- Additional comment from lhh on 2008-12-09 16:11:16 EST ---

clusterfs.sh and fs.sh try to touch a file on the file system periodically -- if it's full for some reason, this will fail.

You can (sort of) disable this check by adding a special child to your clusterfs resources:

   <clusterfs ref="smtpout-gfs">
     <action depth="20" name="status" timeout="30" interval="1Y"/>
   </clusterfs>

--- Additional comment from tjaszowski on 2008-12-10 01:06:01 EST ---

(In reply to comment #4)
> Full file system?

for sure it wasn't full file system. We do monitor such things... For this particular FS it was not more than 2% of space used and less than 1% inodes used. 









anyway even if it would be full FS issue - generic error isn't very verbose.

--- Additional comment from lhh on 2008-12-10 09:25:02 EST ---

Generic error is the failure code reported to rgmanager.  There are only a couple of return codes valid for resource-agents, some are things like "Program not installed".  In most cases, nothing fits, so "Generic error" is used.

What's missing here (or seems to be) is a log message from the resource agent itself as to *why* it returned a failure code to rgmanager during a status check.

--- Additional comment from lhh on 2009-02-27 10:23:57 EST ---

One issue common to clusterfs and fs agents are that they fork() a -lot- of stuff.  During memory pressure, there's a chance that fork would fail, which could also be a possible explanation for this problem.

--- Additional comment from lhh on 2009-02-27 17:29:42 EST ---

I can add more verbose error reporting, but without knowing where in the resource-agent it is failing, fixing the problem is difficult.

--- Additional comment from nickryand on 2009-06-08 06:30:47 EDT ---

Created an attachment (id=346850)
clusterfs patch to fix the race condition during readwrite mount test

I have had this experience with multiple 4 node gfs clusters. We ran into this a lot and I tracked this down to being a race condition in the isAlive() function of clusterfs. The file name isn't randomized in any way. I hacked in more logging into our test environment and eventually caught this in action. Unfortunately I do not have the log output. 

What seemed to occur was during the write test, touch was in the middle of writing the test file on one node while another one was finishing its test and removing the file. This caused touch to error out and subsequently caused the isAlive function to fail. I set the hidden file used to write to the ".$HOSTNAME". Patch is attached. 

I hope it helps.

--- Additional comment from lhh on 2009-06-08 14:56:10 EDT ---

The patch does look like it would correct the race as you described it.

--- Additional comment from lhh on 2009-06-08 16:06:43 EDT ---

Additionally, it does not introduce any upgrade incompatibilities since the only time the file(s) are touched is during a status check.

--- Additional comment from tjp on 2009-12-21 11:54:34 EST ---

I started to see this very issue last month.  I implemented the patch, and still seeing the error, which was occuring about twice a week.  Though the increased logging showed that the fail was on the write test.  I started doing some write tests as user root, and interestingly my qfs_quota had recently been exceeded for user root, not quote sure why it was set anyways.  The write test would pass most of the time but some instances would randomly return gfs_quota exceeded and fail to write the test file.  I have since disabled quota for user root.  The quota warn and limit was not spitting any messages to kernel buffer or syslog, so there was no way to see this.

Comment 1 RHEL Program Management 2010-02-10 15:53:37 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux major release.  Product Management has requested further
review of this request by Red Hat Engineering, for potential inclusion in a Red
Hat Enterprise Linux Major release.  This request is not yet committed for
inclusion.

Comment 3 Lon Hohberger 2010-02-10 23:29:52 UTC

*** Bug 563733 has been marked as a duplicate of this bug. ***

Comment 4 Lon Hohberger 2010-02-17 19:41:31 UTC

http://git.fedorahosted.org/git/?p=cluster.git;a=commit;h=23adebdb29dcc8dc34d1508037434121c7a95ec7

Comment 9 releng-rhel@redhat.com 2010-11-15 14:43:53 UTC

Red Hat Enterprise Linux 6.0 is now available and should resolve
the problem described in this bug report. This report is therefore being closed
with a resolution of CURRENTRELEASE. You may reopen this bug report if the
solution does not work for you.

Note You need to log in before you can comment on or make changes to this bug.