506094 – CRM #1919318, GFS mount (shared resources) are unmounted when new service is added using them

Bug 506094 - CRM #1919318, GFS mount (shared resources) are unmounted when new service is added using them

Summary: CRM #1919318, GFS mount (shared resources) are unmounted when new service is ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	rgmanager
Sub Component:
Version:	5.4
Hardware:	All
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Lon Hohberger
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	499522
TreeView+	depends on / blocked

Reported:	2009-06-15 14:35 UTC by Issue Tracker
Modified:	2023-09-14 01:16 UTC (History)
CC List:	6 users (show)
Fixed In Version:	rgmanager-2.0.52-1.11.el5
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2010-03-30 08:47:48 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Fix clusterfs meta refcnt handling (974 bytes, patch) 2009-07-31 19:44 UTC, Lon Hohberger	no flags	Details \| Diff
Preserve incarnations across configuration changes (1.60 KB, patch) 2009-07-31 19:45 UTC, Lon Hohberger	no flags	Details \| Diff
Fix incarnation handling (3.41 KB, application/octet-stream) 2009-07-31 19:46 UTC, Lon Hohberger	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2010:0280	0	normal	SHIPPED_LIVE	rgmanager bug fix and enhancement update	2010-03-29 13:59:11 UTC

Description Issue Tracker 2009-06-15 14:35:58 UTC

Escalated to Bugzilla from IssueTracker

Comment 1 Issue Tracker 2009-06-15 14:36:00 UTC

Event posted on 06-10-2009 07:56am EDT by avijayak

Description of problem:

Customer has several services and having several shared GFS resources. What I meant by shared GFS resource is, multiple services are using same GFS resources. I believe it is a valid configuration. Assuming yes, I am explaining the issue.
If customer add a new service which consist one of this shared GFS resource, rgmanager forcefully unmounts those GFS shared file systems which indirectly stops all the services which is already using those file system.

How reproducible:
always.

Steps to Reproduce:
I am able to reproduce the issue as below

I have a two node RHEL5 cluster using latest packages. I have created a test service which contains a GFS file system named testfs1 and also added an NFS export and client. The GFS share is mounted on /mnt.
I have opened a file using vi /mnt/test.txt as a simulation of file system is accessing. Now I have added a new service which consists of the same GFS resource and a script resource also. Before starting this service rgmanager killed the vi process and then unmounted the GFS resource which ofcourse affected the already running service. The issue doesn't exist if I don't use force_unmount. But customer requires it.

Expected results:
The file system shouldn't be unmounted if a service is already using it.

Is this expected?I could see Bugzilla https://bugzilla.redhat.com/show_bug.cgi?id=254111 in which the issue is resolved after including refcount support in rgmanager and clusterfs.sh. I can confirm that I am using latest rgmanger, 2.0.46-1.el5_3.3.
Is there any special configuration required to use same GFS on multiple services? Is it must that force_unmount shouldn't be used, to use the same GFS resource inside multiple services? If yes, how we can achieve the benefit of force_unmount in these environments.
This event sent from IssueTracker by dejohnso [Support Engineering Group]
issue 305812

Comment 2 Issue Tracker 2009-06-15 14:36:02 UTC

Event posted on 06-10-2009 08:01am EDT by avijayak

SEG Escalation Template

All Issues: Problem Description
---------------------------------------------------
1. Time and date of problem:
2. System architecture(s):
x86_64
3. Provide a clear and concise problem description as it is understood at
the time of escalation. Please be as specific as possible in your
description. Do not use the generic term "hang", as that can mean many
things.
As explained in description of problem section.
4. Specific action requested of SEG:
I need SEG help to check how we can safely put same GFS resource inside
multiple services with force_unmount enabled
5. Is a defect (bug) in the product suspected? yes/no
   Bugzilla number (if one already exists):
https://bugzilla.redhat.com/show_bug.cgi?id=254111 looks similar


All Issues: Supporting Information
------------------------------------------------------
1. Other actions already taken in working the problem (tech-list posting,
google searches, fulltext search, consultation with another engineer,
etc.):
Sent email to cluster-list with subject "Same GFS share on multiple
services" but didn't get an update.
2. Attach sosreport.
Attached.
4. Provide issue reproduction information, including location and access
of reproducer machine, if available.
Reproducible step given in description of problem section. I can give IPs
if required.

Please let me know if anymore information to be attached.


Issue escalated to Support Engineering Group by: avijayak.
avijayak assigned to issue for Production Support (Pune).
Internal Status set to 'Waiting on SEG'
Status set to: Waiting on Tech

This event sent from IssueTracker by dejohnso  [Support Engineering Group]
 issue 305812

Comment 3 Issue Tracker 2009-06-15 14:36:03 UTC

Event posted on 06-11-2009 01:24pm EDT by dejohnso

This issue was fixed in rgmanager-2.0.38-2.el5 and the revision you tested
on was obviously later.  Can you tell me?  Is the same problem seen when
the service is relocated?  Or is the umount only happening during the
creation of the service.
I am asking because I am suspecting that the reference count may not be
checked by rgmanager during the creation phase of the service.  I am doing
code review to see if I can determine this.  refcount is something that is
managed by rgmanager and it is not necessary to put anything in
cluster.conf.  Still researching but I would appreciate your answers in
the meantime.

Thanks,

Debbie

Internal Status set to 'Waiting on Support'

This event sent from IssueTracker by dejohnso  [Support Engineering Group]
 issue 305812

Comment 4 Issue Tracker 2009-06-15 14:36:05 UTC

Event posted on 06-12-2009 01:22am EDT by avijayak

Hello Debbie,

Thanks for your update.

>Can you tell me?  Is the same problem seen when the service is relocated?
 Or is the umount only happening during the creation of the service.
I have tested it and I noticed that the issue doesn't exist in relocation
time. May be this one is fixed as per the Bugzilla 254111
The issue exist of creation phase as well on deletion phase of service. 

>I am asking because I am suspecting that the reference count may not be
checked by rgmanager during the creation phase of the service.
Hmm, I also think so.

>I am doing code review to see if I can determine this.
Thanks.

--Aneesh

Internal Status set to 'Waiting on SEG'

This event sent from IssueTracker by dejohnso  [Support Engineering Group]
 issue 305812

Comment 6 Debbie Johnson 2009-06-29 18:21:28 UTC

Can we get someone working on this or point me in the right direction so I can look into this issue further.  It is holding up the progress of a customer's production environment.  Thanks in advance.

Debbie

Comment 7 Debbie Johnson 2009-07-07 14:55:26 UTC

Increasing severity to match the increased severity of the IT.  The customer is getting very anxious about this issue.

Comment 8 Lon Hohberger 2009-07-27 14:31:17 UTC

Ok, I understand this.

Comment 9 Lon Hohberger 2009-07-29 21:03:50 UTC

So, I started looking at how to fix this, and there are at least two ways:

* make the 'init' phase not run the 'stop' phase on anything with a nonzero reference count

* make clusterfs init_on_add/destroy_on_delete = 0 and apply this to resources which are not at the top level (currently, this option only works for resources at the top level)

I think the former may be the less intrusive method.

Comment 10 Lon Hohberger 2009-07-31 19:44:54 UTC

Created attachment 355851 [details]
Fix clusterfs meta refcnt handling

Comment 11 Lon Hohberger 2009-07-31 19:45:32 UTC

Created attachment 355852 [details]
Preserve incarnations across configuration changes

Comment 12 Lon Hohberger 2009-07-31 19:46:35 UTC

Created attachment 355853 [details]
Fix incarnation handling

* Don't allow simultaneous starts/stops of the *same*
resource.
* Make incarnation numbers consistent between start/stop

Comment 13 Lon Hohberger 2009-07-31 19:50:31 UTC

What the problem ended up being was related to incarnation handling.

(1) Incarnation counts were not preserved across configuration changes.  This caused rgmanager to "forget" that other instances of a resource were already running after a configuration change.

(2) Incarnation counts were inconsistent between start/stop: the meta_refcnt should always be the concurrent number of other incarnations of the resource running, excluding the current instance.

(3) Incarnations could start/stop simultaneously.  In theory, it would be possible to have two incarnations stopped at the same time without actually executing the 'real' stop phase of the resource agent.  While this does not matter with clusterfs, it is important that this does not happen with other (yet undeveloped) resource types.

These patches have not been pushed to git, and require more testing before this can occur.

Comment 14 Debbie Johnson 2009-07-31 20:07:49 UTC

Lon,

Would it be ok for me to create a patch and provide it to the field for testing?

Deb

Comment 15 Lon Hohberger 2009-07-31 20:27:52 UTC

http://people.redhat.com/lhh/rgmanager-2.0.46-1.el5_3.4bz506094.src.rpm

Source RPM based on current 5.3 + errata.

Comment 17 Lon Hohberger 2009-07-31 20:29:56 UTC

http://people.redhat.com/lhh/rgmanager-2.0.52-1.el5.bz506094.src.rpm

Source RPM based on current RHEL 5.4 beta packages.

Comment 19 Debbie Johnson 2009-08-06 14:35:10 UTC

Lon,

Field tested rgmanager with success.  Customer could not test as they have worked around issue and went into production.

Debbie

Comment 22 Lon Hohberger 2009-08-21 21:20:07 UTC

http://git.fedorahosted.org/git/?p=cluster.git;a=commit;h=9b060401c0c58405e9d6bfecb57acbfaeb717080
http://git.fedorahosted.org/git/?p=cluster.git;a=commit;h=93873dac8ab08f63700d76abe61d70693548323e
http://git.fedorahosted.org/git/?p=cluster.git;a=commit;h=e27e6148dd6a033a3e0a57e04c06b6242efb3321

Comment 25 Lon Hohberger 2009-10-21 19:57:59 UTC

There was a bad assertion introduced.  The following patch fixes it:

http://git.fedorahosted.org/git/?p=cluster.git;a=commit;h=25155283f4797f85fa30b95c94c8ebb7df07dcc3

Comment 28 Chris Ward 2010-02-11 10:23:37 UTC

~~ Attention Customers and Partners - RHEL 5.5 Beta is now available on RHN ~~

RHEL 5.5 Beta has been released! There should be a fix present in this 
release that addresses your request. Please test and report back results 
here, by March 3rd 2010 (2010-03-03) or sooner.

Upon successful verification of this request, post your results and update 
the Verified field in Bugzilla with the appropriate value.

If you encounter any issues while testing, please describe them and set 
this bug into NEED_INFO. If you encounter new defects or have additional 
patch(es) to request for inclusion, please clone this bug per each request
and escalate through your support representative.

Comment 31 errata-xmlrpc 2010-03-30 08:47:48 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2010-0280.html

Comment 32 Red Hat Bugzilla 2023-09-14 01:16:54 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days

Note You need to log in before you can comment on or make changes to this bug.