Bug 493207

Summary: groupd assigns zero group id
Product: Red Hat Enterprise Linux 5 Reporter: David Teigland <teigland>
Component: cmanAssignee: David Teigland <teigland>
Status: CLOSED ERRATA QA Contact: Cluster QE <mspqa-list>
Severity: medium Docs Contact:
Priority: medium    
Version: 5.3CC: cfeist, cluster-maint, edamato, gpaterno, lhh, rlerch, tao
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: cman-2.0.100-1.el5 Doc Type: Bug Fix
Doc Text:
- cause: race condition between nodes during group creation (e.g. mounting gfs) could cause dlm or gfs groups to have zero global id's. (extremely rare, never actually observed in dlm or gfs) - consequence: dlm or gfs startup would fail and usually print errors about zero lockspace or mountgroup id - fix: dlm and gfs now detect zero global id's from groupd and replace them with an id created from a hash of the group name - result: dlm/gfs startup races among nodes can no longer fail due to zero id's being created in groupd
Story Points: ---
Clone Of: 493165 Environment:
Last Closed: 2009-09-02 11:09:09 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 493165    
Bug Blocks:    
Attachments:
Description Flags
patch none

Description David Teigland 2009-03-31 22:11:52 UTC
+++ This bug was initially created as a clone of Bug #493165 +++

Description of problem:

In some occasions groupd allows the fence domain to be ZERO:

# group_tool
type             level name          id       state       
fence            0     default       00000000 none        
[1 2 3]

....

When that happens queries to the 'default fence domain' will fail:

# group_tool ls fence default
groupd has no information about the specified group
# echo $?
1
# group_tool ls fence default &> /dev/null
# echo $?
1

Because such queries are used by rgmanager, this causes rgmanager to hang on startup.

How reproducible:

Every time that the fence domain id is zero.

Steps to Reproduce:

Startup the cluster and obtain id zero for the fence domain. This is not the normal case.

Actual results:

rgmanager blocks. 

Expected results:

rgmanager works.

--- Additional comment from lhh on 2009-03-31 16:23:49 EDT ---

Created an attachment (id=337401)
Fix

--- Additional comment from lhh on 2009-03-31 17:40:36 EDT ---

Previous fix just allows group_tool to work if id == 0; it doesn't change the problem that causes groupd to assign a group the id of 0.

Comment 1 David Teigland 2009-03-31 22:51:22 UTC
The method groupd uses to pick the globally unique id of a new group (global_id), relies on the first cpg confchg to have a single member, which isn't always true.  When there is no initial cpg confchg with one member, groupd will not set a global id for the group, and it will remain zero.

If two nodes join a cpg at the very same time, the first cpg confchg on both nodes may indicate both nodes are members.  Most of the time, the timings are such that one node will actually join first on its own, so the zero id's should be uncommon.

If this uncommon case does happen, and a group has global_id of 0, it is usually harmless.  groupd, fenced, and dlm_controld do not use the global_id at all, so they will be unaffected by groups with global_id 0 (dlm_controld does use it in the deadlock code which is not used.)

gfs_controld does use global_id for plocks.  It passes it to gfs-kernel as id=01234 in the hostdata mount option string.  gfs-kernel uses it as a specific fs reference id in plock operations passed back to userspace.  So, if two gfs_controld mountgroups were unlucky enough to get global_id's of 0, then any plock operations on the two filesystems would be mixed together (assuming apps are using plocks on the fs).  This mixing of plocks between two fs's is therefore the first point at which problems would be observed (and it actually would not be a problem until files with the same inode number in both fs's were being locked.)

So, in order for there to be a problem, two gfs_controld groups need to be given global_id of 0, and plocks need to be used on both of those fs's.  It all adds up to a rather unlikely event.

A fix for this will need to be handled as a special case when zero global_id's occur.  The special case may be implemented in gfs_controld since there is more data to work with, so more options exist for picking an alternative global id.

Comment 2 David Teigland 2009-03-31 22:54:36 UTC
Comment 1 ignores the obvious problem that id 0 causes in bug 493165 which is more an incidental effect easily handled by the group_tool fix.

Comment 3 David Teigland 2009-04-01 17:24:14 UTC
After further study, I've found that dlm/dlm_controld would be affected by 0 global_id's.  First, in 5.3 plocks were shifted to go through the dlm, and the reference id for plock ops are the global_id set for the lockspace, and no longer the gfs mountgroup global_id.  Second, the lockspace global_id is used in dlm network message headers to identify the lockspace.

My initial plan for fixing this is to add code in both dlm_controld and gfs_controld to check:

- if global_id assigned by groupd is zero
- if this is the first group with zero global_id, leave it as zero
- if another group with zero global_id exists, pick new global_id that is a hash of the group name (the method used in cluster3)

To avoid breaking compatibility with earlier RHEL5 versions where a single group with zero global_id exists and works, we need to continue to allow a single group (at each level) with zero global_id.  Changing the global_id of subsequent groups is also incompatible, but duplicate zero global_id's wouldn't work anyway.

Comment 4 David Teigland 2009-04-01 20:27:56 UTC
It turns out that dlm-kernel doesn't work at all with lockspace id of 0.
So, the plan is now for dlm_controld and gfs_controld to replace any zero global_id with the name hash.

Comment 5 David Teigland 2009-04-01 21:22:45 UTC
Created attachment 337669 [details]
patch

This patch worked well in some tests where I forced groupd to always assign a zero global id.

Comment 8 David Teigland 2009-05-19 16:28:16 UTC
Release note added. If any revisions are required, please set the 
"requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly.
All revisions will be proofread by the Engineering Content Services team.

New Contents:
- cause: race condition between nodes during group creation (e.g. mounting gfs) could cause dlm or gfs groups to have zero global id's. (extremely rare, never actually observed in dlm or gfs)

- consequence: dlm or gfs startup would fail and usually print errors about zero lockspace or mountgroup id

- fix: dlm and gfs now detect zero global id's from groupd and replace them with an id created from a hash of the group name

- result: dlm/gfs startup races among nodes can no longer fail due to zero id's being created in groupd

Comment 10 errata-xmlrpc 2009-09-02 11:09:09 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-1341.html