Bug 493207
Summary: | groupd assigns zero group id | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | David Teigland <teigland> | ||||
Component: | cman | Assignee: | David Teigland <teigland> | ||||
Status: | CLOSED ERRATA | QA Contact: | Cluster QE <mspqa-list> | ||||
Severity: | medium | Docs Contact: | |||||
Priority: | medium | ||||||
Version: | 5.3 | CC: | cfeist, cluster-maint, edamato, gpaterno, lhh, rlerch, tao | ||||
Target Milestone: | rc | ||||||
Target Release: | --- | ||||||
Hardware: | All | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | cman-2.0.100-1.el5 | Doc Type: | Bug Fix | ||||
Doc Text: |
- cause: race condition between nodes during group creation (e.g. mounting gfs) could cause dlm or gfs groups to have zero global id's. (extremely rare, never actually observed in dlm or gfs)
- consequence: dlm or gfs startup would fail and usually print errors about zero lockspace or mountgroup id
- fix: dlm and gfs now detect zero global id's from groupd and replace them with an id created from a hash of the group name
- result: dlm/gfs startup races among nodes can no longer fail due to zero id's being created in groupd
|
Story Points: | --- | ||||
Clone Of: | 493165 | Environment: | |||||
Last Closed: | 2009-09-02 11:09:09 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | 493165 | ||||||
Bug Blocks: | |||||||
Attachments: |
|
Description
David Teigland
2009-03-31 22:11:52 UTC
The method groupd uses to pick the globally unique id of a new group (global_id), relies on the first cpg confchg to have a single member, which isn't always true. When there is no initial cpg confchg with one member, groupd will not set a global id for the group, and it will remain zero. If two nodes join a cpg at the very same time, the first cpg confchg on both nodes may indicate both nodes are members. Most of the time, the timings are such that one node will actually join first on its own, so the zero id's should be uncommon. If this uncommon case does happen, and a group has global_id of 0, it is usually harmless. groupd, fenced, and dlm_controld do not use the global_id at all, so they will be unaffected by groups with global_id 0 (dlm_controld does use it in the deadlock code which is not used.) gfs_controld does use global_id for plocks. It passes it to gfs-kernel as id=01234 in the hostdata mount option string. gfs-kernel uses it as a specific fs reference id in plock operations passed back to userspace. So, if two gfs_controld mountgroups were unlucky enough to get global_id's of 0, then any plock operations on the two filesystems would be mixed together (assuming apps are using plocks on the fs). This mixing of plocks between two fs's is therefore the first point at which problems would be observed (and it actually would not be a problem until files with the same inode number in both fs's were being locked.) So, in order for there to be a problem, two gfs_controld groups need to be given global_id of 0, and plocks need to be used on both of those fs's. It all adds up to a rather unlikely event. A fix for this will need to be handled as a special case when zero global_id's occur. The special case may be implemented in gfs_controld since there is more data to work with, so more options exist for picking an alternative global id. Comment 1 ignores the obvious problem that id 0 causes in bug 493165 which is more an incidental effect easily handled by the group_tool fix. After further study, I've found that dlm/dlm_controld would be affected by 0 global_id's. First, in 5.3 plocks were shifted to go through the dlm, and the reference id for plock ops are the global_id set for the lockspace, and no longer the gfs mountgroup global_id. Second, the lockspace global_id is used in dlm network message headers to identify the lockspace. My initial plan for fixing this is to add code in both dlm_controld and gfs_controld to check: - if global_id assigned by groupd is zero - if this is the first group with zero global_id, leave it as zero - if another group with zero global_id exists, pick new global_id that is a hash of the group name (the method used in cluster3) To avoid breaking compatibility with earlier RHEL5 versions where a single group with zero global_id exists and works, we need to continue to allow a single group (at each level) with zero global_id. Changing the global_id of subsequent groups is also incompatible, but duplicate zero global_id's wouldn't work anyway. It turns out that dlm-kernel doesn't work at all with lockspace id of 0. So, the plan is now for dlm_controld and gfs_controld to replace any zero global_id with the name hash. Created attachment 337669 [details]
patch
This patch worked well in some tests where I forced groupd to always assign a zero global id.
Release note added. If any revisions are required, please set the "requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: - cause: race condition between nodes during group creation (e.g. mounting gfs) could cause dlm or gfs groups to have zero global id's. (extremely rare, never actually observed in dlm or gfs) - consequence: dlm or gfs startup would fail and usually print errors about zero lockspace or mountgroup id - fix: dlm and gfs now detect zero global id's from groupd and replace them with an id created from a hash of the group name - result: dlm/gfs startup races among nodes can no longer fail due to zero id's being created in groupd An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2009-1341.html |