Bug 191222
Summary: | read flock broken on single-node | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | [Retired] Red Hat Cluster Suite | Reporter: | Abhijith Das <adas> | ||||||
Component: | gfs | Assignee: | Abhijith Das <adas> | ||||||
Status: | CLOSED ERRATA | QA Contact: | GFS Bugs <gfs-bugs> | ||||||
Severity: | medium | Docs Contact: | |||||||
Priority: | medium | ||||||||
Version: | 4 | CC: | cfeist, nobody+wcheng, swhiteho, teigland | ||||||
Target Milestone: | --- | ||||||||
Target Release: | --- | ||||||||
Hardware: | All | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | RHBA-2006-0561 | Doc Type: | Bug Fix | ||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2006-08-10 21:35:28 UTC | Type: | --- | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Attachments: |
|
Description
Abhijith Das
2006-05-09 20:41:12 UTC
Created attachment 128812 [details]
test-program to simulate bug
At least part of the problem is that the GL_NOCACHE flag used on flock glocks assumes that there's only a single glock holder, so when a NOCACHE holder is dequeued the glock is unlocked without any thought that other holders may still exist. Created attachment 129069 [details]
Patch to potentially fix this bz
This patch ensures that a GL_NOCACHE glock is removed from cache only when
gfs_glock_dq is called on the last holder. I haven't seen any ill-effects of
this patch, but will feel comfortable when it goes through a round of qa.
Committed above patch into RHEL4, HEAD and STABLE branches. A little explanation of FLOCKS, GL_NOCACHE etc 1. Why do flocks need GL_NOCACHE flag turned on for its glocks? If FLOCK glocks are cached on one node after use, another node requesting a conflicting FLOCK coupled with the LOCK_NB flag will be denied. The first node has already used and released the FLOCK and should not conflict with the second node's request. The GL_NOCACHE flag ensures this. 2. In RHEL3, there was no GL_NOCACHE flag. How were flocks working then? Without the GL_NOCACHE flag the release of the glock depends on a timeout value associated with FLOCK glocks. This timeout mechanism (flock_demote_ok()) is not implemented and hence the glock gets released immediately. But, there is a correctness issue here. The release of the glock doesn't happen synchronously. The issue in 1. could still occur if the second node requests the flock within the small window between the release of the flock and release of the glock. The solution is a correct implementation of GL_NOCACHE, which this patch attempts to accomplish. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2006-0561.html Just stumbled upon this bug myself using RHEL4U3. The symptoms I saw was that the traffic on the heartbeat (DLM) network was high and performance was poor(er) on nodes which were not the first to mount the filesystem. The first mounter obtained journal locks then dequeued them when they still had holders. From that moment on the other nodes had to do network DLM transactions to get the locks and could never cache them locally. This fix solved the performance problem. |