Description of problem: When GFS uses direct-io PR and CW locks are mixed together on a single resource. To optimize the interaction between these two lock modes, GFS uses LM_FLAG_ANY to request that either of the modes be granted. When the dlm carries out this optimization and grants a PR lock instead of a CW, or a CW instead of a PR, the mode is not switched on the non- master node. So, for example, the lock will be requested in PR mode with the ALTCW flag, it will be granted on the master node in CW mode, but the non master (requesting) node will record the granted mode as PR. In the test used to uncover this bug, the outward sign of trouble was transient hangs in the test program for 2-4 minutes at a time (until the bad lock was released by gfs's normal drop logic.) It's not clear if there could be more severe consequences from this bug or not, only applications using direct-io would be effected. Version-Release number of selected component (if applicable): All current RHEL4 code. How reproducible: unknown Steps to Reproduce: 1. run rand_direct -s on some nodes 2. run make_panic -r 10 -l 100 on other nodes 3. Actual results: tests hang on all nodes for a few minutes at a time, then run for less than a minute until next hang Expected results: all tests run indefinately Additional info:
Created attachment 123593 [details] rand_direct program mentioned in description
Fix by changing the grmode on the non-master node when we get ALTMODE back from the master. [cluster-STABLE/dlm-kernel/src]% cvs commit cvs commit: Examining . Checking in lockqueue.c; /cvs/cluster/cluster/dlm-kernel/src/lockqueue.c,v <-- lockqueue.c new revision: 1.37.2.6.6.5; previous revision: 1.37.2.6.6.4 done [cluster-RHEL4/dlm-kernel/src]% cvs commit cvs commit: Examining . Checking in lockqueue.c; /cvs/cluster/cluster/dlm-kernel/src/lockqueue.c,v <-- lockqueue.c new revision: 1.37.2.9; previous revision: 1.37.2.8 done
In the situation described above, the correct lock mode is always returned to GFS, which means it's less likely there are more severe problems than the hang already described.
[cluster-RHEL4U3/dlm-kernel/src]% cvs commit cvs commit: Examining . Checking in lockqueue.c; /cvs/cluster/cluster/dlm-kernel/src/lockqueue.c,v <-- lockqueue.c new revision: 1.37.2.6.10.2; previous revision: 1.37.2.6.10.1 done
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2006-0237.html