Bug 200753
Summary: | "ls" is blocked | ||
---|---|---|---|
Product: | [Retired] Red Hat Cluster Suite | Reporter: | Mathieu AVILA <mathieu.avila> |
Component: | dlm | Assignee: | David Teigland <teigland> |
Status: | CLOSED WONTFIX | QA Contact: | Cluster QE <mspqa-list> |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | 4 | CC: | alban.crequy, ccaulfie, cluster-maint |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2006-11-27 16:15:37 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Mathieu AVILA
2006-07-31 14:40:19 UTC
There's a chance this may have the same root cause as bz 172944. One possible cause in this case is an unlock that happens while another lock operation is still in progress on the same lock. (GFS shouldn't do that and there is plenty of room for problems like this in the dlm if it does happen for some reason). In that scenario, the reply to the first operation reports "cancel rep 0" because the dlm expects a return status of EUNLOCK (for the unlock) but gets the 0 for the first op. The reply to the unlock then reports "process_lockqueue_reply ... state 0" because the lock was removed from the waiting list. If this is what happened, we should see a message of "unlock cancel status %d" on one of the other nodes. (Even though this has nothing to do with lock cancelation.) We could print some additional debug info to try to say more definatively what's going wrong. If we're successful at that the best I think we could hope for is a work-around to detect when this happens and does something special to handle it. If this is still a problem for the reporter then we should have our QA group run a test like this. (sorry for the delay) It never reproduced. The problem is, this behaviour is not a desired one : the processes are taking lots of time communicating and are therefore very slow. We haven't continued investigating in that way. We will try as hard as possible not to have 2 nodes accessing the very same data at the very time, although GFS is designed just for that. Therefore the bug may be there, but we it didn't happened again. Still, it would be interesting to have this kind of tests run by your QA group. Will reopen this if it becomes a problem again. |