From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; fr; rv:1.7.6) Gecko/20050322 Description of problem: When the gulm master node also mount a GFS filesystem, the fencing process does not run properly if the gulm master node has to be fenced. I may loose data because the recovery process begun too early (before the fence is finished). Version-Release number of selected component (if applicable): GFS-6.0.2.20-2 How reproducible: Always Steps to Reproduce: 1. Get a 8-nodes cluster. 2. Choose 5 nodes for gulm servers. 3. Mount a GFS filesystem of your 8 nodes. 4. Unplug the network of the current gulm master. 5. Wait until another gulm server become the master. 6. Do NOT run fence_ack_manual and check if the locks of the unplugged node are released. Actual Results: 1. The locks are released immediately when another gulm server become the master. 2. The journal is recovered by another node immediately too. Expected Results: The recovery process should wait that the user run fence_ack_manual Additional info: More explanations here: https://www.redhat.com/archives/linux-cluster/2005-July/msg00000.html I file this bugzilla as requested here: https://www.redhat.com/archives/linux-cluster/2005-July/msg00006.html
Have you tried this with only three nodes are gulm servers? How does it behave then?
just checked with three nodes. bug is there too.
check_for_stale_expires() is tripping on everyone. it only runs if a jid mapping is marked 1. (live mappings are marked 2). Only time jidmapping is marked 1 is when a node other than owner is replaying the journal. Why are the live mappings getting switched to 1 from 2? i duno, but I bet that's the bug right there. I look deeper.
Fixed the issue that was in comment #3 didn't fix the bug. Digging more.
Bug only appears when master lock server is also mounting gfs. So a workaround is to put the lockservers onto dedicated nodes.
There was a kludge that tried to fix something, but I cannot find or figure what it was suppose to fix. That kludge was causing this. Betting on this being a bigger problem that whatever it was trying to fix and removing the kludge. I think what it tried to fix was some weird end case where multiple clients and lock servers failed in some way.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2005-723.html
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2005-733.html