Description of problem: gulm_tool getstats on a client will report the existance of a Master in the event that the client was logged into a Master when the Master node drops into Arbitrating. Newly joining clients will not have this bit of information. Version-Release number of selected component (if applicable): GFS-6.0.0-1.2 How reproducible: Every time Steps to Reproduce: 1. make lock_gulmd quorate on the nodes in the servers list 2. start lock_gulmd on a client 3. shutdown lock_gulmd on the slave nodes untill Master moves to Arbitrating 4. run "gulm_tool getstats client1" and you will see an entry for the Master 5. start lock_gulmd on a new client and run "gulm_tool getstats client2" and there will not be entry for Master Actual results: [root@trin-01 root]# gulm_tool getstats trin-04 I_am = Client Master = trin-01.lab.msp.redhat.com rank = -1 GenerationID = 1092267400182171 run time = 14710 pid = 2505 verbosity = Default,Network2,Locking,Subscribers,LoginLoops,ServerState failover = enabled [root@trin-01 root]# gulm_tool getstats trin-05 I_am = Client quorum_has = 1 quorum_needs = 2 rank = -1 GenerationID = 0 run time = 712 pid = 2500 verbosity = Default,Network2,Locking,Subscribers,LoginLoops,ServerState failover = enabled Expected results: I would expect the fields quorum_has, quorum_needs and Master to be the same on all the clients. Additional info: Perhaps this is another set of fields worth munging for the getstats output (rawstats could stay as it is)?
Currently gulm_tool getstats reports what a specific server on a specific node thinks right now. Which isn't what you where expecting. Since it is perfectly possible for a server on a node to be thinking the wrong things when you ask it. This will get corrected in time. (or the node is really messed up and will get fenced.) I'm not conviced there is anything here to fix.
The node is heartbeating the Arbitrating (old Master) node. It is not really messed up and will not get fenced (unless you mean to imply that all nodes logged into the cluster will get fenced eventually if quorum is lost for whatever reason). This will not get corrected by time. The client will remain logged into the old master (now arbitrating) node. The only way this will be corrected is for the Arbitrating node to become quorate again, making it membership dependant, not time dependant. If gulm_tool can't be wrong in this case, then the client itself must be. It thinks that there is quorum when infact there is not. I thought that this issue was addressed by the arbitrating node dropping all of it's connections and forcing all clients and servers to re-login when quorum was lost. In such a case, the clients would see that the master has lost lost quorum. Am I missing something?
Is the client really logged into the the node it thinks is master? (but is really arbitrating.) This would be a bug. The getstats output says who it thinks is the master. That does not mean the client is connected to that node. It is just who it thinks is the master. If need really be, i can munge the output of getstats.
i am the louse. was thinking of something else entirely. Pretty much just delete everything I said in comment #3.
right. so i think i'm getting somewhere on this. somewhere back when, clients were changed to not get kicked when master dropped to arbit, because this created bogus fences. That is why the client is still logged in. Of course, with all the shuffling of cvs trees and bugzilla dbs, I cannot find the refences to this.
So, out on the end points of the clients. They don't know if the main servers are in quorate or not. Not a problem for gfs, since the way the lock paths were designed. The lock tables, which are on the main servers and thus do know the quorate state, would stop lock traffic when quorum was lost. So the end clients with gfs worked as one would expect. Now since we're trying to move things to be less gfs specific, we cannot relie on this anymore. there is a fix in head, which while minor, changes protocols and library interface. There are two places in stable where this can be noticed. First is as in the initial bug report. (which mostly just appears wrong, even though everything is still working correctly.) Second, apps using libgulm that are running out on client nodes won't get the correct quorate state in this condition (is there anything on 6.0 using libgulm?).
fix in 6.0.* now too. except usespace libgulm interface has not changed.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2005-466.html