Description of problem: 1. get all nodes in cluster to a particlular cluster.conf version level greater than 2 2. then update one node 3. send sig -HUP on that node 4. then kill that node 5. verify the cluster.conf version levels on all other nodes, should have gotten bumped. 6. bring killed node back up 7. check it's cluster.conf version, should be same as all others 8. start ccsd 9. check it's cluster.conf version, should be same as all others 10. attempt cman_tool join You then get a: CMAN: Cluster membership rejected 11. check it's cluster.conf version, it goes down exactly 2 version levels. CRAZY! :) How reproducible: Always
Updates with the proper version and component name.
Updates with the proper version and component name. Again, just love out tools.
- fix bug 143165, 134604, and 133254 - update related issues These all seem to be related to the same issue, that is, remote nodes were erroneously processing an update as though they were the originator - taking on some tasks that didn't belong to them. This was causing connect failures, version rollbacks, etc.
With exact same senario, the cman_tool join command nolonger fails with an error. HOWEVER, rather than dropping 2 version levels, it drops all the way down to first known version level. So while all others are at say v8, the recovered node with v8 before the cman_tool join attempt, drops down to v1 after the cman_tool join.
The problem stemmed from the fact that connect and broadcast request processing went through different code paths. When an update happens, a bit is set telling the daemon that the next request should trigger a read of the config file. While this happened correctly for a connect, it did not for a broadcast. When the node comes back up, the daemon started, and cman_tool join initiated; the daemon broadcasts for config files. The other nodes were responding back with an old version (ignoring the update bit). Since the other nodes are quorate, ccsd respects their view instead of the more recent view it has - thus resulting in a version regression. Now that this is fixed, the cman_tool join error will come back. (I have seen it in the logs - I don't know what Dave has done to the printout on the command line.) This is a result of the fact that cman has not been made aware of the update (via 'cman_tool version -r <new>'). So the incomming node will have a version number that is different (higher) than the rest of the nodes - thus, it is rejected. ccs_tool will soon (i.e now) take away the need to run 'cman_tool version -r <new>'
fix verified.