Red Hat Bugzilla – Bug 842634
ctdb crash on unrelated node, when simulating repeated server loss
Last modified: 2014-09-28 20:21:47 EDT
Created attachment 599979 [details]
test log covering the repeated node shutdown and resulting ctdb issue
Description of problem:
I have a test environment of 4 RHS2 nodes, subscribed to RHN and updated. ctdb has been configured across my nodes for 2 virtual IPs. ctdb setup and operation is fine under normal conditions, but when I performed 2 successive halts against node1, node2's ctdb crashes and won't join the ctdb cluster. The resulting cluster looks to suffer from split brain (each node seeing a different generaion id)
Version-Release number of selected component (if applicable):
I've repeated the tests 3 times, each time with the same result.
Steps to Reproduce:
1. create a 4 node cluster
2. configure ctdb for each of the nodes, providing 2 virtual IP addresses
3. Mount a volume using nfs on the virtual IP
4. IP maps to node 1
5. drop node 1
6. cluster will recover successfully
7. restart node 1
8. confirm cluster node status is al OK
9. halt node 1 again
10. node 2 i no longer able to communicate in the cluster instance
11. check generaion ids on each of the resulting nodes for correctness
node 2 drops from the cluster. Work around is to drop ctdb on each of the 4 nodes, and restart one by one - only moving on to the next when the node status is OK.
I expect ctdb to perform the same on each time
At no point do I expect a node halt to cause an issue on another node in the cluster
Please confirm that the Samba version running is 3.5.10, and please provide the CTDB version number as well. I have checked the attachement and was not able to find this information.
Is this test being run on a "real" cluster or on a set of virtual machines? This may be a moot point but it would be good to test the behavior on real hardware just to be sure.
(In reply to comment #2)
> Please confirm that the Samba version running is 3.5.10, and please provide
> the CTDB version number as well. I have checked the attachement and was not
> able to find this information.
> Is this test being run on a "real" cluster or on a set of virtual machines?
> This may be a moot point but it would be good to test the behavior on real
> hardware just to be sure.
These are the versions;
[root@rhs-1 ~]# rpm -qa | egrep '(samba|ctdb)'
I think these are the same as at GA?
When I did these tests it was all VM's...Real hardware? Chance would be a fine thing ;o)
The boxes are subscribed to RHN with the only outstanding patches for tzdata and dracut pending.
Let me know if you need anything else.
Next step is to try and reproduce the problem, both on RHS2.0 stock and using a newer version of Samba/CTDB.
It may be a few days before I can configure this for testing.
Please see bug 869724.
It is likely that this crash is due to Gluster providing an incorrect response to the low-level F_GETLK fcntl() call. Once the patch is in place, please re-run the tests.
Assigning to QE for regression testing.
Possibly fixed by a GlusterFS patch. See bug 869724.
Verified it on the build -glusterfs 18.104.22.168rhs built on May 2 2013 06:08:46
I did not see the issue.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA.
For information on the advisory, and where to find the updated files, follow the link below.
If the solution does not work for you, open a new bug report.