Bug 842634
Summary: | ctdb crash on unrelated node, when simulating repeated server loss | ||||||
---|---|---|---|---|---|---|---|
Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | Paul Cuzner <pcuzner> | ||||
Component: | glusterd | Assignee: | Christopher R. Hertel <crh> | ||||
Status: | CLOSED ERRATA | QA Contact: | Sudhir D <sdharane> | ||||
Severity: | unspecified | Docs Contact: | |||||
Priority: | high | ||||||
Version: | 2.0 | CC: | amarts, crh, gluster-bugs, jebrown, rfortier, rwheeler, sdharane, shaines, vbellur | ||||
Target Milestone: | --- | ||||||
Target Release: | --- | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | |||||||
: | 858431 (view as bug list) | Environment: | |||||
Last Closed: | 2013-09-23 22:38:54 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 858431 | ||||||
Attachments: |
|
Please confirm that the Samba version running is 3.5.10, and please provide the CTDB version number as well. I have checked the attachement and was not able to find this information. Is this test being run on a "real" cluster or on a set of virtual machines? This may be a moot point but it would be good to test the behavior on real hardware just to be sure. (In reply to comment #2) > Please confirm that the Samba version running is 3.5.10, and please provide > the CTDB version number as well. I have checked the attachement and was not > able to find this information. > > Is this test being run on a "real" cluster or on a set of virtual machines? > This may be a moot point but it would be good to test the behavior on real > hardware just to be sure. Hi Chris, These are the versions; [root@rhs-1 ~]# rpm -qa | egrep '(samba|ctdb)' ctdb-1.0.114.5-1.el6.x86_64 samba-winbind-clients-3.5.10-116.el6_2.x86_64 samba-common-3.5.10-116.el6_2.x86_64 samba-3.5.10-116.el6_2.x86_64 I think these are the same as at GA? When I did these tests it was all VM's...Real hardware? Chance would be a fine thing ;o) The boxes are subscribed to RHN with the only outstanding patches for tzdata and dracut pending. Let me know if you need anything else. Cheers, PC Next step is to try and reproduce the problem, both on RHS2.0 stock and using a newer version of Samba/CTDB. It may be a few days before I can configure this for testing. Please see bug 869724. It is likely that this crash is due to Gluster providing an incorrect response to the low-level F_GETLK fcntl() call. Once the patch is in place, please re-run the tests. Assigning to QE for regression testing. Possibly fixed by a GlusterFS patch. See bug 869724. Verified it on the build -glusterfs 3.4.0.2rhs built on May 2 2013 06:08:46 I did not see the issue. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2013-1262.html Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2013-1262.html |
Created attachment 599979 [details] test log covering the repeated node shutdown and resulting ctdb issue Description of problem: I have a test environment of 4 RHS2 nodes, subscribed to RHN and updated. ctdb has been configured across my nodes for 2 virtual IPs. ctdb setup and operation is fine under normal conditions, but when I performed 2 successive halts against node1, node2's ctdb crashes and won't join the ctdb cluster. The resulting cluster looks to suffer from split brain (each node seeing a different generaion id) Version-Release number of selected component (if applicable): RHS v2.0 How reproducible: I've repeated the tests 3 times, each time with the same result. Steps to Reproduce: 1. create a 4 node cluster 2. configure ctdb for each of the nodes, providing 2 virtual IP addresses 3. Mount a volume using nfs on the virtual IP 4. IP maps to node 1 5. drop node 1 6. cluster will recover successfully 7. restart node 1 8. confirm cluster node status is al OK 9. halt node 1 again 10. node 2 i no longer able to communicate in the cluster instance 11. check generaion ids on each of the resulting nodes for correctness Actual results: node 2 drops from the cluster. Work around is to drop ctdb on each of the 4 nodes, and restart one by one - only moving on to the next when the node status is OK. Expected results: I expect ctdb to perform the same on each time At no point do I expect a node halt to cause an issue on another node in the cluster Additional info: