Bug 842634

Summary: ctdb crash on unrelated node, when simulating repeated server loss
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Paul Cuzner <pcuzner>
Component: glusterdAssignee: Christopher R. Hertel <crh>
Status: CLOSED ERRATA QA Contact: Sudhir D <sdharane>
Severity: unspecified Docs Contact:
Priority: high    
Version: 2.0CC: amarts, crh, gluster-bugs, jebrown, rfortier, rwheeler, sdharane, shaines, vbellur
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 858431 (view as bug list) Environment:
Last Closed: 2013-09-23 22:38:54 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 858431    
Attachments:
Description Flags
test log covering the repeated node shutdown and resulting ctdb issue none

Description Paul Cuzner 2012-07-24 10:00:09 UTC
Created attachment 599979 [details]
test log covering the repeated node shutdown and resulting ctdb issue

Description of problem:
I have a test environment of 4 RHS2 nodes, subscribed to RHN and updated. ctdb has been configured across my nodes for 2 virtual IPs. ctdb setup and operation is fine under normal conditions, but when I performed 2 successive halts against node1, node2's ctdb crashes and won't join the ctdb cluster. The resulting cluster looks to suffer from split brain (each node seeing a different generaion id)

Version-Release number of selected component (if applicable):
RHS v2.0

How reproducible:
I've repeated the tests 3 times, each time with the same result.

Steps to Reproduce:
1. create a 4 node cluster
2. configure ctdb for each of the nodes, providing 2 virtual IP addresses
3. Mount a volume using nfs on the virtual IP
4. IP maps to node 1
5. drop node 1
6. cluster will recover successfully
7. restart node 1 
8. confirm cluster node status is al OK
9. halt node 1 again
10. node 2 i no longer able to communicate in the cluster instance
11. check generaion ids on each of the resulting nodes for correctness
  
Actual results:
node 2 drops from the cluster. Work around is to drop ctdb on each of the 4 nodes, and restart one by one - only moving on to the next when the node status is OK.


Expected results:
I expect ctdb to perform the same on each time
At no point do I expect a node halt to cause an issue on another node in the cluster



Additional info:

Comment 2 Christopher R. Hertel 2012-10-09 15:11:02 UTC
Please confirm that the Samba version running is 3.5.10, and please provide the CTDB version number as well.  I have checked the attachement and was not able to find this information.

Is this test being run on a "real" cluster or on a set of virtual machines?  This may be a moot point but it would be good to test the behavior on real hardware just to be sure.

Comment 3 Paul Cuzner 2012-10-09 16:17:55 UTC
(In reply to comment #2)
> Please confirm that the Samba version running is 3.5.10, and please provide
> the CTDB version number as well.  I have checked the attachement and was not
> able to find this information.
> 
> Is this test being run on a "real" cluster or on a set of virtual machines? 
> This may be a moot point but it would be good to test the behavior on real
> hardware just to be sure.

Hi Chris,

These are the versions;
[root@rhs-1 ~]# rpm -qa | egrep '(samba|ctdb)'
ctdb-1.0.114.5-1.el6.x86_64
samba-winbind-clients-3.5.10-116.el6_2.x86_64
samba-common-3.5.10-116.el6_2.x86_64
samba-3.5.10-116.el6_2.x86_64

I think these are the same as at GA?

When I did these tests it was all VM's...Real hardware? Chance would be a fine thing ;o)

The boxes are subscribed to RHN with the only outstanding patches for tzdata and dracut pending.

Let me know if you need anything else.

Cheers,

PC

Comment 4 Christopher R. Hertel 2012-10-10 01:17:00 UTC
Next step is to try and reproduce the problem, both on RHS2.0 stock and using a newer version of Samba/CTDB.

It may be a few days before I can configure this for testing.

Comment 5 Christopher R. Hertel 2012-11-20 15:16:00 UTC
Please see bug 869724.

It is likely that this crash is due to Gluster providing an incorrect response to the low-level F_GETLK fcntl() call.  Once the patch is in place, please re-run the tests.

Comment 6 Christopher R. Hertel 2012-11-30 01:58:29 UTC
Assigning to QE for regression testing.
Possibly fixed by a GlusterFS patch.  See bug 869724.

Comment 7 Ujjwala 2013-05-07 07:31:25 UTC
Verified it on the build -glusterfs 3.4.0.2rhs built on May  2 2013 06:08:46
I did not see  the issue.

Comment 10 Scott Haines 2013-09-23 22:38:54 UTC
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. 

For information on the advisory, and where to find the updated files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-1262.html

Comment 11 Scott Haines 2013-09-23 22:41:31 UTC
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. 

For information on the advisory, and where to find the updated files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-1262.html