842634 – ctdb crash on unrelated node, when simulating repeated server loss

Bug 842634 - ctdb crash on unrelated node, when simulating repeated server loss

Summary: ctdb crash on unrelated node, when simulating repeated server loss

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	glusterd
Sub Component:
Version:	2.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	unspecified
Target Milestone:	---
Target Release:	---
Assignee:	Christopher R. Hertel
QA Contact:	Sudhir D
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	858431
TreeView+	depends on / blocked

Reported:	2012-07-24 10:00 UTC by Paul Cuzner
Modified:	2014-09-29 00:21 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Clones:	858431 (view as bug list)
Environment:
Last Closed:	2013-09-23 22:38:54 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
test log covering the repeated node shutdown and resulting ctdb issue (10.13 KB, application/octet-stream) 2012-07-24 10:00 UTC, Paul Cuzner	no flags	Details
View All

Description Paul Cuzner 2012-07-24 10:00:09 UTC

Created attachment 599979 [details]
test log covering the repeated node shutdown and resulting ctdb issue

Description of problem:
I have a test environment of 4 RHS2 nodes, subscribed to RHN and updated. ctdb has been configured across my nodes for 2 virtual IPs. ctdb setup and operation is fine under normal conditions, but when I performed 2 successive halts against node1, node2's ctdb crashes and won't join the ctdb cluster. The resulting cluster looks to suffer from split brain (each node seeing a different generaion id)

Version-Release number of selected component (if applicable):
RHS v2.0

How reproducible:
I've repeated the tests 3 times, each time with the same result.

Steps to Reproduce:
1. create a 4 node cluster
2. configure ctdb for each of the nodes, providing 2 virtual IP addresses
3. Mount a volume using nfs on the virtual IP
4. IP maps to node 1
5. drop node 1
6. cluster will recover successfully
7. restart node 1 
8. confirm cluster node status is al OK
9. halt node 1 again
10. node 2 i no longer able to communicate in the cluster instance
11. check generaion ids on each of the resulting nodes for correctness
  
Actual results:
node 2 drops from the cluster. Work around is to drop ctdb on each of the 4 nodes, and restart one by one - only moving on to the next when the node status is OK.


Expected results:
I expect ctdb to perform the same on each time
At no point do I expect a node halt to cause an issue on another node in the cluster



Additional info:

Comment 2 Christopher R. Hertel 2012-10-09 15:11:02 UTC

Please confirm that the Samba version running is 3.5.10, and please provide the CTDB version number as well.  I have checked the attachement and was not able to find this information.

Is this test being run on a "real" cluster or on a set of virtual machines?  This may be a moot point but it would be good to test the behavior on real hardware just to be sure.

Comment 3 Paul Cuzner 2012-10-09 16:17:55 UTC

(In reply to comment #2)
> Please confirm that the Samba version running is 3.5.10, and please provide
> the CTDB version number as well.  I have checked the attachement and was not
> able to find this information.
> 
> Is this test being run on a "real" cluster or on a set of virtual machines? 
> This may be a moot point but it would be good to test the behavior on real
> hardware just to be sure.

Hi Chris,

These are the versions;
[root@rhs-1 ~]# rpm -qa | egrep '(samba|ctdb)'
ctdb-1.0.114.5-1.el6.x86_64
samba-winbind-clients-3.5.10-116.el6_2.x86_64
samba-common-3.5.10-116.el6_2.x86_64
samba-3.5.10-116.el6_2.x86_64

I think these are the same as at GA?

When I did these tests it was all VM's...Real hardware? Chance would be a fine thing ;o)

The boxes are subscribed to RHN with the only outstanding patches for tzdata and dracut pending.

Let me know if you need anything else.

Cheers,

PC

Comment 4 Christopher R. Hertel 2012-10-10 01:17:00 UTC

Next step is to try and reproduce the problem, both on RHS2.0 stock and using a newer version of Samba/CTDB.

It may be a few days before I can configure this for testing.

Comment 5 Christopher R. Hertel 2012-11-20 15:16:00 UTC

Please see bug 869724.

It is likely that this crash is due to Gluster providing an incorrect response to the low-level F_GETLK fcntl() call.  Once the patch is in place, please re-run the tests.

Comment 6 Christopher R. Hertel 2012-11-30 01:58:29 UTC

Assigning to QE for regression testing.
Possibly fixed by a GlusterFS patch.  See bug 869724.

Comment 7 Ujjwala 2013-05-07 07:31:25 UTC

Verified it on the build -glusterfs 3.4.0.2rhs built on May  2 2013 06:08:46
I did not see  the issue.

Comment 10 Scott Haines 2013-09-23 22:38:54 UTC

Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. 

For information on the advisory, and where to find the updated files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-1262.html

Comment 11 Scott Haines 2013-09-23 22:41:31 UTC

Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. 

For information on the advisory, and where to find the updated files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-1262.html

Note You need to log in before you can comment on or make changes to this bug.