Bug 842634

Summary:

ctdb crash on unrelated node, when simulating repeated server loss

Product:

[Red Hat Storage] Red Hat Gluster Storage

Reporter:

Paul Cuzner <pcuzner>

Component:

glusterd

Assignee:

Christopher R. Hertel <crh>

Status:

CLOSED ERRATA

QA Contact:

Sudhir D <sdharane>

Severity:

unspecified

Docs Contact:

Priority:

high

Version:

2.0

CC:

amarts, crh, gluster-bugs, jebrown, rfortier, rwheeler, sdharane, shaines, vbellur

Target Milestone:

---

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Clones:

858431 (view as bug list)

Environment:

Last Closed:

2013-09-23 22:38:54 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

858431

Attachments:

Description	Flags
test log covering the repeated node shutdown and resulting ctdb issue	none

Description Paul Cuzner 2012-07-24 10:00:09 UTC

Created attachment 599979 [details]
test log covering the repeated node shutdown and resulting ctdb issue

Description of problem:
I have a test environment of 4 RHS2 nodes, subscribed to RHN and updated. ctdb has been configured across my nodes for 2 virtual IPs. ctdb setup and operation is fine under normal conditions, but when I performed 2 successive halts against node1, node2's ctdb crashes and won't join the ctdb cluster. The resulting cluster looks to suffer from split brain (each node seeing a different generaion id)

Version-Release number of selected component (if applicable):
RHS v2.0

How reproducible:
I've repeated the tests 3 times, each time with the same result.

Steps to Reproduce:
1. create a 4 node cluster
2. configure ctdb for each of the nodes, providing 2 virtual IP addresses
3. Mount a volume using nfs on the virtual IP
4. IP maps to node 1
5. drop node 1
6. cluster will recover successfully
7. restart node 1 
8. confirm cluster node status is al OK
9. halt node 1 again
10. node 2 i no longer able to communicate in the cluster instance
11. check generaion ids on each of the resulting nodes for correctness
  
Actual results:
node 2 drops from the cluster. Work around is to drop ctdb on each of the 4 nodes, and restart one by one - only moving on to the next when the node status is OK.


Expected results:
I expect ctdb to perform the same on each time
At no point do I expect a node halt to cause an issue on another node in the cluster



Additional info:

Comment 2 Christopher R. Hertel 2012-10-09 15:11:02 UTC

Please confirm that the Samba version running is 3.5.10, and please provide the CTDB version number as well.  I have checked the attachement and was not able to find this information.

Is this test being run on a "real" cluster or on a set of virtual machines?  This may be a moot point but it would be good to test the behavior on real hardware just to be sure.

Comment 3 Paul Cuzner 2012-10-09 16:17:55 UTC

(In reply to comment #2)
> Please confirm that the Samba version running is 3.5.10, and please provide
> the CTDB version number as well.  I have checked the attachement and was not
> able to find this information.
> 
> Is this test being run on a "real" cluster or on a set of virtual machines? 
> This may be a moot point but it would be good to test the behavior on real
> hardware just to be sure.

Hi Chris,

These are the versions;
[root@rhs-1 ~]# rpm -qa | egrep '(samba|ctdb)'
ctdb-1.0.114.5-1.el6.x86_64
samba-winbind-clients-3.5.10-116.el6_2.x86_64
samba-common-3.5.10-116.el6_2.x86_64
samba-3.5.10-116.el6_2.x86_64

I think these are the same as at GA?

When I did these tests it was all VM's...Real hardware? Chance would be a fine thing ;o)

The boxes are subscribed to RHN with the only outstanding patches for tzdata and dracut pending.

Let me know if you need anything else.

Cheers,

PC

Comment 4 Christopher R. Hertel 2012-10-10 01:17:00 UTC

Next step is to try and reproduce the problem, both on RHS2.0 stock and using a newer version of Samba/CTDB.

It may be a few days before I can configure this for testing.

Comment 5 Christopher R. Hertel 2012-11-20 15:16:00 UTC

Please see bug 869724.

It is likely that this crash is due to Gluster providing an incorrect response to the low-level F_GETLK fcntl() call.  Once the patch is in place, please re-run the tests.

Comment 6 Christopher R. Hertel 2012-11-30 01:58:29 UTC

Assigning to QE for regression testing.
Possibly fixed by a GlusterFS patch.  See bug 869724.

Comment 7 Ujjwala 2013-05-07 07:31:25 UTC

Verified it on the build -glusterfs 3.4.0.2rhs built on May  2 2013 06:08:46
I did not see  the issue.

Comment 10 Scott Haines 2013-09-23 22:38:54 UTC

Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. 

For information on the advisory, and where to find the updated files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-1262.html

Comment 11 Scott Haines 2013-09-23 22:41:31 UTC

Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. 

For information on the advisory, and where to find the updated files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-1262.html