This service will be undergoing maintenance at 00:00 UTC, 2017-10-23 It is expected to last about 30 minutes
Bug 842634 - ctdb crash on unrelated node, when simulating repeated server loss
ctdb crash on unrelated node, when simulating repeated server loss
Status: CLOSED ERRATA
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: glusterd (Show other bugs)
2.0
Unspecified Unspecified
high Severity unspecified
: ---
: ---
Assigned To: Christopher R. Hertel
Sudhir D
:
Depends On:
Blocks: 858431
  Show dependency treegraph
 
Reported: 2012-07-24 06:00 EDT by Paul Cuzner
Modified: 2014-09-28 20:21 EDT (History)
9 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 858431 (view as bug list)
Environment:
Last Closed: 2013-09-23 18:38:54 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
test log covering the repeated node shutdown and resulting ctdb issue (10.13 KB, application/octet-stream)
2012-07-24 06:00 EDT, Paul Cuzner
no flags Details

  None (edit)
Description Paul Cuzner 2012-07-24 06:00:09 EDT
Created attachment 599979 [details]
test log covering the repeated node shutdown and resulting ctdb issue

Description of problem:
I have a test environment of 4 RHS2 nodes, subscribed to RHN and updated. ctdb has been configured across my nodes for 2 virtual IPs. ctdb setup and operation is fine under normal conditions, but when I performed 2 successive halts against node1, node2's ctdb crashes and won't join the ctdb cluster. The resulting cluster looks to suffer from split brain (each node seeing a different generaion id)

Version-Release number of selected component (if applicable):
RHS v2.0

How reproducible:
I've repeated the tests 3 times, each time with the same result.

Steps to Reproduce:
1. create a 4 node cluster
2. configure ctdb for each of the nodes, providing 2 virtual IP addresses
3. Mount a volume using nfs on the virtual IP
4. IP maps to node 1
5. drop node 1
6. cluster will recover successfully
7. restart node 1 
8. confirm cluster node status is al OK
9. halt node 1 again
10. node 2 i no longer able to communicate in the cluster instance
11. check generaion ids on each of the resulting nodes for correctness
  
Actual results:
node 2 drops from the cluster. Work around is to drop ctdb on each of the 4 nodes, and restart one by one - only moving on to the next when the node status is OK.


Expected results:
I expect ctdb to perform the same on each time
At no point do I expect a node halt to cause an issue on another node in the cluster



Additional info:
Comment 2 Christopher R. Hertel 2012-10-09 11:11:02 EDT
Please confirm that the Samba version running is 3.5.10, and please provide the CTDB version number as well.  I have checked the attachement and was not able to find this information.

Is this test being run on a "real" cluster or on a set of virtual machines?  This may be a moot point but it would be good to test the behavior on real hardware just to be sure.
Comment 3 Paul Cuzner 2012-10-09 12:17:55 EDT
(In reply to comment #2)
> Please confirm that the Samba version running is 3.5.10, and please provide
> the CTDB version number as well.  I have checked the attachement and was not
> able to find this information.
> 
> Is this test being run on a "real" cluster or on a set of virtual machines? 
> This may be a moot point but it would be good to test the behavior on real
> hardware just to be sure.

Hi Chris,

These are the versions;
[root@rhs-1 ~]# rpm -qa | egrep '(samba|ctdb)'
ctdb-1.0.114.5-1.el6.x86_64
samba-winbind-clients-3.5.10-116.el6_2.x86_64
samba-common-3.5.10-116.el6_2.x86_64
samba-3.5.10-116.el6_2.x86_64

I think these are the same as at GA?

When I did these tests it was all VM's...Real hardware? Chance would be a fine thing ;o)

The boxes are subscribed to RHN with the only outstanding patches for tzdata and dracut pending.

Let me know if you need anything else.

Cheers,

PC
Comment 4 Christopher R. Hertel 2012-10-09 21:17:00 EDT
Next step is to try and reproduce the problem, both on RHS2.0 stock and using a newer version of Samba/CTDB.

It may be a few days before I can configure this for testing.
Comment 5 Christopher R. Hertel 2012-11-20 10:16:00 EST
Please see bug 869724.

It is likely that this crash is due to Gluster providing an incorrect response to the low-level F_GETLK fcntl() call.  Once the patch is in place, please re-run the tests.
Comment 6 Christopher R. Hertel 2012-11-29 20:58:29 EST
Assigning to QE for regression testing.
Possibly fixed by a GlusterFS patch.  See bug 869724.
Comment 7 Ujjwala 2013-05-07 03:31:25 EDT
Verified it on the build -glusterfs 3.4.0.2rhs built on May  2 2013 06:08:46
I did not see  the issue.
Comment 10 Scott Haines 2013-09-23 18:38:54 EDT
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. 

For information on the advisory, and where to find the updated files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-1262.html
Comment 11 Scott Haines 2013-09-23 18:41:31 EDT
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. 

For information on the advisory, and where to find the updated files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-1262.html

Note You need to log in before you can comment on or make changes to this bug.