Bug 825180 - ctdbd on a node crashes when another node in the cluster is brought down.
ctdbd on a node crashes when another node in the cluster is brought down.
Status: CLOSED ERRATA
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: samba (Show other bugs)
2.0
x86_64 Linux
medium Severity high
: ---
: ---
Assigned To: Christopher R. Hertel
Sudhir D
:
Depends On: 821715
Blocks: 956495
  Show dependency treegraph
 
Reported: 2012-05-25 05:57 EDT by Sumit Bose
Modified: 2014-09-28 20:21 EDT (History)
9 users (show)

See Also:
Fixed In Version: ctdb-1.0.114.5-1.el6
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 821715
Environment:
Last Closed: 2013-09-23 18:32:16 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Sumit Bose 2012-05-25 05:57:05 EDT
+++ This bug was initially created as a clone of Bug #821715 +++

Description of problem:
In a CTDB cluster of 4 nodes serving 3 public addresses, one of the nodes is brought down. The ctdbd process(es) on one of other nodes crashed. 
Glusterfs* is being used as the shared filesystem hosting the lockfile for ctdb.
The "nodes" file (/etc/ctdb/nodes) is placed in the shared filesystem as well. Each node has its own /etc/ctdb/public_addresses.

Version-Release number of selected component (if applicable):
CTDB version: 1.0.114.3-3.el6

How reproducible:
always

Steps to Reproduce:
1.Build a ctdb cluster of size 4.
2.Reboot one of the nodes in the cluster.
3.On one of the machines, we see that the ctdbd crashed.
  
Actual results:
ctdbd crashes with signal 6.

Expected results:
ctdbd should not crash.

Additional info:
*Glusterfs is a network filesystem.
Comment 2 Ujjwala 2012-12-26 05:10:49 EST
verified on the build glusterfs 3.4.0qa5 built on Dec 17 2012 and works fine.
Comment 3 Ujjwala 2012-12-28 05:45:21 EST
The behavior is inconsistent. Following are the different behaviors observed:

In a 4 node cluster, When one of the node is powered off, 
> sometimes works fine
> One of the node goes to a banned state and stays in the banned state forever.
Only way to recover is to restart the ctdb service. ( happens 5 out of 10 times)
> On one of the node, ctdb crashes (Happened only once)

The sosreport of the node that goes to banned state and the node where ctdb crashed are available at: 
http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/825180/

Tested on the build - glusterfs 3.4.0qa5 built on Dec 17

(In reply to comment #2)
> verified on the build glusterfs 3.4.0qa5 built on Dec 17 2012 and works fine.
Comment 4 Christopher R. Hertel 2013-07-18 03:06:28 EDT
Please re-test.

Gluster byte-range locking (used by CTDB when negotiating recovery) has been patched, and we have a new version of CTDB with several bugfixes applied.

Note, however, that if the cluster is not running in replicated mode, and if the one node that it shut down is also the one that "owns" the recovery lock file, then CTDB will fail to recover because the lock file is missing.  CTDB needs to have the lock file available in order to perform recovery.

Also, this bug was created as a clone of bug 821715, which has been closed.
Comment 6 surabhi 2013-07-25 08:32:45 EDT
Verified with the latest version:
glusterfs 3.4.0.12rhs.beta6

[root@dhcp159-0 ~]# rpm -qa | grep samba
samba-3.6.9-155.5.el6rhs.x86_64
samba-domainjoin-gui-3.6.9-155.5.el6rhs.x86_64
samba-winbind-clients-3.6.9-155.5.el6rhs.x86_64
samba-common-3.6.9-155.5.el6rhs.x86_64
samba-doc-3.6.9-155.5.el6rhs.x86_64
samba-glusterfs-3.6.9-155.5.el6rhs.x86_64
samba-winbind-3.6.9-155.5.el6rhs.x86_64
samba-winbind-devel-3.6.9-155.5.el6rhs.x86_64
samba-debuginfo-3.6.9-155.5.el6rhs.x86_64
samba-swat-3.6.9-155.5.el6rhs.x86_64
samba-winbind-krb5-locator-3.6.9-155.5.el6rhs.x86_64
Comment 8 Scott Haines 2013-09-23 18:32:16 EDT
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. 

For information on the advisory, and where to find the updated files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-1262.html

Note You need to log in before you can comment on or make changes to this bug.