Red Hat Bugzilla – Bug 825180
ctdbd on a node crashes when another node in the cluster is brought down.
Last modified: 2014-09-28 20:21:47 EDT
+++ This bug was initially created as a clone of Bug #821715 +++
Description of problem:
In a CTDB cluster of 4 nodes serving 3 public addresses, one of the nodes is brought down. The ctdbd process(es) on one of other nodes crashed.
Glusterfs* is being used as the shared filesystem hosting the lockfile for ctdb.
The "nodes" file (/etc/ctdb/nodes) is placed in the shared filesystem as well. Each node has its own /etc/ctdb/public_addresses.
Version-Release number of selected component (if applicable):
CTDB version: 126.96.36.199-3.el6
Steps to Reproduce:
1.Build a ctdb cluster of size 4.
2.Reboot one of the nodes in the cluster.
3.On one of the machines, we see that the ctdbd crashed.
ctdbd crashes with signal 6.
ctdbd should not crash.
*Glusterfs is a network filesystem.
verified on the build glusterfs 3.4.0qa5 built on Dec 17 2012 and works fine.
The behavior is inconsistent. Following are the different behaviors observed:
In a 4 node cluster, When one of the node is powered off,
> sometimes works fine
> One of the node goes to a banned state and stays in the banned state forever.
Only way to recover is to restart the ctdb service. ( happens 5 out of 10 times)
> On one of the node, ctdb crashes (Happened only once)
The sosreport of the node that goes to banned state and the node where ctdb crashed are available at:
Tested on the build - glusterfs 3.4.0qa5 built on Dec 17
(In reply to comment #2)
> verified on the build glusterfs 3.4.0qa5 built on Dec 17 2012 and works fine.
Gluster byte-range locking (used by CTDB when negotiating recovery) has been patched, and we have a new version of CTDB with several bugfixes applied.
Note, however, that if the cluster is not running in replicated mode, and if the one node that it shut down is also the one that "owns" the recovery lock file, then CTDB will fail to recover because the lock file is missing. CTDB needs to have the lock file available in order to perform recovery.
Also, this bug was created as a clone of bug 821715, which has been closed.
Verified with the latest version:
[root@dhcp159-0 ~]# rpm -qa | grep samba
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA.
For information on the advisory, and where to find the updated files, follow the link below.
If the solution does not work for you, open a new bug report.