Bug 821715
Summary: | ctdbd on a node crashes when another node in the cluster is brought down. | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Rachana Patel <racpatel> | ||||
Component: | ctdb | Assignee: | Sumit Bose <sbose> | ||||
Status: | CLOSED ERRATA | QA Contact: | amainkar | ||||
Severity: | high | Docs Contact: | |||||
Priority: | unspecified | ||||||
Version: | 6.2 | CC: | adas, jpayne, kparthas | ||||
Target Milestone: | rc | ||||||
Target Release: | --- | ||||||
Hardware: | x86_64 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | ctdb-1.0.114.5-2.el6 | Doc Type: | Bug Fix | ||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | |||||||
: | 825180 (view as bug list) | Environment: | |||||
Last Closed: | 2013-02-21 08:44:10 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 825180 | ||||||
Attachments: |
|
Ctdb config information: (missed out in description) [shared information present on a Glusterfs mount] [root@QA-42 ~]# cat /gluster/lock/nodes 172.17.251.81 172.17.251.82 172.17.251.83 172.17.251.84 [root@QA-42 ~]# cat /gluster/lock/ctdb CTDB_RECOVERY_LOCK=/gluster/lock/lockfile #CIFS only CTDB_PUBLIC_ADDRESSES=/etc/ctdb/public_addresses CTDB_MANAGES_SAMBA=yes #CIFS only CTDB_NODES=/etc/ctdb/nodes --------------------------------- [Node specific config info] [root@QA-42 ~]# cat /etc/ctdb/public_addresses 172.17.251.241/24 eth4 172.17.251.242/24 eth4 172.17.251.243/24 eth4 We haven't encountered the same issue with the build suggested by Sumit (https://brewweb.devel.redhat.com/taskinfo?taskID=4408685) from above link we have installed following rpm - ctdb-debuginfo-1.2.39-1.el6.x86_64 ctdb-devel-1.2.39-1.el6.x86_64 ctdb-1.2.39-1.el6.x86_64 This request was not resolved in time for the current release. Red Hat invites you to ask your support representative to propose this request, if still desired, for consideration in the next release of Red Hat Enterprise Linux. FWIW, I tried with my 5 node GFS2 cluster but could not reproduce the failure. This is the ctdb version on my nodes: ctdb-1.0.114.3-4.el6.x86_64 Note: I don't have samba running, just cman, clvmd, gfs2 (for ctdb lockfile) and ctdb. Killing one or two nodes doesn't kill ctdbd on other nodes. Membership changes are as expected when the nodes go down. Rachana/Krishnan, is there anything different you're doing? Abhijith, The differences are: - Glusterfs is the shared filesystem. - We 'share' ctdb lockfile, 'nodes' file and the sysconfig/ctdb file - We have ctdb starting samba. Please try and test the packages from https://brewweb.devel.redhat.com/taskinfo?taskID=4437472 and see if they work for you? Thank you. Verified in ctdb-1.0.114.5-3. [root@dash-03 ~]# rpm -q ctdb ctdb-1.0.114.3-3.el6.x86_64 [root@dash-03 ~]# service ctdb status Checking for ctdbd service: Number of nodes:3 pnn:0 10.15.89.168 HEALTHY pnn:1 10.15.89.169 HEALTHY pnn:2 10.15.89.170 HEALTHY (THIS NODE) Generation:185497490 Size:3 hash:0 lmaster:0 hash:1 lmaster:1 hash:2 lmaster:2 Recovery mode:NORMAL (0) Recovery master:0 [root@dash-03 ~]# ctdb ping -n all response from 0 time=0.000367 sec (1 clients) response from 1 time=0.000525 sec (1 clients) response from 2 time=0.000104 sec (2 clients) [root@dash-01 ~]# rpm -q ctdb ctdb-1.0.114.3-3.el6.x86_64 [root@dash-01 ~]# service ctdb status Checking for ctdbd service: Number of nodes:3 pnn:0 10.15.89.168 HEALTHY (THIS NODE) pnn:1 10.15.89.169 HEALTHY pnn:2 10.15.89.170 HEALTHY Generation:185497490 Size:3 hash:0 lmaster:0 hash:1 lmaster:1 hash:2 lmaster:2 Recovery mode:NORMAL (0) Recovery master:0 [root@dash-01 ~]# ctdb ping -n all response from 0 time=0.000086 sec (2 clients) response from 1 time=0.000510 sec (1 clients) response from 2 time=0.000371 sec (1 clients) [root@dash-03 ~]# reboot && exit Broadcast message from root@dash-03 (/dev/pts/0) at 9:59 ... The system is going down for reboot NOW! logout Connection to dash-03.lab.msp.redhat.com closed. -bash-4.1$ for i in `seq 1 3`; do qarsh root@dash-0$i service ctdb status; done ctdb dead but pid file exists Checking for ctdbd service: Number of nodes:3 pnn:0 10.15.89.168 DISCONNECTED|UNHEALTHY|INACTIVE pnn:1 10.15.89.169 HEALTHY (THIS NODE) pnn:2 10.15.89.170 DISCONNECTED|UNHEALTHY|INACTIVE Generation:488801060 Size:2 hash:0 lmaster:0 hash:1 lmaster:1 Recovery mode:NORMAL (0) Recovery master:0 Checking for ctdbd service: ctdbd not running. ctdb is stopped ========== After update =============== [root@dash-03 ~]# rpm -q ctdb ctdb-1.0.114.5-3.el6.x86_64 [root@dash-03 ~]# service ctdb status Checking for ctdbd service: Number of nodes:3 pnn:0 10.15.89.168 HEALTHY pnn:1 10.15.89.169 HEALTHY pnn:2 10.15.89.170 HEALTHY (THIS NODE) Generation:449208127 Size:3 hash:0 lmaster:0 hash:1 lmaster:1 hash:2 lmaster:2 Recovery mode:NORMAL (0) Recovery master:0 [root@dash-03 ~]# ctdb ping -n all response from 0 time=0.000389 sec (1 clients) response from 1 time=0.000474 sec (3 clients) response from 2 time=0.000113 sec (2 clients) [root@dash-03 ~]# reboot && exit [root@dash-01 ~]# rpm -q ctdb ctdb-1.0.114.5-3.el6.x86_64 [root@dash-01 ~]# service ctdb status Checking for ctdbd service: Number of nodes:3 pnn:0 10.15.89.168 HEALTHY (THIS NODE) pnn:1 10.15.89.169 HEALTHY pnn:2 10.15.89.170 DISCONNECTED|UNHEALTHY|INACTIVE Generation:1754606152 Size:2 hash:0 lmaster:0 hash:1 lmaster:1 Recovery mode:NORMAL (0) Recovery master:0 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2013-0337.html |
Created attachment 584648 [details] contains ctdb logs and gdb "bt full" output (where applicable) Description of problem: In a CTDB cluster of 4 nodes serving 3 public addresses, one of the nodes is brought down. The ctdbd process(es) on one of other nodes crashed. Glusterfs* is being used as the shared filesystem hosting the lockfile for ctdb. The "nodes" file (/etc/ctdb/nodes) is placed in the shared filesystem as well. Each node has its own /etc/ctdb/public_addresses. Version-Release number of selected component (if applicable): CTDB version: 1.0.114.3-3.el6 How reproducible: always Steps to Reproduce: 1.Build a ctdb cluster of size 4. 2.Reboot one of the nodes in the cluster. 3.On one of the machines, we see that the ctdbd crashed. Actual results: ctdbd crashes with signal 6. Expected results: ctdbd should not crash. Additional info: *Glusterfs is a network filesystem.