Description of problem: We have 8 servers under this gluster cluster, each two as a brick, when glusterd in 172.16.161.5 start, no matter cluster.self-heal-daemon on or off, the other servers will hang at df -h which mount this gluster.But when kill all the gluster processes in 172.16.161.5, the whole gluster is accessable. Also quiet a lot of zombie processes exit on the saying server: #ps aux | grep Z | wc -l 641 #ps aux | grep Z | head USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 301 0.0 0.0 0 0 ? Z 09:10 0:00 [sh] <defunct> root 327 0.0 0.0 0 0 ? Z 08:45 0:00 [sh] <defunct> root 350 0.0 0.0 0 0 ? Z 09:10 0:00 [sh] <defunct> root 431 0.0 0.0 0 0 ? Z 09:10 0:00 [sh] <defunct> root 478 0.0 0.0 0 0 ? Z 09:10 0:00 [sh] <defunct> root 524 0.0 0.0 0 0 ? Z 08:45 0:00 [sh] <defunct> root 526 0.0 0.0 0 0 ? Z 09:10 0:00 [sh] <defunct> root 573 0.0 0.0 0 0 ? Z 09:10 0:00 [sh] <defunct> root 663 0.0 0.0 0 0 ? Z 09:10 0:00 [sh] <defunct> Version-Release number of selected component (if applicable): #gluster --version glusterfs 3.4.2 built on Nov 6 2014 14:14:26 Repository revision: git://git.gluster.com/glusterfs.git Copyright (c) 2006-2011 Gluster Inc. <http://www.gluster.com> GlusterFS comes with ABSOLUTELY NO WARRANTY. You may redistribute copies of GlusterFS under the terms of the GNU General Public License. How reproducible: Steps to Reproduce: 1. Create a zpool under raidz and mount to /mnt/zpool, zfs create zpool/zfs, zfs set xattr=sa zpool/zfs 2. Stop cluster.self-heal-daemon on a nornal node 3. grep volume-id /var/lib/glusterd/vols/storage_1/info | cut -d= -f2 | sed 's/-//g', setfattr -n trusted.glusterfs.volume-id -v 0x3587ec7fa7574b8b8f02244c5eddf16c /mnt/zpool/zfs 4. Start /etc/init.d/glusterd 5. Start cluster.self-heal-daemon Actual results: glusterd crashed #gluster volume heal storage_1 info Connection failed. Please check if gluster daemon is operational. #gluster volume status Status of volume: storage_1 Gluster process Port Online Pid ------------------------------------------------------------------------------ Brick 172.16.161.10:/mnt/zpool/zfs 49152 Y 31628 Brick 172.16.161.3:/mnt/zpool/zfs 49152 Y 689 Brick 172.16.161.4:/mnt/zpool/zfs 49153 Y 29349 Brick 172.16.161.5:/mnt/zpool/zfs 49154 Y 17987 Brick 172.16.161.6:/mnt/zpool/zfs 49152 Y 13826 Brick 172.16.161.7:/mnt/zpool/zfs 49152 Y 28246 Brick 172.16.161.8:/mnt/zpool/zfs 49152 Y 21390 Brick 172.16.161.9:/mnt/zpool/zfs 49152 Y 24121 NFS Server on localhost 2049 Y 24470 Self-heal Daemon on localhost N/A Y 24477 NFS Server on 172.16.161.4 2049 Y 6262 Self-heal Daemon on 172.16.161.4 N/A Y 6270 NFS Server on 172.16.161.3 2049 Y 21079 Self-heal Daemon on 172.16.161.3 N/A Y 21086 NFS Server on 172.16.161.8 2049 Y 32357 Self-heal Daemon on 172.16.161.8 N/A Y 32390 NFS Server on 172.16.161.10 2049 Y 8899 Self-heal Daemon on 172.16.161.10 N/A Y 8915 NFS Server on 172.16.161.7 2049 Y 5978 Self-heal Daemon on 172.16.161.7 N/A Y 5985 NFS Server on 172.16.161.9 2049 Y 1727 Self-heal Daemon on 172.16.161.9 N/A Y 1734 NFS Server on 172.16.161.5 2049 Y 12371 Self-heal Daemon on 172.16.161.5 N/A Y 12375 #gluster volume info Volume Name: storage_1 Type: Distributed-Replicate Volume ID: 3587ec7f-a757-4b8b-8f02-244c5eddf16c Status: Started Number of Bricks: 4 x 2 = 8 Transport-type: tcp Bricks: Brick1: 172.16.161.10:/mnt/zpool/zfs Brick2: 172.16.161.3:/mnt/zpool/zfs Brick3: 172.16.161.4:/mnt/zpool/zfs Brick4: 172.16.161.5:/mnt/zpool/zfs Brick5: 172.16.161.6:/mnt/zpool/zfs Brick6: 172.16.161.7:/mnt/zpool/zfs Brick7: 172.16.161.8:/mnt/zpool/zfs Brick8: 172.16.161.9:/mnt/zpool/zfs Options Reconfigured: cluster.self-heal-daemon: on performance.flush-behind: off cluster.min-free-disk: 50GB nfs.port: 2049 #glustershd.log [2015-09-14 14:25:16.727491] I [client-handshake.c:1659:select_server_supported_programs] 0-storage_1-client-1: Using Program GlusterFS 3.3, Num (1298437), Version (330) [2015-09-14 14:25:16.727550] I [client-handshake.c:1659:select_server_supported_programs] 0-storage_1-client-7: Using Program GlusterFS 3.3, Num (1298437), Version (330) [2015-09-14 14:25:16.727602] I [client-handshake.c:1659:select_server_supported_programs] 0-storage_1-client-4: Using Program GlusterFS 3.3, Num (1298437), Version (330) [2015-09-14 14:25:16.727669] I [client-handshake.c:1659:select_server_supported_programs] 0-storage_1-client-5: Using Program GlusterFS 3.3, Num (1298437), Version (330) [2015-09-14 14:25:16.727729] I [client-handshake.c:1659:select_server_supported_programs] 0-storage_1-client-6: Using Program GlusterFS 3.3, Num (1298437), Version (330) [2015-09-14 14:25:16.727798] I [client-handshake.c:1456:client_setvolume_cbk] 0-storage_1-client-0: Connected to 172.16.161.10:49152, attached to remote volume '/mnt/zpool/zfs'. [2015-09-14 14:25:16.727814] I [client-handshake.c:1468:client_setvolume_cbk] 0-storage_1-client-0: Server and Client lk-version numbers are not same, reopening the fds [2015-09-14 14:25:16.727880] I [afr-common.c:3698:afr_notify] 0-storage_1-replicate-0: Subvolume 'storage_1-client-0' came back up; going online. [2015-09-14 14:25:16.728293] I [client-handshake.c:1456:client_setvolume_cbk] 0-storage_1-client-7: Connected to 172.16.161.9:49152, attached to remote volume '/mnt/zpool/zfs'. [2015-09-14 14:25:16.728313] I [client-handshake.c:1468:client_setvolume_cbk] 0-storage_1-client-7: Server and Client lk-version numbers are not same, reopening the fds [2015-09-14 14:25:16.728363] I [afr-common.c:3698:afr_notify] 0-storage_1-replicate-3: Subvolume 'storage_1-client-7' came back up; going online. [2015-09-14 14:25:16.728432] I [client-handshake.c:1456:client_setvolume_cbk] 0-storage_1-client-4: Connected to 172.16.161.6:49152, attached to remote volume '/mnt/zpool/zfs'. [2015-09-14 14:25:16.728449] I [client-handshake.c:1468:client_setvolume_cbk] 0-storage_1-client-4: Server and Client lk-version numbers are not same, reopening the fds [2015-09-14 14:25:16.728494] I [afr-common.c:3698:afr_notify] 0-storage_1-replicate-2: Subvolume 'storage_1-client-4' came back up; going online. [2015-09-14 14:25:16.728561] I [client-handshake.c:1456:client_setvolume_cbk] 0-storage_1-client-5: Connected to 172.16.161.7:49152, attached to remote volume '/mnt/zpool/zfs'. [2015-09-14 14:25:16.728590] I [client-handshake.c:1468:client_setvolume_cbk] 0-storage_1-client-5: Server and Client lk-version numbers are not same, reopening the fds [2015-09-14 14:25:16.728706] I [client-handshake.c:1456:client_setvolume_cbk] 0-storage_1-client-6: Connected to 172.16.161.8:49152, attached to remote volume '/mnt/zpool/zfs'. [2015-09-14 14:25:16.728732] I [client-handshake.c:1468:client_setvolume_cbk] 0-storage_1-client-6: Server and Client lk-version numbers are not same, reopening the fds [2015-09-14 14:25:16.728828] I [client-handshake.c:450:client_set_lk_version_cbk] 0-storage_1-client-0: Server lk version = 1 [2015-09-14 14:25:16.728862] I [client-handshake.c:1456:client_setvolume_cbk] 0-storage_1-client-3: Connected to 172.16.161.5:49154, attached to remote volume '/mnt/zpool/zfs'. [2015-09-14 14:25:16.728879] I [client-handshake.c:1468:client_setvolume_cbk] 0-storage_1-client-3: Server and Client lk-version numbers are not same, reopening the fds [2015-09-14 14:25:16.728931] I [afr-common.c:3698:afr_notify] 0-storage_1-replicate-1: Subvolume 'storage_1-client-3' came back up; going online. [2015-09-14 14:25:16.728990] I [client-handshake.c:1456:client_setvolume_cbk] 0-storage_1-client-2: Connected to 172.16.161.4:49153, attached to remote volume '/mnt/zpool/zfs'. [2015-09-14 14:25:16.729005] I [client-handshake.c:1468:client_setvolume_cbk] 0-storage_1-client-2: Server and Client lk-version numbers are not same, reopening the fds [2015-09-14 14:25:16.729092] I [client-handshake.c:1456:client_setvolume_cbk] 0-storage_1-client-1: Connected to 172.16.161.3:49152, attached to remote volume '/mnt/zpool/zfs'. [2015-09-14 14:25:16.729108] I [client-handshake.c:1468:client_setvolume_cbk] 0-storage_1-client-1: Server and Client lk-version numbers are not same, reopening the fds [2015-09-14 14:25:16.729191] I [client-handshake.c:450:client_set_lk_version_cbk] 0-storage_1-client-7: Server lk version = 1 [2015-09-14 14:25:16.729216] I [client-handshake.c:450:client_set_lk_version_cbk] 0-storage_1-client-4: Server lk version = 1 [2015-09-14 14:25:16.729235] I [client-handshake.c:450:client_set_lk_version_cbk] 0-storage_1-client-5: Server lk version = 1 [2015-09-14 14:25:16.729254] I [client-handshake.c:450:client_set_lk_version_cbk] 0-storage_1-client-6: Server lk version = 1 [2015-09-14 14:25:16.729281] I [client-handshake.c:450:client_set_lk_version_cbk] 0-storage_1-client-3: Server lk version = 1 [2015-09-14 14:25:16.729362] I [client-handshake.c:450:client_set_lk_version_cbk] 0-storage_1-client-2: Server lk version = 1 [2015-09-14 14:25:16.729390] I [client-handshake.c:450:client_set_lk_version_cbk] 0-storage_1-client-1: Server lk version = 1 [2015-09-14 14:25:16.900936] I [afr-self-heald.c:1180:afr_dir_exclusive_crawl] 0-storage_1-replicate-1: Another crawl is in progress for storage_1-client-3 [2015-09-14 14:25:17.095066] I [afr-self-heald.c:1180:afr_dir_exclusive_crawl] 0-storage_1-replicate-1: Another crawl is in progress for storage_1-client-3 [2015-09-14 14:27:35.767127] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available) [2015-09-14 14:27:35.767175] W [socket.c:1962:__socket_proto_state_machine] 0-glusterfs: reading from socket failed. Error (No data available), peer (127.0.0.1:24007) [2015-09-14 14:27:45.815724] E [socket.c:2157:socket_connect_finish] 0-glusterfs: connection to 127.0.0.1:24007 failed (Connection refused) [2015-09-14 14:27:45.815785] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available) [2015-09-14 14:27:48.831108] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available) [2015-09-14 14:27:51.835174] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available) [2015-09-14 14:27:54.845877] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available) [2015-09-14 14:27:57.854196] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available) [2015-09-14 14:28:00.869561] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available) [2015-09-14 14:28:03.877629] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available) [2015-09-14 14:28:06.893191] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available) [2015-09-14 14:28:09.899443] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available) [2015-09-14 14:28:12.911237] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available) [2015-09-14 14:28:15.916225] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available) [2015-09-14 14:28:18.928260] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available) [2015-09-14 14:28:21.934446] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available) [2015-09-14 14:28:24.948352] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available) [2015-09-14 14:28:27.954530] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available) [2015-09-14 14:28:30.969954] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available) [2015-09-14 14:28:33.976168] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available) [2015-09-14 14:28:36.986261] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available) [2015-09-14 14:28:39.992428] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available) [2015-09-14 14:28:43.003749] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available) [2015-09-14 14:28:46.009975] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available) [2015-09-14 14:28:49.019259] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available) [2015-09-14 14:28:52.025489] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available) [2015-09-14 14:28:55.037455] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available) [2015-09-14 14:28:58.045264] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available) [2015-09-14 14:29:01.055652] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available) [2015-09-14 14:29:04.068772] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available) [2015-09-14 14:29:07.081162] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available) [2015-09-14 14:29:10.085722] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available) [2015-09-14 14:29:13.096022] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available) [2015-09-14 14:29:16.102167] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available) [2015-09-14 14:29:19.113583] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available) [2015-09-14 14:29:22.119886] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available) [2015-09-14 14:29:25.134581] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available) [2015-09-14 14:29:28.138851] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available) [2015-09-14 14:29:29.138927] C [client-handshake.c:127:rpc_client_ping_timer_expired] 0-storage_1-client-3: server 172.16.161.5:49154 has not responded in the last 42 seconds, disco nnecting. [2015-09-14 14:29:29.143105] W [socket.c:514:__socket_rwv] 0-storage_1-client-3: readv failed (No data available) [2015-09-14 14:29:29.144516] E [rpc-clnt.c:368:saved_frames_unwind] (-->/usr/lib/libgfrpc.so.0(rpc_clnt_notify+0x13d) [0x3d5ca0ea5d] (-->/usr/lib/libgfrpc.so.0(rpc_clnt_connection_cleanup+0 xc3) [0x3d5ca0e5c3] (-->/usr/lib/libgfrpc.so.0(saved_frames_destroy+0xe) [0x3d5ca0e4de]))) 0-storage_1-client-3: forced unwinding frame type(GlusterFS 3.3) op(XATTROP(33)) called at 2015-09-14 14:25:17.573714 (xid=0x18x) [2015-09-14 14:29:29.144544] W [client-rpc-fops.c:1755:client3_3_xattrop_cbk] 0-storage_1-client-3: remote operation failed: Success. Path: (null) (--) [2015-09-14 14:29:29.154316] I [socket.c:3027:socket_submit_request] 0-storage_1-client-3: not connected (priv->connected = 0) [2015-09-14 14:29:29.154343] W [rpc-clnt.c:1488:rpc_clnt_submit] 0-storage_1-client-3: failed to submit rpc-request (XID: 0x24x Program: GlusterFS 3.3, ProgVers: 330, Proc: 29) to r pc-transport (storage_1-client-3) [2015-09-14 14:29:29.154362] W [client-rpc-fops.c:1538:client3_3_inodelk_cbk] 0-storage_1-client-3: remote operation failed: Transport endpoint is not connected [2015-09-14 14:29:29.154407] E [rpc-clnt.c:368:saved_frames_unwind] (-->/usr/lib/libgfrpc.so.0(rpc_clnt_notify+0x13d) [0x3d5ca0ea5d] (-->/usr/lib/libgfrpc.so.0(rpc_clnt_connection_cleanup+0 xc3) [0x3d5ca0e5c3] (-->/usr/lib/libgfrpc.so.0(saved_frames_destroy+0xe) [0x3d5ca0e4de]))) 0-storage_1-client-3: forced unwinding frame type(GlusterFS Handshake) op(PING(3)) called at 2015-09-14 14:28:47.010089 (xid=0x23x) [2015-09-14 14:29:29.154418] W [client-handshake.c:276:client_ping_cbk] 0-storage_1-client-3: timer must have expired [2015-09-14 14:29:29.154433] I [client.c:2097:client_rpc_notify] 0-storage_1-client-3: disconnected [2015-09-14 14:29:29.154478] E [socket.c:2157:socket_connect_finish] 0-storage_1-client-3: connection to 172.16.161.5:24007 failed (Connection refused) [2015-09-14 14:29:29.154499] W [socket.c:514:__socket_rwv] 0-storage_1-client-3: readv failed (No data available) [2015-09-14 14:29:29.154572] W [client-rpc-fops.c:1640:client3_3_entrylk_cbk] 0-storage_1-client-3: remote operation failed: Transport endpoint is not connected [2015-09-14 14:29:29.155101] E [afr-self-heal-entry.c:2296:afr_sh_post_nonblocking_entry_cbk] 0-storage_1-replicate-1: Non Blocking entrylks failed for <gfid:d529ffe7-48c7-4b6d-b9d3 -a645fc18b180>. [2015-09-14 14:29:29.155289] W [client-rpc-fops.c:1112:client3_3_getxattr_cbk] 0-storage_1-client-3: remote operation failed: Transport endpoint is not connected. Path: <gfid:d529ff e7-48c7-4b6d-b9d3-a645fc18b180> (00000000-0000-0000-0000-000000000000). Key: glusterfs.gfid2path [2015-09-14 14:29:29.155383] W [client-rpc-fops.c:2265:client3_3_readdir_cbk] 0-storage_1-client-3: remote operation failed: Transport endpoint is not connected remote_fd = -2 [2015-09-14 14:29:31.154245] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available) [2015-09-14 14:29:34.163264] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available) [2015-09-14 14:29:37.172062] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available) [2015-09-14 14:29:39.176124] W [socket.c:514:__socket_rwv] 0-storage_1-client-3: readv failed (No data available) [2015-09-14 14:29:40.182083] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available) [2015-09-14 14:29:42.186819] W [socket.c:514:__socket_rwv] 0-storage_1-client-3: readv failed (No data available) [2015-09-14 14:29:43.198223] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available) [2015-09-14 14:29:45.204365] W [socket.c:514:__socket_rwv] 0-storage_1-client-3: readv failed (No data available) [2015-09-14 14:29:46.210179] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available) [2015-09-14 14:29:48.215011] W [socket.c:514:__socket_rwv] 0-storage_1-client-3: readv failed (No data available) [2015-09-14 14:29:49.225163] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available) Expected results: gluster will start self-heal at full speed Additional info:
It seems that you are using a version (glusterfs 3.4.2) that we do not update anymore. Could you try with a more recent version? You gave this bug report a subject of "glusterfsd crash". Do you have segmentation faults of some kind? The log that you posted does not contain a reference that something crashed. Please include logs of all gluster processes when you can reproduce this on more current versions.
GlusterFS 3.4.x has reached end-of-life. If this bug still exists in a later release please reopen this and change the version or open a new bug.
GlusterFS 3.4.x has reached end-of-life.\ \ If this bug still exists in a later release please reopen this and change the version or open a new bug.
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days