Bug 1263042 - glusterfsd crash [NEEDINFO]
glusterfsd crash
Status: CLOSED EOL
Product: GlusterFS
Classification: Community
Component: replicate (Show other bugs)
3.4.2
x86_64 Linux
unspecified Severity urgent
: ---
: ---
Assigned To: Pranith Kumar K
: Triaged
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2015-09-14 22:29 EDT by kelvin0431
Modified: 2015-10-07 09:50 EDT (History)
3 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2015-10-07 09:49:43 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
ndevos: needinfo? (sunkai0431)


Attachments (Terms of Use)

  None (edit)
Description kelvin0431 2015-09-14 22:29:36 EDT
Description of problem:

We have 8 servers under this gluster cluster, each two as a brick, when glusterd in 172.16.161.5 start, no matter cluster.self-heal-daemon on or off, the other servers will hang at df -h which mount this gluster.But when kill all the gluster processes in 172.16.161.5, the whole gluster is accessable. Also quiet a lot of zombie processes exit on the saying server:

#ps aux | grep Z | wc -l
641

#ps aux | grep Z | head
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root       301  0.0  0.0      0     0 ?        Z    09:10   0:00 [sh] <defunct>
root       327  0.0  0.0      0     0 ?        Z    08:45   0:00 [sh] <defunct>
root       350  0.0  0.0      0     0 ?        Z    09:10   0:00 [sh] <defunct>
root       431  0.0  0.0      0     0 ?        Z    09:10   0:00 [sh] <defunct>
root       478  0.0  0.0      0     0 ?        Z    09:10   0:00 [sh] <defunct>
root       524  0.0  0.0      0     0 ?        Z    08:45   0:00 [sh] <defunct>
root       526  0.0  0.0      0     0 ?        Z    09:10   0:00 [sh] <defunct>
root       573  0.0  0.0      0     0 ?        Z    09:10   0:00 [sh] <defunct>
root       663  0.0  0.0      0     0 ?        Z    09:10   0:00 [sh] <defunct>


Version-Release number of selected component (if applicable):

#gluster --version
glusterfs 3.4.2 built on Nov  6 2014 14:14:26
Repository revision: git://git.gluster.com/glusterfs.git
Copyright (c) 2006-2011 Gluster Inc. <http://www.gluster.com>
GlusterFS comes with ABSOLUTELY NO WARRANTY.
You may redistribute copies of GlusterFS under the terms of the GNU General Public License.

How reproducible:


Steps to Reproduce:
1. Create a zpool under raidz and mount to /mnt/zpool, zfs create zpool/zfs, zfs set xattr=sa zpool/zfs
2. Stop cluster.self-heal-daemon on a nornal node
3. grep volume-id /var/lib/glusterd/vols/storage_1/info  | cut -d= -f2 | sed 's/-//g',
   setfattr -n trusted.glusterfs.volume-id -v 0x3587ec7fa7574b8b8f02244c5eddf16c /mnt/zpool/zfs
4. Start /etc/init.d/glusterd
5. Start cluster.self-heal-daemon

Actual results:
glusterd crashed

#gluster volume heal  storage_1 info
Connection failed. Please check if gluster daemon is operational.

#gluster volume status
Status of volume: storage_1
Gluster process                     Port    Online  Pid
------------------------------------------------------------------------------
Brick 172.16.161.10:/mnt/zpool/zfs          49152   Y   31628
Brick 172.16.161.3:/mnt/zpool/zfs           49152   Y   689
Brick 172.16.161.4:/mnt/zpool/zfs           49153   Y   29349
Brick 172.16.161.5:/mnt/zpool/zfs           49154   Y   17987
Brick 172.16.161.6:/mnt/zpool/zfs           49152   Y   13826
Brick 172.16.161.7:/mnt/zpool/zfs           49152   Y   28246
Brick 172.16.161.8:/mnt/zpool/zfs           49152   Y   21390
Brick 172.16.161.9:/mnt/zpool/zfs           49152   Y   24121
NFS Server on localhost                 2049    Y   24470
Self-heal Daemon on localhost               N/A Y   24477
NFS Server on 172.16.161.4              2049    Y   6262
Self-heal Daemon on 172.16.161.4            N/A Y   6270
NFS Server on 172.16.161.3              2049    Y   21079
Self-heal Daemon on 172.16.161.3            N/A Y   21086
NFS Server on 172.16.161.8              2049    Y   32357
Self-heal Daemon on 172.16.161.8            N/A Y   32390
NFS Server on 172.16.161.10             2049    Y   8899
Self-heal Daemon on 172.16.161.10           N/A Y   8915
NFS Server on 172.16.161.7              2049    Y   5978
Self-heal Daemon on 172.16.161.7            N/A Y   5985
NFS Server on 172.16.161.9              2049    Y   1727
Self-heal Daemon on 172.16.161.9            N/A Y   1734
NFS Server on 172.16.161.5              2049    Y   12371
Self-heal Daemon on 172.16.161.5            N/A Y   12375

#gluster volume info

Volume Name: storage_1
Type: Distributed-Replicate
Volume ID: 3587ec7f-a757-4b8b-8f02-244c5eddf16c
Status: Started
Number of Bricks: 4 x 2 = 8
Transport-type: tcp
Bricks:
Brick1: 172.16.161.10:/mnt/zpool/zfs
Brick2: 172.16.161.3:/mnt/zpool/zfs
Brick3: 172.16.161.4:/mnt/zpool/zfs
Brick4: 172.16.161.5:/mnt/zpool/zfs
Brick5: 172.16.161.6:/mnt/zpool/zfs
Brick6: 172.16.161.7:/mnt/zpool/zfs
Brick7: 172.16.161.8:/mnt/zpool/zfs
Brick8: 172.16.161.9:/mnt/zpool/zfs
Options Reconfigured:
cluster.self-heal-daemon: on
performance.flush-behind: off
cluster.min-free-disk: 50GB
nfs.port: 2049

#glustershd.log
[2015-09-14 14:25:16.727491] I [client-handshake.c:1659:select_server_supported_programs] 0-storage_1-client-1: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2015-09-14 14:25:16.727550] I [client-handshake.c:1659:select_server_supported_programs] 0-storage_1-client-7: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2015-09-14 14:25:16.727602] I [client-handshake.c:1659:select_server_supported_programs] 0-storage_1-client-4: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2015-09-14 14:25:16.727669] I [client-handshake.c:1659:select_server_supported_programs] 0-storage_1-client-5: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2015-09-14 14:25:16.727729] I [client-handshake.c:1659:select_server_supported_programs] 0-storage_1-client-6: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2015-09-14 14:25:16.727798] I [client-handshake.c:1456:client_setvolume_cbk] 0-storage_1-client-0: Connected to 172.16.161.10:49152, attached to remote volume '/mnt/zpool/zfs'.
[2015-09-14 14:25:16.727814] I [client-handshake.c:1468:client_setvolume_cbk] 0-storage_1-client-0: Server and Client lk-version numbers are not same, reopening the fds
[2015-09-14 14:25:16.727880] I [afr-common.c:3698:afr_notify] 0-storage_1-replicate-0: Subvolume 'storage_1-client-0' came back up; going online.
[2015-09-14 14:25:16.728293] I [client-handshake.c:1456:client_setvolume_cbk] 0-storage_1-client-7: Connected to 172.16.161.9:49152, attached to remote volume '/mnt/zpool/zfs'.
[2015-09-14 14:25:16.728313] I [client-handshake.c:1468:client_setvolume_cbk] 0-storage_1-client-7: Server and Client lk-version numbers are not same, reopening the fds
[2015-09-14 14:25:16.728363] I [afr-common.c:3698:afr_notify] 0-storage_1-replicate-3: Subvolume 'storage_1-client-7' came back up; going online.
[2015-09-14 14:25:16.728432] I [client-handshake.c:1456:client_setvolume_cbk] 0-storage_1-client-4: Connected to 172.16.161.6:49152, attached to remote volume '/mnt/zpool/zfs'.
[2015-09-14 14:25:16.728449] I [client-handshake.c:1468:client_setvolume_cbk] 0-storage_1-client-4: Server and Client lk-version numbers are not same, reopening the fds
[2015-09-14 14:25:16.728494] I [afr-common.c:3698:afr_notify] 0-storage_1-replicate-2: Subvolume 'storage_1-client-4' came back up; going online.
[2015-09-14 14:25:16.728561] I [client-handshake.c:1456:client_setvolume_cbk] 0-storage_1-client-5: Connected to 172.16.161.7:49152, attached to remote volume '/mnt/zpool/zfs'.
[2015-09-14 14:25:16.728590] I [client-handshake.c:1468:client_setvolume_cbk] 0-storage_1-client-5: Server and Client lk-version numbers are not same, reopening the fds
[2015-09-14 14:25:16.728706] I [client-handshake.c:1456:client_setvolume_cbk] 0-storage_1-client-6: Connected to 172.16.161.8:49152, attached to remote volume '/mnt/zpool/zfs'.
[2015-09-14 14:25:16.728732] I [client-handshake.c:1468:client_setvolume_cbk] 0-storage_1-client-6: Server and Client lk-version numbers are not same, reopening the fds
[2015-09-14 14:25:16.728828] I [client-handshake.c:450:client_set_lk_version_cbk] 0-storage_1-client-0: Server lk version = 1
[2015-09-14 14:25:16.728862] I [client-handshake.c:1456:client_setvolume_cbk] 0-storage_1-client-3: Connected to 172.16.161.5:49154, attached to remote volume '/mnt/zpool/zfs'.
[2015-09-14 14:25:16.728879] I [client-handshake.c:1468:client_setvolume_cbk] 0-storage_1-client-3: Server and Client lk-version numbers are not same, reopening the fds
[2015-09-14 14:25:16.728931] I [afr-common.c:3698:afr_notify] 0-storage_1-replicate-1: Subvolume 'storage_1-client-3' came back up; going online.
[2015-09-14 14:25:16.728990] I [client-handshake.c:1456:client_setvolume_cbk] 0-storage_1-client-2: Connected to 172.16.161.4:49153, attached to remote volume '/mnt/zpool/zfs'.
[2015-09-14 14:25:16.729005] I [client-handshake.c:1468:client_setvolume_cbk] 0-storage_1-client-2: Server and Client lk-version numbers are not same, reopening the fds
[2015-09-14 14:25:16.729092] I [client-handshake.c:1456:client_setvolume_cbk] 0-storage_1-client-1: Connected to 172.16.161.3:49152, attached to remote volume '/mnt/zpool/zfs'.
[2015-09-14 14:25:16.729108] I [client-handshake.c:1468:client_setvolume_cbk] 0-storage_1-client-1: Server and Client lk-version numbers are not same, reopening the fds
[2015-09-14 14:25:16.729191] I [client-handshake.c:450:client_set_lk_version_cbk] 0-storage_1-client-7: Server lk version = 1
[2015-09-14 14:25:16.729216] I [client-handshake.c:450:client_set_lk_version_cbk] 0-storage_1-client-4: Server lk version = 1
[2015-09-14 14:25:16.729235] I [client-handshake.c:450:client_set_lk_version_cbk] 0-storage_1-client-5: Server lk version = 1
[2015-09-14 14:25:16.729254] I [client-handshake.c:450:client_set_lk_version_cbk] 0-storage_1-client-6: Server lk version = 1
[2015-09-14 14:25:16.729281] I [client-handshake.c:450:client_set_lk_version_cbk] 0-storage_1-client-3: Server lk version = 1
[2015-09-14 14:25:16.729362] I [client-handshake.c:450:client_set_lk_version_cbk] 0-storage_1-client-2: Server lk version = 1
[2015-09-14 14:25:16.729390] I [client-handshake.c:450:client_set_lk_version_cbk] 0-storage_1-client-1: Server lk version = 1
[2015-09-14 14:25:16.900936] I [afr-self-heald.c:1180:afr_dir_exclusive_crawl] 0-storage_1-replicate-1: Another crawl is in progress for storage_1-client-3
[2015-09-14 14:25:17.095066] I [afr-self-heald.c:1180:afr_dir_exclusive_crawl] 0-storage_1-replicate-1: Another crawl is in progress for storage_1-client-3
[2015-09-14 14:27:35.767127] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available)
[2015-09-14 14:27:35.767175] W [socket.c:1962:__socket_proto_state_machine] 0-glusterfs: reading from socket failed. Error (No data available), peer (127.0.0.1:24007)
[2015-09-14 14:27:45.815724] E [socket.c:2157:socket_connect_finish] 0-glusterfs: connection to 127.0.0.1:24007 failed (Connection refused)
[2015-09-14 14:27:45.815785] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available)
[2015-09-14 14:27:48.831108] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available)
[2015-09-14 14:27:51.835174] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available)
[2015-09-14 14:27:54.845877] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available)
[2015-09-14 14:27:57.854196] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available)
[2015-09-14 14:28:00.869561] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available)
[2015-09-14 14:28:03.877629] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available)
[2015-09-14 14:28:06.893191] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available)
[2015-09-14 14:28:09.899443] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available)
[2015-09-14 14:28:12.911237] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available)
[2015-09-14 14:28:15.916225] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available)
[2015-09-14 14:28:18.928260] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available)
[2015-09-14 14:28:21.934446] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available)
[2015-09-14 14:28:24.948352] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available)
[2015-09-14 14:28:27.954530] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available)
[2015-09-14 14:28:30.969954] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available)
[2015-09-14 14:28:33.976168] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available)
[2015-09-14 14:28:36.986261] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available)
[2015-09-14 14:28:39.992428] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available)
[2015-09-14 14:28:43.003749] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available)
[2015-09-14 14:28:46.009975] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available)
[2015-09-14 14:28:49.019259] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available)
[2015-09-14 14:28:52.025489] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available)
[2015-09-14 14:28:55.037455] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available)
[2015-09-14 14:28:58.045264] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available)
[2015-09-14 14:29:01.055652] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available)
[2015-09-14 14:29:04.068772] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available)
[2015-09-14 14:29:07.081162] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available)
[2015-09-14 14:29:10.085722] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available)
[2015-09-14 14:29:13.096022] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available)
[2015-09-14 14:29:16.102167] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available)
[2015-09-14 14:29:19.113583] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available)
[2015-09-14 14:29:22.119886] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available)
[2015-09-14 14:29:25.134581] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available)
[2015-09-14 14:29:28.138851] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available)
[2015-09-14 14:29:29.138927] C [client-handshake.c:127:rpc_client_ping_timer_expired] 0-storage_1-client-3: server 172.16.161.5:49154 has not responded in the last 42 seconds, disco
nnecting.
[2015-09-14 14:29:29.143105] W [socket.c:514:__socket_rwv] 0-storage_1-client-3: readv failed (No data available)
[2015-09-14 14:29:29.144516] E [rpc-clnt.c:368:saved_frames_unwind] (-->/usr/lib/libgfrpc.so.0(rpc_clnt_notify+0x13d) [0x3d5ca0ea5d] (-->/usr/lib/libgfrpc.so.0(rpc_clnt_connection_cleanup+0
xc3) [0x3d5ca0e5c3] (-->/usr/lib/libgfrpc.so.0(saved_frames_destroy+0xe) [0x3d5ca0e4de]))) 0-storage_1-client-3: forced unwinding frame type(GlusterFS 3.3) op(XATTROP(33)) called at
 2015-09-14 14:25:17.573714 (xid=0x18x)
[2015-09-14 14:29:29.144544] W [client-rpc-fops.c:1755:client3_3_xattrop_cbk] 0-storage_1-client-3: remote operation failed: Success. Path: (null) (--)
[2015-09-14 14:29:29.154316] I [socket.c:3027:socket_submit_request] 0-storage_1-client-3: not connected (priv->connected = 0)
[2015-09-14 14:29:29.154343] W [rpc-clnt.c:1488:rpc_clnt_submit] 0-storage_1-client-3: failed to submit rpc-request (XID: 0x24x Program: GlusterFS 3.3, ProgVers: 330, Proc: 29) to r
pc-transport (storage_1-client-3)
[2015-09-14 14:29:29.154362] W [client-rpc-fops.c:1538:client3_3_inodelk_cbk] 0-storage_1-client-3: remote operation failed: Transport endpoint is not connected
[2015-09-14 14:29:29.154407] E [rpc-clnt.c:368:saved_frames_unwind] (-->/usr/lib/libgfrpc.so.0(rpc_clnt_notify+0x13d) [0x3d5ca0ea5d] (-->/usr/lib/libgfrpc.so.0(rpc_clnt_connection_cleanup+0
xc3) [0x3d5ca0e5c3] (-->/usr/lib/libgfrpc.so.0(saved_frames_destroy+0xe) [0x3d5ca0e4de]))) 0-storage_1-client-3: forced unwinding frame type(GlusterFS Handshake) op(PING(3)) called
at 2015-09-14 14:28:47.010089 (xid=0x23x)
[2015-09-14 14:29:29.154418] W [client-handshake.c:276:client_ping_cbk] 0-storage_1-client-3: timer must have expired
[2015-09-14 14:29:29.154433] I [client.c:2097:client_rpc_notify] 0-storage_1-client-3: disconnected
[2015-09-14 14:29:29.154478] E [socket.c:2157:socket_connect_finish] 0-storage_1-client-3: connection to 172.16.161.5:24007 failed (Connection refused)
[2015-09-14 14:29:29.154499] W [socket.c:514:__socket_rwv] 0-storage_1-client-3: readv failed (No data available)
[2015-09-14 14:29:29.154572] W [client-rpc-fops.c:1640:client3_3_entrylk_cbk] 0-storage_1-client-3: remote operation failed: Transport endpoint is not connected
[2015-09-14 14:29:29.155101] E [afr-self-heal-entry.c:2296:afr_sh_post_nonblocking_entry_cbk] 0-storage_1-replicate-1: Non Blocking entrylks failed for <gfid:d529ffe7-48c7-4b6d-b9d3
-a645fc18b180>.
[2015-09-14 14:29:29.155289] W [client-rpc-fops.c:1112:client3_3_getxattr_cbk] 0-storage_1-client-3: remote operation failed: Transport endpoint is not connected. Path: <gfid:d529ff
e7-48c7-4b6d-b9d3-a645fc18b180> (00000000-0000-0000-0000-000000000000). Key: glusterfs.gfid2path
[2015-09-14 14:29:29.155383] W [client-rpc-fops.c:2265:client3_3_readdir_cbk] 0-storage_1-client-3: remote operation failed: Transport endpoint is not connected remote_fd = -2
[2015-09-14 14:29:31.154245] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available)
[2015-09-14 14:29:34.163264] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available)
[2015-09-14 14:29:37.172062] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available)
[2015-09-14 14:29:39.176124] W [socket.c:514:__socket_rwv] 0-storage_1-client-3: readv failed (No data available)
[2015-09-14 14:29:40.182083] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available)
[2015-09-14 14:29:42.186819] W [socket.c:514:__socket_rwv] 0-storage_1-client-3: readv failed (No data available)
[2015-09-14 14:29:43.198223] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available)
[2015-09-14 14:29:45.204365] W [socket.c:514:__socket_rwv] 0-storage_1-client-3: readv failed (No data available)
[2015-09-14 14:29:46.210179] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available)
[2015-09-14 14:29:48.215011] W [socket.c:514:__socket_rwv] 0-storage_1-client-3: readv failed (No data available)
[2015-09-14 14:29:49.225163] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available)


Expected results:
gluster will start self-heal at full speed

Additional info:
Comment 1 Niels de Vos 2015-09-15 08:23:29 EDT
It seems that you are using a version (glusterfs 3.4.2) that we do not update anymore. Could you try with a more recent version?

You gave this bug report a subject of "glusterfsd crash". Do you have segmentation faults of some kind? The log that you posted does not contain a reference that something crashed. Please include logs of all gluster processes when you can reproduce this on more current versions.
Comment 2 Kaleb KEITHLEY 2015-10-07 09:49:43 EDT
GlusterFS 3.4.x has reached end-of-life.

If this bug still exists in a later release please reopen this and change the version or open a new bug.
Comment 3 Kaleb KEITHLEY 2015-10-07 09:50:53 EDT
GlusterFS 3.4.x has reached end-of-life.\                                                   \                                                                               If this bug still exists in a later release please reopen this and change the version or open a new bug.

Note You need to log in before you can comment on or make changes to this bug.