Bug 1263042 - glusterfsd crash
Summary: glusterfsd crash
Keywords:
Status: CLOSED EOL
Alias: None
Product: GlusterFS
Classification: Community
Component: replicate
Version: 3.4.2
Hardware: x86_64
OS: Linux
unspecified
urgent
Target Milestone: ---
Assignee: Pranith Kumar K
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2015-09-15 02:29 UTC by kelvin0431
Modified: 2023-09-14 03:05 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2015-10-07 13:49:43 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)

Description kelvin0431 2015-09-15 02:29:36 UTC
Description of problem:

We have 8 servers under this gluster cluster, each two as a brick, when glusterd in 172.16.161.5 start, no matter cluster.self-heal-daemon on or off, the other servers will hang at df -h which mount this gluster.But when kill all the gluster processes in 172.16.161.5, the whole gluster is accessable. Also quiet a lot of zombie processes exit on the saying server:

#ps aux | grep Z | wc -l
641

#ps aux | grep Z | head
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root       301  0.0  0.0      0     0 ?        Z    09:10   0:00 [sh] <defunct>
root       327  0.0  0.0      0     0 ?        Z    08:45   0:00 [sh] <defunct>
root       350  0.0  0.0      0     0 ?        Z    09:10   0:00 [sh] <defunct>
root       431  0.0  0.0      0     0 ?        Z    09:10   0:00 [sh] <defunct>
root       478  0.0  0.0      0     0 ?        Z    09:10   0:00 [sh] <defunct>
root       524  0.0  0.0      0     0 ?        Z    08:45   0:00 [sh] <defunct>
root       526  0.0  0.0      0     0 ?        Z    09:10   0:00 [sh] <defunct>
root       573  0.0  0.0      0     0 ?        Z    09:10   0:00 [sh] <defunct>
root       663  0.0  0.0      0     0 ?        Z    09:10   0:00 [sh] <defunct>


Version-Release number of selected component (if applicable):

#gluster --version
glusterfs 3.4.2 built on Nov  6 2014 14:14:26
Repository revision: git://git.gluster.com/glusterfs.git
Copyright (c) 2006-2011 Gluster Inc. <http://www.gluster.com>
GlusterFS comes with ABSOLUTELY NO WARRANTY.
You may redistribute copies of GlusterFS under the terms of the GNU General Public License.

How reproducible:


Steps to Reproduce:
1. Create a zpool under raidz and mount to /mnt/zpool, zfs create zpool/zfs, zfs set xattr=sa zpool/zfs
2. Stop cluster.self-heal-daemon on a nornal node
3. grep volume-id /var/lib/glusterd/vols/storage_1/info  | cut -d= -f2 | sed 's/-//g',
   setfattr -n trusted.glusterfs.volume-id -v 0x3587ec7fa7574b8b8f02244c5eddf16c /mnt/zpool/zfs
4. Start /etc/init.d/glusterd
5. Start cluster.self-heal-daemon

Actual results:
glusterd crashed

#gluster volume heal  storage_1 info
Connection failed. Please check if gluster daemon is operational.

#gluster volume status
Status of volume: storage_1
Gluster process                     Port    Online  Pid
------------------------------------------------------------------------------
Brick 172.16.161.10:/mnt/zpool/zfs          49152   Y   31628
Brick 172.16.161.3:/mnt/zpool/zfs           49152   Y   689
Brick 172.16.161.4:/mnt/zpool/zfs           49153   Y   29349
Brick 172.16.161.5:/mnt/zpool/zfs           49154   Y   17987
Brick 172.16.161.6:/mnt/zpool/zfs           49152   Y   13826
Brick 172.16.161.7:/mnt/zpool/zfs           49152   Y   28246
Brick 172.16.161.8:/mnt/zpool/zfs           49152   Y   21390
Brick 172.16.161.9:/mnt/zpool/zfs           49152   Y   24121
NFS Server on localhost                 2049    Y   24470
Self-heal Daemon on localhost               N/A Y   24477
NFS Server on 172.16.161.4              2049    Y   6262
Self-heal Daemon on 172.16.161.4            N/A Y   6270
NFS Server on 172.16.161.3              2049    Y   21079
Self-heal Daemon on 172.16.161.3            N/A Y   21086
NFS Server on 172.16.161.8              2049    Y   32357
Self-heal Daemon on 172.16.161.8            N/A Y   32390
NFS Server on 172.16.161.10             2049    Y   8899
Self-heal Daemon on 172.16.161.10           N/A Y   8915
NFS Server on 172.16.161.7              2049    Y   5978
Self-heal Daemon on 172.16.161.7            N/A Y   5985
NFS Server on 172.16.161.9              2049    Y   1727
Self-heal Daemon on 172.16.161.9            N/A Y   1734
NFS Server on 172.16.161.5              2049    Y   12371
Self-heal Daemon on 172.16.161.5            N/A Y   12375

#gluster volume info

Volume Name: storage_1
Type: Distributed-Replicate
Volume ID: 3587ec7f-a757-4b8b-8f02-244c5eddf16c
Status: Started
Number of Bricks: 4 x 2 = 8
Transport-type: tcp
Bricks:
Brick1: 172.16.161.10:/mnt/zpool/zfs
Brick2: 172.16.161.3:/mnt/zpool/zfs
Brick3: 172.16.161.4:/mnt/zpool/zfs
Brick4: 172.16.161.5:/mnt/zpool/zfs
Brick5: 172.16.161.6:/mnt/zpool/zfs
Brick6: 172.16.161.7:/mnt/zpool/zfs
Brick7: 172.16.161.8:/mnt/zpool/zfs
Brick8: 172.16.161.9:/mnt/zpool/zfs
Options Reconfigured:
cluster.self-heal-daemon: on
performance.flush-behind: off
cluster.min-free-disk: 50GB
nfs.port: 2049

#glustershd.log
[2015-09-14 14:25:16.727491] I [client-handshake.c:1659:select_server_supported_programs] 0-storage_1-client-1: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2015-09-14 14:25:16.727550] I [client-handshake.c:1659:select_server_supported_programs] 0-storage_1-client-7: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2015-09-14 14:25:16.727602] I [client-handshake.c:1659:select_server_supported_programs] 0-storage_1-client-4: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2015-09-14 14:25:16.727669] I [client-handshake.c:1659:select_server_supported_programs] 0-storage_1-client-5: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2015-09-14 14:25:16.727729] I [client-handshake.c:1659:select_server_supported_programs] 0-storage_1-client-6: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2015-09-14 14:25:16.727798] I [client-handshake.c:1456:client_setvolume_cbk] 0-storage_1-client-0: Connected to 172.16.161.10:49152, attached to remote volume '/mnt/zpool/zfs'.
[2015-09-14 14:25:16.727814] I [client-handshake.c:1468:client_setvolume_cbk] 0-storage_1-client-0: Server and Client lk-version numbers are not same, reopening the fds
[2015-09-14 14:25:16.727880] I [afr-common.c:3698:afr_notify] 0-storage_1-replicate-0: Subvolume 'storage_1-client-0' came back up; going online.
[2015-09-14 14:25:16.728293] I [client-handshake.c:1456:client_setvolume_cbk] 0-storage_1-client-7: Connected to 172.16.161.9:49152, attached to remote volume '/mnt/zpool/zfs'.
[2015-09-14 14:25:16.728313] I [client-handshake.c:1468:client_setvolume_cbk] 0-storage_1-client-7: Server and Client lk-version numbers are not same, reopening the fds
[2015-09-14 14:25:16.728363] I [afr-common.c:3698:afr_notify] 0-storage_1-replicate-3: Subvolume 'storage_1-client-7' came back up; going online.
[2015-09-14 14:25:16.728432] I [client-handshake.c:1456:client_setvolume_cbk] 0-storage_1-client-4: Connected to 172.16.161.6:49152, attached to remote volume '/mnt/zpool/zfs'.
[2015-09-14 14:25:16.728449] I [client-handshake.c:1468:client_setvolume_cbk] 0-storage_1-client-4: Server and Client lk-version numbers are not same, reopening the fds
[2015-09-14 14:25:16.728494] I [afr-common.c:3698:afr_notify] 0-storage_1-replicate-2: Subvolume 'storage_1-client-4' came back up; going online.
[2015-09-14 14:25:16.728561] I [client-handshake.c:1456:client_setvolume_cbk] 0-storage_1-client-5: Connected to 172.16.161.7:49152, attached to remote volume '/mnt/zpool/zfs'.
[2015-09-14 14:25:16.728590] I [client-handshake.c:1468:client_setvolume_cbk] 0-storage_1-client-5: Server and Client lk-version numbers are not same, reopening the fds
[2015-09-14 14:25:16.728706] I [client-handshake.c:1456:client_setvolume_cbk] 0-storage_1-client-6: Connected to 172.16.161.8:49152, attached to remote volume '/mnt/zpool/zfs'.
[2015-09-14 14:25:16.728732] I [client-handshake.c:1468:client_setvolume_cbk] 0-storage_1-client-6: Server and Client lk-version numbers are not same, reopening the fds
[2015-09-14 14:25:16.728828] I [client-handshake.c:450:client_set_lk_version_cbk] 0-storage_1-client-0: Server lk version = 1
[2015-09-14 14:25:16.728862] I [client-handshake.c:1456:client_setvolume_cbk] 0-storage_1-client-3: Connected to 172.16.161.5:49154, attached to remote volume '/mnt/zpool/zfs'.
[2015-09-14 14:25:16.728879] I [client-handshake.c:1468:client_setvolume_cbk] 0-storage_1-client-3: Server and Client lk-version numbers are not same, reopening the fds
[2015-09-14 14:25:16.728931] I [afr-common.c:3698:afr_notify] 0-storage_1-replicate-1: Subvolume 'storage_1-client-3' came back up; going online.
[2015-09-14 14:25:16.728990] I [client-handshake.c:1456:client_setvolume_cbk] 0-storage_1-client-2: Connected to 172.16.161.4:49153, attached to remote volume '/mnt/zpool/zfs'.
[2015-09-14 14:25:16.729005] I [client-handshake.c:1468:client_setvolume_cbk] 0-storage_1-client-2: Server and Client lk-version numbers are not same, reopening the fds
[2015-09-14 14:25:16.729092] I [client-handshake.c:1456:client_setvolume_cbk] 0-storage_1-client-1: Connected to 172.16.161.3:49152, attached to remote volume '/mnt/zpool/zfs'.
[2015-09-14 14:25:16.729108] I [client-handshake.c:1468:client_setvolume_cbk] 0-storage_1-client-1: Server and Client lk-version numbers are not same, reopening the fds
[2015-09-14 14:25:16.729191] I [client-handshake.c:450:client_set_lk_version_cbk] 0-storage_1-client-7: Server lk version = 1
[2015-09-14 14:25:16.729216] I [client-handshake.c:450:client_set_lk_version_cbk] 0-storage_1-client-4: Server lk version = 1
[2015-09-14 14:25:16.729235] I [client-handshake.c:450:client_set_lk_version_cbk] 0-storage_1-client-5: Server lk version = 1
[2015-09-14 14:25:16.729254] I [client-handshake.c:450:client_set_lk_version_cbk] 0-storage_1-client-6: Server lk version = 1
[2015-09-14 14:25:16.729281] I [client-handshake.c:450:client_set_lk_version_cbk] 0-storage_1-client-3: Server lk version = 1
[2015-09-14 14:25:16.729362] I [client-handshake.c:450:client_set_lk_version_cbk] 0-storage_1-client-2: Server lk version = 1
[2015-09-14 14:25:16.729390] I [client-handshake.c:450:client_set_lk_version_cbk] 0-storage_1-client-1: Server lk version = 1
[2015-09-14 14:25:16.900936] I [afr-self-heald.c:1180:afr_dir_exclusive_crawl] 0-storage_1-replicate-1: Another crawl is in progress for storage_1-client-3
[2015-09-14 14:25:17.095066] I [afr-self-heald.c:1180:afr_dir_exclusive_crawl] 0-storage_1-replicate-1: Another crawl is in progress for storage_1-client-3
[2015-09-14 14:27:35.767127] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available)
[2015-09-14 14:27:35.767175] W [socket.c:1962:__socket_proto_state_machine] 0-glusterfs: reading from socket failed. Error (No data available), peer (127.0.0.1:24007)
[2015-09-14 14:27:45.815724] E [socket.c:2157:socket_connect_finish] 0-glusterfs: connection to 127.0.0.1:24007 failed (Connection refused)
[2015-09-14 14:27:45.815785] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available)
[2015-09-14 14:27:48.831108] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available)
[2015-09-14 14:27:51.835174] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available)
[2015-09-14 14:27:54.845877] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available)
[2015-09-14 14:27:57.854196] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available)
[2015-09-14 14:28:00.869561] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available)
[2015-09-14 14:28:03.877629] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available)
[2015-09-14 14:28:06.893191] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available)
[2015-09-14 14:28:09.899443] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available)
[2015-09-14 14:28:12.911237] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available)
[2015-09-14 14:28:15.916225] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available)
[2015-09-14 14:28:18.928260] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available)
[2015-09-14 14:28:21.934446] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available)
[2015-09-14 14:28:24.948352] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available)
[2015-09-14 14:28:27.954530] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available)
[2015-09-14 14:28:30.969954] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available)
[2015-09-14 14:28:33.976168] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available)
[2015-09-14 14:28:36.986261] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available)
[2015-09-14 14:28:39.992428] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available)
[2015-09-14 14:28:43.003749] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available)
[2015-09-14 14:28:46.009975] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available)
[2015-09-14 14:28:49.019259] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available)
[2015-09-14 14:28:52.025489] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available)
[2015-09-14 14:28:55.037455] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available)
[2015-09-14 14:28:58.045264] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available)
[2015-09-14 14:29:01.055652] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available)
[2015-09-14 14:29:04.068772] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available)
[2015-09-14 14:29:07.081162] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available)
[2015-09-14 14:29:10.085722] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available)
[2015-09-14 14:29:13.096022] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available)
[2015-09-14 14:29:16.102167] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available)
[2015-09-14 14:29:19.113583] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available)
[2015-09-14 14:29:22.119886] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available)
[2015-09-14 14:29:25.134581] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available)
[2015-09-14 14:29:28.138851] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available)
[2015-09-14 14:29:29.138927] C [client-handshake.c:127:rpc_client_ping_timer_expired] 0-storage_1-client-3: server 172.16.161.5:49154 has not responded in the last 42 seconds, disco
nnecting.
[2015-09-14 14:29:29.143105] W [socket.c:514:__socket_rwv] 0-storage_1-client-3: readv failed (No data available)
[2015-09-14 14:29:29.144516] E [rpc-clnt.c:368:saved_frames_unwind] (-->/usr/lib/libgfrpc.so.0(rpc_clnt_notify+0x13d) [0x3d5ca0ea5d] (-->/usr/lib/libgfrpc.so.0(rpc_clnt_connection_cleanup+0
xc3) [0x3d5ca0e5c3] (-->/usr/lib/libgfrpc.so.0(saved_frames_destroy+0xe) [0x3d5ca0e4de]))) 0-storage_1-client-3: forced unwinding frame type(GlusterFS 3.3) op(XATTROP(33)) called at
 2015-09-14 14:25:17.573714 (xid=0x18x)
[2015-09-14 14:29:29.144544] W [client-rpc-fops.c:1755:client3_3_xattrop_cbk] 0-storage_1-client-3: remote operation failed: Success. Path: (null) (--)
[2015-09-14 14:29:29.154316] I [socket.c:3027:socket_submit_request] 0-storage_1-client-3: not connected (priv->connected = 0)
[2015-09-14 14:29:29.154343] W [rpc-clnt.c:1488:rpc_clnt_submit] 0-storage_1-client-3: failed to submit rpc-request (XID: 0x24x Program: GlusterFS 3.3, ProgVers: 330, Proc: 29) to r
pc-transport (storage_1-client-3)
[2015-09-14 14:29:29.154362] W [client-rpc-fops.c:1538:client3_3_inodelk_cbk] 0-storage_1-client-3: remote operation failed: Transport endpoint is not connected
[2015-09-14 14:29:29.154407] E [rpc-clnt.c:368:saved_frames_unwind] (-->/usr/lib/libgfrpc.so.0(rpc_clnt_notify+0x13d) [0x3d5ca0ea5d] (-->/usr/lib/libgfrpc.so.0(rpc_clnt_connection_cleanup+0
xc3) [0x3d5ca0e5c3] (-->/usr/lib/libgfrpc.so.0(saved_frames_destroy+0xe) [0x3d5ca0e4de]))) 0-storage_1-client-3: forced unwinding frame type(GlusterFS Handshake) op(PING(3)) called
at 2015-09-14 14:28:47.010089 (xid=0x23x)
[2015-09-14 14:29:29.154418] W [client-handshake.c:276:client_ping_cbk] 0-storage_1-client-3: timer must have expired
[2015-09-14 14:29:29.154433] I [client.c:2097:client_rpc_notify] 0-storage_1-client-3: disconnected
[2015-09-14 14:29:29.154478] E [socket.c:2157:socket_connect_finish] 0-storage_1-client-3: connection to 172.16.161.5:24007 failed (Connection refused)
[2015-09-14 14:29:29.154499] W [socket.c:514:__socket_rwv] 0-storage_1-client-3: readv failed (No data available)
[2015-09-14 14:29:29.154572] W [client-rpc-fops.c:1640:client3_3_entrylk_cbk] 0-storage_1-client-3: remote operation failed: Transport endpoint is not connected
[2015-09-14 14:29:29.155101] E [afr-self-heal-entry.c:2296:afr_sh_post_nonblocking_entry_cbk] 0-storage_1-replicate-1: Non Blocking entrylks failed for <gfid:d529ffe7-48c7-4b6d-b9d3
-a645fc18b180>.
[2015-09-14 14:29:29.155289] W [client-rpc-fops.c:1112:client3_3_getxattr_cbk] 0-storage_1-client-3: remote operation failed: Transport endpoint is not connected. Path: <gfid:d529ff
e7-48c7-4b6d-b9d3-a645fc18b180> (00000000-0000-0000-0000-000000000000). Key: glusterfs.gfid2path
[2015-09-14 14:29:29.155383] W [client-rpc-fops.c:2265:client3_3_readdir_cbk] 0-storage_1-client-3: remote operation failed: Transport endpoint is not connected remote_fd = -2
[2015-09-14 14:29:31.154245] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available)
[2015-09-14 14:29:34.163264] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available)
[2015-09-14 14:29:37.172062] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available)
[2015-09-14 14:29:39.176124] W [socket.c:514:__socket_rwv] 0-storage_1-client-3: readv failed (No data available)
[2015-09-14 14:29:40.182083] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available)
[2015-09-14 14:29:42.186819] W [socket.c:514:__socket_rwv] 0-storage_1-client-3: readv failed (No data available)
[2015-09-14 14:29:43.198223] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available)
[2015-09-14 14:29:45.204365] W [socket.c:514:__socket_rwv] 0-storage_1-client-3: readv failed (No data available)
[2015-09-14 14:29:46.210179] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available)
[2015-09-14 14:29:48.215011] W [socket.c:514:__socket_rwv] 0-storage_1-client-3: readv failed (No data available)
[2015-09-14 14:29:49.225163] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available)


Expected results:
gluster will start self-heal at full speed

Additional info:

Comment 1 Niels de Vos 2015-09-15 12:23:29 UTC
It seems that you are using a version (glusterfs 3.4.2) that we do not update anymore. Could you try with a more recent version?

You gave this bug report a subject of "glusterfsd crash". Do you have segmentation faults of some kind? The log that you posted does not contain a reference that something crashed. Please include logs of all gluster processes when you can reproduce this on more current versions.

Comment 2 Kaleb KEITHLEY 2015-10-07 13:49:43 UTC
GlusterFS 3.4.x has reached end-of-life.

If this bug still exists in a later release please reopen this and change the version or open a new bug.

Comment 3 Kaleb KEITHLEY 2015-10-07 13:50:53 UTC
GlusterFS 3.4.x has reached end-of-life.\                                                   \                                                                               If this bug still exists in a later release please reopen this and change the version or open a new bug.

Comment 4 Red Hat Bugzilla 2023-09-14 03:05:19 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days


Note You need to log in before you can comment on or make changes to this bug.