Description of problem: ----------------------- In a 3-way replicated volume, one brick in each replica set was killed using SIGKILL while I/O was running on fuse client. After a while, attempts to start the killed bricks using `gluster volume start force' were found to fail repeatedly. The following is from the logs - Brick logs when the volume is started with force option - <snip> +------------------------------------------------------------------------------+ [2015-09-10 23:36:12.837482] I [MSGID: 101190] [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread with index 2 [2015-09-10 23:36:14.947021] W [socket.c:642:__socket_rwv] 0-3-test-quota: readv on /var/run/gluster/quotad.socket failed (No data available) [2015-09-10 23:36:15.980138] I [login.c:81:gf_auth] 0-auth/login: allowed user names: 7b8f0589-7451-4995-8d78-a0da9c702f7a [2015-09-10 23:36:15.980174] I [MSGID: 115029] [server-handshake.c:610:server_setvolume] 0-3-test-server: accepted client from dhcp37-70.lab.eng.blr.redhat.com-21710-2015/09/11-06:11:12:923821-3-test-client-0-0-0 (version: 3.7.1) [2015-09-10 23:36:15.982765] I [login.c:81:gf_auth] 0-auth/login: allowed user names: 7b8f0589-7451-4995-8d78-a0da9c702f7a [2015-09-10 23:36:15.982796] I [MSGID: 115029] [server-handshake.c:610:server_setvolume] 0-3-test-server: accepted client from dhcp37-197.lab.eng.blr.redhat.com-21886-2015/09/11-06:10:46:657748-3-test-client-0-0-0 (version: 3.7.1) [2015-09-10 23:36:15.982915] I [MSGID: 115029] [server-handshake.c:610:server_setvolume] 0-3-test-server: accepted client from vm10-rhsqa13.lab.eng.blr.redhat.com-14776-2015/09/10-05:36:14:214793-3-test-client-0-0-4 (version: 3.7.1) [2015-09-10 23:36:16.012835] I [login.c:81:gf_auth] 0-auth/login: allowed user names: 7b8f0589-7451-4995-8d78-a0da9c702f7a [2015-09-10 23:36:16.012871] I [MSGID: 115029] [server-handshake.c:610:server_setvolume] 0-3-test-server: accepted client from dhcp37-197.lab.eng.blr.redhat.com-21894-2015/09/11-06:10:47:670581-3-test-client-0-0-0 (version: 3.7.1) [2015-09-10 23:36:16.013150] I [login.c:81:gf_auth] 0-auth/login: allowed user names: 7b8f0589-7451-4995-8d78-a0da9c702f7a [2015-09-10 23:36:16.013197] I [MSGID: 115029] [server-handshake.c:610:server_setvolume] 0-3-test-server: accepted client from dhcp37-135.lab.eng.blr.redhat.com-20213-2015/09/11-06:10:48:664075-3-test-client-0-0-0 (version: 3.7.1) [2015-09-10 23:36:16.025388] I [login.c:81:gf_auth] 0-auth/login: allowed user names: 7b8f0589-7451-4995-8d78-a0da9c702f7a [2015-09-10 23:36:16.025420] I [MSGID: 115029] [server-handshake.c:610:server_setvolume] 0-3-test-server: accepted client from dhcp37-135.lab.eng.blr.redhat.com-20205-2015/09/11-06:10:47:604671-3-test-client-0-0-0 (version: 3.7.1) [2015-09-10 23:36:16.025539] I [login.c:81:gf_auth] 0-auth/login: allowed user names: 7b8f0589-7451-4995-8d78-a0da9c702f7a [2015-09-10 23:36:16.025571] I [MSGID: 115029] [server-handshake.c:610:server_setvolume] 0-3-test-server: accepted client from dhcp37-135.lab.eng.blr.redhat.com-20197-2015/09/11-06:10:46:587730-3-test-client-0-0-0 (version: 3.7.1) </snip> From glusterd logs - <snip> The message "I [MSGID: 106005] [glusterd-handler.c:4899:__glusterd_brick_rpc_notify] 0-management: Brick 10.70.37.126:/rhs/brick2/b1 has disconnected from glusterd." repeated 39 times between [2015-09-11 00:59:50.495698] and [2015-09-11 01:01:47.519805] The message "I [MSGID: 106005] [glusterd-handler.c:4899:__glusterd_brick_rpc_notify] 0-management: Brick 10.70.37.126:/rhs/brick3/b1 has disconnected from glusterd." repeated 39 times between [2015-09-11 00:59:50.496232] and [2015-09-11 01:01:47.521145] [2015-09-11 01:01:50.520025] W [socket.c:642:__socket_rwv] 0-management: readv on /var/run/gluster/d688303ff19aece29c724dfbabf0aa3f.socket failed (Invalid argument) [2015-09-11 01:01:50.520770] I [MSGID: 106005] [glusterd-handler.c:4899:__glusterd_brick_rpc_notify] 0-management: Brick 10.70.37.126:/rhs/brick2/b1 has disconnected from glusterd. [2015-09-11 01:01:50.521500] W [socket.c:642:__socket_rwv] 0-management: readv on /var/run/gluster/8639fa8939074b2eba37825a7012056c.socket failed (Invalid argument) [2015-09-11 01:01:50.522167] I [MSGID: 106005] [glusterd-handler.c:4899:__glusterd_brick_rpc_notify] 0-management: Brick 10.70.37.126:/rhs/brick3/b1 has disconnected from glusterd. [2015-09-11 01:01:53.520813] W [socket.c:642:__socket_rwv] 0-management: readv on /var/run/gluster/d688303ff19aece29c724dfbabf0aa3f.socket failed (Invalid argument) [2015-09-11 01:01:53.522477] W [socket.c:642:__socket_rwv] 0-management: readv on /var/run/gluster/8639fa8939074b2eba37825a7012056c.socket failed (Invalid argument) [2015-09-11 01:01:56.521453] W [socket.c:642:__socket_rwv] 0-management: readv on /var/run/gluster/d688303ff19aece29c724dfbabf0aa3f.socket failed (Invalid argument) [2015-09-11 01:01:56.522860] W [socket.c:642:__socket_rwv] 0-management: readv on /var/run/gluster/8639fa8939074b2eba37825a7012056c.socket failed (Invalid argument) </snip> Restarting glusterd also does not help. Version-Release number of selected component (if applicable): -------------------------------------------------------------- glusterfs-3.7.1-14.el7rhgs.x86_64 How reproducible: ------------------ Haven't tried on another volume. Steps to Reproduce: ------------------- 1. While the I/o is running from a fuse client on a 2x3 volume, kill one brick from each replica set. 2. After a while, start the volume with force option - `gluster volume start <vol-name> force' Actual results: --------------- The bricks that were killed in step 1 do not start after trying to start with force option or after restarting glusterd. Expected results: ------------------ Brick processes are expected to start after `gluster volume start force'
We have noticed that the bug is not reproduced in the latest version of the product (RHGS-3.3.1+). If the bug is still relevant and is being reproduced, feel free to reopen the bug.