Bug 1262212 - Brick process does not start after being killed with SIGKILL and then running `gluster volume start force'
Brick process does not start after being killed with SIGKILL and then running...
Status: CLOSED WORKSFORME
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: core (Show other bugs)
unspecified
Unspecified Unspecified
unspecified Severity medium
: ---
: ---
Assigned To: Bug Updates Notification Mailing List
Anoop
: ZStream
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2015-09-11 03:47 EDT by Shruti Sampat
Modified: 2018-02-06 23:26 EST (History)
1 user (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2018-02-06 23:26:44 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Shruti Sampat 2015-09-11 03:47:30 EDT
Description of problem:
-----------------------

In a 3-way replicated volume, one brick in each replica set was killed using SIGKILL while I/O was running on fuse client. After a while, attempts to start the killed bricks using `gluster volume start force' were found to fail repeatedly. The following is from the logs -

Brick logs when the volume is started with force option -

<snip>

+------------------------------------------------------------------------------+
[2015-09-10 23:36:12.837482] I [MSGID: 101190] [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread with index 2
[2015-09-10 23:36:14.947021] W [socket.c:642:__socket_rwv] 0-3-test-quota: readv on /var/run/gluster/quotad.socket failed (No data available)
[2015-09-10 23:36:15.980138] I [login.c:81:gf_auth] 0-auth/login: allowed user names: 7b8f0589-7451-4995-8d78-a0da9c702f7a
[2015-09-10 23:36:15.980174] I [MSGID: 115029] [server-handshake.c:610:server_setvolume] 0-3-test-server: accepted client from dhcp37-70.lab.eng.blr.redhat.com-21710-2015/09/11-06:11:12:923821-3-test-client-0-0-0 (version: 3.7.1)
[2015-09-10 23:36:15.982765] I [login.c:81:gf_auth] 0-auth/login: allowed user names: 7b8f0589-7451-4995-8d78-a0da9c702f7a
[2015-09-10 23:36:15.982796] I [MSGID: 115029] [server-handshake.c:610:server_setvolume] 0-3-test-server: accepted client from dhcp37-197.lab.eng.blr.redhat.com-21886-2015/09/11-06:10:46:657748-3-test-client-0-0-0 (version: 3.7.1)
[2015-09-10 23:36:15.982915] I [MSGID: 115029] [server-handshake.c:610:server_setvolume] 0-3-test-server: accepted client from vm10-rhsqa13.lab.eng.blr.redhat.com-14776-2015/09/10-05:36:14:214793-3-test-client-0-0-4 (version: 3.7.1)
[2015-09-10 23:36:16.012835] I [login.c:81:gf_auth] 0-auth/login: allowed user names: 7b8f0589-7451-4995-8d78-a0da9c702f7a
[2015-09-10 23:36:16.012871] I [MSGID: 115029] [server-handshake.c:610:server_setvolume] 0-3-test-server: accepted client from dhcp37-197.lab.eng.blr.redhat.com-21894-2015/09/11-06:10:47:670581-3-test-client-0-0-0 (version: 3.7.1)
[2015-09-10 23:36:16.013150] I [login.c:81:gf_auth] 0-auth/login: allowed user names: 7b8f0589-7451-4995-8d78-a0da9c702f7a
[2015-09-10 23:36:16.013197] I [MSGID: 115029] [server-handshake.c:610:server_setvolume] 0-3-test-server: accepted client from dhcp37-135.lab.eng.blr.redhat.com-20213-2015/09/11-06:10:48:664075-3-test-client-0-0-0 (version: 3.7.1)
[2015-09-10 23:36:16.025388] I [login.c:81:gf_auth] 0-auth/login: allowed user names: 7b8f0589-7451-4995-8d78-a0da9c702f7a
[2015-09-10 23:36:16.025420] I [MSGID: 115029] [server-handshake.c:610:server_setvolume] 0-3-test-server: accepted client from dhcp37-135.lab.eng.blr.redhat.com-20205-2015/09/11-06:10:47:604671-3-test-client-0-0-0 (version: 3.7.1)
[2015-09-10 23:36:16.025539] I [login.c:81:gf_auth] 0-auth/login: allowed user names: 7b8f0589-7451-4995-8d78-a0da9c702f7a
[2015-09-10 23:36:16.025571] I [MSGID: 115029] [server-handshake.c:610:server_setvolume] 0-3-test-server: accepted client from dhcp37-135.lab.eng.blr.redhat.com-20197-2015/09/11-06:10:46:587730-3-test-client-0-0-0 (version: 3.7.1)

</snip>

From glusterd logs -

<snip>

The message "I [MSGID: 106005] [glusterd-handler.c:4899:__glusterd_brick_rpc_notify] 0-management: Brick 10.70.37.126:/rhs/brick2/b1 has disconnected from glusterd." repeated 39 times between [2015-09-11 00:59:50.495698] and [2015-09-11 01:01:47.519805]
The message "I [MSGID: 106005] [glusterd-handler.c:4899:__glusterd_brick_rpc_notify] 0-management: Brick 10.70.37.126:/rhs/brick3/b1 has disconnected from glusterd." repeated 39 times between [2015-09-11 00:59:50.496232] and [2015-09-11 01:01:47.521145]
[2015-09-11 01:01:50.520025] W [socket.c:642:__socket_rwv] 0-management: readv on /var/run/gluster/d688303ff19aece29c724dfbabf0aa3f.socket failed (Invalid argument)
[2015-09-11 01:01:50.520770] I [MSGID: 106005] [glusterd-handler.c:4899:__glusterd_brick_rpc_notify] 0-management: Brick 10.70.37.126:/rhs/brick2/b1 has disconnected from glusterd.
[2015-09-11 01:01:50.521500] W [socket.c:642:__socket_rwv] 0-management: readv on /var/run/gluster/8639fa8939074b2eba37825a7012056c.socket failed (Invalid argument)
[2015-09-11 01:01:50.522167] I [MSGID: 106005] [glusterd-handler.c:4899:__glusterd_brick_rpc_notify] 0-management: Brick 10.70.37.126:/rhs/brick3/b1 has disconnected from glusterd.
[2015-09-11 01:01:53.520813] W [socket.c:642:__socket_rwv] 0-management: readv on /var/run/gluster/d688303ff19aece29c724dfbabf0aa3f.socket failed (Invalid argument)
[2015-09-11 01:01:53.522477] W [socket.c:642:__socket_rwv] 0-management: readv on /var/run/gluster/8639fa8939074b2eba37825a7012056c.socket failed (Invalid argument)
[2015-09-11 01:01:56.521453] W [socket.c:642:__socket_rwv] 0-management: readv on /var/run/gluster/d688303ff19aece29c724dfbabf0aa3f.socket failed (Invalid argument)
[2015-09-11 01:01:56.522860] W [socket.c:642:__socket_rwv] 0-management: readv on /var/run/gluster/8639fa8939074b2eba37825a7012056c.socket failed (Invalid argument)

</snip>

Restarting glusterd also does not help.

Version-Release number of selected component (if applicable):
--------------------------------------------------------------
glusterfs-3.7.1-14.el7rhgs.x86_64

How reproducible:
------------------
Haven't tried on another volume.

Steps to Reproduce:
-------------------
1. While the I/o is running from a fuse client on a 2x3 volume, kill one brick from each replica set.
2. After a while, start the volume with force option - `gluster volume start <vol-name> force'

Actual results:
---------------
The bricks that were killed in step 1 do not start after trying to start with force option or after restarting glusterd.

Expected results:
------------------
Brick processes are expected to start after `gluster volume start force'
Comment 3 Amar Tumballi 2018-02-06 23:26:44 EST
We have noticed that the bug is not reproduced in the latest version of the product (RHGS-3.3.1+).

If the bug is still relevant and is being reproduced, feel free to reopen the bug.

Note You need to log in before you can comment on or make changes to this bug.