Description of problem: I'm using a container to run glusterfs server in a kubernetes environment. When server rebooted, the brick process failed to start at the first time. from brick log it said: [2017-05-11 08:49:28.056753] I [MSGID: 100030] [glusterfsd.c:2454:main] 0-/usr/sbin/glusterfsd: Started running /usr/sbin/glusterfsd version 3.8.5 (args: /usr/sbin/glusterfsd -s 10.3.3.11 --volfile-id gvol0.10.3.3.11.mnt-brick2-vol -p /var/lib/glusterd/vols/gvol0/run/10.3.3.11-mnt-brick2-vol.pid -S /var/run/gluster/16909a0348d1da701cfe2486bf91a886.socket --brick-name /mnt/brick2/vol -l /var/log/glusterfs/bricks/mnt-brick2-vol.log --xlator-option *-posix.glusterd-uuid=a0fd1343-929c-4851-a0d2-9603b7cc4095 --brick-port 49153 --xlator-option gvol0-server.listen-port=49153) [2017-05-11 08:49:28.064464] I [MSGID: 101190] [event-epoll.c:628:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1 [2017-05-11 08:51:30.661259] W [socket.c:590:__socket_rwv] 0-glusterfs: readv on 10.3.3.11:24007 failed (Connection reset by peer) [2017-05-11 08:51:30.661699] E [rpc-clnt.c:365:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7f62bdc09002] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7f62bd9d084e] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f62bd9d095e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x84)[0x7f62bd9d20b4] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x120)[0x7f62bd9d2990] ))))) 0-glusterfs: forced unwinding frame type(GlusterFS Handshake) op(GETSPEC(2)) called at 2017-05-11 08:49:43.653446 (xid=0x1) [2017-05-11 08:51:30.661716] E [glusterfsd-mgmt.c:1686:mgmt_getspec_cbk] 0-mgmt: failed to fetch volume file (key:gvol0.10.3.3.11.mnt-brick2-vol) [2017-05-11 08:51:30.661738] W [glusterfsd.c:1327:cleanup_and_exit] (-->/lib64/libgfrpc.so.0(saved_frames_unwind+0x205) [0x7f62bd9d0875] -->/usr/sbin/glusterfsd(mgmt_getspec_cbk+0x536) [0x557a89452fc6] -->/usr/sbin/glusterfsd(cleanup_and_exit+0x6b) [0x557a8944cb4b] ) 0-: received signum (0), shutting down [2017-05-11 08:51:30.664515] I [socket.c:3391:socket_submit_request] 0-glusterfs: not connected (priv->connected = 0) [2017-05-11 08:51:30.664527] W [rpc-clnt.c:1640:rpc_clnt_submit] 0-glusterfs: failed to submit rpc-request (XID: 0x2 Program: Gluster Portmap, ProgVers: 1, Proc: 5) to rpc-transport (glusterfs) And then I restarted the brick command manually, it worked: # /usr/sbin/glusterfsd -s 10.3.3.11 --volfile-id gvol0.10.3.3.11.mnt-brick2-vol -p /var/lib/glusterd/vols/gvol0/run/10.3.3.11-mnt-brick2-vol.pid -S /var/run/gluster/16909a0348d1da701cfe2486bf91a886.socket --brick-name /mnt/brick2/vol -l /var/log/glusterfs/bricks/mnt-brick2-vol.log --xlator-option *-posix.glusterd-uuid=a0fd1343-929c-4851-a0d2-9603b7cc4095 --brick-port 49153 --xlator-option gvol0-server.listen-port=49153 The log said: [2017-05-11 08:53:18.553398] I [MSGID: 100030] [glusterfsd.c:2454:main] 0-/usr/sbin/glusterfsd: Started running /usr/sbin/glusterfsd version 3.8.5 (args: /usr/sbin/glusterfsd -s 10.3.3.11 --volfile-id gvol0.10.3.3.11.mnt-brick2-vol -p /var/lib/glusterd/vols/gvol0/run/10.3.3.11-mnt-brick2-vol.pid -S /var/run/gluster/16909a0348d1da701cfe2486bf91a886.socket --brick-name /mnt/brick2/vol -l /var/log/glusterfs/bricks/mnt-brick2-vol.log --xlator-option *-posix.glusterd-uuid=a0fd1343-929c-4851-a0d2-9603b7cc4095 --brick-port 49153 --xlator-option gvol0-server.listen-port=49153) [2017-05-11 08:53:18.560507] I [MSGID: 101190] [event-epoll.c:628:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1 [2017-05-11 08:53:18.563946] I [MSGID: 101173] [graph.c:269:gf_add_cmdline_options] 0-gvol0-server: adding option 'listen-port' for volume 'gvol0-server' with value '49153' [2017-05-11 08:53:18.563981] I [MSGID: 101173] [graph.c:269:gf_add_cmdline_options] 0-gvol0-posix: adding option 'glusterd-uuid' for volume 'gvol0-posix' with value 'a0fd1343-929c-4851-a0d2-9603b7cc4095' [2017-05-11 08:53:18.564570] I [MSGID: 115034] [server.c:398:_check_for_auth_option] 0-gvol0-decompounder: skip format check for non-addr auth option auth.login./mnt/brick2/vol.allow [2017-05-11 08:53:18.564578] I [MSGID: 115034] [server.c:398:_check_for_auth_option] 0-gvol0-decompounder: skip format check for non-addr auth option auth.login.94bedfd1-619d-402a-9826-67dab7600f43.password [2017-05-11 08:53:18.564652] I [MSGID: 101190] [event-epoll.c:628:event_dispatch_epoll_worker] 0-epoll: Started thread with index 2 [2017-05-11 08:53:18.565311] I [rpcsvc.c:2214:rpcsvc_set_outstanding_rpc_limit] 0-rpc-service: Configured rpc.outstanding-rpc-limit with value 64 ... Version-Release number of selected component (if applicable): OS: coreos 1298.5.0 kubernetes: v1.5.1 Image: official gluster-centos:gluster3u8_centos7 Gluster: 3.8.5 How reproducible: Reboot glusterfs server Steps to Reproduce: 1.reboot glusterfs server 2. 3. Actual results: Some brick processes failed to start at first time. Expected results: All birck processes should start successfully. Additional info: if needed
This looks like that the brick process failed to fetch volfile from glusterd. Do you have the glusterd log handy?
Created attachment 1285200 [details] glusterd log
from brick log: [2017-05-11 08:49:28.064464] I [MSGID: 101190] [event-epoll.c:628:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1 [2017-05-11 08:51:30.661259] W [socket.c:590:__socket_rwv] 0-glusterfs: readv on 10.3.3.11:24007 failed (Connection reset by peer) [2017-05-11 08:51:30.661699] E [rpc-clnt.c:365:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7f62bdc09002] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7f62bd9d084e] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f62bd9d095e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x84)[0x7f62bd9d20b4] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x120)[0x7f62bd9d2990] ))))) 0-glusterfs: forced unwinding frame type(GlusterFS Handshake) op(GETSPEC(2)) called at 2017-05-11 08:49:43.653446 (xid=0x1) [2017-05-11 08:51:30.661716] E [glusterfsd-mgmt.c:1686:mgmt_getspec_cbk] 0-mgmt: failed to fetch volume file (key:gvol0.10.3.3.11.mnt-brick2-vol) from glusterd log: [2017-05-11 08:51:30.665606] W [socket.c:590:__socket_rwv] 0-management: readv on /var/run/gluster/16909a0348d1da701cfe2486bf91a886.socket failed (No data available) [2017-05-11 08:51:30.668913] I [MSGID: 106005] [glusterd-handler.c:5055:__glusterd_brick_rpc_notify] 0-management: Brick 10.3.3.11:/mnt/brick2/vol has disconnected from glusterd.
My question is what caused the readv on the socket failed, and a second running of the same command succeed? Can't it just retry automatically?
I just pasted the logs for the reference, the analysis is not complete yet.
(In reply to Atin Mukherjee from comment #5) > I just pasted the logs for the reference, the analysis is not complete yet. Is there any progress in this matter?
This bug is getting closed because the 3.8 version is marked End-Of-Life. There will be no further updates to this version. Please open a new bug against a version that still receives bugfixes if you are still facing this issue in a more current release.