1450567 – brick process cannot be started at the first time

Bug 1450567 - brick process cannot be started at the first time

Summary: brick process cannot be started at the first time

Keywords:
Status:	CLOSED EOL
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	core
Sub Component:
Version:	3.8
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Assignee:	bugs@gluster.org
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-05-13 09:13 UTC by likunbyl
Modified:	2017-11-07 10:39 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2017-11-07 10:39:47 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
glusterd log (505.75 KB, application/zip) 2017-06-06 04:00 UTC, likunbyl	no flags	Details
View All

Description likunbyl 2017-05-13 09:13:16 UTC

Description of problem:

I'm using a container to run glusterfs server in a kubernetes environment. When server rebooted, the brick process failed to start at the first time. from brick log it said:

[2017-05-11 08:49:28.056753] I [MSGID: 100030] [glusterfsd.c:2454:main] 0-/usr/sbin/glusterfsd: Started running /usr/sbin/glusterfsd version 3.8.5 (args: /usr/sbin/glusterfsd -s 10.3.3.11 --volfile-id gvol0.10.3.3.11.mnt-brick2-vol -p /var/lib/glusterd/vols/gvol0/run/10.3.3.11-mnt-brick2-vol.pid -S /var/run/gluster/16909a0348d1da701cfe2486bf91a886.socket --brick-name /mnt/brick2/vol -l /var/log/glusterfs/bricks/mnt-brick2-vol.log --xlator-option *-posix.glusterd-uuid=a0fd1343-929c-4851-a0d2-9603b7cc4095 --brick-port 49153 --xlator-option gvol0-server.listen-port=49153)
[2017-05-11 08:49:28.064464] I [MSGID: 101190] [event-epoll.c:628:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1
[2017-05-11 08:51:30.661259] W [socket.c:590:__socket_rwv] 0-glusterfs: readv on 10.3.3.11:24007 failed (Connection reset by peer)
[2017-05-11 08:51:30.661699] E [rpc-clnt.c:365:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7f62bdc09002] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7f62bd9d084e] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f62bd9d095e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x84)[0x7f62bd9d20b4] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x120)[0x7f62bd9d2990] ))))) 0-glusterfs: forced unwinding frame type(GlusterFS Handshake) op(GETSPEC(2)) called at 2017-05-11 08:49:43.653446 (xid=0x1)
[2017-05-11 08:51:30.661716] E [glusterfsd-mgmt.c:1686:mgmt_getspec_cbk] 0-mgmt: failed to fetch volume file (key:gvol0.10.3.3.11.mnt-brick2-vol)
[2017-05-11 08:51:30.661738] W [glusterfsd.c:1327:cleanup_and_exit] (-->/lib64/libgfrpc.so.0(saved_frames_unwind+0x205) [0x7f62bd9d0875] -->/usr/sbin/glusterfsd(mgmt_getspec_cbk+0x536) [0x557a89452fc6] -->/usr/sbin/glusterfsd(cleanup_and_exit+0x6b) [0x557a8944cb4b] ) 0-: received signum (0), shutting down
[2017-05-11 08:51:30.664515] I [socket.c:3391:socket_submit_request] 0-glusterfs: not connected (priv->connected = 0)
[2017-05-11 08:51:30.664527] W [rpc-clnt.c:1640:rpc_clnt_submit] 0-glusterfs: failed to submit rpc-request (XID: 0x2 Program: Gluster Portmap, ProgVers: 1, Proc: 5) to rpc-transport (glusterfs)

And then I restarted the brick command manually, it worked: 

# /usr/sbin/glusterfsd -s 10.3.3.11 --volfile-id gvol0.10.3.3.11.mnt-brick2-vol -p /var/lib/glusterd/vols/gvol0/run/10.3.3.11-mnt-brick2-vol.pid -S /var/run/gluster/16909a0348d1da701cfe2486bf91a886.socket --brick-name /mnt/brick2/vol -l /var/log/glusterfs/bricks/mnt-brick2-vol.log --xlator-option *-posix.glusterd-uuid=a0fd1343-929c-4851-a0d2-9603b7cc4095 --brick-port 49153 --xlator-option gvol0-server.listen-port=49153

The log said:

[2017-05-11 08:53:18.553398] I [MSGID: 100030] [glusterfsd.c:2454:main] 0-/usr/sbin/glusterfsd: Started running /usr/sbin/glusterfsd version 3.8.5 (args: /usr/sbin/glusterfsd -s 10.3.3.11 --volfile-id gvol0.10.3.3.11.mnt-brick2-vol -p /var/lib/glusterd/vols/gvol0/run/10.3.3.11-mnt-brick2-vol.pid -S /var/run/gluster/16909a0348d1da701cfe2486bf91a886.socket --brick-name /mnt/brick2/vol -l /var/log/glusterfs/bricks/mnt-brick2-vol.log --xlator-option *-posix.glusterd-uuid=a0fd1343-929c-4851-a0d2-9603b7cc4095 --brick-port 49153 --xlator-option gvol0-server.listen-port=49153)
[2017-05-11 08:53:18.560507] I [MSGID: 101190] [event-epoll.c:628:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1
[2017-05-11 08:53:18.563946] I [MSGID: 101173] [graph.c:269:gf_add_cmdline_options] 0-gvol0-server: adding option 'listen-port' for volume 'gvol0-server' with value '49153'
[2017-05-11 08:53:18.563981] I [MSGID: 101173] [graph.c:269:gf_add_cmdline_options] 0-gvol0-posix: adding option 'glusterd-uuid' for volume 'gvol0-posix' with value 'a0fd1343-929c-4851-a0d2-9603b7cc4095'
[2017-05-11 08:53:18.564570] I [MSGID: 115034] [server.c:398:_check_for_auth_option] 0-gvol0-decompounder: skip format check for non-addr auth option auth.login./mnt/brick2/vol.allow
[2017-05-11 08:53:18.564578] I [MSGID: 115034] [server.c:398:_check_for_auth_option] 0-gvol0-decompounder: skip format check for non-addr auth option auth.login.94bedfd1-619d-402a-9826-67dab7600f43.password
[2017-05-11 08:53:18.564652] I [MSGID: 101190] [event-epoll.c:628:event_dispatch_epoll_worker] 0-epoll: Started thread with index 2
[2017-05-11 08:53:18.565311] I [rpcsvc.c:2214:rpcsvc_set_outstanding_rpc_limit] 0-rpc-service: Configured rpc.outstanding-rpc-limit with value 64
...


Version-Release number of selected component (if applicable):
OS: coreos 1298.5.0
kubernetes: v1.5.1
Image: official gluster-centos:gluster3u8_centos7
Gluster: 3.8.5


How reproducible:
Reboot glusterfs server

Steps to Reproduce:
1.reboot glusterfs server
2.
3.

Actual results:
Some brick processes failed to start at first time.

Expected results:
All birck processes should start successfully.

Additional info:
if needed

Comment 1 Atin Mukherjee 2017-06-05 09:37:26 UTC

This looks like that the brick process failed to fetch volfile from glusterd. Do you have the glusterd log handy?

Comment 2 likunbyl 2017-06-06 04:00:38 UTC

Created attachment 1285200 [details]
glusterd log

Comment 3 Atin Mukherjee 2017-06-06 04:23:46 UTC

from brick log:
[2017-05-11 08:49:28.064464] I [MSGID: 101190] [event-epoll.c:628:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1
[2017-05-11 08:51:30.661259] W [socket.c:590:__socket_rwv] 0-glusterfs: readv on 10.3.3.11:24007 failed (Connection reset by peer)
[2017-05-11 08:51:30.661699] E [rpc-clnt.c:365:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7f62bdc09002] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7f62bd9d084e] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f62bd9d095e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x84)[0x7f62bd9d20b4] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x120)[0x7f62bd9d2990] ))))) 0-glusterfs: forced unwinding frame type(GlusterFS Handshake) op(GETSPEC(2)) called at 2017-05-11 08:49:43.653446 (xid=0x1)
[2017-05-11 08:51:30.661716] E [glusterfsd-mgmt.c:1686:mgmt_getspec_cbk] 0-mgmt: failed to fetch volume file (key:gvol0.10.3.3.11.mnt-brick2-vol)


from glusterd log:

[2017-05-11 08:51:30.665606] W [socket.c:590:__socket_rwv] 0-management: readv on /var/run/gluster/16909a0348d1da701cfe2486bf91a886.socket failed (No data available)
[2017-05-11 08:51:30.668913] I [MSGID: 106005] [glusterd-handler.c:5055:__glusterd_brick_rpc_notify] 0-management: Brick 10.3.3.11:/mnt/brick2/vol has disconnected from glusterd.

Comment 4 likunbyl 2017-06-06 08:22:57 UTC

My question is what caused the readv on the socket failed, and a second running of the same command succeed? Can't it just retry automatically?

Comment 5 Atin Mukherjee 2017-06-06 08:47:43 UTC

I just pasted the logs for the reference, the analysis is not complete yet.

Comment 6 likunbyl 2017-06-23 03:50:46 UTC

(In reply to Atin Mukherjee from comment #5)
> I just pasted the logs for the reference, the analysis is not complete yet.

Is there any progress in this matter?

Comment 7 Niels de Vos 2017-11-07 10:39:47 UTC

This bug is getting closed because the 3.8 version is marked End-Of-Life. There will be no further updates to this version. Please open a new bug against a version that still receives bugfixes if you are still facing this issue in a more current release.

Note You need to log in before you can comment on or make changes to this bug.