Bug 1362170 - Disconnection Issues in gluster
Summary: Disconnection Issues in gluster
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: GlusterFS
Classification: Community
Component: protocol
Version: mainline
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
Assignee: bugs@gluster.org
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-08-01 12:53 UTC by Olia Kremmyda
Modified: 2018-10-08 17:29 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-10-08 17:29:30 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)

Description Olia Kremmyda 2016-08-01 12:53:23 UTC
Description of problem:

We use v3.6.9 and we have a deployment with 2 storage nodes with bricks and a third one acting as quorum. We have noticed a weird behavior in storage nodes with bricks in every deployment:

glusterd logs are full of the following message:

[2016-07-15 04:08:01.558743] W [socket.c:620:__socket_rwv] 0-management: readv on /var/run/2d164c87498372d99c47b3b627dcd393.socket failed (Invalid argument)

In addition, the following message is reported periodically in glusterd logs:

The message "I [MSGID: 106006] [glusterd-handler.c:4290:__glusterd_nodesvc_rpc_notify] 0-management: nfs has disconnected from glusterd." repeated 39 times between [2016-07-15 04:06:04.513882] and [2016-07-15 04:08:01.558787]

We found the following bug fix, which is quite similar:

http://review.gluster.org/#/c/9269/


In addition to messages mentioned above, many times we face functionality problems and storage is not operational. We have collected the following logs from glusterfsd:

[2016-07-18 07:48:36.800400] I [glusterfsd-mgmt.c:56:mgmt_cbk_spec] 0-mgmt: Volume file changed
[2016-07-18 07:48:36.974391] I [glusterfsd-mgmt.c:1504:mgmt_getspec_cbk] 0-glusterfs: No change in volfile, continuing
[2016-07-18 07:48:36.982390] I [server.c:518:server_rpc_notify] 0-voldata-server: disconnecting connection from SN-0-1580-2016/07/18-07:48:34:958658-voldata-client-0-0-0
[2016-07-18 07:48:36.982422] I [client_t.c:417:gf_client_unref] 0-voldata-server: Shutting down connection SN-0-1580-2016/07/18-07:48:34:958658-voldata-client-0-0-0
[2016-07-18 07:48:38.015640] I [login.c:82:gf_auth] 0-auth/login: allowed user names: 7a00c0d0-e728-4c9a-92d7-819a10952886
[2016-07-18 07:48:38.015662] I [server-handshake.c:585:server_setvolume] 0-voldata-server: accepted client from SN-0-1622-2016/07/18-07:48:37:983796-voldata-client-0-0-0 (version: 3.6.9)
[2016-07-18 07:48:38.284775] I [server.c:518:server_rpc_notify] 0-voldata-server: disconnecting connection from SN-1-2297-2016/07/18-07:48:36:82735-voldata-client-0-0-0
[2016-07-18 07:48:38.284823] I [client_t.c:417:gf_client_unref] 0-voldata-server: Shutting down connection SN-1-2297-2016/07/18-07:48:36:82735-voldata-client-0-0-0
[2016-07-18 07:48:39.331257] I [login.c:82:gf_auth] 0-auth/login: allowed user names: 7a00c0d0-e728-4c9a-92d7-819a10952886
[2016-07-18 07:48:39.331282] I [server-handshake.c:585:server_setvolume] 0-voldata-server: accepted client from SN-1-2350-2016/07/18-07:48:38:966005-voldata-client-0-0-0 (version: 3.6.9)
[2016-07-18 07:48:41.357084] I [login.c:82:gf_auth] 0-auth/login: allowed user names: 7a00c0d0-e728-4c9a-92d7-819a10952886
[2016-07-18 07:48:41.357109] I [server-handshake.c:585:server_setvolume] 0-voldata-server: accepted client from SN-2-2765-2016/07/18-07:48:41:308956-voldata-client-0-0-0 (version: 3.6.9)
[2016-07-18 07:48:42.915110] I [login.c:82:gf_auth] 0-auth/login: allowed user names: 7a00c0d0-e728-4c9a-92d7-819a10952886
[2016-07-18 07:48:42.915136] I [server-handshake.c:585:server_setvolume] 0-voldata-server: accepted client from SN-0-1894-2016/07/18-07:48:42:894933-voldata-client-0-0-0 (version: 3.6.9)
[2016-07-18 07:48:53.873241] I [server-handshake.c:585:server_setvolume] 0-voldata-server: accepted client from MN-0-740-2016/07/18-07:48:53:295469-voldata-client-0-0-0 (version: 3.6.9)
[2016-07-18 07:48:59.645589] I [server-handshake.c:585:server_setvolume] 0-voldata-server: accepted client from MN-1-728-2016/07/18-07:48:59:628747-voldata-client-0-0-0 (version: 3.6.9)
[2016-07-18 07:56:06.939668] I [server.c:518:server_rpc_notify] 0-voldata-server: disconnecting connection from SN-1-1152-2016/07/18-07:48:09:699332-voldata-client-0-0-0
[2016-07-18 07:56:06.939692] I [client_t.c:417:gf_client_unref] 0-voldata-server: Shutting down connection SN-1-1152-2016/07/18-07:48:09:699332-voldata-client-0-0-0
[2016-07-18 07:56:06.939722] I [server.c:518:server_rpc_notify] 0-voldata-server: disconnecting connection from SN-2-1134-2016/07/18-07:48:10:12643-voldata-client-0-0-0
[2016-07-18 07:56:06.939735] I [client_t.c:417:gf_client_unref] 0-voldata-server: Shutting down connection SN-2-1134-2016/07/18-07:48:10:12643-voldata-client-0-0-0
[2016-07-18 07:56:06.942400] W [glusterfsd.c:1211:cleanup_and_exit] (--> 0-: received signum (15), shutting down

After the last message, glusterfsd restarts and after a few minutes it comes on the same situation with above messages printed again. During the short period of operation we observed many disconnections and re-connections of gluster clients.

At the same time, glusterd prints the following messages:

/usr/lib64/glusterfs/glusterfs/3.6.9/xlator/mgmt/glusterd.so(glusterd_mgmt_v3_unlock+0x356)[0x7f9df8acc1f5] (--> /usr/lib64/glusterfs/glusterfs/3.6.9/xlator/mgmt/glusterd.so(__glusterd_peer_rpc_notify+0x289)[0x7f9df8a2da4b] (--> /usr/lib64/glusterfs/glusterfs/3.6.9/xlator/mgmt/glusterd.so(glusterd_big_locked_notify+0x61)[0x7f9df8a22384] (--> /usr/lib64/glusterfs/glusterfs/3.6.9/xlator/mgmt/glusterd.so(glusterd_peer_rpc_notify+0x38)[0x7f9df8a2dbea] ))))) 0-management: Lock for vol services not held
The message "I [MSGID: 106006] [glusterd-handler.c:4290:__glusterd_nodesvc_rpc_notify] 0-management: nfs has disconnected from glusterd." repeated 39 times between [2016-07-18 07:58:08.475445] and [2016-07-18 08:00:06.414826]

and after a while connection with the other storage server and quorum server lost.

Could you please check and provide some help?

Version-Release number of selected component (if applicable):
v3.6.9

How reproducible:
Often

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

FYI we have back-ported the following patches from 3.7 branch:
http://review.gluster.org/#/c/10829/
http://review.gluster.org/#/c/13358/
http://review.gluster.org/#/c/14052/
http://review.gluster.org/#/c/12310/
http://review.gluster.org/#/c/13392/

Comment 1 Atin Mukherjee 2016-08-01 13:22:21 UTC
First of all 3.6 is not going to receive any bug fixes any further, please refer to http://www.gluster.org/pipermail/gluster-users/2016-July/027682.html . You'd need to upgrade your cluster to 3.7/3.8.

Now coming to the relevant glusterd log here is what I think:

/usr/lib64/glusterfs/glusterfs/3.6.9/xlator/mgmt/glusterd.so(glusterd_mgmt_v3_unlock+0x356)[0x7f9df8acc1f5] (--> /usr/lib64/glusterfs/glusterfs/3.6.9/xlator/mgmt/glusterd.so(__glusterd_peer_rpc_notify+0x289)[0x7f9df8a2da4b] (--> /usr/lib64/glusterfs/glusterfs/3.6.9/xlator/mgmt/glusterd.so(glusterd_big_locked_notify+0x61)[0x7f9df8a22384] (--> /usr/lib64/glusterfs/glusterfs/3.6.9/xlator/mgmt/glusterd.so(glusterd_peer_rpc_notify+0x38)[0x7f9df8a2dbea] ))))) 0-management: Lock for vol services not held
The message "I [MSGID: 106006] [glusterd-handler.c:4290:__glusterd_nodesvc_rpc_notify] 0-management: nfs has disconnected from glusterd." repeated 39 times between [2016-07-18 07:58:08.475445] and [2016-07-18 08:00:06.414826]

This is expected and on a peer disconnect glusterd tries to release all the stale locks, if the lock is not held by the peer for the volume(s) warning logs are captured.

Comment 2 Atin Mukherjee 2016-08-01 13:23:16 UTC
(In reply to Olia Kremmyda from comment #0)
> Description of problem:
> 
> We use v3.6.9 and we have a deployment with 2 storage nodes with bricks and
> a third one acting as quorum. We have noticed a weird behavior in storage
> nodes with bricks in every deployment:
> 
> glusterd logs are full of the following message:
> 
> [2016-07-15 04:08:01.558743] W [socket.c:620:__socket_rwv] 0-management:
> readv on /var/run/2d164c87498372d99c47b3b627dcd393.socket failed (Invalid
> argument)
> 
> In addition, the following message is reported periodically in glusterd logs:
> 
> The message "I [MSGID: 106006]
> [glusterd-handler.c:4290:__glusterd_nodesvc_rpc_notify] 0-management: nfs
> has disconnected from glusterd." repeated 39 times between [2016-07-15
> 04:06:04.513882] and [2016-07-15 04:08:01.558787]
> 
> We found the following bug fix, which is quite similar:
> 
> http://review.gluster.org/#/c/9269/
> 
> 
> In addition to messages mentioned above, many times we face functionality
> problems and storage is not operational. We have collected the following
> logs from glusterfsd:
> 
> [2016-07-18 07:48:36.800400] I [glusterfsd-mgmt.c:56:mgmt_cbk_spec] 0-mgmt:
> Volume file changed
> [2016-07-18 07:48:36.974391] I [glusterfsd-mgmt.c:1504:mgmt_getspec_cbk]
> 0-glusterfs: No change in volfile, continuing
> [2016-07-18 07:48:36.982390] I [server.c:518:server_rpc_notify]
> 0-voldata-server: disconnecting connection from
> SN-0-1580-2016/07/18-07:48:34:958658-voldata-client-0-0-0
> [2016-07-18 07:48:36.982422] I [client_t.c:417:gf_client_unref]
> 0-voldata-server: Shutting down connection
> SN-0-1580-2016/07/18-07:48:34:958658-voldata-client-0-0-0
> [2016-07-18 07:48:38.015640] I [login.c:82:gf_auth] 0-auth/login: allowed
> user names: 7a00c0d0-e728-4c9a-92d7-819a10952886
> [2016-07-18 07:48:38.015662] I [server-handshake.c:585:server_setvolume]
> 0-voldata-server: accepted client from
> SN-0-1622-2016/07/18-07:48:37:983796-voldata-client-0-0-0 (version: 3.6.9)
> [2016-07-18 07:48:38.284775] I [server.c:518:server_rpc_notify]
> 0-voldata-server: disconnecting connection from
> SN-1-2297-2016/07/18-07:48:36:82735-voldata-client-0-0-0
> [2016-07-18 07:48:38.284823] I [client_t.c:417:gf_client_unref]
> 0-voldata-server: Shutting down connection
> SN-1-2297-2016/07/18-07:48:36:82735-voldata-client-0-0-0
> [2016-07-18 07:48:39.331257] I [login.c:82:gf_auth] 0-auth/login: allowed
> user names: 7a00c0d0-e728-4c9a-92d7-819a10952886
> [2016-07-18 07:48:39.331282] I [server-handshake.c:585:server_setvolume]
> 0-voldata-server: accepted client from
> SN-1-2350-2016/07/18-07:48:38:966005-voldata-client-0-0-0 (version: 3.6.9)
> [2016-07-18 07:48:41.357084] I [login.c:82:gf_auth] 0-auth/login: allowed
> user names: 7a00c0d0-e728-4c9a-92d7-819a10952886
> [2016-07-18 07:48:41.357109] I [server-handshake.c:585:server_setvolume]
> 0-voldata-server: accepted client from
> SN-2-2765-2016/07/18-07:48:41:308956-voldata-client-0-0-0 (version: 3.6.9)
> [2016-07-18 07:48:42.915110] I [login.c:82:gf_auth] 0-auth/login: allowed
> user names: 7a00c0d0-e728-4c9a-92d7-819a10952886
> [2016-07-18 07:48:42.915136] I [server-handshake.c:585:server_setvolume]
> 0-voldata-server: accepted client from
> SN-0-1894-2016/07/18-07:48:42:894933-voldata-client-0-0-0 (version: 3.6.9)
> [2016-07-18 07:48:53.873241] I [server-handshake.c:585:server_setvolume]
> 0-voldata-server: accepted client from
> MN-0-740-2016/07/18-07:48:53:295469-voldata-client-0-0-0 (version: 3.6.9)
> [2016-07-18 07:48:59.645589] I [server-handshake.c:585:server_setvolume]
> 0-voldata-server: accepted client from
> MN-1-728-2016/07/18-07:48:59:628747-voldata-client-0-0-0 (version: 3.6.9)
> [2016-07-18 07:56:06.939668] I [server.c:518:server_rpc_notify]
> 0-voldata-server: disconnecting connection from
> SN-1-1152-2016/07/18-07:48:09:699332-voldata-client-0-0-0
> [2016-07-18 07:56:06.939692] I [client_t.c:417:gf_client_unref]
> 0-voldata-server: Shutting down connection
> SN-1-1152-2016/07/18-07:48:09:699332-voldata-client-0-0-0
> [2016-07-18 07:56:06.939722] I [server.c:518:server_rpc_notify]
> 0-voldata-server: disconnecting connection from
> SN-2-1134-2016/07/18-07:48:10:12643-voldata-client-0-0-0
> [2016-07-18 07:56:06.939735] I [client_t.c:417:gf_client_unref]
> 0-voldata-server: Shutting down connection
> SN-2-1134-2016/07/18-07:48:10:12643-voldata-client-0-0-0
> [2016-07-18 07:56:06.942400] W [glusterfsd.c:1211:cleanup_and_exit] (--> 0-:
> received signum (15), shutting down

Pranith,

Do you want to check this behavior?

~Atin
> 
> After the last message, glusterfsd restarts and after a few minutes it comes
> on the same situation with above messages printed again. During the short
> period of operation we observed many disconnections and re-connections of
> gluster clients.
> 
> At the same time, glusterd prints the following messages:
> 
> /usr/lib64/glusterfs/glusterfs/3.6.9/xlator/mgmt/glusterd.
> so(glusterd_mgmt_v3_unlock+0x356)[0x7f9df8acc1f5] (-->
> /usr/lib64/glusterfs/glusterfs/3.6.9/xlator/mgmt/glusterd.
> so(__glusterd_peer_rpc_notify+0x289)[0x7f9df8a2da4b] (-->
> /usr/lib64/glusterfs/glusterfs/3.6.9/xlator/mgmt/glusterd.
> so(glusterd_big_locked_notify+0x61)[0x7f9df8a22384] (-->
> /usr/lib64/glusterfs/glusterfs/3.6.9/xlator/mgmt/glusterd.
> so(glusterd_peer_rpc_notify+0x38)[0x7f9df8a2dbea] ))))) 0-management: Lock
> for vol services not held
> The message "I [MSGID: 106006]
> [glusterd-handler.c:4290:__glusterd_nodesvc_rpc_notify] 0-management: nfs
> has disconnected from glusterd." repeated 39 times between [2016-07-18
> 07:58:08.475445] and [2016-07-18 08:00:06.414826]
> 
> and after a while connection with the other storage server and quorum server
> lost.
> 
> Could you please check and provide some help?
> 
> Version-Release number of selected component (if applicable):
> v3.6.9
> 
> How reproducible:
> Often
> 
> Steps to Reproduce:
> 1.
> 2.
> 3.
> 
> Actual results:
> 
> 
> Expected results:
> 
> 
> Additional info:
> 
> FYI we have back-ported the following patches from 3.7 branch:
> http://review.gluster.org/#/c/10829/
> http://review.gluster.org/#/c/13358/
> http://review.gluster.org/#/c/14052/
> http://review.gluster.org/#/c/12310/
> http://review.gluster.org/#/c/13392/

Comment 3 Kaushal 2016-08-23 12:59:39 UTC
GlusterFS-3.6 is nearing its End-Of-Life, only important security bugs still make a chance on getting fixed.
Moving this to the mainline 'version' as I still see similar disconnection messages in GlusterD logs.

Comment 4 Amar Tumballi 2018-10-08 17:29:30 UTC
Please try glusterfs-4.1.4 version, and see if this helps you to get to better state. There were multiple fixes in these areas which helps to get better n/w behavior.

Feel free to re-open this if it is still present!


Note You need to log in before you can comment on or make changes to this bug.