1460225 – Not cleaning up stale socket file is resulting in spamming glusterd logs with warnings of "got disconnect from stale rpc"

Bug 1460225 - Not cleaning up stale socket file is resulting in spamming glusterd logs with warnings of "got disconnect from stale rpc"

Summary: Not cleaning up stale socket file is resulting in spamming glusterd logs with...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	glusterd
Sub Component:
Version:	mainline
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Assignee:	Atin Mukherjee
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1459900
TreeView+	depends on / blocked

Reported:	2017-06-09 12:15 UTC by Atin Mukherjee
Modified:	2017-09-05 17:33 UTC (History)
CC List:	6 users (show)
Fixed In Version:	glusterfs-3.12.0
Clone Of:	1459900
Environment:
Last Closed:	2017-09-05 17:33:35 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Comment 1 Atin Mukherjee 2017-06-09 12:23:50 UTC

Description of problem:
======================
On a brick mux setup, When a brick is killed and we end up start volume using volume force, then the stale socket file results in glusterd spamming with below message
[2017-06-08 13:36:09.699089] W [glusterd-handler.c:5678:__glusterd_brick_rpc_notify] 0-management: got disconnect from stale rpc on /rhs/brick31/test3_31

How reproducible:
=========
always

Steps to Reproduce:
1. have brick mux enabled, create 30 volumes say v1..v30
2. kill b1(let us say base volume for the glusterfsd was v1)
3. now do a vol start force of say v25(not the base volume v1)
4. Now do a vol start force of all the volumes say for i in $(gluster v list);do gluster v start $i force;done
5. we can see that the glusterd log is spammed with below error here on for indefinite time, as the base volume glusterfsd socket file is stale and existing
(in my case the stale socket file is with base volume test3_31)

[2017-06-08 13:37:45.712948] W [glusterd-handler.c:5678:__glusterd_brick_rpc_notify] 0-management: got disconnect from stale rpc on /rhs/brick31/test3_31
[2017-06-08 13:37:48.713752] W [glusterd-handler.c:5678:__glusterd_brick_rpc_notify] 0-management: got disconnect from stale rpc on /rhs/brick31/test3_31
[2017-06-08 13:37:51.713870] W [glusterd-handler.c:5678:__glusterd_brick_rpc_notify] 0-management: got disconnect from stale rpc on /rhs/brick31/test3_31


Workaround
=======
delete the old stale socket file

RCA:

This only happens when the brick process was killed with SIGKILL, not SIGTERM. Here given the brick process was killed with SIGKILL signal the signal handler wasn't invoked and hence the further cleanup wasn't triggered due to which we ended up with a stale socket file and this is the reason we see a constant series of stale disconnect. I can actually convert the gf_log instance to gf_log_occasionally to avoid this flood.

Comment 2 Worker Ant 2017-06-09 12:24:29 UTC

REVIEW: https://review.gluster.org/17499 (glusterd: log stale rpc disconnects occasionally) posted (#1) for review on master by Atin Mukherjee (amukherj)

Comment 3 Worker Ant 2017-06-09 22:44:29 UTC

COMMIT: https://review.gluster.org/17499 committed in master by Jeff Darcy (jeff.us) 
------
commit 801697cc08928660a8087d08122a3aed622f6790
Author: Atin Mukherjee <amukherj>
Date:   Fri Jun 9 17:10:00 2017 +0530

    glusterd: log stale rpc disconnects occasionally
    
    There might be situations where if a brick process is killed through
    SIGKILL (not SIGTERM) when brick mux is enabled glusterd will continue to
    receive disconnect events from the stale rpc which might flood the
    glusterd log. Fix is to use GF_LOG_OCCASIONALLY.
    
    Change-Id: I95a10c8be2346614e0a3458f98d9f99aab34800a
    BUG: 1460225
    Signed-off-by: Atin Mukherjee <amukherj>
    Reviewed-on: https://review.gluster.org/17499
    Smoke: Gluster Build System <jenkins.org>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.org>
    Reviewed-by: Jeff Darcy <jeff.us>

Comment 4 Shyamsundar 2017-09-05 17:33:35 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.12.0, please open a new bug report.

glusterfs-3.12.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://lists.gluster.org/pipermail/announce/2017-September/000082.html
[2] https://www.gluster.org/pipermail/gluster-users/

Note You need to log in before you can comment on or make changes to this bug.