1589253 – After creating and starting 601 volumes, self heal daemon went down and seeing continuous warning messages in glusterd log

Bug 1589253 - After creating and starting 601 volumes, self heal daemon went down and seeing continuous warning messages in glusterd log

Summary: After creating and starting 601 volumes, self heal daemon went down and seei...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	glusterd
Sub Component:
Version:	mainline
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Assignee:	Sanju
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:	1581184
Blocks:
TreeView+	depends on / blocked

Reported:	2018-06-08 14:18 UTC by Sanju
Modified:	2018-10-23 15:11 UTC (History)
CC List:	10 users (show)
Fixed In Version:	glusterfs-5.0
Clone Of:	1581184
Environment:
Last Closed:	2018-10-23 15:11:13 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Comment 1 Sanju 2018-06-08 14:20:53 UTC

Description of problem:
--------------------------------------------------------------------
On a three node cluster, Created and started 600(2X3) volumes. All the bricks and the self-heal daemon is running properly. Then created a new volume of type 2X3, the self-heal daemon stopped running and seeing the continuous warning for every 7 seconds.
---------------------------------------------------------------------
[2018-05-22 09:10:54.352926] W [socket.c:3266:socket_connect] 0-glustershd: Ignore failed connection attempt on /var/run/gluster/a218720a3b016edcafc4598e18d17126.socket, (No such file or directory)
[2018-05-22 09:11:01.354185] W [socket.c:3266:socket_connect] 0-glustershd: Ignore failed connection attempt on /var/run/gluster/a218720a3b016edcafc4598e18d17126.socket, (No such file or directory)
[2018-05-22 09:11:08.355858] W [socket.c:3266:socket_connect] 0-glustershd: Ignore failed connection attempt on /var/run/gluster/a218720a3b016edcafc4598e18d17126.socket, (No such file or directory)
[2018-05-22 09:11:15.358315] W [socket.c:3266:socket_connect] 0-glustershd: Ignore failed connection attempt on /var/run/gluster/a218720a3b016edcafc4598e18d17126.socket, (No such file or directory)
[2018-05-22 09:11:22.360205] W [socket.c:3266:socket_connect] 0-glustershd: Ignore failed connection attempt on /var/run/gluster/a218720a3b016edcafc4598e18d17126.socket, (No such file or directory)


Version-Release number of selected component (if applicable):


How reproducible:
1/1

Steps to Reproduce:
1. On a three node cluster, created 600 volumes of type replicate (2X3) and started them using a script
2. Created a new volume of type replicate 2X3 volume and started it 
3. Volume started successfully

Actual results:
Self-heal daemon went down and seeing continuous warning messages for every 7 seconds as below

[2018-05-22 08:48:09.064406] W [socket.c:3266:socket_connect] 0-glustershd: Ignore failed connection attempt on /var/run/gluster/a218720a3b016edcafc4598e18d17126.socket, (No such file or directory)
[2018-05-22 08:48:16.065553] W [socket.c:3266:socket_connect] 0-glustershd: Ignore failed connection attempt on /var/run/gluster/a218720a3b016edcafc4598e18d17126.socket, (No such file or directory)
[2018-05-22 08:48:23.066968] W [socket.c:3266:socket_connect] 0-glustershd: Ignore failed connection attempt on /var/run/gluster/a218720a3b016edcafc4598e18d17126.socket, (No such file or directory)
[2018-05-22 08:48:30.068186] W [socket.c:3266:socket_connect] 0-glustershd: Ignore failed connection attempt on /var/run/gluster/a218720a3b016edcafc4598e18d17126.socket, (No such file or directory)
[2018-05-22 08:48:37.069355] W [socket.c:3266:socket_connect] 0-glustershd: Ignore failed connection attempt on /var/run/gluster/a218720a3b016edcafc4598e18d17126.socket, (No such file or directory)

Expected results:
Self-heal daemon should be running

Additional info:

[root@dhcp37-214 ~]# gluster vol info deadpool
 
Volume Name: deadpool
Type: Distributed-Replicate
Volume ID: 25cf7f2f-3369-4ffc-8349-ce7c146b9ff2
Status: Started
Snapshot Count: 0
Number of Bricks: 2 x 3 = 6
Transport-type: tcp
Bricks:
Brick1: 10.70.37.214:/bricks/brick0/rel
Brick2: 10.70.37.178:/bricks/brick0/rel
Brick3: 10.70.37.46:/bricks/brick0/rel
Brick4: 10.70.37.214:/bricks/brick1/rel
Brick5: 10.70.37.178:/bricks/brick1/rel
Brick6: 10.70.37.46:/bricks/brick1/rel
Options Reconfigured:
transport.address-family: inet
nfs.disable: on
performance.client-io-threads: off

Comment 2 Worker Ant 2018-06-08 14:24:28 UTC

REVIEW: https://review.gluster.org/20197 (glusterd: Fix for shd status) posted (#1) for review on master by Sanju Rakonde

Comment 3 Worker Ant 2018-06-13 15:34:39 UTC

COMMIT: https://review.gluster.org/20197 committed in master by "Atin Mukherjee" <amukherj> with a commit message- glusterd: Fix for shd not coming up

Problem: After creating and starting n(n is large) distribute-replicated
volumes using a script, if we create and start (n+1)th distribute-replicate
volume manually self heal daemon is down.

Solution: In glusterd_proc_stop after giving SIGTERM signal if the
process is still running, we are giving a SIGKILL. As SIGKILL will
not perform any cleanup process, we need to remove the pidfile.

Fixes: bz#1589253
Change-Id: I7c114334eec74c8d0f21b3e45cf7db6b8ef28af1
Signed-off-by: Sanju Rakonde <srakonde>

Comment 4 Worker Ant 2018-06-14 15:21:09 UTC

REVIEW: https://review.gluster.org/20277 (glusterd: removing the unnecessary glusterd message) posted (#1) for review on master by Sanju Rakonde

Comment 5 Worker Ant 2018-06-14 21:35:52 UTC

COMMIT: https://review.gluster.org/20277 committed in master by "Atin Mukherjee" <amukherj> with a commit message- glusterd: removing the unnecessary glusterd message

Fixes: bz#1589253
Change-Id: I5510250a3d094e19e471b3ee47bf13ea9ee8aff5
Signed-off-by: Sanju Rakonde <srakonde>

Comment 6 Shyamsundar 2018-10-23 15:11:13 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-5.0, please open a new bug report.

glusterfs-5.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] https://lists.gluster.org/pipermail/announce/2018-October/000115.html
[2] https://www.gluster.org/pipermail/gluster-users/

Note You need to log in before you can comment on or make changes to this bug.