Bug 1581184

Summary: After creating and starting 601 volumes, self heal daemon went down and seeing continuous warning messages in glusterd log
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Bala Konda Reddy M <bmekala>
Component: glusterdAssignee: Sanju <srakonde>
Status: CLOSED ERRATA QA Contact: Bala Konda Reddy M <bmekala>
Severity: high Docs Contact:
Priority: unspecified    
Version: rhgs-3.4CC: bmekala, rhs-bugs, sankarshan, sheggodu, srakonde, storage-qa-internal, vbellur, vdas
Target Milestone: ---   
Target Release: RHGS 3.4.0   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: glusterfs-3.12.2-13 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1589253 (view as bug list) Environment:
Last Closed: 2018-09-04 06:48:11 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1503137, 1589253    

Description Bala Konda Reddy M 2018-05-22 09:56:22 UTC
Description of problem:
--------------------------------------------------------------------
On a three node cluster, Created and started 600(2X3) volumes. All the bricks and the self-heal daemon is running properly. Then created a new volume of type 2X3, the self-heal daemon stopped running and seeing the continuous warning for every 7 seconds.
---------------------------------------------------------------------
[2018-05-22 09:10:54.352926] W [socket.c:3266:socket_connect] 0-glustershd: Ignore failed connection attempt on /var/run/gluster/a218720a3b016edcafc4598e18d17126.socket, (No such file or directory)
[2018-05-22 09:11:01.354185] W [socket.c:3266:socket_connect] 0-glustershd: Ignore failed connection attempt on /var/run/gluster/a218720a3b016edcafc4598e18d17126.socket, (No such file or directory)
[2018-05-22 09:11:08.355858] W [socket.c:3266:socket_connect] 0-glustershd: Ignore failed connection attempt on /var/run/gluster/a218720a3b016edcafc4598e18d17126.socket, (No such file or directory)
[2018-05-22 09:11:15.358315] W [socket.c:3266:socket_connect] 0-glustershd: Ignore failed connection attempt on /var/run/gluster/a218720a3b016edcafc4598e18d17126.socket, (No such file or directory)
[2018-05-22 09:11:22.360205] W [socket.c:3266:socket_connect] 0-glustershd: Ignore failed connection attempt on /var/run/gluster/a218720a3b016edcafc4598e18d17126.socket, (No such file or directory)


Version-Release number of selected component (if applicable):
3.12.2-9

How reproducible:
1/1

Steps to Reproduce:
1. On a three node cluster, created 600 volumes of type replicate (2X3) and started them using a script
2. Created a new volume of type replicate 2X3 volume and started it 
3. Volume started successfully

Actual results:
Self-heal daemon went down and seeing continuous warning messages for every 7 seconds as below

[2018-05-22 08:48:09.064406] W [socket.c:3266:socket_connect] 0-glustershd: Ignore failed connection attempt on /var/run/gluster/a218720a3b016edcafc4598e18d17126.socket, (No such file or directory)
[2018-05-22 08:48:16.065553] W [socket.c:3266:socket_connect] 0-glustershd: Ignore failed connection attempt on /var/run/gluster/a218720a3b016edcafc4598e18d17126.socket, (No such file or directory)
[2018-05-22 08:48:23.066968] W [socket.c:3266:socket_connect] 0-glustershd: Ignore failed connection attempt on /var/run/gluster/a218720a3b016edcafc4598e18d17126.socket, (No such file or directory)
[2018-05-22 08:48:30.068186] W [socket.c:3266:socket_connect] 0-glustershd: Ignore failed connection attempt on /var/run/gluster/a218720a3b016edcafc4598e18d17126.socket, (No such file or directory)
[2018-05-22 08:48:37.069355] W [socket.c:3266:socket_connect] 0-glustershd: Ignore failed connection attempt on /var/run/gluster/a218720a3b016edcafc4598e18d17126.socket, (No such file or directory)

Expected results:
Self-heal daemon should be running

Additional info:

[root@dhcp37-214 ~]# gluster vol info deadpool
 
Volume Name: deadpool
Type: Distributed-Replicate
Volume ID: 25cf7f2f-3369-4ffc-8349-ce7c146b9ff2
Status: Started
Snapshot Count: 0
Number of Bricks: 2 x 3 = 6
Transport-type: tcp
Bricks:
Brick1: 10.70.37.214:/bricks/brick0/rel
Brick2: 10.70.37.178:/bricks/brick0/rel
Brick3: 10.70.37.46:/bricks/brick0/rel
Brick4: 10.70.37.214:/bricks/brick1/rel
Brick5: 10.70.37.178:/bricks/brick1/rel
Brick6: 10.70.37.46:/bricks/brick1/rel
Options Reconfigured:
transport.address-family: inet
nfs.disable: on
performance.client-io-threads: off

Comment 16 Bala Konda Reddy M 2018-07-10 14:13:07 UTC
Build: 3.12.2-13

Followed the steps mentioned in the description. Creating (n+1)th volume manually after creating n volumes using the script. Seeing all the processes(brick process and self-heal daemon process) running. No warning messages in the glusterd log. 

Hence marking the bug as verified

Comment 17 errata-xmlrpc 2018-09-04 06:48:11 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2607