+++ This bug was initially created as a clone of Bug #1264245 +++ Description of problem: After a brick daemon dies, glusterd lost track of new/future brick listen ports. Two different error scenarios can happen: a) A replacement brick from the same node where a brick daemon previously died will not be healed. b) A new volume created using a brick from same server where a brick daemon previously died will not be replicated (by the client) Version-Release number of selected component (if applicable): 3.7.4 How reproducible: Every time. Steps to Reproduce (both scenario a+b): 1a. Create a distributed-replicated 1x2 volume 2a. kill -9 <brick-pid> 3a. stop + delete volume 4a. replace-brick with another brick on same node where <brick-pid> died (healing works) 5a. kill -9 <replacement-brick-pid> 6a. replace-brick with yet another brick (healing fails because wrong pid is used to connect to new brick) 7a. grep "Connection refused" /var/log/glusterfs/glustershd.log 1b. Create a distributed-replicated 1x2 volume 2b. kill -9 <brick-pid> 3b. stop + delete volume 4b. Create new 1x2 volume using same (cleaned) bricks as in 1b 5b. mount it. 6b. On client, grep "Connection refused" /var/log/glusterfs/<volname>.log Actual results: a. # grep "Connection refused" /var/log/glusterfs/glustershd.log [2015-09-18 00:55:24.717023] E [socket.c:2278:socket_connect_finish] 0-voltest-client-0: connection to 192.168.1.3:49152 failed (Connection refused) b. # grep "Connection refused" /var/log/glusterfs/voltest.log [2015-09-18 00:44:59.117344] E [socket.c:2278:socket_connect_finish] 4-voltest-client-0: connection to 192.168.1.3:49152 failed (Connection refused) Expected results: Additional info: Restarting glusterd after the brick daemon is killed will prevent the "Connection refused" in both a) and b) --- Additional comment from Atin Mukherjee on 2015-09-17 23:45:47 EDT --- Request AFR team to check this. --- Additional comment from Atin Mukherjee on 2015-09-18 00:19:31 EDT --- Scenario b is reproducible. We will keep you posted once we have the RCA. Thanks for filing the bug. --- Additional comment from Vijay Bellur on 2015-09-18 01:23:39 EDT --- REVIEW: http://review.gluster.org/12189 (glusterd: Use GF_PMAP_PORT_BRICKSERVER in pmap_registry_remove from brick disconnects) posted (#1) for review on master by Atin Mukherjee (amukherj)
REVIEW: http://review.gluster.org/12189 (glusterd: Use GF_PMAP_PORT_BRICKSERVER in pmap_registry_remove from brick disconnects) posted (#2) for review on master by Atin Mukherjee (amukherj)
The patch which addresses this issue is http://review.gluster.org/#/c/10785/ . Since Gaurav is the author of the patch, assigning it to him.
This bug was accidentally moved from POST to MODIFIED via an error in automation, please see mmccune with any questions
https://review.gluster.org/#/c/15005/ has fixed this issue.