Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1264290

Summary:	glusterd lost track of brick port numbers after brick daemon dies
Product:	[Community] GlusterFS	Reporter:	Atin Mukherjee <amukherj>
Component:	glusterd	Assignee:	Atin Mukherjee <amukherj>
Status:	CLOSED CURRENTRELEASE	QA Contact:
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	mainline	CC:	amukherj, bugs, eivind
Target Milestone:	---	Keywords:	Triaged
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	glusterfs-4.1.3 (or higher)	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:	1264245	Environment:
Last Closed:	2018-08-29 03:18:32 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1221623, 1264245, 1369766
Bug Blocks:

Description Atin Mukherjee 2015-09-18 06:50:22 UTC

+++ This bug was initially created as a clone of Bug #1264245 +++

Description of problem:
After a brick daemon dies, glusterd lost track of new/future brick listen ports.
Two different error scenarios can happen:
a) A replacement brick from the same node where a brick daemon previously died will not be healed.
b) A new volume created using a brick from same server where a brick daemon previously died will not be replicated (by the client)


Version-Release number of selected component (if applicable):
3.7.4


How reproducible:
Every time.


Steps to Reproduce (both scenario a+b):
1a. Create a distributed-replicated 1x2 volume
2a. kill -9 <brick-pid>
3a. stop + delete volume
4a. replace-brick with another brick on same node where <brick-pid> died (healing works)
5a. kill -9 <replacement-brick-pid>
6a. replace-brick with yet another brick (healing fails because wrong pid is used to connect to new brick)
7a. grep "Connection refused" /var/log/glusterfs/glustershd.log

1b. Create a distributed-replicated 1x2 volume
2b. kill -9 <brick-pid>
3b. stop + delete volume
4b. Create new 1x2 volume using same (cleaned) bricks as in 1b
5b. mount it.
6b. On client, grep "Connection refused" /var/log/glusterfs/<volname>.log

Actual results:

a. # grep "Connection refused" /var/log/glusterfs/glustershd.log
[2015-09-18 00:55:24.717023] E [socket.c:2278:socket_connect_finish] 0-voltest-client-0: connection to 192.168.1.3:49152 failed (Connection refused)

b. # grep "Connection refused" /var/log/glusterfs/voltest.log
[2015-09-18 00:44:59.117344] E [socket.c:2278:socket_connect_finish] 4-voltest-client-0: connection to 192.168.1.3:49152 failed (Connection refused)



Expected results:


Additional info:

Restarting glusterd after the brick daemon is killed will prevent the "Connection refused" in both a) and b)

--- Additional comment from Atin Mukherjee on 2015-09-17 23:45:47 EDT ---

Request AFR team to check this.

--- Additional comment from Atin Mukherjee on 2015-09-18 00:19:31 EDT ---

Scenario b is reproducible. We will keep you posted once we have the RCA. Thanks for filing the bug.

--- Additional comment from Vijay Bellur on 2015-09-18 01:23:39 EDT ---

REVIEW: http://review.gluster.org/12189 (glusterd: Use GF_PMAP_PORT_BRICKSERVER in pmap_registry_remove from brick disconnects) posted (#1) for review on master by Atin Mukherjee (amukherj)

Comment 1 Vijay Bellur 2015-09-18 06:51:09 UTC

REVIEW: http://review.gluster.org/12189 (glusterd: Use GF_PMAP_PORT_BRICKSERVER in pmap_registry_remove from brick disconnects) posted (#2) for review on master by Atin Mukherjee (amukherj)

Comment 2 Atin Mukherjee 2016-02-14 03:19:44 UTC

The patch which addresses this issue is http://review.gluster.org/#/c/10785/ . Since Gaurav is the author of the patch, assigning it to him.

Comment 3 Mike McCune 2016-03-28 22:39:30 UTC

This bug was accidentally moved from POST to MODIFIED via an error in automation, please see mmccune with any questions

Comment 5 Atin Mukherjee 2017-01-25 08:39:41 UTC

https://review.gluster.org/#/c/15005/ has fixed this issue.