Bug 1264245

Summary: glusterd lost track of brick port numbers after brick daemon dies
Product: [Community] GlusterFS Reporter: Eivind Sarto <eivind>
Component: glusterdAssignee: Atin Mukherjee <amukherj>
Status: CLOSED EOL QA Contact:
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.7.4CC: amukherj, bugs
Target Milestone: ---Keywords: Triaged
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1264290 (view as bug list) Environment:
Last Closed: 2017-03-08 10:54:34 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1264290    

Description Eivind Sarto 2015-09-18 01:06:15 UTC
Description of problem:
After a brick daemon dies, glusterd lost track of new/future brick listen ports.
Two different error scenarios can happen:
a) A replacement brick from the same node where a brick daemon previously died will not be healed.
b) A new volume created using a brick from same server where a brick daemon previously died will not be replicated (by the client)


Version-Release number of selected component (if applicable):
3.7.4


How reproducible:
Every time.


Steps to Reproduce (both scenario a+b):
1a. Create a distributed-replicated 1x2 volume
2a. kill -9 <brick-pid>
3a. stop + delete volume
4a. replace-brick with another brick on same node where <brick-pid> died (healing works)
5a. kill -9 <replacement-brick-pid>
6a. replace-brick with yet another brick (healing fails because wrong pid is used to connect to new brick)
7a. grep "Connection refused" /var/log/glusterfs/glustershd.log

1b. Create a distributed-replicated 1x2 volume
2b. kill -9 <brick-pid>
3b. stop + delete volume
4b. Create new 1x2 volume using same (cleaned) bricks as in 1b
5b. mount it.
6b. On client, grep "Connection refused" /var/log/glusterfs/<volname>.log

Actual results:

a. # grep "Connection refused" /var/log/glusterfs/glustershd.log
[2015-09-18 00:55:24.717023] E [socket.c:2278:socket_connect_finish] 0-voltest-client-0: connection to 192.168.1.3:49152 failed (Connection refused)

b. # grep "Connection refused" /var/log/glusterfs/voltest.log
[2015-09-18 00:44:59.117344] E [socket.c:2278:socket_connect_finish] 4-voltest-client-0: connection to 192.168.1.3:49152 failed (Connection refused)



Expected results:


Additional info:

Restarting glusterd after the brick daemon is killed will prevent the "Connection refused" in both a) and b)

Comment 1 Atin Mukherjee 2015-09-18 03:45:47 UTC
Request AFR team to check this.

Comment 2 Atin Mukherjee 2015-09-18 04:19:31 UTC
Scenario b is reproducible. We will keep you posted once we have the RCA. Thanks for filing the bug.

Comment 3 Vijay Bellur 2015-09-18 05:23:39 UTC
REVIEW: http://review.gluster.org/12189 (glusterd: Use GF_PMAP_PORT_BRICKSERVER in pmap_registry_remove from brick disconnects) posted (#1) for review on master by Atin Mukherjee (amukherj)

Comment 4 Atin Mukherjee 2015-09-18 06:52:14 UTC
(In reply to Vijay Bellur from comment #3)
> REVIEW: http://review.gluster.org/12189 (glusterd: Use
> GF_PMAP_PORT_BRICKSERVER in pmap_registry_remove from brick disconnects)
> posted (#1) for review on master by Atin Mukherjee (amukherj)

This patch is posted in mainline, moving the state to Assigned.

Comment 5 Kaushal 2017-03-08 10:54:34 UTC
This bug is getting closed because GlusteFS-3.7 has reached its end-of-life.

Note: This bug is being closed using a script. No verification has been performed to check if it still exists on newer releases of GlusterFS.
If this bug still exists in newer GlusterFS releases, please reopen this bug against the newer release.