Bug 1264245 - glusterd lost track of brick port numbers after brick daemon dies
Summary: glusterd lost track of brick port numbers after brick daemon dies
Keywords:
Status: CLOSED EOL
Alias: None
Product: GlusterFS
Classification: Community
Component: glusterd
Version: 3.7.4
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
Assignee: Atin Mukherjee
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks: 1264290
TreeView+ depends on / blocked
 
Reported: 2015-09-18 01:06 UTC by Eivind Sarto
Modified: 2017-03-08 10:54 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 1264290 (view as bug list)
Environment:
Last Closed: 2017-03-08 10:54:34 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)

Description Eivind Sarto 2015-09-18 01:06:15 UTC
Description of problem:
After a brick daemon dies, glusterd lost track of new/future brick listen ports.
Two different error scenarios can happen:
a) A replacement brick from the same node where a brick daemon previously died will not be healed.
b) A new volume created using a brick from same server where a brick daemon previously died will not be replicated (by the client)


Version-Release number of selected component (if applicable):
3.7.4


How reproducible:
Every time.


Steps to Reproduce (both scenario a+b):
1a. Create a distributed-replicated 1x2 volume
2a. kill -9 <brick-pid>
3a. stop + delete volume
4a. replace-brick with another brick on same node where <brick-pid> died (healing works)
5a. kill -9 <replacement-brick-pid>
6a. replace-brick with yet another brick (healing fails because wrong pid is used to connect to new brick)
7a. grep "Connection refused" /var/log/glusterfs/glustershd.log

1b. Create a distributed-replicated 1x2 volume
2b. kill -9 <brick-pid>
3b. stop + delete volume
4b. Create new 1x2 volume using same (cleaned) bricks as in 1b
5b. mount it.
6b. On client, grep "Connection refused" /var/log/glusterfs/<volname>.log

Actual results:

a. # grep "Connection refused" /var/log/glusterfs/glustershd.log
[2015-09-18 00:55:24.717023] E [socket.c:2278:socket_connect_finish] 0-voltest-client-0: connection to 192.168.1.3:49152 failed (Connection refused)

b. # grep "Connection refused" /var/log/glusterfs/voltest.log
[2015-09-18 00:44:59.117344] E [socket.c:2278:socket_connect_finish] 4-voltest-client-0: connection to 192.168.1.3:49152 failed (Connection refused)



Expected results:


Additional info:

Restarting glusterd after the brick daemon is killed will prevent the "Connection refused" in both a) and b)

Comment 1 Atin Mukherjee 2015-09-18 03:45:47 UTC
Request AFR team to check this.

Comment 2 Atin Mukherjee 2015-09-18 04:19:31 UTC
Scenario b is reproducible. We will keep you posted once we have the RCA. Thanks for filing the bug.

Comment 3 Vijay Bellur 2015-09-18 05:23:39 UTC
REVIEW: http://review.gluster.org/12189 (glusterd: Use GF_PMAP_PORT_BRICKSERVER in pmap_registry_remove from brick disconnects) posted (#1) for review on master by Atin Mukherjee (amukherj)

Comment 4 Atin Mukherjee 2015-09-18 06:52:14 UTC
(In reply to Vijay Bellur from comment #3)
> REVIEW: http://review.gluster.org/12189 (glusterd: Use
> GF_PMAP_PORT_BRICKSERVER in pmap_registry_remove from brick disconnects)
> posted (#1) for review on master by Atin Mukherjee (amukherj)

This patch is posted in mainline, moving the state to Assigned.

Comment 5 Kaushal 2017-03-08 10:54:34 UTC
This bug is getting closed because GlusteFS-3.7 has reached its end-of-life.

Note: This bug is being closed using a script. No verification has been performed to check if it still exists on newer releases of GlusterFS.
If this bug still exists in newer GlusterFS releases, please reopen this bug against the newer release.


Note You need to log in before you can comment on or make changes to this bug.