| Summary: | glustershd and FUSE logs are getting spammed with continuous "Connection refused" error messages | ||
|---|---|---|---|
| Product: | Red Hat Gluster Storage | Reporter: | Prasad Desala <tdesala> |
| Component: | glusterd | Assignee: | Gaurav Yadav <gyadav> |
| Status: | CLOSED WORKSFORME | QA Contact: | Byreddy <bsrirama> |
| Severity: | medium | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | rhgs-3.2 | CC: | amukherj, rhs-bugs, sbairagy, storage-qa-internal, tdesala, vbellur |
| Target Milestone: | --- | Keywords: | ZStream |
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2017-12-21 05:57:02 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
|
Description
Prasad Desala
2016-11-09 13:57:20 UTC
Looking at the setup especially the stale port map entry it does look like you have deleted/recreated volumes with same name and brick path and killed brick process using kill -9 instead of kill -15, please confirm.
Here is the portmap entry details for port 49159 which glusterd gives back to the mount and port 49156 which is the actual port consumed by the brick process.
(gdb) p $4.ports[49159]
$7 = {
type = GF_PMAP_PORT_BRICKSERVER,
brickname = 0x7f04a814b400 "/bricks/brick3/b3",
xprt = 0x7f04a8148380
}
(gdb) p $4.ports[49156]
$8 = {
type = GF_PMAP_PORT_BRICKSERVER,
brickname = 0x7f04a8000d00 "/bricks/brick3/b3",
xprt = 0x7f04a817b820
}
From the above two entries it is clear that 49159 is a stale entry and the reason glusterd gave back this port back to the client is because portmap search always go top down i.e. starting from last_alloc and coming down to base and this was a change introduced recently by BZ 1353426. At the time the patch for this BZ was merged,things were proper as the last_alloc was always fast forwarded however with BZ 1263090 getting fixed after that introduced a side effect to the former fix since now the pmap_registry_alloc always starts from base_port and tries to find any free port which was consumed earlier.
Now consider a case:
1. Create 4 1 X 1 volume, so brick ports for vol1 to vol4 would be 49152 to 49155.
2. Start all the volumes
3. Delete vol1, vol2 & vol3
4. kill -9 <brick pid of vol4>
5. stop and delete vol4
6. Create vol4 with same vol name and brick path (use force option) and start the volume, note now the port will be 49152
7. try to mount and mount will fail since glusterd will report back 49155 as the port for vol4.
We'd need to think about how to fix it. But this *can* only happen if a PMAP_SIGNOUT is not received when a brick process goes down and then you'd need to delete and recreate volume with same name and brick path. With all this in mind, I am moving this out of 3.2.0.
Feel free to think otherwise with proper justification :)
Gaurav - this is a very interesting bug to work with. Can you please add it to your backlog? |