Bug 1528641
Summary: | Brick processes fail to start | ||
---|---|---|---|
Product: | [Community] GlusterFS | Reporter: | robdewit <rob> |
Component: | rpc | Assignee: | bugs <bugs> |
Status: | CLOSED WORKSFORME | QA Contact: | |
Severity: | high | Docs Contact: | |
Priority: | unspecified | ||
Version: | mainline | CC: | bugs, mchangir, rob, rob.dewit |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2019-01-07 09:17:21 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
robdewit
2017-12-22 14:07:14 UTC
Release 3.12 has been EOLd and this bug was still found to be in the NEW state, hence moving the version to mainline, to triage the same and take appropriate actions. This might be an insufficient transport.listen-backlog case. Rob, Could you set the vol file option transport.listen-backlog to 1024 in the /etc/glusterfs/glusterd.vol file on both the nodes and restart the nodes and get back on the status. In the mean time, a dump of the volume info of the all the volumes help provide an insight into the state of affairs. We've expanded the cluster with another node since then and this behavior has not occurred after that as far as I recall. Could this have been caused by the number of volumes or rather by some latency in I/O (disk or network)? I'd rather not mess with the settings since the cluster have been running OK now for several months. Glad to hear things are working for you. My hypothesis is that glusterd starting a large number of bricks causes a flood/rush of brick process to attempt to connect back to glusterd. This causes SYN Flooding and eventually drop of connection requests causing loss of service due to insufficient resources for holding connection requests until they are acknowledged. Hence the reference to tweak the glusterd vol file option: transport.listen-backlog. You could take a look at /var/log/messages and "grep -i" for "SYN Flooding" and see if that's the case. If things are working for you, you could close this BZ as WORKSFORME. |