Bug 1402172
| Summary: | Peer unexpectedly disconnected | ||
|---|---|---|---|
| Product: | [Community] GlusterFS | Reporter: | David Galloway <dgallowa> |
| Component: | glusterd | Assignee: | Atin Mukherjee <amukherj> |
| Status: | CLOSED WORKSFORME | QA Contact: | |
| Severity: | medium | Docs Contact: | |
| Priority: | medium | ||
| Version: | 3.8 | CC: | amukherj, bugs, dgallowa |
| Target Milestone: | --- | Keywords: | Triaged |
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2017-01-23 14:08:35 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
David Galloway
2016-12-07 00:15:15 UTC
We had seen this issue a long back but couldn't RCA it. We could debug to a state that there was a discrepancy in the connection between socket and application layer where application layer keeps on retrying for the connection and underlying socket was already connected due to which reconnect was failing on every attempt. Good news is we managed to debug and RCA this issue recently and now its been fixed in upstream mainline and a patch http://review.gluster.org/#/c/16025 has been backported in release-3.8 branch which should go in to the next 3.8.x release. Thanks for the quick response! Am I at any risk in continuing to run my cluster in this state? In other words, are the peers really disconnected and replication is not happening or is 'peer status' output bogus? No, its a race, restarting glusterd instance where it shows disconnected should bring everything back to normal. (In reply to Atin Mukherjee from comment #3) > No, its a race, restarting glusterd instance where it shows disconnected > should bring everything back to normal. It doesn't though. I restarted glusterd on store01 and store02 multiple times. (In reply to David Galloway from comment #4) > (In reply to Atin Mukherjee from comment #3) > > No, its a race, restarting glusterd instance where it shows disconnected > > should bring everything back to normal. > > It doesn't though. I restarted glusterd on store01 and store02 multiple > times. In that case, my initial suspect doesn't stand true. Is there any firewall setting causing the connection to not go through from store1 to store2? (In reply to Atin Mukherjee from comment #5) > (In reply to David Galloway from comment #4) > > (In reply to Atin Mukherjee from comment #3) > > > No, its a race, restarting glusterd instance where it shows disconnected > > > should bring everything back to normal. > > > > It doesn't though. I restarted glusterd on store01 and store02 multiple > > times. > > In that case, my initial suspect doesn't stand true. Is there any firewall > setting causing the connection to not go through from store1 to store2? Relevant 'iptables -L' output. store01 ------- ACCEPT tcp -- 172.21.0.0/20 anywhere tcp dpt:24007 ACCEPT tcp -- 172.21.0.0/20 anywhere tcp dpt:24009 ACCEPT tcp -- 172.21.0.0/20 anywhere tcp dpts:38465:38468 ACCEPT tcp -- 172.21.0.0/20 anywhere tcp dpts:49152:49162 store02 ------- ACCEPT tcp -- 172.21.0.0/20 anywhere tcp dpt:24007 ACCEPT tcp -- 172.21.0.0/20 anywhere tcp dpt:24009 ACCEPT tcp -- 172.21.0.0/20 anywhere tcp dpts:38465:38467 ACCEPT tcp -- 172.21.0.0/20 anywhere tcp dpts:49152:49162 David, I wanted to have a look at the sosreport but unable to access it due to insufficient permissions. Mean time, can you check if glusterd has got any disconnect event by any chance? In glusterd log you should be seeing peer X is disconnected kind of log entry if that's the case. Otherwise could you restart glusterd with debug log enabled (kill glusterd and then bring it back with 'glusterd -LDEBUG') and share the glusterd logs from both the nodes? I fixed file permissions on the sosreports. Sorry about that. We had a network glitch on 2016-12-03 so there are a few entries about peers disconnecting then as well as when I tried restarting glusterd on 2016-12-06. Let me know if you need the debug logging still after checking out the sosreport. Do you still need debug logs or can you tell from the sosreports what happened? I just logged in to store01 and all peers are showing as connected. Nothing has changed as far as firewalls or configs on my end. I uploaded new sosreports from store01 and store02 to http://drop.ceph.com/qa/dgalloway/ if you want to investigate. Sorry for the delay, I'll have a look at the logs and get back. [2016-12-06 23:40:10.394118] E [socket.c:2309:socket_connect_finish] 0-management: connection to 172.21.0.9:24007 failed (Connection timed out) [2016-12-06 23:40:10.394317] I [MSGID: 106004] [glusterd-handler.c:5219:__glusterd_peer_rpc_notify] 0-management: Peer <store02> (<97d6abee-f2ac-47d8-bb96-738ffb99b38f>), in state <Peer in Cluster>, has disconnected from glusterd. The above glusterd log snippet from store1 convinces me that there is an issue with the connectivity between store1 & store2 (172.21.0.9) ? Doesn't restarting all the glusterd instances solve this issue? (In reply to Atin Mukherjee from comment #12) > [2016-12-06 23:40:10.394118] E [socket.c:2309:socket_connect_finish] > 0-management: connection to 172.21.0.9:24007 failed (Connection timed out) > [2016-12-06 23:40:10.394317] I [MSGID: 106004] > [glusterd-handler.c:5219:__glusterd_peer_rpc_notify] 0-management: Peer > <store02> (<97d6abee-f2ac-47d8-bb96-738ffb99b38f>), in state <Peer in > Cluster>, has disconnected from glusterd. > > The above glusterd log snippet from store1 convinces me that there is an > issue with the connectivity between store1 & store2 (172.21.0.9) ? > > Doesn't restarting all the glusterd instances solve this issue? It didn't, no. But all the peers are showing as connected now without any intervention on my part so I'm not sure this bug is valid anymore. I am closing this BZ based on comment 13, please reopen if the issue is seen again. |