Bug 1254137
| Summary: | Rebalance fix-layout fails after some time with a timeout | |||
|---|---|---|---|---|
| Product: | [Community] GlusterFS | Reporter: | Trefex <trefex> | |
| Component: | rpc | Assignee: | Mohammed Rafi KC <rkavunga> | |
| Status: | CLOSED EOL | QA Contact: | ||
| Severity: | high | Docs Contact: | ||
| Priority: | unspecified | |||
| Version: | 3.7.5 | CC: | bugs, rgowdapp, rkavunga, sankarshan | |
| Target Milestone: | --- | Keywords: | Triaged | |
| Target Release: | --- | |||
| Hardware: | x86_64 | |||
| OS: | Linux | |||
| Whiteboard: | rpc-ping-timeout | |||
| Fixed In Version: | Doc Type: | Bug Fix | ||
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 1254138 (view as bug list) | Environment: | ||
| Last Closed: | 2017-03-08 10:48:24 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 1254138 | |||
|
Description
Trefex
2015-08-17 09:17:47 UTC
Steps to reproduce: 1. Add a new node to a 2-node setup (already full with files) 2. Start a `gluster volume rebalance live fix-layout start` 3. Wait 20k-40k seconds 4. rebalance failed REVIEW: http://review.gluster.org/11935 (socket: Add ping packets into beginning of ioq list) posted (#2) for review on master by mohammed rafi kc (rkavunga) REVIEW: http://review.gluster.org/11935 (socket: Add ping packets into beginning of ioq list) posted (#4) for review on master by mohammed rafi kc (rkavunga) REVIEW: http://review.gluster.org/11935 (socket: Add ping packets into beginning of ioq list) posted (#5) for review on master by Vijay Bellur (vbellur) REVIEW: http://review.gluster.org/11935 (socket: Add ping packets into beginning of ioq list) posted (#6) for review on master by mohammed rafi kc (rkavunga) REVIEW: http://review.gluster.org/11935 (socket: Add ping packets into beginning of ioq list) posted (#7) for review on master by mohammed rafi kc (rkavunga) REVIEW: http://review.gluster.org/11935 (socket: Add ping packets into beginning of ioq list) posted (#8) for review on master by mohammed rafi kc (rkavunga) we block reading from socket till event-handler completes. This might cause spurious disconnects due to ping-timer expiry if handlers take more time. In this bug, the load seems to be from readdirp. I just looked at readdirp reply path. It involves looping over dentry list from various translators: 1. protocol/client construct dentry list and hence it traverses the list. 2. afr does a loop over dentries 3. dht does a loop over dentries 4. syncop_readdirp_cbk (rebalance process use syncops) copies each dentry and constructs a new list. I am suspecting whether such heavy processing in handler might've prevented the client from reading the ping response from socket (if ping response was queued behind readdirp response), resulting in timeout of ping-timer. One solution is that it would be better if we start reading from socket once we read a complete rpc msg. We need not wait till rpc-program/rpc-clnt above transport to process the reply. <rpc_clnt_ping_timer_expired>
gettimeofday (¤t, NULL);
if (((current.tv_sec - conn->last_received.tv_sec) <
conn->ping_timeout)
|| ((current.tv_sec - conn->last_sent.tv_sec) <
conn->ping_timeout)) {
transport_activity = 1;
}
if (transport_activity) {
gf_log (trans->name, GF_LOG_TRACE,
"ping timer expired but transport activity "
"detected - not bailing transport");
if (__rpc_clnt_rearm_ping_timer (rpc,
rpc_clnt_ping_timer_expired) == -1) {
gf_log (trans->name, GF_LOG_WARNING,
"unable to setup ping timer");
}
} else {
conn->ping_started = 0;
disconnect = 1;
}
</rpc_clnt_ping_timer_expired>
As can be seen above, ping_timer_expired takes "transport_activity" into account before actually disconnecting (Its not just ping response we are looking at). Hence RCA in previous comment is most likely wrong.
REVIEW: http://review.gluster.org/11935 (socket: Add ping packets into beginning of ioq list) posted (#10) for review on master by Vijay Bellur (vbellur) REVIEW: http://review.gluster.org/11935 (socket: Add ping packets into beginning of ioq list) posted (#11) for review on master by Vijay Bellur (vbellur) REVIEW: http://review.gluster.org/11935 (socket: Add ping packets into beginning of ioq list) posted (#12) for review on master by mohammed rafi kc (rkavunga) This bug was accidentally moved from POST to MODIFIED via an error in automation, please see mmccune with any questions REVIEW: http://review.gluster.org/11935 (socket: Add ping packets into beginning of ioq list) posted (#13) for review on master by mohammed rafi kc (rkavunga) REVIEW: http://review.gluster.org/11935 (socket: Add ping packets into beginning of ioq list) posted (#14) for review on master by mohammed rafi kc (rkavunga) This bug is getting closed because GlusteFS-3.7 has reached its end-of-life. Note: This bug is being closed using a script. No verification has been performed to check if it still exists on newer releases of GlusterFS. If this bug still exists in newer GlusterFS releases, please reopen this bug against the newer release. |