Bug 1710159
| Summary: | glusterd: While upgrading (3-node cluster) 'gluster v status' times out on node to be upgraded | ||
|---|---|---|---|
| Product: | [Community] GlusterFS | Reporter: | Sanju <srakonde> |
| Component: | glusterd | Assignee: | Sanju <srakonde> |
| Status: | CLOSED NEXTRELEASE | QA Contact: | |
| Severity: | high | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | mainline | CC: | amukherj, bmekala, bugs, rallan, rhinduja, rhs-bugs, sankarshan, storage-qa-internal, vbellur |
| Target Milestone: | --- | Keywords: | Regression |
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | 1707246 | Environment: | |
| Last Closed: | 2019-05-31 03:09:26 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 1707246 | ||
|
Description
Sanju
2019-05-15 02:47:53 UTC
REVIEW: https://review.gluster.org/22730 (glusterd: add an op-version check) posted (#1) for review on master by Sanju Rakonde Root cause:
With commit 34e010d64, we have added some conditions to set txn-opinfo to avoid
the memory leak in txn-opinfo object. But, in a heterogeneous cluster the
upgraded and non-upgraded nodes are following different conditions to set
txn-opinfo. This is leading the get-txn-opinfo operation to fail and eventually
the process hungs.
[root@server2 glusterfs]# git show 34e010d64
commit 34e010d64905b7387de57840d3fb16a326853c9b
Author: Atin Mukherjee <amukherj>
Date: Mon Mar 18 16:08:04 2019 +0530
glusterd: fix txn-id mem leak
This commit ensures the following:
1. Don't send commit op request to the remote nodes when gluster v
status all is executed as for the status all transaction the local
commit gets the name of the volumes and remote commit ops are
technically a no-op. So no need for additional rpc requests.
2. In op state machine flow, if the transaction is in staged state and
op_info.skip_locking is true, then no need to set the txn id in the
priv->glusterd_txn_opinfo dictionary which never gets freed.
Fixes: bz#1691164
Change-Id: Ib6a9300ea29633f501abac2ba53fb72ff648c822
Signed-off-by: Atin Mukherjee <amukherj>
diff --git a/xlators/mgmt/glusterd/src/glusterd-op-sm.c
b/xlators/mgmt/glusterd/src/glusterd-op-sm.c
index 6495a9d88..84c34f1fe 100644
--- a/xlators/mgmt/glusterd/src/glusterd-op-sm.c
+++ b/xlators/mgmt/glusterd/src/glusterd-op-sm.c
@@ -5652,6 +5652,9 @@ glusterd_op_ac_stage_op(glusterd_op_sm_event_t *event,
void *ctx)
dict_t *dict = NULL;
xlator_t *this = NULL;
uuid_t *txn_id = NULL;
+ glusterd_op_info_t txn_op_info = {
+ {0},
+ };
this = THIS;
GF_ASSERT(this);
@@ -5686,6 +5689,7 @@ glusterd_op_ac_stage_op(glusterd_op_sm_event_t *event,
void *ctx)
ret = -1;
goto out;
}
+ ret = glusterd_get_txn_opinfo(&event->txn_id, &txn_op_info);
ret = dict_set_bin(rsp_dict, "transaction_id", txn_id, sizeof(*txn_id));
if (ret) {
@@ -5704,6 +5708,12 @@ out:
gf_msg_debug(this->name, 0, "Returning with %d", ret);
+ /* for no volname transactions, the txn_opinfo needs to be cleaned up
+ * as there's no unlock event triggered
+ */
+ if (txn_op_info.skip_locking)
+ ret = glusterd_clear_txn_opinfo(txn_id);
+
if (rsp_dict)
dict_unref(rsp_dict);
@@ -8159,12 +8169,16 @@ glusterd_op_sm()
"Unable to clear "
"transaction's opinfo");
} else {
- ret = glusterd_set_txn_opinfo(&event->txn_id, &opinfo);
- if (ret)
- gf_msg(this->name, GF_LOG_ERROR, 0,
- GD_MSG_TRANS_OPINFO_SET_FAIL,
- "Unable to set "
- "transaction's opinfo");
+ if (!(event_type == GD_OP_EVENT_STAGE_OP &&
+ opinfo.state.state == GD_OP_STATE_STAGED &&
+ opinfo.skip_locking)) { <---- now, upgraded nodes
will not set txn-opinfo when this condition is false, so the
glusterd_get_txn_opinfo() after this is failing. previously we used to set
txn-opinfo in every state of op-sm and glusterd_get_txn_opinfo will be called
in every phase. We need to add an op-version check for this change.
+ ret = glusterd_set_txn_opinfo(&event->txn_id, &opinfo);
+ if (ret)
+ gf_msg(this->name, GF_LOG_ERROR, 0,
+ GD_MSG_TRANS_OPINFO_SET_FAIL,
+ "Unable to set "
+ "transaction's opinfo");
+ }
}
glusterd_destroy_op_event_ctx(event);
diff --git a/xlators/mgmt/glusterd/src/glusterd-syncop.c
b/xlators/mgmt/glusterd/src/glusterd-syncop.c
index 45b221c2e..9bab2cfd5 100644
--- a/xlators/mgmt/glusterd/src/glusterd-syncop.c
+++ b/xlators/mgmt/glusterd/src/glusterd-syncop.c
@@ -1392,6 +1392,8 @@ gd_commit_op_phase(glusterd_op_t op, dict_t *op_ctx,
dict_t *req_dict,
char *errstr = NULL;
struct syncargs args = {0};
int type = GF_QUOTA_OPTION_TYPE_NONE;
+ uint32_t cmd = 0;
+ gf_boolean_t origin_glusterd = _gf_false;
this = THIS;
GF_ASSERT(this);
@@ -1449,6 +1451,20 @@ commit_done:
gd_syncargs_init(&args, op_ctx);
synctask_barrier_init((&args));
peer_cnt = 0;
+ origin_glusterd = is_origin_glusterd(req_dict);
+
+ if (op == GD_OP_STATUS_VOLUME) {
+ ret = dict_get_uint32(req_dict, "cmd", &cmd);
+ if (ret)
+ goto out;
+
+ if (origin_glusterd) {
+ if ((cmd & GF_CLI_STATUS_ALL)) {
+ ret = 0;
+ goto out;
+ }
+ }
+ }
RCU_READ_LOCK;
cds_list_for_each_entry_rcu(peerinfo, &conf->peers, uuid_list)
(END)
Thanks,
Sanju
One correction to the above root cause is process never hangs here. It's just that the response back to the cli is not sent due to the incorrect state machine movement depending on the event trigger. REVIEW: https://review.gluster.org/22730 (glusterd: add an op-version check) merged (#3) on master by Atin Mukherjee |