Bug 1707246
| Summary: | [glusterd]: While upgrading (3-node cluster) 'gluster v status' times out on node to be upgraded | |||
|---|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | Rochelle <rallan> | |
| Component: | glusterd | Assignee: | Sanju <srakonde> | |
| Status: | CLOSED ERRATA | QA Contact: | Bala Konda Reddy M <bmekala> | |
| Severity: | high | Docs Contact: | ||
| Priority: | unspecified | |||
| Version: | rhgs-3.5 | CC: | amukherj, kiyer, pasik, rhinduja, rhs-bugs, sheggodu, storage-qa-internal, vbellur, vdas | |
| Target Milestone: | --- | Keywords: | Regression | |
| Target Release: | RHGS 3.5.0 | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | glusterfs-6.0-4 | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 1710159 (view as bug list) | Environment: | ||
| Last Closed: | 2019-10-30 12:21:23 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | 1710159, 1722131 | |||
| Bug Blocks: | 1696807 | |||
|
Description
Rochelle
2019-05-07 07:02:22 UTC
Root cause:
With commit 34e010d64, we have added some conditions to set txn-opinfo to avoid the memory leak in txn-opinfo object. But, in a heterogeneous cluster the upgraded and non-upgraded nodes are following different conditions to set txn-opinfo. This is leading the get-txn-opinfo operation to fail and eventually the process hungs.
[root@server2 glusterfs]# git show 34e010d64
commit 34e010d64905b7387de57840d3fb16a326853c9b
Author: Atin Mukherjee <amukherj>
Date: Mon Mar 18 16:08:04 2019 +0530
glusterd: fix txn-id mem leak
This commit ensures the following:
1. Don't send commit op request to the remote nodes when gluster v
status all is executed as for the status all transaction the local
commit gets the name of the volumes and remote commit ops are
technically a no-op. So no need for additional rpc requests.
2. In op state machine flow, if the transaction is in staged state and
op_info.skip_locking is true, then no need to set the txn id in the
priv->glusterd_txn_opinfo dictionary which never gets freed.
Fixes: bz#1691164
Change-Id: Ib6a9300ea29633f501abac2ba53fb72ff648c822
Signed-off-by: Atin Mukherjee <amukherj>
diff --git a/xlators/mgmt/glusterd/src/glusterd-op-sm.c b/xlators/mgmt/glusterd/src/glusterd-op-sm.c
index 6495a9d88..84c34f1fe 100644
--- a/xlators/mgmt/glusterd/src/glusterd-op-sm.c
+++ b/xlators/mgmt/glusterd/src/glusterd-op-sm.c
@@ -5652,6 +5652,9 @@ glusterd_op_ac_stage_op(glusterd_op_sm_event_t *event, void *ctx)
dict_t *dict = NULL;
xlator_t *this = NULL;
uuid_t *txn_id = NULL;
+ glusterd_op_info_t txn_op_info = {
+ {0},
+ };
this = THIS;
GF_ASSERT(this);
@@ -5686,6 +5689,7 @@ glusterd_op_ac_stage_op(glusterd_op_sm_event_t *event, void *ctx)
ret = -1;
goto out;
}
+ ret = glusterd_get_txn_opinfo(&event->txn_id, &txn_op_info);
ret = dict_set_bin(rsp_dict, "transaction_id", txn_id, sizeof(*txn_id));
if (ret) {
@@ -5704,6 +5708,12 @@ out:
gf_msg_debug(this->name, 0, "Returning with %d", ret);
+ /* for no volname transactions, the txn_opinfo needs to be cleaned up
+ * as there's no unlock event triggered
+ */
+ if (txn_op_info.skip_locking)
+ ret = glusterd_clear_txn_opinfo(txn_id);
+
if (rsp_dict)
dict_unref(rsp_dict);
@@ -8159,12 +8169,16 @@ glusterd_op_sm()
"Unable to clear "
"transaction's opinfo");
} else {
- ret = glusterd_set_txn_opinfo(&event->txn_id, &opinfo);
- if (ret)
- gf_msg(this->name, GF_LOG_ERROR, 0,
- GD_MSG_TRANS_OPINFO_SET_FAIL,
- "Unable to set "
- "transaction's opinfo");
+ if (!(event_type == GD_OP_EVENT_STAGE_OP &&
+ opinfo.state.state == GD_OP_STATE_STAGED &&
+ opinfo.skip_locking)) { <---- now, upgraded nodes will not set txn-opinfo when this condition is false, so the glusterd_get_txn_opinfo() after this is failing. previously we used to set txn-opinfo in every state of op-sm and glusterd_get_txn_opinfo will be called in every phase. We need to add an op-version check for this change.
+ ret = glusterd_set_txn_opinfo(&event->txn_id, &opinfo);
+ if (ret)
+ gf_msg(this->name, GF_LOG_ERROR, 0,
+ GD_MSG_TRANS_OPINFO_SET_FAIL,
+ "Unable to set "
+ "transaction's opinfo");
+ }
}
glusterd_destroy_op_event_ctx(event);
diff --git a/xlators/mgmt/glusterd/src/glusterd-syncop.c b/xlators/mgmt/glusterd/src/glusterd-syncop.c
index 45b221c2e..9bab2cfd5 100644
--- a/xlators/mgmt/glusterd/src/glusterd-syncop.c
+++ b/xlators/mgmt/glusterd/src/glusterd-syncop.c
@@ -1392,6 +1392,8 @@ gd_commit_op_phase(glusterd_op_t op, dict_t *op_ctx, dict_t *req_dict,
char *errstr = NULL;
struct syncargs args = {0};
int type = GF_QUOTA_OPTION_TYPE_NONE;
+ uint32_t cmd = 0;
+ gf_boolean_t origin_glusterd = _gf_false;
this = THIS;
GF_ASSERT(this);
@@ -1449,6 +1451,20 @@ commit_done:
gd_syncargs_init(&args, op_ctx);
synctask_barrier_init((&args));
peer_cnt = 0;
+ origin_glusterd = is_origin_glusterd(req_dict);
+
+ if (op == GD_OP_STATUS_VOLUME) {
+ ret = dict_get_uint32(req_dict, "cmd", &cmd);
+ if (ret)
+ goto out;
+
+ if (origin_glusterd) {
+ if ((cmd & GF_CLI_STATUS_ALL)) {
+ ret = 0;
+ goto out;
+ }
+ }
+ }
RCU_READ_LOCK;
cds_list_for_each_entry_rcu(peerinfo, &conf->peers, uuid_list)
(END)
patch https://review.gluster.org/#/c/glusterfs/+/22730 posted at upstream for review.
Thanks,
Sanju
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2019:3249 |