Bug 1707246
Summary: | [glusterd]: While upgrading (3-node cluster) 'gluster v status' times out on node to be upgraded | |||
---|---|---|---|---|
Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | Rochelle <rallan> | |
Component: | glusterd | Assignee: | Sanju <srakonde> | |
Status: | CLOSED ERRATA | QA Contact: | Bala Konda Reddy M <bmekala> | |
Severity: | high | Docs Contact: | ||
Priority: | unspecified | |||
Version: | rhgs-3.5 | CC: | amukherj, kiyer, pasik, rhinduja, rhs-bugs, sheggodu, storage-qa-internal, vbellur, vdas | |
Target Milestone: | --- | Keywords: | Regression | |
Target Release: | RHGS 3.5.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | glusterfs-6.0-4 | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1710159 (view as bug list) | Environment: | ||
Last Closed: | 2019-10-30 12:21:23 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | 1710159, 1722131 | |||
Bug Blocks: | 1696807 |
Description
Rochelle
2019-05-07 07:02:22 UTC
Root cause: With commit 34e010d64, we have added some conditions to set txn-opinfo to avoid the memory leak in txn-opinfo object. But, in a heterogeneous cluster the upgraded and non-upgraded nodes are following different conditions to set txn-opinfo. This is leading the get-txn-opinfo operation to fail and eventually the process hungs. [root@server2 glusterfs]# git show 34e010d64 commit 34e010d64905b7387de57840d3fb16a326853c9b Author: Atin Mukherjee <amukherj> Date: Mon Mar 18 16:08:04 2019 +0530 glusterd: fix txn-id mem leak This commit ensures the following: 1. Don't send commit op request to the remote nodes when gluster v status all is executed as for the status all transaction the local commit gets the name of the volumes and remote commit ops are technically a no-op. So no need for additional rpc requests. 2. In op state machine flow, if the transaction is in staged state and op_info.skip_locking is true, then no need to set the txn id in the priv->glusterd_txn_opinfo dictionary which never gets freed. Fixes: bz#1691164 Change-Id: Ib6a9300ea29633f501abac2ba53fb72ff648c822 Signed-off-by: Atin Mukherjee <amukherj> diff --git a/xlators/mgmt/glusterd/src/glusterd-op-sm.c b/xlators/mgmt/glusterd/src/glusterd-op-sm.c index 6495a9d88..84c34f1fe 100644 --- a/xlators/mgmt/glusterd/src/glusterd-op-sm.c +++ b/xlators/mgmt/glusterd/src/glusterd-op-sm.c @@ -5652,6 +5652,9 @@ glusterd_op_ac_stage_op(glusterd_op_sm_event_t *event, void *ctx) dict_t *dict = NULL; xlator_t *this = NULL; uuid_t *txn_id = NULL; + glusterd_op_info_t txn_op_info = { + {0}, + }; this = THIS; GF_ASSERT(this); @@ -5686,6 +5689,7 @@ glusterd_op_ac_stage_op(glusterd_op_sm_event_t *event, void *ctx) ret = -1; goto out; } + ret = glusterd_get_txn_opinfo(&event->txn_id, &txn_op_info); ret = dict_set_bin(rsp_dict, "transaction_id", txn_id, sizeof(*txn_id)); if (ret) { @@ -5704,6 +5708,12 @@ out: gf_msg_debug(this->name, 0, "Returning with %d", ret); + /* for no volname transactions, the txn_opinfo needs to be cleaned up + * as there's no unlock event triggered + */ + if (txn_op_info.skip_locking) + ret = glusterd_clear_txn_opinfo(txn_id); + if (rsp_dict) dict_unref(rsp_dict); @@ -8159,12 +8169,16 @@ glusterd_op_sm() "Unable to clear " "transaction's opinfo"); } else { - ret = glusterd_set_txn_opinfo(&event->txn_id, &opinfo); - if (ret) - gf_msg(this->name, GF_LOG_ERROR, 0, - GD_MSG_TRANS_OPINFO_SET_FAIL, - "Unable to set " - "transaction's opinfo"); + if (!(event_type == GD_OP_EVENT_STAGE_OP && + opinfo.state.state == GD_OP_STATE_STAGED && + opinfo.skip_locking)) { <---- now, upgraded nodes will not set txn-opinfo when this condition is false, so the glusterd_get_txn_opinfo() after this is failing. previously we used to set txn-opinfo in every state of op-sm and glusterd_get_txn_opinfo will be called in every phase. We need to add an op-version check for this change. + ret = glusterd_set_txn_opinfo(&event->txn_id, &opinfo); + if (ret) + gf_msg(this->name, GF_LOG_ERROR, 0, + GD_MSG_TRANS_OPINFO_SET_FAIL, + "Unable to set " + "transaction's opinfo"); + } } glusterd_destroy_op_event_ctx(event); diff --git a/xlators/mgmt/glusterd/src/glusterd-syncop.c b/xlators/mgmt/glusterd/src/glusterd-syncop.c index 45b221c2e..9bab2cfd5 100644 --- a/xlators/mgmt/glusterd/src/glusterd-syncop.c +++ b/xlators/mgmt/glusterd/src/glusterd-syncop.c @@ -1392,6 +1392,8 @@ gd_commit_op_phase(glusterd_op_t op, dict_t *op_ctx, dict_t *req_dict, char *errstr = NULL; struct syncargs args = {0}; int type = GF_QUOTA_OPTION_TYPE_NONE; + uint32_t cmd = 0; + gf_boolean_t origin_glusterd = _gf_false; this = THIS; GF_ASSERT(this); @@ -1449,6 +1451,20 @@ commit_done: gd_syncargs_init(&args, op_ctx); synctask_barrier_init((&args)); peer_cnt = 0; + origin_glusterd = is_origin_glusterd(req_dict); + + if (op == GD_OP_STATUS_VOLUME) { + ret = dict_get_uint32(req_dict, "cmd", &cmd); + if (ret) + goto out; + + if (origin_glusterd) { + if ((cmd & GF_CLI_STATUS_ALL)) { + ret = 0; + goto out; + } + } + } RCU_READ_LOCK; cds_list_for_each_entry_rcu(peerinfo, &conf->peers, uuid_list) (END) patch https://review.gluster.org/#/c/glusterfs/+/22730 posted at upstream for review. Thanks, Sanju Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2019:3249 |