Bug 1288003
Summary: | [tiering]: Tier daemon crashed on two of eight nodes and lot of "demotion failed" seen in the system | |||
---|---|---|---|---|
Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | krishnaram Karthick <kramdoss> | |
Component: | tier | Assignee: | Bug Updates Notification Mailing List <rhs-bugs> | |
Status: | CLOSED WONTFIX | QA Contact: | krishnaram Karthick <kramdoss> | |
Severity: | high | Docs Contact: | ||
Priority: | unspecified | |||
Version: | rhgs-3.1 | CC: | byarlaga, dlambrig, kramdoss, nbalacha, rhs-bugs, storage-qa-internal, vagarwal | |
Target Milestone: | --- | Keywords: | ZStream | |
Target Release: | --- | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | glusterfs-3.7.5-12 | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1288995 (view as bug list) | Environment: | ||
Last Closed: | 2016-01-11 15:08:15 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1260783 |
Description
krishnaram Karthick
2015-12-03 09:54:38 UTC
version: glusterfs-server-3.7.5-8.el6rhs.x86_64 Sosreports are available here --> http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1288003/ Pasting BT for both the cores <<<<<<BT from dhcp37-121.core.5159>>>>>> (gdb) bt #0 0x00007fdf00934d75 in __gf_free (free_ptr=0x7fdebc001970) at mem-pool.c:313 #1 0x00007fdef2887267 in tier_process_ctr_query (args=0x7fdee9fa3c40, query_cbk_args=<value optimized out>, is_promotion=<value optimized out>) at tier.c:878 #2 tier_process_brick (args=0x7fdee9fa3c40, query_cbk_args=<value optimized out>, is_promotion=<value optimized out>) at tier.c:967 #3 tier_build_migration_qfile (args=0x7fdee9fa3c40, query_cbk_args=<value optimized out>, is_promotion=<value optimized out>) at tier.c:1043 #4 0x00007fdef2888a70 in tier_promote (args=0x7fdee9fa3c40) at tier.c:1143 #5 0x00007fdeff9e8a51 in start_thread () from /lib64/libpthread.so.0 #6 0x00007fdeff35293d in clone () from /lib64/libc.so.6 <<<<<<<BT from dhcp37-111.core.5424>>>>>> #0 0x00007f6d271ee625 in raise () from /lib64/libc.so.6 #1 0x00007f6d271efe05 in abort () from /lib64/libc.so.6 #2 0x00007f6d2722c537 in __libc_message () from /lib64/libc.so.6 #3 0x00007f6d27231f4e in malloc_printerr () from /lib64/libc.so.6 #4 0x00007f6d27232353 in malloc_consolidate () from /lib64/libc.so.6 #5 0x00007f6d27235c28 in _int_malloc () from /lib64/libc.so.6 #6 0x00007f6d27236b1c in malloc () from /lib64/libc.so.6 #7 0x00007f6d2888b7b2 in __gf_default_malloc () at mem-pool.h:106 #8 glusterfs_lkowner_buf_get () at globals.c:329 #9 0x00007f6d28870188 in lkowner_utoa (lkowner=0x7f6d2625f970) at common-utils.c:2407 #10 0x00007f6d2888ece2 in gf_proc_dump_call_stack (call_stack=0x7f6d2625f718, key_buf=<value optimized out>) at stack.c:167 #11 0x00007f6d2888f04e in gf_proc_dump_pending_frames (call_pool=0x7f6d299727a0) at stack.c:210 #12 0x00007f6d2888dafb in gf_proc_dump_info (signum=<value optimized out>, ctx=0x7f6d29950010) at statedump.c:825 #13 0x00007f6d28d1d10d in glusterfs_sigwaiter (arg=<value optimized out>) at glusterfsd.c:2020 #14 0x00007f6d2793aa51 in start_thread () from /lib64/libpthread.so.0 #15 0x00007f6d272a493d in clone () from /lib64/libc.so.6 Analysis of dhcp37-121.core.5159: From the logs: [2015-12-03 00:01:25.433174] C [rpc-clnt-ping.c:165:rpc_clnt_ping_timer_expired] 0-tiering-test-vol-01-client-9: server 10.70.37.121:49153 has not responded in the last 42 seconds, disconnecting. [2015-12-03 00:01:25.433755] E [rpc-clnt.c:362:saved_frames_unwind] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x1eb)[0x7fdf009026eb] (--> /usr/lib64/libgfrpc.so.0(saved_frames_unwind+0x1e7)[0x7fdf006cd227] (--> /usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fdf006cd33e] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xab)[0x7fdf006cd40b] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x1c2)[0x7fdf006cda42] ))))) 0-tiering-test-vol-01-client-9: forced unwinding frame type(GlusterFS 3.3) op(IPC(47)) called at 2015-12-03 00:00:01.130434 (xid=0xcc177) [2015-12-03 00:01:25.433787] W [MSGID: 114031] [client-rpc-fops.c:2265:client3_3_ipc_cbk] 0-tiering-test-vol-01-client-9: remote operation failed [Transport endpoint is not connected] [2015-12-03 00:01:25.433966] E [MSGID: 109107] [tier.c:838:tier_process_ctr_query] 0-tiering-test-vol-01-tier-dht: Failed query on /rhs/brick2/leg1/.glusterfs/leg1.db ret -107 pending frames: This means that the syncop_ipc call in tier_process_ctr_query failed. ret = dict_set_bin (ctr_ipc_in_dict, GFDB_IPC_CTR_GET_QUERY_PARAMS, ipc_ctr_params, sizeof (*ipc_ctr_params)); if (ret) { gf_msg (this->name, GF_LOG_ERROR, 0, LG_MSG_SET_PARAM_FAILED, "Failed setting %s to params dictionary", GFDB_IPC_CTR_GET_QUERYsyncop_ipc_PARAMS); goto out; } ret = syncop_ipc (local_brick->xlator, GF_IPC_TARGET_CTR, ctr_ipc_in_dict, &ctr_ipc_out_dict); if (ret) { gf_msg (this->name, GF_LOG_ERROR, 0, DHT_MSG_LOG_IPC_TIER_ERROR, "Failed query on %s ret %d", local_brick->brick_db_path, ret); goto out; } Since the call to syncop_ipc() failed, ctr_ipc_out_dict is NULL. On goto out: out: if (ctr_ipc_in_dict) { dict_unref(ctr_ipc_in_dict); <-- this will free ipc_ctr_params ctr_ipc_in_dict = NULL; } if (ctr_ipc_out_dict) { dict_unref(ctr_ipc_out_dict); ctr_ipc_out_dict = NULL; ipc_ctr_params = NULL; <-- this is not set to NULL } GF_FREE (ipc_ctr_params); <--double free return ret; } The dict_unref(ctr_ipc_in_dict) will call GF_FREE on ipc_ctr_params as part of dict_destroy()->data_unref() as data->is_static is false. As the memory has already been freed, the second call to GF_FREE (ipc_ctr_params) will crash. Karthick, can you please file a separate BZ for each issue described here? That will make tracking easier. We can use this BZ for the tier crash in dhcp37-121.core.5159. BZ # 1289029 has been raised to track core - dhcp37-111.core.5424 Reference link: https://bugzilla.redhat.com/show_bug.cgi?id=1289029 Dont' want to close this out since the support is not there for current release. QE will verify once the support is in place. Removing the QE ack. Based on triage calls, closing this as we do not support tiering on rhel 6. |