Bug 1405299
Summary: | fuse mount crashed when VM installation is in progress & one of the brick killed | ||||||
---|---|---|---|---|---|---|---|
Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | SATHEESARAN <sasundar> | ||||
Component: | replicate | Assignee: | Krutika Dhananjay <kdhananj> | ||||
Status: | CLOSED ERRATA | QA Contact: | SATHEESARAN <sasundar> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | unspecified | ||||||
Version: | rhgs-3.2 | CC: | amukherj, kdhananj, knarra, rcyriac, rhinduja, rhs-bugs, sabose, storage-qa-internal | ||||
Target Milestone: | --- | ||||||
Target Release: | RHGS 3.2.0 | ||||||
Hardware: | x86_64 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | glusterfs-3.8.4-9 | Doc Type: | If docs needed, set a value | ||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | |||||||
: | 1405308 (view as bug list) | Environment: | |||||
Last Closed: | 2017-03-23 05:58:01 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 1277939, 1351528, 1405308 | ||||||
Attachments: |
|
Description
SATHEESARAN
2016-12-16 06:51:18 UTC
Backtrace: ---------- Core was generated by `/usr/sbin/glusterfs --volfile-server=10.70.37.138 --volfile-id=/rep3vol /mnt/re'. Program terminated with signal 11, Segmentation fault. #0 afr_pre_op_writev_cbk (frame=0x7f24e25d2974, cookie=0x1, this=0x7f24d000a7b0, op_ret=<optimized out>, op_errno=<optimized out>, data=<optimized out>, xdata=0x0) at afr-transaction.c:1255 1255 write_args_cbk = &args_cbk->rsp_list[1]; Missing separate debuginfos, use: debuginfo-install keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.14.1-26.el7.x86_64 libcom_err-1.42.9-9.el7.x86_64 libselinux-2.5-6.el7.x86_64 pcre-8.32-15.el7_2.1.x86_64 (gdb) bt #0 afr_pre_op_writev_cbk (frame=0x7f24e25d2974, cookie=0x1, this=0x7f24d000a7b0, op_ret=<optimized out>, op_errno=<optimized out>, data=<optimized out>, xdata=0x0) at afr-transaction.c:1255 #1 0x00007f24d6e91dd7 in client3_3_compound_cbk (req=<optimized out>, iov=<optimized out>, count=<optimized out>, myframe=0x7f24e25ceea8) at client-rpc-fops.c:3214 #2 0x00007f24e48ad785 in saved_frames_unwind (saved_frames=saved_frames@entry=0x7f24c4001620) at rpc-clnt.c:369 #3 0x00007f24e48ad86e in saved_frames_destroy (frames=frames@entry=0x7f24c4001620) at rpc-clnt.c:386 #4 0x00007f24e48aefd4 in rpc_clnt_connection_cleanup (conn=conn@entry=0x7f24d007cf18) at rpc-clnt.c:556 #5 0x00007f24e48af864 in rpc_clnt_handle_disconnect (conn=0x7f24d007cf18, clnt=0x7f24d007cec0) at rpc-clnt.c:881 #6 rpc_clnt_notify (trans=<optimized out>, mydata=0x7f24d007cf18, event=RPC_TRANSPORT_DISCONNECT, data=0x7f24d008cc10) at rpc-clnt.c:937 #7 0x00007f24e48ab883 in rpc_transport_notify (this=this@entry=0x7f24d008cc10, event=event@entry=RPC_TRANSPORT_DISCONNECT, data=data@entry=0x7f24d008cc10) at rpc-transport.c:537 #8 0x00007f24d9173302 in socket_event_poll_err (this=0x7f24d008cc10) at socket.c:1179 #9 socket_event_handler (fd=<optimized out>, idx=4, data=0x7f24d008cc10, poll_in=1, poll_out=0, poll_err=<optimized out>) at socket.c:2404 #10 0x00007f24e4b3f4f0 in event_dispatch_epoll_handler (event=0x7f24cfffee80, event_pool=0x7f24e5b41f00) at event-epoll.c:571 #11 event_dispatch_epoll_worker (data=0x7f24d003f420) at event-epoll.c:674 #12 0x00007f24e3946dc5 in start_thread (arg=0x7f24cffff700) at pthread_create.c:308 #13 0x00007f24e328b73d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113 Created attachment 1232448 [details]
fuse mount logs
Krutika has RCA'ed the issue and found that the patch[1] is missed in the backport, which has caused this issue. [1] - http://review.gluster.org/#/c/15482/9 @Krutika, Requesting you to provide the detailed RCA Crash is in Core was generated by `/usr/sbin/glusterfs --volfile-server=10.70.37.138 --volfile-id=/rep3vol /mnt/re'. Program terminated with signal 11, Segmentation fault. #0 afr_pre_op_writev_cbk (frame=0x7f24e25d2974, cookie=0x1, this=0x7f24d000a7b0, op_ret=<optimized out>, op_errno=<optimized out>, data=<optimized out>, xdata=0x0) at afr-transaction.c:1255 1255 write_args_cbk = &args_cbk->rsp_list[1]; Missing separate debuginfos, use: debuginfo-install keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.14.1-26.el7.x86_64 libcom_err-1.42.9-9.el7.x86_64 libselinux-2.5-6.el7.x86_64 pcre-8.32-15.el7_2.1.x86_64 (gdb) p args_cbk $1 = <optimized out> Problem is that args_cbk was NULL (gdb) f 1 #1 0x00007f24d6e91dd7 in client3_3_compound_cbk (req=<optimized out>, iov=<optimized out>, count=<optimized out>, myframe=0x7f24e25ceea8) at client-rpc-fops.c:3214 3214 CLIENT_STACK_UNWIND (compound, frame, rsp.op_ret, (gdb) l 3209 } 3210 3211 } 3212 rsp.op_ret = 0; 3213 out: 3214 CLIENT_STACK_UNWIND (compound, frame, rsp.op_ret, 3215 gf_error_to_errno (rsp.op_errno), args_cbk, xdata); 3216 3217 free (rsp.xdata.xdata_val); 3218 (gdb) p args_cbk $2 = (compound_args_cbk_t *) 0x0 As per code ... 1223 afr_pre_op_writev_cbk (call_frame_t *frame, void *cookie, xlator_t *this, 1224 int op_ret, int op_errno, 1225 void *data, dict_t *xdata) 1226 { 1227 afr_local_t *local = NULL; 1228 afr_private_t *priv = NULL; 1229 call_frame_t *fop_frame = NULL; 1230 default_args_cbk_t *write_args_cbk = NULL; (gdb) 1231 compound_args_cbk_t *args_cbk = data; 1232 int call_count = -1; 1233 int child_index = -1; 1234 1235 local = frame->local; 1236 priv = this->private; 1237 child_index = (long) cookie; 1238 1239 if (local->pre_op_compat) 1240 afr_changelog_pre_op_update (frame, this); (gdb) 1241 1242 if (op_ret == -1) { 1243 local->op_errno = op_errno; 1244 afr_transaction_fop_failed (frame, this, child_index); 1245 } 1246 1247 /* If the compound fop failed due to saved_frame_unwind(), then 1248 * protocol/client fails it even before args_cbk is allocated. 1249 * Handle that case by passing the op_ret, op_errno values explicitly. 1250 */ (gdb) 1251 if ((op_ret == -1) && (args_cbk == NULL)) { 1252 afr_inode_write_fill (frame, this, child_index, op_ret, 1253 op_errno, NULL, NULL, NULL); 1254 } else { 1255 write_args_cbk = &args_cbk->rsp_list[1]; ... if op_ret is -1 and args_cbk is NULL (line 1251 above, implying that proper NULL check is in place), args_cbk by virtue of being NULL will not even be dereferenced. But ... (gdb) p rsp.op_ret $4 = 0 ... op_ret was not -1, which is why the else block at 1254 was executed and NULL pointer dereferenced and the client process crashed. Why was op_ret -1? Because ... 3158 int 3159 client3_3_compound_cbk (struct rpc_req *req, struct iovec *iov, int count, 3160 void *myframe) 3161 { 3162 gfs3_compound_rsp rsp = {0,}; 3163 compound_args_cbk_t *args_cbk = NULL; 3164 call_frame_t *frame = NULL; 3165 xlator_t *this = NULL; (gdb) 3166 dict_t *xdata = NULL; 3167 clnt_local_t *local = NULL; 3168 int op_errno = 0; 3169 int i = 0; 3170 int length = 0; 3171 int ret = -1; 3172 3173 this = THIS; 3174 3175 frame = myframe; (gdb) 3176 local = frame->local; 3177 3178 if (-1 == req->rpc_status) { 3179 op_errno = ENOTCONN; 3180 goto out; 3181 } 3182 In line 3178, when req->rpc_status is -1 (which can happen when a brick went down before the request was sent over the network), rsp.op_ret is not set to -1 when it should be, which is fixed here http://review.gluster.org/#/c/15482/9/xlators/protocol/client/src/client-rpc-fops.c@3165 upstream mainline : https://review.gluster.org/15482 downstream patch : https://code.engineering.redhat.com/gerrit/#/c/93157/ Tested with glusterfs-3.8.4-9.el7rhgs with the following test steps: 1. Created a replica 3 sharded volume with shard-block-size as 4MB, optimized for virt store usecase & started the volume 2. Fuse mounted the volume on RHEL 7.3 server ( hypervisor ) 3. Created a image file (sparse - 20G ) on the volume 4. Started VM installation with RHEL 7.3 as OS 5. While installation is in progress, killed the second brick of the replica 3 volume. Observation: No fuse mount crash. all worked well Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2017-0486.html |