Description of problem: When converting a replica volume (2 bricks), to a distribute-replica, gluster crashes. Version-Release number of selected component (if applicable): mainline How reproducible: everytime Steps to Reproduce: 1. create a replica vol with 2 bricks 2. add a new replica pair Actual results: gluster> volume create new replica 2 sng:/export/dir1 sng:/export/dir2 Multiple bricks of a replicate volume are present on the same server. This setup is not optimal. Do you still want to continue creating the volume? (y/n) y Creation of volume new has been successful. Please start the volume to access data. gluster> volume start new Starting volume new has been successful gluster> volume add-brick new replica 2 sng:/export/dir3 sng:/export/dir4 gluster> volume info Expected results: Additional info: bt: Core was generated by `glusterd'. Program terminated with signal 8, Arithmetic exception. #0 0x00007f607a67f8c4 in add_brick_at_right_order (brickinfo=0x1a56800, volinfo=0x1a4ba80, count=0, stripe_cnt=0, replica_cnt=2) at glusterd-brick-ops.c:79 79 idx = (count / (replica_cnt - sub_cnt) * sub_cnt) + (gdb) bt #0 0x00007f607a67f8c4 in add_brick_at_right_order (brickinfo=0x1a56800, volinfo=0x1a4ba80, count=0, stripe_cnt=0, replica_cnt=2) at glusterd-brick-ops.c:79 #1 0x00007f607a6825d8 in glusterd_op_perform_add_bricks (volinfo=0x1a4ba80, count=2, bricks=0x1a54c50 " sng:/export/dir3 sng:/export/dir4 ", dict=0x1a55b10) at glusterd-brick-ops.c:826 #2 0x00007f607a683dd7 in glusterd_op_add_brick (dict=0x1a55b10, op_errstr=0x7fff14fda620) at glusterd-brick-ops.c:1327 #3 0x00007f607a63424d in glusterd_op_commit_perform (op=GD_OP_ADD_BRICK, dict=0x1a55b10, op_errstr=0x7fff14fda620, rsp_dict=0x0) at glusterd-op-sm.c:2304 #4 0x00007f607a63257b in glusterd_op_ac_send_commit_op (event=0x1a55a40, ctx=0x1a3f3a0) at glusterd-op-sm.c:1681 #5 0x00007f607a6369ca in glusterd_op_sm () at glusterd-op-sm.c:3321 #6 0x00007f607a680f3e in glusterd_handle_add_brick (req=0x7f607a54d02c) at glusterd-brick-ops.c:489 #7 0x00007f607d35f1ef in rpcsvc_handle_rpc_call (svc=0x1a44ec0, trans=0x1a48800, msg=0x1a50da0) at rpcsvc.c:507 #8 0x00007f607d35f56c in rpcsvc_notify (trans=0x1a48800, mydata=0x1a44ec0, event=RPC_TRANSPORT_MSG_RECEIVED, data=0x1a50da0) at rpcsvc.c:603 #9 0x00007f607d36518c in rpc_transport_notify (this=0x1a48800, event=RPC_TRANSPORT_MSG_RECEIVED, data=0x1a50da0) at rpc-transport.c:498 #10 0x00007f607a3412e7 in socket_event_poll_in (this=0x1a48800) at socket.c:1675 #11 0x00007f607a341850 in socket_event_handler (fd=14, idx=7, data=0x1a48800, poll_in=1, poll_out=0, poll_err=0) at socket.c:1790 #12 0x00007f607d5b9c6c in event_dispatch_epoll_handler (event_pool=0x1a3e500, events=0x1a47be0, i=0) at event.c:794 #13 0x00007f607d5b9e7f in event_dispatch_epoll (event_pool=0x1a3e500) at event.c:856 #14 0x00007f607d5ba1f2 in event_dispatch (event_pool=0x1a3e500) at event.c:956 #15 0x0000000000407d1e in main (argc=1, argv=0x7fff14fdb468) at glusterfsd.c:1601 (gdb) p replica_cnt $1 = 2 (gdb) p sub_cnt $2 = 2
In fix for 765774. setting the op dictionary with modified stripe and replica count was (accidently) removed affecting the correctness. The stage and op functions of add-brick rely on the fact that stripe/replica count are zero if there is no change in them, during this add-brick operation.
CHANGE: http://review.gluster.com/2548 (glusterd: Fixed add-brick handler algorithm.) merged in master by Vijay Bellur (vijay)
Checked on release-3.3. glusterd no longer crashes on add brick. The follwing are the steps followed and the output obtained, gluster> volume create test replica 2 arch:/export/test1 arch:/export/test2 Multiple bricks of a replicate volume are present on the same server. This setup is not optimal. Do you still want to continue creating the volume? (y/n) y Creation of volume test has been successful. Please start the volume to access data. gluster> volume add-brick test arch:/export/test3 arch:/export/test4 Add Brick successful gluster>