Hide Forgot
Created attachment 577902 [details] glusterd log file Description of problem: ------------------------ On a distributed-replicate volume (4x3) performing add-brick, rebalance (start, stop, status) volume operations and subsequent restart of glusterd resulted in glusterd crash. Note:- -------- From the core generated we can see that volinfo referred by glusterd_defrag_notify (glusterd-rebalance.c:182) is corrupted. #0 0x00007f7e5d0e2ed9 in glusterd_defrag_notify (rpc=0x670d20, mydata=0x673f80, event=RPC_CLNT_DISCONNECT, data=0x0) at glusterd-rebalance.c:182 182 if ((event == RPC_CLNT_DISCONNECT) && defrag->connected) Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.47.el6.x86_64 libgcc-4.4.6-3.el6.x86_64 openssl-1.0.0-20.el6_2.3.x86_64 zlib-1.2.3-27.el6.x86_64 (gdb) bt full #0 0x00007f7e5d0e2ed9 in glusterd_defrag_notify (rpc=0x670d20, mydata=0x673f80, event=RPC_CLNT_DISCONNECT, data=0x0) at glusterd-rebalance.c:182 volinfo = 0x673f80 defrag = 0x2 ret = 0 (gdb) p *volinfo $1 = { volname = "\320?g", '\000' <repeats 21 times>, "\r\360\255\272", '\000' <repeats 12 times>, "Q\000\000\000\000\000\000\000(\000\000\000\037\000\000\000\000\000\000\000\300df\000\000\000\000\000\276\272\376\312\000\000\000\000\000\000\000\000/etc/glusterd/vols/dstore/info\000\r\360\255\272\000\000\000\000\000Q\000\000\000\000\000\000\000v\000\000\000\030\000\000\000\000\000\000\000\300df\000\000\000\000\000\276\272\376\312\000\000\000\000\000\000\000\000p@g", '\000' <repeats 21 times>, "\r\360\255\272", '\000' <repeats 12 times>, "a\000\000\000\000\000\000\000(\000\000\000\"\000\000\000\000\000\000\000\300df\000\000\000\000\000\276\272\376\312\000\000\000\000\000\000\000\000/etc/gluste"..., type = 0, brick_count = 0, vol_list = {next = 0x0, prev = 0x0}, bricks = {next = 0x0, prev = 0x0}, status = GLUSTERD_STATUS_NONE, sub_count = 0, stripe_count = 0, replica_count = 0, dist_leaf_count = 0, port = 0, shandle = 0x0, rb_shandle = 0x0, defrag_status = GF_DEFRAG_STATUS_NOT_STARTED, rebalance_files = 0, rebalance_data = 0, lookedup_files = 85899345920, defrag = 0x2, defrag_cmd = GF_DEFRAG_CMD_START, rebalance_failures = 3131961357, rb_status = GF_RB_STATUS_NONE, src_brick = 0x41, dst_brick = 0xb00000028, version = 0, cksum = 6710464, transport_type = GF_TRANSPORT_TCP, nfs_transport_type = 3405691582, dict = 0x0, volume_id = "management\000\r", <incomplete sequence \360\255\272>, auth = {username = 0x0, password = 0x41 <Address 0x41 out of bounds>}, logdir = 0x66e730 "", gsync_slaves = 0x661f40, decommission_in_progress = 0, xl = 0x0, memory_accounting = 762081142} Version-Release number of selected component (if applicable): ------------------------------------------------------------- mainline Steps to Reproduce: --------------------- The trusted storage pool has 3 machines m1, m2, m3. 1.create distribute-replicate volume(2X3). start the volume. 2.create fuse, nfs mounts. 3.run gfsc1.sh from fuse mount 4.run nfsc1.sh from nfs mount 4.add-bricks to the volume 5.start rebalance 6.status rebalance 7.stop rebalance 8.bring down 2 bricks from each replicate set, so that one brick is online from each replica set 9.bring back bricks online 10.start force rebalance 11.query rebalance status 12.stop rebalance Repeat step8 to step12 3-4 times. 13. add-bricks to volume Repeat step8 to step12 3-4 times 14. stop volume 15. restart volume 16. kill glusterd on m1, m2 17. restart glusterd on m1 and m2. Actual results: glusterd crashed on both m1,m2. Additional info: ------------------- [04/17/12 - 17:09:51 root@APP-SERVER3 ~]# gluster volume info Volume Name: dstore Type: Distributed-Replicate Volume ID: 90336962-3cd3-483b-917b-aee27cf34eff Status: Started Number of Bricks: 4 x 3 = 12 Transport-type: tcp Bricks: Brick1: 192.168.2.35:/export1/dstore1 Brick2: 192.168.2.36:/export1/dstore1 Brick3: 192.168.2.37:/export1/dstore1 Brick4: 192.168.2.35:/export2/dstore2 Brick5: 192.168.2.36:/export2/dstore2 Brick6: 192.168.2.37:/export2/dstore2 Brick7: 192.168.2.35:/export1/dstore2 Brick8: 192.168.2.36:/export1/dstore2 Brick9: 192.168.2.37:/export1/dstore2 Brick10: 192.168.2.35:/export2/dstore1 Brick11: 192.168.2.36:/export2/dstore1 Brick12: 192.168.2.37:/export2/dstore1 Options Reconfigured: diagnostics.client-log-level: INFO cluster.self-heal-daemon: off
Created attachment 577903 [details] Backtrace of core
Created attachment 577904 [details] volume info file data on m1,m2,m3
Attaching scripts to run on fuse, nfs mounts:- gfsc1.sh:- ----------- #!/bin/bash mountpoint=`pwd` for i in {1..10} do level1_dir=$mountpoint/fuse2.$i mkdir $level1_dir cd $level1_dir for j in {1..20} do level2_dir=dir.$j mkdir $level2_dir cd $level2_dir for k in {1..100} do echo "Creating File: $leve1_dir/$level2_dir/file.$k" dd if=/dev/zero of=file.$k bs=1M count=$k done cd $level1_dir done cd $mountpoint done nfsc1.sh:- ---------- #!/bin/bash mountpoint=`pwd` for i in {1..5} do level1_dir=$mountpoint/nfs2.$i mkdir $level1_dir cd $level1_dir for j in {1..20} do level2_dir=dir.$j mkdir $level2_dir cd $level2_dir for k in {1..100} do echo "Creating File: $leve1_dir/$level2_dir/file.$k" dd if=/dev/zero of=file.$k bs=1M count=$k done cd $level1_dir done cd $mountpoint done
Not reproducible anymore. Removing blocker flag.
not able to reproduce it again.