| Summary: | glusterd crashed when rebalance was in progress and performed stop/start volume | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | [Community] GlusterFS | Reporter: | Shwetha Panduranga <shwetha.h.panduranga> | ||||||||
| Component: | glusterd | Assignee: | krishnan parthasarathi <kparthas> | ||||||||
| Status: | CLOSED WORKSFORME | QA Contact: | |||||||||
| Severity: | high | Docs Contact: | |||||||||
| Priority: | high | ||||||||||
| Version: | mainline | CC: | amarts, gluster-bugs, nsathyan, vbellur, vinaraya | ||||||||
| Target Milestone: | --- | ||||||||||
| Target Release: | --- | ||||||||||
| Hardware: | Unspecified | ||||||||||
| OS: | Unspecified | ||||||||||
| Whiteboard: | |||||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||||
| Doc Text: | Story Points: | --- | |||||||||
| Clone Of: | Environment: | ||||||||||
| Last Closed: | 2012-07-11 06:24:27 UTC | Type: | Bug | ||||||||
| Regression: | --- | Mount Type: | --- | ||||||||
| Documentation: | --- | CRM: | |||||||||
| Verified Versions: | Category: | --- | |||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||
| Attachments: |
|
||||||||||
Created attachment 577903 [details]
Backtrace of core
Created attachment 577904 [details]
volume info file data on m1,m2,m3
Attaching scripts to run on fuse, nfs mounts:-
gfsc1.sh:-
-----------
#!/bin/bash
mountpoint=`pwd`
for i in {1..10}
do
level1_dir=$mountpoint/fuse2.$i
mkdir $level1_dir
cd $level1_dir
for j in {1..20}
do
level2_dir=dir.$j
mkdir $level2_dir
cd $level2_dir
for k in {1..100}
do
echo "Creating File: $leve1_dir/$level2_dir/file.$k"
dd if=/dev/zero of=file.$k bs=1M count=$k
done
cd $level1_dir
done
cd $mountpoint
done
nfsc1.sh:-
----------
#!/bin/bash
mountpoint=`pwd`
for i in {1..5}
do
level1_dir=$mountpoint/nfs2.$i
mkdir $level1_dir
cd $level1_dir
for j in {1..20}
do
level2_dir=dir.$j
mkdir $level2_dir
cd $level2_dir
for k in {1..100}
do
echo "Creating File: $leve1_dir/$level2_dir/file.$k"
dd if=/dev/zero of=file.$k bs=1M count=$k
done
cd $level1_dir
done
cd $mountpoint
done
Not reproducible anymore. Removing blocker flag. not able to reproduce it again. |
Created attachment 577902 [details] glusterd log file Description of problem: ------------------------ On a distributed-replicate volume (4x3) performing add-brick, rebalance (start, stop, status) volume operations and subsequent restart of glusterd resulted in glusterd crash. Note:- -------- From the core generated we can see that volinfo referred by glusterd_defrag_notify (glusterd-rebalance.c:182) is corrupted. #0 0x00007f7e5d0e2ed9 in glusterd_defrag_notify (rpc=0x670d20, mydata=0x673f80, event=RPC_CLNT_DISCONNECT, data=0x0) at glusterd-rebalance.c:182 182 if ((event == RPC_CLNT_DISCONNECT) && defrag->connected) Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.47.el6.x86_64 libgcc-4.4.6-3.el6.x86_64 openssl-1.0.0-20.el6_2.3.x86_64 zlib-1.2.3-27.el6.x86_64 (gdb) bt full #0 0x00007f7e5d0e2ed9 in glusterd_defrag_notify (rpc=0x670d20, mydata=0x673f80, event=RPC_CLNT_DISCONNECT, data=0x0) at glusterd-rebalance.c:182 volinfo = 0x673f80 defrag = 0x2 ret = 0 (gdb) p *volinfo $1 = { volname = "\320?g", '\000' <repeats 21 times>, "\r\360\255\272", '\000' <repeats 12 times>, "Q\000\000\000\000\000\000\000(\000\000\000\037\000\000\000\000\000\000\000\300df\000\000\000\000\000\276\272\376\312\000\000\000\000\000\000\000\000/etc/glusterd/vols/dstore/info\000\r\360\255\272\000\000\000\000\000Q\000\000\000\000\000\000\000v\000\000\000\030\000\000\000\000\000\000\000\300df\000\000\000\000\000\276\272\376\312\000\000\000\000\000\000\000\000p@g", '\000' <repeats 21 times>, "\r\360\255\272", '\000' <repeats 12 times>, "a\000\000\000\000\000\000\000(\000\000\000\"\000\000\000\000\000\000\000\300df\000\000\000\000\000\276\272\376\312\000\000\000\000\000\000\000\000/etc/gluste"..., type = 0, brick_count = 0, vol_list = {next = 0x0, prev = 0x0}, bricks = {next = 0x0, prev = 0x0}, status = GLUSTERD_STATUS_NONE, sub_count = 0, stripe_count = 0, replica_count = 0, dist_leaf_count = 0, port = 0, shandle = 0x0, rb_shandle = 0x0, defrag_status = GF_DEFRAG_STATUS_NOT_STARTED, rebalance_files = 0, rebalance_data = 0, lookedup_files = 85899345920, defrag = 0x2, defrag_cmd = GF_DEFRAG_CMD_START, rebalance_failures = 3131961357, rb_status = GF_RB_STATUS_NONE, src_brick = 0x41, dst_brick = 0xb00000028, version = 0, cksum = 6710464, transport_type = GF_TRANSPORT_TCP, nfs_transport_type = 3405691582, dict = 0x0, volume_id = "management\000\r", <incomplete sequence \360\255\272>, auth = {username = 0x0, password = 0x41 <Address 0x41 out of bounds>}, logdir = 0x66e730 "", gsync_slaves = 0x661f40, decommission_in_progress = 0, xl = 0x0, memory_accounting = 762081142} Version-Release number of selected component (if applicable): ------------------------------------------------------------- mainline Steps to Reproduce: --------------------- The trusted storage pool has 3 machines m1, m2, m3. 1.create distribute-replicate volume(2X3). start the volume. 2.create fuse, nfs mounts. 3.run gfsc1.sh from fuse mount 4.run nfsc1.sh from nfs mount 4.add-bricks to the volume 5.start rebalance 6.status rebalance 7.stop rebalance 8.bring down 2 bricks from each replicate set, so that one brick is online from each replica set 9.bring back bricks online 10.start force rebalance 11.query rebalance status 12.stop rebalance Repeat step8 to step12 3-4 times. 13. add-bricks to volume Repeat step8 to step12 3-4 times 14. stop volume 15. restart volume 16. kill glusterd on m1, m2 17. restart glusterd on m1 and m2. Actual results: glusterd crashed on both m1,m2. Additional info: ------------------- [04/17/12 - 17:09:51 root@APP-SERVER3 ~]# gluster volume info Volume Name: dstore Type: Distributed-Replicate Volume ID: 90336962-3cd3-483b-917b-aee27cf34eff Status: Started Number of Bricks: 4 x 3 = 12 Transport-type: tcp Bricks: Brick1: 192.168.2.35:/export1/dstore1 Brick2: 192.168.2.36:/export1/dstore1 Brick3: 192.168.2.37:/export1/dstore1 Brick4: 192.168.2.35:/export2/dstore2 Brick5: 192.168.2.36:/export2/dstore2 Brick6: 192.168.2.37:/export2/dstore2 Brick7: 192.168.2.35:/export1/dstore2 Brick8: 192.168.2.36:/export1/dstore2 Brick9: 192.168.2.37:/export1/dstore2 Brick10: 192.168.2.35:/export2/dstore1 Brick11: 192.168.2.36:/export2/dstore1 Brick12: 192.168.2.37:/export2/dstore1 Options Reconfigured: diagnostics.client-log-level: INFO cluster.self-heal-daemon: off