Bug 813169

Summary: glusterd crashed when rebalance was in progress and performed stop/start volume
Product: [Community] GlusterFS Reporter: Shwetha Panduranga <shwetha.h.panduranga>
Component: glusterdAssignee: krishnan parthasarathi <kparthas>
Status: CLOSED WORKSFORME QA Contact:
Severity: high Docs Contact:
Priority: high    
Version: mainlineCC: amarts, gluster-bugs, nsathyan, vbellur, vinaraya
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-07-11 02:24:27 EDT Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Attachments:
Description Flags
glusterd log file
none
Backtrace of core
none
volume info file data on m1,m2,m3 none

Description Shwetha Panduranga 2012-04-17 02:34:32 EDT
Created attachment 577902 [details]
glusterd log file

Description of problem:
------------------------
On a distributed-replicate volume (4x3) performing add-brick, rebalance (start, stop, status) volume operations and subsequent restart of glusterd resulted in glusterd crash. 

Note:-
--------
From the core generated we can see that volinfo referred by glusterd_defrag_notify (glusterd-rebalance.c:182) is corrupted.

#0  0x00007f7e5d0e2ed9 in glusterd_defrag_notify (rpc=0x670d20, mydata=0x673f80, event=RPC_CLNT_DISCONNECT, data=0x0) at glusterd-rebalance.c:182
182	        if ((event == RPC_CLNT_DISCONNECT) && defrag->connected)
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.47.el6.x86_64 libgcc-4.4.6-3.el6.x86_64 openssl-1.0.0-20.el6_2.3.x86_64 zlib-1.2.3-27.el6.x86_64
(gdb) bt full
#0  0x00007f7e5d0e2ed9 in glusterd_defrag_notify (rpc=0x670d20, mydata=0x673f80, event=RPC_CLNT_DISCONNECT, data=0x0) at glusterd-rebalance.c:182
        volinfo = 0x673f80
        defrag = 0x2
        ret = 0

(gdb) p *volinfo
$1 = {
  volname = "\320?g", '\000' <repeats 21 times>, "\r\360\255\272", '\000' <repeats 12 times>, "Q\000\000\000\000\000\000\000(\000\000\000\037\000\000\000\000\000\000\000\300df\000\000\000\000\000\276\272\376\312\000\000\000\000\000\000\000\000/etc/glusterd/vols/dstore/info\000\r\360\255\272\000\000\000\000\000Q\000\000\000\000\000\000\000v\000\000\000\030\000\000\000\000\000\000\000\300df\000\000\000\000\000\276\272\376\312\000\000\000\000\000\000\000\000p@g", '\000' <repeats 21 times>, "\r\360\255\272", '\000' <repeats 12 times>, "a\000\000\000\000\000\000\000(\000\000\000\"\000\000\000\000\000\000\000\300df\000\000\000\000\000\276\272\376\312\000\000\000\000\000\000\000\000/etc/gluste"..., type = 0, brick_count = 0, vol_list = {next = 0x0, prev = 0x0}, bricks = {next = 0x0, prev = 0x0}, status = GLUSTERD_STATUS_NONE, 
  sub_count = 0, stripe_count = 0, replica_count = 0, dist_leaf_count = 0, port = 0, shandle = 0x0, rb_shandle = 0x0, defrag_status = GF_DEFRAG_STATUS_NOT_STARTED, 
  rebalance_files = 0, rebalance_data = 0, lookedup_files = 85899345920, defrag = 0x2, defrag_cmd = GF_DEFRAG_CMD_START, rebalance_failures = 3131961357, 
  rb_status = GF_RB_STATUS_NONE, src_brick = 0x41, dst_brick = 0xb00000028, version = 0, cksum = 6710464, transport_type = GF_TRANSPORT_TCP, 
  nfs_transport_type = 3405691582, dict = 0x0, volume_id = "management\000\r", <incomplete sequence \360\255\272>, auth = {username = 0x0, 
    password = 0x41 <Address 0x41 out of bounds>}, logdir = 0x66e730 "", gsync_slaves = 0x661f40, decommission_in_progress = 0, xl = 0x0, 
  memory_accounting = 762081142}


Version-Release number of selected component (if applicable):
-------------------------------------------------------------
mainline

Steps to Reproduce:
---------------------
The trusted storage pool has 3 machines m1, m2, m3. 

1.create distribute-replicate volume(2X3). start the volume.
2.create fuse, nfs mounts. 
3.run gfsc1.sh from fuse mount
4.run nfsc1.sh from nfs mount
4.add-bricks to the volume
5.start rebalance 
6.status rebalance
7.stop rebalance
8.bring down 2 bricks from each replicate set, so that one brick is online from
each replica set
9.bring back bricks online
10.start force rebalance
11.query rebalance status 
12.stop rebalance

Repeat step8 to step12 3-4 times.

13. add-bricks to volume

Repeat step8 to step12 3-4 times

14. stop volume
15. restart volume
16. kill glusterd on m1, m2
17. restart glusterd on m1 and m2. 

Actual results:
glusterd crashed on both m1,m2. 

Additional info:
-------------------
[04/17/12 - 17:09:51 root@APP-SERVER3 ~]# gluster volume info
 
Volume Name: dstore
Type: Distributed-Replicate
Volume ID: 90336962-3cd3-483b-917b-aee27cf34eff
Status: Started
Number of Bricks: 4 x 3 = 12
Transport-type: tcp
Bricks:
Brick1: 192.168.2.35:/export1/dstore1
Brick2: 192.168.2.36:/export1/dstore1
Brick3: 192.168.2.37:/export1/dstore1
Brick4: 192.168.2.35:/export2/dstore2
Brick5: 192.168.2.36:/export2/dstore2
Brick6: 192.168.2.37:/export2/dstore2
Brick7: 192.168.2.35:/export1/dstore2
Brick8: 192.168.2.36:/export1/dstore2
Brick9: 192.168.2.37:/export1/dstore2
Brick10: 192.168.2.35:/export2/dstore1
Brick11: 192.168.2.36:/export2/dstore1
Brick12: 192.168.2.37:/export2/dstore1
Options Reconfigured:
diagnostics.client-log-level: INFO
cluster.self-heal-daemon: off
Comment 1 Shwetha Panduranga 2012-04-17 02:36:10 EDT
Created attachment 577903 [details]
Backtrace of core
Comment 2 Shwetha Panduranga 2012-04-17 02:37:06 EDT
Created attachment 577904 [details]
volume info file data on m1,m2,m3
Comment 3 Shwetha Panduranga 2012-04-17 06:50:52 EDT
Attaching scripts to run on fuse, nfs mounts:-

gfsc1.sh:-
-----------
#!/bin/bash

mountpoint=`pwd`
for i in {1..10}
do
 level1_dir=$mountpoint/fuse2.$i
 mkdir $level1_dir
 cd $level1_dir
 for j in {1..20}
 do 
  level2_dir=dir.$j
  mkdir $level2_dir
  cd $level2_dir
  for k in {1..100}
  do 
   echo "Creating File: $leve1_dir/$level2_dir/file.$k"
   dd if=/dev/zero of=file.$k bs=1M count=$k 
  done
  cd $level1_dir
 done
 cd $mountpoint
done


nfsc1.sh:-
----------
#!/bin/bash

mountpoint=`pwd`
for i in {1..5}
do 
 level1_dir=$mountpoint/nfs2.$i
 mkdir $level1_dir
 cd $level1_dir
 for j in {1..20}
 do 
  level2_dir=dir.$j
  mkdir $level2_dir
  cd $level2_dir

  for k in {1..100}
  do 
   echo "Creating File: $leve1_dir/$level2_dir/file.$k"
   dd if=/dev/zero of=file.$k bs=1M count=$k

  done
  cd $level1_dir
 done
 cd $mountpoint
done
Comment 4 Vijay Bellur 2012-05-18 09:09:44 EDT
Not reproducible anymore. Removing blocker flag.
Comment 5 Amar Tumballi 2012-07-11 02:24:27 EDT
not able to reproduce it again.