Bug 813169 - glusterd crashed when rebalance was in progress and performed stop/start volume
Summary: glusterd crashed when rebalance was in progress and performed stop/start volume
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: GlusterFS
Classification: Community
Component: glusterd
Version: mainline
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
Assignee: krishnan parthasarathi
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2012-04-17 06:34 UTC by Shwetha Panduranga
Modified: 2015-11-03 23:04 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2012-07-11 06:24:27 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:


Attachments (Terms of Use)
glusterd log file (411.07 KB, text/x-log)
2012-04-17 06:34 UTC, Shwetha Panduranga
no flags Details
Backtrace of core (6.66 KB, application/octet-stream)
2012-04-17 06:36 UTC, Shwetha Panduranga
no flags Details
volume info file data on m1,m2,m3 (2.65 KB, application/octet-stream)
2012-04-17 06:37 UTC, Shwetha Panduranga
no flags Details

Description Shwetha Panduranga 2012-04-17 06:34:32 UTC
Created attachment 577902 [details]
glusterd log file

Description of problem:
------------------------
On a distributed-replicate volume (4x3) performing add-brick, rebalance (start, stop, status) volume operations and subsequent restart of glusterd resulted in glusterd crash. 

Note:-
--------
From the core generated we can see that volinfo referred by glusterd_defrag_notify (glusterd-rebalance.c:182) is corrupted.

#0  0x00007f7e5d0e2ed9 in glusterd_defrag_notify (rpc=0x670d20, mydata=0x673f80, event=RPC_CLNT_DISCONNECT, data=0x0) at glusterd-rebalance.c:182
182	        if ((event == RPC_CLNT_DISCONNECT) && defrag->connected)
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.47.el6.x86_64 libgcc-4.4.6-3.el6.x86_64 openssl-1.0.0-20.el6_2.3.x86_64 zlib-1.2.3-27.el6.x86_64
(gdb) bt full
#0  0x00007f7e5d0e2ed9 in glusterd_defrag_notify (rpc=0x670d20, mydata=0x673f80, event=RPC_CLNT_DISCONNECT, data=0x0) at glusterd-rebalance.c:182
        volinfo = 0x673f80
        defrag = 0x2
        ret = 0

(gdb) p *volinfo
$1 = {
  volname = "\320?g", '\000' <repeats 21 times>, "\r\360\255\272", '\000' <repeats 12 times>, "Q\000\000\000\000\000\000\000(\000\000\000\037\000\000\000\000\000\000\000\300df\000\000\000\000\000\276\272\376\312\000\000\000\000\000\000\000\000/etc/glusterd/vols/dstore/info\000\r\360\255\272\000\000\000\000\000Q\000\000\000\000\000\000\000v\000\000\000\030\000\000\000\000\000\000\000\300df\000\000\000\000\000\276\272\376\312\000\000\000\000\000\000\000\000p@g", '\000' <repeats 21 times>, "\r\360\255\272", '\000' <repeats 12 times>, "a\000\000\000\000\000\000\000(\000\000\000\"\000\000\000\000\000\000\000\300df\000\000\000\000\000\276\272\376\312\000\000\000\000\000\000\000\000/etc/gluste"..., type = 0, brick_count = 0, vol_list = {next = 0x0, prev = 0x0}, bricks = {next = 0x0, prev = 0x0}, status = GLUSTERD_STATUS_NONE, 
  sub_count = 0, stripe_count = 0, replica_count = 0, dist_leaf_count = 0, port = 0, shandle = 0x0, rb_shandle = 0x0, defrag_status = GF_DEFRAG_STATUS_NOT_STARTED, 
  rebalance_files = 0, rebalance_data = 0, lookedup_files = 85899345920, defrag = 0x2, defrag_cmd = GF_DEFRAG_CMD_START, rebalance_failures = 3131961357, 
  rb_status = GF_RB_STATUS_NONE, src_brick = 0x41, dst_brick = 0xb00000028, version = 0, cksum = 6710464, transport_type = GF_TRANSPORT_TCP, 
  nfs_transport_type = 3405691582, dict = 0x0, volume_id = "management\000\r", <incomplete sequence \360\255\272>, auth = {username = 0x0, 
    password = 0x41 <Address 0x41 out of bounds>}, logdir = 0x66e730 "", gsync_slaves = 0x661f40, decommission_in_progress = 0, xl = 0x0, 
  memory_accounting = 762081142}


Version-Release number of selected component (if applicable):
-------------------------------------------------------------
mainline

Steps to Reproduce:
---------------------
The trusted storage pool has 3 machines m1, m2, m3. 

1.create distribute-replicate volume(2X3). start the volume.
2.create fuse, nfs mounts. 
3.run gfsc1.sh from fuse mount
4.run nfsc1.sh from nfs mount
4.add-bricks to the volume
5.start rebalance 
6.status rebalance
7.stop rebalance
8.bring down 2 bricks from each replicate set, so that one brick is online from
each replica set
9.bring back bricks online
10.start force rebalance
11.query rebalance status 
12.stop rebalance

Repeat step8 to step12 3-4 times.

13. add-bricks to volume

Repeat step8 to step12 3-4 times

14. stop volume
15. restart volume
16. kill glusterd on m1, m2
17. restart glusterd on m1 and m2. 

Actual results:
glusterd crashed on both m1,m2. 

Additional info:
-------------------
[04/17/12 - 17:09:51 root@APP-SERVER3 ~]# gluster volume info
 
Volume Name: dstore
Type: Distributed-Replicate
Volume ID: 90336962-3cd3-483b-917b-aee27cf34eff
Status: Started
Number of Bricks: 4 x 3 = 12
Transport-type: tcp
Bricks:
Brick1: 192.168.2.35:/export1/dstore1
Brick2: 192.168.2.36:/export1/dstore1
Brick3: 192.168.2.37:/export1/dstore1
Brick4: 192.168.2.35:/export2/dstore2
Brick5: 192.168.2.36:/export2/dstore2
Brick6: 192.168.2.37:/export2/dstore2
Brick7: 192.168.2.35:/export1/dstore2
Brick8: 192.168.2.36:/export1/dstore2
Brick9: 192.168.2.37:/export1/dstore2
Brick10: 192.168.2.35:/export2/dstore1
Brick11: 192.168.2.36:/export2/dstore1
Brick12: 192.168.2.37:/export2/dstore1
Options Reconfigured:
diagnostics.client-log-level: INFO
cluster.self-heal-daemon: off

Comment 1 Shwetha Panduranga 2012-04-17 06:36:10 UTC
Created attachment 577903 [details]
Backtrace of core

Comment 2 Shwetha Panduranga 2012-04-17 06:37:06 UTC
Created attachment 577904 [details]
volume info file data on m1,m2,m3

Comment 3 Shwetha Panduranga 2012-04-17 10:50:52 UTC
Attaching scripts to run on fuse, nfs mounts:-

gfsc1.sh:-
-----------
#!/bin/bash

mountpoint=`pwd`
for i in {1..10}
do
 level1_dir=$mountpoint/fuse2.$i
 mkdir $level1_dir
 cd $level1_dir
 for j in {1..20}
 do 
  level2_dir=dir.$j
  mkdir $level2_dir
  cd $level2_dir
  for k in {1..100}
  do 
   echo "Creating File: $leve1_dir/$level2_dir/file.$k"
   dd if=/dev/zero of=file.$k bs=1M count=$k 
  done
  cd $level1_dir
 done
 cd $mountpoint
done


nfsc1.sh:-
----------
#!/bin/bash

mountpoint=`pwd`
for i in {1..5}
do 
 level1_dir=$mountpoint/nfs2.$i
 mkdir $level1_dir
 cd $level1_dir
 for j in {1..20}
 do 
  level2_dir=dir.$j
  mkdir $level2_dir
  cd $level2_dir

  for k in {1..100}
  do 
   echo "Creating File: $leve1_dir/$level2_dir/file.$k"
   dd if=/dev/zero of=file.$k bs=1M count=$k

  done
  cd $level1_dir
 done
 cd $mountpoint
done

Comment 4 Vijay Bellur 2012-05-18 13:09:44 UTC
Not reproducible anymore. Removing blocker flag.

Comment 5 Amar Tumballi 2012-07-11 06:24:27 UTC
not able to reproduce it again.


Note You need to log in before you can comment on or make changes to this bug.