Bug 813169 - glusterd crashed when rebalance was in progress and performed stop/start volume
glusterd crashed when rebalance was in progress and performed stop/start volume
Status: CLOSED WORKSFORME
Product: GlusterFS
Classification: Community
Component: glusterd (Show other bugs)
mainline
Unspecified Unspecified
high Severity high
: ---
: ---
Assigned To: krishnan parthasarathi
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2012-04-17 02:34 EDT by Shwetha Panduranga
Modified: 2015-11-03 18:04 EST (History)
5 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2012-07-11 02:24:27 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:


Attachments (Terms of Use)
glusterd log file (411.07 KB, text/x-log)
2012-04-17 02:34 EDT, Shwetha Panduranga
no flags Details
Backtrace of core (6.66 KB, application/octet-stream)
2012-04-17 02:36 EDT, Shwetha Panduranga
no flags Details
volume info file data on m1,m2,m3 (2.65 KB, application/octet-stream)
2012-04-17 02:37 EDT, Shwetha Panduranga
no flags Details

  None (edit)
Description Shwetha Panduranga 2012-04-17 02:34:32 EDT
Created attachment 577902 [details]
glusterd log file

Description of problem:
------------------------
On a distributed-replicate volume (4x3) performing add-brick, rebalance (start, stop, status) volume operations and subsequent restart of glusterd resulted in glusterd crash. 

Note:-
--------
From the core generated we can see that volinfo referred by glusterd_defrag_notify (glusterd-rebalance.c:182) is corrupted.

#0  0x00007f7e5d0e2ed9 in glusterd_defrag_notify (rpc=0x670d20, mydata=0x673f80, event=RPC_CLNT_DISCONNECT, data=0x0) at glusterd-rebalance.c:182
182	        if ((event == RPC_CLNT_DISCONNECT) && defrag->connected)
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.47.el6.x86_64 libgcc-4.4.6-3.el6.x86_64 openssl-1.0.0-20.el6_2.3.x86_64 zlib-1.2.3-27.el6.x86_64
(gdb) bt full
#0  0x00007f7e5d0e2ed9 in glusterd_defrag_notify (rpc=0x670d20, mydata=0x673f80, event=RPC_CLNT_DISCONNECT, data=0x0) at glusterd-rebalance.c:182
        volinfo = 0x673f80
        defrag = 0x2
        ret = 0

(gdb) p *volinfo
$1 = {
  volname = "\320?g", '\000' <repeats 21 times>, "\r\360\255\272", '\000' <repeats 12 times>, "Q\000\000\000\000\000\000\000(\000\000\000\037\000\000\000\000\000\000\000\300df\000\000\000\000\000\276\272\376\312\000\000\000\000\000\000\000\000/etc/glusterd/vols/dstore/info\000\r\360\255\272\000\000\000\000\000Q\000\000\000\000\000\000\000v\000\000\000\030\000\000\000\000\000\000\000\300df\000\000\000\000\000\276\272\376\312\000\000\000\000\000\000\000\000p@g", '\000' <repeats 21 times>, "\r\360\255\272", '\000' <repeats 12 times>, "a\000\000\000\000\000\000\000(\000\000\000\"\000\000\000\000\000\000\000\300df\000\000\000\000\000\276\272\376\312\000\000\000\000\000\000\000\000/etc/gluste"..., type = 0, brick_count = 0, vol_list = {next = 0x0, prev = 0x0}, bricks = {next = 0x0, prev = 0x0}, status = GLUSTERD_STATUS_NONE, 
  sub_count = 0, stripe_count = 0, replica_count = 0, dist_leaf_count = 0, port = 0, shandle = 0x0, rb_shandle = 0x0, defrag_status = GF_DEFRAG_STATUS_NOT_STARTED, 
  rebalance_files = 0, rebalance_data = 0, lookedup_files = 85899345920, defrag = 0x2, defrag_cmd = GF_DEFRAG_CMD_START, rebalance_failures = 3131961357, 
  rb_status = GF_RB_STATUS_NONE, src_brick = 0x41, dst_brick = 0xb00000028, version = 0, cksum = 6710464, transport_type = GF_TRANSPORT_TCP, 
  nfs_transport_type = 3405691582, dict = 0x0, volume_id = "management\000\r", <incomplete sequence \360\255\272>, auth = {username = 0x0, 
    password = 0x41 <Address 0x41 out of bounds>}, logdir = 0x66e730 "", gsync_slaves = 0x661f40, decommission_in_progress = 0, xl = 0x0, 
  memory_accounting = 762081142}


Version-Release number of selected component (if applicable):
-------------------------------------------------------------
mainline

Steps to Reproduce:
---------------------
The trusted storage pool has 3 machines m1, m2, m3. 

1.create distribute-replicate volume(2X3). start the volume.
2.create fuse, nfs mounts. 
3.run gfsc1.sh from fuse mount
4.run nfsc1.sh from nfs mount
4.add-bricks to the volume
5.start rebalance 
6.status rebalance
7.stop rebalance
8.bring down 2 bricks from each replicate set, so that one brick is online from
each replica set
9.bring back bricks online
10.start force rebalance
11.query rebalance status 
12.stop rebalance

Repeat step8 to step12 3-4 times.

13. add-bricks to volume

Repeat step8 to step12 3-4 times

14. stop volume
15. restart volume
16. kill glusterd on m1, m2
17. restart glusterd on m1 and m2. 

Actual results:
glusterd crashed on both m1,m2. 

Additional info:
-------------------
[04/17/12 - 17:09:51 root@APP-SERVER3 ~]# gluster volume info
 
Volume Name: dstore
Type: Distributed-Replicate
Volume ID: 90336962-3cd3-483b-917b-aee27cf34eff
Status: Started
Number of Bricks: 4 x 3 = 12
Transport-type: tcp
Bricks:
Brick1: 192.168.2.35:/export1/dstore1
Brick2: 192.168.2.36:/export1/dstore1
Brick3: 192.168.2.37:/export1/dstore1
Brick4: 192.168.2.35:/export2/dstore2
Brick5: 192.168.2.36:/export2/dstore2
Brick6: 192.168.2.37:/export2/dstore2
Brick7: 192.168.2.35:/export1/dstore2
Brick8: 192.168.2.36:/export1/dstore2
Brick9: 192.168.2.37:/export1/dstore2
Brick10: 192.168.2.35:/export2/dstore1
Brick11: 192.168.2.36:/export2/dstore1
Brick12: 192.168.2.37:/export2/dstore1
Options Reconfigured:
diagnostics.client-log-level: INFO
cluster.self-heal-daemon: off
Comment 1 Shwetha Panduranga 2012-04-17 02:36:10 EDT
Created attachment 577903 [details]
Backtrace of core
Comment 2 Shwetha Panduranga 2012-04-17 02:37:06 EDT
Created attachment 577904 [details]
volume info file data on m1,m2,m3
Comment 3 Shwetha Panduranga 2012-04-17 06:50:52 EDT
Attaching scripts to run on fuse, nfs mounts:-

gfsc1.sh:-
-----------
#!/bin/bash

mountpoint=`pwd`
for i in {1..10}
do
 level1_dir=$mountpoint/fuse2.$i
 mkdir $level1_dir
 cd $level1_dir
 for j in {1..20}
 do 
  level2_dir=dir.$j
  mkdir $level2_dir
  cd $level2_dir
  for k in {1..100}
  do 
   echo "Creating File: $leve1_dir/$level2_dir/file.$k"
   dd if=/dev/zero of=file.$k bs=1M count=$k 
  done
  cd $level1_dir
 done
 cd $mountpoint
done


nfsc1.sh:-
----------
#!/bin/bash

mountpoint=`pwd`
for i in {1..5}
do 
 level1_dir=$mountpoint/nfs2.$i
 mkdir $level1_dir
 cd $level1_dir
 for j in {1..20}
 do 
  level2_dir=dir.$j
  mkdir $level2_dir
  cd $level2_dir

  for k in {1..100}
  do 
   echo "Creating File: $leve1_dir/$level2_dir/file.$k"
   dd if=/dev/zero of=file.$k bs=1M count=$k

  done
  cd $level1_dir
 done
 cd $mountpoint
done
Comment 4 Vijay Bellur 2012-05-18 09:09:44 EDT
Not reproducible anymore. Removing blocker flag.
Comment 5 Amar Tumballi 2012-07-11 02:24:27 EDT
not able to reproduce it again.

Note You need to log in before you can comment on or make changes to this bug.