Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 813169

Summary:

glusterd crashed when rebalance was in progress and performed stop/start volume

Product:

[Community] GlusterFS

Reporter:

Shwetha Panduranga <shwetha.h.panduranga>

Component:

glusterd

Assignee:

krishnan parthasarathi <kparthas>

Status:

CLOSED WORKSFORME

QA Contact:

Severity:

high

Docs Contact:

Priority:

high

Version:

mainline

CC:

amarts, gluster-bugs, nsathyan, vbellur, vinaraya

Target Milestone:

---

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2012-07-11 06:24:27 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
glusterd log file	none
Backtrace of core	none
volume info file data on m1,m2,m3	none

Description Shwetha Panduranga 2012-04-17 06:34:32 UTC

Created attachment 577902 [details]
glusterd log file

Description of problem:
------------------------
On a distributed-replicate volume (4x3) performing add-brick, rebalance (start, stop, status) volume operations and subsequent restart of glusterd resulted in glusterd crash. 

Note:-
--------
From the core generated we can see that volinfo referred by glusterd_defrag_notify (glusterd-rebalance.c:182) is corrupted.

#0  0x00007f7e5d0e2ed9 in glusterd_defrag_notify (rpc=0x670d20, mydata=0x673f80, event=RPC_CLNT_DISCONNECT, data=0x0) at glusterd-rebalance.c:182
182	        if ((event == RPC_CLNT_DISCONNECT) && defrag->connected)
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.47.el6.x86_64 libgcc-4.4.6-3.el6.x86_64 openssl-1.0.0-20.el6_2.3.x86_64 zlib-1.2.3-27.el6.x86_64
(gdb) bt full
#0  0x00007f7e5d0e2ed9 in glusterd_defrag_notify (rpc=0x670d20, mydata=0x673f80, event=RPC_CLNT_DISCONNECT, data=0x0) at glusterd-rebalance.c:182
        volinfo = 0x673f80
        defrag = 0x2
        ret = 0

(gdb) p *volinfo
$1 = {
  volname = "\320?g", '\000' <repeats 21 times>, "\r\360\255\272", '\000' <repeats 12 times>, "Q\000\000\000\000\000\000\000(\000\000\000\037\000\000\000\000\000\000\000\300df\000\000\000\000\000\276\272\376\312\000\000\000\000\000\000\000\000/etc/glusterd/vols/dstore/info\000\r\360\255\272\000\000\000\000\000Q\000\000\000\000\000\000\000v\000\000\000\030\000\000\000\000\000\000\000\300df\000\000\000\000\000\276\272\376\312\000\000\000\000\000\000\000\000p@g", '\000' <repeats 21 times>, "\r\360\255\272", '\000' <repeats 12 times>, "a\000\000\000\000\000\000\000(\000\000\000\"\000\000\000\000\000\000\000\300df\000\000\000\000\000\276\272\376\312\000\000\000\000\000\000\000\000/etc/gluste"..., type = 0, brick_count = 0, vol_list = {next = 0x0, prev = 0x0}, bricks = {next = 0x0, prev = 0x0}, status = GLUSTERD_STATUS_NONE, 
  sub_count = 0, stripe_count = 0, replica_count = 0, dist_leaf_count = 0, port = 0, shandle = 0x0, rb_shandle = 0x0, defrag_status = GF_DEFRAG_STATUS_NOT_STARTED, 
  rebalance_files = 0, rebalance_data = 0, lookedup_files = 85899345920, defrag = 0x2, defrag_cmd = GF_DEFRAG_CMD_START, rebalance_failures = 3131961357, 
  rb_status = GF_RB_STATUS_NONE, src_brick = 0x41, dst_brick = 0xb00000028, version = 0, cksum = 6710464, transport_type = GF_TRANSPORT_TCP, 
  nfs_transport_type = 3405691582, dict = 0x0, volume_id = "management\000\r", <incomplete sequence \360\255\272>, auth = {username = 0x0, 
    password = 0x41 <Address 0x41 out of bounds>}, logdir = 0x66e730 "", gsync_slaves = 0x661f40, decommission_in_progress = 0, xl = 0x0, 
  memory_accounting = 762081142}


Version-Release number of selected component (if applicable):
-------------------------------------------------------------
mainline

Steps to Reproduce:
---------------------
The trusted storage pool has 3 machines m1, m2, m3. 

1.create distribute-replicate volume(2X3). start the volume.
2.create fuse, nfs mounts. 
3.run gfsc1.sh from fuse mount
4.run nfsc1.sh from nfs mount
4.add-bricks to the volume
5.start rebalance 
6.status rebalance
7.stop rebalance
8.bring down 2 bricks from each replicate set, so that one brick is online from
each replica set
9.bring back bricks online
10.start force rebalance
11.query rebalance status 
12.stop rebalance

Repeat step8 to step12 3-4 times.

13. add-bricks to volume

Repeat step8 to step12 3-4 times

14. stop volume
15. restart volume
16. kill glusterd on m1, m2
17. restart glusterd on m1 and m2. 

Actual results:
glusterd crashed on both m1,m2. 

Additional info:
-------------------
[04/17/12 - 17:09:51 root@APP-SERVER3 ~]# gluster volume info
 
Volume Name: dstore
Type: Distributed-Replicate
Volume ID: 90336962-3cd3-483b-917b-aee27cf34eff
Status: Started
Number of Bricks: 4 x 3 = 12
Transport-type: tcp
Bricks:
Brick1: 192.168.2.35:/export1/dstore1
Brick2: 192.168.2.36:/export1/dstore1
Brick3: 192.168.2.37:/export1/dstore1
Brick4: 192.168.2.35:/export2/dstore2
Brick5: 192.168.2.36:/export2/dstore2
Brick6: 192.168.2.37:/export2/dstore2
Brick7: 192.168.2.35:/export1/dstore2
Brick8: 192.168.2.36:/export1/dstore2
Brick9: 192.168.2.37:/export1/dstore2
Brick10: 192.168.2.35:/export2/dstore1
Brick11: 192.168.2.36:/export2/dstore1
Brick12: 192.168.2.37:/export2/dstore1
Options Reconfigured:
diagnostics.client-log-level: INFO
cluster.self-heal-daemon: off

Comment 1 Shwetha Panduranga 2012-04-17 06:36:10 UTC

Created attachment 577903 [details]
Backtrace of core

Comment 2 Shwetha Panduranga 2012-04-17 06:37:06 UTC

Created attachment 577904 [details]
volume info file data on m1,m2,m3

Comment 3 Shwetha Panduranga 2012-04-17 10:50:52 UTC

Attaching scripts to run on fuse, nfs mounts:-

gfsc1.sh:-
-----------
#!/bin/bash

mountpoint=`pwd`
for i in {1..10}
do
 level1_dir=$mountpoint/fuse2.$i
 mkdir $level1_dir
 cd $level1_dir
 for j in {1..20}
 do 
  level2_dir=dir.$j
  mkdir $level2_dir
  cd $level2_dir
  for k in {1..100}
  do 
   echo "Creating File: $leve1_dir/$level2_dir/file.$k"
   dd if=/dev/zero of=file.$k bs=1M count=$k 
  done
  cd $level1_dir
 done
 cd $mountpoint
done


nfsc1.sh:-
----------
#!/bin/bash

mountpoint=`pwd`
for i in {1..5}
do 
 level1_dir=$mountpoint/nfs2.$i
 mkdir $level1_dir
 cd $level1_dir
 for j in {1..20}
 do 
  level2_dir=dir.$j
  mkdir $level2_dir
  cd $level2_dir

  for k in {1..100}
  do 
   echo "Creating File: $leve1_dir/$level2_dir/file.$k"
   dd if=/dev/zero of=file.$k bs=1M count=$k

  done
  cd $level1_dir
 done
 cd $mountpoint
done

Comment 4 Vijay Bellur 2012-05-18 13:09:44 UTC

Not reproducible anymore. Removing blocker flag.

Comment 5 Amar Tumballi 2012-07-11 06:24:27 UTC

not able to reproduce it again.