810103 – rebalance process crashed

Bug 810103 - rebalance process crashed

Summary: rebalance process crashed

Keywords:
Status:	CLOSED DUPLICATE of bug 808977
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	distribute
Sub Component:
Version:	mainline
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Assignee:	shishir gowda
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2012-04-05 07:25 UTC by Shwetha Panduranga
Modified:	2013-12-09 01:30 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2012-05-08 04:35:53 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
rebalance log (860.54 KB, text/x-log) 2012-04-05 12:02 UTC, Shwetha Panduranga	no flags	Details
View All

Description Shwetha Panduranga 2012-04-05 07:25:33 UTC

Description of problem:
(gdb) bt full
#0  0x00000032f1e32885 in raise () from /lib64/libc.so.6
No symbol table info available.
#1  0x00000032f1e34065 in abort () from /lib64/libc.so.6
No symbol table info available.
#2  0x00000032f1e2b9fe in __assert_fail_base () from /lib64/libc.so.6
No symbol table info available.
#3  0x00000032f1e2bac0 in __assert_fail () from /lib64/libc.so.6
No symbol table info available.
#4  0x00007fe315a0e6ab in __gf_free (free_ptr=0x1443400) at mem-pool.c:278
        req_size = 0
        ptr = 0x14433f4 ""
        type = 0
        xl = 0x0
        __PRETTY_FUNCTION__ = "__gf_free"
#5  0x00007fe310ff32e3 in gf_defrag_start_crawl (data=0x1425b00) at dht-rebalance.c:1485
        this = 0x1425b00
        conf = 0x14432f0
        defrag = 0x1443400
        ret = -1
        loc = {path = 0x7fe311032eaf "/", name = 0x0, inode = 0x7fe2d161a04c, parent = 0x0, gfid = '\000' <repeats 15 times>, "\001", 
          pargfid = '\000' <repeats 15 times>}
        iatt = {ia_ino = 0, ia_gfid = '\000' <repeats 15 times>, ia_dev = 0, ia_type = IA_INVAL, ia_prot = {suid = 0 '\000', sgid = 0 '\000', sticky = 0 '\000', 
            owner = {read = 0 '\000', write = 0 '\000', exec = 0 '\000'}, group = {read = 0 '\000', write = 0 '\000', exec = 0 '\000'}, other = {read = 0 '\000', 
              write = 0 '\000', exec = 0 '\000'}}, ia_nlink = 0, ia_uid = 0, ia_gid = 0, ia_rdev = 0, ia_size = 0, ia_blksize = 0, ia_blocks = 0, ia_atime = 0, 
          ia_atime_nsec = 0, ia_mtime = 0, ia_mtime_nsec = 0, ia_ctime = 0, ia_ctime_nsec = 0}
        parent = {ia_ino = 0, ia_gfid = '\000' <repeats 15 times>, ia_dev = 0, ia_type = IA_INVAL, ia_prot = {suid = 0 '\000', sgid = 0 '\000', sticky = 0 '\000', 
            owner = {read = 0 '\000', write = 0 '\000', exec = 0 '\000'}, group = {read = 0 '\000', write = 0 '\000', exec = 0 '\000'}, other = {read = 0 '\000', 
              write = 0 '\000', exec = 0 '\000'}}, ia_nlink = 0, ia_uid = 0, ia_gid = 0, ia_rdev = 0, ia_size = 0, ia_blksize = 0, ia_blocks = 0, ia_atime = 0, 
          ia_atime_nsec = 0, ia_mtime = 0, ia_mtime_nsec = 0, ia_ctime = 0, ia_ctime_nsec = 0}
        fix_layout = 0x0
        migrate_data = 0x0
        __FUNCTION__ = "gf_defrag_start_crawl"
#6  0x00007fe315a1f286 in synctask_wrap (old_task=0x7fe2c8200d70) at syncop.c:128
        task = 0x7fe2c8200d70
#7  0x00000032f1e43610 in ?? () from /lib64/libc.so.6
No symbol table info available.


Version-Release number of selected component (if applicable):
3.3.0qa33

gfsc1.sh:-
-----------
#!/bin/bash

mountpoint=`pwd`
for i in {1..10}
do
	level1_dir=$mountpoint/fuse2.$i
	mkdir $level1_dir
	cd $level1_dir
	for j in {1..20}
	do 
		level2_dir=dir.$j
		mkdir $level2_dir
		cd $level2_dir
		for k in {1..100}
		do 
			echo "Creating File: $leve1_dir/$level2_dir/file.$k"
			dd if=/dev/zero of=file.$k bs=1M count=$k 
		done
		cd $level1_dir
	done
	cd $mountpoint
done


nfsc1.sh:-
----------
#!/bin/bash

mountpoint=`pwd`
for i in {1..5}
do 
	level1_dir=$mountpoint/nfs2.$i
	mkdir $level1_dir
	cd $level1_dir
	for j in {1..20}
	do 
		level2_dir=dir.$j
		mkdir $level2_dir
		cd $level2_dir

		for k in {1..100}
		do 
			echo "Creating File: $leve1_dir/$level2_dir/file.$k"
			dd if=/dev/zero of=file.$k bs=1M count=$k
		
		done
		cd $level1_dir
	done
	cd $mountpoint
done

Steps to Reproduce:
1.create distribute-replicate volume(2X3). start the volume.
2.create fuse, nfs mounts. 
3.run gfsc1.sh from fuse mount
4.run nfsc1.sh from nfs mount
4.add-brick to the volume
5.start rebalance 
6.status rebalance
7.stop rebalance
8.brink down 2 bricks from each replicate set, so that one brick is online from each replica set
9.brick back bricks online
10.start force rebalance
11.query rebalance status 
12.stop rebalance

Repeat step8 to step12 3-4 times. 

Actual results:
rebalance process crashed.

Comment 1 shishir gowda 2012-04-05 07:48:29 UTC

Can you please provide the rebalance logs.
Also, a gdb o/p of the frame in question(5 i believe).
If possible, can the setup infomation be made available for me to access(can be mailed across).

Comment 2 Shwetha Panduranga 2012-04-05 12:02:01 UTC

Created attachment 575383 [details]
rebalance log

Comment 3 Shwetha Panduranga 2012-04-05 12:02:24 UTC

Able to recreate the problem with the above mentioned steps . Attaching the rebalance logs.

Comment 4 shishir gowda 2012-04-05 17:19:44 UTC

This seems to be a case where afr background self heal is in progress, and rebalance has called a cleanup_and_exit. Sending parent_down to xlators does not seem to be fixing this issue.

Comment 5 shishir gowda 2012-05-08 04:35:53 UTC

Closing this bug as we have switched off selfhealing from rebalance process. Please re-open the bug if you are able to reproduce it.

*** This bug has been marked as a duplicate of bug 808997 ***

Comment 6 shishir gowda 2012-05-08 04:36:50 UTC

sorry, marked it as dup to a wrong bug

*** This bug has been marked as a duplicate of bug 808977 ***

Comment 7 Shwetha Panduranga 2012-05-12 14:01:28 UTC

Unable to re-create the same issue.

Note You need to log in before you can comment on or make changes to this bug.