829168 – self-heal daemon process crashed during self-heal of 50k files

Bug 829168 - self-heal daemon process crashed during self-heal of 50k files

Summary: self-heal daemon process crashed during self-heal of 50k files

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	replicate
Sub Component:
Version:	3.3-beta
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	high
Target Milestone:	---
Assignee:	Amar Tumballi
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2012-06-06 06:41 UTC by Shwetha Panduranga
Modified:	2018-12-02 17:40 UTC (History)
CC List:	3 users (show)
Fixed In Version:	glusterfs-3.4.0
Clone Of:
Environment:
Last Closed:	2013-07-24 17:26:58 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
glustershd log (85.11 KB, text/x-log) 2012-06-06 06:41 UTC, Shwetha Panduranga	no flags	Details
View All

Description Shwetha Panduranga 2012-06-06 06:41:37 UTC

Created attachment 589744 [details]
glustershd log

Description of problem:
-------------------------
self-heal daemon process crashed during self-heal of 50K files.

Core was generated by `/usr/local/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/'.
Program terminated with signal 11, Segmentation fault.
#0  0x0000003271f14f99 in xdrmem_getlong () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.47.el6.x86_64 libgcc-4.4.6-3.el6.x86_64 openssl-1.0.0-20.el6.x86_64 zlib-1.2.3-27.el6.x86_64
(gdb) bt
#0  0x0000003271f14f99 in xdrmem_getlong () from /lib64/libc.so.6
#1  0x0000003271f141f3 in xdr_u_int_internal () from /lib64/libc.so.6
#2  0x0000003271f148e8 in xdr_string_internal () from /lib64/libc.so.6
#3  0x00007f3af77c5c67 in xdr_gd1_mgmt_brick_op_req (xdrs=0x132c5a0, objp=0x132ce40) at glusterd1-xdr.c:478
#4  0x00007f3af77bd71f in xdr_to_generic (inmsg=..., args=0x132ce40, proc=0x7f3af77c5c3e <xdr_gd1_mgmt_brick_op_req>) at xdr-generic.c:60
#5  0x000000000040a4b7 in glusterfs_handle_translator_op (data=0xd5175c) at glusterfsd-mgmt.c:655
#6  0x00007f3af7c4a882 in synctask_wrap (old_task=0xf2dbf0) at syncop.c:120
#7  0x0000003271e43610 in ?? () from /lib64/libc.so.6
#8  0x0000000000000000 in ?? ()

create_files.sh:-(create 1M - 10M 50K files) (270GB disk space is required)
-----------------
#!/bin/bash

path=$1
count_value=1
for i in {1..50000};do
	if [ $count_value -gt 10 ]; then
		count_value=1
	fi
	echo -e "Creating File : file.$i\n"
	dd if=/dev/urandom of=$path/file.$i bs=1M count=$count_value
	echo -e "~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n"
	let "count_value = $count_value + 1"
done


Version-Release number of selected component (if applicable):
--------------------------------------------------------------
3.3.0qa45

Steps to Reproduce:
---------------------
1.Create a distribute-replicate volume(3x3)
2.Bring down 2 bricks from each replicate sub-volume. 
3.Create a fuse mount
4.On the mount, execute the script "create_files.sh"
5.Once the script is completely executed, bring back the bricks.(gluster v start <vol_name> force"
6.On storage node execute :
-------------------------
a. gluster v heal vol full
Heal operation on volume vol has been successful

b. gluster v heal vol  info healed
Self-heal daemon is not running. Check self-heal daemon log file.

Actual results:
----------------
After some time glusterfs self-heal daemon crashed on one of the storage nodes.

Additional info:
---------------

[06/06/12 - 02:18:58 root@AFR-Server1 ~]# gluster v info
 
Volume Name: vol
Type: Distributed-Replicate
Volume ID: b2f7f458-598e-456f-af7c-aa5af0036393
Status: Started
Number of Bricks: 3 x 3 = 9
Transport-type: tcp
Bricks:
Brick1: 10.16.159.184:/export_b1/dir1
Brick2: 10.16.159.188:/export_b1/dir1
Brick3: 10.16.159.196:/export_b1/dir1
Brick4: 10.16.159.184:/export_c1/dir1
Brick5: 10.16.159.188:/export_c1/dir1
Brick6: 10.16.159.196:/export_c1/dir1
Brick7: 10.16.159.184:/export_d1/dir1
Brick8: 10.16.159.188:/export_d1/dir1
Brick9: 10.16.159.196:/export_d1/dir1

Found the bug while verifying the Bug 798907

Comment 1 Pranith Kumar K 2012-06-13 10:27:45 UTC

Amar will be fixing the XDR encoding/decoding problems.

Amar
    there is still a bug in brick-op which blows up the request size because of the brick-op response on the originator. We will have to fix that as well once the xdr problems go away.

Comment 2 Amar Tumballi 2012-07-05 06:23:10 UTC

did some check on this. The failure is happening inside xdrmem_getlong(), which is outside glusterfs scope. Did some searching about the possible failures of xdrmem_getlong()... did find some issue way back in 1995, but hopefully that should be fixed now... (http://mail-index.netbsd.org/netbsd-bugs/1995/06/05/0003.html)

lowering the priority for this bug for now.

Comment 4 Amar Tumballi 2012-12-21 06:02:19 UTC

not happening anymore with 3.4.0 qa series. Please re-open if seen again.

Comment 5 Elvir Kuric 2013-06-10 08:36:08 UTC

(In reply to Amar Tumballi from comment #4)
> not happening anymore with 3.4.0 qa series. Please re-open if seen again.

When we can expect 3.4.0 to be released?

Thank you in advance, 

Kind regards, 

Elvir / Red Hat GSS EMEA

Note You need to log in before you can comment on or make changes to this bug.