Bug 795741

Summary:	Crash when tried to self heal from gluster cli: "gluster volume heal <volume_name>"
Product:	[Community] GlusterFS	Reporter:	Shwetha Panduranga <shwetha.h.panduranga>
Component:	replicate	Assignee:	Pranith Kumar K <pkarampu>
Status:	CLOSED CURRENTRELEASE	QA Contact:
Severity:	high	Docs Contact:
Priority:	high
Version:	mainline	CC:	gluster-bugs, rabhat
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	glusterfs-3.4.0	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2013-07-24 17:26:54 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	817967

Description Shwetha Panduranga 2012-02-21 12:38:16 UTC

Description of problem:
Core was generated by `/usr/local/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /etc/'.
Program terminated with signal 11, Segmentation fault.
#0  0xffffffffff60042a in ?? ()
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.25.el6_1.3.x86_64 libgcc-4.4.5-6.el6.x86_64
(gdb) bt
#0  0xffffffffff60042a in ?? ()
#1  0x000000004f438aaf in ?? ()
#2  0x000000000002ff27 in ?? ()
#3  0x00000000c4131e4d in ?? ()
#4  0x000000000000000e in ?? ()
#5  0x00007fbc00ee84b0 in ?? ()
#6  0x000000311fc9aa7d in time () from /lib64/libc.so.6
#7  0x00007fbc023d56ae in _do_self_heal_on_subvol (this=0xffffffffff600421, child=32700, crawl=15631440) at afr-self-heald.c:357
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
(gdb) f 7
#7  0x00007fbc023d56ae in _do_self_heal_on_subvol (this=0xffffffffff600421, child=32700, crawl=15631440) at afr-self-heald.c:357
357	        time (&shd->sh_times[child]);
(gdb) l
352	        afr_self_heald_t *shd = NULL;
353	
354	        priv = this->private;
355	        shd = &priv->shd;
356	
357	        time (&shd->sh_times[child]);
358	        afr_start_crawl (this, child, crawl, _self_heal_entry,
359	                         NULL, _gf_true, STOP_CRAWL_ON_SINGLE_SUBVOL,
360	                         afr_crawl_done);
361	}
(gdb) p shd
$1 = (afr_self_heald_t *) 0xefd10
(gdb) p priv
$2 = (afr_private_t *) 0x4f438de2
(gdb) p *priv
Cannot access memory at address 0x4f438de2
(gdb) p *this
Cannot access memory at address 0xffffffffff600421
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
(gdb) info thre
  5 Thread 0x7fc6350d0700 (LWP 12953)  0x000000311fce5d73 in epoll_wait () from /lib64/libc.so.6
  4 Thread 0x7fc630147700 (LWP 12957)  0x000000312040ecbd in nanosleep () from /lib64/libpthread.so.0
  3 Thread 0x7fc632fe1700 (LWP 12954)  0x000000312040f235 in sigwait () from /lib64/libpthread.so.0
  2 Thread 0x7fc6325e0700 (LWP 12955)  0x000000311fcdda07 in writev () from /lib64/libc.so.6
* 1 Thread 0x7fc631bdf700 (LWP 12956)  0xffffffffff60042a in ?? ()

Version-Release number of selected component (if applicable):
mainline

How reproducible:
often

Steps to Reproduce:
Steps to Reproduce:
1.Create a replicate volume with 2 bricks (self-heal daemon is off)
2.Start the volume
3.Mount to volume from client
4.Create files and dirs.
5. Bring down one of the brick
6. Create files and dirs. change ownership/permissions/ on existing files
7. Bring back brick
8. find . | xargs stat

This is not triggering self-heal. 

9. Enable self-heal daemon
10. gluster volume heal <volume_name> 

Output:- Error
  
Example:-
[02/21/12 - 07:28:09 root@SERVER1 glusterfs]# gluster volume heal replicate 
error

Glustershd Log:-
--------------------
[2012-02-21 07:28:18.982200] I [afr-self-heald.c:1047:afr_start_crawl] 0-replicate-replicate-0: starting crawl 1 for replicate-client-0
pending frames:

patchset: git://git.gluster.com/glusterfs.git
signal received: 11
pending frames:
time of crash: 2012-02-21 07:28:18

configuration details:
patchset: git://git.gluster.com/glusterfs.git
argp 1
signal received: 11
backtrace 1
dlfcn 1
fdatasync 1
time of crash: libpthread 1
2012-02-21 07:28:18
llistxattr 1
configuration details:
setfsid 1
argp 1
spinlock 1
backtrace 1
epoll.h 1
dlfcn 1
xattr.h 1
fdatasync 1
st_atim.tv_nsec 1
libpthread 1
package-string: glusterfs 3git
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3git
/lib64/libc.so.6[0x311fc32980]
[0xffffffffff60042a]
---------
/lib64/libc.so.6[0x311fc32980]
/usr/local/lib/glusterfs/3git/xlator/cluster/replicate.so(+0x5d8d1)[0x7fc62f2dc8d1]
/usr/local/lib/libglusterfs.so.0(synctask_wrap+0x38)[0x7fc63557134f]

Comment 1 Raghavendra Bhat 2012-02-21 12:51:16 UTC

Ran glustershd with valgrind and reproduced the crash. This is the backtrace of the core.


Core was generated by `'.
Program terminated with signal 11, Segmentation fault.
#0  0x000000000570a4bd in nanosleep () at ../sysdeps/unix/syscall-template.S:82
82	../sysdeps/unix/syscall-template.S: No such file or directory.
	in ../sysdeps/unix/syscall-template.S
(gdb) bt
#0  0x000000000570a4bd in nanosleep () at ../sysdeps/unix/syscall-template.S:82
#1  0x0000000004e61297 in gf_timer_proc (ctx=0x5cad040) at ../../../libglusterfs/src/timer.c:182
#2  0x0000000005701d8c in start_thread (arg=0xb0d8700) at pthread_create.c:304
#3  0x00000000059ff04d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#4  0x0000000000000000 in ?? ()
(gdb) info thr
  5 Thread 13393  0x00000000059ff6a3 in epoll_wait () at ../sysdeps/unix/syscall-template.S:82
  4 Thread 13394  do_sigwait (set=<value optimized out>, sig=0x87f5eb8)
    at ../nptl/sysdeps/unix/sysv/linux/../../../../../sysdeps/unix/sysv/linux/sigwait.c:65
  3 Thread 13395  0x00000000057078f7 in ?? () from /lib/x86_64-linux-gnu/libpthread.so.0
  2 Thread 13396  0x000000000570ab3b in raise (sig=<value optimized out>) at ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:42
* 1 Thread 13491  0x000000000570a4bd in nanosleep () at ../sysdeps/unix/syscall-template.S:82
(gdb) t 2
[Switching to thread 2 (Thread 13396)]#0  0x000000000570ab3b in raise (sig=<value optimized out>)
    at ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:42
42	../nptl/sysdeps/unix/sysv/linux/pt-raise.c: No such file or directory.
	in ../nptl/sysdeps/unix/sysv/linux/pt-raise.c
(gdb) bt
#0  0x000000000570ab3b in raise (sig=<value optimized out>) at ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:42
#1  0x0000000004e5d6dc in gf_print_trace (signum=11) at ../../../libglusterfs/src/common-utils.c:437
#2  <signal handler called>
#3  0x000000000b377073 in afr_dir_exclusive_crawl (data=0x7f886d0) at ../../../../../xlators/cluster/afr/src/afr-self-heald.c:978
#4  0x0000000004e8d8da in synctask_wrap (old_task=0x7f88890) at ../../../libglusterfs/src/syncop.c:144
#5  0x000000000595e1a0 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#6  0x0000000000000000 in ?? ()
(gdb) f 3
#3  0x000000000b377073 in afr_dir_exclusive_crawl (data=0x7f886d0) at ../../../../../xlators/cluster/afr/src/afr-self-heald.c:978
978	                if (shd->inprogress[child]) {
(gdb) p shd
$1 = (afr_self_heald_t *) 0x6064a08
(gdb) p *shd
$2 = {enabled = _gf_true, pending = 0x0, inprogress = 0x0, pos = 0x0, sh_times = 0x0, timer = 0x0, healed = 0x0, heal_failed = 0x0, 
  split_brain = 0x0}
(gdb) 


This is what valgrind log says.

For counts of detected and suppressed errors, rerun with: -v
==13383== ERROR SUMMARY: 22 errors from 22 contexts (suppressed: 4 from 4)
==13393== Warning: client switching stacks?  SP change: 0x8ff6e48 --> 0xece0098
==13393==          to suppress, use: --max-stackframe=97423952 or greater
==13393== Thread 3:
==13393== Syscall param time(t) points to unaddressable byte(s)
==13393==    at 0x3804049A: vgPlain_amd64_linux_REDIR_FOR_vtime (m_trampoline.S:167)
==13393==  Address 0x8 is not stack'd, malloc'd or (recently) free'd
==13393== 
==13393== Warning: client switching stacks?  SP change: 0x97f7e48 --> 0xf85c028
==13393==          to suppress, use: --max-stackframe=101073376 or greater
==13393== Thread 4:
==13393== Invalid read of size 4
==13393==    at 0xB377073: afr_dir_exclusive_crawl (afr-self-heald.c:978)
==13393==  Address 0x0 is not stack'd, malloc'd or (recently) free'd
==13393== 
==13393== Warning: client switching stacks?  SP change: 0x8ff6e48 --> 0xfc5c028
==13393==          to suppress, use: --max-stackframe=113660384 or greater
==13393==          further instances of this message will not be shown.

Comment 2 Pranith Kumar K 2012-02-21 17:17:01 UTC

Thanks for the steps guys. Afr xl needs to maintain inode-table inside the xl if it is in self-heal-daemon. The code was depending on the option self-heal-daemon to do this. This is wrong as the option can be reconfigured to on/off. Added a new option which can't be reconfigured for this purpose.

Comment 3 Anand Avati 2012-03-01 17:16:36 UTC

CHANGE: http://review.gluster.com/2787 (cluster/afr: Add new option to know which process it is in) merged in master by Vijay Bellur (vijay)

Comment 4 Shwetha Panduranga 2012-05-04 08:57:36 UTC

Bug is fixed . verified on 3.3.0qa39