Description of problem: Core was generated by `/usr/local/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /etc/'. Program terminated with signal 11, Segmentation fault. #0 0xffffffffff60042a in ?? () Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.25.el6_1.3.x86_64 libgcc-4.4.5-6.el6.x86_64 (gdb) bt #0 0xffffffffff60042a in ?? () #1 0x000000004f438aaf in ?? () #2 0x000000000002ff27 in ?? () #3 0x00000000c4131e4d in ?? () #4 0x000000000000000e in ?? () #5 0x00007fbc00ee84b0 in ?? () #6 0x000000311fc9aa7d in time () from /lib64/libc.so.6 #7 0x00007fbc023d56ae in _do_self_heal_on_subvol (this=0xffffffffff600421, child=32700, crawl=15631440) at afr-self-heald.c:357 Backtrace stopped: previous frame inner to this frame (corrupt stack?) (gdb) f 7 #7 0x00007fbc023d56ae in _do_self_heal_on_subvol (this=0xffffffffff600421, child=32700, crawl=15631440) at afr-self-heald.c:357 357 time (&shd->sh_times[child]); (gdb) l 352 afr_self_heald_t *shd = NULL; 353 354 priv = this->private; 355 shd = &priv->shd; 356 357 time (&shd->sh_times[child]); 358 afr_start_crawl (this, child, crawl, _self_heal_entry, 359 NULL, _gf_true, STOP_CRAWL_ON_SINGLE_SUBVOL, 360 afr_crawl_done); 361 } (gdb) p shd $1 = (afr_self_heald_t *) 0xefd10 (gdb) p priv $2 = (afr_private_t *) 0x4f438de2 (gdb) p *priv Cannot access memory at address 0x4f438de2 (gdb) p *this Cannot access memory at address 0xffffffffff600421 Backtrace stopped: previous frame inner to this frame (corrupt stack?) (gdb) info thre 5 Thread 0x7fc6350d0700 (LWP 12953) 0x000000311fce5d73 in epoll_wait () from /lib64/libc.so.6 4 Thread 0x7fc630147700 (LWP 12957) 0x000000312040ecbd in nanosleep () from /lib64/libpthread.so.0 3 Thread 0x7fc632fe1700 (LWP 12954) 0x000000312040f235 in sigwait () from /lib64/libpthread.so.0 2 Thread 0x7fc6325e0700 (LWP 12955) 0x000000311fcdda07 in writev () from /lib64/libc.so.6 * 1 Thread 0x7fc631bdf700 (LWP 12956) 0xffffffffff60042a in ?? () Version-Release number of selected component (if applicable): mainline How reproducible: often Steps to Reproduce: Steps to Reproduce: 1.Create a replicate volume with 2 bricks (self-heal daemon is off) 2.Start the volume 3.Mount to volume from client 4.Create files and dirs. 5. Bring down one of the brick 6. Create files and dirs. change ownership/permissions/ on existing files 7. Bring back brick 8. find . | xargs stat This is not triggering self-heal. 9. Enable self-heal daemon 10. gluster volume heal <volume_name> Output:- Error Example:- [02/21/12 - 07:28:09 root@SERVER1 glusterfs]# gluster volume heal replicate error Glustershd Log:- -------------------- [2012-02-21 07:28:18.982200] I [afr-self-heald.c:1047:afr_start_crawl] 0-replicate-replicate-0: starting crawl 1 for replicate-client-0 pending frames: patchset: git://git.gluster.com/glusterfs.git signal received: 11 pending frames: time of crash: 2012-02-21 07:28:18 configuration details: patchset: git://git.gluster.com/glusterfs.git argp 1 signal received: 11 backtrace 1 dlfcn 1 fdatasync 1 time of crash: libpthread 1 2012-02-21 07:28:18 llistxattr 1 configuration details: setfsid 1 argp 1 spinlock 1 backtrace 1 epoll.h 1 dlfcn 1 xattr.h 1 fdatasync 1 st_atim.tv_nsec 1 libpthread 1 package-string: glusterfs 3git llistxattr 1 setfsid 1 spinlock 1 epoll.h 1 xattr.h 1 st_atim.tv_nsec 1 package-string: glusterfs 3git /lib64/libc.so.6[0x311fc32980] [0xffffffffff60042a] --------- /lib64/libc.so.6[0x311fc32980] /usr/local/lib/glusterfs/3git/xlator/cluster/replicate.so(+0x5d8d1)[0x7fc62f2dc8d1] /usr/local/lib/libglusterfs.so.0(synctask_wrap+0x38)[0x7fc63557134f]
Ran glustershd with valgrind and reproduced the crash. This is the backtrace of the core. Core was generated by `'. Program terminated with signal 11, Segmentation fault. #0 0x000000000570a4bd in nanosleep () at ../sysdeps/unix/syscall-template.S:82 82 ../sysdeps/unix/syscall-template.S: No such file or directory. in ../sysdeps/unix/syscall-template.S (gdb) bt #0 0x000000000570a4bd in nanosleep () at ../sysdeps/unix/syscall-template.S:82 #1 0x0000000004e61297 in gf_timer_proc (ctx=0x5cad040) at ../../../libglusterfs/src/timer.c:182 #2 0x0000000005701d8c in start_thread (arg=0xb0d8700) at pthread_create.c:304 #3 0x00000000059ff04d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112 #4 0x0000000000000000 in ?? () (gdb) info thr 5 Thread 13393 0x00000000059ff6a3 in epoll_wait () at ../sysdeps/unix/syscall-template.S:82 4 Thread 13394 do_sigwait (set=<value optimized out>, sig=0x87f5eb8) at ../nptl/sysdeps/unix/sysv/linux/../../../../../sysdeps/unix/sysv/linux/sigwait.c:65 3 Thread 13395 0x00000000057078f7 in ?? () from /lib/x86_64-linux-gnu/libpthread.so.0 2 Thread 13396 0x000000000570ab3b in raise (sig=<value optimized out>) at ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:42 * 1 Thread 13491 0x000000000570a4bd in nanosleep () at ../sysdeps/unix/syscall-template.S:82 (gdb) t 2 [Switching to thread 2 (Thread 13396)]#0 0x000000000570ab3b in raise (sig=<value optimized out>) at ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:42 42 ../nptl/sysdeps/unix/sysv/linux/pt-raise.c: No such file or directory. in ../nptl/sysdeps/unix/sysv/linux/pt-raise.c (gdb) bt #0 0x000000000570ab3b in raise (sig=<value optimized out>) at ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:42 #1 0x0000000004e5d6dc in gf_print_trace (signum=11) at ../../../libglusterfs/src/common-utils.c:437 #2 <signal handler called> #3 0x000000000b377073 in afr_dir_exclusive_crawl (data=0x7f886d0) at ../../../../../xlators/cluster/afr/src/afr-self-heald.c:978 #4 0x0000000004e8d8da in synctask_wrap (old_task=0x7f88890) at ../../../libglusterfs/src/syncop.c:144 #5 0x000000000595e1a0 in ?? () from /lib/x86_64-linux-gnu/libc.so.6 #6 0x0000000000000000 in ?? () (gdb) f 3 #3 0x000000000b377073 in afr_dir_exclusive_crawl (data=0x7f886d0) at ../../../../../xlators/cluster/afr/src/afr-self-heald.c:978 978 if (shd->inprogress[child]) { (gdb) p shd $1 = (afr_self_heald_t *) 0x6064a08 (gdb) p *shd $2 = {enabled = _gf_true, pending = 0x0, inprogress = 0x0, pos = 0x0, sh_times = 0x0, timer = 0x0, healed = 0x0, heal_failed = 0x0, split_brain = 0x0} (gdb) This is what valgrind log says. For counts of detected and suppressed errors, rerun with: -v ==13383== ERROR SUMMARY: 22 errors from 22 contexts (suppressed: 4 from 4) ==13393== Warning: client switching stacks? SP change: 0x8ff6e48 --> 0xece0098 ==13393== to suppress, use: --max-stackframe=97423952 or greater ==13393== Thread 3: ==13393== Syscall param time(t) points to unaddressable byte(s) ==13393== at 0x3804049A: vgPlain_amd64_linux_REDIR_FOR_vtime (m_trampoline.S:167) ==13393== Address 0x8 is not stack'd, malloc'd or (recently) free'd ==13393== ==13393== Warning: client switching stacks? SP change: 0x97f7e48 --> 0xf85c028 ==13393== to suppress, use: --max-stackframe=101073376 or greater ==13393== Thread 4: ==13393== Invalid read of size 4 ==13393== at 0xB377073: afr_dir_exclusive_crawl (afr-self-heald.c:978) ==13393== Address 0x0 is not stack'd, malloc'd or (recently) free'd ==13393== ==13393== Warning: client switching stacks? SP change: 0x8ff6e48 --> 0xfc5c028 ==13393== to suppress, use: --max-stackframe=113660384 or greater ==13393== further instances of this message will not be shown.
Thanks for the steps guys. Afr xl needs to maintain inode-table inside the xl if it is in self-heal-daemon. The code was depending on the option self-heal-daemon to do this. This is wrong as the option can be reconfigured to on/off. Added a new option which can't be reconfigured for this purpose.
CHANGE: http://review.gluster.com/2787 (cluster/afr: Add new option to know which process it is in) merged in master by Vijay Bellur (vijay)
Bug is fixed . verified on 3.3.0qa39