Bug 764221 (GLUSTER-2489) - GlusterFS crashing with replace-brick
Summary: GlusterFS crashing with replace-brick
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: GLUSTER-2489
Product: GlusterFS
Classification: Community
Component: core
Version: 3.1.2
Hardware: x86_64
OS: Linux
urgent
high
Target Milestone: ---
Assignee: krishnan parthasarathi
QA Contact:
URL:
Whiteboard:
: GLUSTER-2650 764491 764683 764686 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2011-03-04 12:36 UTC by Prasanth
Modified: 2015-11-03 23:03 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:
Regression: ---
Mount Type: fuse
Documentation: ---
CRM:
Verified Versions:


Attachments (Terms of Use)
Core dump following crash on bdan3 (42 bytes, text/plain)
2011-03-04 12:19 UTC, Dan Bretherton
no flags Details

Description Dan Bretherton 2011-03-04 12:19:09 UTC
Created attachment 447

Comment 1 Dan Bretherton 2011-03-04 12:23:10 UTC
I have uploaded the core dump from bdan3, and changed the GlusterFS version number in the bug details to 3.1.2 - I hope that was the right thing to do.

[root@bdan3 ~]# glusterfs --version
glusterfs 3.1.2 built on Jan 14 2011 19:21:08

Let me know if you need more info or need me to run more tests.

Regards,
-Dan Bretherton

Comment 2 Prasanth 2011-03-04 12:36:34 UTC
Hello,

It seems that GlusterFS is crashing while performing a replace-brick. Relevant logs are pasted below:

------------
[2011-02-25 12:43:59.974051] I [afr-common.c:662:afr_lookup_done] raid10-pump: entries are missing in lookup of /users/dab/backup/data
/gorgon/users/dab/cxoffice/support/templates/win2000/drive_c/Windows/Application Data.
[2011-02-25 12:43:59.974103] I [afr-common.c:716:afr_lookup_done] raid10-pump: background  meta-data data entry self-heal triggered. p
ath: /users/dab/backup/data/gorgon/users/dab/cxoffice/support/templates/win2000/drive_c/Windows/Application Data
[2011-02-25 12:43:59.982317] I [afr-self-heal-common.c:1526:afr_self_heal_completion_cbk] raid10-pump: background  meta-data data entr
y self-heal completed on /users/dab/backup/data/gorgon/users/dab/cxoffice/support/templates/win2000/drive_c/Windows/Application Data
[2011-02-25 12:43:59.984233] I [afr-common.c:662:afr_lookup_done] raid10-pump: entries are missing in lookup of /users/dab/backup/data
/gorgon/users/dab/cxoffice/support/templates/win2000/drive_c/Windows/Start Menu.
[2011-02-25 12:43:59.984288] I [afr-common.c:716:afr_lookup_done] raid10-pump: background  meta-data data entry self-heal triggered. p
ath: /users/dab/backup/data/gorgon/users/dab/cxoffice/support/templates/win2000/drive_c/Windows/Start Menu
[2011-02-25 12:43:59.992450] I [afr-self-heal-common.c:1526:afr_self_heal_completion_cbk] raid10-pump: background  meta-data data entr
y self-heal completed on /users/dab/backup/data/gorgon/users/dab/cxoffice/support/templates/win2000/drive_c/Windows/Start Menu
[2011-02-25 12:43:59.994415] I [afr-common.c:662:afr_lookup_done] raid10-pump: entries are missing in lookup of /users/dab/backup/data
/gorgon/users/dab/cxoffice/support/templates/win2000/drive_c/Windows/Start Menu/Programs.
[2011-02-25 12:43:59.994469] I [afr-common.c:716:afr_lookup_done] raid10-pump: background  meta-data data entry self-heal triggered. p
ath: /users/dab/backup/data/gorgon/users/dab/cxoffice/support/templates/win2000/drive_c/Windows/Start Menu/Programs
[2011-02-25 12:44:00.3184] I [afr-self-heal-common.c:1526:afr_self_heal_completion_cbk] raid10-pump: background  meta-data data entry 
self-heal completed on /users/dab/backup/data/gorgon/users/dab/cxoffice/support/templates/win2000/drive_c/Windows/Start Menu/Programs
[2011-02-25 12:44:00.5147] I [afr-common.c:662:afr_lookup_done] raid10-pump: entries are missing in lookup of /users/dab/backup/data/g
orgon/users/dab/cxoffice/support/templates/win2000/drive_c/Windows/Start Menu/Programs/Startup.
[2011-02-25 12:44:00.5206] I [afr-common.c:716:afr_lookup_done] raid10-pump: background  meta-data data entry self-heal triggered. pat
h: /users/dab/backup/data/gorgon/users/dab/cxoffice/support/templates/win2000/drive_c/Windows/Start Menu/Programs/Startup
[2011-02-25 12:44:00.14464] I [afr-self-heal-common.c:1526:afr_self_heal_completion_cbk] raid10-pump: background  meta-data data entry
 self-heal completed on /users/dab/backup/data/gorgon/users/dab/cxoffice/support/templates/win2000/drive_c/Windows/Start Menu/Programs
/Startup
pending frames:

patchset: v3.1.1-64-gf2a067c
signal received: 11
time of crash: 2011-02-25 12:44:00
configuration details:
argp 1
backtrace 1
dlfcn 1
fdatasync 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.1.2
/lib64/libc.so.6[0x361b0302d0]
/usr/lib64/libglusterfs.so.0(synctask_wake+0x4d)[0x3d60c41c7d]
/usr/lib64/libglusterfs.so.0(syncop_readdirp_cbk+0x104)[0x3d60c43b04]
/usr/lib64/libglusterfs.so.0(default_readdirp_cbk+0x79)[0x3d60c22069]
/usr/lib64/libglusterfs.so.0(default_readdirp_cbk+0x79)[0x3d60c22069]
/usr/lib64/glusterfs/3.1.2/xlator/storage/posix.so(posix_do_readdir+0x409)[0x2aaaab6b1fc9]
/usr/lib64/glusterfs/3.1.2/xlator/storage/posix.so(posix_readdirp+0xf)[0x2aaaab6b254f]
/usr/lib64/libglusterfs.so.0(default_readdirp+0xe9)[0x3d60c1b809]
/usr/lib64/libglusterfs.so.0(default_readdirp+0xe9)[0x3d60c1b809]
/usr/lib64/libglusterfs.so.0(syncop_readdirp+0x135)[0x3d60c433b5]
/usr/lib64/glusterfs/3.1.2/xlator/cluster/pump.so[0x2aaaabf560f1]
/usr/lib64/glusterfs/3.1.2/xlator/cluster/pump.so[0x2aaaabf56656]
/usr/lib64/glusterfs/3.1.2/xlator/cluster/pump.so[0x2aaaabf56656]
/usr/lib64/glusterfs/3.1.2/xlator/cluster/pump.so[0x2aaaabf56656]
/usr/lib64/glusterfs/3.1.2/xlator/cluster/pump.so[0x2aaaabf56656]
/usr/lib64/glusterfs/3.1.2/xlator/cluster/pump.so[0x2aaaabf56656]
/usr/lib64/glusterfs/3.1.2/xlator/cluster/pump.so[0x2aaaabf56656]
/usr/lib64/glusterfs/3.1.2/xlator/cluster/pump.so[0x2aaaabf56656]
/usr/lib64/glusterfs/3.1.2/xlator/cluster/pump.so[0x2aaaabf56656]
/usr/lib64/glusterfs/3.1.2/xlator/cluster/pump.so[0x2aaaabf56656]
/usr/lib64/glusterfs/3.1.2/xlator/cluster/pump.so[0x2aaaabf56656]
/usr/lib64/glusterfs/3.1.2/xlator/cluster/pump.so[0x2aaaabf56656]
/usr/lib64/glusterfs/3.1.2/xlator/cluster/pump.so[0x2aaaabf56656]
/usr/lib64/glusterfs/3.1.2/xlator/cluster/pump.so[0x2aaaabf56656]
/usr/lib64/glusterfs/3.1.2/xlator/cluster/pump.so[0x2aaaabf56656]
/usr/lib64/glusterfs/3.1.2/xlator/cluster/pump.so[0x2aaaabf56656]
/usr/lib64/glusterfs/3.1.2/xlator/cluster/pump.so[0x2aaaabf56656]
/usr/lib64/glusterfs/3.1.2/xlator/cluster/pump.so[0x2aaaabf5bfa0]
/usr/lib64/libglusterfs.so.0(synctask_wrap+0xb)[0x3d60c4237b]
/lib64/libc.so.6[0x361b041870]
---------

Do let me know if you require any more logs or info.


Regards,
Prasanth

Comment 3 Prasanth 2011-03-29 06:16:56 UTC
Raising the priority

Comment 4 Vijay Bellur 2011-04-08 09:05:10 UTC
(In reply to comment #3)
Not been able to reproduce so far. Looking into it further..

Comment 5 Anand Avati 2011-06-08 13:57:28 UTC
PATCH: http://patches.gluster.com/patch/7381 in master (syncop: Increase stack size for deep call stack.)

Comment 6 Anand Avati 2011-06-08 13:57:33 UTC
PATCH: http://patches.gluster.com/patch/7382 in master (glusterd: replace brick status grows with dir tree.)

Comment 7 Anand Avati 2011-06-08 15:18:15 UTC
PATCH: http://patches.gluster.com/patch/7405 in release-3.2 (glusterd: replace brick status grows with dir tree.)

Comment 8 Anand Avati 2011-06-08 15:18:20 UTC
PATCH: http://patches.gluster.com/patch/7408 in release-3.2 (pump: cleanup potential dict related memory corruption.)

Comment 9 Anand Avati 2011-06-08 15:18:25 UTC
PATCH: http://patches.gluster.com/patch/7409 in release-3.2 (syncop: Increase stack size for deep call stack.)

Comment 10 krishnan parthasarathi 2011-06-09 01:56:46 UTC
The root cause for replace-brick operation crashing Glusterfs was a stack overflow.
We tried to observe a correlation with height,width of the directory tree and the occurrence of a crash. 
Sometimes it resulted in a segmentation fault by overflowing into memory not owned by the glusterfsd process. On other instances, it (silently) corrupted glusterfsd's data structures, crashing much later with 'unhelpful' backtraces. 
We began to observe consistent crashes when directory depth was ~20 (voila!). 
On further analysis, we found that tasks running SYNCOPs were using a stack of only 16KB. This stack size cannot accommodate beyond a certain depth of directory tree (~20 - empirical). On increasing it to 2MB, we tested for directory depth of ~2048 (limited by posix PATH_MAX) without any crash.

Comment 11 krishnan parthasarathi 2011-06-09 01:59:22 UTC
*** Bug 2951 has been marked as a duplicate of this bug. ***

Comment 12 krishnan parthasarathi 2011-06-09 06:08:48 UTC
*** Bug 2650 has been marked as a duplicate of this bug. ***

Comment 13 krishnan parthasarathi 2011-06-09 06:10:55 UTC
*** Bug 2759 has been marked as a duplicate of this bug. ***

Comment 14 krishnan parthasarathi 2011-06-09 06:12:52 UTC
*** Bug 2954 has been marked as a duplicate of this bug. ***

Comment 15 Anand Avati 2011-06-10 07:53:50 UTC
PATCH: http://patches.gluster.com/patch/7458 in release-3.1 (syncop: Increase stack size for deep call stack.)

Comment 16 Anand Avati 2011-06-10 07:53:56 UTC
PATCH: http://patches.gluster.com/patch/7459 in release-3.1 (glusterd: replace brick status grows with dir tree.)

Comment 17 Anand Avati 2011-06-10 07:54:02 UTC
PATCH: http://patches.gluster.com/patch/7460 in release-3.1 (pump: cleanup potential dict related memory corruption.)

Comment 18 M S Vishwanath Bhat 2011-06-10 09:22:21 UTC
With 3.2.1qa4 release , I tried replace-brick operation on 40 data and it succeeded. It was 40 GB of data with 50 dirs depth.

I also created 2041 dirs depth and then running replace-brick on it. It succeeded.

Comment 19 M S Vishwanath Bhat 2011-06-15 02:47:44 UTC
Repeated the above exercise with 3.1.5qa2 and it's working fine.

Comment 20 Anand Avati 2011-07-12 06:13:03 UTC
PATCH: http://patches.gluster.com/patch/7826 in master (pump, afr: dict related memory fixes.)


Note You need to log in before you can comment on or make changes to this bug.