| Summary: | GlusterFS crashing with replace-brick | ||||||
|---|---|---|---|---|---|---|---|
| Product: | [Community] GlusterFS | Reporter: | Prasanth <prasanth> | ||||
| Component: | core | Assignee: | krishnan parthasarathi <kparthas> | ||||
| Status: | CLOSED CURRENTRELEASE | QA Contact: | |||||
| Severity: | high | Docs Contact: | |||||
| Priority: | urgent | ||||||
| Version: | 3.1.2 | CC: | amarts, d.a.bretherton, gluster-bugs, jacob, nsathyan, prasanth, rabhat, sac, vbellur, vbhat, vijay | ||||
| Target Milestone: | --- | ||||||
| Target Release: | --- | ||||||
| Hardware: | x86_64 | ||||||
| OS: | Linux | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | Type: | --- | |||||
| Regression: | --- | Mount Type: | fuse | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Attachments: |
|
||||||
|
Description
Dan Bretherton
2011-03-04 12:19:09 UTC
I have uploaded the core dump from bdan3, and changed the GlusterFS version number in the bug details to 3.1.2 - I hope that was the right thing to do. [root@bdan3 ~]# glusterfs --version glusterfs 3.1.2 built on Jan 14 2011 19:21:08 Let me know if you need more info or need me to run more tests. Regards, -Dan Bretherton Hello, It seems that GlusterFS is crashing while performing a replace-brick. Relevant logs are pasted below: ------------ [2011-02-25 12:43:59.974051] I [afr-common.c:662:afr_lookup_done] raid10-pump: entries are missing in lookup of /users/dab/backup/data /gorgon/users/dab/cxoffice/support/templates/win2000/drive_c/Windows/Application Data. [2011-02-25 12:43:59.974103] I [afr-common.c:716:afr_lookup_done] raid10-pump: background meta-data data entry self-heal triggered. p ath: /users/dab/backup/data/gorgon/users/dab/cxoffice/support/templates/win2000/drive_c/Windows/Application Data [2011-02-25 12:43:59.982317] I [afr-self-heal-common.c:1526:afr_self_heal_completion_cbk] raid10-pump: background meta-data data entr y self-heal completed on /users/dab/backup/data/gorgon/users/dab/cxoffice/support/templates/win2000/drive_c/Windows/Application Data [2011-02-25 12:43:59.984233] I [afr-common.c:662:afr_lookup_done] raid10-pump: entries are missing in lookup of /users/dab/backup/data /gorgon/users/dab/cxoffice/support/templates/win2000/drive_c/Windows/Start Menu. [2011-02-25 12:43:59.984288] I [afr-common.c:716:afr_lookup_done] raid10-pump: background meta-data data entry self-heal triggered. p ath: /users/dab/backup/data/gorgon/users/dab/cxoffice/support/templates/win2000/drive_c/Windows/Start Menu [2011-02-25 12:43:59.992450] I [afr-self-heal-common.c:1526:afr_self_heal_completion_cbk] raid10-pump: background meta-data data entr y self-heal completed on /users/dab/backup/data/gorgon/users/dab/cxoffice/support/templates/win2000/drive_c/Windows/Start Menu [2011-02-25 12:43:59.994415] I [afr-common.c:662:afr_lookup_done] raid10-pump: entries are missing in lookup of /users/dab/backup/data /gorgon/users/dab/cxoffice/support/templates/win2000/drive_c/Windows/Start Menu/Programs. [2011-02-25 12:43:59.994469] I [afr-common.c:716:afr_lookup_done] raid10-pump: background meta-data data entry self-heal triggered. p ath: /users/dab/backup/data/gorgon/users/dab/cxoffice/support/templates/win2000/drive_c/Windows/Start Menu/Programs [2011-02-25 12:44:00.3184] I [afr-self-heal-common.c:1526:afr_self_heal_completion_cbk] raid10-pump: background meta-data data entry self-heal completed on /users/dab/backup/data/gorgon/users/dab/cxoffice/support/templates/win2000/drive_c/Windows/Start Menu/Programs [2011-02-25 12:44:00.5147] I [afr-common.c:662:afr_lookup_done] raid10-pump: entries are missing in lookup of /users/dab/backup/data/g orgon/users/dab/cxoffice/support/templates/win2000/drive_c/Windows/Start Menu/Programs/Startup. [2011-02-25 12:44:00.5206] I [afr-common.c:716:afr_lookup_done] raid10-pump: background meta-data data entry self-heal triggered. pat h: /users/dab/backup/data/gorgon/users/dab/cxoffice/support/templates/win2000/drive_c/Windows/Start Menu/Programs/Startup [2011-02-25 12:44:00.14464] I [afr-self-heal-common.c:1526:afr_self_heal_completion_cbk] raid10-pump: background meta-data data entry self-heal completed on /users/dab/backup/data/gorgon/users/dab/cxoffice/support/templates/win2000/drive_c/Windows/Start Menu/Programs /Startup pending frames: patchset: v3.1.1-64-gf2a067c signal received: 11 time of crash: 2011-02-25 12:44:00 configuration details: argp 1 backtrace 1 dlfcn 1 fdatasync 1 libpthread 1 llistxattr 1 setfsid 1 spinlock 1 epoll.h 1 xattr.h 1 st_atim.tv_nsec 1 package-string: glusterfs 3.1.2 /lib64/libc.so.6[0x361b0302d0] /usr/lib64/libglusterfs.so.0(synctask_wake+0x4d)[0x3d60c41c7d] /usr/lib64/libglusterfs.so.0(syncop_readdirp_cbk+0x104)[0x3d60c43b04] /usr/lib64/libglusterfs.so.0(default_readdirp_cbk+0x79)[0x3d60c22069] /usr/lib64/libglusterfs.so.0(default_readdirp_cbk+0x79)[0x3d60c22069] /usr/lib64/glusterfs/3.1.2/xlator/storage/posix.so(posix_do_readdir+0x409)[0x2aaaab6b1fc9] /usr/lib64/glusterfs/3.1.2/xlator/storage/posix.so(posix_readdirp+0xf)[0x2aaaab6b254f] /usr/lib64/libglusterfs.so.0(default_readdirp+0xe9)[0x3d60c1b809] /usr/lib64/libglusterfs.so.0(default_readdirp+0xe9)[0x3d60c1b809] /usr/lib64/libglusterfs.so.0(syncop_readdirp+0x135)[0x3d60c433b5] /usr/lib64/glusterfs/3.1.2/xlator/cluster/pump.so[0x2aaaabf560f1] /usr/lib64/glusterfs/3.1.2/xlator/cluster/pump.so[0x2aaaabf56656] /usr/lib64/glusterfs/3.1.2/xlator/cluster/pump.so[0x2aaaabf56656] /usr/lib64/glusterfs/3.1.2/xlator/cluster/pump.so[0x2aaaabf56656] /usr/lib64/glusterfs/3.1.2/xlator/cluster/pump.so[0x2aaaabf56656] /usr/lib64/glusterfs/3.1.2/xlator/cluster/pump.so[0x2aaaabf56656] /usr/lib64/glusterfs/3.1.2/xlator/cluster/pump.so[0x2aaaabf56656] /usr/lib64/glusterfs/3.1.2/xlator/cluster/pump.so[0x2aaaabf56656] /usr/lib64/glusterfs/3.1.2/xlator/cluster/pump.so[0x2aaaabf56656] /usr/lib64/glusterfs/3.1.2/xlator/cluster/pump.so[0x2aaaabf56656] /usr/lib64/glusterfs/3.1.2/xlator/cluster/pump.so[0x2aaaabf56656] /usr/lib64/glusterfs/3.1.2/xlator/cluster/pump.so[0x2aaaabf56656] /usr/lib64/glusterfs/3.1.2/xlator/cluster/pump.so[0x2aaaabf56656] /usr/lib64/glusterfs/3.1.2/xlator/cluster/pump.so[0x2aaaabf56656] /usr/lib64/glusterfs/3.1.2/xlator/cluster/pump.so[0x2aaaabf56656] /usr/lib64/glusterfs/3.1.2/xlator/cluster/pump.so[0x2aaaabf56656] /usr/lib64/glusterfs/3.1.2/xlator/cluster/pump.so[0x2aaaabf56656] /usr/lib64/glusterfs/3.1.2/xlator/cluster/pump.so[0x2aaaabf5bfa0] /usr/lib64/libglusterfs.so.0(synctask_wrap+0xb)[0x3d60c4237b] /lib64/libc.so.6[0x361b041870] --------- Do let me know if you require any more logs or info. Regards, Prasanth Raising the priority (In reply to comment #3) Not been able to reproduce so far. Looking into it further.. PATCH: http://patches.gluster.com/patch/7381 in master (syncop: Increase stack size for deep call stack.) PATCH: http://patches.gluster.com/patch/7382 in master (glusterd: replace brick status grows with dir tree.) PATCH: http://patches.gluster.com/patch/7405 in release-3.2 (glusterd: replace brick status grows with dir tree.) PATCH: http://patches.gluster.com/patch/7408 in release-3.2 (pump: cleanup potential dict related memory corruption.) PATCH: http://patches.gluster.com/patch/7409 in release-3.2 (syncop: Increase stack size for deep call stack.) The root cause for replace-brick operation crashing Glusterfs was a stack overflow. We tried to observe a correlation with height,width of the directory tree and the occurrence of a crash. Sometimes it resulted in a segmentation fault by overflowing into memory not owned by the glusterfsd process. On other instances, it (silently) corrupted glusterfsd's data structures, crashing much later with 'unhelpful' backtraces. We began to observe consistent crashes when directory depth was ~20 (voila!). On further analysis, we found that tasks running SYNCOPs were using a stack of only 16KB. This stack size cannot accommodate beyond a certain depth of directory tree (~20 - empirical). On increasing it to 2MB, we tested for directory depth of ~2048 (limited by posix PATH_MAX) without any crash. *** Bug 2951 has been marked as a duplicate of this bug. *** *** Bug 2650 has been marked as a duplicate of this bug. *** *** Bug 2759 has been marked as a duplicate of this bug. *** *** Bug 2954 has been marked as a duplicate of this bug. *** PATCH: http://patches.gluster.com/patch/7458 in release-3.1 (syncop: Increase stack size for deep call stack.) PATCH: http://patches.gluster.com/patch/7459 in release-3.1 (glusterd: replace brick status grows with dir tree.) PATCH: http://patches.gluster.com/patch/7460 in release-3.1 (pump: cleanup potential dict related memory corruption.) With 3.2.1qa4 release , I tried replace-brick operation on 40 data and it succeeded. It was 40 GB of data with 50 dirs depth. I also created 2041 dirs depth and then running replace-brick on it. It succeeded. Repeated the above exercise with 3.1.5qa2 and it's working fine. PATCH: http://patches.gluster.com/patch/7826 in master (pump, afr: dict related memory fixes.) |