Description of problem: Brick process crashed during self-heal process Version-Release number of selected component (if applicable): glusterfs-3.7.0-2.el6rhs.x86_64 nfs-ganesha-2.2.0-0.el6.x86_64 How reproducible: Once Steps to Reproduce: 1. create a 6X2 dist-rep volume and mount using nfs-ganesha vers=3 2. create directories and files 3. bring down 1 brick from each of the replica pairs 4. rename all the files and directories 5. force start the volume 6. Self-heal process starts 7. After 5-10min a brick process crashes Actual results: During self-heal process one of the brick process crashed Expected results: Self-heal process must complete, none of the bricks must crash Additional info: backtrace of the core: (gdb) bt #0 0x00007f437831a531 in server_process_event_upcall (this=0x7f437401e650, data=<value optimized out>) at server.c:1145 #1 0x00007f437831a6dd in notify (this=0x7f437401e650, event=<value optimized out>, data=<value optimized out>) at server.c:1182 #2 0x0000003ae0a21916 in xlator_notify (xl=0x7f437401e650, event=19, data=0x7f4360e1f500) at xlator.c:489 #3 0x0000003ae0a2c142 in default_notify (this=0x7f437401d1d0, event=19, data=0x7f4360e1f500) at defaults.c:2331 #4 0x00007f437855a6ae in notify (this=0x7f437401d1d0, event=<value optimized out>, data=0x7f4360e1f500) at io-stats.c:3064 #5 0x0000003ae0a21916 in xlator_notify (xl=0x7f437401d1d0, event=19, data=0x7f4360e1f500) at xlator.c:489 #6 0x0000003ae0a2c142 in default_notify (this=0x7f437401bcb0, event=19, data=0x7f4360e1f500) at defaults.c:2331 #7 0x0000003ae0a21916 in xlator_notify (xl=0x7f437401bcb0, event=19, data=0x7f4360e1f500) at xlator.c:489 #8 0x0000003ae0a2c142 in default_notify (this=0x7f437401a850, event=19, data=0x7f4360e1f500) at defaults.c:2331 #9 0x0000003ae0a21916 in xlator_notify (xl=0x7f437401a850, event=19, data=0x7f4360e1f500) at xlator.c:489 #10 0x0000003ae0a2c142 in default_notify (this=0x7f43740192a0, event=19, data=0x7f4360e1f500) at defaults.c:2331 #11 0x0000003ae0a21916 in xlator_notify (xl=0x7f43740192a0, event=19, data=0x7f4360e1f500) at xlator.c:489 #12 0x0000003ae0a2c142 in default_notify (this=0x7f4374017970, event=19, data=0x7f4360e1f500) at defaults.c:2331 #13 0x0000003ae0a21916 in xlator_notify (xl=0x7f4374017970, event=19, data=0x7f4360e1f500) at xlator.c:489 #14 0x0000003ae0a2c142 in default_notify (this=0x7f4374016510, event=19, data=0x7f4360e1f500) at defaults.c:2331 #15 0x00007f4378fb80bb in notify (this=<value optimized out>, event=<value optimized out>, data=<value optimized out>) at index.c:1419 #16 0x0000003ae0a21916 in xlator_notify (xl=0x7f4374016510, event=19, data=0x7f4360e1f500) at xlator.c:489 #17 0x0000003ae0a2c142 in default_notify (this=0x7f4374015020, event=19, data=0x7f4360e1f500) at defaults.c:2331 #18 0x00007f43791c28b9 in notify (this=0x7f4374015020, event=19, data=0x7f4360e1f500) at barrier.c:539 #19 0x0000003ae0a21916 in xlator_notify (xl=0x7f4374015020, event=19, data=0x7f4360e1f500) at xlator.c:489 #20 0x0000003ae0a2c142 in default_notify (this=0x7f4374013c60, event=19, data=0x7f4360e1f500) at defaults.c:2331 #21 0x0000003ae0a21916 in xlator_notify (xl=0x7f4374013c60, event=19, data=0x7f4360e1f500) at xlator.c:489 #22 0x0000003ae0a2c142 in default_notify (this=0x7f4374012810, event=19, data=0x7f4360e1f500) at defaults.c:2331 #23 0x00007f43795d6b68 in notify (this=0x7f4374012810, event=<value optimized out>, data=0x7f4360e1f500) at upcall.c:1747 #24 0x00007f43795dfad0 in upcall_client_cache_invalidate (this=0x7f4374012810, gfid=<value optimized out>, up_client_entry=0x7f434c10a690, flags=<value optimized out>, stbuf=0x0, p_stbuf=0x7f4360e1f9d0, oldp_stbuf=0x0) at upcall-internal.c:578 #25 0x00007f43795e0589 in upcall_cache_invalidate (frame=0x7f4384952724, this=0x7f4374012810, client=0x7f436c0026a0, inode=0x7f43610e66e4, flags=529, stbuf=0x0, p_stbuf=0x7f4360e1f9d0, oldp_stbuf=0x0) at upcall-internal.c:519 #26 0x00007f43795de13b in up_rmdir_cbk (frame=0x7f4384952724, cookie=<value optimized out>, this=0x7f4374012810, op_ret=0, op_errno=39, preparent=0x7f4360e1fa40, postparent=0x7f4360e1f9d0, xdata=0x0) at upcall.c:584 #27 0x00007f4379a0186c in posix_acl_rmdir_cbk (frame=0x7f43849535ec, cookie=<value optimized out>, this=<value optimized out>, op_ret=0, op_errno=39, preparent=<value optimized out>, postparent=0x7f4360e1f9d0, xdata=0x0) at posix-acl.c:1370 #28 0x00007f4379e233c8 in changelog_rmdir_cbk (frame=0x7f4384952f34, cookie=<value optimized out>, this=<value optimized out>, op_ret=0, op_errno=<value optimized out>, preparent=<value optimized out>, postparent=0x7f4360e1f9d0, xdata=0x0) at changelog.c:66 #29 0x00007f437a459a0c in trash_common_rmdir_cbk (frame=0x7f438495287c, cookie=<value optimized out>, this=<value optimized out>, op_ret=0, op_errno=39, preparent=<value optimized out>, postparent=0x7f4360e1f9d0, xdata=0x0) at trash.c:555 #30 0x00007f437aa8b108 in posix_rmdir (frame=0x7f438495333c, this=<value optimized out>, loc=<value optimized out>, flags=1, xdata=<value optimized out>) at posix.c:1798 #31 0x00007f437a45b307 in trash_rmdir (frame=0x7f438495287c, this=0x7f4374009020, loc=0x7f43843dacc8, flags=1, xdata=0x0) at trash.c:1926 #32 0x0000003ae0a2d678 in default_rmdir (frame=0x7f438495287c, this=0x7f437400a6c0, loc=0x7f43843dacc8, flags=1, xdata=<value optimized out>) at defaults.c:1905 #33 0x00007f4379e27b68 in changelog_rmdir (frame=0x7f4384952f34, this=0x7f437400cdf0, loc=0x7f43843dacc8, xflags=1, xdata=0x0) at changelog.c:164 #34 0x0000003ae0a2d678 in default_rmdir (frame=0x7f4384952f34, this=0x7f437400ec40, loc=0x7f43843dacc8, flags=1, xdata=<value optimized out>) at defaults.c:1905 #35 0x00007f4379a03f25 in posix_acl_rmdir (frame=0x7f43849535ec, this=0x7f43740100d0, loc=0x7f43843dacc8, flags=1, xdata=0x0) at posix-acl.c:1387 #36 0x0000003ae0a2d678 in default_rmdir (frame=0x7f43849535ec, this=0x7f43740114a0, loc=0x7f43843dacc8, flags=1, xdata=<value optimized out>) at defaults.c:1905 #37 0x00007f43795daa49 in up_rmdir (frame=0x7f4384952724, this=0x7f4374012810, loc=0x7f43843dacc8, flags=1, xdata=0x0) at upcall.c:610 #38 0x0000003ae0a315c7 in default_rmdir_resume (frame=0x7f4384953698, this=0x7f4374013c60, loc=0x7f43843dacc8, flags=1, xdata=0x0) at defaults.c:1464 #39 0x0000003ae0a4bb60 in call_resume (stub=0x7f43843dac88) at call-stub.c:2576 #40 0x00007f43793d0398 in iot_worker (data=0x7f437404ea50) at io-threads.c:214 #41 0x00000037286079d1 in start_thread () from /lib64/libpthread.so.0 #42 0x00000037282e89dd in clone () from /lib64/libc.so.6 [root@nfs2 /]# gluster v status testvol Status of volume: testvol Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.46.180:/rhs/brick1/brick1/testv ol_brick0 49230 0 Y 5988 Brick 10.70.46.185:/rhs/brick1/brick1/testv ol_brick1 49227 0 Y 12210 Brick 10.70.46.179:/rhs/brick1/brick0/testv ol_brick2 49204 0 Y 22927 Brick 10.70.46.172:/rhs/brick1/brick0/testv ol_brick3 49204 0 Y 372 Brick 10.70.46.180:/rhs/brick1/brick2/testv ol_brick4 49231 0 Y 6005 Brick 10.70.46.185:/rhs/brick1/brick2/testv ol_brick5 N/A N/A N 12231 Brick 10.70.46.179:/rhs/brick1/brick1/testv ol_brick6 49205 0 Y 22944 Brick 10.70.46.172:/rhs/brick1/brick1/testv ol_brick7 49205 0 Y 397 Brick 10.70.46.180:/rhs/brick1/brick3/testv ol_brick8 49232 0 Y 6022 Brick 10.70.46.185:/rhs/brick1/brick3/testv ol_brick9 49229 0 Y 12249 Brick 10.70.46.179:/rhs/brick1/brick2/testv ol_brick10 49206 0 Y 22961 Brick 10.70.46.172:/rhs/brick1/brick2/testv ol_brick11 49206 0 Y 417 NFS Server on localhost N/A N/A N N/A Self-heal Daemon on localhost N/A N/A Y 12279 NFS Server on 10.70.46.179 N/A N/A N N/A Self-heal Daemon on 10.70.46.179 N/A N/A Y 28432 NFS Server on 10.70.46.172 N/A N/A N N/A Self-heal Daemon on 10.70.46.172 N/A N/A Y 476 NFS Server on 10.70.46.180 N/A N/A N N/A Self-heal Daemon on 10.70.46.180 N/A N/A Y 11504 Task Status of Volume testvol ------------------------------------------------------------------------------ There are no active volume tasks [root@nfs2 /]# gluster v heal testvol info | grep "Number" Number of entries: 300 Number of entries: 0 Number of entries: 300 Number of entries: 0 Number of entries: 749 Number of entries: 300 Number of entries: 0 Number of entries: 300 Number of entries: 0 Number of entries: 300 Number of entries: 0
sosreports and core: http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1226820/
I followed similar steps on my setup but unable to reproduce this issue. And I couldn't debug the core attached as the RPMs have been updated. I request you to re-run the test on the latest RPM and let me know in case of any crash reported.
team-nfs
Please try to re-produce the issue with latest build and provide the core.
Fix is merged in RHGS 3.1 branch. Should be available in the next build.
Did not see any brick process crash, but the self-heal process seems to be in hung state, logged a bug for the same. Bug 1234884 - Selfheal on a volume stops at a particular point and does not resume for a long time
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2015-1495.html