1226820 – Brick process crashed during self-heal process

Bug 1226820 - Brick process crashed during self-heal process

Summary: Brick process crashed during self-heal process

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	nfs-ganesha
Sub Component:
Version:	rhgs-3.1
Hardware:	x86_64
OS:	FreeBSD
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	RHGS 3.1.0
Assignee:	Soumya Koduri
QA Contact:	Apeksha
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1202842
TreeView+	depends on / blocked

Reported:	2015-06-01 08:39 UTC by Apeksha
Modified:	2015-07-29 04:54 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2015-07-29 04:54:15 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2015:1495	0	normal	SHIPPED_LIVE	Important: Red Hat Gluster Storage 3.1 update	2015-07-29 08:26:26 UTC

Description Apeksha 2015-06-01 08:39:09 UTC

Description of problem:
Brick process crashed during self-heal process

Version-Release number of selected component (if applicable):
glusterfs-3.7.0-2.el6rhs.x86_64
nfs-ganesha-2.2.0-0.el6.x86_64

How reproducible:
Once

Steps to Reproduce:
1. create a 6X2 dist-rep volume and mount using nfs-ganesha vers=3
2. create directories and files
3. bring down 1 brick from each of the replica pairs
4. rename all the files and directories
5. force start the volume
6. Self-heal process starts
7. After 5-10min a brick process crashes

Actual results: During self-heal process one of the brick process crashed

Expected results: Self-heal process must complete, none of the bricks must crash

Additional info:
backtrace of the core:

(gdb) bt
#0  0x00007f437831a531 in server_process_event_upcall (this=0x7f437401e650, data=<value optimized out>) at server.c:1145
#1  0x00007f437831a6dd in notify (this=0x7f437401e650, event=<value optimized out>, data=<value optimized out>) at server.c:1182
#2  0x0000003ae0a21916 in xlator_notify (xl=0x7f437401e650, event=19, data=0x7f4360e1f500) at xlator.c:489
#3  0x0000003ae0a2c142 in default_notify (this=0x7f437401d1d0, event=19, data=0x7f4360e1f500) at defaults.c:2331
#4  0x00007f437855a6ae in notify (this=0x7f437401d1d0, event=<value optimized out>, data=0x7f4360e1f500) at io-stats.c:3064
#5  0x0000003ae0a21916 in xlator_notify (xl=0x7f437401d1d0, event=19, data=0x7f4360e1f500) at xlator.c:489
#6  0x0000003ae0a2c142 in default_notify (this=0x7f437401bcb0, event=19, data=0x7f4360e1f500) at defaults.c:2331
#7  0x0000003ae0a21916 in xlator_notify (xl=0x7f437401bcb0, event=19, data=0x7f4360e1f500) at xlator.c:489
#8  0x0000003ae0a2c142 in default_notify (this=0x7f437401a850, event=19, data=0x7f4360e1f500) at defaults.c:2331
#9  0x0000003ae0a21916 in xlator_notify (xl=0x7f437401a850, event=19, data=0x7f4360e1f500) at xlator.c:489
#10 0x0000003ae0a2c142 in default_notify (this=0x7f43740192a0, event=19, data=0x7f4360e1f500) at defaults.c:2331
#11 0x0000003ae0a21916 in xlator_notify (xl=0x7f43740192a0, event=19, data=0x7f4360e1f500) at xlator.c:489
#12 0x0000003ae0a2c142 in default_notify (this=0x7f4374017970, event=19, data=0x7f4360e1f500) at defaults.c:2331
#13 0x0000003ae0a21916 in xlator_notify (xl=0x7f4374017970, event=19, data=0x7f4360e1f500) at xlator.c:489
#14 0x0000003ae0a2c142 in default_notify (this=0x7f4374016510, event=19, data=0x7f4360e1f500) at defaults.c:2331
#15 0x00007f4378fb80bb in notify (this=<value optimized out>, event=<value optimized out>, data=<value optimized out>) at index.c:1419
#16 0x0000003ae0a21916 in xlator_notify (xl=0x7f4374016510, event=19, data=0x7f4360e1f500) at xlator.c:489
#17 0x0000003ae0a2c142 in default_notify (this=0x7f4374015020, event=19, data=0x7f4360e1f500) at defaults.c:2331
#18 0x00007f43791c28b9 in notify (this=0x7f4374015020, event=19, data=0x7f4360e1f500) at barrier.c:539
#19 0x0000003ae0a21916 in xlator_notify (xl=0x7f4374015020, event=19, data=0x7f4360e1f500) at xlator.c:489
#20 0x0000003ae0a2c142 in default_notify (this=0x7f4374013c60, event=19, data=0x7f4360e1f500) at defaults.c:2331
#21 0x0000003ae0a21916 in xlator_notify (xl=0x7f4374013c60, event=19, data=0x7f4360e1f500) at xlator.c:489
#22 0x0000003ae0a2c142 in default_notify (this=0x7f4374012810, event=19, data=0x7f4360e1f500) at defaults.c:2331
#23 0x00007f43795d6b68 in notify (this=0x7f4374012810, event=<value optimized out>, data=0x7f4360e1f500) at upcall.c:1747
#24 0x00007f43795dfad0 in upcall_client_cache_invalidate (this=0x7f4374012810, gfid=<value optimized out>, up_client_entry=0x7f434c10a690, flags=<value optimized out>, stbuf=0x0, p_stbuf=0x7f4360e1f9d0, 
    oldp_stbuf=0x0) at upcall-internal.c:578
#25 0x00007f43795e0589 in upcall_cache_invalidate (frame=0x7f4384952724, this=0x7f4374012810, client=0x7f436c0026a0, inode=0x7f43610e66e4, flags=529, stbuf=0x0, p_stbuf=0x7f4360e1f9d0, oldp_stbuf=0x0)
    at upcall-internal.c:519
#26 0x00007f43795de13b in up_rmdir_cbk (frame=0x7f4384952724, cookie=<value optimized out>, this=0x7f4374012810, op_ret=0, op_errno=39, preparent=0x7f4360e1fa40, postparent=0x7f4360e1f9d0, xdata=0x0)
    at upcall.c:584
#27 0x00007f4379a0186c in posix_acl_rmdir_cbk (frame=0x7f43849535ec, cookie=<value optimized out>, this=<value optimized out>, op_ret=0, op_errno=39, preparent=<value optimized out>, postparent=0x7f4360e1f9d0, 
    xdata=0x0) at posix-acl.c:1370
#28 0x00007f4379e233c8 in changelog_rmdir_cbk (frame=0x7f4384952f34, cookie=<value optimized out>, this=<value optimized out>, op_ret=0, op_errno=<value optimized out>, preparent=<value optimized out>, 
    postparent=0x7f4360e1f9d0, xdata=0x0) at changelog.c:66
#29 0x00007f437a459a0c in trash_common_rmdir_cbk (frame=0x7f438495287c, cookie=<value optimized out>, this=<value optimized out>, op_ret=0, op_errno=39, preparent=<value optimized out>, 
    postparent=0x7f4360e1f9d0, xdata=0x0) at trash.c:555
#30 0x00007f437aa8b108 in posix_rmdir (frame=0x7f438495333c, this=<value optimized out>, loc=<value optimized out>, flags=1, xdata=<value optimized out>) at posix.c:1798
#31 0x00007f437a45b307 in trash_rmdir (frame=0x7f438495287c, this=0x7f4374009020, loc=0x7f43843dacc8, flags=1, xdata=0x0) at trash.c:1926
#32 0x0000003ae0a2d678 in default_rmdir (frame=0x7f438495287c, this=0x7f437400a6c0, loc=0x7f43843dacc8, flags=1, xdata=<value optimized out>) at defaults.c:1905
#33 0x00007f4379e27b68 in changelog_rmdir (frame=0x7f4384952f34, this=0x7f437400cdf0, loc=0x7f43843dacc8, xflags=1, xdata=0x0) at changelog.c:164
#34 0x0000003ae0a2d678 in default_rmdir (frame=0x7f4384952f34, this=0x7f437400ec40, loc=0x7f43843dacc8, flags=1, xdata=<value optimized out>) at defaults.c:1905
#35 0x00007f4379a03f25 in posix_acl_rmdir (frame=0x7f43849535ec, this=0x7f43740100d0, loc=0x7f43843dacc8, flags=1, xdata=0x0) at posix-acl.c:1387
#36 0x0000003ae0a2d678 in default_rmdir (frame=0x7f43849535ec, this=0x7f43740114a0, loc=0x7f43843dacc8, flags=1, xdata=<value optimized out>) at defaults.c:1905
#37 0x00007f43795daa49 in up_rmdir (frame=0x7f4384952724, this=0x7f4374012810, loc=0x7f43843dacc8, flags=1, xdata=0x0) at upcall.c:610
#38 0x0000003ae0a315c7 in default_rmdir_resume (frame=0x7f4384953698, this=0x7f4374013c60, loc=0x7f43843dacc8, flags=1, xdata=0x0) at defaults.c:1464
#39 0x0000003ae0a4bb60 in call_resume (stub=0x7f43843dac88) at call-stub.c:2576
#40 0x00007f43793d0398 in iot_worker (data=0x7f437404ea50) at io-threads.c:214
#41 0x00000037286079d1 in start_thread () from /lib64/libpthread.so.0
#42 0x00000037282e89dd in clone () from /lib64/libc.so.6


[root@nfs2 /]# gluster v status testvol 
Status of volume: testvol
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.70.46.180:/rhs/brick1/brick1/testv
ol_brick0                                   49230     0          Y       5988 
Brick 10.70.46.185:/rhs/brick1/brick1/testv
ol_brick1                                   49227     0          Y       12210
Brick 10.70.46.179:/rhs/brick1/brick0/testv
ol_brick2                                   49204     0          Y       22927
Brick 10.70.46.172:/rhs/brick1/brick0/testv
ol_brick3                                   49204     0          Y       372  
Brick 10.70.46.180:/rhs/brick1/brick2/testv
ol_brick4                                   49231     0          Y       6005 
Brick 10.70.46.185:/rhs/brick1/brick2/testv
ol_brick5                                   N/A       N/A        N       12231
Brick 10.70.46.179:/rhs/brick1/brick1/testv
ol_brick6                                   49205     0          Y       22944
Brick 10.70.46.172:/rhs/brick1/brick1/testv
ol_brick7                                   49205     0          Y       397  
Brick 10.70.46.180:/rhs/brick1/brick3/testv
ol_brick8                                   49232     0          Y       6022 
Brick 10.70.46.185:/rhs/brick1/brick3/testv
ol_brick9                                   49229     0          Y       12249
Brick 10.70.46.179:/rhs/brick1/brick2/testv
ol_brick10                                  49206     0          Y       22961
Brick 10.70.46.172:/rhs/brick1/brick2/testv
ol_brick11                                  49206     0          Y       417  
NFS Server on localhost                     N/A       N/A        N       N/A  
Self-heal Daemon on localhost               N/A       N/A        Y       12279
NFS Server on 10.70.46.179                  N/A       N/A        N       N/A  
Self-heal Daemon on 10.70.46.179            N/A       N/A        Y       28432
NFS Server on 10.70.46.172                  N/A       N/A        N       N/A  
Self-heal Daemon on 10.70.46.172            N/A       N/A        Y       476  
NFS Server on 10.70.46.180                  N/A       N/A        N       N/A  
Self-heal Daemon on 10.70.46.180            N/A       N/A        Y       11504
 
Task Status of Volume testvol
------------------------------------------------------------------------------
There are no active volume tasks



[root@nfs2 /]# gluster v heal testvol info | grep "Number"
Number of entries: 300
Number of entries: 0
Number of entries: 300
Number of entries: 0
Number of entries: 749
Number of entries: 300
Number of entries: 0
Number of entries: 300
Number of entries: 0
Number of entries: 300
Number of entries: 0

Comment 2 Apeksha 2015-06-01 08:48:45 UTC

sosreports and core:
http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1226820/

Comment 3 Soumya Koduri 2015-06-03 06:58:40 UTC

I followed similar steps on my setup but unable to reproduce this issue. And I couldn't debug the core attached as the RPMs have been updated. I request you to re-run the test on the latest RPM and let me know in case of any crash reported.

Comment 4 Vivek Agarwal 2015-06-04 07:46:16 UTC

team-nfs

Comment 5 Soumya Koduri 2015-06-04 09:23:56 UTC

Please try to re-produce the issue with latest build and provide the core.

Comment 7 Soumya Koduri 2015-06-11 08:23:32 UTC

Fix is merged in RHGS 3.1 branch. Should be available in the next build.

Comment 9 Apeksha 2015-06-26 10:45:00 UTC

Did not see any brick process crash, but the self-heal process seems to be in hung state, logged a bug for the same.
 Bug 1234884 - Selfheal on a volume stops at a particular point and does not resume for a long time

Comment 10 errata-xmlrpc 2015-07-29 04:54:15 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2015-1495.html

Note You need to log in before you can comment on or make changes to this bug.