Bug 1459400

Summary:	brick process crashes while running bug-1432542-mpx-restart-crash.t in a loop
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	Mohit Agrawal <moagrawa>
Component:	core	Assignee:	Mohit Agrawal <moagrawa>
Status:	CLOSED ERRATA	QA Contact:	Nag Pavan Chilakam <nchilaka>
Severity:	urgent	Docs Contact:
Priority:	urgent
Version:	rhgs-3.3	CC:	amukherj, nchilaka, rhs-bugs, storage-qa-internal
Target Milestone:	---
Target Release:	RHGS 3.3.0
Hardware:	x86_64
OS:	All
Whiteboard:	brick-multiplexing
Fixed In Version:	glusterfs-3.8.4-28	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1459402 (view as bug list)		Environment:
Last Closed:	2017-09-21 04:45:37 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1468514
Bug Blocks:	1417151, 1459402

Description Mohit Agrawal 2017-06-07 03:50:52 UTC

Description of problem:
Sometime brick process is getting crash in index xlator while executing regression test
suite after enable brick multiplexing.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
Run test case (bug-1432542-mpx-restart-crash.t) in a 100 times loops.Once in a 100 time
brick process is getting crash 

Actual results:
brick process is getting crash.

Expected results:
brick process should not crash.

Additional info:

Comment 2 Mohit Agrawal 2017-06-07 03:54:39 UTC

Hi,

Below is the core pattern generated by brick process at the time of getting crash

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

09:18:46 Program terminated with signal 11, Segmentation fault.
09:18:46 #0  0x00007f8fcab1bcdf in index_get_gfid_type (opaque=0x7f8fac49bcb0) at /home/jenkins/root/workspace/regression-test-with-multiplex/xlators/features/index/src/index.c:1632
09:18:46 1632	        list_for_each_entry (entry, &args->entries->list, list) {
09:18:46 
09:18:46 Thread 64 (Thread 0x7f8fd4700700 (LWP 6861)):
09:18:46 #0  0x00007f8fde222a5e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
09:18:46 No symbol table info available.
09:18:46 #1  0x00007f8fdef91ba0 in syncenv_task (proc=0xce4d70) at /home/jenkins/root/workspace/regression-test-with-multiplex/libglusterfs/src/syncop.c:603
09:18:46         env = 0xce49b0
09:18:46         task = 0x0
09:18:46         sleep_till = {tv_sec = 1496642513, tv_nsec = 0}
09:18:46         ret = 0
09:18:46 #2  0x00007f8fdef91e42 in syncenv_processor (thdata=0xce4d70) at /home/jenkins/root/workspace/regression-test-with-multiplex/libglusterfs/src/syncop.c:695
09:18:46         env = 0xce49b0
09:18:46         proc = 0xce4d70
09:18:46         task = 0x7f8fcc3d7670
09:18:46 #3  0x00007f8fde21eaa1 in start_thread () from /lib64/libpthread.so.0
09:18:46 No symbol table info available.
09:18:46 #4  0x00007f8fddb86bcd in clone () from /lib64/libc.so.6
09:18:46 No symbol table info available.
09:18:46 
.....
......
......

18:46 Thread 1 (Thread 0x7f8fd5101700 (LWP 6860)):
09:18:46 #0  0x00007f8fcab1bcdf in index_get_gfid_type (opaque=0x7f8fac49bcb0) at /home/jenkins/root/workspace/regression-test-with-multiplex/xlators/features/index/src/index.c:1632
09:18:46         entry = 0x0
09:18:46         this = 0x7f8fa0a24da0
09:18:46         args = 0x7f8fac49bcb0
09:18:46         loc = {path = 0x0, name = 0x0, inode = 0x0, parent = 0x0, gfid = '\000' <repeats 15 times>, pargfid = '\000' <repeats 15 times>}
09:18:46         iatt = {ia_ino = 0, ia_gfid = '\000' <repeats 15 times>, ia_dev = 0, ia_type = IA_INVAL, ia_prot = {suid = 0 '\000', sgid = 0 '\000', sticky = 0 '\000', owner = {read = 0 '\000', write = 0 '\000', exec = 0 '\000'}, group = {read = 0 '\000', write = 0 '\000', exec = 0 '\000'}, other = {read = 0 '\000', write = 0 '\000', exec = 0 '\000'}}, ia_nlink = 0, ia_uid = 0, ia_gid = 0, ia_rdev = 0, ia_size = 0, ia_blksize = 0, ia_blocks = 0, ia_atime = 0, ia_atime_nsec = 0, ia_mtime = 0, ia_mtime_nsec = 0, ia_ctime = 0, ia_ctime_nsec = 0}
09:18:46         ret = 0
09:18:46 #1  0x00007f8fdef91355 in synctask_wrap () at /home/jenkins/root/workspace/regression-test-with-multiplex/libglusterfs/src/syncop.c:375
09:18:46         task = 0x7f8fa4782e50
09:18:46 #2  0x00007f8fddae1760 in ?? () from /lib64/libc.so.6
09:18:46 No symbol table info available.
09:18:46 #3  0x0000000000000000 in ?? ()
09:18:46 No symbol table info available.


>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>.

After analyse the crash it seems brick process was getting crash because thread was not cleaned up appropriately in index xlator.
After update the index_worker code as well as notify code in index xlator , brick process is not getting crash.


Regards
Mohit Agrawal

Comment 5 Atin Mukherjee 2017-06-09 10:03:04 UTC

downstream patch : https://code.engineering.redhat.com/gerrit/#/c/108648/

Comment 8 Nag Pavan Chilakam 2017-07-17 10:41:55 UTC

ON_QA validation:
had 50 1x3 vols on a 6 node brick mux setup(all bricks hosted only on first 3 nodes)
Started and stopped volumes in loop for 100 times with IO goingon on  one vol.

for j in {1..100};do echo "#########################################";date;for i in $(gluster v list);do gluster v stop $i --mode=script;done;ps -ef|grep glusterfsd;echo "############ end of loop $j ########################" ;for k in $(gluster v list);do gluster v start $k;done;done



after about 7 times, I hit glusterfsd crash as reported in BZ#1468514

Hence blocked, until BZ#1468514 is fixed

Comment 9 Nag Pavan Chilakam 2017-07-18 06:21:21 UTC

on_qa validation 3.8.4-34 test version

Ran the command in comment#8 while I was doing IO for only one volume(refer to https://bugzilla.redhat.com/show_bug.cgi?id=1468514#c21)

Moving to verified as I didn't hit any crash for about 150 loops.
However, I did hit the same crash at around loop#170  which is mentioned in BZ#468514. As i didn't hit anyother crash  i am moving to verified

Comment 10 Nag Pavan Chilakam 2017-07-18 06:21:51 UTC

(In reply to nchilaka from comment #9)
> on_qa validation 3.8.4-34 test version
> 
> Ran the command in comment#8 while I was doing IO for only one volume(refer
> to https://bugzilla.redhat.com/show_bug.cgi?id=1468514#c21)
> 
> Moving to verified as I didn't hit any crash for about 150 loops.
> However, I did hit the same crash at around loop#170  which is mentioned in
> BZ#468514. As i didn't hit anyother crash  i am moving to verified

Sorry it is BZ#1468514

Comment 12 errata-xmlrpc 2017-09-21 04:45:37 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2774

Comment 13 errata-xmlrpc 2017-09-21 04:58:49 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2774