Bug 1459400 - brick process crashes while running bug-1432542-mpx-restart-crash.t in a loop
brick process crashes while running bug-1432542-mpx-restart-crash.t in a loop
Status: CLOSED ERRATA
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: core (Show other bugs)
3.3
x86_64 All
urgent Severity urgent
: ---
: RHGS 3.3.0
Assigned To: Mohit Agrawal
nchilaka
brick-multiplexing
:
Depends On: 1468514
Blocks: 1417151 1459402
  Show dependency treegraph
 
Reported: 2017-06-06 23:50 EDT by Mohit Agrawal
Modified: 2017-09-21 00:58 EDT (History)
4 users (show)

See Also:
Fixed In Version: glusterfs-3.8.4-28
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1459402 (view as bug list)
Environment:
Last Closed: 2017-09-21 00:45:37 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2017:2774 normal SHIPPED_LIVE glusterfs bug fix and enhancement update 2017-09-21 04:16:29 EDT

  None (edit)
Description Mohit Agrawal 2017-06-06 23:50:52 EDT
Description of problem:
Sometime brick process is getting crash in index xlator while executing regression test
suite after enable brick multiplexing.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
Run test case (bug-1432542-mpx-restart-crash.t) in a 100 times loops.Once in a 100 time
brick process is getting crash 

Actual results:
brick process is getting crash.

Expected results:
brick process should not crash.

Additional info:
Comment 2 Mohit Agrawal 2017-06-06 23:54:39 EDT
Hi,

Below is the core pattern generated by brick process at the time of getting crash

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

09:18:46 Program terminated with signal 11, Segmentation fault.
09:18:46 #0  0x00007f8fcab1bcdf in index_get_gfid_type (opaque=0x7f8fac49bcb0) at /home/jenkins/root/workspace/regression-test-with-multiplex/xlators/features/index/src/index.c:1632
09:18:46 1632	        list_for_each_entry (entry, &args->entries->list, list) {
09:18:46 
09:18:46 Thread 64 (Thread 0x7f8fd4700700 (LWP 6861)):
09:18:46 #0  0x00007f8fde222a5e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
09:18:46 No symbol table info available.
09:18:46 #1  0x00007f8fdef91ba0 in syncenv_task (proc=0xce4d70) at /home/jenkins/root/workspace/regression-test-with-multiplex/libglusterfs/src/syncop.c:603
09:18:46         env = 0xce49b0
09:18:46         task = 0x0
09:18:46         sleep_till = {tv_sec = 1496642513, tv_nsec = 0}
09:18:46         ret = 0
09:18:46 #2  0x00007f8fdef91e42 in syncenv_processor (thdata=0xce4d70) at /home/jenkins/root/workspace/regression-test-with-multiplex/libglusterfs/src/syncop.c:695
09:18:46         env = 0xce49b0
09:18:46         proc = 0xce4d70
09:18:46         task = 0x7f8fcc3d7670
09:18:46 #3  0x00007f8fde21eaa1 in start_thread () from /lib64/libpthread.so.0
09:18:46 No symbol table info available.
09:18:46 #4  0x00007f8fddb86bcd in clone () from /lib64/libc.so.6
09:18:46 No symbol table info available.
09:18:46 
.....
......
......

18:46 Thread 1 (Thread 0x7f8fd5101700 (LWP 6860)):
09:18:46 #0  0x00007f8fcab1bcdf in index_get_gfid_type (opaque=0x7f8fac49bcb0) at /home/jenkins/root/workspace/regression-test-with-multiplex/xlators/features/index/src/index.c:1632
09:18:46         entry = 0x0
09:18:46         this = 0x7f8fa0a24da0
09:18:46         args = 0x7f8fac49bcb0
09:18:46         loc = {path = 0x0, name = 0x0, inode = 0x0, parent = 0x0, gfid = '\000' <repeats 15 times>, pargfid = '\000' <repeats 15 times>}
09:18:46         iatt = {ia_ino = 0, ia_gfid = '\000' <repeats 15 times>, ia_dev = 0, ia_type = IA_INVAL, ia_prot = {suid = 0 '\000', sgid = 0 '\000', sticky = 0 '\000', owner = {read = 0 '\000', write = 0 '\000', exec = 0 '\000'}, group = {read = 0 '\000', write = 0 '\000', exec = 0 '\000'}, other = {read = 0 '\000', write = 0 '\000', exec = 0 '\000'}}, ia_nlink = 0, ia_uid = 0, ia_gid = 0, ia_rdev = 0, ia_size = 0, ia_blksize = 0, ia_blocks = 0, ia_atime = 0, ia_atime_nsec = 0, ia_mtime = 0, ia_mtime_nsec = 0, ia_ctime = 0, ia_ctime_nsec = 0}
09:18:46         ret = 0
09:18:46 #1  0x00007f8fdef91355 in synctask_wrap () at /home/jenkins/root/workspace/regression-test-with-multiplex/libglusterfs/src/syncop.c:375
09:18:46         task = 0x7f8fa4782e50
09:18:46 #2  0x00007f8fddae1760 in ?? () from /lib64/libc.so.6
09:18:46 No symbol table info available.
09:18:46 #3  0x0000000000000000 in ?? ()
09:18:46 No symbol table info available.


>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>.

After analyse the crash it seems brick process was getting crash because thread was not cleaned up appropriately in index xlator.
After update the index_worker code as well as notify code in index xlator , brick process is not getting crash.


Regards
Mohit Agrawal
Comment 5 Atin Mukherjee 2017-06-09 06:03:04 EDT
downstream patch : https://code.engineering.redhat.com/gerrit/#/c/108648/
Comment 8 nchilaka 2017-07-17 06:41:55 EDT
ON_QA validation:
had 50 1x3 vols on a 6 node brick mux setup(all bricks hosted only on first 3 nodes)
Started and stopped volumes in loop for 100 times with IO goingon on  one vol.

for j in {1..100};do echo "#########################################";date;for i in $(gluster v list);do gluster v stop $i --mode=script;done;ps -ef|grep glusterfsd;echo "############ end of loop $j ########################" ;for k in $(gluster v list);do gluster v start $k;done;done



after about 7 times, I hit glusterfsd crash as reported in BZ#1468514

Hence blocked, until BZ#1468514 is fixed
Comment 9 nchilaka 2017-07-18 02:21:21 EDT
on_qa validation 3.8.4-34 test version

Ran the command in comment#8 while I was doing IO for only one volume(refer to https://bugzilla.redhat.com/show_bug.cgi?id=1468514#c21)

Moving to verified as I didn't hit any crash for about 150 loops.
However, I did hit the same crash at around loop#170  which is mentioned in BZ#468514. As i didn't hit anyother crash  i am moving to verified
Comment 10 nchilaka 2017-07-18 02:21:51 EDT
(In reply to nchilaka from comment #9)
> on_qa validation 3.8.4-34 test version
> 
> Ran the command in comment#8 while I was doing IO for only one volume(refer
> to https://bugzilla.redhat.com/show_bug.cgi?id=1468514#c21)
> 
> Moving to verified as I didn't hit any crash for about 150 loops.
> However, I did hit the same crash at around loop#170  which is mentioned in
> BZ#468514. As i didn't hit anyother crash  i am moving to verified

Sorry it is BZ#1468514
Comment 12 errata-xmlrpc 2017-09-21 00:45:37 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2774
Comment 13 errata-xmlrpc 2017-09-21 00:58:49 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2774

Note You need to log in before you can comment on or make changes to this bug.