1468514 – Brick Mux Setup: brick processes(glusterfsd) crash after a restart of volume which was preceded with some actions

Bug 1468514 - Brick Mux Setup: brick processes(glusterfsd) crash after a restart of volume which was preceded with some actions

Summary: Brick Mux Setup: brick processes(glusterfsd) crash after a restart of volume ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	core
Sub Component:
Version:	rhgs-3.3
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	RHGS 3.3.0
Assignee:	Mohit Agrawal
QA Contact:	Nag Pavan Chilakam
Docs Contact:
URL:
Whiteboard:	brick-multiplexing
Depends On:	1470533
Blocks:	1417151 1459400
TreeView+	depends on / blocked

Reported:	2017-07-07 10:03 UTC by Nag Pavan Chilakam
Modified:	2017-09-21 05:02 UTC (History)
CC List:	6 users (show)
Fixed In Version:	glusterfs-3.8.4-34
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1470533 (view as bug list)
Environment:
Last Closed:	2017-09-21 05:02:13 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
thread bt (62.93 KB, text/plain) 2017-07-07 10:10 UTC, Nag Pavan Chilakam	no flags	Details
cli output (537.79 KB, text/plain) 2017-07-07 10:56 UTC, Nag Pavan Chilakam	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2017:2774	0	normal	SHIPPED_LIVE	glusterfs bug fix and enhancement update	2017-09-21 08:16:29 UTC

Description Nag Pavan Chilakam 2017-07-07 10:03:35 UTC

Description of problem:
-----------------------
2 of the 3 bricks in 1x3 crashed while doing a vol restart
Was working on validating on_qa bz#1450807 - Brick Multiplexing:heal info shows root as heal pending for all associated volumes, after bringing down brick of one volume 

[root@dhcp35-45 ~]# file /core.19032 
/core.19032: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style, from '/usr/sbin/glusterfsd -s 10.70.35.45 --volfile-id rep3_1.10.70.35.45.rhs-brick1-', real uid: 0, effective uid: 0, real gid: 0, effective gid: 0, execfn: '/usr/sbin/glusterfsd', platform: 'x86_64'

(gdb) bt
#0  0x00007f3b2bd231f7 in raise () from /lib64/libc.so.6
#1  0x00007f3b2bd248e8 in abort () from /lib64/libc.so.6
#2  0x00007f3b2bd62f47 in __libc_message () from /lib64/libc.so.6
#3  0x00007f3b2bd6a619 in _int_free () from /lib64/libc.so.6
#4  0x00007f3b2bda880d in closedir () from /lib64/libc.so.6
#5  0x00007f3b2d6e1c15 in sys_closedir (dir=<optimized out>) at syscall.c:113
#6  0x00007f3b1bde9157 in notify (this=<optimized out>, event=<optimized out>, data=<optimized out>)
    at posix.c:6618
#7  0x00007f3b2d6b7b92 in xlator_notify (xl=0x7f39b5b1a450, event=event@entry=9, 
    data=data@entry=0x7f39b5b1b0c0) at xlator.c:549
#8  0x00007f3b2d751292 in default_notify (this=this@entry=0x7f39b5b1b0c0, event=9, 
    data=<optimized out>) at defaults.c:3139
#9  0x00007f3b1b5add97 in notify (this=0x7f39b5b1b0c0, event=<optimized out>, data=<optimized out>)
    at trash.c:2313
#10 0x00007f3b2d6b7b92 in xlator_notify (xl=0x7f39b5b1b0c0, event=event@entry=9, 
    data=data@entry=0x7f39b5b1c080) at xlator.c:549
#11 0x00007f3b2d751292 in default_notify (this=this@entry=0x7f39b5b1c080, event=event@entry=9, 
    data=data@entry=0x7f39b5b1e410) at defaults.c:3139
#12 0x00007f3b1b38f664 in notify (this=0x7f39b5b1c080, event=9, data=0x7f39b5b1e410)
    at changetimerecorder.c:2184
#13 0x00007f3b2d6b7b92 in xlator_notify (xl=0x7f39b5b1c080, event=event@entry=9, 
    data=data@entry=0x7f39b5b1e410) at xlator.c:549
#14 0x00007f3b2d751292 in default_notify (this=this@entry=0x7f39b5b1e410, event=9, 
    data=data@entry=0x7f39b5b1f760) at defaults.c:3139
#15 0x00007f3b1aca40e3 in notify (this=0x7f39b5b1e410, event=<optimized out>, data=0x7f39b5b1f760)
    at changelog.c:2312
#16 0x00007f3b2d6b7b92 in xlator_notify (xl=0x7f39b5b1e410, event=event@entry=9, 
    data=data@entry=0x7f39b5b1f760) at xlator.c:549
#17 0x00007f3b2d751292 in default_notify (this=this@entry=0x7f39b5b1f760, event=9, 
    data=<optimized out>) at defaults.c:3139
#18 0x00007f3b1a870181 in notify (this=0x7f39b5b1f760, event=<optimized out>, data=<optimized out>)
    at bit-rot-stub.c:243
#19 0x00007f3b2d6b7b92 in xlator_notify (xl=0x7f39b5b1f760, event=event@entry=9, 
    data=data@entry=0x7f39b5b20810) at xlator.c:549
#20 0x00007f3b2d751292 in default_notify (this=0x7f39b5b20810, event=9, data=<optimized out>)
    at defaults.c:3139
#21 0x00007f3b2d6b7b92 in xlator_notify (xl=0x7f39b5b20810, event=event@entry=9, 
    data=data@entry=0x7f39b5b21650) at xlator.c:549
#22 0x00007f3b2d751292 in default_notify (this=0x7f39b5b21650, event=9, data=<optimized out>)
    at defaults.c:3139
#23 0x00007f3b2d6b7b92 in xlator_notify (xl=0x7f39b5b21650, event=event@entry=9, 
    data=data@entry=0x7f39b5b22410) at xlator.c:549
#24 0x00007f3b2d751292 in default_notify (this=0x7f39b5b22410, event=9, data=<optimized out>)
    at defaults.c:3139
#25 0x00007f3b2d6b7b92 in xlator_notify (xl=0x7f39b5b22410, event=event@entry=9, 
    data=data@entry=0x7f39b5b23510) at xlator.c:549
#26 0x00007f3b2d751292 in default_notify (this=0x7f39b5b23510, event=9, data=<optimized out>)
    at defaults.c:3139
#27 0x00007f3b2d6b7b92 in xlator_notify (xl=0x7f39b5b23510, event=event@entry=9, 
    data=data@entry=0x7f39b5b24550) at xlator.c:549
#28 0x00007f3b2d751292 in default_notify (this=0x7f39b5b24550, event=9, data=<optimized out>)
    at defaults.c:3139
#29 0x00007f3b2d6b7b92 in xlator_notify (xl=0x7f39b5b24550, event=event@entry=9, 
---Type <return> to continue, or q <return> to quit---
    data=data@entry=0x7f39b5b25580) at xlator.c:549
#30 0x00007f3b2d751292 in default_notify (this=0x7f39b5b25580, event=9, data=<optimized out>)
    at defaults.c:3139
#31 0x00007f3b19c00a32 in notify (this=0x7f39b5b25580, event=<optimized out>, data=0x7f39b5b265b0)
    at upcall.c:2511
#32 0x00007f3b2d6b7b92 in xlator_notify (xl=0x7f39b5b25580, event=event@entry=9, 
    data=data@entry=0x7f39b5b265b0) at xlator.c:549
#33 0x00007f3b2d751292 in default_notify (this=this@entry=0x7f39b5b265b0, event=9, 
    data=<optimized out>) at defaults.c:3139
#34 0x00007f3b199e89e3 in notify (this=0x7f39b5b265b0, event=<optimized out>, data=<optimized out>)
    at io-threads.c:1053
#35 0x00007f3b2d6b7b92 in xlator_notify (xl=0x7f39b5b265b0, event=event@entry=9, 
    data=data@entry=0x7f39b5b274a0) at xlator.c:549
#36 0x00007f3b2d751292 in default_notify (this=0x7f39b5b274a0, event=9, data=<optimized out>)
    at defaults.c:3139
#37 0x00007f3b2d6b7b92 in xlator_notify (xl=0x7f39b5b274a0, event=event@entry=9, 
    data=data@entry=0x7f39b5b29020) at xlator.c:549
#38 0x00007f3b2d751292 in default_notify (this=this@entry=0x7f39b5b29020, event=event@entry=9, 
    data=data@entry=0x7f39b5b2a230) at defaults.c:3139
#39 0x00007f3b195ba4fb in notify (this=0x7f39b5b29020, event=9, data=0x7f39b5b2a230) at barrier.c:534
#40 0x00007f3b2d6b7b92 in xlator_notify (xl=0x7f39b5b29020, event=event@entry=9, 
    data=data@entry=0x7f39b5b2a230) at xlator.c:549
#41 0x00007f3b2d751292 in default_notify (this=this@entry=0x7f39b5b2a230, event=9, 
    data=<optimized out>) at defaults.c:3139
#42 0x00007f3b193aed98 in notify (this=0x7f39b5b2a230, event=<optimized out>, data=<optimized out>)
    at index.c:2530
#43 0x00007f3b2d6b7b92 in xlator_notify (xl=0x7f39b5b2a230, event=event@entry=9, 
    data=data@entry=0x7f39b5b2b7c0) at xlator.c:549
#44 0x00007f3b2d751292 in default_notify (this=0x7f39b5b2b7c0, event=9, data=<optimized out>)
    at defaults.c:3139
#45 0x00007f3b2d6b7b92 in xlator_notify (xl=0x7f39b5b2b7c0, event=event@entry=9, 
    data=data@entry=0x7f39b5b2cc30) at xlator.c:549
#46 0x00007f3b2d751292 in default_notify (this=this@entry=0x7f39b5b2cc30, event=9, 
    data=data@entry=0x7f39b5b2e270) at defaults.c:3139
#47 0x00007f3b18f713da in notify (this=0x7f39b5b2cc30, event=<optimized out>, data=0x7f39b5b2e270)
    at io-stats.c:4150
#48 0x00007f3b2d6b7b92 in xlator_notify (xl=0x7f39b5b2cc30, event=event@entry=9, 
    data=data@entry=0x7f39b5b2e270) at xlator.c:549
#49 0x00007f3b2d751292 in default_notify (this=this@entry=0x7f39b5b2e270, event=event@entry=9, 
    data=data@entry=0x7f39b5b2e270) at defaults.c:3139
#50 0x00007f3b18afed3f in notify (this=<optimized out>, event=<optimized out>, data=0x7f39b5b2e270)
    at server.c:1584
#51 0x0000555897aa8057 in glusterfs_handle_terminate (req=0x7f3af525d460) at glusterfsd-mgmt.c:263
#52 0x00007f3b2d6f0832 in synctask_wrap (old_task=<optimized out>) at syncop.c:375
#53 0x00007f3b2bd34d40 in ?? () from /lib64/libc.so.6
#54 0x0000000000000000 in ?? ()

Version-Release number of selected component (if applicable):
======
3.8.4-32


Steps to Reproduce:
Note: The Main intention was to validate the BZ#1450807 - Brick Multiplexing:heal info shows root as heal pending for all associated volume

I wanted to try a few scenarios, as I did see that once the issue was still seen(as updated in bz previously)

1.brick mux enabled on a 6 node setup
2.created about 40 1x3 volume using bricks on n1,n2,n3 say vol names are rep3_1 ...rep3_40
3.mounted rep3_9 
4. started to do linux untar
5. killed brick b1, while untar was going on.
   =>this did bring down glusterfsd and hence impacts all vols
6. did service glusterd restart for bringing the brick online.
   =>this worked well
   =>files got healed,all good so far
7. now , with untar still in progress, did a umount -l of the lv where b1 of rep3_9 was residing(hence not impacting other bricks)
   =>Observations
   ==>the heal info doesnt show b1 as offline
   ==>the parent directory which was in play at that time(as part of untar), showed up in the heal info entry of b1(meaning it has heal pendings marked)
8. stopped linux untar
9. mounted the lv using mount -a 
10. triggered manual heal using glust v heal rep3_9
11. heal completed
12. now did a restart of rep3_9 (intention: wanted to check the behavior, as the observed in step 7, the heal info didnt show brick offline, and so did vol info at that time)

Comment 2 Nag Pavan Chilakam 2017-07-07 10:05:17 UTC

note: b1(where umount was done and remounted) and b3 couldn't come up as glusterfsd crashed, b2 was up



[root@dhcp35-45 ~]# gluster v info rep3_9
 
Volume Name: rep3_9
Type: Replicate
Volume ID: 0bb1db31-b6cb-4b51-ace6-c9de4f16adc3
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: 10.70.35.45:/rhs/brick9/rep3_9
Brick2: 10.70.35.130:/rhs/brick9/rep3_9
Brick3: 10.70.35.122:/rhs/brick9/rep3_9
Options Reconfigured:
nfs.disable: on
transport.address-family: inet
cluster.brick-multiplex: on
[root@dhcp35-45 ~]# gluster v status rep3_9
Status of volume: rep3_9
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.70.35.45:/rhs/brick9/rep3_9        N/A       N/A        N       N/A  
Brick 10.70.35.130:/rhs/brick9/rep3_9       49152     0          Y       9787 
Brick 10.70.35.122:/rhs/brick9/rep3_9       N/A       N/A        N       N/A  
Self-heal Daemon on localhost               N/A       N/A        Y       29373
Self-heal Daemon on 10.70.35.138            N/A       N/A        Y       24803
Self-heal Daemon on 10.70.35.130            N/A       N/A        Y       11239
Self-heal Daemon on 10.70.35.23             N/A       N/A        Y       1985 
Self-heal Daemon on 10.70.35.112            N/A       N/A        Y       19752
Self-heal Daemon on 10.70.35.122            N/A       N/A        Y       30171
 
Task Status of Volume rep3_9
------------------------------------------------------------------------------
There are no active volume tasks

Comment 3 Nag Pavan Chilakam 2017-07-07 10:10:43 UTC

Created attachment 1295251 [details]
thread bt

Comment 4 Nag Pavan Chilakam 2017-07-07 10:27:41 UTC

logs and sosreports @ http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/nchilaka/bug.1468514

cores available in dedicated server directories

Comment 6 Nag Pavan Chilakam 2017-07-07 10:56:22 UTC

Created attachment 1295255 [details]
cli output

Comment 7 Mohit Agrawal 2017-07-10 10:17:46 UTC

Hi Nag,

Test build is available at below link,

https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=13623735

Please try and share the result if issue is still reproduce after apply the patch


Regards
Mohit Agrawal

Comment 16 Atin Mukherjee 2017-07-13 07:06:43 UTC

upstream patch : https://review.gluster.org/17767

Comment 17 Mohit Agrawal 2017-07-14 04:38:05 UTC

Below is downstream link of patch

https://code.engineering.redhat.com/gerrit/112271

Comment 18 Nag Pavan Chilakam 2017-07-17 10:42:59 UTC

blocks on_qa validation of bZ# 1468514
refer https://bugzilla.redhat.com/show_bug.cgi?id=1459400#c8

Comment 19 Nag Pavan Chilakam 2017-07-17 10:43:40 UTC

(In reply to nchilaka from comment #18)
> blocks on_qa validation of bZ# 1468514
> refer https://bugzilla.redhat.com/show_bug.cgi?id=1459400#c8

sorry, I meant it blocks BZ#1459400
refer https://bugzilla.redhat.com/show_bug.cgi?id=1459400#c8

Comment 22 Mohit Agrawal 2017-07-18 06:33:26 UTC

Hi Nag,

  Thanks for sharing the core dump,this core dump is not similar to previous 
  core dump, this time brick process is getting crash in changetimerecorder
  xlator,please file a separate bugzilla to fix this.

  Kindly this bugzilla update to verified state also.

Regards
Mohit Agrawal

Comment 23 Nag Pavan Chilakam 2017-07-18 06:49:25 UTC

(In reply to Mohit Agrawal from comment #22)
> Hi Nag,
> 
>   Thanks for sharing the core dump,this core dump is not similar to previous 
>   core dump, this time brick process is getting crash in changetimerecorder
>   xlator,please file a separate bugzilla to fix this.
> 
>   Kindly this bugzilla update to verified state also.
> 
> Regards
> Mohit Agrawal


Raised a new bz#1472129

Comment 24 Nag Pavan Chilakam 2017-07-18 08:41:40 UTC

on_qa validation:
re-ran the case mentioned in description===>crash not hit
and also reran volume restarts for about 150 times , but didn't hit this crash.
However hit another crash(all details mentioned in previous comment ie comment#23), hence moving to verified

Comment 25 Nag Pavan Chilakam 2017-07-18 08:42:04 UTC

(In reply to nchilaka from comment #24)
> on_qa validation:
> re-ran the case mentioned in description===>crash not hit
> and also reran volume restarts for about 150 times , but didn't hit this
> crash.
> However hit another crash(all details mentioned in previous comment ie
> comment#23), hence moving to verified

test build:3.8.4-34

Comment 27 errata-xmlrpc 2017-09-21 05:02:13 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2774

Note You need to log in before you can comment on or make changes to this bug.