Bug 1468514 - Brick Mux Setup: brick processes(glusterfsd) crash after a restart of volume which was preceded with some actions
Brick Mux Setup: brick processes(glusterfsd) crash after a restart of volume ...
Status: CLOSED ERRATA
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: core (Show other bugs)
3.3
Unspecified Unspecified
unspecified Severity urgent
: ---
: RHGS 3.3.0
Assigned To: Mohit Agrawal
nchilaka
brick-multiplexing
:
Depends On: 1470533
Blocks: 1417151 1459400
  Show dependency treegraph
 
Reported: 2017-07-07 06:03 EDT by nchilaka
Modified: 2017-09-21 01:02 EDT (History)
6 users (show)

See Also:
Fixed In Version: glusterfs-3.8.4-34
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1470533 (view as bug list)
Environment:
Last Closed: 2017-09-21 01:02:13 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
thread bt (62.93 KB, text/plain)
2017-07-07 06:10 EDT, nchilaka
no flags Details
cli output (537.79 KB, text/plain)
2017-07-07 06:56 EDT, nchilaka
no flags Details

  None (edit)
Description nchilaka 2017-07-07 06:03:35 EDT
Description of problem:
-----------------------
2 of the 3 bricks in 1x3 crashed while doing a vol restart
Was working on validating on_qa bz#1450807 - Brick Multiplexing:heal info shows root as heal pending for all associated volumes, after bringing down brick of one volume 

[root@dhcp35-45 ~]# file /core.19032 
/core.19032: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style, from '/usr/sbin/glusterfsd -s 10.70.35.45 --volfile-id rep3_1.10.70.35.45.rhs-brick1-', real uid: 0, effective uid: 0, real gid: 0, effective gid: 0, execfn: '/usr/sbin/glusterfsd', platform: 'x86_64'

(gdb) bt
#0  0x00007f3b2bd231f7 in raise () from /lib64/libc.so.6
#1  0x00007f3b2bd248e8 in abort () from /lib64/libc.so.6
#2  0x00007f3b2bd62f47 in __libc_message () from /lib64/libc.so.6
#3  0x00007f3b2bd6a619 in _int_free () from /lib64/libc.so.6
#4  0x00007f3b2bda880d in closedir () from /lib64/libc.so.6
#5  0x00007f3b2d6e1c15 in sys_closedir (dir=<optimized out>) at syscall.c:113
#6  0x00007f3b1bde9157 in notify (this=<optimized out>, event=<optimized out>, data=<optimized out>)
    at posix.c:6618
#7  0x00007f3b2d6b7b92 in xlator_notify (xl=0x7f39b5b1a450, event=event@entry=9, 
    data=data@entry=0x7f39b5b1b0c0) at xlator.c:549
#8  0x00007f3b2d751292 in default_notify (this=this@entry=0x7f39b5b1b0c0, event=9, 
    data=<optimized out>) at defaults.c:3139
#9  0x00007f3b1b5add97 in notify (this=0x7f39b5b1b0c0, event=<optimized out>, data=<optimized out>)
    at trash.c:2313
#10 0x00007f3b2d6b7b92 in xlator_notify (xl=0x7f39b5b1b0c0, event=event@entry=9, 
    data=data@entry=0x7f39b5b1c080) at xlator.c:549
#11 0x00007f3b2d751292 in default_notify (this=this@entry=0x7f39b5b1c080, event=event@entry=9, 
    data=data@entry=0x7f39b5b1e410) at defaults.c:3139
#12 0x00007f3b1b38f664 in notify (this=0x7f39b5b1c080, event=9, data=0x7f39b5b1e410)
    at changetimerecorder.c:2184
#13 0x00007f3b2d6b7b92 in xlator_notify (xl=0x7f39b5b1c080, event=event@entry=9, 
    data=data@entry=0x7f39b5b1e410) at xlator.c:549
#14 0x00007f3b2d751292 in default_notify (this=this@entry=0x7f39b5b1e410, event=9, 
    data=data@entry=0x7f39b5b1f760) at defaults.c:3139
#15 0x00007f3b1aca40e3 in notify (this=0x7f39b5b1e410, event=<optimized out>, data=0x7f39b5b1f760)
    at changelog.c:2312
#16 0x00007f3b2d6b7b92 in xlator_notify (xl=0x7f39b5b1e410, event=event@entry=9, 
    data=data@entry=0x7f39b5b1f760) at xlator.c:549
#17 0x00007f3b2d751292 in default_notify (this=this@entry=0x7f39b5b1f760, event=9, 
    data=<optimized out>) at defaults.c:3139
#18 0x00007f3b1a870181 in notify (this=0x7f39b5b1f760, event=<optimized out>, data=<optimized out>)
    at bit-rot-stub.c:243
#19 0x00007f3b2d6b7b92 in xlator_notify (xl=0x7f39b5b1f760, event=event@entry=9, 
    data=data@entry=0x7f39b5b20810) at xlator.c:549
#20 0x00007f3b2d751292 in default_notify (this=0x7f39b5b20810, event=9, data=<optimized out>)
    at defaults.c:3139
#21 0x00007f3b2d6b7b92 in xlator_notify (xl=0x7f39b5b20810, event=event@entry=9, 
    data=data@entry=0x7f39b5b21650) at xlator.c:549
#22 0x00007f3b2d751292 in default_notify (this=0x7f39b5b21650, event=9, data=<optimized out>)
    at defaults.c:3139
#23 0x00007f3b2d6b7b92 in xlator_notify (xl=0x7f39b5b21650, event=event@entry=9, 
    data=data@entry=0x7f39b5b22410) at xlator.c:549
#24 0x00007f3b2d751292 in default_notify (this=0x7f39b5b22410, event=9, data=<optimized out>)
    at defaults.c:3139
#25 0x00007f3b2d6b7b92 in xlator_notify (xl=0x7f39b5b22410, event=event@entry=9, 
    data=data@entry=0x7f39b5b23510) at xlator.c:549
#26 0x00007f3b2d751292 in default_notify (this=0x7f39b5b23510, event=9, data=<optimized out>)
    at defaults.c:3139
#27 0x00007f3b2d6b7b92 in xlator_notify (xl=0x7f39b5b23510, event=event@entry=9, 
    data=data@entry=0x7f39b5b24550) at xlator.c:549
#28 0x00007f3b2d751292 in default_notify (this=0x7f39b5b24550, event=9, data=<optimized out>)
    at defaults.c:3139
#29 0x00007f3b2d6b7b92 in xlator_notify (xl=0x7f39b5b24550, event=event@entry=9, 
---Type <return> to continue, or q <return> to quit---
    data=data@entry=0x7f39b5b25580) at xlator.c:549
#30 0x00007f3b2d751292 in default_notify (this=0x7f39b5b25580, event=9, data=<optimized out>)
    at defaults.c:3139
#31 0x00007f3b19c00a32 in notify (this=0x7f39b5b25580, event=<optimized out>, data=0x7f39b5b265b0)
    at upcall.c:2511
#32 0x00007f3b2d6b7b92 in xlator_notify (xl=0x7f39b5b25580, event=event@entry=9, 
    data=data@entry=0x7f39b5b265b0) at xlator.c:549
#33 0x00007f3b2d751292 in default_notify (this=this@entry=0x7f39b5b265b0, event=9, 
    data=<optimized out>) at defaults.c:3139
#34 0x00007f3b199e89e3 in notify (this=0x7f39b5b265b0, event=<optimized out>, data=<optimized out>)
    at io-threads.c:1053
#35 0x00007f3b2d6b7b92 in xlator_notify (xl=0x7f39b5b265b0, event=event@entry=9, 
    data=data@entry=0x7f39b5b274a0) at xlator.c:549
#36 0x00007f3b2d751292 in default_notify (this=0x7f39b5b274a0, event=9, data=<optimized out>)
    at defaults.c:3139
#37 0x00007f3b2d6b7b92 in xlator_notify (xl=0x7f39b5b274a0, event=event@entry=9, 
    data=data@entry=0x7f39b5b29020) at xlator.c:549
#38 0x00007f3b2d751292 in default_notify (this=this@entry=0x7f39b5b29020, event=event@entry=9, 
    data=data@entry=0x7f39b5b2a230) at defaults.c:3139
#39 0x00007f3b195ba4fb in notify (this=0x7f39b5b29020, event=9, data=0x7f39b5b2a230) at barrier.c:534
#40 0x00007f3b2d6b7b92 in xlator_notify (xl=0x7f39b5b29020, event=event@entry=9, 
    data=data@entry=0x7f39b5b2a230) at xlator.c:549
#41 0x00007f3b2d751292 in default_notify (this=this@entry=0x7f39b5b2a230, event=9, 
    data=<optimized out>) at defaults.c:3139
#42 0x00007f3b193aed98 in notify (this=0x7f39b5b2a230, event=<optimized out>, data=<optimized out>)
    at index.c:2530
#43 0x00007f3b2d6b7b92 in xlator_notify (xl=0x7f39b5b2a230, event=event@entry=9, 
    data=data@entry=0x7f39b5b2b7c0) at xlator.c:549
#44 0x00007f3b2d751292 in default_notify (this=0x7f39b5b2b7c0, event=9, data=<optimized out>)
    at defaults.c:3139
#45 0x00007f3b2d6b7b92 in xlator_notify (xl=0x7f39b5b2b7c0, event=event@entry=9, 
    data=data@entry=0x7f39b5b2cc30) at xlator.c:549
#46 0x00007f3b2d751292 in default_notify (this=this@entry=0x7f39b5b2cc30, event=9, 
    data=data@entry=0x7f39b5b2e270) at defaults.c:3139
#47 0x00007f3b18f713da in notify (this=0x7f39b5b2cc30, event=<optimized out>, data=0x7f39b5b2e270)
    at io-stats.c:4150
#48 0x00007f3b2d6b7b92 in xlator_notify (xl=0x7f39b5b2cc30, event=event@entry=9, 
    data=data@entry=0x7f39b5b2e270) at xlator.c:549
#49 0x00007f3b2d751292 in default_notify (this=this@entry=0x7f39b5b2e270, event=event@entry=9, 
    data=data@entry=0x7f39b5b2e270) at defaults.c:3139
#50 0x00007f3b18afed3f in notify (this=<optimized out>, event=<optimized out>, data=0x7f39b5b2e270)
    at server.c:1584
#51 0x0000555897aa8057 in glusterfs_handle_terminate (req=0x7f3af525d460) at glusterfsd-mgmt.c:263
#52 0x00007f3b2d6f0832 in synctask_wrap (old_task=<optimized out>) at syncop.c:375
#53 0x00007f3b2bd34d40 in ?? () from /lib64/libc.so.6
#54 0x0000000000000000 in ?? ()

Version-Release number of selected component (if applicable):
======
3.8.4-32


Steps to Reproduce:
Note: The Main intention was to validate the BZ#1450807 - Brick Multiplexing:heal info shows root as heal pending for all associated volume

I wanted to try a few scenarios, as I did see that once the issue was still seen(as updated in bz previously)

1.brick mux enabled on a 6 node setup
2.created about 40 1x3 volume using bricks on n1,n2,n3 say vol names are rep3_1 ...rep3_40
3.mounted rep3_9 
4. started to do linux untar
5. killed brick b1, while untar was going on.
   =>this did bring down glusterfsd and hence impacts all vols
6. did service glusterd restart for bringing the brick online.
   =>this worked well
   =>files got healed,all good so far
7. now , with untar still in progress, did a umount -l of the lv where b1 of rep3_9 was residing(hence not impacting other bricks)
   =>Observations
   ==>the heal info doesnt show b1 as offline
   ==>the parent directory which was in play at that time(as part of untar), showed up in the heal info entry of b1(meaning it has heal pendings marked)
8. stopped linux untar
9. mounted the lv using mount -a 
10. triggered manual heal using glust v heal rep3_9
11. heal completed
12. now did a restart of rep3_9 (intention: wanted to check the behavior, as the observed in step 7, the heal info didnt show brick offline, and so did vol info at that time)
Comment 2 nchilaka 2017-07-07 06:05:17 EDT
note: b1(where umount was done and remounted) and b3 couldn't come up as glusterfsd crashed, b2 was up



[root@dhcp35-45 ~]# gluster v info rep3_9
 
Volume Name: rep3_9
Type: Replicate
Volume ID: 0bb1db31-b6cb-4b51-ace6-c9de4f16adc3
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: 10.70.35.45:/rhs/brick9/rep3_9
Brick2: 10.70.35.130:/rhs/brick9/rep3_9
Brick3: 10.70.35.122:/rhs/brick9/rep3_9
Options Reconfigured:
nfs.disable: on
transport.address-family: inet
cluster.brick-multiplex: on
[root@dhcp35-45 ~]# gluster v status rep3_9
Status of volume: rep3_9
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.70.35.45:/rhs/brick9/rep3_9        N/A       N/A        N       N/A  
Brick 10.70.35.130:/rhs/brick9/rep3_9       49152     0          Y       9787 
Brick 10.70.35.122:/rhs/brick9/rep3_9       N/A       N/A        N       N/A  
Self-heal Daemon on localhost               N/A       N/A        Y       29373
Self-heal Daemon on 10.70.35.138            N/A       N/A        Y       24803
Self-heal Daemon on 10.70.35.130            N/A       N/A        Y       11239
Self-heal Daemon on 10.70.35.23             N/A       N/A        Y       1985 
Self-heal Daemon on 10.70.35.112            N/A       N/A        Y       19752
Self-heal Daemon on 10.70.35.122            N/A       N/A        Y       30171
 
Task Status of Volume rep3_9
------------------------------------------------------------------------------
There are no active volume tasks
Comment 3 nchilaka 2017-07-07 06:10 EDT
Created attachment 1295251 [details]
thread bt
Comment 4 nchilaka 2017-07-07 06:27:41 EDT
logs and sosreports @ http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/nchilaka/bug.1468514

cores available in dedicated server directories
Comment 6 nchilaka 2017-07-07 06:56 EDT
Created attachment 1295255 [details]
cli output
Comment 7 Mohit Agrawal 2017-07-10 06:17:46 EDT
Hi Nag,

Test build is available at below link,

https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=13623735

Please try and share the result if issue is still reproduce after apply the patch


Regards
Mohit Agrawal
Comment 16 Atin Mukherjee 2017-07-13 03:06:43 EDT
upstream patch : https://review.gluster.org/17767
Comment 17 Mohit Agrawal 2017-07-14 00:38:05 EDT
Below is downstream link of patch

https://code.engineering.redhat.com/gerrit/112271
Comment 18 nchilaka 2017-07-17 06:42:59 EDT
blocks on_qa validation of bZ# 1468514
refer https://bugzilla.redhat.com/show_bug.cgi?id=1459400#c8
Comment 19 nchilaka 2017-07-17 06:43:40 EDT
(In reply to nchilaka from comment #18)
> blocks on_qa validation of bZ# 1468514
> refer https://bugzilla.redhat.com/show_bug.cgi?id=1459400#c8

sorry, I meant it blocks BZ#1459400
refer https://bugzilla.redhat.com/show_bug.cgi?id=1459400#c8
Comment 22 Mohit Agrawal 2017-07-18 02:33:26 EDT
Hi Nag,

  Thanks for sharing the core dump,this core dump is not similar to previous 
  core dump, this time brick process is getting crash in changetimerecorder
  xlator,please file a separate bugzilla to fix this.

  Kindly this bugzilla update to verified state also.

Regards
Mohit Agrawal
Comment 23 nchilaka 2017-07-18 02:49:25 EDT
(In reply to Mohit Agrawal from comment #22)
> Hi Nag,
> 
>   Thanks for sharing the core dump,this core dump is not similar to previous 
>   core dump, this time brick process is getting crash in changetimerecorder
>   xlator,please file a separate bugzilla to fix this.
> 
>   Kindly this bugzilla update to verified state also.
> 
> Regards
> Mohit Agrawal


Raised a new bz#1472129
Comment 24 nchilaka 2017-07-18 04:41:40 EDT
on_qa validation:
re-ran the case mentioned in description===>crash not hit
and also reran volume restarts for about 150 times , but didn't hit this crash.
However hit another crash(all details mentioned in previous comment ie comment#23), hence moving to verified
Comment 25 nchilaka 2017-07-18 04:42:04 EDT
(In reply to nchilaka from comment #24)
> on_qa validation:
> re-ran the case mentioned in description===>crash not hit
> and also reran volume restarts for about 150 times , but didn't hit this
> crash.
> However hit another crash(all details mentioned in previous comment ie
> comment#23), hence moving to verified

test build:3.8.4-34
Comment 27 errata-xmlrpc 2017-09-21 01:02:13 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2774

Note You need to log in before you can comment on or make changes to this bug.