Description of problem: ----------------------- 2 of the 3 bricks in 1x3 crashed while doing a vol restart Was working on validating on_qa bz#1450807 - Brick Multiplexing:heal info shows root as heal pending for all associated volumes, after bringing down brick of one volume [root@dhcp35-45 ~]# file /core.19032 /core.19032: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style, from '/usr/sbin/glusterfsd -s 10.70.35.45 --volfile-id rep3_1.10.70.35.45.rhs-brick1-', real uid: 0, effective uid: 0, real gid: 0, effective gid: 0, execfn: '/usr/sbin/glusterfsd', platform: 'x86_64' (gdb) bt #0 0x00007f3b2bd231f7 in raise () from /lib64/libc.so.6 #1 0x00007f3b2bd248e8 in abort () from /lib64/libc.so.6 #2 0x00007f3b2bd62f47 in __libc_message () from /lib64/libc.so.6 #3 0x00007f3b2bd6a619 in _int_free () from /lib64/libc.so.6 #4 0x00007f3b2bda880d in closedir () from /lib64/libc.so.6 #5 0x00007f3b2d6e1c15 in sys_closedir (dir=<optimized out>) at syscall.c:113 #6 0x00007f3b1bde9157 in notify (this=<optimized out>, event=<optimized out>, data=<optimized out>) at posix.c:6618 #7 0x00007f3b2d6b7b92 in xlator_notify (xl=0x7f39b5b1a450, event=event@entry=9, data=data@entry=0x7f39b5b1b0c0) at xlator.c:549 #8 0x00007f3b2d751292 in default_notify (this=this@entry=0x7f39b5b1b0c0, event=9, data=<optimized out>) at defaults.c:3139 #9 0x00007f3b1b5add97 in notify (this=0x7f39b5b1b0c0, event=<optimized out>, data=<optimized out>) at trash.c:2313 #10 0x00007f3b2d6b7b92 in xlator_notify (xl=0x7f39b5b1b0c0, event=event@entry=9, data=data@entry=0x7f39b5b1c080) at xlator.c:549 #11 0x00007f3b2d751292 in default_notify (this=this@entry=0x7f39b5b1c080, event=event@entry=9, data=data@entry=0x7f39b5b1e410) at defaults.c:3139 #12 0x00007f3b1b38f664 in notify (this=0x7f39b5b1c080, event=9, data=0x7f39b5b1e410) at changetimerecorder.c:2184 #13 0x00007f3b2d6b7b92 in xlator_notify (xl=0x7f39b5b1c080, event=event@entry=9, data=data@entry=0x7f39b5b1e410) at xlator.c:549 #14 0x00007f3b2d751292 in default_notify (this=this@entry=0x7f39b5b1e410, event=9, data=data@entry=0x7f39b5b1f760) at defaults.c:3139 #15 0x00007f3b1aca40e3 in notify (this=0x7f39b5b1e410, event=<optimized out>, data=0x7f39b5b1f760) at changelog.c:2312 #16 0x00007f3b2d6b7b92 in xlator_notify (xl=0x7f39b5b1e410, event=event@entry=9, data=data@entry=0x7f39b5b1f760) at xlator.c:549 #17 0x00007f3b2d751292 in default_notify (this=this@entry=0x7f39b5b1f760, event=9, data=<optimized out>) at defaults.c:3139 #18 0x00007f3b1a870181 in notify (this=0x7f39b5b1f760, event=<optimized out>, data=<optimized out>) at bit-rot-stub.c:243 #19 0x00007f3b2d6b7b92 in xlator_notify (xl=0x7f39b5b1f760, event=event@entry=9, data=data@entry=0x7f39b5b20810) at xlator.c:549 #20 0x00007f3b2d751292 in default_notify (this=0x7f39b5b20810, event=9, data=<optimized out>) at defaults.c:3139 #21 0x00007f3b2d6b7b92 in xlator_notify (xl=0x7f39b5b20810, event=event@entry=9, data=data@entry=0x7f39b5b21650) at xlator.c:549 #22 0x00007f3b2d751292 in default_notify (this=0x7f39b5b21650, event=9, data=<optimized out>) at defaults.c:3139 #23 0x00007f3b2d6b7b92 in xlator_notify (xl=0x7f39b5b21650, event=event@entry=9, data=data@entry=0x7f39b5b22410) at xlator.c:549 #24 0x00007f3b2d751292 in default_notify (this=0x7f39b5b22410, event=9, data=<optimized out>) at defaults.c:3139 #25 0x00007f3b2d6b7b92 in xlator_notify (xl=0x7f39b5b22410, event=event@entry=9, data=data@entry=0x7f39b5b23510) at xlator.c:549 #26 0x00007f3b2d751292 in default_notify (this=0x7f39b5b23510, event=9, data=<optimized out>) at defaults.c:3139 #27 0x00007f3b2d6b7b92 in xlator_notify (xl=0x7f39b5b23510, event=event@entry=9, data=data@entry=0x7f39b5b24550) at xlator.c:549 #28 0x00007f3b2d751292 in default_notify (this=0x7f39b5b24550, event=9, data=<optimized out>) at defaults.c:3139 #29 0x00007f3b2d6b7b92 in xlator_notify (xl=0x7f39b5b24550, event=event@entry=9, ---Type <return> to continue, or q <return> to quit--- data=data@entry=0x7f39b5b25580) at xlator.c:549 #30 0x00007f3b2d751292 in default_notify (this=0x7f39b5b25580, event=9, data=<optimized out>) at defaults.c:3139 #31 0x00007f3b19c00a32 in notify (this=0x7f39b5b25580, event=<optimized out>, data=0x7f39b5b265b0) at upcall.c:2511 #32 0x00007f3b2d6b7b92 in xlator_notify (xl=0x7f39b5b25580, event=event@entry=9, data=data@entry=0x7f39b5b265b0) at xlator.c:549 #33 0x00007f3b2d751292 in default_notify (this=this@entry=0x7f39b5b265b0, event=9, data=<optimized out>) at defaults.c:3139 #34 0x00007f3b199e89e3 in notify (this=0x7f39b5b265b0, event=<optimized out>, data=<optimized out>) at io-threads.c:1053 #35 0x00007f3b2d6b7b92 in xlator_notify (xl=0x7f39b5b265b0, event=event@entry=9, data=data@entry=0x7f39b5b274a0) at xlator.c:549 #36 0x00007f3b2d751292 in default_notify (this=0x7f39b5b274a0, event=9, data=<optimized out>) at defaults.c:3139 #37 0x00007f3b2d6b7b92 in xlator_notify (xl=0x7f39b5b274a0, event=event@entry=9, data=data@entry=0x7f39b5b29020) at xlator.c:549 #38 0x00007f3b2d751292 in default_notify (this=this@entry=0x7f39b5b29020, event=event@entry=9, data=data@entry=0x7f39b5b2a230) at defaults.c:3139 #39 0x00007f3b195ba4fb in notify (this=0x7f39b5b29020, event=9, data=0x7f39b5b2a230) at barrier.c:534 #40 0x00007f3b2d6b7b92 in xlator_notify (xl=0x7f39b5b29020, event=event@entry=9, data=data@entry=0x7f39b5b2a230) at xlator.c:549 #41 0x00007f3b2d751292 in default_notify (this=this@entry=0x7f39b5b2a230, event=9, data=<optimized out>) at defaults.c:3139 #42 0x00007f3b193aed98 in notify (this=0x7f39b5b2a230, event=<optimized out>, data=<optimized out>) at index.c:2530 #43 0x00007f3b2d6b7b92 in xlator_notify (xl=0x7f39b5b2a230, event=event@entry=9, data=data@entry=0x7f39b5b2b7c0) at xlator.c:549 #44 0x00007f3b2d751292 in default_notify (this=0x7f39b5b2b7c0, event=9, data=<optimized out>) at defaults.c:3139 #45 0x00007f3b2d6b7b92 in xlator_notify (xl=0x7f39b5b2b7c0, event=event@entry=9, data=data@entry=0x7f39b5b2cc30) at xlator.c:549 #46 0x00007f3b2d751292 in default_notify (this=this@entry=0x7f39b5b2cc30, event=9, data=data@entry=0x7f39b5b2e270) at defaults.c:3139 #47 0x00007f3b18f713da in notify (this=0x7f39b5b2cc30, event=<optimized out>, data=0x7f39b5b2e270) at io-stats.c:4150 #48 0x00007f3b2d6b7b92 in xlator_notify (xl=0x7f39b5b2cc30, event=event@entry=9, data=data@entry=0x7f39b5b2e270) at xlator.c:549 #49 0x00007f3b2d751292 in default_notify (this=this@entry=0x7f39b5b2e270, event=event@entry=9, data=data@entry=0x7f39b5b2e270) at defaults.c:3139 #50 0x00007f3b18afed3f in notify (this=<optimized out>, event=<optimized out>, data=0x7f39b5b2e270) at server.c:1584 #51 0x0000555897aa8057 in glusterfs_handle_terminate (req=0x7f3af525d460) at glusterfsd-mgmt.c:263 #52 0x00007f3b2d6f0832 in synctask_wrap (old_task=<optimized out>) at syncop.c:375 #53 0x00007f3b2bd34d40 in ?? () from /lib64/libc.so.6 #54 0x0000000000000000 in ?? () Version-Release number of selected component (if applicable): ====== 3.8.4-32 Steps to Reproduce: Note: The Main intention was to validate the BZ#1450807 - Brick Multiplexing:heal info shows root as heal pending for all associated volume I wanted to try a few scenarios, as I did see that once the issue was still seen(as updated in bz previously) 1.brick mux enabled on a 6 node setup 2.created about 40 1x3 volume using bricks on n1,n2,n3 say vol names are rep3_1 ...rep3_40 3.mounted rep3_9 4. started to do linux untar 5. killed brick b1, while untar was going on. =>this did bring down glusterfsd and hence impacts all vols 6. did service glusterd restart for bringing the brick online. =>this worked well =>files got healed,all good so far 7. now , with untar still in progress, did a umount -l of the lv where b1 of rep3_9 was residing(hence not impacting other bricks) =>Observations ==>the heal info doesnt show b1 as offline ==>the parent directory which was in play at that time(as part of untar), showed up in the heal info entry of b1(meaning it has heal pendings marked) 8. stopped linux untar 9. mounted the lv using mount -a 10. triggered manual heal using glust v heal rep3_9 11. heal completed 12. now did a restart of rep3_9 (intention: wanted to check the behavior, as the observed in step 7, the heal info didnt show brick offline, and so did vol info at that time)
note: b1(where umount was done and remounted) and b3 couldn't come up as glusterfsd crashed, b2 was up [root@dhcp35-45 ~]# gluster v info rep3_9 Volume Name: rep3_9 Type: Replicate Volume ID: 0bb1db31-b6cb-4b51-ace6-c9de4f16adc3 Status: Started Snapshot Count: 0 Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: Brick1: 10.70.35.45:/rhs/brick9/rep3_9 Brick2: 10.70.35.130:/rhs/brick9/rep3_9 Brick3: 10.70.35.122:/rhs/brick9/rep3_9 Options Reconfigured: nfs.disable: on transport.address-family: inet cluster.brick-multiplex: on [root@dhcp35-45 ~]# gluster v status rep3_9 Status of volume: rep3_9 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.35.45:/rhs/brick9/rep3_9 N/A N/A N N/A Brick 10.70.35.130:/rhs/brick9/rep3_9 49152 0 Y 9787 Brick 10.70.35.122:/rhs/brick9/rep3_9 N/A N/A N N/A Self-heal Daemon on localhost N/A N/A Y 29373 Self-heal Daemon on 10.70.35.138 N/A N/A Y 24803 Self-heal Daemon on 10.70.35.130 N/A N/A Y 11239 Self-heal Daemon on 10.70.35.23 N/A N/A Y 1985 Self-heal Daemon on 10.70.35.112 N/A N/A Y 19752 Self-heal Daemon on 10.70.35.122 N/A N/A Y 30171 Task Status of Volume rep3_9 ------------------------------------------------------------------------------ There are no active volume tasks
Created attachment 1295251 [details] thread bt
logs and sosreports @ http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/nchilaka/bug.1468514 cores available in dedicated server directories
Created attachment 1295255 [details] cli output
Hi Nag, Test build is available at below link, https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=13623735 Please try and share the result if issue is still reproduce after apply the patch Regards Mohit Agrawal
upstream patch : https://review.gluster.org/17767
Below is downstream link of patch https://code.engineering.redhat.com/gerrit/112271
blocks on_qa validation of bZ# 1468514 refer https://bugzilla.redhat.com/show_bug.cgi?id=1459400#c8
(In reply to nchilaka from comment #18) > blocks on_qa validation of bZ# 1468514 > refer https://bugzilla.redhat.com/show_bug.cgi?id=1459400#c8 sorry, I meant it blocks BZ#1459400 refer https://bugzilla.redhat.com/show_bug.cgi?id=1459400#c8
Hi Nag, Thanks for sharing the core dump,this core dump is not similar to previous core dump, this time brick process is getting crash in changetimerecorder xlator,please file a separate bugzilla to fix this. Kindly this bugzilla update to verified state also. Regards Mohit Agrawal
(In reply to Mohit Agrawal from comment #22) > Hi Nag, > > Thanks for sharing the core dump,this core dump is not similar to previous > core dump, this time brick process is getting crash in changetimerecorder > xlator,please file a separate bugzilla to fix this. > > Kindly this bugzilla update to verified state also. > > Regards > Mohit Agrawal Raised a new bz#1472129
on_qa validation: re-ran the case mentioned in description===>crash not hit and also reran volume restarts for about 150 times , but didn't hit this crash. However hit another crash(all details mentioned in previous comment ie comment#23), hence moving to verified
(In reply to nchilaka from comment #24) > on_qa validation: > re-ran the case mentioned in description===>crash not hit > and also reran volume restarts for about 150 times , but didn't hit this > crash. > However hit another crash(all details mentioned in previous comment ie > comment#23), hence moving to verified test build:3.8.4-34
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:2774