Description of problem: While testing two of my geo-rep patches [1] and [2] I see the geo-rep mount process crashed. I was running the modified upstream geo-rep regression test suite to run on replica 3 (6*3) volume. Geo-rep client process crashed as below. Note that the geo-rep mount are aux-gfid-mount. I looked into the traceback, it's during dht attr heal and the gfid is null in both loc and loc->inode. I didn't have much context on afr/dht, so could debug it further. (gdb) #0 0x00007f260e71b765 in raise () from /lib64/libc.so.6 #1 0x00007f260e71d36a in abort () from /lib64/libc.so.6 #2 0x00007f260e713f97 in __assert_fail_base () from /lib64/libc.so.6 #3 0x00007f260e714042 in __assert_fail () from /lib64/libc.so.6 #4 0x00007f2602149ec2 in client_pre_inodelk (this=0x7f25fc00ef20, req=0x7f25e4217670, loc=0x7f25e400a298, cmd=6, flock=0x7f25e400a4b8, volume=0x7f25fc0132b0 "slave-replicate-1", xdata=0x0) at client-common.c:841 #5 0x00007f2602138b24 in client3_3_inodelk (frame=0x7f25e4015290, this=0x7f25fc00ef20, data=0x7f25e4217760) at client-rpc-fops.c:5307 #6 0x00007f260210d9d9 in client_inodelk (frame=0x7f25e4015290, this=0x7f25fc00ef20, volume=0x7f25fc0132b0 "slave-replicate-1", loc=0x7f25e400a298, cmd=6, lock=0x7f25e400a4b8, xdata=0x0) at client.c:1679 #7 0x00007f2601ea4444 in afr_nonblocking_inodelk (frame=0x7f25e400f680, this=0x7f25fc015230) at afr-lk-common.c:1093 #8 0x00007f2601e9d149 in afr_lock (frame=0x7f25e400f680, this=0x7f25fc015230) at afr-transaction.c:1652 #9 0x00007f2601e9eb84 in afr_transaction_start (local=0x7f25e4009e60, this=0x7f25fc015230) at afr-transaction.c:2333 #10 0x00007f2601e9eec0 in afr_transaction (frame=0x7f25e400f680, this=0x7f25fc015230, type=AFR_METADATA_TRANSACTION) at afr-transaction.c:2402 #11 0x00007f2601e875d7 in afr_setattr (frame=0x7f25e400ece0, this=0x7f25fc015230, loc=0x7f25e4008e58, buf=0x7f25e4008f58, valid=7, xdata=0x0) at afr-inode-write.c:895 #12 0x00007f261011681d in syncop_setattr (subvol=0x7f25fc015230, loc=0x7f25e4008e58, iatt=0x7f25e4008f58, valid=7, preop=0x0, postop=0x0, xdata_in=0x0, xdata_out=0x0) at syncop.c:1811 #13 0x00007f2601bc0448 in dht_dir_attr_heal (data=0x7f25e4007c60) at dht-selfheal.c:2497 #14 0x00007f261010f894 in synctask_wrap () at syncop.c:375 #15 0x00007f260e72fb60 in ?? () from /lib64/libc.so.6 #16 0x0000000000000000 in ?? () (gdb) f 4 #4 0x00007f2602149ec2 in client_pre_inodelk (this=0x7f25fc00ef20, req=0x7f25e4217670, loc=0x7f25e400a298, cmd=6, flock=0x7f25e400a4b8, volume=0x7f25fc0132b0 "slave-replicate-1", xdata=0x0) at client-common.c:841 841 GF_ASSERT_AND_GOTO_WITH_ERROR (this->name, (gdb) p *loc $1 = {path = 0x7f25e40102f0 "/.gfid/00000000-0000-0000-0000-", '0' <repeats 11 times>, "1/rsnapshot_symlinkbug", name = 0x7f25e401031c "rsnapshot_symlinkbug", inode = 0x7f25ec030d30, parent = 0x7f25fc078870, gfid = '\000' <repeats 15 times>, pargfid = '\000' <repeats 15 times>, "\001"} (gdb) p *loc->inode $2 = {table = 0x7f25fc078770, gfid = '\000' <repeats 15 times>, lock = {spinlock = 0, mutex = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 39 times>, __align = 0}}, nlookup = 0, fd_count = 0, active_fd_count = 0, ref = 3, ia_type = IA_INVAL, fd_list = {next = 0x7f25ec030d88, prev = 0x7f25ec030d88}, dentry_list = {next = 0x7f25ec030d98, prev = 0x7f25ec030d98}, hash = {next = 0x7f25ec030da8, prev = 0x7f25ec030da8}, list = {next = 0x7f25ec03aa98, prev = 0x7f25fc0787d0}, _ctx = 0x7f25ec032580} (gdb) $3 = {table = 0x7f25fc078770, gfid = '\000' <repeats 15 times>, lock = {spinlock = 0, mutex = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 39 times>, __align = 0}}, nlookup = 0, fd_count = 0, active_fd_count = 0, ref = 3, ia_type = IA_INVAL, fd_list = {next = 0x7f25ec030d88, prev = 0x7f25ec030d88}, dentry_list = {next = 0x7f25ec030d98, prev = 0x7f25ec030d98}, hash = {next = 0x7f25ec030da8, prev = 0x7f25ec030da8}, list = {next = 0x7f25ec03aa98, prev = 0x7f25fc0787d0}, _ctx = 0x7f25ec032580} Version-Release number of selected component (if applicable): 3.4 source install: Had two patches [1] and [2] on top of commit 8c9028b560b1f0fd816e7d2a9e0bec70cc526c1a How reproducible: Rarely, I have hit only once. Steps to Reproduce: 1. Run upstream regression test suite modifying the volume type of both master and slave to 6*3 # prove -v tests/00-geo-rep/georep-basic-dr-rsync.t Actual results: mount process crashed [1] https://code.engineering.redhat.com/gerrit/143400 [2] https://code.engineering.redhat.com/gerrit/143826 Expected results: No crash should be seen Additional info: The mount is done by geo-rep worker process and it's a gfid-acces fuse mount. Fust mount mounted with option "-o aux-gfid-mount".
Please provide access to the coredump
I have just uploaded the core to qe machine. Prasad will share the details. The host is "fedora 24" not rhel and it's my local VM. So if you can't use the core file, let me know if I need to update the other gluster binaries.
I am unable to see any symbols in the core file when I try to open it.
Will check and update if dht misses any gfid update in healing code path.
Nithya, If you are working on this already, could you move this to assigned state. Susant
(In reply to Susant Kumar Palai from comment #9) > Nithya, If you are working on this already, could you move this to assigned > state. > > Susant Done. I suspect the heal in dht_lookup_dir_cbk() - the gfid is not set in loc. loc->inode->gfid is also NULL which is what causes the crash.
Hi, The lib64 directory which was missing is uploaded here http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1601331/ core/ 2018-07-16 12:04 - georep-basic-dr-rsyn..> 2018-07-16 12:37 131M gluster-binares.tar 2018-07-16 12:37 1.6M lib64.tar 2018-07-17 16:50 333M libraries.tar 2018-07-16 12:44 30M Steps to use the core for debugging. 1. Create a directory on local machine and change directory #mkdir /dht-crash #cd /dht-crash 2. Download all libraries.tar, gluster-binaries.tar and core/core-glustersproc0-6-0-0-13668-1531717404 into /dht-crash directory 3. untar all the tar files 4. gdb usr/local/sbin/glusterfs core-glustersproc0-6-0-0-13668-1531717404 (gdb) set solib-absolute-prefix /dht-crash (gdb) bt
Mid air collision, setting the status back to ASSIGNED
(In reply to Kotresh HR from comment #11) > Hi, > > The lib64 directory which was missing is uploaded here > > http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1601331/ > > core/ 2018-07-16 12:04 - > georep-basic-dr-rsyn..> 2018-07-16 12:37 131M > gluster-binares.tar 2018-07-16 12:37 1.6M > lib64.tar 2018-07-17 16:50 333M > libraries.tar 2018-07-16 12:44 30M > > > Steps to use the core for debugging. > > 1. Create a directory on local machine and change directory > #mkdir /dht-crash > #cd /dht-crash > > 2. Download all libraries.tar, gluster-binaries.tar and > core/core-glustersproc0-6-0-0-13668-1531717404 into /dht-crash directory > > 3. untar all the tar files > > 4. gdb usr/local/sbin/glusterfs core-glustersproc0-6-0-0-13668-1531717404 > > (gdb) set solib-absolute-prefix /dht-crash > (gdb) bt Thank you. I can now see the symbols.
Still looking into this. I shall update by tomorrow.
On glusterfs version: 3.12.2-15.el7rhgs.x86_64, ran the same test case mentioned in the description multiple times and didn't hit this issue. Hencce, moving this BZ to Verified state.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2018:2607