Description of problem: ========================= Found 218 cores on a single slave node while running sanity check of geo-replication with nl enabled.. [root@dhcp37-82 ~]# gdb glusterfsd /core.20805 GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-94.el7 Copyright (C) 2013 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu". For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>... Reading symbols from /usr/sbin/glusterfsd...Reading symbols from /usr/lib/debug/usr/sbin/glusterfsd.debug...done. done. warning: core file may not match specified executable file. [New LWP 20820] [New LWP 20806] [New LWP 20807] [New LWP 20810] [New LWP 20819] [New LWP 20823] [New LWP 20805] [New LWP 20818] [New LWP 20809] [New LWP 20811] [New LWP 20822] [New LWP 20808] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". Core was generated by `/usr/sbin/glusterfs --aux-gfid-mount --acl --log-file=/var/log/glusterfs/geo-re'. Program terminated with signal 11, Segmentation fault. #0 nlc_dir_add_ne (this=0x7fac34021160, inode=0x0, name=0x0) at nl-cache-helper.c:823 823 if (inode->ia_type != IA_IFDIR) { Missing separate debuginfos, use: debuginfo-install glibc-2.17-157.el7_3.1.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.14.1-27.el7_3.x86_64 libcom_err-1.42.9-9.el7.x86_64 libgcc-4.8.5-11.el7.x86_64 libselinux-2.5-6.el7.x86_64 libuuid-2.23.2-33.el7_3.2.x86_64 openssl-libs-1.0.1e-60.el7_3.1.x86_64 pcre-8.32-15.el7_2.1.x86_64 sssd-client-1.14.0-43.el7_3.14.x86_64 zlib-1.2.7-17.el7.x86_64 (gdb) bt #0 nlc_dir_add_ne (this=0x7fac34021160, inode=0x0, name=0x0) at nl-cache-helper.c:823 #1 0x00007fac3a554a3e in nlc_lookup_cbk (frame=0x7fac3000ff20, cookie=<optimized out>, this=<optimized out>, op_ret=-1, op_errno=2, inode=0x0, buf=0x7fac30001ac8, xdata=0x0, postparent=0x7fac30001cf8) at nl-cache.c:203 #2 0x00007fac3a969cb3 in qr_lookup_cbk (frame=frame@entry=0x7fac30007be0, cookie=<optimized out>, this=<optimized out>, op_ret=op_ret@entry=-1, op_errno=op_errno@entry=2, inode_ret=inode_ret@entry=0x0, buf=buf@entry=0x7fac30001ac8, xdata=xdata@entry=0x0, postparent=postparent@entry=0x7fac30001cf8) at quick-read.c:446 #3 0x00007fac3ab75cee in ioc_lookup_cbk (frame=0x7fac300109a0, cookie=<optimized out>, this=<optimized out>, op_ret=<optimized out>, op_errno=<optimized out>, inode=0x0, stbuf=0x7fac30001ac8, xdata=0x0, postparent=0x7fac30001cf8) at io-cache.c:255 #4 0x00007fac3b3dd507 in dht_discover_complete (this=this@entry=0x7fac34016fe0, discover_frame=discover_frame@entry=0x7fac300175d0) at dht-common.c:572 #5 0x00007fac3b3de29b in dht_discover_cbk (frame=0x7fac300175d0, cookie=<optimized out>, this=0x7fac34016fe0, op_ret=<optimized out>, op_errno=2, inode=0x7fac1c000f30, stbuf=0x7fac300073d0, xattr=0x0, postparent=0x7fac30007440) at dht-common.c:701 #6 0x00007fac3b68a2f0 in afr_discover_done (this=<optimized out>, frame=0x7fac30003270) at afr-common.c:2615 #7 afr_discover_cbk (frame=frame@entry=0x7fac30003270, cookie=<optimized out>, this=<optimized out>, op_ret=<optimized out>, op_errno=<optimized out>, inode=inode@entry=0x7fac1c000f30, buf=buf@entry=0x7fac2bffe940, xdata=0x0, postparent=postparent@entry=0x7fac2bffe9b0) at afr-common.c:2660 #8 0x00007fac3b8c72c7 in client3_3_lookup_cbk (req=<optimized out>, iov=<optimized out>, count=<optimized out>, myframe=0x7fac30008720) at client-rpc-fops.c:2947 #9 0x00007fac49139840 in rpc_clnt_handle_reply (clnt=clnt@entry=0x7fac340767d0, pollin=pollin@entry=0x7fac24004950) at rpc-clnt.c:794 #10 0x00007fac49139b27 in rpc_clnt_notify (trans=<optimized out>, mydata=0x7fac34076800, event=<optimized out>, data=0x7fac24004950) at rpc-clnt.c:987 #11 0x00007fac491359e3 in rpc_transport_notify (this=this@entry=0x7fac340769d0, event=event@entry=RPC_TRANSPORT_MSG_RECEIVED, data=data@entry=0x7fac24004950) at rpc-transport.c:538 #12 0x00007fac3ddb03b4 in socket_event_poll_in (this=this@entry=0x7fac340769d0) at socket.c:2275 #13 0x00007fac3ddb2895 in socket_event_handler (fd=<optimized out>, idx=2, data=0x7fac340769d0, poll_in=1, poll_out=0, poll_err=0) at socket.c:2411 #14 0x00007fac493c9e00 in event_dispatch_epoll_handler (event=0x7fac2bffee80, event_pool=0x7fac4a524730) at event-epoll.c:572 #15 event_dispatch_epoll_worker (data=0x7fac34076500) at event-epoll.c:675 #16 0x00007fac481cfdc5 in start_thread () from /lib64/libpthread.so.0 #17 0x00007fac47b1473d in clone () from /lib64/libc.so.6 (gdb) list 1 1 /* 2 * Copyright (c) 2017 Red Hat, Inc. <http://www.redhat.com> 3 * This file is part of GlusterFS. 4 * 5 * This file is licensed to you under your choice of the GNU Lesser 6 * General Public License, version 3 or any later version (LGPLv3 or 7 * later), or the GNU General Public License, version 2 (GPLv2), in all 8 * cases as published by the Free Software Foundation. 9 */ 10 (gdb) list nl_lookup_cbk Function "nl_lookup_cbk" not defined. (gdb) list nlc_lookup_cbk 188 189 static int32_t 190 nlc_lookup_cbk (call_frame_t *frame, void *cookie, xlator_t *this, 191 int32_t op_ret, int32_t op_errno, inode_t *inode, 192 struct iatt *buf, dict_t *xdata, struct iatt *postparent) 193 { 194 nlc_local_t *local = NULL; 195 nlc_conf_t *conf = NULL; 196 197 local = frame->local; (gdb) 198 conf = this->private; 199 200 /* Donot add to pe, this may lead to duplicate entry and 201 * requires search before adding if list of strings */ 202 if (op_ret < 0 && op_errno == ENOENT) { 203 nlc_dir_add_ne (this, local->loc.parent, local->loc.name); 204 GF_ATOMIC_INC (conf->nlc_counter.nlc_miss); 205 } 206 207 NLC_STACK_UNWIND (lookup, frame, op_ret, op_errno, inode, buf, xdata, (gdb) 208 postparent); 209 return 0; 210 } 211 212 213 static int32_t 214 nlc_lookup (call_frame_t *frame, xlator_t *this, loc_t *loc, dict_t *xdata) 215 { 216 nlc_local_t *local = NULL; 217 nlc_conf_t *conf = NULL; (gdb) f 1 #1 0x00007fac3a554a3e in nlc_lookup_cbk (frame=0x7fac3000ff20, cookie=<optimized out>, this=<optimized out>, op_ret=-1, op_errno=2, inode=0x0, buf=0x7fac30001ac8, xdata=0x0, postparent=0x7fac30001cf8) at nl-cache.c:203 203 nlc_dir_add_ne (this, local->loc.parent, local->loc.name); (gdb) p *local->loc Structure has no component named operator*. (gdb) p local->loc $1 = {path = 0x0, name = 0x0, inode = 0x7fac1c000f30, parent = 0x0, gfid = "̲ǝ\f1Aڜ\353iq61Y)", pargfid = '\000' <repeats 15 times>} (gdb) quit [root@dhcp37-82 ~]# Ingeneral following were the steps: 1. Create Master and Slave cluster and volume 2. Start them 3. Enable nl options on master and slave 4. Mount the Master volume via cifs mount 5. Create following fops on master volume: {create,chmod,chown,chgrp,rename,hardlink,symlink,truncate,remove} Trying the fops one by one to see the specific. However following shows the aux mount of geo-replication [root@dhcp37-82 ~]# file /core.9948 /core.9948: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style, from '/usr/sbin/glusterfs --aux-gfid-mount --acl --log-file=/var/log/glusterfs/geo-re', real uid: 0, effective uid: 0, real gid: 0, effective gid: 0, execfn: '/usr/sbin/glusterfs', platform: 'x86_64' [root@dhcp37-82 ~]#
Crash during Slave Volume mount.
The issue is with nameless lookup and negative lookup cache. The crash happened because negative lookup caching is enabled on slave. As discussed with Poorima, negative lookup caching is not possible with nameless lookups. A fix is required for the same. Assigning it to Poornima.
Patch posted upstream https://review.gluster.org/#/c/17316/1
downstream patch : https://code.engineering.redhat.com/gerrit/#/c/106935/
validated with build : glusterfs-geo-replication-3.8.4-27.el7rhgs.x86_64
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:2774