We do not have reproducer steps as of now. Core is available at : collab-shell.usersys.redhat.com:/cases/01633922 cd /cases/01633922 ls *core* core.34740.1463053585.dump.1.xz from rhs9 and core.23327.1463050194.dump.1.xz from rhs8 node. looking at the back traces it looks more like a glibc issue . # file * 160-core.23327.1463050194.dump.1: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style, from '/usr/sbin/glusterfs -s localhost --volfile-id gluster/nfs -p /var/lib/glusterd/' 200-core.34740.1463053585.dump.1: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style, from '/usr/sbin/glusterfs -s localhost --volfile-id gluster/nfs -p /var/lib/glusterd/' # gdb -c 160-core.23327.1463050194.dump.1 /usr/sbin/glusterfs (gdb) bt #0 0x00007f9e3ad5e625 in raise (sig=<value optimized out>) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64 #1 0x00007f9e3ad5fe05 in abort () at abort.c:92 #2 0x00007f9e3ad9c537 in __libc_message (do_abort=2, fmt=0x7f9e3ae848c0 "*** glibc detected *** %s: %s: 0x%s ***\n") at ../sysdeps/unix/sysv/linux/libc_fatal.c:198 #3 0x00007f9e3ada1f4e in malloc_printerr (action=3, str=0x7f9e3ae829ae "free(): invalid pointer", ptr=<value optimized out>, ar_ptr=<value optimized out>) at malloc.c:6350 #4 0x00007f9e3ada4cad in _int_free (av=0x7f9e3b0bbe80, p=0x7f9e107bee10, have_lock=0) at malloc.c:4836 #5 0x00007f9e3c3bab97 in dict_del (this=0x7f9e39978e14, key=0x7f9e04001534 "") at dict.c:528 #6 0x00007f9e2e13ceb3 in __mountdict_remove (ms=0x7f9e28198f30) at mount3.c:213 #7 __mnt3svc_umountall (ms=0x7f9e28198f30) at mount3.c:2508 #8 0x00007f9e2e13cf14 in mnt3svc_umountall (ms=0x7f9e28198f30) at mount3.c:2527 #9 0x00007f9e2e13d7dc in mnt3svc_umntall (req=0x7f9e2da61bd4) at mount3.c:2553 #10 0x00007f9e3c408332 in synctask_wrap (old_task=<value optimized out>) at syncop.c:381 #11 0x00007f9e3ad6f8f0 in ?? () from /lib64/libc.so.6 #12 0x0000000000000000 in ?? () (gdb) quit # gdb -c 200-core.34740.1463053585.dump.1 /usr/sbin/glusterfs (gdb) bt #0 0x00007f3a6964a625 in raise (sig=<value optimized out>) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64 #1 0x00007f3a6964be05 in abort () at abort.c:92 #2 0x00007f3a69688537 in __libc_message (do_abort=2, fmt=0x7f3a697708c0 "*** glibc detected *** %s: %s: 0x%s ***\n") at ../sysdeps/unix/sysv/linux/libc_fatal.c:198 #3 0x00007f3a6968df4e in malloc_printerr (action=3, str=0x7f3a6976e9ae "free(): invalid pointer", ptr=<value optimized out>, ar_ptr=<value optimized out>) at malloc.c:6350 #4 0x00007f3a69690cad in _int_free (av=0x7f3a699a7e80, p=0x7f3a3c03e030, have_lock=0) at malloc.c:4836 #5 0x00007f3a6aca6b97 in dict_del (this=0x7f3a68264e14, key=0x7f3a34001534 "") at dict.c:528 #6 0x00007f3a5c825eb3 in __mountdict_remove (ms=0x7f3a58198f30) at mount3.c:213 #7 __mnt3svc_umountall (ms=0x7f3a58198f30) at mount3.c:2508 #8 0x00007f3a5c825f14 in mnt3svc_umountall (ms=0x7f3a58198f30) at mount3.c:2527 #9 0x00007f3a5c8267dc in mnt3svc_umntall (req=0x7f3a5c13c3d8) at mount3.c:2553 #10 0x00007f3a6acf4332 in synctask_wrap (old_task=<value optimized out>) at syncop.c:381 #11 0x00007f3a6965b8f0 in ?? () from /lib64/libc.so.6 #12 0x0000000000000000 in ?? ()
Maybe related, it seems that Windows 7 and 2008 send UMNTALL requests: https://bugzilla.redhat.com/show_bug.cgi?id=GLUSTER-1666 These requests are normally sent after a(n unclean?) reboot. A few more details are in https://tools.ietf.org/html/rfc1813#section-5.2.4 Linux and pynfs do not implement UMNTALL, so it might only be reproducible with Windows or an other OS.
Niels, I tried playing with Windows 2012 NFS client yesterday but could not produce anything useful. Can you suggest me the test or steps for reproducer? Thanks, Bipin Kunal
Created attachment 1160478 [details] 1. fix UMNTALL behaviour, 2. remove mountdict Completely untested patch that addresses the following two points: 1. fix UMNTALL to only UMNT the exports from the client calling the procedure 2. remove the duplication of structures in mountdict, use mountlist everywhere I am confident that these two patches prevent the crashes that have been seen. The mountdict was used to optimize lookups when many clients are mounting exports. The removal of the mountdict may cause some performance drop while (un)mounting, but I do not expect that to be critical. These patches were only build-tested, but it shows the approach I'd like to take in order to fix this problem.
(In reply to Bipin Kunal from comment #24) > I tried playing with Windows 2012 NFS client yesterday but could not produce > anything useful. > > Can you suggest me the test or steps for reproducer? Sorry, I still have no idea how this can happen. There might be a need for multiple clients mounting at the same time when one client sends the UMNTALL procedure. But without more details from the customers that hit this problem, it is very difficult/impossible to understand the cause. Maybe one of the customers can capture a tcpdump on the NFS-server for all the MOUNT procedures? Depending on their configuration, this would be done on a different port than the NFS traffic. So the tcpdump should not contain much data, only the MNT, UMNT and UMNTALL procedures. Check with 'rpcinfo -p $SERVER' what port is used. Provide the tcpdump and the nfs.log once the process crashed again.
Created attachment 1246422 [details] tcpdump
Created attachment 1246423 [details] images from client
Okay...From the code, I see a potential issue with mountdict. During gluster-nfs process initialization --> mnt3svc_init (xlator_t *nfsx) { ... ... ... mstate->mountdict = dict_new (); ... .. } The ref taken for mountdict above seem to be getting un-refed in __mnt3svc_umountall() __mnt3svc_umountall (struct mount3_state *ms) { dict_unref (ms->mountdict); } So in a lifetime of gNFS process >1 UMOUNTALL requests may result in double unref resulting in accessing freed memory. Ideally the above dict_unref (ms->mountdict) should have been in mnt3svc_deinit() IMO. Based on the above, I am working with Riyas to check if we can reproduce the issue by trying out multiple mount/umount or reboot scenarios using windows client.
As per the current HOTFIX process document, setting NeedInfo to PM https://mojo.redhat.com/docs/DOC-1037888
downstream patch : https://code.engineering.redhat.com/gerrit/#/c/97861/
The hotfix build provided for the issue is been verified as mentioned in #C107. The build can be provided to the customers.
__mnt3svc_umountall crash is not seen while doing gnfs mount and umountall. Verified the fix on the build, glusterfs-server-3.8.4-15.el7rhgs.x86_64
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2017-0486.html