Bug 1315544 - [GSS] -Gluster NFS server crashing in __mnt3svc_umountall
[GSS] -Gluster NFS server crashing in __mnt3svc_umountall
Status: CLOSED ERRATA
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: gluster-nfs (Show other bugs)
unspecified
All All
urgent Severity urgent
: ---
: RHGS 3.2.0
Assigned To: Soumya Koduri
Arthy Loganathan
:
Depends On: 1421759
Blocks: 1351515 1351530
  Show dependency treegraph
 
Reported: 2016-03-07 20:45 EST by Oonkwee Lim_
Modified: 2017-04-24 20:56 EDT (History)
24 users (show)

See Also:
Fixed In Version: glusterfs-3.8.4-15
Doc Type: Bug Fix
Doc Text:
Previously, when a NFS client unmounted all volumes, Red Hat Gluster Storage Native NFS server freed a structure that was still being used, which resulted in a segmentation fault on the server (use-after-free). The server now does not free the structure while the mount service is available, so the segmentation fault no longer occurs.
Story Points: ---
Clone Of:
: 1421759 (view as bug list)
Environment:
Last Closed: 2017-03-23 01:27:19 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
akaiser: needinfo+


Attachments (Terms of Use)
1. fix UMNTALL behaviour, 2. remove mountdict (14.82 KB, application/mbox)
2016-05-23 04:39 EDT, Niels de Vos
no flags Details
tcpdump (2.03 MB, application/octet-stream)
2017-01-31 14:44 EST, Raghavendra Bhat
no flags Details
images from client (1.20 MB, application/x-gzip)
2017-01-31 14:45 EST, Raghavendra Bhat
no flags Details

  None (edit)
Comment 14 Bipin Kunal 2016-05-16 10:03:04 EDT
We do not have reproducer steps as of now.

Core is available at : 


collab-shell.usersys.redhat.com:/cases/01633922
cd /cases/01633922
ls *core*


core.34740.1463053585.dump.1.xz from rhs9 and 
core.23327.1463050194.dump.1.xz from rhs8 node.

looking at the back traces it looks more like a glibc issue .

# file *
160-core.23327.1463050194.dump.1: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style, from '/usr/sbin/glusterfs -s localhost --volfile-id gluster/nfs -p /var/lib/glusterd/'
200-core.34740.1463053585.dump.1: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style, from '/usr/sbin/glusterfs -s localhost --volfile-id gluster/nfs -p /var/lib/glusterd/'

# gdb -c 160-core.23327.1463050194.dump.1  /usr/sbin/glusterfs

(gdb) bt
#0  0x00007f9e3ad5e625 in raise (sig=<value optimized out>) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1  0x00007f9e3ad5fe05 in abort () at abort.c:92
#2  0x00007f9e3ad9c537 in __libc_message (do_abort=2, fmt=0x7f9e3ae848c0 "*** glibc detected *** %s: %s: 0x%s ***\n") at ../sysdeps/unix/sysv/linux/libc_fatal.c:198
#3  0x00007f9e3ada1f4e in malloc_printerr (action=3, str=0x7f9e3ae829ae "free(): invalid pointer", ptr=<value optimized out>, ar_ptr=<value optimized out>) at malloc.c:6350
#4  0x00007f9e3ada4cad in _int_free (av=0x7f9e3b0bbe80, p=0x7f9e107bee10, have_lock=0) at malloc.c:4836
#5  0x00007f9e3c3bab97 in dict_del (this=0x7f9e39978e14, key=0x7f9e04001534 "") at dict.c:528
#6  0x00007f9e2e13ceb3 in __mountdict_remove (ms=0x7f9e28198f30) at mount3.c:213
#7  __mnt3svc_umountall (ms=0x7f9e28198f30) at mount3.c:2508
#8  0x00007f9e2e13cf14 in mnt3svc_umountall (ms=0x7f9e28198f30) at mount3.c:2527
#9  0x00007f9e2e13d7dc in mnt3svc_umntall (req=0x7f9e2da61bd4) at mount3.c:2553
#10 0x00007f9e3c408332 in synctask_wrap (old_task=<value optimized out>) at syncop.c:381
#11 0x00007f9e3ad6f8f0 in ?? () from /lib64/libc.so.6
#12 0x0000000000000000 in ?? ()
(gdb) quit

# gdb -c 200-core.34740.1463053585.dump.1 /usr/sbin/glusterfs

(gdb) bt
#0  0x00007f3a6964a625 in raise (sig=<value optimized out>) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1  0x00007f3a6964be05 in abort () at abort.c:92
#2  0x00007f3a69688537 in __libc_message (do_abort=2, fmt=0x7f3a697708c0 "*** glibc detected *** %s: %s: 0x%s ***\n") at ../sysdeps/unix/sysv/linux/libc_fatal.c:198
#3  0x00007f3a6968df4e in malloc_printerr (action=3, str=0x7f3a6976e9ae "free(): invalid pointer", ptr=<value optimized out>, ar_ptr=<value optimized out>) at malloc.c:6350
#4  0x00007f3a69690cad in _int_free (av=0x7f3a699a7e80, p=0x7f3a3c03e030, have_lock=0) at malloc.c:4836
#5  0x00007f3a6aca6b97 in dict_del (this=0x7f3a68264e14, key=0x7f3a34001534 "") at dict.c:528
#6  0x00007f3a5c825eb3 in __mountdict_remove (ms=0x7f3a58198f30) at mount3.c:213
#7  __mnt3svc_umountall (ms=0x7f3a58198f30) at mount3.c:2508
#8  0x00007f3a5c825f14 in mnt3svc_umountall (ms=0x7f3a58198f30) at mount3.c:2527
#9  0x00007f3a5c8267dc in mnt3svc_umntall (req=0x7f3a5c13c3d8) at mount3.c:2553
#10 0x00007f3a6acf4332 in synctask_wrap (old_task=<value optimized out>) at syncop.c:381
#11 0x00007f3a6965b8f0 in ?? () from /lib64/libc.so.6
#12 0x0000000000000000 in ?? ()
Comment 18 Niels de Vos 2016-05-17 06:36:56 EDT
Maybe related, it seems that Windows 7 and 2008 send UMNTALL requests:
  https://bugzilla.redhat.com/show_bug.cgi?id=GLUSTER-1666

These requests are normally sent after a(n unclean?) reboot. A few more details are in https://tools.ietf.org/html/rfc1813#section-5.2.4

Linux and pynfs do not implement UMNTALL, so it might only be reproducible with Windows or an other OS.
Comment 24 Bipin Kunal 2016-05-20 03:12:02 EDT
Niels,

I tried playing with Windows 2012 NFS client yesterday but could not produce anything useful.

Can you suggest me the test or steps for reproducer?

Thanks,
Bipin Kunal
Comment 25 Niels de Vos 2016-05-23 04:39 EDT
Created attachment 1160478 [details]
1. fix UMNTALL behaviour, 2. remove mountdict

Completely untested patch that addresses the following two points:

  1. fix UMNTALL to only UMNT the exports from the client calling the procedure
  2. remove the duplication of structures in mountdict, use mountlist everywhere

I am confident that these two patches prevent the crashes that have been seen. The mountdict was used to optimize lookups when many clients are mounting exports. The removal of the mountdict may cause some performance drop while (un)mounting, but I do not expect that to be critical.

These patches were only build-tested, but it shows the approach I'd like to take in order to fix this problem.
Comment 26 Niels de Vos 2016-05-23 04:40:14 EDT
(In reply to Bipin Kunal from comment #24)
> I tried playing with Windows 2012 NFS client yesterday but could not produce
> anything useful.
> 
> Can you suggest me the test or steps for reproducer?

Sorry, I still have no idea how this can happen. There might be a need for multiple clients mounting at the same time when one client sends the UMNTALL procedure. But without more details from the customers that hit this problem, it is very difficult/impossible to understand the cause.

Maybe one of the customers can capture a tcpdump on the NFS-server for all the MOUNT procedures? Depending on their configuration, this would be done on a different port than the NFS traffic. So the tcpdump should not contain much data, only the MNT, UMNT and UMNTALL procedures. Check with 'rpcinfo -p $SERVER' what port is used. Provide the tcpdump and the nfs.log once the process crashed again.
Comment 45 Raghavendra Bhat 2017-01-31 14:44 EST
Created attachment 1246422 [details]
tcpdump
Comment 46 Raghavendra Bhat 2017-01-31 14:45 EST
Created attachment 1246423 [details]
images from client
Comment 58 Soumya Koduri 2017-02-07 02:46:17 EST
Okay...From the code, I see a potential issue with mountdict.

During gluster-nfs process initialization -->

mnt3svc_init (xlator_t *nfsx) {

...
...
...
        mstate->mountdict = dict_new ();
...
..
}

The ref taken for mountdict above seem to be getting un-refed in __mnt3svc_umountall() 

__mnt3svc_umountall (struct mount3_state *ms) {

       dict_unref (ms->mountdict);
}

So in a lifetime of gNFS process >1 UMOUNTALL requests may result in double unref resulting in accessing freed memory. 

Ideally the above dict_unref (ms->mountdict) should have been in mnt3svc_deinit() IMO.

Based on the above, I am working with Riyas to check if we can reproduce the issue by trying out multiple mount/umount or reboot scenarios using windows client.
Comment 90 Harold Miller 2017-02-15 15:43:48 EST
As per the current HOTFIX process document, setting NeedInfo to PM  https://mojo.redhat.com/docs/DOC-1037888
Comment 94 Atin Mukherjee 2017-02-16 03:59:28 EST
downstream patch : https://code.engineering.redhat.com/gerrit/#/c/97861/
Comment 108 surabhi 2017-02-27 05:02:44 EST
The hotfix build provided for the issue is been verified as mentioned in #C107.
The build can be provided to the customers.
Comment 113 Arthy Loganathan 2017-03-01 04:01:25 EST
__mnt3svc_umountall crash is not seen while doing gnfs mount and umountall.

Verified the fix on the build, 
glusterfs-server-3.8.4-15.el7rhgs.x86_64
Comment 118 errata-xmlrpc 2017-03-23 01:27:19 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2017-0486.html

Note You need to log in before you can comment on or make changes to this bug.