1315544 – [GSS] -Gluster NFS server crashing in __mnt3svc_umountall

Bug 1315544 - [GSS] -Gluster NFS server crashing in __mnt3svc_umountall

Summary: [GSS] -Gluster NFS server crashing in __mnt3svc_umountall

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	gluster-nfs
Sub Component:
Version:	unspecified
Hardware:	All
OS:	All
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	RHGS 3.2.0
Assignee:	Soumya Koduri
QA Contact:	Arthy Loganathan
Docs Contact:
URL:
Whiteboard:
Depends On:	1421759
Blocks:	1351515 1351530
TreeView+	depends on / blocked

Reported:	2016-03-08 01:45 UTC by Oonkwee Lim
Modified:	2020-09-10 09:33 UTC (History)
CC List:	25 users (show)
Fixed In Version:	glusterfs-3.8.4-15
Doc Type:	Bug Fix
Doc Text:	Previously, when a NFS client unmounted all volumes, Red Hat Gluster Storage Native NFS server freed a structure that was still being used, which resulted in a segmentation fault on the server (use-after-free). The server now does not free the structure while the mount service is available, so the segmentation fault no longer occurs.
Clone Of:
Clones:	1421759 (view as bug list)
Environment:
Last Closed:	2017-03-23 05:27:19 UTC
Embargoed:
Dependent Products:
Flags:	akaiser: needinfo+

Attachments	(Terms of Use)
1. fix UMNTALL behaviour, 2. remove mountdict (14.82 KB, application/mbox) 2016-05-23 08:39 UTC, Niels de Vos	no flags	Details
tcpdump (2.03 MB, application/octet-stream) 2017-01-31 19:44 UTC, Raghavendra Bhat	no flags	Details
images from client (1.20 MB, application/x-gzip) 2017-01-31 19:45 UTC, Raghavendra Bhat	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	2912821	0	None	None	None	2017-08-25 17:18:42 UTC
Red Hat Product Errata	RHSA-2017:0486	0	normal	SHIPPED_LIVE	Moderate: Red Hat Gluster Storage 3.2.0 security, bug fix, and enhancement update	2017-03-23 09:18:45 UTC

Comment 14 Bipin Kunal 2016-05-16 14:03:04 UTC

We do not have reproducer steps as of now.

Core is available at : 


collab-shell.usersys.redhat.com:/cases/01633922
cd /cases/01633922
ls *core*


core.34740.1463053585.dump.1.xz from rhs9 and 
core.23327.1463050194.dump.1.xz from rhs8 node.

looking at the back traces it looks more like a glibc issue .

# file *
160-core.23327.1463050194.dump.1: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style, from '/usr/sbin/glusterfs -s localhost --volfile-id gluster/nfs -p /var/lib/glusterd/'
200-core.34740.1463053585.dump.1: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style, from '/usr/sbin/glusterfs -s localhost --volfile-id gluster/nfs -p /var/lib/glusterd/'

# gdb -c 160-core.23327.1463050194.dump.1  /usr/sbin/glusterfs

(gdb) bt
#0  0x00007f9e3ad5e625 in raise (sig=<value optimized out>) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1  0x00007f9e3ad5fe05 in abort () at abort.c:92
#2  0x00007f9e3ad9c537 in __libc_message (do_abort=2, fmt=0x7f9e3ae848c0 "*** glibc detected *** %s: %s: 0x%s ***\n") at ../sysdeps/unix/sysv/linux/libc_fatal.c:198
#3  0x00007f9e3ada1f4e in malloc_printerr (action=3, str=0x7f9e3ae829ae "free(): invalid pointer", ptr=<value optimized out>, ar_ptr=<value optimized out>) at malloc.c:6350
#4  0x00007f9e3ada4cad in _int_free (av=0x7f9e3b0bbe80, p=0x7f9e107bee10, have_lock=0) at malloc.c:4836
#5  0x00007f9e3c3bab97 in dict_del (this=0x7f9e39978e14, key=0x7f9e04001534 "") at dict.c:528
#6  0x00007f9e2e13ceb3 in __mountdict_remove (ms=0x7f9e28198f30) at mount3.c:213
#7  __mnt3svc_umountall (ms=0x7f9e28198f30) at mount3.c:2508
#8  0x00007f9e2e13cf14 in mnt3svc_umountall (ms=0x7f9e28198f30) at mount3.c:2527
#9  0x00007f9e2e13d7dc in mnt3svc_umntall (req=0x7f9e2da61bd4) at mount3.c:2553
#10 0x00007f9e3c408332 in synctask_wrap (old_task=<value optimized out>) at syncop.c:381
#11 0x00007f9e3ad6f8f0 in ?? () from /lib64/libc.so.6
#12 0x0000000000000000 in ?? ()
(gdb) quit

# gdb -c 200-core.34740.1463053585.dump.1 /usr/sbin/glusterfs

(gdb) bt
#0  0x00007f3a6964a625 in raise (sig=<value optimized out>) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1  0x00007f3a6964be05 in abort () at abort.c:92
#2  0x00007f3a69688537 in __libc_message (do_abort=2, fmt=0x7f3a697708c0 "*** glibc detected *** %s: %s: 0x%s ***\n") at ../sysdeps/unix/sysv/linux/libc_fatal.c:198
#3  0x00007f3a6968df4e in malloc_printerr (action=3, str=0x7f3a6976e9ae "free(): invalid pointer", ptr=<value optimized out>, ar_ptr=<value optimized out>) at malloc.c:6350
#4  0x00007f3a69690cad in _int_free (av=0x7f3a699a7e80, p=0x7f3a3c03e030, have_lock=0) at malloc.c:4836
#5  0x00007f3a6aca6b97 in dict_del (this=0x7f3a68264e14, key=0x7f3a34001534 "") at dict.c:528
#6  0x00007f3a5c825eb3 in __mountdict_remove (ms=0x7f3a58198f30) at mount3.c:213
#7  __mnt3svc_umountall (ms=0x7f3a58198f30) at mount3.c:2508
#8  0x00007f3a5c825f14 in mnt3svc_umountall (ms=0x7f3a58198f30) at mount3.c:2527
#9  0x00007f3a5c8267dc in mnt3svc_umntall (req=0x7f3a5c13c3d8) at mount3.c:2553
#10 0x00007f3a6acf4332 in synctask_wrap (old_task=<value optimized out>) at syncop.c:381
#11 0x00007f3a6965b8f0 in ?? () from /lib64/libc.so.6
#12 0x0000000000000000 in ?? ()

Comment 18 Niels de Vos 2016-05-17 10:36:56 UTC

Maybe related, it seems that Windows 7 and 2008 send UMNTALL requests:
  https://bugzilla.redhat.com/show_bug.cgi?id=GLUSTER-1666

These requests are normally sent after a(n unclean?) reboot. A few more details are in https://tools.ietf.org/html/rfc1813#section-5.2.4

Linux and pynfs do not implement UMNTALL, so it might only be reproducible with Windows or an other OS.

Comment 24 Bipin Kunal 2016-05-20 07:12:02 UTC

Niels,

I tried playing with Windows 2012 NFS client yesterday but could not produce anything useful.

Can you suggest me the test or steps for reproducer?

Thanks,
Bipin Kunal

Comment 25 Niels de Vos 2016-05-23 08:39:56 UTC

Created attachment 1160478 [details]
1. fix UMNTALL behaviour, 2. remove mountdict

Completely untested patch that addresses the following two points:

  1. fix UMNTALL to only UMNT the exports from the client calling the procedure
  2. remove the duplication of structures in mountdict, use mountlist everywhere

I am confident that these two patches prevent the crashes that have been seen. The mountdict was used to optimize lookups when many clients are mounting exports. The removal of the mountdict may cause some performance drop while (un)mounting, but I do not expect that to be critical.

These patches were only build-tested, but it shows the approach I'd like to take in order to fix this problem.

Comment 26 Niels de Vos 2016-05-23 08:40:14 UTC

(In reply to Bipin Kunal from comment #24)
> I tried playing with Windows 2012 NFS client yesterday but could not produce
> anything useful.
> 
> Can you suggest me the test or steps for reproducer?

Sorry, I still have no idea how this can happen. There might be a need for multiple clients mounting at the same time when one client sends the UMNTALL procedure. But without more details from the customers that hit this problem, it is very difficult/impossible to understand the cause.

Maybe one of the customers can capture a tcpdump on the NFS-server for all the MOUNT procedures? Depending on their configuration, this would be done on a different port than the NFS traffic. So the tcpdump should not contain much data, only the MNT, UMNT and UMNTALL procedures. Check with 'rpcinfo -p $SERVER' what port is used. Provide the tcpdump and the nfs.log once the process crashed again.

Comment 45 Raghavendra Bhat 2017-01-31 19:44:29 UTC

Created attachment 1246422 [details]
tcpdump

Comment 46 Raghavendra Bhat 2017-01-31 19:45:02 UTC

Created attachment 1246423 [details]
images from client

Comment 58 Soumya Koduri 2017-02-07 07:46:17 UTC

Okay...From the code, I see a potential issue with mountdict.

During gluster-nfs process initialization -->

mnt3svc_init (xlator_t *nfsx) {

...
...
...
        mstate->mountdict = dict_new ();
...
..
}

The ref taken for mountdict above seem to be getting un-refed in __mnt3svc_umountall() 

__mnt3svc_umountall (struct mount3_state *ms) {

       dict_unref (ms->mountdict);
}

So in a lifetime of gNFS process >1 UMOUNTALL requests may result in double unref resulting in accessing freed memory. 

Ideally the above dict_unref (ms->mountdict) should have been in mnt3svc_deinit() IMO.

Based on the above, I am working with Riyas to check if we can reproduce the issue by trying out multiple mount/umount or reboot scenarios using windows client.

Comment 90 Harold Miller 2017-02-15 20:43:48 UTC

As per the current HOTFIX process document, setting NeedInfo to PM  https://mojo.redhat.com/docs/DOC-1037888

Comment 94 Atin Mukherjee 2017-02-16 08:59:28 UTC

downstream patch : https://code.engineering.redhat.com/gerrit/#/c/97861/

Comment 108 surabhi 2017-02-27 10:02:44 UTC

The hotfix build provided for the issue is been verified as mentioned in #C107.
The build can be provided to the customers.

Comment 113 Arthy Loganathan 2017-03-01 09:01:25 UTC

__mnt3svc_umountall crash is not seen while doing gnfs mount and umountall.

Verified the fix on the build, 
glusterfs-server-3.8.4-15.el7rhgs.x86_64

Comment 118 errata-xmlrpc 2017-03-23 05:27:19 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2017-0486.html

Note You need to log in before you can comment on or make changes to this bug.

acavalla
akaiser
aloganat
amukherj
asrivast
bkunal
ctowsley
dnunes
fahmed
fgaspar
hamiller
ndevos
olim
omasek
pdhange
pousley
rabhat
rcyriac
rhinduja
rhs-bugs
rnalakka
sankarshan
sbhaloth
skoduri
storage-qa-internal