Bug 1697971 - Segfault in FUSE process, potential use after free
Summary: Segfault in FUSE process, potential use after free
Keywords:
Status: NEW
Alias: None
Product: GlusterFS
Classification: Community
Component: open-behind
Version: 6
Hardware: x86_64
OS: Linux
medium
urgent
Target Milestone: ---
Assignee: Csaba Henk
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-04-09 11:27 UTC by manschwetus
Modified: 2019-08-31 18:01 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:


Attachments (Terms of Use)
backtrace and threadlist (6.10 KB, text/plain)
2019-04-09 11:27 UTC, manschwetus
no flags Details
corefile (older, uploaded not really intended) (16.40 MB, application/x-7z-compressed)
2019-04-24 10:50 UTC, manschwetus
no flags Details
corefile (1st) (17.06 MB, application/x-7z-compressed)
2019-04-24 10:56 UTC, manschwetus
no flags Details
corefile (2nd) (13.33 MB, application/x-7z-compressed)
2019-04-24 10:57 UTC, manschwetus
no flags Details
corefile (3rd) (14.12 MB, application/x-7z-compressed)
2019-04-24 10:58 UTC, manschwetus
no flags Details
swarm-volumes log of time of crash (582.04 KB, text/plain)
2019-04-24 11:02 UTC, manschwetus
no flags Details

Description manschwetus 2019-04-09 11:27:47 UTC
Created attachment 1553829 [details]
backtrace and threadlist

Description of problem:
We face regular segfaults of FUSE process, when running gitlab docker with data placed on 1x3 glusterfs, all nodes running the container are gluster servers.

Version-Release number of selected component (if applicable):
We use latest packages available for ubuntu lts

# cat /etc/apt/sources.list.d/gluster-ubuntu-glusterfs-6-bionic.list
deb http://ppa.launchpad.net/gluster/glusterfs-6/ubuntu bionic main


How reproducible:
It happens regularly, at least daily, but we aren't aware of a specific action/activity.

Steps to Reproduce:
1. setup docker swarm backed with glusterfs and run gitlab on it
2. ???
3. ???

Actual results:
Regular crashes of FUSE, cutting of persistent storage used for dockerized gitlab

Expected results:
Stable availability of glusterfs

Additional info:
we preserve the produced corefiles, I'll attach bt and thread list of te one today.

Comment 1 manschwetus 2019-04-09 11:31:55 UTC
vlume infos:
$ sudo gluster  volume info
 
Volume Name: swarm-vols
Type: Replicate
Volume ID: a103c1da-d651-4d65-8f86-a8731e2a670c
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: 192.168.1.81:/gluster/data
Brick2: 192.168.1.86:/gluster/data
Brick3: 192.168.1.85:/gluster/data
Options Reconfigured:
performance.write-behind: off
performance.cache-max-file-size: 1GB
performance.cache-size: 2GB
nfs.disable: on
transport.address-family: inet
auth.allow: 127.0.0.1
$ sudo gluster  volume status
Status of volume: swarm-vols
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 192.168.1.81:/gluster/data            49153     0          Y       188805
Brick 192.168.1.86:/gluster/data            49153     0          Y       5543 
Brick 192.168.1.85:/gluster/data            49152     0          Y       2069 
Self-heal Daemon on localhost               N/A       N/A        Y       2080 
Self-heal Daemon on 192.168.1.81            N/A       N/A        Y       144760
Self-heal Daemon on 192.168.1.86            N/A       N/A        Y       63715
 
Task Status of Volume swarm-vols
------------------------------------------------------------------------------
There are no active volume tasks
$ mount |  grep swarm
localhost:/swarm-vols on /swarm/volumes type fuse.glusterfs (rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072,_netdev)

Comment 2 manschwetus 2019-04-12 13:41:59 UTC
Today I encountered another segfault, same stack.

Comment 3 manschwetus 2019-04-15 06:48:14 UTC
I just checked another corefile from a second system, same stack, that issue gets quite annoying.
Could some pls look into it.

Comment 4 manschwetus 2019-04-23 08:03:21 UTC
Problem persists with 6.1, bt has changed a bit:

Crashdump1:
#0  0x00007fc3dcaa27f0 in ?? () from /lib/x86_64-linux-gnu/libuuid.so.1
#1  0x00007fc3dcaa2874 in ?? () from /lib/x86_64-linux-gnu/libuuid.so.1
#2  0x00007fc3ddb5cdcc in gf_uuid_unparse (out=0x7fc3c8005580 "c27a90a6-e68b-4b0b-af56-002ea7bf1fb4", uuid=0x8 <error: Cannot access memory at address 0x8>) at ./glusterfs/compat-uuid.h:55
#3  uuid_utoa (uuid=uuid@entry=0x8 <error: Cannot access memory at address 0x8>) at common-utils.c:2777
#4  0x00007fc3d688c529 in ioc_open_cbk (frame=0x7fc3a8b56208, cookie=<optimized out>, this=0x7fc3d001eb80, op_ret=0, op_errno=117, fd=0x7fc3c7678418, xdata=0x0) at io-cache.c:646
#5  0x00007fc3d6cb09b1 in ra_open_cbk (frame=0x7fc3a8b5b698, cookie=<optimized out>, this=<optimized out>, op_ret=<optimized out>, op_errno=<optimized out>, fd=0x7fc3c7678418, xdata=0x0) at read-ahead.c:99
#6  0x00007fc3d71b10b3 in afr_open_cbk (frame=0x7fc3a8b67d48, cookie=0x0, this=<optimized out>, op_ret=0, op_errno=0, fd=0x7fc3c7678418, xdata=0x0) at afr-open.c:97
#7  0x00007fc3d747c5f8 in client4_0_open_cbk (req=<optimized out>, iov=<optimized out>, count=<optimized out>, myframe=0x7fc3a8b58d18) at client-rpc-fops_v2.c:284
#8  0x00007fc3dd9013d1 in rpc_clnt_handle_reply (clnt=clnt@entry=0x7fc3d0057dd0, pollin=pollin@entry=0x7fc386e7e2b0) at rpc-clnt.c:755
#9  0x00007fc3dd901773 in rpc_clnt_notify (trans=0x7fc3d0058090, mydata=0x7fc3d0057e00, event=<optimized out>, data=0x7fc386e7e2b0) at rpc-clnt.c:922
#10 0x00007fc3dd8fe273 in rpc_transport_notify (this=this@entry=0x7fc3d0058090, event=event@entry=RPC_TRANSPORT_MSG_RECEIVED, data=<optimized out>) at rpc-transport.c:542
#11 0x00007fc3d8333474 in socket_event_poll_in (notify_handled=true, this=0x7fc3d0058090) at socket.c:2522
#12 socket_event_handler (fd=fd@entry=11, idx=idx@entry=2, gen=gen@entry=4, data=data@entry=0x7fc3d0058090, poll_in=<optimized out>, poll_out=<optimized out>, poll_err=<optimized out>, event_thread_died=0 '\000') at socket.c:2924
#13 0x00007fc3ddbb2863 in event_dispatch_epoll_handler (event=0x7fc3cfffee54, event_pool=0x5570195807b0) at event-epoll.c:648
#14 event_dispatch_epoll_worker (data=0x5570195c5a80) at event-epoll.c:761
#15 0x00007fc3dd2bc6db in start_thread (arg=0x7fc3cffff700) at pthread_create.c:463
#16 0x00007fc3dcfe588f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Crashdump2:
#0  0x00007fdd13fd97f0 in ?? () from /lib/x86_64-linux-gnu/libuuid.so.1
#1  0x00007fdd13fd9874 in ?? () from /lib/x86_64-linux-gnu/libuuid.so.1
#2  0x00007fdd15093dcc in gf_uuid_unparse (out=0x7fdd0805a2d0 "1f739bdc-f7c0-4133-84cc-554eb594ae81", uuid=0x8 <error: Cannot access memory at address 0x8>) at ./glusterfs/compat-uuid.h:55
#3  uuid_utoa (uuid=uuid@entry=0x8 <error: Cannot access memory at address 0x8>) at common-utils.c:2777
#4  0x00007fdd0d5c2529 in ioc_open_cbk (frame=0x7fdce44a9f88, cookie=<optimized out>, this=0x7fdd0801eb80, op_ret=0, op_errno=117, fd=0x7fdcf1ad9b78, xdata=0x0) at io-cache.c:646
#5  0x00007fdd0d9e69b1 in ra_open_cbk (frame=0x7fdce44d2a78, cookie=<optimized out>, this=<optimized out>, op_ret=<optimized out>, op_errno=<optimized out>, fd=0x7fdcf1ad9b78, xdata=0x0) at read-ahead.c:99
#6  0x00007fdd0dee70b3 in afr_open_cbk (frame=0x7fdce44a80a8, cookie=0x1, this=<optimized out>, op_ret=0, op_errno=0, fd=0x7fdcf1ad9b78, xdata=0x0) at afr-open.c:97
#7  0x00007fdd0e1b25f8 in client4_0_open_cbk (req=<optimized out>, iov=<optimized out>, count=<optimized out>, myframe=0x7fdce4462528) at client-rpc-fops_v2.c:284
#8  0x00007fdd14e383d1 in rpc_clnt_handle_reply (clnt=clnt@entry=0x7fdd08054490, pollin=pollin@entry=0x7fdc942904d0) at rpc-clnt.c:755
#9  0x00007fdd14e38773 in rpc_clnt_notify (trans=0x7fdd08054750, mydata=0x7fdd080544c0, event=<optimized out>, data=0x7fdc942904d0) at rpc-clnt.c:922
#10 0x00007fdd14e35273 in rpc_transport_notify (this=this@entry=0x7fdd08054750, event=event@entry=RPC_TRANSPORT_MSG_RECEIVED, data=<optimized out>) at rpc-transport.c:542
#11 0x00007fdd0f86a474 in socket_event_poll_in (notify_handled=true, this=0x7fdd08054750) at socket.c:2522
#12 socket_event_handler (fd=fd@entry=10, idx=idx@entry=4, gen=gen@entry=4, data=data@entry=0x7fdd08054750, poll_in=<optimized out>, poll_out=<optimized out>, poll_err=<optimized out>, event_thread_died=0 '\000') at socket.c:2924
#13 0x00007fdd150e9863 in event_dispatch_epoll_handler (event=0x7fdd0f3e0e54, event_pool=0x55cf9c9277b0) at event-epoll.c:648
#14 event_dispatch_epoll_worker (data=0x55cf9c96ca20) at event-epoll.c:761
#15 0x00007fdd147f36db in start_thread (arg=0x7fdd0f3e1700) at pthread_create.c:463
#16 0x00007fdd1451c88f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95


Please tell me if you need further information.

Comment 5 manschwetus 2019-04-24 10:27:43 UTC
got another set of cores, one for each system in my gluster setup over night:

Core was generated by `/usr/sbin/glusterfs --process-name fuse --volfile-server=localhost --volfile-id'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  __GI___pthread_mutex_lock (mutex=0x18) at ../nptl/pthread_mutex_lock.c:65
65      ../nptl/pthread_mutex_lock.c: Datei oder Verzeichnis nicht gefunden.
[Current thread is 1 (Thread 0x7faee404c700 (LWP 28792))]
(gdb) bt
#0  __GI___pthread_mutex_lock (mutex=0x18) at ../nptl/pthread_mutex_lock.c:65
#1  0x00007faee544e4b5 in ob_fd_free (ob_fd=0x7faebc054df0) at open-behind.c:198
#2  0x00007faee544edd6 in ob_inode_wake (this=this@entry=0x7faed8020d20, ob_fds=ob_fds@entry=0x7faee404bc90) at open-behind.c:355
#3  0x00007faee544f062 in open_all_pending_fds_and_resume (this=this@entry=0x7faed8020d20, inode=0x7faed037cf08, stub=0x7faebc008858) at open-behind.c:442
#4  0x00007faee544f4ff in ob_rename (frame=frame@entry=0x7faebc1ceae8, this=this@entry=0x7faed8020d20, src=src@entry=0x7faed03d9a70, dst=dst@entry=0x7faed03d9ab0, xdata=xdata@entry=0x0) at open-behind.c:1035
#5  0x00007faeed1b0ad0 in default_rename (frame=frame@entry=0x7faebc1ceae8, this=<optimized out>, oldloc=oldloc@entry=0x7faed03d9a70, newloc=newloc@entry=0x7faed03d9ab0, xdata=xdata@entry=0x0) at defaults.c:2631
#6  0x00007faee501f798 in mdc_rename (frame=frame@entry=0x7faebc210468, this=0x7faed80247d0, oldloc=oldloc@entry=0x7faed03d9a70, newloc=newloc@entry=0x7faed03d9ab0, xdata=xdata@entry=0x0) at md-cache.c:1852
#7  0x00007faeed1c6936 in default_rename_resume (frame=0x7faed02d2318, this=0x7faed8026430, oldloc=0x7faed03d9a70, newloc=0x7faed03d9ab0, xdata=0x0) at defaults.c:1897
#8  0x00007faeed14cc45 in call_resume (stub=0x7faed03d9a28) at call-stub.c:2555
#9  0x00007faee4e10cd8 in iot_worker (data=0x7faed8034780) at io-threads.c:232
#10 0x00007faeec88f6db in start_thread (arg=0x7faee404c700) at pthread_create.c:463
#11 0x00007faeec5b888f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95



bCore was generated by `/usr/sbin/glusterfs --process-name fuse --volfile-server=localhost --volfile-id'.
Program terminated with signal SIGSEGV, Segmentation fault.
t#0  __GI___pthread_mutex_lock (mutex=0x18) at ../nptl/pthread_mutex_lock.c:65
65      ../nptl/pthread_mutex_lock.c: Datei oder Verzeichnis nicht gefunden.
[Current thread is 1 (Thread 0x7f7944069700 (LWP 24067))]
(gdb) bt
#0  __GI___pthread_mutex_lock (mutex=0x18) at ../nptl/pthread_mutex_lock.c:65
#1  0x00007f794682c4b5 in ob_fd_free (ob_fd=0x7f79283e94e0) at open-behind.c:198
#2  0x00007f794682cdd6 in ob_inode_wake (this=this@entry=0x7f793801eee0, ob_fds=ob_fds@entry=0x7f7944068c90) at open-behind.c:355
#3  0x00007f794682d062 in open_all_pending_fds_and_resume (this=this@entry=0x7f793801eee0, inode=0x7f79301de788, stub=0x7f7928004578) at open-behind.c:442
#4  0x00007f794682d4ff in ob_rename (frame=frame@entry=0x7f79280ab2b8, this=this@entry=0x7f793801eee0, src=src@entry=0x7f7930558ea0, dst=dst@entry=0x7f7930558ee0, xdata=xdata@entry=0x0) at open-behind.c:1035
#5  0x00007f794e729ad0 in default_rename (frame=frame@entry=0x7f79280ab2b8, this=<optimized out>, oldloc=oldloc@entry=0x7f7930558ea0, newloc=newloc@entry=0x7f7930558ee0, xdata=xdata@entry=0x0) at defaults.c:2631
#6  0x00007f79463fd798 in mdc_rename (frame=frame@entry=0x7f7928363ae8, this=0x7f7938022990, oldloc=oldloc@entry=0x7f7930558ea0, newloc=newloc@entry=0x7f7930558ee0, xdata=xdata@entry=0x0) at md-cache.c:1852
#7  0x00007f794e73f936 in default_rename_resume (frame=0x7f7930298c28, this=0x7f79380245f0, oldloc=0x7f7930558ea0, newloc=0x7f7930558ee0, xdata=0x0) at defaults.c:1897
#8  0x00007f794e6c5c45 in call_resume (stub=0x7f7930558e58) at call-stub.c:2555
#9  0x00007f79461eecd8 in iot_worker (data=0x7f7938032940) at io-threads.c:232
#10 0x00007f794de086db in start_thread (arg=0x7f7944069700) at pthread_create.c:463
#11 0x00007f794db3188f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95



Core was generated by `/usr/sbin/glusterfs --process-name fuse --volfile-server=localhost --volfile-id'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  __GI___pthread_mutex_lock (mutex=0x18) at ../nptl/pthread_mutex_lock.c:65
65      ../nptl/pthread_mutex_lock.c: Datei oder Verzeichnis nicht gefunden.
[Current thread is 1 (Thread 0x7f736ce78700 (LWP 42526))]
(gdb) bt
#0  __GI___pthread_mutex_lock (mutex=0x18) at ../nptl/pthread_mutex_lock.c:65
#1  0x00007f73733499f7 in fd_unref (fd=0x7f73544165a8) at fd.c:515
#2  0x00007f736c3fb618 in client_local_wipe (local=local@entry=0x7f73441ff8c8) at client-helpers.c:124
#3  0x00007f736c44a60a in client4_0_open_cbk (req=<optimized out>, iov=<optimized out>, count=<optimized out>, myframe=0x7f7344038b48) at client-rpc-fops_v2.c:284
#4  0x00007f73730d03d1 in rpc_clnt_handle_reply (clnt=clnt@entry=0x7f7368054490, pollin=pollin@entry=0x7f73601b2730) at rpc-clnt.c:755
#5  0x00007f73730d0773 in rpc_clnt_notify (trans=0x7f7368054750, mydata=0x7f73680544c0, event=<optimized out>, data=0x7f73601b2730) at rpc-clnt.c:922
#6  0x00007f73730cd273 in rpc_transport_notify (this=this@entry=0x7f7368054750, event=event@entry=RPC_TRANSPORT_MSG_RECEIVED, data=<optimized out>) at rpc-transport.c:542
#7  0x00007f736db02474 in socket_event_poll_in (notify_handled=true, this=0x7f7368054750) at socket.c:2522
#8  socket_event_handler (fd=fd@entry=10, idx=idx@entry=4, gen=gen@entry=1, data=data@entry=0x7f7368054750, poll_in=<optimized out>, poll_out=<optimized out>, poll_err=<optimized out>, event_thread_died=0 '\000') at socket.c:2924
#9  0x00007f7373381863 in event_dispatch_epoll_handler (event=0x7f736ce77e54, event_pool=0x55621c2917b0) at event-epoll.c:648
#10 event_dispatch_epoll_worker (data=0x55621c2d6a80) at event-epoll.c:761
#11 0x00007f7372a8b6db in start_thread (arg=0x7f736ce78700) at pthread_create.c:463
#12 0x00007f73727b488f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Comment 6 Xavi Hernandez 2019-04-24 10:38:59 UTC
Can you share these core dumps and reference the exact version of gluster and ubuntu used for each one ?

As per your comments, I understand that you are not doing anything special when this happens, right ? no volume reconfiguration or any other management task.

Mount logs would also be helpful.

Comment 7 manschwetus 2019-04-24 10:46:40 UTC
Ubuntu 18.04.2 LTS (GNU/Linux 4.15.0-47-generic x86_64)

administrator@ubuntu-docker:~$ apt list glusterfs-*
Listing... Done
glusterfs-client/bionic,now 6.1-ubuntu1~bionic1 amd64 [installed,automatic]
glusterfs-common/bionic,now 6.1-ubuntu1~bionic1 amd64 [installed,automatic]
glusterfs-dbg/bionic 6.1-ubuntu1~bionic1 amd64
glusterfs-server/bionic,now 6.1-ubuntu1~bionic1 amd64 [installed]

mount logs, not sure what you refer?

Comment 8 Xavi Hernandez 2019-04-24 10:50:13 UTC
(In reply to manschwetus from comment #7)
> Ubuntu 18.04.2 LTS (GNU/Linux 4.15.0-47-generic x86_64)
> 
> administrator@ubuntu-docker:~$ apt list glusterfs-*
> Listing... Done
> glusterfs-client/bionic,now 6.1-ubuntu1~bionic1 amd64 [installed,automatic]
> glusterfs-common/bionic,now 6.1-ubuntu1~bionic1 amd64 [installed,automatic]
> glusterfs-dbg/bionic 6.1-ubuntu1~bionic1 amd64
> glusterfs-server/bionic,now 6.1-ubuntu1~bionic1 amd64 [installed]
> 
> mount logs, not sure what you refer?

In the server where gluster crashed, you should find a file like /var/log/glusterfs/swarm-volumes.log. I would also need the coredump files and ubuntu version.

Comment 9 manschwetus 2019-04-24 10:50:43 UTC
Created attachment 1558139 [details]
corefile (older, uploaded not really intended)

Comment 10 manschwetus 2019-04-24 10:56:25 UTC
Created attachment 1558142 [details]
corefile (1st)

Comment 11 manschwetus 2019-04-24 10:57:10 UTC
Created attachment 1558143 [details]
corefile (2nd)

Comment 12 manschwetus 2019-04-24 10:58:02 UTC
Created attachment 1558144 [details]
corefile (3rd)

Comment 13 manschwetus 2019-04-24 11:00:05 UTC
administrator@ubuntu-docker:~$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 18.04.2 LTS
Release:        18.04
Codename:       bionic


Same for the others

Comment 14 manschwetus 2019-04-24 11:02:44 UTC
Created attachment 1558157 [details]
swarm-volumes log of time of crash

Comment 15 Xavi Hernandez 2019-04-24 13:22:50 UTC
I'm still analyzing the core dumps, but as a first test, could you disable open-behind feature using the following command ?

    # gluster volume set <volname> open-behind off

Comment 16 manschwetus 2019-04-24 13:43:34 UTC
Ok, I modified it, lets see if it helps, hopefully it wont kill the performance.

sudo gluster volume info swarm-vols

Volume Name: swarm-vols
Type: Replicate
Volume ID: a103c1da-d651-4d65-8f86-a8731e2a670c
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: 192.168.1.81:/gluster/data
Brick2: 192.168.1.86:/gluster/data
Brick3: 192.168.1.85:/gluster/data
Options Reconfigured:
performance.open-behind: off
performance.cache-size: 1GB
cluster.self-heal-daemon: enable
performance.write-behind: off
auth.allow: 127.0.0.1
transport.address-family: inet
nfs.disable: on
performance.cache-max-file-size: 1GB

Comment 17 Xavi Hernandez 2019-04-25 10:33:31 UTC
Right now the issue seems related to open-behind, so changing the component accordingly.

Comment 18 manschwetus 2019-04-30 07:42:31 UTC
Ok, as we haven't had a crash since we disable open-behind it seems to be valid, to say that disabling open-behind bypasses the issue.
As I disabled write-behind due to another defect report, is it worth to re enable it or would you expect problems with it?

Comment 19 Amar Tumballi 2019-05-01 12:56:35 UTC
Yes please, you can re-enable write-behind if you have upgraded to glusterfs-5.5 or glusterfs-6.x releases.

Comment 20 Xavi Hernandez 2019-05-02 17:10:34 UTC
From my debugging I think the issue is related to a missing fd_ref() when ob_open_behind() is used. This could potentially cause a race when the same fd is being unref'ed (refcount becoming 0) and ref'ed at the same time to handle some open_and_resume() requests. I have not yet identified the exact sequence of operations that cause the problem though. Knowing that the problem really comes from here, I'll investigate further.

Comment 21 Amar Tumballi 2019-05-27 15:37:53 UTC
Marking it back to NEW, as the assignee is still bugs@gluster.org


Note You need to log in before you can comment on or make changes to this bug.