1810934 – Segfault in FUSE process, potential use after free

Bug 1810934 - Segfault in FUSE process, potential use after free

Summary: Segfault in FUSE process, potential use after free

Keywords:
Status:	CLOSED NEXTRELEASE
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	open-behind
Sub Component:
Version:	mainline
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	urgent
Target Milestone:	---
Assignee:	Xavi Hernandez
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1697971
TreeView+	depends on / blocked

Reported:	2020-03-06 09:16 UTC by Xavi Hernandez
Modified:	2020-08-18 11:35 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Clone Of:	1697971
Environment:
Last Closed:	2020-03-12 13:32:38 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Gluster.org Gerrit	24204	0	None	Merged	open-behind: fix missing fd reference	2020-03-17 04:24:02 UTC

Description Xavi Hernandez 2020-03-06 09:16:57 UTC

+++ This bug was initially created as a clone of Bug #1697971 +++

Description of problem:
We face regular segfaults of FUSE process, when running gitlab docker with data placed on 1x3 glusterfs, all nodes running the container are gluster servers.

Version-Release number of selected component (if applicable):
We use latest packages available for ubuntu lts

# cat /etc/apt/sources.list.d/gluster-ubuntu-glusterfs-6-bionic.list
deb http://ppa.launchpad.net/gluster/glusterfs-6/ubuntu bionic main


How reproducible:
It happens regularly, at least daily, but we aren't aware of a specific action/activity.

Steps to Reproduce:
1. setup docker swarm backed with glusterfs and run gitlab on it
2. ???
3. ???

Actual results:
Regular crashes of FUSE, cutting of persistent storage used for dockerized gitlab

Expected results:
Stable availability of glusterfs

Additional info:
we preserve the produced corefiles, I'll attach bt and thread list of te one today.

--- Additional comment from  on 2019-04-09 13:31:55 CEST ---

vlume infos:
$ sudo gluster  volume info
 
Volume Name: swarm-vols
Type: Replicate
Volume ID: a103c1da-d651-4d65-8f86-a8731e2a670c
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: 192.168.1.81:/gluster/data
Brick2: 192.168.1.86:/gluster/data
Brick3: 192.168.1.85:/gluster/data
Options Reconfigured:
performance.write-behind: off
performance.cache-max-file-size: 1GB
performance.cache-size: 2GB
nfs.disable: on
transport.address-family: inet
auth.allow: 127.0.0.1
$ sudo gluster  volume status
Status of volume: swarm-vols
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 192.168.1.81:/gluster/data            49153     0          Y       188805
Brick 192.168.1.86:/gluster/data            49153     0          Y       5543 
Brick 192.168.1.85:/gluster/data            49152     0          Y       2069 
Self-heal Daemon on localhost               N/A       N/A        Y       2080 
Self-heal Daemon on 192.168.1.81            N/A       N/A        Y       144760
Self-heal Daemon on 192.168.1.86            N/A       N/A        Y       63715
 
Task Status of Volume swarm-vols
------------------------------------------------------------------------------
There are no active volume tasks
$ mount |  grep swarm
localhost:/swarm-vols on /swarm/volumes type fuse.glusterfs (rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072,_netdev)

--- Additional comment from  on 2019-04-12 15:41:59 CEST ---

Today I encountered another segfault, same stack.

--- Additional comment from  on 2019-04-15 08:48:14 CEST ---

I just checked another corefile from a second system, same stack, that issue gets quite annoying.
Could some pls look into it.

--- Additional comment from  on 2019-04-23 10:03:21 CEST ---

Problem persists with 6.1, bt has changed a bit:

Crashdump1:
#0  0x00007fc3dcaa27f0 in ?? () from /lib/x86_64-linux-gnu/libuuid.so.1
#1  0x00007fc3dcaa2874 in ?? () from /lib/x86_64-linux-gnu/libuuid.so.1
#2  0x00007fc3ddb5cdcc in gf_uuid_unparse (out=0x7fc3c8005580 "c27a90a6-e68b-4b0b-af56-002ea7bf1fb4", uuid=0x8 <error: Cannot access memory at address 0x8>) at ./glusterfs/compat-uuid.h:55
#3  uuid_utoa (uuid=uuid@entry=0x8 <error: Cannot access memory at address 0x8>) at common-utils.c:2777
#4  0x00007fc3d688c529 in ioc_open_cbk (frame=0x7fc3a8b56208, cookie=<optimized out>, this=0x7fc3d001eb80, op_ret=0, op_errno=117, fd=0x7fc3c7678418, xdata=0x0) at io-cache.c:646
#5  0x00007fc3d6cb09b1 in ra_open_cbk (frame=0x7fc3a8b5b698, cookie=<optimized out>, this=<optimized out>, op_ret=<optimized out>, op_errno=<optimized out>, fd=0x7fc3c7678418, xdata=0x0) at read-ahead.c:99
#6  0x00007fc3d71b10b3 in afr_open_cbk (frame=0x7fc3a8b67d48, cookie=0x0, this=<optimized out>, op_ret=0, op_errno=0, fd=0x7fc3c7678418, xdata=0x0) at afr-open.c:97
#7  0x00007fc3d747c5f8 in client4_0_open_cbk (req=<optimized out>, iov=<optimized out>, count=<optimized out>, myframe=0x7fc3a8b58d18) at client-rpc-fops_v2.c:284
#8  0x00007fc3dd9013d1 in rpc_clnt_handle_reply (clnt=clnt@entry=0x7fc3d0057dd0, pollin=pollin@entry=0x7fc386e7e2b0) at rpc-clnt.c:755
#9  0x00007fc3dd901773 in rpc_clnt_notify (trans=0x7fc3d0058090, mydata=0x7fc3d0057e00, event=<optimized out>, data=0x7fc386e7e2b0) at rpc-clnt.c:922
#10 0x00007fc3dd8fe273 in rpc_transport_notify (this=this@entry=0x7fc3d0058090, event=event@entry=RPC_TRANSPORT_MSG_RECEIVED, data=<optimized out>) at rpc-transport.c:542
#11 0x00007fc3d8333474 in socket_event_poll_in (notify_handled=true, this=0x7fc3d0058090) at socket.c:2522
#12 socket_event_handler (fd=fd@entry=11, idx=idx@entry=2, gen=gen@entry=4, data=data@entry=0x7fc3d0058090, poll_in=<optimized out>, poll_out=<optimized out>, poll_err=<optimized out>, event_thread_died=0 '\000') at socket.c:2924
#13 0x00007fc3ddbb2863 in event_dispatch_epoll_handler (event=0x7fc3cfffee54, event_pool=0x5570195807b0) at event-epoll.c:648
#14 event_dispatch_epoll_worker (data=0x5570195c5a80) at event-epoll.c:761
#15 0x00007fc3dd2bc6db in start_thread (arg=0x7fc3cffff700) at pthread_create.c:463
#16 0x00007fc3dcfe588f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Crashdump2:
#0  0x00007fdd13fd97f0 in ?? () from /lib/x86_64-linux-gnu/libuuid.so.1
#1  0x00007fdd13fd9874 in ?? () from /lib/x86_64-linux-gnu/libuuid.so.1
#2  0x00007fdd15093dcc in gf_uuid_unparse (out=0x7fdd0805a2d0 "1f739bdc-f7c0-4133-84cc-554eb594ae81", uuid=0x8 <error: Cannot access memory at address 0x8>) at ./glusterfs/compat-uuid.h:55
#3  uuid_utoa (uuid=uuid@entry=0x8 <error: Cannot access memory at address 0x8>) at common-utils.c:2777
#4  0x00007fdd0d5c2529 in ioc_open_cbk (frame=0x7fdce44a9f88, cookie=<optimized out>, this=0x7fdd0801eb80, op_ret=0, op_errno=117, fd=0x7fdcf1ad9b78, xdata=0x0) at io-cache.c:646
#5  0x00007fdd0d9e69b1 in ra_open_cbk (frame=0x7fdce44d2a78, cookie=<optimized out>, this=<optimized out>, op_ret=<optimized out>, op_errno=<optimized out>, fd=0x7fdcf1ad9b78, xdata=0x0) at read-ahead.c:99
#6  0x00007fdd0dee70b3 in afr_open_cbk (frame=0x7fdce44a80a8, cookie=0x1, this=<optimized out>, op_ret=0, op_errno=0, fd=0x7fdcf1ad9b78, xdata=0x0) at afr-open.c:97
#7  0x00007fdd0e1b25f8 in client4_0_open_cbk (req=<optimized out>, iov=<optimized out>, count=<optimized out>, myframe=0x7fdce4462528) at client-rpc-fops_v2.c:284
#8  0x00007fdd14e383d1 in rpc_clnt_handle_reply (clnt=clnt@entry=0x7fdd08054490, pollin=pollin@entry=0x7fdc942904d0) at rpc-clnt.c:755
#9  0x00007fdd14e38773 in rpc_clnt_notify (trans=0x7fdd08054750, mydata=0x7fdd080544c0, event=<optimized out>, data=0x7fdc942904d0) at rpc-clnt.c:922
#10 0x00007fdd14e35273 in rpc_transport_notify (this=this@entry=0x7fdd08054750, event=event@entry=RPC_TRANSPORT_MSG_RECEIVED, data=<optimized out>) at rpc-transport.c:542
#11 0x00007fdd0f86a474 in socket_event_poll_in (notify_handled=true, this=0x7fdd08054750) at socket.c:2522
#12 socket_event_handler (fd=fd@entry=10, idx=idx@entry=4, gen=gen@entry=4, data=data@entry=0x7fdd08054750, poll_in=<optimized out>, poll_out=<optimized out>, poll_err=<optimized out>, event_thread_died=0 '\000') at socket.c:2924
#13 0x00007fdd150e9863 in event_dispatch_epoll_handler (event=0x7fdd0f3e0e54, event_pool=0x55cf9c9277b0) at event-epoll.c:648
#14 event_dispatch_epoll_worker (data=0x55cf9c96ca20) at event-epoll.c:761
#15 0x00007fdd147f36db in start_thread (arg=0x7fdd0f3e1700) at pthread_create.c:463
#16 0x00007fdd1451c88f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95


Please tell me if you need further information.

--- Additional comment from  on 2019-04-24 12:27:43 CEST ---

got another set of cores, one for each system in my gluster setup over night:

Core was generated by `/usr/sbin/glusterfs --process-name fuse --volfile-server=localhost --volfile-id'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  __GI___pthread_mutex_lock (mutex=0x18) at ../nptl/pthread_mutex_lock.c:65
65      ../nptl/pthread_mutex_lock.c: Datei oder Verzeichnis nicht gefunden.
[Current thread is 1 (Thread 0x7faee404c700 (LWP 28792))]
(gdb) bt
#0  __GI___pthread_mutex_lock (mutex=0x18) at ../nptl/pthread_mutex_lock.c:65
#1  0x00007faee544e4b5 in ob_fd_free (ob_fd=0x7faebc054df0) at open-behind.c:198
#2  0x00007faee544edd6 in ob_inode_wake (this=this@entry=0x7faed8020d20, ob_fds=ob_fds@entry=0x7faee404bc90) at open-behind.c:355
#3  0x00007faee544f062 in open_all_pending_fds_and_resume (this=this@entry=0x7faed8020d20, inode=0x7faed037cf08, stub=0x7faebc008858) at open-behind.c:442
#4  0x00007faee544f4ff in ob_rename (frame=frame@entry=0x7faebc1ceae8, this=this@entry=0x7faed8020d20, src=src@entry=0x7faed03d9a70, dst=dst@entry=0x7faed03d9ab0, xdata=xdata@entry=0x0) at open-behind.c:1035
#5  0x00007faeed1b0ad0 in default_rename (frame=frame@entry=0x7faebc1ceae8, this=<optimized out>, oldloc=oldloc@entry=0x7faed03d9a70, newloc=newloc@entry=0x7faed03d9ab0, xdata=xdata@entry=0x0) at defaults.c:2631
#6  0x00007faee501f798 in mdc_rename (frame=frame@entry=0x7faebc210468, this=0x7faed80247d0, oldloc=oldloc@entry=0x7faed03d9a70, newloc=newloc@entry=0x7faed03d9ab0, xdata=xdata@entry=0x0) at md-cache.c:1852
#7  0x00007faeed1c6936 in default_rename_resume (frame=0x7faed02d2318, this=0x7faed8026430, oldloc=0x7faed03d9a70, newloc=0x7faed03d9ab0, xdata=0x0) at defaults.c:1897
#8  0x00007faeed14cc45 in call_resume (stub=0x7faed03d9a28) at call-stub.c:2555
#9  0x00007faee4e10cd8 in iot_worker (data=0x7faed8034780) at io-threads.c:232
#10 0x00007faeec88f6db in start_thread (arg=0x7faee404c700) at pthread_create.c:463
#11 0x00007faeec5b888f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95



bCore was generated by `/usr/sbin/glusterfs --process-name fuse --volfile-server=localhost --volfile-id'.
Program terminated with signal SIGSEGV, Segmentation fault.
t#0  __GI___pthread_mutex_lock (mutex=0x18) at ../nptl/pthread_mutex_lock.c:65
65      ../nptl/pthread_mutex_lock.c: Datei oder Verzeichnis nicht gefunden.
[Current thread is 1 (Thread 0x7f7944069700 (LWP 24067))]
(gdb) bt
#0  __GI___pthread_mutex_lock (mutex=0x18) at ../nptl/pthread_mutex_lock.c:65
#1  0x00007f794682c4b5 in ob_fd_free (ob_fd=0x7f79283e94e0) at open-behind.c:198
#2  0x00007f794682cdd6 in ob_inode_wake (this=this@entry=0x7f793801eee0, ob_fds=ob_fds@entry=0x7f7944068c90) at open-behind.c:355
#3  0x00007f794682d062 in open_all_pending_fds_and_resume (this=this@entry=0x7f793801eee0, inode=0x7f79301de788, stub=0x7f7928004578) at open-behind.c:442
#4  0x00007f794682d4ff in ob_rename (frame=frame@entry=0x7f79280ab2b8, this=this@entry=0x7f793801eee0, src=src@entry=0x7f7930558ea0, dst=dst@entry=0x7f7930558ee0, xdata=xdata@entry=0x0) at open-behind.c:1035
#5  0x00007f794e729ad0 in default_rename (frame=frame@entry=0x7f79280ab2b8, this=<optimized out>, oldloc=oldloc@entry=0x7f7930558ea0, newloc=newloc@entry=0x7f7930558ee0, xdata=xdata@entry=0x0) at defaults.c:2631
#6  0x00007f79463fd798 in mdc_rename (frame=frame@entry=0x7f7928363ae8, this=0x7f7938022990, oldloc=oldloc@entry=0x7f7930558ea0, newloc=newloc@entry=0x7f7930558ee0, xdata=xdata@entry=0x0) at md-cache.c:1852
#7  0x00007f794e73f936 in default_rename_resume (frame=0x7f7930298c28, this=0x7f79380245f0, oldloc=0x7f7930558ea0, newloc=0x7f7930558ee0, xdata=0x0) at defaults.c:1897
#8  0x00007f794e6c5c45 in call_resume (stub=0x7f7930558e58) at call-stub.c:2555
#9  0x00007f79461eecd8 in iot_worker (data=0x7f7938032940) at io-threads.c:232
#10 0x00007f794de086db in start_thread (arg=0x7f7944069700) at pthread_create.c:463
#11 0x00007f794db3188f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95



Core was generated by `/usr/sbin/glusterfs --process-name fuse --volfile-server=localhost --volfile-id'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  __GI___pthread_mutex_lock (mutex=0x18) at ../nptl/pthread_mutex_lock.c:65
65      ../nptl/pthread_mutex_lock.c: Datei oder Verzeichnis nicht gefunden.
[Current thread is 1 (Thread 0x7f736ce78700 (LWP 42526))]
(gdb) bt
#0  __GI___pthread_mutex_lock (mutex=0x18) at ../nptl/pthread_mutex_lock.c:65
#1  0x00007f73733499f7 in fd_unref (fd=0x7f73544165a8) at fd.c:515
#2  0x00007f736c3fb618 in client_local_wipe (local=local@entry=0x7f73441ff8c8) at client-helpers.c:124
#3  0x00007f736c44a60a in client4_0_open_cbk (req=<optimized out>, iov=<optimized out>, count=<optimized out>, myframe=0x7f7344038b48) at client-rpc-fops_v2.c:284
#4  0x00007f73730d03d1 in rpc_clnt_handle_reply (clnt=clnt@entry=0x7f7368054490, pollin=pollin@entry=0x7f73601b2730) at rpc-clnt.c:755
#5  0x00007f73730d0773 in rpc_clnt_notify (trans=0x7f7368054750, mydata=0x7f73680544c0, event=<optimized out>, data=0x7f73601b2730) at rpc-clnt.c:922
#6  0x00007f73730cd273 in rpc_transport_notify (this=this@entry=0x7f7368054750, event=event@entry=RPC_TRANSPORT_MSG_RECEIVED, data=<optimized out>) at rpc-transport.c:542
#7  0x00007f736db02474 in socket_event_poll_in (notify_handled=true, this=0x7f7368054750) at socket.c:2522
#8  socket_event_handler (fd=fd@entry=10, idx=idx@entry=4, gen=gen@entry=1, data=data@entry=0x7f7368054750, poll_in=<optimized out>, poll_out=<optimized out>, poll_err=<optimized out>, event_thread_died=0 '\000') at socket.c:2924
#9  0x00007f7373381863 in event_dispatch_epoll_handler (event=0x7f736ce77e54, event_pool=0x55621c2917b0) at event-epoll.c:648
#10 event_dispatch_epoll_worker (data=0x55621c2d6a80) at event-epoll.c:761
#11 0x00007f7372a8b6db in start_thread (arg=0x7f736ce78700) at pthread_create.c:463
#12 0x00007f73727b488f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

--- Additional comment from Xavi Hernandez on 2019-04-24 12:38:59 CEST ---

Can you share these core dumps and reference the exact version of gluster and ubuntu used for each one ?

As per your comments, I understand that you are not doing anything special when this happens, right ? no volume reconfiguration or any other management task.

Mount logs would also be helpful.

--- Additional comment from  on 2019-04-24 12:46:40 CEST ---

Ubuntu 18.04.2 LTS (GNU/Linux 4.15.0-47-generic x86_64)

administrator@ubuntu-docker:~$ apt list glusterfs-*
Listing... Done
glusterfs-client/bionic,now 6.1-ubuntu1~bionic1 amd64 [installed,automatic]
glusterfs-common/bionic,now 6.1-ubuntu1~bionic1 amd64 [installed,automatic]
glusterfs-dbg/bionic 6.1-ubuntu1~bionic1 amd64
glusterfs-server/bionic,now 6.1-ubuntu1~bionic1 amd64 [installed]

mount logs, not sure what you refer?

--- Additional comment from Xavi Hernandez on 2019-04-24 12:50:13 CEST ---

(In reply to manschwetus from comment #7)
> Ubuntu 18.04.2 LTS (GNU/Linux 4.15.0-47-generic x86_64)
> 
> administrator@ubuntu-docker:~$ apt list glusterfs-*
> Listing... Done
> glusterfs-client/bionic,now 6.1-ubuntu1~bionic1 amd64 [installed,automatic]
> glusterfs-common/bionic,now 6.1-ubuntu1~bionic1 amd64 [installed,automatic]
> glusterfs-dbg/bionic 6.1-ubuntu1~bionic1 amd64
> glusterfs-server/bionic,now 6.1-ubuntu1~bionic1 amd64 [installed]
> 
> mount logs, not sure what you refer?

In the server where gluster crashed, you should find a file like /var/log/glusterfs/swarm-volumes.log. I would also need the coredump files and ubuntu version.

--- Additional comment from  on 2019-04-24 12:50:43 CEST ---



--- Additional comment from  on 2019-04-24 12:56:25 CEST ---



--- Additional comment from  on 2019-04-24 12:57:10 CEST ---



--- Additional comment from  on 2019-04-24 12:58:02 CEST ---



--- Additional comment from  on 2019-04-24 13:00:05 CEST ---

administrator@ubuntu-docker:~$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 18.04.2 LTS
Release:        18.04
Codename:       bionic


Same for the others

--- Additional comment from  on 2019-04-24 13:02:44 CEST ---



--- Additional comment from Xavi Hernandez on 2019-04-24 15:22:50 CEST ---

I'm still analyzing the core dumps, but as a first test, could you disable open-behind feature using the following command ?

    # gluster volume set <volname> open-behind off

--- Additional comment from  on 2019-04-24 15:43:34 CEST ---

Ok, I modified it, lets see if it helps, hopefully it wont kill the performance.

sudo gluster volume info swarm-vols

Volume Name: swarm-vols
Type: Replicate
Volume ID: a103c1da-d651-4d65-8f86-a8731e2a670c
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: 192.168.1.81:/gluster/data
Brick2: 192.168.1.86:/gluster/data
Brick3: 192.168.1.85:/gluster/data
Options Reconfigured:
performance.open-behind: off
performance.cache-size: 1GB
cluster.self-heal-daemon: enable
performance.write-behind: off
auth.allow: 127.0.0.1
transport.address-family: inet
nfs.disable: on
performance.cache-max-file-size: 1GB

--- Additional comment from Xavi Hernandez on 2019-04-25 12:33:31 CEST ---

Right now the issue seems related to open-behind, so changing the component accordingly.

--- Additional comment from  on 2019-04-30 09:42:31 CEST ---

Ok, as we haven't had a crash since we disable open-behind it seems to be valid, to say that disabling open-behind bypasses the issue.
As I disabled write-behind due to another defect report, is it worth to re enable it or would you expect problems with it?

--- Additional comment from Amar Tumballi on 2019-05-01 14:56:35 CEST ---

Yes please, you can re-enable write-behind if you have upgraded to glusterfs-5.5 or glusterfs-6.x releases.

--- Additional comment from Xavi Hernandez on 2019-05-02 19:10:34 CEST ---

From my debugging I think the issue is related to a missing fd_ref() when ob_open_behind() is used. This could potentially cause a race when the same fd is being unref'ed (refcount becoming 0) and ref'ed at the same time to handle some open_and_resume() requests. I have not yet identified the exact sequence of operations that cause the problem though. Knowing that the problem really comes from here, I'll investigate further.

--- Additional comment from Amar Tumballi on 2019-05-27 17:37:53 CEST ---

Marking it back to NEW, as the assignee is still bugs

--- Additional comment from Xavi Hernandez on 2020-01-09 17:53:15 CET ---

Is it possible to check if the problem still persists with latest 6.x or 7.x ?

--- Additional comment from  on 2020-01-10 14:42:01 CET ---

Was there a potential fix?
If yes I'll have to update all systems and remove the workarounds, so it may take a week.

--- Additional comment from Xavi Hernandez on 2020-01-10 15:33:33 CET ---

I don't see a clear candidate to fix this, but I haven't seen this problem in other tests either.

Is it possible to try it in a test environment to not touch production ? I can't guarantee that it will work.

--- Additional comment from  on 2020-01-29 08:19:14 CET ---

No, that is the problem, but I'll try. Would it be sufficient to just reset the options, or would the fs needed to be unmouted/mounted?

--- Additional comment from  on 2020-03-03 16:30:45 CET ---

I updated today to

 apt list glusterfs-*
Listing... Done
glusterfs-client/bionic,now 6.7-ubuntu1~bionic1 amd64 [installed,automatic]
glusterfs-common/bionic,now 6.7-ubuntu1~bionic1 amd64 [installed,automatic]
glusterfs-dbg/bionic 6.7-ubuntu1~bionic1 amd64
glusterfs-server/bionic,now 6.7-ubuntu1~bionic1 amd64 [installed]


Then I enabled open-behind, it crashed during afternoon. So problem persists. I disabled open-behind again, to prevent further instability.

--- Additional comment from Csaba Henk on 2020-03-04 17:53:51 CET ---

Can you please share all the fresh data? Backtraces, coredumps, gluster vol info/status, swarm-volumes.log. Can you please succinctly describe the workload you put on this setup?

Note: I can find the availability of 6.7-ubuntu1~bionic; indeed, what I see is 6.8-ubuntu1~bionic1 at https://launchpad.net/~gluster/+archive/ubuntu/glusterfs-6/+packages.

--- Additional comment from Xavi Hernandez on 2020-03-05 12:05:28 CET ---

I've been analyzing the core dumps again and I think I've identified a possible cause. I'm trying to reproduce it but apparently it's hard to hit it with a simple test.

I'll update on my findings soon.

--- Additional comment from Xavi Hernandez on 2020-03-05 16:53:32 CET ---

Finally I've been able to cause the same issue with master branch. I've needed to add artificial delays and use gfapi to trigger the bug, but the race is there and under certain workloads it could manifest.

The backtrace I've got is the same:

(gdb) bt
#0  0x00007ffff7f77e74 in pthread_mutex_lock () from /lib64/libpthread.so.0
#1  0x00007ffff0fb5145 in ob_fd_free (ob_fd=0x7fffd40063d0) at open-behind.c:214
#2  0x00007ffff0fb5b18 in ob_inode_wake (this=this@entry=0x7fffe4018fb0, ob_fds=ob_fds@entry=0x7ffff062ded0) at open-behind.c:361
#3  0x00007ffff0fb5ed1 in open_all_pending_fds_and_resume (this=this@entry=0x7fffe4018fb0, inode=0x4b12b8, stub=0x7fffd400f138) at open-behind.c:453
#4  0x00007ffff0fb61c3 in ob_rename (frame=frame@entry=0x7fffd4007c28, this=0x7fffe4018fb0, src=src@entry=0x7fffcc0025f0, dst=dst@entry=0x7fffcc002630, xdata=xdata@entry=0x0) at open-behind.c:1057
#5  0x00007ffff0f927a3 in mdc_rename (frame=frame@entry=0x7fffd4007da8, this=0x7fffe401abc0, oldloc=oldloc@entry=0x7fffcc0025f0, newloc=newloc@entry=0x7fffcc002630, xdata=xdata@entry=0x0) at md-cache.c:1848
#6  0x00007ffff7cf5c7c in default_rename_resume (frame=0x7fffcc001bf8, this=0x7fffe401c7d0, oldloc=0x7fffcc0025f0, newloc=0x7fffcc002630, xdata=0x0) at defaults.c:1897
#7  0x00007ffff7c77785 in call_resume (stub=0x7fffcc0025a8) at call-stub.c:2392
#8  0x00007ffff0f82700 in iot_worker (data=0x7fffe402e970) at io-threads.c:232
#9  0x00007ffff7f754e2 in start_thread () from /lib64/libpthread.so.0
#10 0x00007ffff7ea46d3 in clone () from /lib64/libc.so.6

The issue is caused by a missing reference on an fd. I'll fix it.

Comment 1 Worker Ant 2020-03-08 17:40:28 UTC

REVIEW: https://review.gluster.org/24204 (open-behind: fix missing fd reference) posted (#3) for review on master by Xavi Hernandez

Comment 2 Worker Ant 2020-03-12 13:32:38 UTC

This bug is moved to https://github.com/gluster/glusterfs/issues/1028, and will be tracked there from now on. Visit GitHub issues URL for further details

Comment 3 Worker Ant 2020-03-17 04:24:03 UTC

REVIEW: https://review.gluster.org/24204 (open-behind: fix missing fd reference) merged (#5) on master by Amar Tumballi

Note You need to log in before you can comment on or make changes to this bug.