Bug 1697971 - Segfault in FUSE process, potential use after free
Summary: Segfault in FUSE process, potential use after free
Keywords:
Status: CLOSED UPSTREAM
Alias: None
Product: GlusterFS
Classification: Community
Component: open-behind
Version: 6
Hardware: x86_64
OS: Linux
medium
urgent
Target Milestone: ---
Assignee: Xavi Hernandez
QA Contact:
URL:
Whiteboard:
Depends On: 1810934
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-04-09 11:27 UTC by manschwetus
Modified: 2020-08-18 11:36 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1810934 (view as bug list)
Environment:
Last Closed: 2020-03-12 13:02:48 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)
backtrace and threadlist (6.10 KB, text/plain)
2019-04-09 11:27 UTC, manschwetus
no flags Details
corefile (older, uploaded not really intended) (16.40 MB, application/x-7z-compressed)
2019-04-24 10:50 UTC, manschwetus
no flags Details
corefile (1st) (17.06 MB, application/x-7z-compressed)
2019-04-24 10:56 UTC, manschwetus
no flags Details
corefile (2nd) (13.33 MB, application/x-7z-compressed)
2019-04-24 10:57 UTC, manschwetus
no flags Details
corefile (3rd) (14.12 MB, application/x-7z-compressed)
2019-04-24 10:58 UTC, manschwetus
no flags Details
swarm-volumes log of time of crash (582.04 KB, text/plain)
2019-04-24 11:02 UTC, manschwetus
no flags Details

Description manschwetus 2019-04-09 11:27:47 UTC
Created attachment 1553829 [details]
backtrace and threadlist

Description of problem:
We face regular segfaults of FUSE process, when running gitlab docker with data placed on 1x3 glusterfs, all nodes running the container are gluster servers.

Version-Release number of selected component (if applicable):
We use latest packages available for ubuntu lts

# cat /etc/apt/sources.list.d/gluster-ubuntu-glusterfs-6-bionic.list
deb http://ppa.launchpad.net/gluster/glusterfs-6/ubuntu bionic main


How reproducible:
It happens regularly, at least daily, but we aren't aware of a specific action/activity.

Steps to Reproduce:
1. setup docker swarm backed with glusterfs and run gitlab on it
2. ???
3. ???

Actual results:
Regular crashes of FUSE, cutting of persistent storage used for dockerized gitlab

Expected results:
Stable availability of glusterfs

Additional info:
we preserve the produced corefiles, I'll attach bt and thread list of te one today.

Comment 1 manschwetus 2019-04-09 11:31:55 UTC
vlume infos:
$ sudo gluster  volume info
 
Volume Name: swarm-vols
Type: Replicate
Volume ID: a103c1da-d651-4d65-8f86-a8731e2a670c
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: 192.168.1.81:/gluster/data
Brick2: 192.168.1.86:/gluster/data
Brick3: 192.168.1.85:/gluster/data
Options Reconfigured:
performance.write-behind: off
performance.cache-max-file-size: 1GB
performance.cache-size: 2GB
nfs.disable: on
transport.address-family: inet
auth.allow: 127.0.0.1
$ sudo gluster  volume status
Status of volume: swarm-vols
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 192.168.1.81:/gluster/data            49153     0          Y       188805
Brick 192.168.1.86:/gluster/data            49153     0          Y       5543 
Brick 192.168.1.85:/gluster/data            49152     0          Y       2069 
Self-heal Daemon on localhost               N/A       N/A        Y       2080 
Self-heal Daemon on 192.168.1.81            N/A       N/A        Y       144760
Self-heal Daemon on 192.168.1.86            N/A       N/A        Y       63715
 
Task Status of Volume swarm-vols
------------------------------------------------------------------------------
There are no active volume tasks
$ mount |  grep swarm
localhost:/swarm-vols on /swarm/volumes type fuse.glusterfs (rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072,_netdev)

Comment 2 manschwetus 2019-04-12 13:41:59 UTC
Today I encountered another segfault, same stack.

Comment 3 manschwetus 2019-04-15 06:48:14 UTC
I just checked another corefile from a second system, same stack, that issue gets quite annoying.
Could some pls look into it.

Comment 4 manschwetus 2019-04-23 08:03:21 UTC
Problem persists with 6.1, bt has changed a bit:

Crashdump1:
#0  0x00007fc3dcaa27f0 in ?? () from /lib/x86_64-linux-gnu/libuuid.so.1
#1  0x00007fc3dcaa2874 in ?? () from /lib/x86_64-linux-gnu/libuuid.so.1
#2  0x00007fc3ddb5cdcc in gf_uuid_unparse (out=0x7fc3c8005580 "c27a90a6-e68b-4b0b-af56-002ea7bf1fb4", uuid=0x8 <error: Cannot access memory at address 0x8>) at ./glusterfs/compat-uuid.h:55
#3  uuid_utoa (uuid=uuid@entry=0x8 <error: Cannot access memory at address 0x8>) at common-utils.c:2777
#4  0x00007fc3d688c529 in ioc_open_cbk (frame=0x7fc3a8b56208, cookie=<optimized out>, this=0x7fc3d001eb80, op_ret=0, op_errno=117, fd=0x7fc3c7678418, xdata=0x0) at io-cache.c:646
#5  0x00007fc3d6cb09b1 in ra_open_cbk (frame=0x7fc3a8b5b698, cookie=<optimized out>, this=<optimized out>, op_ret=<optimized out>, op_errno=<optimized out>, fd=0x7fc3c7678418, xdata=0x0) at read-ahead.c:99
#6  0x00007fc3d71b10b3 in afr_open_cbk (frame=0x7fc3a8b67d48, cookie=0x0, this=<optimized out>, op_ret=0, op_errno=0, fd=0x7fc3c7678418, xdata=0x0) at afr-open.c:97
#7  0x00007fc3d747c5f8 in client4_0_open_cbk (req=<optimized out>, iov=<optimized out>, count=<optimized out>, myframe=0x7fc3a8b58d18) at client-rpc-fops_v2.c:284
#8  0x00007fc3dd9013d1 in rpc_clnt_handle_reply (clnt=clnt@entry=0x7fc3d0057dd0, pollin=pollin@entry=0x7fc386e7e2b0) at rpc-clnt.c:755
#9  0x00007fc3dd901773 in rpc_clnt_notify (trans=0x7fc3d0058090, mydata=0x7fc3d0057e00, event=<optimized out>, data=0x7fc386e7e2b0) at rpc-clnt.c:922
#10 0x00007fc3dd8fe273 in rpc_transport_notify (this=this@entry=0x7fc3d0058090, event=event@entry=RPC_TRANSPORT_MSG_RECEIVED, data=<optimized out>) at rpc-transport.c:542
#11 0x00007fc3d8333474 in socket_event_poll_in (notify_handled=true, this=0x7fc3d0058090) at socket.c:2522
#12 socket_event_handler (fd=fd@entry=11, idx=idx@entry=2, gen=gen@entry=4, data=data@entry=0x7fc3d0058090, poll_in=<optimized out>, poll_out=<optimized out>, poll_err=<optimized out>, event_thread_died=0 '\000') at socket.c:2924
#13 0x00007fc3ddbb2863 in event_dispatch_epoll_handler (event=0x7fc3cfffee54, event_pool=0x5570195807b0) at event-epoll.c:648
#14 event_dispatch_epoll_worker (data=0x5570195c5a80) at event-epoll.c:761
#15 0x00007fc3dd2bc6db in start_thread (arg=0x7fc3cffff700) at pthread_create.c:463
#16 0x00007fc3dcfe588f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Crashdump2:
#0  0x00007fdd13fd97f0 in ?? () from /lib/x86_64-linux-gnu/libuuid.so.1
#1  0x00007fdd13fd9874 in ?? () from /lib/x86_64-linux-gnu/libuuid.so.1
#2  0x00007fdd15093dcc in gf_uuid_unparse (out=0x7fdd0805a2d0 "1f739bdc-f7c0-4133-84cc-554eb594ae81", uuid=0x8 <error: Cannot access memory at address 0x8>) at ./glusterfs/compat-uuid.h:55
#3  uuid_utoa (uuid=uuid@entry=0x8 <error: Cannot access memory at address 0x8>) at common-utils.c:2777
#4  0x00007fdd0d5c2529 in ioc_open_cbk (frame=0x7fdce44a9f88, cookie=<optimized out>, this=0x7fdd0801eb80, op_ret=0, op_errno=117, fd=0x7fdcf1ad9b78, xdata=0x0) at io-cache.c:646
#5  0x00007fdd0d9e69b1 in ra_open_cbk (frame=0x7fdce44d2a78, cookie=<optimized out>, this=<optimized out>, op_ret=<optimized out>, op_errno=<optimized out>, fd=0x7fdcf1ad9b78, xdata=0x0) at read-ahead.c:99
#6  0x00007fdd0dee70b3 in afr_open_cbk (frame=0x7fdce44a80a8, cookie=0x1, this=<optimized out>, op_ret=0, op_errno=0, fd=0x7fdcf1ad9b78, xdata=0x0) at afr-open.c:97
#7  0x00007fdd0e1b25f8 in client4_0_open_cbk (req=<optimized out>, iov=<optimized out>, count=<optimized out>, myframe=0x7fdce4462528) at client-rpc-fops_v2.c:284
#8  0x00007fdd14e383d1 in rpc_clnt_handle_reply (clnt=clnt@entry=0x7fdd08054490, pollin=pollin@entry=0x7fdc942904d0) at rpc-clnt.c:755
#9  0x00007fdd14e38773 in rpc_clnt_notify (trans=0x7fdd08054750, mydata=0x7fdd080544c0, event=<optimized out>, data=0x7fdc942904d0) at rpc-clnt.c:922
#10 0x00007fdd14e35273 in rpc_transport_notify (this=this@entry=0x7fdd08054750, event=event@entry=RPC_TRANSPORT_MSG_RECEIVED, data=<optimized out>) at rpc-transport.c:542
#11 0x00007fdd0f86a474 in socket_event_poll_in (notify_handled=true, this=0x7fdd08054750) at socket.c:2522
#12 socket_event_handler (fd=fd@entry=10, idx=idx@entry=4, gen=gen@entry=4, data=data@entry=0x7fdd08054750, poll_in=<optimized out>, poll_out=<optimized out>, poll_err=<optimized out>, event_thread_died=0 '\000') at socket.c:2924
#13 0x00007fdd150e9863 in event_dispatch_epoll_handler (event=0x7fdd0f3e0e54, event_pool=0x55cf9c9277b0) at event-epoll.c:648
#14 event_dispatch_epoll_worker (data=0x55cf9c96ca20) at event-epoll.c:761
#15 0x00007fdd147f36db in start_thread (arg=0x7fdd0f3e1700) at pthread_create.c:463
#16 0x00007fdd1451c88f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95


Please tell me if you need further information.

Comment 5 manschwetus 2019-04-24 10:27:43 UTC
got another set of cores, one for each system in my gluster setup over night:

Core was generated by `/usr/sbin/glusterfs --process-name fuse --volfile-server=localhost --volfile-id'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  __GI___pthread_mutex_lock (mutex=0x18) at ../nptl/pthread_mutex_lock.c:65
65      ../nptl/pthread_mutex_lock.c: Datei oder Verzeichnis nicht gefunden.
[Current thread is 1 (Thread 0x7faee404c700 (LWP 28792))]
(gdb) bt
#0  __GI___pthread_mutex_lock (mutex=0x18) at ../nptl/pthread_mutex_lock.c:65
#1  0x00007faee544e4b5 in ob_fd_free (ob_fd=0x7faebc054df0) at open-behind.c:198
#2  0x00007faee544edd6 in ob_inode_wake (this=this@entry=0x7faed8020d20, ob_fds=ob_fds@entry=0x7faee404bc90) at open-behind.c:355
#3  0x00007faee544f062 in open_all_pending_fds_and_resume (this=this@entry=0x7faed8020d20, inode=0x7faed037cf08, stub=0x7faebc008858) at open-behind.c:442
#4  0x00007faee544f4ff in ob_rename (frame=frame@entry=0x7faebc1ceae8, this=this@entry=0x7faed8020d20, src=src@entry=0x7faed03d9a70, dst=dst@entry=0x7faed03d9ab0, xdata=xdata@entry=0x0) at open-behind.c:1035
#5  0x00007faeed1b0ad0 in default_rename (frame=frame@entry=0x7faebc1ceae8, this=<optimized out>, oldloc=oldloc@entry=0x7faed03d9a70, newloc=newloc@entry=0x7faed03d9ab0, xdata=xdata@entry=0x0) at defaults.c:2631
#6  0x00007faee501f798 in mdc_rename (frame=frame@entry=0x7faebc210468, this=0x7faed80247d0, oldloc=oldloc@entry=0x7faed03d9a70, newloc=newloc@entry=0x7faed03d9ab0, xdata=xdata@entry=0x0) at md-cache.c:1852
#7  0x00007faeed1c6936 in default_rename_resume (frame=0x7faed02d2318, this=0x7faed8026430, oldloc=0x7faed03d9a70, newloc=0x7faed03d9ab0, xdata=0x0) at defaults.c:1897
#8  0x00007faeed14cc45 in call_resume (stub=0x7faed03d9a28) at call-stub.c:2555
#9  0x00007faee4e10cd8 in iot_worker (data=0x7faed8034780) at io-threads.c:232
#10 0x00007faeec88f6db in start_thread (arg=0x7faee404c700) at pthread_create.c:463
#11 0x00007faeec5b888f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95



bCore was generated by `/usr/sbin/glusterfs --process-name fuse --volfile-server=localhost --volfile-id'.
Program terminated with signal SIGSEGV, Segmentation fault.
t#0  __GI___pthread_mutex_lock (mutex=0x18) at ../nptl/pthread_mutex_lock.c:65
65      ../nptl/pthread_mutex_lock.c: Datei oder Verzeichnis nicht gefunden.
[Current thread is 1 (Thread 0x7f7944069700 (LWP 24067))]
(gdb) bt
#0  __GI___pthread_mutex_lock (mutex=0x18) at ../nptl/pthread_mutex_lock.c:65
#1  0x00007f794682c4b5 in ob_fd_free (ob_fd=0x7f79283e94e0) at open-behind.c:198
#2  0x00007f794682cdd6 in ob_inode_wake (this=this@entry=0x7f793801eee0, ob_fds=ob_fds@entry=0x7f7944068c90) at open-behind.c:355
#3  0x00007f794682d062 in open_all_pending_fds_and_resume (this=this@entry=0x7f793801eee0, inode=0x7f79301de788, stub=0x7f7928004578) at open-behind.c:442
#4  0x00007f794682d4ff in ob_rename (frame=frame@entry=0x7f79280ab2b8, this=this@entry=0x7f793801eee0, src=src@entry=0x7f7930558ea0, dst=dst@entry=0x7f7930558ee0, xdata=xdata@entry=0x0) at open-behind.c:1035
#5  0x00007f794e729ad0 in default_rename (frame=frame@entry=0x7f79280ab2b8, this=<optimized out>, oldloc=oldloc@entry=0x7f7930558ea0, newloc=newloc@entry=0x7f7930558ee0, xdata=xdata@entry=0x0) at defaults.c:2631
#6  0x00007f79463fd798 in mdc_rename (frame=frame@entry=0x7f7928363ae8, this=0x7f7938022990, oldloc=oldloc@entry=0x7f7930558ea0, newloc=newloc@entry=0x7f7930558ee0, xdata=xdata@entry=0x0) at md-cache.c:1852
#7  0x00007f794e73f936 in default_rename_resume (frame=0x7f7930298c28, this=0x7f79380245f0, oldloc=0x7f7930558ea0, newloc=0x7f7930558ee0, xdata=0x0) at defaults.c:1897
#8  0x00007f794e6c5c45 in call_resume (stub=0x7f7930558e58) at call-stub.c:2555
#9  0x00007f79461eecd8 in iot_worker (data=0x7f7938032940) at io-threads.c:232
#10 0x00007f794de086db in start_thread (arg=0x7f7944069700) at pthread_create.c:463
#11 0x00007f794db3188f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95



Core was generated by `/usr/sbin/glusterfs --process-name fuse --volfile-server=localhost --volfile-id'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  __GI___pthread_mutex_lock (mutex=0x18) at ../nptl/pthread_mutex_lock.c:65
65      ../nptl/pthread_mutex_lock.c: Datei oder Verzeichnis nicht gefunden.
[Current thread is 1 (Thread 0x7f736ce78700 (LWP 42526))]
(gdb) bt
#0  __GI___pthread_mutex_lock (mutex=0x18) at ../nptl/pthread_mutex_lock.c:65
#1  0x00007f73733499f7 in fd_unref (fd=0x7f73544165a8) at fd.c:515
#2  0x00007f736c3fb618 in client_local_wipe (local=local@entry=0x7f73441ff8c8) at client-helpers.c:124
#3  0x00007f736c44a60a in client4_0_open_cbk (req=<optimized out>, iov=<optimized out>, count=<optimized out>, myframe=0x7f7344038b48) at client-rpc-fops_v2.c:284
#4  0x00007f73730d03d1 in rpc_clnt_handle_reply (clnt=clnt@entry=0x7f7368054490, pollin=pollin@entry=0x7f73601b2730) at rpc-clnt.c:755
#5  0x00007f73730d0773 in rpc_clnt_notify (trans=0x7f7368054750, mydata=0x7f73680544c0, event=<optimized out>, data=0x7f73601b2730) at rpc-clnt.c:922
#6  0x00007f73730cd273 in rpc_transport_notify (this=this@entry=0x7f7368054750, event=event@entry=RPC_TRANSPORT_MSG_RECEIVED, data=<optimized out>) at rpc-transport.c:542
#7  0x00007f736db02474 in socket_event_poll_in (notify_handled=true, this=0x7f7368054750) at socket.c:2522
#8  socket_event_handler (fd=fd@entry=10, idx=idx@entry=4, gen=gen@entry=1, data=data@entry=0x7f7368054750, poll_in=<optimized out>, poll_out=<optimized out>, poll_err=<optimized out>, event_thread_died=0 '\000') at socket.c:2924
#9  0x00007f7373381863 in event_dispatch_epoll_handler (event=0x7f736ce77e54, event_pool=0x55621c2917b0) at event-epoll.c:648
#10 event_dispatch_epoll_worker (data=0x55621c2d6a80) at event-epoll.c:761
#11 0x00007f7372a8b6db in start_thread (arg=0x7f736ce78700) at pthread_create.c:463
#12 0x00007f73727b488f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Comment 6 Xavi Hernandez 2019-04-24 10:38:59 UTC
Can you share these core dumps and reference the exact version of gluster and ubuntu used for each one ?

As per your comments, I understand that you are not doing anything special when this happens, right ? no volume reconfiguration or any other management task.

Mount logs would also be helpful.

Comment 7 manschwetus 2019-04-24 10:46:40 UTC
Ubuntu 18.04.2 LTS (GNU/Linux 4.15.0-47-generic x86_64)

administrator@ubuntu-docker:~$ apt list glusterfs-*
Listing... Done
glusterfs-client/bionic,now 6.1-ubuntu1~bionic1 amd64 [installed,automatic]
glusterfs-common/bionic,now 6.1-ubuntu1~bionic1 amd64 [installed,automatic]
glusterfs-dbg/bionic 6.1-ubuntu1~bionic1 amd64
glusterfs-server/bionic,now 6.1-ubuntu1~bionic1 amd64 [installed]

mount logs, not sure what you refer?

Comment 8 Xavi Hernandez 2019-04-24 10:50:13 UTC
(In reply to manschwetus from comment #7)
> Ubuntu 18.04.2 LTS (GNU/Linux 4.15.0-47-generic x86_64)
> 
> administrator@ubuntu-docker:~$ apt list glusterfs-*
> Listing... Done
> glusterfs-client/bionic,now 6.1-ubuntu1~bionic1 amd64 [installed,automatic]
> glusterfs-common/bionic,now 6.1-ubuntu1~bionic1 amd64 [installed,automatic]
> glusterfs-dbg/bionic 6.1-ubuntu1~bionic1 amd64
> glusterfs-server/bionic,now 6.1-ubuntu1~bionic1 amd64 [installed]
> 
> mount logs, not sure what you refer?

In the server where gluster crashed, you should find a file like /var/log/glusterfs/swarm-volumes.log. I would also need the coredump files and ubuntu version.

Comment 9 manschwetus 2019-04-24 10:50:43 UTC
Created attachment 1558139 [details]
corefile (older, uploaded not really intended)

Comment 10 manschwetus 2019-04-24 10:56:25 UTC
Created attachment 1558142 [details]
corefile (1st)

Comment 11 manschwetus 2019-04-24 10:57:10 UTC
Created attachment 1558143 [details]
corefile (2nd)

Comment 12 manschwetus 2019-04-24 10:58:02 UTC
Created attachment 1558144 [details]
corefile (3rd)

Comment 13 manschwetus 2019-04-24 11:00:05 UTC
administrator@ubuntu-docker:~$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 18.04.2 LTS
Release:        18.04
Codename:       bionic


Same for the others

Comment 14 manschwetus 2019-04-24 11:02:44 UTC
Created attachment 1558157 [details]
swarm-volumes log of time of crash

Comment 15 Xavi Hernandez 2019-04-24 13:22:50 UTC
I'm still analyzing the core dumps, but as a first test, could you disable open-behind feature using the following command ?

    # gluster volume set <volname> open-behind off

Comment 16 manschwetus 2019-04-24 13:43:34 UTC
Ok, I modified it, lets see if it helps, hopefully it wont kill the performance.

sudo gluster volume info swarm-vols

Volume Name: swarm-vols
Type: Replicate
Volume ID: a103c1da-d651-4d65-8f86-a8731e2a670c
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: 192.168.1.81:/gluster/data
Brick2: 192.168.1.86:/gluster/data
Brick3: 192.168.1.85:/gluster/data
Options Reconfigured:
performance.open-behind: off
performance.cache-size: 1GB
cluster.self-heal-daemon: enable
performance.write-behind: off
auth.allow: 127.0.0.1
transport.address-family: inet
nfs.disable: on
performance.cache-max-file-size: 1GB

Comment 17 Xavi Hernandez 2019-04-25 10:33:31 UTC
Right now the issue seems related to open-behind, so changing the component accordingly.

Comment 18 manschwetus 2019-04-30 07:42:31 UTC
Ok, as we haven't had a crash since we disable open-behind it seems to be valid, to say that disabling open-behind bypasses the issue.
As I disabled write-behind due to another defect report, is it worth to re enable it or would you expect problems with it?

Comment 19 Amar Tumballi 2019-05-01 12:56:35 UTC
Yes please, you can re-enable write-behind if you have upgraded to glusterfs-5.5 or glusterfs-6.x releases.

Comment 20 Xavi Hernandez 2019-05-02 17:10:34 UTC
From my debugging I think the issue is related to a missing fd_ref() when ob_open_behind() is used. This could potentially cause a race when the same fd is being unref'ed (refcount becoming 0) and ref'ed at the same time to handle some open_and_resume() requests. I have not yet identified the exact sequence of operations that cause the problem though. Knowing that the problem really comes from here, I'll investigate further.

Comment 21 Amar Tumballi 2019-05-27 15:37:53 UTC
Marking it back to NEW, as the assignee is still bugs

Comment 22 Xavi Hernandez 2020-01-09 16:53:15 UTC
Is it possible to check if the problem still persists with latest 6.x or 7.x ?

Comment 23 manschwetus 2020-01-10 13:42:01 UTC
Was there a potential fix?
If yes I'll have to update all systems and remove the workarounds, so it may take a week.

Comment 24 Xavi Hernandez 2020-01-10 14:33:33 UTC
I don't see a clear candidate to fix this, but I haven't seen this problem in other tests either.

Is it possible to try it in a test environment to not touch production ? I can't guarantee that it will work.

Comment 25 manschwetus 2020-01-29 07:19:14 UTC
No, that is the problem, but I'll try. Would it be sufficient to just reset the options, or would the fs needed to be unmouted/mounted?

Comment 26 manschwetus 2020-03-03 15:30:45 UTC
I updated today to

 apt list glusterfs-*
Listing... Done
glusterfs-client/bionic,now 6.7-ubuntu1~bionic1 amd64 [installed,automatic]
glusterfs-common/bionic,now 6.7-ubuntu1~bionic1 amd64 [installed,automatic]
glusterfs-dbg/bionic 6.7-ubuntu1~bionic1 amd64
glusterfs-server/bionic,now 6.7-ubuntu1~bionic1 amd64 [installed]


Then I enabled open-behind, it crashed during afternoon. So problem persists. I disabled open-behind again, to prevent further instability.

Comment 27 Csaba Henk 2020-03-04 16:53:51 UTC
Can you please share all the fresh data? Backtraces, coredumps, gluster vol info/status, swarm-volumes.log. Can you please succinctly describe the workload you put on this setup?

Note: I can find the availability of 6.7-ubuntu1~bionic; indeed, what I see is 6.8-ubuntu1~bionic1 at https://launchpad.net/~gluster/+archive/ubuntu/glusterfs-6/+packages.

Comment 28 Xavi Hernandez 2020-03-05 11:05:28 UTC
I've been analyzing the core dumps again and I think I've identified a possible cause. I'm trying to reproduce it but apparently it's hard to hit it with a simple test.

I'll update on my findings soon.

Comment 29 Xavi Hernandez 2020-03-05 15:53:32 UTC
Finally I've been able to cause the same issue with master branch. I've needed to add artificial delays and use gfapi to trigger the bug, but the race is there and under certain workloads it could manifest.

The backtrace I've got is the same:

(gdb) bt
#0  0x00007ffff7f77e74 in pthread_mutex_lock () from /lib64/libpthread.so.0
#1  0x00007ffff0fb5145 in ob_fd_free (ob_fd=0x7fffd40063d0) at open-behind.c:214
#2  0x00007ffff0fb5b18 in ob_inode_wake (this=this@entry=0x7fffe4018fb0, ob_fds=ob_fds@entry=0x7ffff062ded0) at open-behind.c:361
#3  0x00007ffff0fb5ed1 in open_all_pending_fds_and_resume (this=this@entry=0x7fffe4018fb0, inode=0x4b12b8, stub=0x7fffd400f138) at open-behind.c:453
#4  0x00007ffff0fb61c3 in ob_rename (frame=frame@entry=0x7fffd4007c28, this=0x7fffe4018fb0, src=src@entry=0x7fffcc0025f0, dst=dst@entry=0x7fffcc002630, xdata=xdata@entry=0x0) at open-behind.c:1057
#5  0x00007ffff0f927a3 in mdc_rename (frame=frame@entry=0x7fffd4007da8, this=0x7fffe401abc0, oldloc=oldloc@entry=0x7fffcc0025f0, newloc=newloc@entry=0x7fffcc002630, xdata=xdata@entry=0x0) at md-cache.c:1848
#6  0x00007ffff7cf5c7c in default_rename_resume (frame=0x7fffcc001bf8, this=0x7fffe401c7d0, oldloc=0x7fffcc0025f0, newloc=0x7fffcc002630, xdata=0x0) at defaults.c:1897
#7  0x00007ffff7c77785 in call_resume (stub=0x7fffcc0025a8) at call-stub.c:2392
#8  0x00007ffff0f82700 in iot_worker (data=0x7fffe402e970) at io-threads.c:232
#9  0x00007ffff7f754e2 in start_thread () from /lib64/libpthread.so.0
#10 0x00007ffff7ea46d3 in clone () from /lib64/libc.so.6

The issue is caused by a missing reference on an fd. I'll fix it.

Comment 30 Worker Ant 2020-03-12 13:02:48 UTC
This bug is moved to https://github.com/gluster/glusterfs/issues/991, and will be tracked there from now on. Visit GitHub issues URL for further details


Note You need to log in before you can comment on or make changes to this bug.