1664173 – glusterfs core dump caused by fs-drift.py

Bug 1664173 - glusterfs core dump caused by fs-drift.py

Summary: glusterfs core dump caused by fs-drift.py

Keywords:
Status:	CLOSED DEFERRED
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	fuse
Sub Component:
Version:	unspecified
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Csaba Henk
QA Contact:	Prasanth
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-01-07 23:56 UTC by Ben England
Modified:	2020-10-26 14:00 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-07-30 10:19:51 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
patched glusterfs 5.2 rpms (glusterfs-{fuse,libs}) (550.00 KB, application/x-tar) 2019-01-16 23:43 UTC, Csaba Henk	no flags	Details
backport of https://review.gluster.org//21305 to 5.3 (6.31 KB, patch) 2019-01-17 19:12 UTC, Csaba Henk	no flags	Details \| Diff
View All

Description Ben England 2019-01-07 23:56:18 UTC

Description of problem:

After running fs-drift.py mixed-workload generator for a half-hour, glusterfs core dumped and I got the error "Transport endpoint is not connected" from "df".  Details below.  

Version-Release number of selected component (if applicable):

glusterfs-5.2-1.el7.x86_64
kernel-3.10.0-693.2.2.el7.x86_64

How reproducible:

not sure yet.

Steps to Reproduce:

I'm using Kaleb's longevity cluster, I don't know how that was installed.  All I did was create glusterfs mountpoints on 6 of the nodes rhs-srv-{08,09,10,11,12,13,14} and launch fs-drift.py from -09 using the script below.

I just happened to notice the loscoredump URL will be provided.s of a mountpoint on rhs-srv-10, not sure if it happened anywhere else, or when it happened exactly, only that the mountpoint was functioning when the test started.


Actual results:

[root@rhs-srv-10 ~]# ls -l /mnt/longevity 
anaconda-ks.cfg                             libntirpc-1.6.2-1.el7.x86_64.rpm
[root@rhs-srv-10 ~]# ls -l /core*
-rw------- 1 root root 346918912 Jan  7 18:37 /core.12648
-rw------- 1 root root  26152960 Nov 14  2016 /core.27656
-rw------- 1 root root  26152960 Nov 14  2016 /core.27681
-rw------- 1 root root  26152960 Nov 14  2016 /core.27704


Expected results:

glusterfs should never core dump no matter how heavy the load

Additional info:

Just before that, I had paused the benchmark (see below), 
and done a find command to count the files created by the benchmark, only about 1000 so far.  Then I resumed the benchmark and that's when I noticed the mountpoint was gone.

The fs-drift.py mixed workload benchmark is described here:

https://github.com/parallel-fs-utils/fs-drift/blob/master/README.md

The workload was this script:

[root@rhs-srv-09 fs-drift]# more runit.sh
#!/bin/bash -x
./fs-drift.py \
 --host-set rhs-srv-08,rhs-srv-10,rhs-srv-11,rhs-srv-12,rhs-srv-13,rhs-srv-14 \
 --top /mnt/longevity/bene \
 --max-files 1000000 \
 --max-file-size-kb 1024 \
 --max-record-size-kb 256 \
 --max-random-writes 5 \
 --max-random-reads 10 \
 --threads 16 \
 --dirs-per-level 50 \
 --duration 36000 \
 --tolerate-stale-file-handles true \
 --output-json

It is still running AFAIK, should be done tomorrow morning.

To pause the workload, touch /mnt/longevity/bene/network_shared/pause.tmp
To unpause it, just rm that file.

Comment 2 Csaba Henk 2019-01-09 11:42:53 UTC

Hi Ben, can you please share backtraces or make the cores accessible? Or would it be worth to bother with if you can reproduce the phenomenon?

Thanks,
Csaba

Comment 3 Ben England 2019-01-11 18:19:24 UTC

sorry Csaba, I thought the cores were accessible by virtue of being in the longevity cluster.  This most likely is reproducible.   Can you log in to rhs_srv_10.storage-dev.lab.eng.bos.redhat.com and access the core dump above?  If not, I'll post the core dump.  Kaleb Keithley is the owner of this cluster and can further assist.

My main purpose here is to demonstrate enhancements to the fs-drift tool that can be used for verifying that Gluster holds up under a heavy workload for a long period of time, and verifying that Gluster functions correctly under realistic but extreme conditions, such as a near-full filesystem.  This benchmark is being used on Cephfs also, but it should work on any POSIX-like filesystem, distributed or not.

I'll try it one more time next week (with newer fs-drift) and post the core dump.  My preference would be to help someone in the Gluster community use the workload generator and have them reproduce the problem and analyze it.  Of course I'll do my best to resolve any problems you find with the workload generator.

HTH -ben

Comment 4 Ben England 2019-01-11 20:30:04 UTC

sorry again, it was rhs-srv-10 not rhs_srv_10 :-(

Comment 5 Csaba Henk 2019-01-16 10:47:23 UTC

I see the crash happens on the fuse_forget -> inode_unref -> inode_table_prune code path. That part of the code has been significally rewritten after 5.2 by https://review.gluster.org/c/glusterfs/+/19778. Would it be possible to recheck with that patch applied?

Comment 6 Ben England 2019-01-16 12:33:55 UTC

I don't know how to apply the patch but if you or someone elese does it, then I can re-run fs-drift, which I planned to do anyway.

Comment 7 Csaba Henk 2019-01-16 23:43:28 UTC

Created attachment 1521159 [details]
patched glusterfs 5.2 rpms (glusterfs-{fuse,libs})

Here we ship

glusterfs-{libs,fuse}-5.2-0.2.git477643b.el7.x86_64.rpm

in a tarball. The rpms were built from 5.2 with following changes backported:

https://review.gluster.org/c/glusterfs/+/21305
https://review.gluster.org/c/glusterfs/+/19778

They are expected to be usable with rest of a glusterfs 5.2 installation on a RHEL 7.* system.

Please retest with these.

Comment 8 Csaba Henk 2019-01-17 10:03:05 UTC

Citing the backtrace from /core.12648 for further reference.

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/usr/sbin/glusterfs --process-name fuse --volfile-server=rhs-srv-09 --volfile-i'.
Program terminated with signal 11, Segmentation fault.
#0  __GI___pthread_mutex_lock (mutex=0x800b07f4d388) at ../nptl/pthread_mutex_lock.c:66
66        unsigned int type = PTHREAD_MUTEX_TYPE_ELISION (mutex);
Missing separate debuginfos, use: debuginfo-install glusterfs-fuse-5.2-1.el7.x86_64
(gdb) bt
#0  __GI___pthread_mutex_lock (mutex=0x800b07f4d388) at ../nptl/pthread_mutex_lock.c:66
#1  0x00007fcdd96ca7ec in __gf_free () from /lib64/libglusterfs.so.0
#2  0x00007fcdcb8d16ab in afr_inode_ctx_destroy () from /usr/lib64/glusterfs/5.2/xlator/cluster/replicate.so
#3  0x00007fcdcb8d46a7 in afr_forget () from /usr/lib64/glusterfs/5.2/xlator/cluster/replicate.so
#4  0x00007fcdd96ae77e in __inode_ctx_free () from /lib64/libglusterfs.so.0
#5  0x00007fcdd96afbb2 in inode_table_prune () from /lib64/libglusterfs.so.0
#6  0x00007fcdd96afe94 in inode_unref () from /lib64/libglusterfs.so.0
#7  0x00007fcdd0a81292 in fuse_forget () from /usr/lib64/glusterfs/5.2/xlator/mount/fuse.so
#8  0x00007fcdd0a98c5a in fuse_thread_proc () from /usr/lib64/glusterfs/5.2/xlator/mount/fuse.so
#9  0x00007fcdd8502e25 in start_thread (arg=0x7fcdc8a17700) at pthread_create.c:308
#10 0x00007fcdd7dcf34d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113

Comment 9 Csaba Henk 2019-01-17 19:12:09 UTC

Created attachment 1521358 [details]
backport of https://review.gluster.org//21305 to 5.3

As Kaleb pointed out, 5.3 is out now. However, it still does not have one of the changes referred in comment #7 (https://review.gluster.org//21305), so I'm posting a backport of it to 5.3.

Comment 10 Ben England 2019-01-21 18:50:33 UTC

Am re-running test, I think Gluster has been updated on the longevity cluster to v5.2.1 on all hosts, plus whatever was done by Kaleb.  Kaleb said in e-mail: "I ended up extracting the libglusterfs shlib and the fuse xlator from the rpms and copying them into place on all the nodes. It probably would have been easier to send me the diff of the backport (or a src.rpm) and then I'd have built a complete set of packages."  He also said that it might be easier to upgrade to V5.3 and retry the test.


To run this test, login to rhs-srv-09 and:

- if not mounted, mount the filesystem on all clients:

# export ANSIBLE_INVENTORY=~/bene/host.inv
# ansible -m shell -a 'mount -t glusterfs rhs-srv-09:/longevity /mnt/longevity' all

- kick off the test

# cd ~/bene/fs-drift
# ./runit.sh


I see these fairly regularly, I think I saw them before:

tail -f glusterd.log
...
The message "E [MSGID: 101191] [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch handler" repeated 39 times between [2019-01-21 18:27:30.613805] and [2019-01-21 18:29:27.741999]
[2019-01-21 18:29:30.745368] E [MSGID: 101191] [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch handler
The message "E [MSGID: 101191] [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch handler" repeated 39 times between [2019-01-21 18:29:30.745368] and [2019-01-21 18:31:27.873675]
[2019-01-21 18:31:30.876996] E [MSGID: 101191] [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch handler
The message "E [MSGID: 101191] [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch handler" repeated 39 times between [2019-01-21 18:31:30.876996] and [2019-01-21 18:33:28.005059]
[2019-01-21 18:33:31.008389] E [MSGID: 101191] [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch handler
The message "E [MSGID: 101191] [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch handler" repeated 46 times between [2019-01-21 18:33:31.008389] and [2019-01-21 18:35:28.136950]
[2019-01-21 18:35:31.140260] E [MSGID: 101191] [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch handler
The message "E [MSGID: 101191] [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch handler" repeated 39 times between [2019-01-21 18:35:31.140260] and [2019-01-21 18:37:28.270902]
[2019-01-21 18:37:31.274049] E [MSGID: 101191] [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch handler

Are these real errors or not?

mnt-longevity.log:

[2019-01-21 18:41:40.349391] I [MSGID: 109066] [dht-rename.c:1922:dht_rename] 0-longevity-dht: renaming /bene/d0029/d0047/f000000336 (04407899-c84c-4d72-b1f3-c6d31fbb38d9) (hash=longevity-replicate-0/cache=longevity-replicate-0) => /bene/d0010/d0036/f000000183 ((null)) (hash=longevity-replicate-3/cache=<nul>) 
[2019-01-21 18:41:40.540286] W [MSGID: 101159] [inode.c:1330:__inode_unlink] 0-inode: bb4302c7-2c54-432d-9949-d6a3c9a28c54/f000000179: dentry not found in c71679b1-a915-4c7d-8419-3c4e5d16f0cf
[2019-01-21 18:41:40.703773] E [MSGID: 101191] [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch handler
[2019-01-21 18:41:40.703994] E [MSGID: 101191] [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch handler
[2019-01-21 18:41:40.704323] I [MSGID: 109066] [dht-rename.c:1922:dht_rename] 0-longevity-dht: renaming /bene/d0045/d0025/f000000128 (4e3e43a6-bf32-4260-8fc3-692991d67484) (hash=longevity-replicate-2/cache=longevity-replicate-2) => /bene/d0011/d0011/f000000370 ((null)) (hash=longevity-replicate-0/cache=<nul>) 
[2019-01-21 18:41:40.705841] E [MSGID: 101191] [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch handler
The message "E [MSGID: 101191] [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch handler" repeated 159 times between [2019-01-21 18:41:40.705841] and [2019-01-21 18:41:40.746764]
[2019-01-21 18:41:40.962160] I [MSGID: 109066] [dht-rename.c:1922:dht_rename] 0-longevity-dht: renaming /bene/d0015/d0020/f000000268 (0a99892f-6a4b-46c4-af73-10b7882542c3) (hash=longevity-replicate-0/cache=longevity-replicate-0) => /bene/d0027/d0020/f000000012 ((null)) (hash=longevity-replicate-1/cache=<nul>) 
[2019-01-21 18:41:41.012307] E [MSGID: 133021] [shard.c:3569:shard_delete_shards] 0-longevity-shard: Failed to clean up shards of gfid b1123a96-44b0-47f7-97aa-5d536b3308ec [No such file or directory]
[2019-01-21 18:41:41.039286] E [MSGID: 133021] [shard.c:3569:shard_delete_shards] 0-longevity-shard: Failed to clean up shards of gfid f4a4c3f3-8fa1-45ce-ba94-052ff3ce79c9 [No such file or directory]
[2019-01-21 18:41:41.134390] E [fuse-bridge.c:219:check_and_dump_fuse_W] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x13b)[0x7f19cafd7fbb] (--> /usr/lib64/glusterfs/5.2/xlator/mount/fuse.so(+0x7f81)[0x7f19c23b6f81] (--> /usr/lib64/glusterfs/5.2/xlator/mount/fuse.so(+0x8e38)[0x7f19c23b7e38] (--> /lib64/libpthread.so.0(+0x7e25)[0x7f19c9e39e25] (--> /lib64/libc.so.6(clone+0x6d)[0x7f19c970634d] ))))) 0-glusterfs-fuse: writing to fuse device failed: No such file or directory
[2019-01-21 18:41:41.316802] E [MSGID: 101191] [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch handler
The message "E [MSGID: 101191] [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch handler" repeated 179 times between [2019-01-21 18:41:41.316802] and [2019-01-21 18:41:41.388848]


bricks/home-bricks-longevity.log

[2019-01-21 18:43:03.641697] E [MSGID: 101191] [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch handler
The message "E [MSGID: 101191] [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch handler" repeated 9630 times between [2019-01-21 18:43:03.641697] and [2019-01-21 18:43:05.598147]
[2019-01-21 18:43:05.618000] I [MSGID: 115062] [server-rpc-fops_v2.c:1133:server4_link_cbk] 0-longevity-server: 85655: LINK /bene/d0009/d0049/f000000296 (e568165b-c337-4168-9583-c8b02a54257a) -> bb08e1f6-3828-45c6-8c75-aa475a3eeacc/f000000037, client: CTX_ID:20d0af45-d4ba-48a7-8898-d131ea327af7-GRAPH_ID:0-PID:15008-HOST:rhs-srv-08.storage-dev.lab.eng.bos.redhat.com-PC_NAME:longevity-client-3-RECON_NO:-0, error-xlator: - [File exists]
[2019-01-21 18:43:05.618883] E [MSGID: 101191] [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch handler
The message "E [MSGID: 101191] [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch handler" repeated 139 times between [2019-01-21 18:43:05.618883] and [2019-01-21 18:43:05.637162]
[2019-01-21 18:43:05.731700] E [MSGID: 113039] [posix-inode-fd-ops.c:1457:posix_open] 0-longevity-posix: open on /home/bricks/longevity/.glusterfs/82/3c/823c46d5-91fc-401a-a77d-b35543ebccbe, flags: 1 [No such file or directory]
[2019-01-21 18:43:06.744458] E [MSGID: 101191] [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch handler
The message "E [MSGID: 101191] [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch handler" repeated 21 times between [2019-01-21 18:43:06.744458] and [2019-01-21 18:43:06.750397]
[2019-01-21 18:43:06.750419] E [MSGID: 113001] [posix-inode-fd-ops.c:3266:posix_getxattr] 0-longevity-posix: getxattr failed on /home/bricks/longevity/.glusterfs/82/3c/823c46d5-91fc-401a-a77d-b35543ebccbe: trusted.glusterfs.dht.linkto  [No such file or directory]
[2019-01-21 18:43:06.750491] E [MSGID: 101191] [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch handler
The message "E [MSGID: 101191] [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch handler" repeated 3669 times between [2019-01-21 18:43:06.750491] and [2019-01-21 18:43:07.539439]

Comment 11 Ben England 2019-01-21 19:15:27 UTC

and just as before, about 1/2 hour later...

    fd = os.open(fn, os.O_WRONLY)
OSError: [Errno 107] Transport endpoint is not connected: './d0042/d0045/f000000317'
thrd.05 2019-01-21 14:00:13,333 - ERROR - append: ./d0042/d0045/f000000317 syscall errno 107(Transport endpoint is not connected)
thrd.05 2019-01-21 14:00:13,333 - ERROR - [Errno 107] Transport endpoint is not connected: './d0015'
Traceback (most recent call last):
  File "/root/bene/fs-drift/fsop.py", line 314, in op_create
    os.makedirs(subdir)
  File "/usr/lib64/python2.7/os.py", line 150, in makedirs
    makedirs(head, mode)
  File "/usr/lib64/python2.7/os.py", line 157, in makedirs
    mkdir(name, mode)
OSError: [Errno 107] Transport endpoint is not connected: './d0015'
thrd.05 2019-01-21 14:00:13,334 - ERROR - dir create: ./d0015/d0016/f000000368 syscall errno 107(Transport endpoint is not connected)
thrd.05 2019-01-21 14:00:13,334 - ERROR - [Errno 107] Transport endpoint is not connected: './d0001'
Traceback (most recent call last):

[root@rhs-srv-09 fs-drift]# ansible -i ../host.inv -m shell -a 'ls -ltr /core*' all
rhs-srv-10 | CHANGED | rc=0 >>
...
-rw------- 1 root root 164057088 Jan 21 14:00 /core.6658


I'll try to fix fs-drift so that it doesn't keep logging errors like this one if you tell it to ignore stale file handles:

OSError: [Errno 116] Stale file handle
thrd.01 2019-01-21 13:44:23,008 - ERROR - close: ./d0020/d0039/f000000103 syscall errno 116(Stale file handle)
thrd.02 2019-01-21 13:45:14,640 - ERROR - [Errno 2] No such file or directory
Traceback (most recent call last):
  File "/root/bene/fs-drift/fsop.py", line 199, in try_to_close
    os.close(closefd)

Comment 12 Ben England 2019-01-22 02:02:17 UTC

Here is a new core dump.   So this is definitely reproducible.

[root@rhs-srv-09 ~]# ansible -i ~/bene/host.inv -m shell -a 'ls -ltr /core* | grep "Jan 21"' all
...
rhs-srv-14 | CHANGED | rc=0 >>
-rw------- 1 root root 206307328 Jan 21 17:16 /core.6676

above fs-drift logging problem is fixed.

Comment 13 Ben England 2019-01-22 02:31:55 UTC

raising priority to high because entire cluster locked up - all the fs-drift processes on all hosts enter "D" state (ps awux shows "D") and can't be killed, even with -9.   Example:

root     21762  0.0  0.1 274592 20772 ?        D    17:12   0:00 python /root/bene/fs-drift/fs-drift-remote.py --network-sync-dir /mnt/longevity/bene/network-shared --as-host rhs-srv-08

And unmount fails

[root@rhs-srv-08 ~]# umount -f /mnt/longevity
umount: /mnt/longevity: target is busy.
        (In some cases useful info about processes that use
         the device is found by lsof(8) or fuser(1))

Literally only thing I can do at that point is reboot!  One next step would be to change the workload mix to remove operation types until this stops happening, and isolate the operation mix that makes it happen.  You could also remove the sharding translator and see if it still happens.  I'm not going to do that, have to return to my day job.  But if someone wants to run fs-drift and try to isolate this problem further, I can provide assistance with fs-drift.  Here's some documentation I wrote about it:

https://docs.google.com/presentation/d/10DpxUCVC4dvZKwcEov4TEKbbS26DdiZ2L-EMR5-6SGo/edit#slide=id.p

Comment 14 Ben England 2019-01-22 13:36:56 UTC

clarifying bz title

Comment 15 Yaniv Kaul 2019-06-13 08:50:27 UTC

Status?

Note You need to log in before you can comment on or make changes to this bug.