Bugzilla will be upgraded to version 5.0. The upgrade date is tentatively scheduled for 2 December 2018, pending final testing and feedback.
Bug 1561999 - rm command hangs in fuse_request_send
rm command hangs in fuse_request_send
Status: CLOSED ERRATA
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: distribute (Show other bugs)
3.4
Unspecified Unspecified
unspecified Severity high
: ---
: RHGS 3.4.0
Assigned To: Raghavendra G
Prasad Desala
:
Depends On:
Blocks: 1503137 1574305 1574798
  Show dependency treegraph
 
Reported: 2018-03-29 06:34 EDT by Prasad Desala
Modified: 2018-09-17 09:25 EDT (History)
7 users (show)

See Also:
Fixed In Version: glusterfs-3.12.2-9
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1574305 (view as bug list)
Environment:
Last Closed: 2018-09-04 02:45:40 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
client statedump (239.59 KB, application/x-gzip)
2018-03-31 05:52 EDT, Raghavendra G
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2018:2607 None None None 2018-09-04 02:46 EDT

  None (edit)
Description Prasad Desala 2018-03-29 06:34:15 EDT
Description of problem:
=======================
rm -rvf * command on fuse mount got hung. It hung for more than 1.30hr and still in same state. The volume is mounted on 4 clients but rm is issued from only one client.

Client dump:
=============
Mar 29 15:32:37 dhcp41-254 kernel: rm              D ffff927aa92adee0     0 13643  13625 0x00000084
Mar 29 15:32:37 dhcp41-254 kernel: Call Trace:
Mar 29 15:32:37 dhcp41-254 kernel: [<ffffffff87912f49>] schedule+0x29/0x70
Mar 29 15:32:37 dhcp41-254 kernel: [<ffffffffc066c4e5>] __fuse_request_send+0xf5/0x2e0 [fuse]
Mar 29 15:32:37 dhcp41-254 kernel: [<ffffffff872bbe20>] ? wake_up_atomic_t+0x30/0x30
Mar 29 15:32:37 dhcp41-254 kernel: [<ffffffffc066c6e2>] fuse_request_send+0x12/0x20 [fuse]
Mar 29 15:32:37 dhcp41-254 kernel: [<ffffffffc06727eb>] fuse_lookup_name+0x12b/0x2c0 [fuse]
Mar 29 15:32:37 dhcp41-254 kernel: [<ffffffffc06729ce>] fuse_lookup+0x4e/0x130 [fuse]
Mar 29 15:32:37 dhcp41-254 kernel: [<ffffffff874259b3>] lookup_real+0x23/0x60
Mar 29 15:32:37 dhcp41-254 kernel: [<ffffffff874263c2>] __lookup_hash+0x42/0x60
Mar 29 15:32:37 dhcp41-254 kernel: [<ffffffff8790ac07>] lookup_slow+0x42/0xa7
Mar 29 15:32:37 dhcp41-254 kernel: [<ffffffff8742a248>] path_lookupat+0x838/0x8b0
Mar 29 15:32:37 dhcp41-254 kernel: [<ffffffff872cebfc>] ? try_to_wake_up+0x18c/0x350
Mar 29 15:32:37 dhcp41-254 kernel: [<ffffffff8742af9d>] ? do_last+0x66d/0x12c0
Mar 29 15:32:37 dhcp41-254 kernel: [<ffffffff873f7275>] ? kmem_cache_alloc+0x35/0x1f0
Mar 29 15:32:37 dhcp41-254 kernel: [<ffffffff8742c58f>] ? getname_flags+0x4f/0x1a0
Mar 29 15:32:37 dhcp41-254 kernel: [<ffffffff8742a2eb>] filename_lookup+0x2b/0xc0
Mar 29 15:32:37 dhcp41-254 kernel: [<ffffffff8742d727>] user_path_at_empty+0x67/0xc0
Mar 29 15:32:37 dhcp41-254 kernel: [<ffffffffc0674118>] ? fuse_file_put+0x48/0x80 [fuse]
Mar 29 15:32:37 dhcp41-254 kernel: [<ffffffff8742d791>] user_path_at+0x11/0x20
Mar 29 15:32:37 dhcp41-254 kernel: [<ffffffff87420763>] vfs_fstatat+0x63/0xc0
Mar 29 15:32:37 dhcp41-254 kernel: [<ffffffff87420d84>] SYSC_newfstatat+0x24/0x60

fuse mount bt:
==============

(gdb) thread apply all bt

Thread 11 (Thread 0x7fc24f126700 (LWP 13051)):
#0  0x00007fc256bbbeed in nanosleep () from /lib64/libpthread.so.0
#1  0x00007fc257d64126 in gf_timer_proc (data=0x56104907fb00) at timer.c:165
#2  0x00007fc256bb4dd5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007fc25647db3d in clone () from /lib64/libc.so.6

Thread 10 (Thread 0x7fc24e925700 (LWP 13052)):
#0  0x00007fc256bbc411 in sigwait () from /lib64/libpthread.so.0
#1  0x0000561047e3a40b in glusterfs_sigwaiter (arg=<optimized out>) at glusterfsd.c:2149
#2  0x00007fc256bb4dd5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007fc25647db3d in clone () from /lib64/libc.so.6

Thread 9 (Thread 0x7fc24e124700 (LWP 13053)):
#0  0x00007fc2564444fd in nanosleep () from /lib64/libc.so.6
#1  0x00007fc256444394 in sleep () from /lib64/libc.so.6
#2  0x00007fc257d7e93d in pool_sweeper (arg=<optimized out>) at mem-pool.c:481
#3  0x00007fc256bb4dd5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007fc25647db3d in clone () from /lib64/libc.so.6

Thread 8 (Thread 0x7fc24d923700 (LWP 13054)):
#0  0x00007fc256bb8cf2 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007fc257d91568 in syncenv_task (proc=proc@entry=0x561049080320) at syncop.c:603
#2  0x00007fc257d92430 in syncenv_processor (thdata=0x561049080320) at syncop.c:695
#3  0x00007fc256bb4dd5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007fc25647db3d in clone () from /lib64/libc.so.6

Thread 7 (Thread 0x7fc24d122700 (LWP 13055)):
#0  0x00007fc256bb8cf2 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007fc257d91568 in syncenv_task (proc=proc@entry=0x5610490806e0) at syncop.c:603
#2  0x00007fc257d92430 in syncenv_processor (thdata=0x5610490806e0) at syncop.c:695
#3  0x00007fc256bb4dd5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007fc25647db3d in clone () from /lib64/libc.so.6

Thread 6 (Thread 0x7fc24ae64700 (LWP 13057)):
#0  0x00007fc25647e113 in epoll_wait () from /lib64/libc.so.6
#1  0x00007fc257db3252 in event_dispatch_epoll_worker (data=0x5610490a6990) at event-epoll.c:649
#2  0x00007fc256bb4dd5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007fc25647db3d in clone () from /lib64/libc.so.6

Thread 5 (Thread 0x7fc2580d8700 (LWP 13058)):
#0  0x00007fc256bb8cf2 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007fc249088d3c in iot_worker (data=0x7fc2440b2dc0) at io-threads.c:193
#2  0x00007fc256bb4dd5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007fc25647db3d in clone () from /lib64/libc.so.6

Thread 4 (Thread 0x7fc242d80700 (LWP 13059)):
#0  0x00007fc25647e113 in epoll_wait () from /lib64/libc.so.6
#1  0x00007fc257db3252 in event_dispatch_epoll_worker (data=0x7fc244164b80) at event-epoll.c:649
#2  0x00007fc256bb4dd5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007fc25647db3d in clone () from /lib64/libc.so.6
---Type <return> to continue, or q <return> to quit---

Thread 3 (Thread 0x7fc242240700 (LWP 13062)):
#0  0x00007fc2564745f0 in readv () from /lib64/libc.so.6
#1  0x00007fc257d7fd05 in sys_readv (fd=<optimized out>, iov=<optimized out>, iovcnt=<optimized out>) at syscall.c:295
#2  0x00007fc24f144a3a in fuse_thread_proc (data=0x56104906c8b0) at fuse-bridge.c:4984
#3  0x00007fc256bb4dd5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007fc25647db3d in clone () from /lib64/libc.so.6

Thread 2 (Thread 0x7fc241a3f700 (LWP 13063)):
#0  0x00007fc256bb8945 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007fc24f12f953 in notify_kernel_loop (data=<optimized out>) at fuse-bridge.c:4004
#2  0x00007fc256bb4dd5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007fc25647db3d in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x7fc25823a780 (LWP 13050)):
#0  0x00007fc256bb5f47 in pthread_join () from /lib64/libpthread.so.0
#1  0x00007fc257db39b8 in event_dispatch_epoll (event_pool=0x561049065210) at event-epoll.c:746
#2  0x0000561047e370e7 in main (argc=4, argv=<optimized out>) at glusterfsd.c:2562


Version-Release number of selected component (if applicable):
3.12.2-6.el7rhgs.x86_64

How reproducible:
Reported at first occurrence

Steps to Reproduce:
==================
Not sure on how to reproduce this issue, as usual I was trying to remove the data set on the mount point. I have left the system in the same state and can be used for live debugging. Please see below comments for system details.

Dataset on mount point: 5000 cat files, linux-4.9.27.tar.gz, linux-4.9.27 (kernel untar)
 
Actual results:
===============
rm hung.

Expected results:
================
There should not be any hung.
Comment 3 Raghavendra G 2018-03-30 00:44:07 EDT
Prasad,

Can you please attach statedumps of client and bricks?

regards,
Raghavendra
Comment 6 Raghavendra G 2018-03-31 05:52 EDT
Created attachment 1415446 [details]
client statedump
Comment 10 Raghavendra G 2018-04-04 06:24:42 EDT
Note that dht_selfheal_dir_mkdir doesn't call dht_selfheal_dir_finish on failures. So, any failure in dht_selfheal_dir_mkdir would result in a hung frame. Also dht_protect_namespace requires a non-NULL hashed subvol. If absent it returns a failure which in turn causes dht_selfheal_dir_mkdir to fail.

I found dht_protect_namespace logging errors in rmdir codepath:

[2018-03-29 08:47:05.511404] W [MSGID: 109011] [dht-layout.c:186:dht_layout_search] 0-distrepx3-dht: no subvolume for hash (value) = 2788654309
[2018-03-29 08:47:05.511640] W [MSGID: 109011] [dht-common.c:8247:dht_rmdir_do] 0-distrepx3-dht: Failed to get hashed subvol for /1/linux-4.9.27/arch/arm/mach-bcm (gfid = e27e7994-31bc-4c71-be27-337e39d16401)
[2018-03-29 08:47:05.515732] E [dht-lock.c:1271:dht_protect_namespace] (-->/usr/lib64/glusterfs/3.12.2/xlator/cluster/distribute.so(+0x521a7) [0x7fc249f431a7] -->/usr/lib64/glusterfs/3.12.2/xlator/cluster/distribute.so(+0x5029e) [0x7fc249f4129e] -->/usr/lib64/glusterfs/3.12.2/xlator/cluster/distribute.so(+0x79395) [0x7fc249f6a395] ) 0-distrepx3-dht: invalid argument: subvol [Invalid argument]
[2018-03-29 08:47:05.515882] W [fuse-bridge.c:1457:fuse_unlink_cbk] 0-glusterfs-fuse: 5360: RMDIR() /1/linux-4.9.27/arch/arm/mach-bcm => -1 (Invalid argument)

I am wondering a similar condition of not being able to find hashed subvol might've resulted in failure in dht_selfheal_dir_mkdir resulting in a hung frame. Will send a patch to fix this.
Comment 11 Raghavendra G 2018-04-04 09:54:48 EDT
[root@xyz glusterfs]# grep "hashed subvolume not found" mnt-distrepx3.log*
mnt-distrepx3.log-20180401:[2018-03-29 08:56:34.746447] W [MSGID: 109011] [dht-selfheal.c:1531:dht_selfheal_dir_mkdir] 0-distrepx3-dht: ( O���=B������X��/block) (path: /1/linux-4.9.27/block): hashed subvolume not found [Invalid argument]

This proves that hung frame is due to dht_selfheal_dir_mkdir and hypothesis is correct.
Comment 14 Raghavendra G 2018-05-03 00:11:57 EDT
https://review.gluster.org/#/c/19953/
Comment 17 Prasad Desala 2018-07-11 05:41:43 EDT
On 3.12.2-13.el7rhgs.x86_64, ran rm -rf * on FUSE mount multiples times and I didn't see any rm hungs.

Moving the BZ to Verified.
Comment 19 errata-xmlrpc 2018-09-04 02:45:40 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2607

Note You need to log in before you can comment on or make changes to this bug.