Bug 1567030 - [Cephfs:Fuse]: Fuse service stopped and crefi IO failed during MDS in starting state.
Summary: [Cephfs:Fuse]: Fuse service stopped and crefi IO failed during MDS in startin...
Status: CLOSED DUPLICATE of bug 1566016
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat
Component: CephFS
Version: 3.0
Hardware: Unspecified
OS: Unspecified
urgent
high
Target Milestone: z5
: 3.0
Assignee: Yan, Zheng
QA Contact: ceph-qe-bugs
URL:
Whiteboard:
Keywords: Reopened
Depends On:
Blocks: 1557269
TreeView+ depends on / blocked
 
Reported: 2018-04-13 10:14 UTC by Ramakrishnan Periyasamy
Modified: 2018-06-05 19:52 UTC (History)
7 users (show)

(edit)
.Client I/O sometimes fails for CephFS FUSE clients

Client I/O sometimes fails for Ceph File System (CephFS) as a File System in User Space (FUSE) clients with the error `transport endpoint shutdown` due to assert in the FUSE service. To workaround this issues, unmount and then remount CephFS FUSE, and then start the client I/Os.
Clone Of:
(edit)
Last Closed: 2018-06-05 19:52:16 UTC


Attachments (Terms of Use)
/var/log/message logs (861.94 KB, text/plain)
2018-04-13 10:14 UTC, Ramakrishnan Periyasamy
no flags Details
fuse client log (934.76 KB, text/plain)
2018-04-13 10:14 UTC, Ramakrishnan Periyasamy
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Bugzilla 1566016 None None None Never
Red Hat Bugzilla 1576908 None None None Never

Internal Trackers: 1566016 1576908

Description Ramakrishnan Periyasamy 2018-04-13 10:14:07 UTC
Created attachment 1421310 [details]
/var/log/message logs

Description of problem:

Crefi IO's failed with below errors while MDS was in starting state.
> "OSError: [Errno 108] Cannot send after transport endpoint shutdown: '/mnt/cephfs/test4'"

Crefi IO profile:
 for i in create rename chmod chown chgrp symlink hardlink setxattr truncate; do sudo ./crefi.py --multi -b 10 -d 2 -n 10000 -t sparse --random --min=1K --max=64K /mnt/nfs/@#dir8/ ; done

FUSE assert:
ceph version 12.2.4-6.el7cp (78f60b924802e34d44f7078029a40dbe6c0c922f) luminous (stable)
 1: (()+0x66d2e1) [0x5652343a62e1]
 2: (()+0xf680) [0x7f077da88680]
 3: (gsignal()+0x37) [0x7f077c894207]
 4: (abort()+0x148) [0x7f077c8958f8]
 5: (Client::_do_remount()+0x2b1) [0x565233fc8d11]
 6: (Context::complete(int)+0x9) [0x565233fccac9]
 7: (Finisher::finisher_thread_entry()+0x198) [0x565234031428]
 8: (()+0x7dd5) [0x7f077da80dd5]
 9: (clone()+0x6d) [0x7f077c95cb3d]

Call trace in /var/log/messages:
Apr 11 07:31:26 magna117 kernel: INFO: task crefi.py:16811 blocked for more than 120 seconds.
Apr 11 07:31:26 magna117 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr 11 07:31:26 magna117 kernel: crefi.py        D ffff92f01a752f70     0 16811      1 0x00000084
Apr 11 07:31:26 magna117 kernel: Call Trace:
Apr 11 07:31:26 magna117 kernel: [<ffffffffaa312f49>] schedule+0x29/0x70
Apr 11 07:31:26 magna117 kernel: [<ffffffffc0a394e5>] __fuse_request_send+0xf5/0x2e0 [fuse]
Apr 11 07:31:26 magna117 kernel: [<ffffffffa9cbbe20>] ? wake_up_atomic_t+0x30/0x30
Apr 11 07:31:26 magna117 kernel: [<ffffffffc0a396e2>] fuse_request_send+0x12/0x20 [fuse]
Apr 11 07:31:26 magna117 kernel: [<ffffffffc0a3e8b1>] fuse_dentry_revalidate+0x1c1/0x310 [fuse]
Apr 11 07:31:26 magna117 kernel: [<ffffffffa9e26e7a>] lookup_fast+0x1da/0x230
Apr 11 07:31:26 magna117 kernel: [<ffffffffa9ed0632>] ? security_inode_permission+0x22/0x30
Apr 11 07:31:26 magna117 kernel: [<ffffffffa9e28463>] link_path_walk+0x1b3/0x8b0
Apr 11 07:31:26 magna117 kernel: [<ffffffffa9df7606>] ? kmem_cache_alloc_trace+0x1d6/0x200
Apr 11 07:31:26 magna117 kernel: [<ffffffffa9e2bca5>] path_openat+0xb5/0x640
Apr 11 07:31:26 magna117 kernel: [<ffffffffc0a39771>] ? flush_bg_queue+0x81/0xa0 [fuse]
Apr 11 07:31:26 magna117 kernel: [<ffffffffa9e2d85d>] do_filp_open+0x4d/0xb0
Apr 11 07:31:26 magna117 kernel: [<ffffffffa9e3acd7>] ? __alloc_fd+0x47/0x170
Apr 11 07:31:26 magna117 kernel: [<ffffffffa9e19b67>] do_sys_open+0x137/0x240
Apr 11 07:31:26 magna117 kernel: [<ffffffffa9e19c8e>] SyS_open+0x1e/0x20
Apr 11 07:31:26 magna117 kernel: [<ffffffffaa31f7d5>] system_call_fastpath+0x1c/0x21
Apr 11 07:31:26 magna117 kernel: INFO: task crefi.py:18471 blocked for more than 120 seconds.
Apr 11 07:31:26 magna117 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr 11 07:31:26 magna117 kernel: crefi.py        D ffff92f005e16eb0     0 18471      1 0x00000084
Apr 11 07:31:26 magna117 kernel: Call Trace:
Apr 11 07:31:26 magna117 kernel: [<ffffffffaa312f49>] schedule+0x29/0x70
Apr 11 07:31:26 magna117 kernel: [<ffffffffc0a394e5>] __fuse_request_send+0xf5/0x2e0 [fuse]
Apr 11 07:31:26 magna117 kernel: [<ffffffffa9cbbe20>] ? wake_up_atomic_t+0x30/0x30
Apr 11 07:31:26 magna117 kernel: [<ffffffffc0a396e2>] fuse_request_send+0x12/0x20 [fuse]
Apr 11 07:31:26 magna117 kernel: [<ffffffffc0a3eb0a>] fuse_do_getattr+0x10a/0x2d0 [fuse]
Apr 11 07:31:26 magna117 kernel: [<ffffffffc0a3fc25>] fuse_update_attributes+0x75/0x80 [fuse]
Apr 11 07:31:26 magna117 kernel: [<ffffffffc0a3fc73>] fuse_getattr+0x43/0x50 [fuse]
Apr 11 07:31:26 magna117 kernel: [<ffffffffa9e20649>] vfs_getattr+0x49/0x80
Apr 11 07:31:26 magna117 kernel: [<ffffffffa9e20775>] vfs_fstatat+0x75/0xc0
Apr 11 07:31:26 magna117 kernel: [<ffffffffa9e20cce>] SYSC_newstat+0x2e/0x60
Apr 11 07:31:26 magna117 kernel: [<ffffffffa9e1ad78>] ? vfs_write+0x168/0x1f0
Apr 11 07:31:26 magna117 kernel: [<ffffffffa9d2f3d6>] ? __audit_syscall_exit+0x1e6/0x280
Apr 11 07:31:26 magna117 kernel: [<ffffffffa9e20fae>] SyS_newstat+0xe/0x10
Apr 11 07:31:26 magna117 kernel: [<ffffffffaa31f7d5>] system_call_fastpath+0x1c/0x21
Apr 11 07:33:26 magna117 kernel: INFO: task crefi.py:16811 blocked for more than 120 seconds.
Apr 11 07:33:26 magna117 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr 11 07:33:26 magna117 kernel: crefi.py        D ffff92f01a752f70     0 16811      1 0x00000084


Version-Release number of selected component (if applicable):
12.2.4-6.el7cp

How reproducible:
1/3

Steps to Reproduce:
1. Ran crefi workload 
2. Manually upgrade MDS using steps in  https://github.com/ceph/ceph/pull/21263/files
3. After increasing max_mds to 2 at last step(as per link in above step), MDS was in starting state for long time.

Actual results:
Client IO failed with transport endpoint shutdown message

Expected results:
Client IO should not fail and Fuse service should not crash

Additional info:
NA

Comment 3 Ramakrishnan Periyasamy 2018-04-13 10:14 UTC
Created attachment 1421311 [details]
fuse client log

Comment 7 Ramakrishnan Periyasamy 2018-04-20 16:18:41 UTC
added doc text

Comment 27 Yan, Zheng 2018-05-23 04:02:52 UTC
Only found one mds crashed in 'staring' state, which is duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1566016.

Comment 28 Ken Dreyer (Red Hat) 2018-05-31 22:51:59 UTC
Patrick and Zheng, is there anything else for engineering to do on this BZ, or should we mark as CLOSED -> DUPLICATE for 1566016?


Note You need to log in before you can comment on or make changes to this bug.