1567030 – [Cephfs:Fuse]: Fuse service stopped and crefi IO failed during MDS in starting state.

Bug 1567030 - [Cephfs:Fuse]: Fuse service stopped and crefi IO failed during MDS in starting state.

Summary: [Cephfs:Fuse]: Fuse service stopped and crefi IO failed during MDS in startin...

Keywords:
Status:	CLOSED DUPLICATE of bug 1566016
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	CephFS
Sub Component:
Version:	3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	high
Target Milestone:	z5
Target Release:	3.0
Assignee:	Yan, Zheng
QA Contact:	ceph-qe-bugs
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1557269
TreeView+	depends on / blocked

Reported:	2018-04-13 10:14 UTC by Ramakrishnan Periyasamy
Modified:	2018-06-05 19:52 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	Known Issue
Doc Text:	.Client I/O sometimes fails for CephFS FUSE clients Client I/O sometimes fails for Ceph File System (CephFS) as a File System in User Space (FUSE) clients with the error `transport endpoint shutdown` due to assert in the FUSE service. To workaround this issues, unmount and then remount CephFS FUSE, and then start the client I/Os.
Clone Of:
Environment:
Last Closed:	2018-06-05 19:52:16 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
/var/log/message logs (861.94 KB, text/plain) 2018-04-13 10:14 UTC, Ramakrishnan Periyasamy	no flags	Details
fuse client log (934.76 KB, text/plain) 2018-04-13 10:14 UTC, Ramakrishnan Periyasamy	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1566016	0	high	CLOSED	[cephfs]: MDS asserted while in Starting/resolve state	2021-09-09 13:44:05 UTC
Red Hat Bugzilla	1576908	0	high	CLOSED	[CephFS]: Client IO's hung Fuse service asserted with error FAILED assert(oset.objects.empty()	2021-02-22 00:41:40 UTC

Internal Links: 1566016 1576908

Description Ramakrishnan Periyasamy 2018-04-13 10:14:07 UTC

Created attachment 1421310 [details]
/var/log/message logs

Description of problem:

Crefi IO's failed with below errors while MDS was in starting state.
> "OSError: [Errno 108] Cannot send after transport endpoint shutdown: '/mnt/cephfs/test4'"

Crefi IO profile:
 for i in create rename chmod chown chgrp symlink hardlink setxattr truncate; do sudo ./crefi.py --multi -b 10 -d 2 -n 10000 -t sparse --random --min=1K --max=64K /mnt/nfs/@#dir8/ ; done

FUSE assert:
ceph version 12.2.4-6.el7cp (78f60b924802e34d44f7078029a40dbe6c0c922f) luminous (stable)
 1: (()+0x66d2e1) [0x5652343a62e1]
 2: (()+0xf680) [0x7f077da88680]
 3: (gsignal()+0x37) [0x7f077c894207]
 4: (abort()+0x148) [0x7f077c8958f8]
 5: (Client::_do_remount()+0x2b1) [0x565233fc8d11]
 6: (Context::complete(int)+0x9) [0x565233fccac9]
 7: (Finisher::finisher_thread_entry()+0x198) [0x565234031428]
 8: (()+0x7dd5) [0x7f077da80dd5]
 9: (clone()+0x6d) [0x7f077c95cb3d]

Call trace in /var/log/messages:
Apr 11 07:31:26 magna117 kernel: INFO: task crefi.py:16811 blocked for more than 120 seconds.
Apr 11 07:31:26 magna117 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr 11 07:31:26 magna117 kernel: crefi.py        D ffff92f01a752f70     0 16811      1 0x00000084
Apr 11 07:31:26 magna117 kernel: Call Trace:
Apr 11 07:31:26 magna117 kernel: [<ffffffffaa312f49>] schedule+0x29/0x70
Apr 11 07:31:26 magna117 kernel: [<ffffffffc0a394e5>] __fuse_request_send+0xf5/0x2e0 [fuse]
Apr 11 07:31:26 magna117 kernel: [<ffffffffa9cbbe20>] ? wake_up_atomic_t+0x30/0x30
Apr 11 07:31:26 magna117 kernel: [<ffffffffc0a396e2>] fuse_request_send+0x12/0x20 [fuse]
Apr 11 07:31:26 magna117 kernel: [<ffffffffc0a3e8b1>] fuse_dentry_revalidate+0x1c1/0x310 [fuse]
Apr 11 07:31:26 magna117 kernel: [<ffffffffa9e26e7a>] lookup_fast+0x1da/0x230
Apr 11 07:31:26 magna117 kernel: [<ffffffffa9ed0632>] ? security_inode_permission+0x22/0x30
Apr 11 07:31:26 magna117 kernel: [<ffffffffa9e28463>] link_path_walk+0x1b3/0x8b0
Apr 11 07:31:26 magna117 kernel: [<ffffffffa9df7606>] ? kmem_cache_alloc_trace+0x1d6/0x200
Apr 11 07:31:26 magna117 kernel: [<ffffffffa9e2bca5>] path_openat+0xb5/0x640
Apr 11 07:31:26 magna117 kernel: [<ffffffffc0a39771>] ? flush_bg_queue+0x81/0xa0 [fuse]
Apr 11 07:31:26 magna117 kernel: [<ffffffffa9e2d85d>] do_filp_open+0x4d/0xb0
Apr 11 07:31:26 magna117 kernel: [<ffffffffa9e3acd7>] ? __alloc_fd+0x47/0x170
Apr 11 07:31:26 magna117 kernel: [<ffffffffa9e19b67>] do_sys_open+0x137/0x240
Apr 11 07:31:26 magna117 kernel: [<ffffffffa9e19c8e>] SyS_open+0x1e/0x20
Apr 11 07:31:26 magna117 kernel: [<ffffffffaa31f7d5>] system_call_fastpath+0x1c/0x21
Apr 11 07:31:26 magna117 kernel: INFO: task crefi.py:18471 blocked for more than 120 seconds.
Apr 11 07:31:26 magna117 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr 11 07:31:26 magna117 kernel: crefi.py        D ffff92f005e16eb0     0 18471      1 0x00000084
Apr 11 07:31:26 magna117 kernel: Call Trace:
Apr 11 07:31:26 magna117 kernel: [<ffffffffaa312f49>] schedule+0x29/0x70
Apr 11 07:31:26 magna117 kernel: [<ffffffffc0a394e5>] __fuse_request_send+0xf5/0x2e0 [fuse]
Apr 11 07:31:26 magna117 kernel: [<ffffffffa9cbbe20>] ? wake_up_atomic_t+0x30/0x30
Apr 11 07:31:26 magna117 kernel: [<ffffffffc0a396e2>] fuse_request_send+0x12/0x20 [fuse]
Apr 11 07:31:26 magna117 kernel: [<ffffffffc0a3eb0a>] fuse_do_getattr+0x10a/0x2d0 [fuse]
Apr 11 07:31:26 magna117 kernel: [<ffffffffc0a3fc25>] fuse_update_attributes+0x75/0x80 [fuse]
Apr 11 07:31:26 magna117 kernel: [<ffffffffc0a3fc73>] fuse_getattr+0x43/0x50 [fuse]
Apr 11 07:31:26 magna117 kernel: [<ffffffffa9e20649>] vfs_getattr+0x49/0x80
Apr 11 07:31:26 magna117 kernel: [<ffffffffa9e20775>] vfs_fstatat+0x75/0xc0
Apr 11 07:31:26 magna117 kernel: [<ffffffffa9e20cce>] SYSC_newstat+0x2e/0x60
Apr 11 07:31:26 magna117 kernel: [<ffffffffa9e1ad78>] ? vfs_write+0x168/0x1f0
Apr 11 07:31:26 magna117 kernel: [<ffffffffa9d2f3d6>] ? __audit_syscall_exit+0x1e6/0x280
Apr 11 07:31:26 magna117 kernel: [<ffffffffa9e20fae>] SyS_newstat+0xe/0x10
Apr 11 07:31:26 magna117 kernel: [<ffffffffaa31f7d5>] system_call_fastpath+0x1c/0x21
Apr 11 07:33:26 magna117 kernel: INFO: task crefi.py:16811 blocked for more than 120 seconds.
Apr 11 07:33:26 magna117 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr 11 07:33:26 magna117 kernel: crefi.py        D ffff92f01a752f70     0 16811      1 0x00000084


Version-Release number of selected component (if applicable):
12.2.4-6.el7cp

How reproducible:
1/3

Steps to Reproduce:
1. Ran crefi workload 
2. Manually upgrade MDS using steps in  https://github.com/ceph/ceph/pull/21263/files
3. After increasing max_mds to 2 at last step(as per link in above step), MDS was in starting state for long time.

Actual results:
Client IO failed with transport endpoint shutdown message

Expected results:
Client IO should not fail and Fuse service should not crash

Additional info:
NA

Comment 3 Ramakrishnan Periyasamy 2018-04-13 10:14:47 UTC

Created attachment 1421311 [details]
fuse client log

Comment 7 Ramakrishnan Periyasamy 2018-04-20 16:18:41 UTC

added doc text

Comment 27 Yan, Zheng 2018-05-23 04:02:52 UTC

Only found one mds crashed in 'staring' state, which is duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1566016.

Comment 28 Ken Dreyer (Red Hat) 2018-05-31 22:51:59 UTC

Patrick and Zheng, is there anything else for engineering to do on this BZ, or should we mark as CLOSED -> DUPLICATE for 1566016?

Note You need to log in before you can comment on or make changes to this bug.