Created attachment 1421310 [details] /var/log/message logs Description of problem: Crefi IO's failed with below errors while MDS was in starting state. > "OSError: [Errno 108] Cannot send after transport endpoint shutdown: '/mnt/cephfs/test4'" Crefi IO profile: for i in create rename chmod chown chgrp symlink hardlink setxattr truncate; do sudo ./crefi.py --multi -b 10 -d 2 -n 10000 -t sparse --random --min=1K --max=64K /mnt/nfs/@#dir8/ ; done FUSE assert: ceph version 12.2.4-6.el7cp (78f60b924802e34d44f7078029a40dbe6c0c922f) luminous (stable) 1: (()+0x66d2e1) [0x5652343a62e1] 2: (()+0xf680) [0x7f077da88680] 3: (gsignal()+0x37) [0x7f077c894207] 4: (abort()+0x148) [0x7f077c8958f8] 5: (Client::_do_remount()+0x2b1) [0x565233fc8d11] 6: (Context::complete(int)+0x9) [0x565233fccac9] 7: (Finisher::finisher_thread_entry()+0x198) [0x565234031428] 8: (()+0x7dd5) [0x7f077da80dd5] 9: (clone()+0x6d) [0x7f077c95cb3d] Call trace in /var/log/messages: Apr 11 07:31:26 magna117 kernel: INFO: task crefi.py:16811 blocked for more than 120 seconds. Apr 11 07:31:26 magna117 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Apr 11 07:31:26 magna117 kernel: crefi.py D ffff92f01a752f70 0 16811 1 0x00000084 Apr 11 07:31:26 magna117 kernel: Call Trace: Apr 11 07:31:26 magna117 kernel: [<ffffffffaa312f49>] schedule+0x29/0x70 Apr 11 07:31:26 magna117 kernel: [<ffffffffc0a394e5>] __fuse_request_send+0xf5/0x2e0 [fuse] Apr 11 07:31:26 magna117 kernel: [<ffffffffa9cbbe20>] ? wake_up_atomic_t+0x30/0x30 Apr 11 07:31:26 magna117 kernel: [<ffffffffc0a396e2>] fuse_request_send+0x12/0x20 [fuse] Apr 11 07:31:26 magna117 kernel: [<ffffffffc0a3e8b1>] fuse_dentry_revalidate+0x1c1/0x310 [fuse] Apr 11 07:31:26 magna117 kernel: [<ffffffffa9e26e7a>] lookup_fast+0x1da/0x230 Apr 11 07:31:26 magna117 kernel: [<ffffffffa9ed0632>] ? security_inode_permission+0x22/0x30 Apr 11 07:31:26 magna117 kernel: [<ffffffffa9e28463>] link_path_walk+0x1b3/0x8b0 Apr 11 07:31:26 magna117 kernel: [<ffffffffa9df7606>] ? kmem_cache_alloc_trace+0x1d6/0x200 Apr 11 07:31:26 magna117 kernel: [<ffffffffa9e2bca5>] path_openat+0xb5/0x640 Apr 11 07:31:26 magna117 kernel: [<ffffffffc0a39771>] ? flush_bg_queue+0x81/0xa0 [fuse] Apr 11 07:31:26 magna117 kernel: [<ffffffffa9e2d85d>] do_filp_open+0x4d/0xb0 Apr 11 07:31:26 magna117 kernel: [<ffffffffa9e3acd7>] ? __alloc_fd+0x47/0x170 Apr 11 07:31:26 magna117 kernel: [<ffffffffa9e19b67>] do_sys_open+0x137/0x240 Apr 11 07:31:26 magna117 kernel: [<ffffffffa9e19c8e>] SyS_open+0x1e/0x20 Apr 11 07:31:26 magna117 kernel: [<ffffffffaa31f7d5>] system_call_fastpath+0x1c/0x21 Apr 11 07:31:26 magna117 kernel: INFO: task crefi.py:18471 blocked for more than 120 seconds. Apr 11 07:31:26 magna117 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Apr 11 07:31:26 magna117 kernel: crefi.py D ffff92f005e16eb0 0 18471 1 0x00000084 Apr 11 07:31:26 magna117 kernel: Call Trace: Apr 11 07:31:26 magna117 kernel: [<ffffffffaa312f49>] schedule+0x29/0x70 Apr 11 07:31:26 magna117 kernel: [<ffffffffc0a394e5>] __fuse_request_send+0xf5/0x2e0 [fuse] Apr 11 07:31:26 magna117 kernel: [<ffffffffa9cbbe20>] ? wake_up_atomic_t+0x30/0x30 Apr 11 07:31:26 magna117 kernel: [<ffffffffc0a396e2>] fuse_request_send+0x12/0x20 [fuse] Apr 11 07:31:26 magna117 kernel: [<ffffffffc0a3eb0a>] fuse_do_getattr+0x10a/0x2d0 [fuse] Apr 11 07:31:26 magna117 kernel: [<ffffffffc0a3fc25>] fuse_update_attributes+0x75/0x80 [fuse] Apr 11 07:31:26 magna117 kernel: [<ffffffffc0a3fc73>] fuse_getattr+0x43/0x50 [fuse] Apr 11 07:31:26 magna117 kernel: [<ffffffffa9e20649>] vfs_getattr+0x49/0x80 Apr 11 07:31:26 magna117 kernel: [<ffffffffa9e20775>] vfs_fstatat+0x75/0xc0 Apr 11 07:31:26 magna117 kernel: [<ffffffffa9e20cce>] SYSC_newstat+0x2e/0x60 Apr 11 07:31:26 magna117 kernel: [<ffffffffa9e1ad78>] ? vfs_write+0x168/0x1f0 Apr 11 07:31:26 magna117 kernel: [<ffffffffa9d2f3d6>] ? __audit_syscall_exit+0x1e6/0x280 Apr 11 07:31:26 magna117 kernel: [<ffffffffa9e20fae>] SyS_newstat+0xe/0x10 Apr 11 07:31:26 magna117 kernel: [<ffffffffaa31f7d5>] system_call_fastpath+0x1c/0x21 Apr 11 07:33:26 magna117 kernel: INFO: task crefi.py:16811 blocked for more than 120 seconds. Apr 11 07:33:26 magna117 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Apr 11 07:33:26 magna117 kernel: crefi.py D ffff92f01a752f70 0 16811 1 0x00000084 Version-Release number of selected component (if applicable): 12.2.4-6.el7cp How reproducible: 1/3 Steps to Reproduce: 1. Ran crefi workload 2. Manually upgrade MDS using steps in https://github.com/ceph/ceph/pull/21263/files 3. After increasing max_mds to 2 at last step(as per link in above step), MDS was in starting state for long time. Actual results: Client IO failed with transport endpoint shutdown message Expected results: Client IO should not fail and Fuse service should not crash Additional info: NA
Created attachment 1421311 [details] fuse client log
added doc text
Only found one mds crashed in 'staring' state, which is duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1566016.
Patrick and Zheng, is there anything else for engineering to do on this BZ, or should we mark as CLOSED -> DUPLICATE for 1566016?