Bug 1576908 - [CephFS]: Client IO's hung Fuse service asserted with error FAILED assert(oset.objects.empty()
Summary: [CephFS]: Client IO's hung Fuse service asserted with error FAILED assert(ose...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: CephFS
Version: 3.0
Hardware: Unspecified
OS: Unspecified
high
medium
Target Milestone: z4
: 3.0
Assignee: Yan, Zheng
QA Contact: Ramakrishnan Periyasamy
Aron Gunn
URL:
Whiteboard:
Depends On:
Blocks: 1576030 1585029
TreeView+ depends on / blocked
 
Reported: 2018-05-10 16:47 UTC by Ramakrishnan Periyasamy
Modified: 2018-07-11 18:12 UTC (History)
7 users (show)

Fixed In Version: RHEL: ceph-12.2.4-22.el7cp Ubuntu: 12.2.4-27redhat1xenial
Doc Type: Known Issue
Doc Text:
.Client I/O sometimes fails for CephFS FUSE clients Client I/O sometimes fails for Ceph File System (CephFS) as a File System in User Space (FUSE) clients with the error `transport endpoint shutdown` due to assert in the FUSE service. To workaround this issues, unmount and then remount CephFS FUSE, and then start the client I/Os.
Clone Of:
: 1585029 (view as bug list)
Environment:
Last Closed: 2018-07-11 18:11:10 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Ceph Project Bug Tracker 23837 0 None None None 2018-05-15 20:05:17 UTC
Ceph Project Bug Tracker 24207 0 None None None 2018-05-22 15:16:35 UTC
Red Hat Bugzilla 1567030 0 urgent CLOSED [Cephfs:Fuse]: Fuse service stopped and crefi IO failed during MDS in starting state. 2021-02-22 00:41:40 UTC
Red Hat Product Errata RHSA-2018:2177 0 None None None 2018-07-11 18:12:03 UTC

Internal Links: 1567030

Description Ramakrishnan Periyasamy 2018-05-10 16:47:54 UTC
Description of problem:

dd IO failed with "dd: failed to open ‘/mnt/fuse/fuse/client1/dir1/1M_1/magna021_000000104856’: Cannot send after transport endpoint shutdown"

dd script used during testing:
function n {
    printf "%s_%012d" "$(hostname)" "$1"
}
dir=/mnt/fuse/fuse/client1/dir1
time mkdir -p $dir/1M_1 && for ((i = 0; i < 2**20/10; i++)); do echo "$i" > /root/1M_1_CURRENT; dd status=none if=/dev/urandom of="$dir/1M_1/$(n "$i")" bs=4k count=1; done

Fuse assert:
2018-05-10 10:20:50.709658 7f96cadb90c0  0 ~Inode: leftover objects on inode 0x0x10000082b16
2018-05-10 10:20:50.726976 7f96cadb90c0 -1 /builddir/build/BUILD/ceph-12.2.1/src/client/Inode.cc: In function 'Inode::~Inode()' thread 7f96cadb90c0 time 2018-05-10 10:20:50.709671
/builddir/build/BUILD/ceph-12.2.1/src/client/Inode.cc: 27: FAILED assert(oset.objects.empty())

 ceph version 12.2.1-46.el7cp (b6f6f1b141c306a43f669b974971b9ec44914cb0) luminous (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x110) [0x55ea9e2857d0]
 2: (Inode::~Inode()+0x5e5) [0x55ea9e250865]
 3: (Client::put_inode(Inode*, int)+0x350) [0x55ea9e1cdda0]
 4: (Client::unlink(Dentry*, bool, bool)+0x11a) [0x55ea9e1d2cca]
 5: (Client::trim_dentry(Dentry*)+0x93) [0x55ea9e1d3433]
 6: (Client::trim_cache(bool)+0x328) [0x55ea9e1d38c8]
 7: (Client::tear_down_cache()+0x2eb) [0x55ea9e1e987b]
 8: (Client::~Client()+0x53) [0x55ea9e1e9ab3]
 9: (StandaloneClient::~StandaloneClient()+0x9) [0x55ea9e1ea1c9]
 10: (main()+0x534) [0x55ea9e19c3a4]
 11: (__libc_start_main()+0xf5) [0x7f96c7af23d5]
 12: (()+0x212bf3) [0x55ea9e1a5bf3]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- begin dump of recent events ---


Version-Release number of selected component (if applicable):
Client ceph version: 12.2.1-46.el7cp
Cluster daemons are in 12.2.4-10.el7cp --> cluster upgraded from 12.2.1-46 to 12.2.4-10.el7cp version.

How reproducible:
1/1

Steps to Reproduce:
1. Configure cluster with MDS using 12.2.1-46.el7cp
2. Created around 800k inodes in the FS
3. Did cluster upgrade to 12.2.4-10.el7cp using MDS upgrade procedures manually
4. After deactivating the active MDS it was in stopping state, observed client hung and fuse assert.
Actual results:


Expected results:
Client IO should not fail

Additional info:
NA

Comment 7 Yan, Zheng 2018-05-23 03:49:20 UTC
https://github.com/ceph/ceph/pull/22168

Comment 13 Yan, Zheng 2018-06-11 11:40:39 UTC
cherry-pick for 3.1 is done https://bugzilla.redhat.com/show_bug.cgi?id=1585029

Comment 14 Ramakrishnan Periyasamy 2018-06-25 04:42:25 UTC
Manual Cluster upgrade passed without any issues.

FS Sanity and regression automation runs passed.
a) FS sanity in Jenkins is clean and results posted to polarion

http://cistatus.ceph.redhat.com/ui/#cephci/launches/New_filter1%7Cpage.page=1&page.size=50&filter.cnt.name=sanity_fs&pag… 

b) http://pulpito.ceph.redhat.com/vasu-2018-06-21_16:30:55-fs-luminous-distro-basic-argo/ 

Moving this bug to verified state.

Comment 18 errata-xmlrpc 2018-07-11 18:11:10 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2177


Note You need to log in before you can comment on or make changes to this bug.