1576908 – [CephFS]: Client IO's hung Fuse service asserted with error FAILED assert(oset.objects.empty()

Bug 1576908 - [CephFS]: Client IO's hung Fuse service asserted with error FAILED assert(oset.objects.empty()

Summary: [CephFS]: Client IO's hung Fuse service asserted with error FAILED assert(ose...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	CephFS
Sub Component:
Version:	3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	medium
Target Milestone:	z4
Target Release:	3.0
Assignee:	Yan, Zheng
QA Contact:	Ramakrishnan Periyasamy
Docs Contact:	Aron Gunn
URL:
Whiteboard:
Depends On:
Blocks:	1576030 1585029
TreeView+	depends on / blocked

Reported:	2018-05-10 16:47 UTC by Ramakrishnan Periyasamy
Modified:	2018-07-11 18:12 UTC (History)
CC List:	7 users (show)
Fixed In Version:	RHEL: ceph-12.2.4-22.el7cp Ubuntu: 12.2.4-27redhat1xenial
Doc Type:	Known Issue
Doc Text:	.Client I/O sometimes fails for CephFS FUSE clients Client I/O sometimes fails for Ceph File System (CephFS) as a File System in User Space (FUSE) clients with the error `transport endpoint shutdown` due to assert in the FUSE service. To workaround this issues, unmount and then remount CephFS FUSE, and then start the client I/Os.
Clone Of:
Clones:	1585029 (view as bug list)
Environment:
Last Closed:	2018-07-11 18:11:10 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Ceph Project Bug Tracker	23837	None	None	None	2018-05-15 20:05:17 UTC
Ceph Project Bug Tracker	24207	None	None	None	2018-05-22 15:16:35 UTC
Red Hat Bugzilla	1567030	urgent	CLOSED	[Cephfs:Fuse]: Fuse service stopped and crefi IO failed during MDS in starting state.	2021-02-22 00:41:40 UTC
Red Hat Product Errata	RHSA-2018:2177	None	None	None	2018-07-11 18:12:03 UTC

Internal Links: 1567030

Description Ramakrishnan Periyasamy 2018-05-10 16:47:54 UTC

Description of problem:

dd IO failed with "dd: failed to open ‘/mnt/fuse/fuse/client1/dir1/1M_1/magna021_000000104856’: Cannot send after transport endpoint shutdown"

dd script used during testing:
function n {
    printf "%s_%012d" "$(hostname)" "$1"
}
dir=/mnt/fuse/fuse/client1/dir1
time mkdir -p $dir/1M_1 && for ((i = 0; i < 2**20/10; i++)); do echo "$i" > /root/1M_1_CURRENT; dd status=none if=/dev/urandom of="$dir/1M_1/$(n "$i")" bs=4k count=1; done

Fuse assert:
2018-05-10 10:20:50.709658 7f96cadb90c0  0 ~Inode: leftover objects on inode 0x0x10000082b16
2018-05-10 10:20:50.726976 7f96cadb90c0 -1 /builddir/build/BUILD/ceph-12.2.1/src/client/Inode.cc: In function 'Inode::~Inode()' thread 7f96cadb90c0 time 2018-05-10 10:20:50.709671
/builddir/build/BUILD/ceph-12.2.1/src/client/Inode.cc: 27: FAILED assert(oset.objects.empty())

 ceph version 12.2.1-46.el7cp (b6f6f1b141c306a43f669b974971b9ec44914cb0) luminous (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x110) [0x55ea9e2857d0]
 2: (Inode::~Inode()+0x5e5) [0x55ea9e250865]
 3: (Client::put_inode(Inode*, int)+0x350) [0x55ea9e1cdda0]
 4: (Client::unlink(Dentry*, bool, bool)+0x11a) [0x55ea9e1d2cca]
 5: (Client::trim_dentry(Dentry*)+0x93) [0x55ea9e1d3433]
 6: (Client::trim_cache(bool)+0x328) [0x55ea9e1d38c8]
 7: (Client::tear_down_cache()+0x2eb) [0x55ea9e1e987b]
 8: (Client::~Client()+0x53) [0x55ea9e1e9ab3]
 9: (StandaloneClient::~StandaloneClient()+0x9) [0x55ea9e1ea1c9]
 10: (main()+0x534) [0x55ea9e19c3a4]
 11: (__libc_start_main()+0xf5) [0x7f96c7af23d5]
 12: (()+0x212bf3) [0x55ea9e1a5bf3]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- begin dump of recent events ---


Version-Release number of selected component (if applicable):
Client ceph version: 12.2.1-46.el7cp
Cluster daemons are in 12.2.4-10.el7cp --> cluster upgraded from 12.2.1-46 to 12.2.4-10.el7cp version.

How reproducible:
1/1

Steps to Reproduce:
1. Configure cluster with MDS using 12.2.1-46.el7cp
2. Created around 800k inodes in the FS
3. Did cluster upgrade to 12.2.4-10.el7cp using MDS upgrade procedures manually
4. After deactivating the active MDS it was in stopping state, observed client hung and fuse assert.
Actual results:


Expected results:
Client IO should not fail

Additional info:
NA

Comment 7 Yan, Zheng 2018-05-23 03:49:20 UTC

https://github.com/ceph/ceph/pull/22168

Comment 13 Yan, Zheng 2018-06-11 11:40:39 UTC

cherry-pick for 3.1 is done https://bugzilla.redhat.com/show_bug.cgi?id=1585029

Comment 14 Ramakrishnan Periyasamy 2018-06-25 04:42:25 UTC

Manual Cluster upgrade passed without any issues.

FS Sanity and regression automation runs passed.
a) FS sanity in Jenkins is clean and results posted to polarion

http://cistatus.ceph.redhat.com/ui/#cephci/launches/New_filter1%7Cpage.page=1&page.size=50&filter.cnt.name=sanity_fs&pag… 

b) http://pulpito.ceph.redhat.com/vasu-2018-06-21_16:30:55-fs-luminous-distro-basic-argo/ 

Moving this bug to verified state.

Comment 18 errata-xmlrpc 2018-07-11 18:11:10 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2177

Note You need to log in before you can comment on or make changes to this bug.