1576551 – [CephFS] IOs from fuse-clients were hung

Bug 1576551 - [CephFS] IOs from fuse-clients were hung

Summary: [CephFS] IOs from fuse-clients were hung

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	CephFS
Sub Component:
Version:	3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	z5
Target Release:	3.0
Assignee:	Patrick Donnelly
QA Contact:	Rishabh Dave
Docs Contact:	Persona non grata
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-05-09 17:29 UTC by Persona non grata
Modified:	2018-08-09 18:27 UTC (History)
CC List:	7 users (show)
Fixed In Version:	RHEL: ceph-12.2.4-32 Ubuntu: ceph_12.2.4-36redhat1xenial
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-08-09 18:27:11 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
fuse client log (22.41 KB, text/plain) 2018-05-09 17:29 UTC, Persona non grata	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2018:2375	0	None	None	None	2018-08-09 18:27:42 UTC

Description Persona non grata 2018-05-09 17:29:55 UTC

Created attachment 1433973 [details]
fuse client log

Description of problem:
While running mds failover tests,at the start of test,IOs were running from 2 fuse and 2 kernel clients for  filling up the cluster. While running IOs from a fuse client,I observed this message on ceph status:
===================================
  cluster:
    id:     40469cc1-e467-4a60-a122-d6b7716f7fd5
    health: HEALTH_WARN
            1 clients failing to respond to capability release
            1 MDSs report slow requests
 
  services:
    mon: 1 daemons, quorum ceph-jenkins3-build-run201-node1-monmgrinstaller
    mgr: ceph-jenkins3-build-run201-node1-monmgrinstaller(active)
    mds: cephfs-2/2/2 up  {0=ceph-jenkins3-build-run201-node4-mds=up:active,1=ceph-jenkins3-build-run201-node3-mds=up:active}, 2 up:standby
    osd: 12 osds: 12 up, 12 in
 
  data:
    pools:   3 pools, 192 pgs
    objects: 21683 objects, 6622 MB
    usage:   22236 MB used, 326 GB / 347 GB avail
    pgs:     192 active+clean
=====================================
And there was no client IOs info in the ceph status. This test was running on VMs.
IO tools used: Crefi,fio
On fuse client, crefi was used which was hung.

Version-Release number of selected component (if applicable):
ceph version 12.2.4-10.el7cp (03fd19535b3701f3322c68b5f424335d6fc8dd66) luminous (stable)
OS -Red Hat Enterprise Linux Server release 7.4 (Maipo)
How reproducible:
Always

Steps to Reproduce:
1.Setup a ceph cluster with 4 MDS(2 active, 2 standby),4 clients(2 fuse,2 kernel), 1 mon+mgr,3 OSDS
2.Fill up cluster with IOs
3.Fail active mds,one after the other with IOs running.

Actual results:
IO hung with fuse client. I observed this on fuse client log.
2018-05-09 08:59:15.098969 7fb5d49ec700  0 -- 172.16.115.35:0/83230076 >> 172.16.115.77:6800/562710057 conn(0x56346dc4d800 :-1 s=STATE_OPEN pgs=24 cs=1 l=0).fault initiating reconnect


Expected results:
No IO failures and mds fail over should success.

Additional info:
Before running this test,another mds failover test had run with different dir pinning,was completed without any issues. log of this test: http://magna002.ceph.redhat.com/cephci-jenkins/cephci-run-1525864967774/cephfs-mds-failover_0.log

Comment 13 errata-xmlrpc 2018-08-09 18:27:11 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:2375

Note You need to log in before you can comment on or make changes to this bug.