Bug 1576551

Summary: [CephFS] IOs from fuse-clients were hung
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Persona non grata <nobody+410372>
Component: CephFSAssignee: Patrick Donnelly <pdonnell>
Status: CLOSED ERRATA QA Contact: Rishabh Dave <ridave>
Severity: medium Docs Contact: Persona non grata <nobody+410372>
Priority: medium    
Version: 3.0CC: anharris, ceph-eng-bugs, john.spray, kdreyer, nobody+410372, rperiyas, zyan
Target Milestone: z5Keywords: Automation
Target Release: 3.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: RHEL: ceph-12.2.4-32 Ubuntu: ceph_12.2.4-36redhat1xenial Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-08-09 18:27:11 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
fuse client log none

Description Persona non grata 2018-05-09 17:29:55 UTC
Created attachment 1433973 [details]
fuse client log

Description of problem:
While running mds failover tests,at the start of test,IOs were running from 2 fuse and 2 kernel clients for  filling up the cluster. While running IOs from a fuse client,I observed this message on ceph status:
===================================
  cluster:
    id:     40469cc1-e467-4a60-a122-d6b7716f7fd5
    health: HEALTH_WARN
            1 clients failing to respond to capability release
            1 MDSs report slow requests
 
  services:
    mon: 1 daemons, quorum ceph-jenkins3-build-run201-node1-monmgrinstaller
    mgr: ceph-jenkins3-build-run201-node1-monmgrinstaller(active)
    mds: cephfs-2/2/2 up  {0=ceph-jenkins3-build-run201-node4-mds=up:active,1=ceph-jenkins3-build-run201-node3-mds=up:active}, 2 up:standby
    osd: 12 osds: 12 up, 12 in
 
  data:
    pools:   3 pools, 192 pgs
    objects: 21683 objects, 6622 MB
    usage:   22236 MB used, 326 GB / 347 GB avail
    pgs:     192 active+clean
=====================================
And there was no client IOs info in the ceph status. This test was running on VMs.
IO tools used: Crefi,fio
On fuse client, crefi was used which was hung.

Version-Release number of selected component (if applicable):
ceph version 12.2.4-10.el7cp (03fd19535b3701f3322c68b5f424335d6fc8dd66) luminous (stable)
OS -Red Hat Enterprise Linux Server release 7.4 (Maipo)
How reproducible:
Always

Steps to Reproduce:
1.Setup a ceph cluster with 4 MDS(2 active, 2 standby),4 clients(2 fuse,2 kernel), 1 mon+mgr,3 OSDS
2.Fill up cluster with IOs
3.Fail active mds,one after the other with IOs running.

Actual results:
IO hung with fuse client. I observed this on fuse client log.
2018-05-09 08:59:15.098969 7fb5d49ec700  0 -- 172.16.115.35:0/83230076 >> 172.16.115.77:6800/562710057 conn(0x56346dc4d800 :-1 s=STATE_OPEN pgs=24 cs=1 l=0).fault initiating reconnect


Expected results:
No IO failures and mds fail over should success.

Additional info:
Before running this test,another mds failover test had run with different dir pinning,was completed without any issues. log of this test: http://magna002.ceph.redhat.com/cephci-jenkins/cephci-run-1525864967774/cephfs-mds-failover_0.log

Comment 13 errata-xmlrpc 2018-08-09 18:27:11 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:2375