Bug 1576551

Summary:

[CephFS] IOs from fuse-clients were hung

Product:

[Red Hat Storage] Red Hat Ceph Storage

Reporter:

Persona non grata <nobody+410372>

Component:

CephFS

Assignee:

Patrick Donnelly <pdonnell>

Status:

CLOSED ERRATA

QA Contact:

Rishabh Dave <ridave>

Severity:

medium

Docs Contact:

Persona non grata <nobody+410372>

Priority:

medium

Version:

3.0

CC:

anharris, ceph-eng-bugs, john.spray, kdreyer, nobody+410372, rperiyas, zyan

Target Milestone:

Keywords:

Automation

Target Release:

3.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

RHEL: ceph-12.2.4-32 Ubuntu: ceph_12.2.4-36redhat1xenial

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2018-08-09 18:27:11 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
fuse client log	none

Description Persona non grata 2018-05-09 17:29:55 UTC

Created attachment 1433973 [details]
fuse client log

Description of problem:
While running mds failover tests,at the start of test,IOs were running from 2 fuse and 2 kernel clients for  filling up the cluster. While running IOs from a fuse client,I observed this message on ceph status:
===================================
  cluster:
    id:     40469cc1-e467-4a60-a122-d6b7716f7fd5
    health: HEALTH_WARN
            1 clients failing to respond to capability release
            1 MDSs report slow requests
 
  services:
    mon: 1 daemons, quorum ceph-jenkins3-build-run201-node1-monmgrinstaller
    mgr: ceph-jenkins3-build-run201-node1-monmgrinstaller(active)
    mds: cephfs-2/2/2 up  {0=ceph-jenkins3-build-run201-node4-mds=up:active,1=ceph-jenkins3-build-run201-node3-mds=up:active}, 2 up:standby
    osd: 12 osds: 12 up, 12 in
 
  data:
    pools:   3 pools, 192 pgs
    objects: 21683 objects, 6622 MB
    usage:   22236 MB used, 326 GB / 347 GB avail
    pgs:     192 active+clean
=====================================
And there was no client IOs info in the ceph status. This test was running on VMs.
IO tools used: Crefi,fio
On fuse client, crefi was used which was hung.

Version-Release number of selected component (if applicable):
ceph version 12.2.4-10.el7cp (03fd19535b3701f3322c68b5f424335d6fc8dd66) luminous (stable)
OS -Red Hat Enterprise Linux Server release 7.4 (Maipo)
How reproducible:
Always

Steps to Reproduce:
1.Setup a ceph cluster with 4 MDS(2 active, 2 standby),4 clients(2 fuse,2 kernel), 1 mon+mgr,3 OSDS
2.Fill up cluster with IOs
3.Fail active mds,one after the other with IOs running.

Actual results:
IO hung with fuse client. I observed this on fuse client log.
2018-05-09 08:59:15.098969 7fb5d49ec700  0 -- 172.16.115.35:0/83230076 >> 172.16.115.77:6800/562710057 conn(0x56346dc4d800 :-1 s=STATE_OPEN pgs=24 cs=1 l=0).fault initiating reconnect


Expected results:
No IO failures and mds fail over should success.

Additional info:
Before running this test,another mds failover test had run with different dir pinning,was completed without any issues. log of this test: http://magna002.ceph.redhat.com/cephci-jenkins/cephci-run-1525864967774/cephfs-mds-failover_0.log

Comment 13 errata-xmlrpc 2018-08-09 18:27:11 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:2375