Bug 2196900

Summary: CephFS Unstable with Performance Issues Related to Logging Workload
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Craig Wayman <crwayman>
Component: cephAssignee: Patrick Donnelly <pdonnell>
ceph sub component: CephFS QA Contact: Elad <ebenahar>
Status: ASSIGNED --- Docs Contact:
Severity: high    
Priority: urgent CC: amanzane, bhull, bkunal, bniver, flucifre, gfarnum, johender, kelwhite, lema, linuxkidd, lsantann, mcaldeir, mduasope, muagarwa, odf-bz-bot, pdhange, pdonnell, poleary, sostapov, tnielsen, vshankar, vumrao
Version: unspecified   
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Craig Wayman 2023-05-10 18:04:24 UTC
Description of problem (please be detailed as possible and provide log snippets):


  First, it's important to convey that this case has been open for almost a month now. The ODF support team has had many senior personnel analyze the case and have not yet been able to stabilize the environment yet. With that, after syncing with team members in other geos,  it was conveyed that this situation is almost identical to the BZ opened for Barclays:  https://bugzilla.redhat.com/show_bug.cgi?id=2171825


  Some background is that the customer recently migrated from PortWorx (NFS) to ODF. The particular workload that is giving this cluster issues. CephFS is being used for logging (Logstash/ElasticSearch). During business hours they're seeing increased loads and high file locking wait time in dynatrace metrics and it's believed to be sourced to log4j synchronous being held up by ODF. 

  ODF Support recommended this workload be moved to RBD however, the customer stated that Ceph-RBD was not an option and using S3 with their particular application was not an option either.

  As of right now, we're having MDSs crashes frequently and CephFS is unstable. We're able to see the crashes in the MDS logs however, we're not able to determine why the crashes are happening. When we attempt to capture debugging logs, when the debugging levels are increased ms=1 on the mds and the mons, the cluster is severely impacted. Due to this impact everything needed to make a determination of the underlying crash origin. A possible workaround is to set the debug levels to 0/1 however, not sure if that will give us what we need.

  Some observation made is that we’re constantly seeing data spikes in the MDS. This $ ceph fs status command was taken within minutes of each other and shows metadata change of almost 500GB meanwhile on a similar cluster that metadata holds stead at around 19GB:

sh-4.4$ ceph fs status
ocs-storagecluster-cephfilesystem - 1528 clients
=================================
RANK      STATE                       MDS                     ACTIVITY     DNS    INOS   DIRS   CAPS
 0        active      ocs-storagecluster-cephfilesystem-b  Reqs:  230 /s   297k   292k  3711    108k
0-s   standby-replay  ocs-storagecluster-cephfilesystem-a  Evts:  232 /s  87.6k  5089   1278      0
                   POOL                       TYPE     USED  AVAIL
ocs-storagecluster-cephfilesystem-metadata  metadata   777G  9118G
 ocs-storagecluster-cephfilesystem-data0      data    6552G  9118G
MDS version: ceph version 16.2.7-126.el8cp (fe0af61d104d48cb9d116cde6e593b5fc8c197e4) pacific (stable)
sh-4.4$

---

sh-4.4$ ceph fs status
ocs-storagecluster-cephfilesystem - 1526 clients
=================================
RANK      STATE                       MDS                     ACTIVITY     DNS    INOS   DIRS   CAPS
 0        active      ocs-storagecluster-cephfilesystem-b  Reqs:  151 /s   295k   294k  3728   91.8k
0-s   standby-replay  ocs-storagecluster-cephfilesystem-a  Evts:  100 /s  9664   6109   1319      0
                   POOL                       TYPE     USED  AVAIL
ocs-storagecluster-cephfilesystem-metadata  metadata   224G  9390G
 ocs-storagecluster-cephfilesystem-data0      data    6515G  9390G
MDS version: ceph version 16.2.7-126.el8cp (fe0af61d104d48cb9d116cde6e593b5fc8c197e4) pacific (stable)


  Something important to note is that there is a similar cluster/similar workloads, maybe not as demanding but seems to be performing fine. If needed we can use that cluster as a baseline to gauge any configuration issues related to the application workload.



Version of all relevant components (if applicable):

OCP v4.10.11
ODF v4.10.53
Ceph: ceph version 16.2.7-126.el8cp

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?

Still determining root cause


Can this issue reproducible?

  When the Logstash deployment is scaled down, the tenant reported that performance returned to normal but when Logstash was running in parallel the performance of the Java application was affected.

  The Logstash deployment is only scaled to 1 pod at any time, the problem seems to be when Logstash is reading the data it affects the write performance of the application 20 replicas to a point where it affects the response time of the application (I guess because it’s log4j synchronous logging).

The customer has dynatrace monitoring enabled on the app (20 replicas), and when Logstash is running Dynatrace is reporting a lot of time spent in a “locking” state which I can only assume means on disk. When Logstash is not running, dynatrace doesn’t show “locking” on the 20 replica spring java app


Additional info:

  From historic mds ops we see the client op almost 600 seconds in the queue and 0 seconds to execute which was conveyed to us as over-utilized disk and/or environment issue (network/platform). We’ve reviewed the collectl data for disk/network issues and all seems to be fine along with OSD perf. 

2023-05-04T17:52:38.541687+0000 client_request(client.25412918:2576 setattr size=0 mtime=2023-05-04T17:52:38.524175+0000 #0x10004775186 2023-05-04T17:52:38.524175+0000 caller_uid=1000440000, caller_gid=0{1000440000,})
  Age: 593.511175423 / Duration: 9.650933153
    0.000000 2023-05-04T17:52:38.541687+0000 initiated
    0.000001 2023-05-04T17:52:38.541687+0000 header_read
    0.000006 2023-05-04T17:52:38.541688+0000 throttled
    0.000017 2023-05-04T17:52:38.541694+0000 all_read
    0.000021 2023-05-04T17:52:38.541711+0000 dispatched
    0.011516 2023-05-04T17:52:38.541732+0000 failed to xlock, waiting
    2.290387 2023-05-04T17:52:38.553248+0000 failed to xlock, waiting
    2.919767 2023-05-04T17:52:40.843635+0000 failed to xlock, waiting
    0.000284 2023-05-04T17:52:43.763402+0000 failed to xlock, waiting
    0.000062 2023-05-04T17:52:43.763686+0000 failed to xlock, waiting
    0.009766 2023-05-04T17:52:43.763748+0000 failed to xlock, waiting
    0.000013 2023-05-04T17:52:43.773514+0000 failed to xlock, waiting
    0.000207 2023-05-04T17:52:43.773527+0000 failed to xlock, waiting
    0.001738 2023-05-04T17:52:43.773734+0000 failed to xlock, waiting
    2.203056 2023-05-04T17:52:43.775472+0000 failed to xlock, waiting
    0.000411 2023-05-04T17:52:45.978528+0000 failed to xlock, waiting
    0.000027 2023-05-04T17:52:45.978939+0000 failed to xlock, waiting
    0.000027 2023-05-04T17:52:45.978966+0000 failed to xlock, waiting
    0.000049 2023-05-04T17:52:45.978993+0000 failed to xlock, waiting
    0.013416 2023-05-04T17:52:45.979042+0000 failed to xlock, waiting
    0.000021 2023-05-04T17:52:45.992458+0000 failed to xlock, waiting
    0.000146 2023-05-04T17:52:45.992479+0000 acquired locks
    0.000001 2023-05-04T17:52:45.992625+0000 early_replied
    2.199930 2023-05-04T17:52:45.992626+0000 submit entry: journal_and_reply
    0.000041 2023-05-04T17:52:48.192556+0000 journal_committed:
    0.000019 2023-05-04T17:52:48.192597+0000 replying
    0.000004 2023-05-04T17:52:48.192616+0000 finishing request
    0.000000 2023-05-04T17:52:48.192620+0000 cleaned up request
    0.000006 2023-05-04T17:52:48.192620+0000 done

2023-05-04T17:52:48.215151+0000 client_request(client.25413413:4153 create #0x10004774b5c/master.txt 2023-05-04T17:52:48.213682+0000 caller_uid=1000440000, caller_gid=0{1000440000,})
  Age: 583.837711767 / Duration: 14.581236078
    0.000000 2023-05-04T17:52:48.215151+0000 initiated
    0.000001 2023-05-04T17:52:48.215151+0000 header_read
    0.000006 2023-05-04T17:52:48.215152+0000 throttled
    0.127949 2023-05-04T17:52:48.215158+0000 all_read
    0.000014 2023-05-04T17:52:48.343107+0000 dispatched
    2.447871 2023-05-04T17:52:48.343121+0000 failed to xlock, waiting
    0.000005 2023-05-04T17:52:50.790992+0000 acquired locks
    0.000002 2023-05-04T17:52:50.790997+0000 acquired locks
    2.817897 2023-05-04T17:52:50.790999+0000 failed to xlock, waiting
    0.000014 2023-05-04T17:52:53.608896+0000 acquired locks
    2.189104 2023-05-04T17:52:53.608910+0000 acquired locks
    2.494759 2023-05-04T17:52:55.798014+0000 failed to wrlock, waiting
    0.000003 2023-05-04T17:52:58.292773+0000 acquired locks
    0.000013 2023-05-04T17:52:58.292776+0000 acquired locks
    0.000290 2023-05-04T17:52:58.292789+0000 failed to xlock, waiting
    0.000003 2023-05-04T17:52:58.293079+0000 acquired locks
    2.138049 2023-05-04T17:52:58.293082+0000 failed to xlock, waiting
    0.000008 2023-05-04T17:53:00.431131+0000 acquired locks
    0.083715 2023-05-04T17:53:00.431139+0000 failed to rdlock, waiting
    0.000011 2023-05-04T17:53:00.514854+0000 acquired locks
    0.007369 2023-05-04T17:53:00.514865+0000 failed to rdlock, waiting
    0.000004 2023-05-04T17:53:00.522234+0000 acquired locks
    0.000101 2023-05-04T17:53:00.522238+0000 acquired locks
    0.000001 2023-05-04T17:53:00.522339+0000 early_replied
    2.273945 2023-05-04T17:53:00.522340+0000 submit entry: journal_and_reply
    0.000084 2023-05-04T17:53:02.796285+0000 journal_committed:
    0.000011 2023-05-04T17:53:02.796369+0000 replying
    0.000004 2023-05-04T17:53:02.796380+0000 finishing request
    0.000003 2023-05-04T17:53:02.796384+0000 cleaned up request
    0.000006 2023-05-04T17:53:02.796387+0000 done