Bug 2015520 - AWS - Random IO performance degradation on ODF 4.9 CephFS and RBD as compared to OCS 4.8
Summary: AWS - Random IO performance degradation on ODF 4.9 CephFS and RBD as compared...
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: ceph
Version: 4.9
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ---
Assignee: Neha Ojha
QA Contact: Joy John Pinto
URL:
Whiteboard:
: 1984590 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-10-19 12:33 UTC by Joy John Pinto
Modified: 2023-08-09 16:37 UTC (History)
20 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-09-29 02:18:46 UTC
Embargoed:


Attachments (Terms of Use)

Description Joy John Pinto 2021-10-19 12:33:48 UTC
Description of problem (please be detailed as possible and provide log
snippests):

Performance degradation is observed with 4.9 as compared to 4.8 with 4Kib, 16Kib and 64Kib block size.  

Full comparison report is available here:
https://docs.google.com/document/d/1vyufd55iDyvKeYOwoXwKSsNoRK2VR41QNTuH-iERR8s/edit#heading=h.w4cb5hofz7uj (page 14 and Page 15)

Version of all relevant components (if applicable):

HW Platform	AWS
Number of OCS nodes	3
Number of total OSDs	3
OSD Size (TiB)	2TB
Total available storage (GiB)	6140
OCP Version	4.9.0-0.nightly-2021-09-30-022250
ODF Version	odf-operator.v4.9.0-164.ci
Ceph Version	16.2.0-136.el8cp


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
NA

Is there any workaround available to the best of your knowledge?
NA

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
2

Can this issue reproducible?
Yes

Can this issue reproduce from the UI?
NA

If this is a regression, please provide more details to justify this:
Regression. results are compared between OCP 4.8.0-0.nightly-2021-07-04-112043 and OCP 4.9.0-0.nightly-2021-09-30-022250 with same hardware configuration

Steps to Reproduce:
0. Deploy AWS cluster , 2TB OSD size. 
1. Run tests/e2e/performance/test_fio_benchmark.py
2. Compare IOPS and throughput results between 4.8 and 4.9 for random IO (4,16 and 64Kib) on CephFS and RBD 


Actual results:
20%-40% degradation is observed

Expected results:
ODF4.9 IO esults should be on par or better than OCS4.8

Additional info:
Test log can be found at:
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/lr5-jopinto-perf-run/lr5-jopinto-perf-run_20211011T182425/logs/ocs-ci-logs-1633980696/tests/e2e/performance/test_fio_benchmark.py/TestFIOBenchmark/

Comment 4 Joy John Pinto 2021-10-20 04:40:51 UTC
Tests were run on OCS 4.8.0-444.ci (OCP 4.8) and odf-operator.v4.9.0-164.ci (OCP4.9).

Comment 8 Yaniv Kaul 2021-10-21 08:13:36 UTC
No one bothered to put a needinfo on the reporter... ?

Comment 10 Mudit Agarwal 2021-10-21 10:01:22 UTC
It is there in my radar, it is being discussed here
https://chat.google.com/room/AAAAcocpRq8/mdPD4kRDaWA

Comment 15 Mudit Agarwal 2021-10-28 23:49:13 UTC
Josh, PTAL

Comment 19 Yaniv Kaul 2021-11-03 08:04:37 UTC
Any updates?

Comment 23 Ben England 2021-11-09 14:20:45 UTC
I reproduced at least part of Joy's regression.   I do not yet have an analysis or root cause for it.   I created a full-blown 6-node AWS cluster with 3 masters and 3 separate workers to minimize interaction between Kubernetes and Ceph.   Workers were 3 AWS m5.4xlarge nodes, each with 1 2-TB gp2 volume, all defaults (no encryption).   

next steps: will discuss with Joy why our test parameters differ and whether this matters.    I suspect that 51 pods each queueing 16 I/O requests in parallel ~= 750 I/Os in flight to 3 OSDs is going to cause some high latency, but this would not account for a regression and is a use case worth testing (customers typically in my experience underprovision OCS/ODF).   But my 9-pod test was sufficient to drive the block devices to max IOPS/throughput, you can see the prometheus query graph screenshot in the ODF 4.9 dataset.

See the google sheet here (tab is 2021-11-08):

https://docs.google.com/spreadsheets/d/1hnq-N4eEEvpxg82JYAFTsRSHMSMHn3OGLAWTAqrEQsU/edit#gid=1627837689

From the graph we see that there is a significant regression for 16-KiB random writes (about 25%), and just under 10% regressions for 1- and 4-KiB random writes.   Everything else is pretty much even.  I saw some variance in the 16-KiB random writes (%dev ~= 15%) using OCS 4.8 but this does not seem to explain a regression of the magnitude observed.   All other variances are quite small.

Differences from Joy's RBD fio test:

Parameter         / My test               / Joy's test
--------------------------------------------------------
duration            10-min                   5-min
pods                9                        51
filesizeGB          80                       16
bs                  4096,1024,256,64,16,4,1  1024,64,16,4

The total amount of data in the files is roughly the same (mine = 80*9 = 720 GB x 3 ~= 2 TB = 1/3 available space, Joy's = 51*16 ~= 750 GB)

You can see my CRs at fio.yaml in the test directories below.

I measured OCP4.8.3 + OCS 4.8.3, then did an upgrade of the cluster to OCP 4.9.5 (after removing OCS 4.8), then installed ODF quay.io/rhceph-dev/ocs-registry:4.9.0-228.ci using Shekhar's procedure (details available upon request).  
 
data for the runs is viewable in elasticsearch viewable with grafana dashboard.  Sorry that I can't embed hyperlinks in bugzilla.

OCS 4.9: 

http://marquez.perf.lab.eng.rdu2.redhat.com:3000/d/dWSOkJi7z/aws-fio-summary-with-rook-ceph-es-backed?orgId=1&from=now-7d&to=now&var-data_source_type=Production&var-results_ES=Production-AWS-ES_fio-results&var-analyzed_ES=Production-AWS-ES_fio-analyzed&var-log_ES=Production-AWS-ES_fio-log&var-prom_ES=Production-AWS-ES_fio-prometheus-data&var-user=fio-ocs48-quick&var-UUID=75201963-eb67-56ce-85be-1a8a906345e7&var-clustername=bene-2021-11-03-09-10-aws&var-Operation=All&var-io_size=All&var-interval=30s&var-es_block_device=All&var-es_net_device=All&var-datasource=Public-AWS-ES_fio_prometheus_data&var-instances=All&var-sample=All

OCS 4.8:

http://marquez.perf.lab.eng.rdu2.redhat.com:3000/d/dWSOkJi7z/aws-fio-summary-with-rook-ceph-es-backed?orgId=1&from=now-7d&to=now&var-data_source_type=Production&var-results_ES=Production-AWS-ES_fio-results&var-analyzed_ES=Production-AWS-ES_fio-analyzed&var-log_ES=Production-AWS-ES_fio-log&var-prom_ES=Production-AWS-ES_fio-prometheus-data&var-user=fio-ocs48-quick&var-UUID=48048ac2-d6c0-5476-9bb6-b14b3d51d554&var-clustername=bene-2021-11-03-09-10-aws&var-Operation=All&var-io_size=All&var-interval=30s&var-es_block_device=All&var-es_net_device=All&var-datasource=Public-AWS-ES_fio_prometheus_data&var-instances=All&var-sample=All

Do not pay attention to the user or clustername in the dashboard, those are descriptive fields that I failed to update properly, I'm going by the test uuid which is in the fio-client pod log.  For example, for uuid beginning with 4804, you get it from log messages like these:

2021-11-07T14:37:35Z - INFO     - MainProcess - trigger_fio: fio has successfully finished sample 1 executing for jobname randread and 
   results are in the dir /tmp/fiod-48048ac2-d6c0-5476-9bb6-b14b3d51d554/fiojob-randread-4096KiB-1/1/randread

The raw data for the runs and must-gather data is inside this URL:

OCP4.8+OCS4.8: http://nfs-storage01.scalelab.redhat.com/bengland/public/openshift/ripsaw/fio-logs/2021-11-07-08-35/
OCP4.9+ODF4.9: http://nfs-storage01.scalelab.redhat.com/bengland/public/openshift/ripsaw/fio-logs/2021-11-08-08-15/

These logs include pod logs + dmesg from all nodes + must-gather + some Ceph outputs taken before the benchmark was deleted

Comment 24 Ben England 2021-11-09 16:05:15 UTC
one other difference, I use Ceph OSD cache dropper and kernel cache dropper feature of benchmark-operator.   You can see that this has a major impact because random write tests show a read spike at the beginning of each sample where OSDs are reading in metadata.  This is intended to tighten up variance so we can distinguish small regressions accurately.

Comment 26 Yuli Persky 2021-11-09 20:12:05 UTC
Please note that we see similar degradation in Random IO results in  OCP 4.9 + OCS 4.9 vs OCP4.8 + OCS 4.8. 

The comparison is available here : http://ocsperf.ceph.redhat.com:8080/index.php?version1=5&build1=11&platform1=2&az_topology1=1&test_name%5B%5D=1&version2=6&build2=18&platform2=2&az_topology2=1&version3=&build3=&submit=Choose+options

Comment 27 Ben England 2021-11-09 20:13:00 UTC
forgot to add this article as background.

https://medium.com/omio-engineering/cpu-limits-and-aggressive-throttling-in-kubernetes-c5b20bd8a718

Comment 28 Yuli Persky 2021-11-09 20:13:49 UTC
Addition to Comment#26:  The platform is VMware LSO

Comment 34 Ben England 2021-11-18 19:29:42 UTC
lowering the queue depth from 16 to 2 for my 9-pod test results in stable numbers for 16-KiB randwrites, while getting reasonably good throughput.   We need to retest across the whole range of parameters to verify, but am getting excellent stability from the tests this way.

Now for the actual regression numbers, from here:

https://docs.google.com/spreadsheets/d/1hnq-N4eEEvpxg82JYAFTsRSHMSMHn3OGLAWTAqrEQsU/edit#gid=1049516327

It shows that for 16-KiB size (drum roll....):

randread throughput increased by 10% (yay!)
randwrite throughput decreased by 30% (boo!)

These numbers are very consistent and the percent deviation of the samples is VERY LOW.  But why?   I see that more reads are happening during randwrite test with ODF 4.9:

with OCS 4.8:  http://nfs-storage01.scalelab.redhat.com/bengland/public/openshift/ripsaw/fio-logs/2021-11-16-14-32/Screenshot%20from%202021-11-16%2017-13-20.png
with ODF 4.9:  http://nfs-storage01.scalelab.redhat.com/bengland/public/openshift/ripsaw/fio-logs/2021-11-17-17-45/iops-over-time.png

To get these PromQL graphs, go to the openshift console and replace everything to the right of the hostname in the URL with:

monitoring/query-browser?query0=irate%28node_disk_writes_completed_total%7Bdevice%3D~"nvme%5B0-9%5Dn1"%7D%5B1m%5D%29&query1=irate%28node_disk_reads_completed_total%7Bdevice%3D~"nvme%5B0-9%5Dn1"%7D%5B1m%5D%29

I see that more reads are happening with ODF 4.9 during randwrite test than happened with 4.8.  If so, this could explain the regression.

next step is to run this test with higher mem limits and see if this difference goes away.   If so, the solution might be to just give OSDs a little more memory to account for changes happening in RHCS 5 = Ceph Pacific, which wouldn't be so bad.  We can investigate why this might have happened, but the important thing for now is to stop the regression so we can  release 4.9.  Right now I've altered the ocs-storagecluster CR as shown here:

[bengland@localhost ripsaw]$ ocos get storagecluster ocs-storagecluster -o json | jq .spec.storageDeviceSets[0].resources
{
  "limits": {
    "cpu": "2",
    "memory": "10Gi"
  },
  "requests": {
    "cpu": "2",
    "memory": "10Gi"
  }
}

So now I get OSDs with more memory and CPU:

[bengland@localhost ripsaw]$ oc -n openshift-storage get pod rook-ceph-osd-2-86dcbc5fbb-wfvnp -o json | jq .spec.containers[0].resources
{
  "limits": {
    "cpu": "2",
    "memory": "10Gi"
  },
  "requests": {
    "cpu": "2",
    "memory": "10Gi"
  }
}

rerunning test now.   If this works, I'll experiment with different amounts of memory to see how much is actually needed.   Shekhar Berry has already shed light on this with his bz, 

https://bugzilla.redhat.com/show_bug.cgi?id=2012082

Comment 36 Yaniv Kaul 2021-11-29 11:19:22 UTC
Ben, do we have updated numbers?

Comment 37 Ben England 2021-12-06 16:24:49 UTC
I reran ODF 4.9 with 10-GB pod memory limit and 4-core CPU limit, I still see a major regression with no real improvement.   I re-ran for all block sizes and rand{read,write}, the regression remains in {1,4,16,64}-KiB rand writes with no significant regression elsewhere. Will re-focus on 16-KiB randwrites only.  One thing to rule out - since my experiment includes cache dropping (to eliminate variance between samples, which it does successfully), there may be some hit associated with that, but customer systems run without cache dropping, so I'll run hour-long samples and see if the regression stays the same.  Spreadsheet at:

https://docs.google.com/spreadsheets/d/1hnq-N4eEEvpxg82JYAFTsRSHMSMHn3OGLAWTAqrEQsU/edit#gid=553959841

The raw data is at

http://nfs-storage01.scalelab.redhat.com/bengland/public/openshift/ripsaw/fio-logs/2021-12-05-22-32/

The latest run with ODF 4.9 includes a graph of OSD block device IOPS over time (blkdev-iops-*.png) from the openshift console, it shows significant reads.   It also includes requested dumps of Ceph OSD stats and parameters in dump-mempools*log, heap-stats*log, and ceph-config*log.

Let me know if there are any other stats I can get, or if you need these stats for OCS 4.8.

Comment 43 Mark Nelson 2021-12-15 15:06:45 UTC
@bengland

Comment 62 Mudit Agarwal 2022-03-11 02:38:10 UTC
No decision yet on this, not a 4.10 blocker. Moving it out.
Setting NI on Ben based on Patrick's comment.

Comment 64 Greg Farnum 2022-03-14 14:36:44 UTC
*** Bug 1984590 has been marked as a duplicate of this bug. ***


Note You need to log in before you can comment on or make changes to this bug.