Description of problem (please be detailed as possible and provide log snippests): Performance degradation is observed with 4.9 as compared to 4.8 with 4Kib, 16Kib and 64Kib block size. Full comparison report is available here: https://docs.google.com/document/d/1vyufd55iDyvKeYOwoXwKSsNoRK2VR41QNTuH-iERR8s/edit#heading=h.w4cb5hofz7uj (page 14 and Page 15) Version of all relevant components (if applicable): HW Platform AWS Number of OCS nodes 3 Number of total OSDs 3 OSD Size (TiB) 2TB Total available storage (GiB) 6140 OCP Version 4.9.0-0.nightly-2021-09-30-022250 ODF Version odf-operator.v4.9.0-164.ci Ceph Version 16.2.0-136.el8cp Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? NA Is there any workaround available to the best of your knowledge? NA Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 2 Can this issue reproducible? Yes Can this issue reproduce from the UI? NA If this is a regression, please provide more details to justify this: Regression. results are compared between OCP 4.8.0-0.nightly-2021-07-04-112043 and OCP 4.9.0-0.nightly-2021-09-30-022250 with same hardware configuration Steps to Reproduce: 0. Deploy AWS cluster , 2TB OSD size. 1. Run tests/e2e/performance/test_fio_benchmark.py 2. Compare IOPS and throughput results between 4.8 and 4.9 for random IO (4,16 and 64Kib) on CephFS and RBD Actual results: 20%-40% degradation is observed Expected results: ODF4.9 IO esults should be on par or better than OCS4.8 Additional info: Test log can be found at: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/lr5-jopinto-perf-run/lr5-jopinto-perf-run_20211011T182425/logs/ocs-ci-logs-1633980696/tests/e2e/performance/test_fio_benchmark.py/TestFIOBenchmark/
Tests were run on OCS 4.8.0-444.ci (OCP 4.8) and odf-operator.v4.9.0-164.ci (OCP4.9).
No one bothered to put a needinfo on the reporter... ?
It is there in my radar, it is being discussed here https://chat.google.com/room/AAAAcocpRq8/mdPD4kRDaWA
Josh, PTAL
Any updates?
I reproduced at least part of Joy's regression. I do not yet have an analysis or root cause for it. I created a full-blown 6-node AWS cluster with 3 masters and 3 separate workers to minimize interaction between Kubernetes and Ceph. Workers were 3 AWS m5.4xlarge nodes, each with 1 2-TB gp2 volume, all defaults (no encryption). next steps: will discuss with Joy why our test parameters differ and whether this matters. I suspect that 51 pods each queueing 16 I/O requests in parallel ~= 750 I/Os in flight to 3 OSDs is going to cause some high latency, but this would not account for a regression and is a use case worth testing (customers typically in my experience underprovision OCS/ODF). But my 9-pod test was sufficient to drive the block devices to max IOPS/throughput, you can see the prometheus query graph screenshot in the ODF 4.9 dataset. See the google sheet here (tab is 2021-11-08): https://docs.google.com/spreadsheets/d/1hnq-N4eEEvpxg82JYAFTsRSHMSMHn3OGLAWTAqrEQsU/edit#gid=1627837689 From the graph we see that there is a significant regression for 16-KiB random writes (about 25%), and just under 10% regressions for 1- and 4-KiB random writes. Everything else is pretty much even. I saw some variance in the 16-KiB random writes (%dev ~= 15%) using OCS 4.8 but this does not seem to explain a regression of the magnitude observed. All other variances are quite small. Differences from Joy's RBD fio test: Parameter / My test / Joy's test -------------------------------------------------------- duration 10-min 5-min pods 9 51 filesizeGB 80 16 bs 4096,1024,256,64,16,4,1 1024,64,16,4 The total amount of data in the files is roughly the same (mine = 80*9 = 720 GB x 3 ~= 2 TB = 1/3 available space, Joy's = 51*16 ~= 750 GB) You can see my CRs at fio.yaml in the test directories below. I measured OCP4.8.3 + OCS 4.8.3, then did an upgrade of the cluster to OCP 4.9.5 (after removing OCS 4.8), then installed ODF quay.io/rhceph-dev/ocs-registry:4.9.0-228.ci using Shekhar's procedure (details available upon request). data for the runs is viewable in elasticsearch viewable with grafana dashboard. Sorry that I can't embed hyperlinks in bugzilla. OCS 4.9: http://marquez.perf.lab.eng.rdu2.redhat.com:3000/d/dWSOkJi7z/aws-fio-summary-with-rook-ceph-es-backed?orgId=1&from=now-7d&to=now&var-data_source_type=Production&var-results_ES=Production-AWS-ES_fio-results&var-analyzed_ES=Production-AWS-ES_fio-analyzed&var-log_ES=Production-AWS-ES_fio-log&var-prom_ES=Production-AWS-ES_fio-prometheus-data&var-user=fio-ocs48-quick&var-UUID=75201963-eb67-56ce-85be-1a8a906345e7&var-clustername=bene-2021-11-03-09-10-aws&var-Operation=All&var-io_size=All&var-interval=30s&var-es_block_device=All&var-es_net_device=All&var-datasource=Public-AWS-ES_fio_prometheus_data&var-instances=All&var-sample=All OCS 4.8: http://marquez.perf.lab.eng.rdu2.redhat.com:3000/d/dWSOkJi7z/aws-fio-summary-with-rook-ceph-es-backed?orgId=1&from=now-7d&to=now&var-data_source_type=Production&var-results_ES=Production-AWS-ES_fio-results&var-analyzed_ES=Production-AWS-ES_fio-analyzed&var-log_ES=Production-AWS-ES_fio-log&var-prom_ES=Production-AWS-ES_fio-prometheus-data&var-user=fio-ocs48-quick&var-UUID=48048ac2-d6c0-5476-9bb6-b14b3d51d554&var-clustername=bene-2021-11-03-09-10-aws&var-Operation=All&var-io_size=All&var-interval=30s&var-es_block_device=All&var-es_net_device=All&var-datasource=Public-AWS-ES_fio_prometheus_data&var-instances=All&var-sample=All Do not pay attention to the user or clustername in the dashboard, those are descriptive fields that I failed to update properly, I'm going by the test uuid which is in the fio-client pod log. For example, for uuid beginning with 4804, you get it from log messages like these: 2021-11-07T14:37:35Z - INFO - MainProcess - trigger_fio: fio has successfully finished sample 1 executing for jobname randread and results are in the dir /tmp/fiod-48048ac2-d6c0-5476-9bb6-b14b3d51d554/fiojob-randread-4096KiB-1/1/randread The raw data for the runs and must-gather data is inside this URL: OCP4.8+OCS4.8: http://nfs-storage01.scalelab.redhat.com/bengland/public/openshift/ripsaw/fio-logs/2021-11-07-08-35/ OCP4.9+ODF4.9: http://nfs-storage01.scalelab.redhat.com/bengland/public/openshift/ripsaw/fio-logs/2021-11-08-08-15/ These logs include pod logs + dmesg from all nodes + must-gather + some Ceph outputs taken before the benchmark was deleted
one other difference, I use Ceph OSD cache dropper and kernel cache dropper feature of benchmark-operator. You can see that this has a major impact because random write tests show a read spike at the beginning of each sample where OSDs are reading in metadata. This is intended to tighten up variance so we can distinguish small regressions accurately.
Please note that we see similar degradation in Random IO results in OCP 4.9 + OCS 4.9 vs OCP4.8 + OCS 4.8. The comparison is available here : http://ocsperf.ceph.redhat.com:8080/index.php?version1=5&build1=11&platform1=2&az_topology1=1&test_name%5B%5D=1&version2=6&build2=18&platform2=2&az_topology2=1&version3=&build3=&submit=Choose+options
forgot to add this article as background. https://medium.com/omio-engineering/cpu-limits-and-aggressive-throttling-in-kubernetes-c5b20bd8a718
Addition to Comment#26: The platform is VMware LSO
lowering the queue depth from 16 to 2 for my 9-pod test results in stable numbers for 16-KiB randwrites, while getting reasonably good throughput. We need to retest across the whole range of parameters to verify, but am getting excellent stability from the tests this way. Now for the actual regression numbers, from here: https://docs.google.com/spreadsheets/d/1hnq-N4eEEvpxg82JYAFTsRSHMSMHn3OGLAWTAqrEQsU/edit#gid=1049516327 It shows that for 16-KiB size (drum roll....): randread throughput increased by 10% (yay!) randwrite throughput decreased by 30% (boo!) These numbers are very consistent and the percent deviation of the samples is VERY LOW. But why? I see that more reads are happening during randwrite test with ODF 4.9: with OCS 4.8: http://nfs-storage01.scalelab.redhat.com/bengland/public/openshift/ripsaw/fio-logs/2021-11-16-14-32/Screenshot%20from%202021-11-16%2017-13-20.png with ODF 4.9: http://nfs-storage01.scalelab.redhat.com/bengland/public/openshift/ripsaw/fio-logs/2021-11-17-17-45/iops-over-time.png To get these PromQL graphs, go to the openshift console and replace everything to the right of the hostname in the URL with: monitoring/query-browser?query0=irate%28node_disk_writes_completed_total%7Bdevice%3D~"nvme%5B0-9%5Dn1"%7D%5B1m%5D%29&query1=irate%28node_disk_reads_completed_total%7Bdevice%3D~"nvme%5B0-9%5Dn1"%7D%5B1m%5D%29 I see that more reads are happening with ODF 4.9 during randwrite test than happened with 4.8. If so, this could explain the regression. next step is to run this test with higher mem limits and see if this difference goes away. If so, the solution might be to just give OSDs a little more memory to account for changes happening in RHCS 5 = Ceph Pacific, which wouldn't be so bad. We can investigate why this might have happened, but the important thing for now is to stop the regression so we can release 4.9. Right now I've altered the ocs-storagecluster CR as shown here: [bengland@localhost ripsaw]$ ocos get storagecluster ocs-storagecluster -o json | jq .spec.storageDeviceSets[0].resources { "limits": { "cpu": "2", "memory": "10Gi" }, "requests": { "cpu": "2", "memory": "10Gi" } } So now I get OSDs with more memory and CPU: [bengland@localhost ripsaw]$ oc -n openshift-storage get pod rook-ceph-osd-2-86dcbc5fbb-wfvnp -o json | jq .spec.containers[0].resources { "limits": { "cpu": "2", "memory": "10Gi" }, "requests": { "cpu": "2", "memory": "10Gi" } } rerunning test now. If this works, I'll experiment with different amounts of memory to see how much is actually needed. Shekhar Berry has already shed light on this with his bz, https://bugzilla.redhat.com/show_bug.cgi?id=2012082
Ben, do we have updated numbers?
I reran ODF 4.9 with 10-GB pod memory limit and 4-core CPU limit, I still see a major regression with no real improvement. I re-ran for all block sizes and rand{read,write}, the regression remains in {1,4,16,64}-KiB rand writes with no significant regression elsewhere. Will re-focus on 16-KiB randwrites only. One thing to rule out - since my experiment includes cache dropping (to eliminate variance between samples, which it does successfully), there may be some hit associated with that, but customer systems run without cache dropping, so I'll run hour-long samples and see if the regression stays the same. Spreadsheet at: https://docs.google.com/spreadsheets/d/1hnq-N4eEEvpxg82JYAFTsRSHMSMHn3OGLAWTAqrEQsU/edit#gid=553959841 The raw data is at http://nfs-storage01.scalelab.redhat.com/bengland/public/openshift/ripsaw/fio-logs/2021-12-05-22-32/ The latest run with ODF 4.9 includes a graph of OSD block device IOPS over time (blkdev-iops-*.png) from the openshift console, it shows significant reads. It also includes requested dumps of Ceph OSD stats and parameters in dump-mempools*log, heap-stats*log, and ceph-config*log. Let me know if there are any other stats I can get, or if you need these stats for OCS 4.8.
@bengland
The Comparison of FIO benchmark test on AWS 4.10.0 build 148 vs 4.9.0 build 164 can be found here: http://ocsperf.ceph.redhat.com:8080/index.php?version1=6&build1=13&platform1=1&az_topology1=1&test_name%5B%5D=1&version2=14&build2=36&platform2=1&az_topology2=1&version3=&build3=&version4=&build4=&submit=Choose+options
No decision yet on this, not a 4.10 blocker. Moving it out. Setting NI on Ben based on Patrick's comment.
*** Bug 1984590 has been marked as a duplicate of this bug. ***