Created attachment 1452684 [details] perf stats from start of cluster fill workload job Description of problem: Using build RHCEPH-3.1-RHEL-7-20180530.ci.0 filestore journals on NVMe devices bucket index pools on NVMe devices 12x OSD nodes, 24 HDDs and 2 NVMe's 24 Client nodes COSbench workload generator While performing a 25% cluster fill workload the Client I/O performance as recorded by COSbench in throughput (iops/sec) and bandwidth (MB/s) reduces dramatically over the life of the run (see attached files: w21start.png and w21end.png). COSbench workload: # cat fillWorkload.xml <?xml version="1.0" encoding="UTF-8" ?> <workload name="fillCluster" description="RGW GCrate testing"> <!-- Initialization --> <storage type="swift" config="timeout=900000" retry="3"/> <auth type="swauth" config="username=johndoe:swift;password=KOVjOkEJbapU2RDyihys96FtOyAEGUtwCmRDNwUe;auth_url=http://localhost:5000/auth/1.0" /> <workflow> <!-- Initialization --> <workstage name="init_containers"> <work type="init" workers="1" config="containers=r(1,5)"/> </workstage> <!-- Prepare using container counts, object counts and object sizes from vars.shinc --> <workstage name="preparing_cluster"> <work type="prepare" workers="72" config="containers=r(1,5);objects=r(1,400000);sizes=h(1|1|50,64|64|15,8192|8192|15,65536|65536|15,1048576|1048576|5)KB"/> </workstage> </workflow> </workload> Expected results: Performance should remain (mostly) consistent for life of the run.
Created attachment 1452686 [details] perf stats from end of cluster fill workload job
Created attachment 1452717 [details] charts
I have added a new attachment which shows the perf chart from COSbench The slowdown hits hard around 1600 samples. Using a 5sec interval, that coorelates to 133 minutes into the run.
Created attachment 1452719 [details] latency plot over cluster fill runtime
Created attachment 1452720 [details] throughput plot over cluster fill runtime
Using build RHCEPH-3.1-RHEL-7-20180530.ci.0 Also worth noting that once the performance drops off, it does not seem to come back Throughput drops from ~200 ops/sec to around 50 ops/sec. At the job start 'ceph -s' shows client: 4148 kB/s rd, 8910 MB/s wr, 5002 op/s rd, 4345 op/s wr After 133 minutes of runtime it's client: 1046 kB/s rd, 2988 MB/s wr, 1273 op/s rd, 1636 op/s wr I ran a 50% cluster fill test as well. The perf results indicate the same behaviour. After 133 minutes of runtime the client I/O rates drop hard. The total time to do the cluster fills: * 25% fill 300min * 50% fill 1200min So four times the runtime for 2x the capacity.
Created attachment 1452722 [details] OSD log from c04-h17-6048r (osd.83) during the run random OSD log from run
An RGW log from the run (too large to attach): http://file.rdu.redhat.com/~dfuller/ceph-rgw-c04-h17-6048r.log-20180618.gz
a lengthy but informative email thread has raised the likely possibility that this slowdown is due to a know issue with filestore split/merge activity. Previously filed as BZ1219974 ; BZ1332874 ; BZ1367448 Based on previous testing results which proved to mitigate the effect I will be re-testing using increased filestore_split / filestore_merge values, 16 & 48 respectively. For a further explanation see pgs 34-36 of this document https://www.redhat.com/cms/managed-files/st-ceph-storage-qct-object-storage-reference-architecture-f7901-201706-v2-en.pdf
Reassigned to RADOS given the current discussion.
Created attachment 1453018 [details] filestore 50% fill using RHCS 3.0 (but w/different object sizes)
Created attachment 1453019 [details] bluestore 50% fill using RHCS 3.0 (but w/different object sizes)
> What about 'is this a regression from 3.0'? The RGW failure testing I did in the Scale Lab earlier this year used RHCS 3.0 However, most of the testing was on bluestore. Also we were using different object size distribution, more containers (100 vs 5) - so be careful using this as a direct comparative. We did see a significant cluster fill rate improvement for bluestore over filestore. Those results can be seen here SEE SLIDE 3 https://docs.google.com/presentation/d/1sapLXG0652iHL8QCjnrn83gvzh6gfJqB2zt0QFo40R8/edit?usp=sharing I looked through the COSbench results for those cluster fill runs. I do see a performance dropoff during the filestore runs which I don't see with bluestore. It is not as dramatic as what I am currently seeing, but it is there. Comparing a 50% cluster fill for bluestore vs filestore. Both jobs completed 4.3 million operations and wrote 91TB: * filestore (jobID w168) * bluestore (jobID w186) I have attached two screenshots which show performance rates for the above two COSbench runs. Notice that the bluestore run delivers consistent throughput, response time and bandwidth. The filestore screenshot shows a reduction as the run continues, starting at around sample number 2200.
Performed a run using these commands (expected_num_objects=50M) # ceph osd crush rule create-erasure default.rgw.buckets.data myprofile # ceph osd pool create default.rgw.buckets.data 4096 erasure myprofile default.rgw.buckets.data 50000000 job started at 17:16:15 completed at 20:58:28 The first split event on osd.256 was at 19:17:44 UTC - two hours into the run I have attached the screenshot with COSbench charts (w34ExpectedNumObjects50M) I am going to kickoff another run increasing expected_num_objects to 1 trillion and will update the BZ with those results later.
Created attachment 1453029 [details] cluster fill with expectNumObj=50M at pool creation
It is worth noting that the performance dropoff for run w34 is much more subtle than what i was observing on earlier runs. You can definitely see it (reduced throughput, increased resTime) in the screen shot but it is now where near the slowdown that was seen when running without precreation. Again I will add the results from the 1 trillion precreation run later.
the 1 trillion precreate failed, leaving the cluster in HEALTH_WARN with 140 OSDs down. I will be at an offsite tomorrow so further testing will have to wait until Thursday. Seems we need to find a more reasonable value.
At the completion of my 25% cluster fill test 'ceph df' reports 26M objects in the data pool. In my most recent run (w34) I precreated 50M objects and directory splitting still occurred, why? And the BIGGIE: Given an RGW workload (# of RGW objects, size of RGW objects, ...) what is the formula for determining how many need to be precreated? Also, when using precreate how should these values be set? * filestore_merge_threshold (default 10) - currently using -1 * filestore_split_multiple (default 2) - using default are they still impactful? thanks, John
(In reply to John Harrigan from comment #25) > At the completion of my 25% cluster fill test 'ceph df' reports 26M objects > in > the data pool. In my most recent run (w34) I precreated 50M objects and > directory > splitting still occurred, why? A couple possibilities come to mind: - other pools are getting split (perhaps some rgw metadata pool ends up with more objects than I thought) - distribution of objects is skewed - there's a bug in pre-splitting Do you have the output of 'tree' on an osd after pre-splitting, or 'ceph df' and 'ceph osd dump' to see the number of objects per pg in other pools? > And the BIGGIE: > Given an RGW workload (# of RGW objects, size of RGW objects, ...) what is > the > formula for determining how many need to be precreated? > > Also, when using precreate how should these values be set? > * filestore_merge_threshold (default 10) - currently using -1 > * filestore_split_multiple (default 2) - using default > > are they still impactful? These still matter - as the docs mention, the threshold at which directories are split is abs(merge_threshold) * split_multiple * 16, so using -1 instead of 10 as the merge threshould results in a deeper directory hierarchy. The most urgent thing to determine is not the optimal settings though, but whether this is a regression from 3.0 - how soon can we run the same workload with the same settings on 3.0 to compare?
Created attachment 1454404 [details] cluster fill with expectNumObj=500M at pool creation (w38)
Further testing with RHCEPH-3.1-RHEL-7-20180530.ci.0 proves that using a large enough expected_num_objects at pool creation avoids the slowdown. See attachment "cluster fill with expectedNumObjects=500M" Comparing first attachment "charts" and this recent attachment you will see how the runtime for cluster fill is dramatically reduced. Also the responseTime does not increase as the run progresses. There are questions that remain: * provide field guidance on value to set for expected_num_objects. Given a certain number and size of RGW objects, containers and what else - what is the formula to calculate? * what are the tradeoffs for setting expected_num_objects to high value. what value is too high? * could this be done automatically for RGW data pools? I see no mention of this topic in the "Ceph Object Gateway for Production guide" https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/3/html-single/ceph_object_gateway_for_production/
(In reply to John Harrigan from comment #28) > Further testing with RHCEPH-3.1-RHEL-7-20180530.ci.0 proves that > using a large enough expected_num_objects at pool creation avoids the > slowdown. See attachment "cluster fill with expectedNumObjects=500M" > > Comparing first attachment "charts" and this recent attachment you will > see how the runtime for cluster fill is dramatically reduced. Also the > responseTime does not increase as the run progresses. > > There are questions that remain: > * provide field guidance on value to set for expected_num_objects. Given a > certain number and size of RGW objects, containers and what else - what is > the formula to calculate? Buckets don't matter, only the number of objects and their size are factors. The number of rados objects is roughly: ceil(avg rgw object size / rgw_max_chunk_size) > * what are the tradeoffs for setting expected_num_objects to high value. > what value is too high? You get a deeper directory hierarchy, and thus less efficient dentry cache use, which can increase latency. > * could this be done automatically for RGW data pools? We could use a conservative estimate as a default to avoid at least a few levels of splitting when setting up rgw pools with ceph-ansible. E.g. the default rgw data pool could be created with expected_num_objects = num_osds * 10M > I see no mention of this topic in the "Ceph Object Gateway for Production > guide" > https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/3/html- > single/ceph_object_gateway_for_production/ These guidelines should certainly be added to the docs.
Now we print a warning instead of trying to guess a good value automatically. Guessing the initial value too high can cause suicide timeouts on the OSDs.
Looks like matt moved this by error to 3.2, It will be marked as verified in 3.1
The following commits are in the upstream PR and missing from the downstream ceph-3.1-rhel-patches branch: "mon/OSDMonitor: Warn when expected_num_objects will have no effect" "mon/OSDMonitor: Warn if missing expected_num_objects"
These patches have been applied downstream.
I do not know why this assigned to me as QA contact. Resetting.
to verify you can create a pool with expectedNumObjs and look at the directory structure on one of the OSD nodes, like this # cd /var/lib/ceph/osd; for D in */; do cd $D echo -n "${PWD##*/} " find . -type d -print | wc -l cd .. done This will print the number of directories found under each of the OSDs. On a freshly deployed cluster these counts will be low. After you create a pool with expectedNumObjs, these counts should go up. In addition, if you run a write intense workload (like fillCluster.sh from https://github.com/jharriga/GCrate/tree/master) it will perform consistently throughout its runtime when the default.rgw.buckets.data pool is created with expectedNumObjs. The GCrate scripts contain two versions for pool creation: * resetRGW.sh does not use expectedNumObj * resetRGW.shPRECREATE does use expectedNumObj any questions, let me know. - John
(In reply to John Harrigan from comment #40) > to verify you can create a pool with expectedNumObjs and look at the > directory > structure on one of the OSD nodes, like this > > # cd /var/lib/ceph/osd; for D in */; do > cd $D > echo -n "${PWD##*/} " > find . -type d -print | wc -l > cd .. > done > > This will print the number of directories found under each of the OSDs. > On a freshly deployed cluster these counts will be low. After you create > a pool with expectedNumObjs, these counts should go up. > > In addition, if you run a write intense workload (like fillCluster.sh from > https://github.com/jharriga/GCrate/tree/master) it will perform consistently > throughout its runtime when the default.rgw.buckets.data pool is created with > expectedNumObjs. The GCrate scripts contain two versions for pool creation: > * resetRGW.sh does not use expectedNumObj > * resetRGW.shPRECREATE does use expectedNumObj > > any questions, let me know. > - John the other aspect of verifying this BZ has to do with the warning msgs cited in comment #30 above External Bug ID: Github https://github.com/ceph/ceph/pull/23072 Here is a suggested procedure: Try to create a filestore pool with >=100 PGs per OSD and verify a warning message is printed. # ceph osd pool create testpool $(( `ceph osd ls |wc -l` * 100)) A pool should be created and a warning message should be printed.
Verified with build 12.2.5-39.el7cp with 2 scenarios below: 1. Create a pool with expectedNumObjs and check for directory structure: Number of directory structure count on OSD node is increase dramatically compare to not using expectedNumObj: [root@gprfs041 ~]#cd /var/lib/ceph/osd; for D in */; do cd $D echo -n "${PWD##*/} " find . -type d -print | wc -l cd .. done > ceph-13 20245 > ceph-16 20937 > ceph-19 17551 > ceph-2 16559 > ceph-20 16835 > ceph-24 13679 > ceph-25 24103 > ceph-26 20653 > ceph-28 12759 > ceph-33 21079 > ceph-5 18875 > ceph-8 19011 2. * Run fillCluster 30% with use of expectedNumObject (resetRGW.shPRECREATE): Performance is consistent during the run. Using numOBJ=40k and expectednumobj=50M, workers=10 Bandwidth=1.03GB/S Avg-ResTime=513ms 99%-RT=9710ms * Run fillCluster 30% without use of expectedNumObjects (resetRGW.sh): numOBJ=40k and expectednumobj=50M, workers=10 performance is consistent during the run: Bandwidth=758.4MB/S Avg-ResTime=703ms 99%-RT=15460ms 3. Also, testing the message print out when creating a pool: message is displayed and pool is created which is correct behavior. [root@c04-h01-6048r yum.repos.d]# ceph osd pool create testpool 1024 For better initial performance on pools expected to store a large number of objects, consider supplying the expected_num_objects parameter when creating the pool. [root@c04-h01-6048r yum.repos.d]# ceph osd lspools 0 rbd,1 .rgw.root,2 default.rgw.control,3 default.rgw.data.root,4 default.rgw.gc,5 default.rgw.log,6 testpool,
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:2819