.Ceph Object Gateway garbage collection decreases client performance by up to 50% during mixed workload
In testing during a mixed workload of 60% read operations, 16% write operations, 14% delete operations, and 10% list operations, at 18 hours into the testing run, client throughput and bandwidth drop to half their earlier levels.
Created attachment 1455401[details]
PERF CHART COSbench
Description of problem:
while running amixed operation customer representative RGW workload, the client
performance levels (throughput and bandwidth) decrease significantly at a time which aligns with RGW garbage collection activity.
Version-Release number of selected component (if applicable):
RHCEPH-3.1-RHEL-7-20180530.ci.0
Steps to Reproduce:
1. The COSbench workload contains: 60% reads; 16% writes; 14% deletes and 10% lists (see attached ioWorkload.xml)
2. Let the workload run for specified runtime of 24 hours
3. At 18hrs into the runtime, client throughput and bandwidth drop to half
their earlier levels (sharp cliff). See attachment PERF CHART
4. review attached 'garbage collection logfile'.
Job starts at timestamp:
2018/06/27:16:54:38: Pending GC's == 55106
Pending GC count climbs steadily until timestamp (18 hours into the run):
2018/06/28:10:47:05: %RAW USED 55.81; Pending GCs 3404570
Starting then 'Pending GC' count gets reduced, indicating increased RGW garbage collection activity. On the PERF CHART there is a sharp decline in performance levels at that same time (sample #12937) and the earlier performance levels don't return. COSbench is using 5sec sampling interval so roughly 18hrs into the run.
Actual results:
Cluster performance slashed in half during long running mixed operation workload.
Expected results:
Cluster sustains reasonably consistent performance for a long running mixed operation workload.
Attachments
1) PERF CHART COSbench
2) ioWorkload.xml
3) garbage collection logfile
Created attachment 1455401 [details] PERF CHART COSbench Description of problem: while running amixed operation customer representative RGW workload, the client performance levels (throughput and bandwidth) decrease significantly at a time which aligns with RGW garbage collection activity. Version-Release number of selected component (if applicable): RHCEPH-3.1-RHEL-7-20180530.ci.0 Steps to Reproduce: 1. The COSbench workload contains: 60% reads; 16% writes; 14% deletes and 10% lists (see attached ioWorkload.xml) 2. Let the workload run for specified runtime of 24 hours 3. At 18hrs into the runtime, client throughput and bandwidth drop to half their earlier levels (sharp cliff). See attachment PERF CHART 4. review attached 'garbage collection logfile'. Job starts at timestamp: 2018/06/27:16:54:38: Pending GC's == 55106 Pending GC count climbs steadily until timestamp (18 hours into the run): 2018/06/28:10:47:05: %RAW USED 55.81; Pending GCs 3404570 Starting then 'Pending GC' count gets reduced, indicating increased RGW garbage collection activity. On the PERF CHART there is a sharp decline in performance levels at that same time (sample #12937) and the earlier performance levels don't return. COSbench is using 5sec sampling interval so roughly 18hrs into the run. Actual results: Cluster performance slashed in half during long running mixed operation workload. Expected results: Cluster sustains reasonably consistent performance for a long running mixed operation workload. Attachments 1) PERF CHART COSbench 2) ioWorkload.xml 3) garbage collection logfile