.RGW garbage collection fails to keep pace during evenly balanced delete-write workloads
In testing during an evenly balanced delete-write (50% / 50%) workload the cluster fills completely in eleven hours. Object Gateway garbage collection fails to keep pace. This causes the cluster to fill completely and the status switches to HEALTH_ERR state. Aggressive settings for the new parallel/async garbage collection tunables did significantly delay the onset of cluster fill in testing, and can be helpful for many workloads. Typical real world cluster workloads are not likely to cause a cluster fill due primarily to garbage collection.
Description of problem:
Running a evenly balanced delete-write (50% / 50%) workload fills cluster
in 11 hours. RGW garbage collection fails to keep pace. Note that with the
previous version, RHCS 3, the cluster would fill in about 3 hours - so there
is definite improvement with this release.
Version-Release number of selected component (if applicable):
RHCEPH-3.1-RHEL-7-20180530.ci.0
Steps to Reproduce:
1. Fill cluster to 30%
2. Start evenly balanced delete-write workload
2. Run it for an extended period, monitoring cluster capacity and pending GC's
3. The cluster %RAW USED keeps rising and the pending GC's keep increasing
4. Eventually the cluster fills and reaches HEALTH_ERR state
I have automation at https://github.com/jharriga/GCrate to assist
Actual results:
Cluster fills and reaches HEALTH_ERR state
Expected results:
When the workload requires it, garbage collection can be made aggressive enough to keep pace with workload
Additional info:
Product documentation (Ceph Object Gateway for Production) should guide users on monitoring and tuning garbage collection.
Did we try experiment where we do a mixture of reads, writes and deletes? Is pure write-delete workload normal? So I would suggest we try it with something like half-read and half-write and see if it keeps up in that case. If it does, then perhaps this is acceptable for now, and we can document tuning for the case where the workload is pure write-delete.
But my original suggestion was to speed up garbage-collection activity as the system fills up. There is no harm in aggressively doing garbage collection if the system is about to run out of storage anyway. I think librados lets you ask how full storage is, perhaps with rados_cluster_stat function ? Could that be considered for a future release? See
http://docs.ceph.com/docs/luminous/rados/api/librados/
I have also done a number of runs using a workload I refer to as 'hybrid". It has this operation mix: 60% read; 16% write; 14% delete and 10% list. I have been
able to run this for extended periods (24 hours) and the RGW garbage collection
in RHCS 3.1 does keep pace.
Unfortunately while running hybrid workload for extended periods I have observed
a significant drop off in client performance once GC activity kicks in.
See https://bugzilla.redhat.com/show_bug.cgi?id=1596401
I added the deleteWrite COSbench XML file.
The runtime=36000 means that the workload will run for ten hours. Obviously
that can be changed. Be aware, on the Scale Lab cfg I have with 312 OSDs
and 486TB of storage the cluster gets full in 11 hours.
Ceph doesn't look kindly on cluster full situations and in my experience
that condition necessitates a cluster purge and redeploy.
Comment 9Ken Dreyer (Red Hat)
2018-07-24 20:41:04 UTC
Discussed with Matt and Scott, re-targeting to 3.2