Description of problem: With RGWs configured in a load balancer, quota stats cache can possibly run into unbound values. We have found errors like below in our clusters running Jewel. This happens when PUT and DELETE operations do not hit the same RGW and an eventual update_stats() from a DELETE operation tries to decrement the stats cache. This can be easily verified by having RGWs configured in a load balancer(I've used HAProxy in RR mode) and running a script to upload/delete objects, with the user quota enabled. 20 quota: can't use cached stats, exceeded soft threshold (num objs): 18446744073709551615 >= 190000 10 quota exceeded: stats.num_kb_rounded=18446744073709549572 size_kb=1024 user_quota.max_size_kb=5242880000 Version-Release number of selected component (if applicable): 2.3 How reproducible: 90% Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
Additional fix is needed, upstream pr: https://github.com/ceph/ceph/pull/17116
@orit, Could you please suggest a reproducer for this?
(In reply to shilpa from comment #9) > @orit, > > Could you please suggest a reproducer for this? Hi Shilpa, You will need a cluster with at least two radosgw. You will need to configure a load balancer for example HAproxy. Use a script that does several PUT objects and than several DELETE objects. The load balancer would send the requests to different radosgws and you should hit the bug.
Moving back to assigned based on Marcus' comments above. Matt/Orit Is this something we have ability to fix/retest in the next week?
Thanks Orit!
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:2819