Updating the QA Contact to a Hemant. Hemant will be rerouting them to the appropriate QE Associate. Regards, Giri
about comment 1: you said you set the CGroup memory limit to 2GB, but the OOM kill happened at 6 GB. Why wasn't it killed at 2 GB? Also, why didn't the clients fail over to a different RGW server and continue running? Perhaps a load balancer wasn't used? about comment 3: if "we cannot identify a reliable memory limit" then the proposed workaround is not really preventing the problem from occurring later, just postponing it, right. We have to know ahead of time how much memory RGW requires for a variety of reasons.
cc'ing Karan Singh, who has worked with RGW in some really large configurations (1 billion objects). https://docs.google.com/document/d/1uKq5TLZFDc5IWVCa5EekWQU6eoB5QOmBXE05FVpy6QU/edit Karan, any sign of RGW daemon memory usage growth during your tests?
Matt, what's the next step here?
Matt / Mkogon I am in the middle of ingesting 10 Billion objects (as I write this email, 800 Million has been successfully ingested) if you guys want me to capture this data point, you need to provide me the instructions to capture this. Currently, I do not get any metrics with the name of RGW memory in Prometheus. If you like i can give you SSH access to the env
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat Ceph Storage 4.1 Bug Fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4144