Description of problem (please be detailed as possible and provide log snippests): We(managed service team) have production clusters on OCP 4.10.15/4.10.16 size 20TiB, when we upgrade them to OCP 4.11, the rook-ceph-operator goes into OOMKilled state. The OOMKilled state is there right after ceph detect version pod is Terminated. On 4.10.16 cluster, after upgrade we restarted the rook-ceph-operator and it worked. on 4.10.15 cluster, after upgrade and restart it still goes to OOMKilled state. We also upgraded a cluster from 4.10.52 size 4TiB to 4.11.38, in which we didn't have to restart the rook-ceph-operator. We looked into logs for rook-ceph-operator and we found only INFO logs. We have resourceLimit defined by the ocs-osd deployer: https://github.com/red-hat-storage/ocs-osd-deployer/blob/main/utils/resources.go "rook-ceph-operator": { Limits: corev1.ResourceList{ "cpu": resource.MustParse("300m"), "memory": resource.MustParse("200Mi"), }, Requests: corev1.ResourceList{ "cpu": resource.MustParse("300m"), "memory": resource.MustParse("200Mi"), }, }, OHSS ticket for the incident: https://issues.redhat.com/browse/OHSS-21838. Version of all relevant components (if applicable): OCP version 4.11.38 ODF version 4.10.9 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Is there any workaround available to the best of your knowledge? Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? Can this issue reproducible? Yes Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. 2. 3. Actual results: Rook ceph operator goes to OOMKilled Expected results: Rook ceph operator should be in running state. Additional info:
200Mi is rather low memory for the operator to burst and perform some operations. Upstream we have recommended 512Mi memory limit for the operator: https://github.com/rook/rook/blob/master/deploy/charts/rook-ceph/values.yaml#L23-L27 Since the operator is getting OOM-killed, you'll need to increase the memory.
Thank you @tnielsen for the answer But I would like to mention that we are running with this setup for at least a year on 5 different clusters on production and countless more on staging, and we never encountered this OOM issue. The only change here is the version of ROSA (OCP) from 4.10.z to 4.11.z. We didn't upgrade or changed ODF, OCS, or Rook. So it is really strange that this situation manifests itself just now. Is there a way to figure out what operation is trying to consume the memory? I don't believe in just raising up memory limits without an explicit explanation for the need. Mainly because we might encounter something like this again even after raising the mem limits if we don't know the underlying reason. Could you help us identify the root cause?
4.11 is where we also saw the need to increase the memory limits, and there has not been another update necessary since then. The change was made upstream here: https://github.com/rook/rook/pull/10195 Based on the upstream issue: https://github.com/rook/rook/issues/10192 Full memory analysis was not done at the time, but the most likely related change was the update of the controller runtime that rook was using in that release. It's possible there was some rook feature in that release that triggered the increase in resources, but nothing stands out.
@tnielsen We are running ODF 4.10 which brings Rook 4.10. The upgrade was to the OCP version, not the ODF version. There is a planned update from ODF 4.10 to ODF 4.11 but we are blocked by this issue, as the deployer, which is responsible for updating ODF, is scaled down. On top of that, we have no headroom to jump to 512Mi as we are running on ROSA with a very tight fit. We will need to lower some other pod's memory requests and limits to get there and I am not sure we have any we can touch. What would you suggest?
Since bursts are so important to the operator, upstream we have set the limits to 4x the requests for memory. Any reason this pattern wouldn't work for you?
ROSA guidelines, and a requirement for addon-based software, limits our ability to set memory limits to be different than memory request for each and every pod of our deployment
Yes, we have increased the memory from Managed Service and it's working. We can close this.