Bug 2193367
| Summary: | Rook ceph operator goes to OOMKilled state | |||
|---|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat OpenShift Data Foundation | Reporter: | Rewant <resoni> | |
| Component: | rook | Assignee: | Travis Nielsen <tnielsen> | |
| Status: | CLOSED NOTABUG | QA Contact: | Neha Berry <nberry> | |
| Severity: | urgent | Docs Contact: | ||
| Priority: | unspecified | |||
| Version: | 4.11 | CC: | dbindra, muagarwa, ocs-bugs, odf-bz-bot, omitrani, owasserm, tnielsen | |
| Target Milestone: | --- | |||
| Target Release: | --- | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | ||
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 2203788 (view as bug list) | Environment: | ||
| Last Closed: | 2023-05-17 04:44:22 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 2203788 | |||
|
Description
Rewant
2023-05-05 12:07:30 UTC
200Mi is rather low memory for the operator to burst and perform some operations. Upstream we have recommended 512Mi memory limit for the operator: https://github.com/rook/rook/blob/master/deploy/charts/rook-ceph/values.yaml#L23-L27 Since the operator is getting OOM-killed, you'll need to increase the memory. Thank you @tnielsen for the answer But I would like to mention that we are running with this setup for at least a year on 5 different clusters on production and countless more on staging, and we never encountered this OOM issue. The only change here is the version of ROSA (OCP) from 4.10.z to 4.11.z. We didn't upgrade or changed ODF, OCS, or Rook. So it is really strange that this situation manifests itself just now. Is there a way to figure out what operation is trying to consume the memory? I don't believe in just raising up memory limits without an explicit explanation for the need. Mainly because we might encounter something like this again even after raising the mem limits if we don't know the underlying reason. Could you help us identify the root cause? 4.11 is where we also saw the need to increase the memory limits, and there has not been another update necessary since then. The change was made upstream here: https://github.com/rook/rook/pull/10195 Based on the upstream issue: https://github.com/rook/rook/issues/10192 Full memory analysis was not done at the time, but the most likely related change was the update of the controller runtime that rook was using in that release. It's possible there was some rook feature in that release that triggered the increase in resources, but nothing stands out. @tnielsen We are running ODF 4.10 which brings Rook 4.10. The upgrade was to the OCP version, not the ODF version. There is a planned update from ODF 4.10 to ODF 4.11 but we are blocked by this issue, as the deployer, which is responsible for updating ODF, is scaled down. On top of that, we have no headroom to jump to 512Mi as we are running on ROSA with a very tight fit. We will need to lower some other pod's memory requests and limits to get there and I am not sure we have any we can touch. What would you suggest? Since bursts are so important to the operator, upstream we have set the limits to 4x the requests for memory. Any reason this pattern wouldn't work for you? ROSA guidelines, and a requirement for addon-based software, limits our ability to set memory limits to be different than memory request for each and every pod of our deployment Yes, we have increased the memory from Managed Service and it's working. We can close this. |