Bug 1918683
| Summary: | prometheus faces inexplicable OOM | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Pablo Alonso Rodriguez <palonsor> | ||||||
| Component: | Monitoring | Assignee: | Simon Pasquier <spasquie> | ||||||
| Status: | CLOSED NOTABUG | QA Contact: | Junqi Zhao <juzhao> | ||||||
| Severity: | high | Docs Contact: | |||||||
| Priority: | high | ||||||||
| Version: | 4.5 | CC: | adeshpan, akhaire, alegrand, anpicker, dahernan, ddelcian, dtarabor, erich, erooth, hongyli, igreen, jkaur, kakkoyun, kiyyappa, lcosic, mhernon, ocasalsa, pchavan, pkrupa, rugouvei, spasquie, ssadhale, ssonigra, vlours, wking | ||||||
| Target Milestone: | --- | ||||||||
| Target Release: | --- | ||||||||
| Hardware: | Unspecified | ||||||||
| OS: | Unspecified | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||
| Doc Text: | Story Points: | --- | |||||||
| Clone Of: | Environment: | ||||||||
| Last Closed: | 2021-06-02 06:27:31 UTC | Type: | Bug | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Embargoed: | |||||||||
| Attachments: |
|
||||||||
|
Description
Pablo Alonso Rodriguez
2021-01-21 11:20:24 UTC
I've looked at the gathered data in supportshell and unfortunately I didn't find any obvious cause to the issue. Also I can't decompress the rar files attached to the BZ: on my laptop, the operation never ends and consumes all free space.
It might very well be that Prometheus is crashlooping because it consumes too much memory at startup due to WAL replay. But that being said and given the size of the cluster, it shouldn't need 80G in steady state. Can you remove the WAL directory once more and graph the following metrics over a couple of hours (ideally at least 4h)?
* prometheus_tsdb_head_series
* sum by(pod) (rate(prometheus_tsdb_head_samples_appended_total[5m]))
* container_memory_working_set_bytes{namespace="openshift-monitoring",container=""}
(In reply to Simon Pasquier from comment #4) > I've looked at the gathered data in supportshell and unfortunately I didn't > find any obvious cause to the issue. Also I can't decompress the rar files > attached to the BZ: on my laptop, the operation never ends and consumes all > free space. > > It might very well be that Prometheus is crashlooping because it consumes > too much memory at startup due to WAL replay. But that being said and given > the size of the cluster, it shouldn't need 80G in steady state. Can you > remove the WAL directory once more and graph the following metrics over a > couple of hours (ideally at least 4h)? > * prometheus_tsdb_head_series > * sum by(pod) (rate(prometheus_tsdb_head_samples_appended_total[5m])) > * > container_memory_working_set_bytes{namespace="openshift-monitoring", > container=""} Customer has even recreated the pvc's / as well at trying to remove the wal directory Is the wal data kept somewhere? Of possibly we are missing something rar collectl are extracted on supportshell - they just show the prometheus memory growth - which happens in minutes Ilan, WAL data is part of the pvc contents. I was about to write that we have already tried doing so (both removing only wal and the whole pv) and prometheus only was running for some few minutes, so I guess it is not enough time to gather the metrics you mentioned, or it is? Regarding collectctl, I am re-uploading in tar.xz (I had no issues in extracting them, actually, but just in case it works better for you) Created attachment 1751251 [details]
prometheus memory usage surge in a short time
upgrade from 4.6.13 to 4.7.0-0.nightly-2021-01-22-134922, memory usage for prometheus is increased in a short time
*** Bug 1922035 has been marked as a duplicate of this bug. *** Created attachment 1756338 [details] Prometheus series and memory metrics during upgrade Regarding the memory rise reported in attachment 1751251 [details], it is not as bad as it seems (though there's definitely an increase of memory usage during and after upgrade). I've done 4.6 -> 4.7 upgrade and the peak isn't so high if you look at the raw data. My assumption is that when the Prometheus container is restarted, it doesn't have time to mark its metrics as stale which means that for about 5m, the sum() operation will add the memory of the running pod + the memory of the old pod. *** Bug 1929875 has been marked as a duplicate of this bug. *** The needinfo request[s] on this closed bug have been removed as they have been unresolved for 365 days |