Description of problem: - OpenShift 3.11 metric nodes lock up and are unresponsive, requiring a reboot. - The reason containerd appears to be unresponsive is that the node is out of all useable memory, as it is consumed by prometheus. - In vmcores captured from two seperate nodes with two seperate occurances, all available memory was used for prometheus: crash> kmem -i PAGES TOTAL PERCENTAGE TOTAL MEM 10254501 39.1 GB ---- FREE 58857 229.9 MB 0% of TOTAL MEM USED 10195644 38.9 GB 99% of TOTAL MEM SHARED 9881 38.6 MB 0% of TOTAL MEM BUFFERS 0 0 0% of TOTAL MEM CACHED 35778 139.8 MB 0% of TOTAL MEM SLAB 86828 339.2 MB 0% of TOTAL MEM TOTAL HUGE 0 0 ---- HUGE FREE 0 0 0% of TOTAL HUGE TOTAL SWAP 0 0 ---- SWAP USED 0 0 0% of TOTAL SWAP SWAP FREE 0 0 0% of TOTAL SWAP COMMIT LIMIT 5127250 19.6 GB ---- COMMITTED 11012811 42 GB 214% of TOTAL LIMIT - The following shows 38.1 gigabytes of 42 gigabytes memory comitted to prometheus, leaving none for basic operations. crash> ps | head -n1 PID PPID CPU TASK ST %MEM VSZ RSS COMM crash> ps | sort -nrk 8 | head -n50 > 54299 38195 2 ffff9881a87f2100 RU 88.6 70417064 38109120 prometheus > 54298 38195 5 ffff9881a87f0000 RU 88.6 70417064 38109120 prometheus > 38307 38195 3 ffff9881b1ba9080 RU 88.6 70417064 38109120 prometheus 54746 38195 3 ffff9879037c6300 UN 88.6 70417064 38109120 prometheus 54301 38195 0 ffff98820752e300 UN 88.6 70417064 38109120 prometheus 54300 38195 1 ffff9881a87f6300 IN 88.6 70417064 38109120 prometheus 54297 38195 4 ffff98819c3a2100 IN 88.6 70417064 38109120 prometheus 38811 38195 0 ffff987a239aa100 IN 88.6 70417064 38109120 prometheus 38467 38195 4 ffff98820708a100 UN 88.6 70417064 38109120 prometheus 38466 38195 0 ffff987ce2ee1080 IN 88.6 70417064 38109120 prometheus 38311 38195 7 ffff98820aa28000 IN 88.6 70417064 38109120 prometheus 38310 38195 3 ffff98820aa2b180 IN 88.6 70417064 38109120 prometheus 38308 38195 1 ffff9881b8d31080 UN 88.6 70417064 38109120 prometheus 38305 38195 1 ffff9881cafc2100 UN 88.6 70417064 38109120 prometheus 38304 38195 2 ffff9881b1bac200 IN 88.6 70417064 38109120 prometheus 38303 38195 3 ffff98820a8dd280 IN 88.6 70417064 38109120 prometheus 38302 38195 0 ffff9881b5944200 IN 88.6 70417064 38109120 prometheus 38301 38195 1 ffff98819ba99080 IN 88.6 70417064 38109120 prometheus 38300 38195 6 ffff98819ba98000 IN 88.6 70417064 38109120 prometheus 38299 38195 7 ffff987dca870000 IN 88.6 70417064 38109120 prometheus 38298 38195 3 ffff987dca872100 IN 88.6 70417064 38109120 prometheus 38228 38195 1 ffff9879037c1080 IN 88.6 70417064 38109120 prometheus We applied a memory limit on Prometheus as per https://access.redhat.com/solutions/3867881 to prevent it from crashing the nodes. The pods now OOMKill: prometheus-k8s-1 3/4 OOMKilled 5 6m Version-Release number of selected component (if applicable): v3.11.286 How reproducible: Unconfirmed Additional info: In previous bugs on memory with prometheus (https://bugzilla.redhat.com/show_bug.cgi?id=1790265) it's suggested to check against known capacity requirements: https://docs.openshift.com/container-platform/3.11/scaling_performance/scaling_cluster_monitoring.html#cluster-monitoring-capacity-planning However, this is not a very large cluster: # oc get nodes | wc -l 20 # oc get pods --all-namespaces | wc -l 532 We currently have set a 3Gi memory limit so that it will OOMkill instead of crashing the node. I will upload config details.
Hi, after setting to 8G limit the issue seems to have resolved. I wonder if we should consider updating the scaling / performance doc: https://docs.openshift.com/container-platform/3.11/scaling_performance/scaling_cluster_monitoring.html#cluster-monitoring-capacity-planning As noted in the initial bug description, the customer's cluster is actually smaller than the smallest test on the chart, and yet needed more memory. Are there certain situations where more memory is required? I'm not sure if this is a priority since it's 3.11, but I wonder if we should tweak the document if this is a common issue. Otherwise I think we can close the bz.
3.11 PR - https://github.com/openshift/openshift-docs/pull/29502 4.5-4.7 PR - https://github.com/openshift/openshift-docs/pull/29503 Simon, PTAL at the above PRs and let me know if more details are needed in the new note. Thank you.
LGTM
Verified fix is published and live on docs.openshift.com [3.11]: https://docs.openshift.com/container-platform/3.11/scaling_performance/scaling_cluster_monitoring.html
Verified fix is published and live on docs.openshift.com [4.5/4.6]: https://docs.openshift.com/container-platform/4.5/scalability_and_performance/scaling-cluster-monitoring-operator.html https://docs.openshift.com/container-platform/4.6/scalability_and_performance/scaling-cluster-monitoring-operator.html Verified fix will be available upon release of 4.7: https://docs.openshift.com/container-platform/4.7/scalability_and_performance/scaling-cluster-monitoring-operator.html