Hide Forgot
Description of problem: Based on https://github.com/prometheus-operator/kube-prometheus/issues/294, fsSpaceFillingUpCriticalThreshold was adjusted as it was in conflict with the Kubernetes Garbage Collection default values (nicely documented in https://www.openshift.com/blog/image-garbage-collection-in-openshift and https://kubernetes.io/docs/reference/config-api/kubelet-config.v1beta1/). But with the default value for imageGCLowThresholdPercent, we are still able to trigger NodeFilesystemSpaceFillingUp on warning level as this is starting when 40% space is left. Hence even with Garbage Collection running, the alert may continue to fire, even though necessary clean-up is done. This is why fsSpaceFillingUpWarningThreshold should be adjusted to a reasonable value and if going to 20% is not possible, it would be required to talk to kubelet Engineering group to understand whether Garbage Collection default values need to be adjusted. Version-Release number of selected component (if applicable): - OpenShift Container Platform 4.x How reproducible: - Always Steps to Reproduce: 1. N/A Actual results: NodeFilesystemSpaceFillingUp alert firing before Kubernetes Garbage Collection kicked in, causing sort of false/positive alert. Expected results: NodeFilesystemSpaceFillingUp to only kick in when kubernetes Garbage Collection has happen and the expected threshold could not be reached and thus manual activity may be required. Additional info:
checked with 4.10.0-0.nightly-2021-09-09-225032 - alert: NodeFilesystemSpaceFillingUp annotations: description: Filesystem on {{ $labels.device }} at {{ $labels.instance }} has only {{ printf "%.2f" $value }}% available space left and is filling up. runbook_url: https://github.com/openshift/runbooks/blob/master/alerts/cluster-monitoring-operator/NodeFilesystemSpaceFillingUp.md summary: Filesystem is predicted to run out of space within the next 24 hours. expr: | ( node_filesystem_avail_bytes{job="node-exporter",fstype!=""} / node_filesystem_size_bytes{job="node-exporter",fstype!=""} * 100 < 20 and predict_linear(node_filesystem_avail_bytes{job="node-exporter",fstype!=""}[6h], 24*60*60) < 0 and node_filesystem_readonly{job="node-exporter",fstype!=""} == 0 ) for: 1h labels: severity: warning - alert: NodeFilesystemSpaceFillingUp annotations: description: Filesystem on {{ $labels.device }} at {{ $labels.instance }} has only {{ printf "%.2f" $value }}% available space left and is filling up fast. runbook_url: https://github.com/openshift/runbooks/blob/master/alerts/cluster-monitoring-operator/NodeFilesystemSpaceFillingUp.md summary: Filesystem is predicted to run out of space within the next 4 hours. expr: | ( node_filesystem_avail_bytes{job="node-exporter",fstype!=""} / node_filesystem_size_bytes{job="node-exporter",fstype!=""} * 100 < 15 and predict_linear(node_filesystem_avail_bytes{job="node-exporter",fstype!=""}[6h], 4*60*60) < 0 and node_filesystem_readonly{job="node-exporter",fstype!=""} == 0 ) for: 1h labels: severity: critical
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056