Description of problem: It looks like the intention of KubePersistentVolumeFullInFourDays is to filter non-critical namespaces: >>>> alert: KubePersistentVolumeUsageCritical expr: 100 * kubelet_volume_stats_available_bytes{job="kubelet",namespace=~"(openshift.*|kube.*|default|logging)"} / kubelet_volume_stats_capacity_bytes{job="kubelet",namespace=~"(openshift.*|kube.*|default|logging)"} < 3 <<<< and >>>> alert: KubePersistentVolumeFullInFourDays expr: kubelet_volume_stats_available_bytes{job="kubelet",namespace=~"(openshift.*|kube.*|default|logging)"} and predict_linear(kubelet_volume_stats_available_bytes{job="kubelet",namespace=~"(openshift.*|kube.*|default|logging)"}[6h], 4 * 24 * 3600) < 0 <<<< However, the problematic namespace seems to be exposed in the 'exported_namespace'. >>>> alertname="KubePersistentVolumeFullInFourDays" endpoint="https-metrics" exported_namespace="emergencytime" instance="172.31.30.61:10250" job="kubelet" namespace="kube-system" persistentvolumeclaim="php-storage" service="kubelet" severity="critical" <<<< Version-Release number of selected component (if applicable): v3.11.16 How reproducible: 100%
What are you referring to with 'exported_namespace', Justin?
If you look at the example alert: alertname="KubePersistentVolumeFullInFourDays" endpoint="https-metrics" exported_namespace="emergencytime" instance="172.31.30.61:10250" job="kubelet" namespace="kube-system" persistentvolumeclaim="php-storage" service="kubelet" severity="critical" You see: exported_namespace="emergencytime" This is the actual namespace containing the pv which is nearly full: persistentvolumeclaim="php-storage" The rule, KubePersistentVolumeFullInFourDays, is attempting to filter out non-critical alerts by using a regex to select only critical namespaces: namespace=~"(openshift.*|kube.*|default|logging)" However, this alert is still being raised because the 'true' problematic namespace is in exported_namespace *not* namespace. Ideally, the problematic namespace would be in the 'namespace' attribute. Barring that, the regex should be applied to 'exported_namespace' instead of 'namespace'.
Right, thanks for the clarification Justin! Tracked here: https://github.com/kubernetes-monitoring/kubernetes-mixin/issues/105
Hey, this should not be a problem of the alerts themselves, but rather a configuration issue on the Prometheus itself. The Prometheus Operator uses a ServiceMonitor for the kubelet which sets `honorLabels: true` and I think you might be missing that on your cluster. See my GitHub comment for additional information: https://github.com/kubernetes-monitoring/kubernetes-mixin/issues/105#issuecomment-427807261
Yes, I think when this was originally reported the cluster-monitoring stack did not honor labels from the kubelet's metrics endpoint, but it does now. Justin can you let us know on which cluster you are seeing this? Then we can verify whether an upgrade would solve it, or whether it might be already solved now.
First observed on starter-ca-central-1 and still observed as of now (v3.11.16): alertname="KubePersistentVolumeFullInFourDays" endpoint="https-metrics" exported_namespace="my-project-new" instance="172.31.30.100:10250" job="kubelet" namespace="kube-system" persistentvolumeclaim="mysql" service="kubelet" severity="critical"
Could you share the output of: ``` kubectl get servicemonitor kubelet -oyaml ``` I don't seem to have permissions to view ServiceMonitor objects.
Actually nevermind. I just found that we indeed do not set the `honor_label` configuration properly: https://github.com/openshift/cluster-monitoring-operator/blob/1f465a5a9e1a2959d67f21102762a43228fadf4e/jsonnet/prometheus.jsonnet#L175-L181 This is where it needs to be fixed.
Setting to modified as https://github.com/openshift/cluster-monitoring-operator/pull/127 was merged
tested with cluster monitoring v3.11.82, the cluster-monitoring stack honors labels from the kubelet's metrics endpoint. eg: ALERTS{alertname="KubePersistentVolumeFullInFourDays",alertstate="firing",endpoint="https-metrics",instance="10.0.77.44:10250",job="kubelet",namespace="openshift-monitoring",persistentvolumeclaim="prometheus-k8s-db-prometheus-k8s-1",service="kubelet",severity="critical"}
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0326