Bug 1634302
| Summary: | KubePersistentVolumeUsageCritical/KubePersistentVolumeFullInFourDays should regex check exported_namespace not namespace | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Justin Pierce <jupierce> |
| Component: | Monitoring | Assignee: | Frederic Branczyk <fbranczy> |
| Status: | CLOSED ERRATA | QA Contact: | Junqi Zhao <juzhao> |
| Severity: | unspecified | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 3.11.0 | CC: | minden, mloibl |
| Target Milestone: | --- | ||
| Target Release: | 3.11.z | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2019-02-20 14:11:01 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Justin Pierce
2018-09-29 15:44:13 UTC
What are you referring to with 'exported_namespace', Justin? If you look at the example alert: alertname="KubePersistentVolumeFullInFourDays" endpoint="https-metrics" exported_namespace="emergencytime" instance="172.31.30.61:10250" job="kubelet" namespace="kube-system" persistentvolumeclaim="php-storage" service="kubelet" severity="critical" You see: exported_namespace="emergencytime" This is the actual namespace containing the pv which is nearly full: persistentvolumeclaim="php-storage" The rule, KubePersistentVolumeFullInFourDays, is attempting to filter out non-critical alerts by using a regex to select only critical namespaces: namespace=~"(openshift.*|kube.*|default|logging)" However, this alert is still being raised because the 'true' problematic namespace is in exported_namespace *not* namespace. Ideally, the problematic namespace would be in the 'namespace' attribute. Barring that, the regex should be applied to 'exported_namespace' instead of 'namespace'. Right, thanks for the clarification Justin! Tracked here: https://github.com/kubernetes-monitoring/kubernetes-mixin/issues/105 Hey, this should not be a problem of the alerts themselves, but rather a configuration issue on the Prometheus itself. The Prometheus Operator uses a ServiceMonitor for the kubelet which sets `honorLabels: true` and I think you might be missing that on your cluster. See my GitHub comment for additional information: https://github.com/kubernetes-monitoring/kubernetes-mixin/issues/105#issuecomment-427807261 Yes, I think when this was originally reported the cluster-monitoring stack did not honor labels from the kubelet's metrics endpoint, but it does now. Justin can you let us know on which cluster you are seeing this? Then we can verify whether an upgrade would solve it, or whether it might be already solved now. First observed on starter-ca-central-1 and still observed as of now (v3.11.16): alertname="KubePersistentVolumeFullInFourDays" endpoint="https-metrics" exported_namespace="my-project-new" instance="172.31.30.100:10250" job="kubelet" namespace="kube-system" persistentvolumeclaim="mysql" service="kubelet" severity="critical" Could you share the output of: ``` kubectl get servicemonitor kubelet -oyaml ``` I don't seem to have permissions to view ServiceMonitor objects. Actually nevermind. I just found that we indeed do not set the `honor_label` configuration properly: https://github.com/openshift/cluster-monitoring-operator/blob/1f465a5a9e1a2959d67f21102762a43228fadf4e/jsonnet/prometheus.jsonnet#L175-L181 This is where it needs to be fixed. Setting to modified as https://github.com/openshift/cluster-monitoring-operator/pull/127 was merged tested with cluster monitoring v3.11.82, the cluster-monitoring stack honors labels from the kubelet's metrics endpoint.
eg:
ALERTS{alertname="KubePersistentVolumeFullInFourDays",alertstate="firing",endpoint="https-metrics",instance="10.0.77.44:10250",job="kubelet",namespace="openshift-monitoring",persistentvolumeclaim="prometheus-k8s-db-prometheus-k8s-1",service="kubelet",severity="critical"}
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0326 |