Bug 1634302 - KubePersistentVolumeUsageCritical/KubePersistentVolumeFullInFourDays should regex check exported_namespace not namespace
Summary: KubePersistentVolumeUsageCritical/KubePersistentVolumeFullInFourDays should r...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 3.11.0
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: 3.11.z
Assignee: Frederic Branczyk
QA Contact: Junqi Zhao
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-09-29 15:44 UTC by Justin Pierce
Modified: 2019-02-20 14:11 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-02-20 14:11:01 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2019:0326 0 None None None 2019-02-20 14:11:07 UTC

Description Justin Pierce 2018-09-29 15:44:13 UTC
Description of problem:
It looks like the intention of KubePersistentVolumeFullInFourDays is to filter non-critical namespaces:

>>>>
alert: KubePersistentVolumeUsageCritical
expr: 100
  * kubelet_volume_stats_available_bytes{job="kubelet",namespace=~"(openshift.*|kube.*|default|logging)"}
  / kubelet_volume_stats_capacity_bytes{job="kubelet",namespace=~"(openshift.*|kube.*|default|logging)"}
  < 3
<<<<

and
>>>>
alert: KubePersistentVolumeFullInFourDays
expr: kubelet_volume_stats_available_bytes{job="kubelet",namespace=~"(openshift.*|kube.*|default|logging)"}
  and predict_linear(kubelet_volume_stats_available_bytes{job="kubelet",namespace=~"(openshift.*|kube.*|default|logging)"}[6h],
  4 * 24 * 3600) < 0
<<<<

However, the problematic namespace seems to be exposed in the 'exported_namespace'.

>>>>
alertname="KubePersistentVolumeFullInFourDays" endpoint="https-metrics" exported_namespace="emergencytime" instance="172.31.30.61:10250" job="kubelet" namespace="kube-system" persistentvolumeclaim="php-storage" service="kubelet" severity="critical"
<<<<

Version-Release number of selected component (if applicable):
v3.11.16

How reproducible:
100%

Comment 1 minden 2018-10-01 12:20:55 UTC
What are you referring to with 'exported_namespace', Justin?

Comment 2 Justin Pierce 2018-10-05 20:59:59 UTC
If you look at the example alert:
alertname="KubePersistentVolumeFullInFourDays" endpoint="https-metrics" exported_namespace="emergencytime" instance="172.31.30.61:10250" job="kubelet" namespace="kube-system" persistentvolumeclaim="php-storage" service="kubelet" severity="critical"

You see: exported_namespace="emergencytime"

This is the actual namespace containing the pv which is nearly full: persistentvolumeclaim="php-storage"

The rule, KubePersistentVolumeFullInFourDays, is attempting to filter out non-critical alerts by using a regex to select only critical namespaces: namespace=~"(openshift.*|kube.*|default|logging)"

However, this alert is still being raised because the 'true' problematic namespace is in exported_namespace  *not*   namespace. 


Ideally, the problematic namespace would be in the 'namespace' attribute. Barring that, the regex should be applied to 'exported_namespace' instead of 'namespace'.

Comment 3 minden 2018-10-08 09:24:28 UTC
Right, thanks for the clarification Justin!

Tracked here: https://github.com/kubernetes-monitoring/kubernetes-mixin/issues/105

Comment 4 Matthias Loibl 2018-10-08 12:04:23 UTC
Hey,
this should not be a problem of the alerts themselves, but rather a configuration issue on the Prometheus itself. The Prometheus Operator uses a ServiceMonitor for the kubelet which sets `honorLabels: true` and I think you might be missing that on your cluster.
See my GitHub comment for additional information: https://github.com/kubernetes-monitoring/kubernetes-mixin/issues/105#issuecomment-427807261

Comment 5 Frederic Branczyk 2018-10-15 15:05:56 UTC
Yes, I think when this was originally reported the cluster-monitoring stack did not honor labels from the kubelet's metrics endpoint, but it does now. Justin can you let us know on which cluster you are seeing this? Then we can verify whether an upgrade would solve it, or whether it might be already solved now.

Comment 6 Justin Pierce 2018-10-15 15:24:19 UTC
First observed on starter-ca-central-1 and still observed as of now (v3.11.16):

alertname="KubePersistentVolumeFullInFourDays" endpoint="https-metrics" exported_namespace="my-project-new" instance="172.31.30.100:10250" job="kubelet" namespace="kube-system" persistentvolumeclaim="mysql" service="kubelet" severity="critical"

Comment 7 Frederic Branczyk 2018-10-15 15:56:14 UTC
Could you share the output of:

```
kubectl get servicemonitor kubelet -oyaml
```

I don't seem to have permissions to view ServiceMonitor objects.

Comment 8 Frederic Branczyk 2018-10-15 15:58:40 UTC
Actually nevermind. I just found that we indeed do not set the `honor_label` configuration properly: https://github.com/openshift/cluster-monitoring-operator/blob/1f465a5a9e1a2959d67f21102762a43228fadf4e/jsonnet/prometheus.jsonnet#L175-L181

This is where it needs to be fixed.

Comment 9 Frederic Branczyk 2019-01-23 16:06:20 UTC
Setting to modified as https://github.com/openshift/cluster-monitoring-operator/pull/127 was merged

Comment 11 Junqi Zhao 2019-02-12 03:30:44 UTC
tested with cluster monitoring v3.11.82, the cluster-monitoring stack honors labels from the kubelet's metrics endpoint.
eg:
ALERTS{alertname="KubePersistentVolumeFullInFourDays",alertstate="firing",endpoint="https-metrics",instance="10.0.77.44:10250",job="kubelet",namespace="openshift-monitoring",persistentvolumeclaim="prometheus-k8s-db-prometheus-k8s-1",service="kubelet",severity="critical"}

Comment 13 errata-xmlrpc 2019-02-20 14:11:01 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0326


Note You need to log in before you can comment on or make changes to this bug.