1634302 – KubePersistentVolumeUsageCritical/KubePersistentVolumeFullInFourDays should regex check exported_namespace not namespace

Bug 1634302 - KubePersistentVolumeUsageCritical/KubePersistentVolumeFullInFourDays should regex check exported_namespace not namespace

Summary: KubePersistentVolumeUsageCritical/KubePersistentVolumeFullInFourDays should r...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Monitoring
Sub Component:
Version:	3.11.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	3.11.z
Assignee:	Frederic Branczyk
QA Contact:	Junqi Zhao
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-09-29 15:44 UTC by Justin Pierce
Modified:	2019-02-20 14:11 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-02-20 14:11:01 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2019:0326	0	None	None	None	2019-02-20 14:11:07 UTC

Description Justin Pierce 2018-09-29 15:44:13 UTC

Description of problem:
It looks like the intention of KubePersistentVolumeFullInFourDays is to filter non-critical namespaces:

>>>>
alert: KubePersistentVolumeUsageCritical
expr: 100
  * kubelet_volume_stats_available_bytes{job="kubelet",namespace=~"(openshift.*|kube.*|default|logging)"}
  / kubelet_volume_stats_capacity_bytes{job="kubelet",namespace=~"(openshift.*|kube.*|default|logging)"}
  < 3
<<<<

and
>>>>
alert: KubePersistentVolumeFullInFourDays
expr: kubelet_volume_stats_available_bytes{job="kubelet",namespace=~"(openshift.*|kube.*|default|logging)"}
  and predict_linear(kubelet_volume_stats_available_bytes{job="kubelet",namespace=~"(openshift.*|kube.*|default|logging)"}[6h],
  4 * 24 * 3600) < 0
<<<<

However, the problematic namespace seems to be exposed in the 'exported_namespace'.

>>>>
alertname="KubePersistentVolumeFullInFourDays" endpoint="https-metrics" exported_namespace="emergencytime" instance="172.31.30.61:10250" job="kubelet" namespace="kube-system" persistentvolumeclaim="php-storage" service="kubelet" severity="critical"
<<<<

Version-Release number of selected component (if applicable):
v3.11.16

How reproducible:
100%

Comment 1 minden 2018-10-01 12:20:55 UTC

What are you referring to with 'exported_namespace', Justin?

Comment 2 Justin Pierce 2018-10-05 20:59:59 UTC

If you look at the example alert:
alertname="KubePersistentVolumeFullInFourDays" endpoint="https-metrics" exported_namespace="emergencytime" instance="172.31.30.61:10250" job="kubelet" namespace="kube-system" persistentvolumeclaim="php-storage" service="kubelet" severity="critical"

You see: exported_namespace="emergencytime"

This is the actual namespace containing the pv which is nearly full: persistentvolumeclaim="php-storage"

The rule, KubePersistentVolumeFullInFourDays, is attempting to filter out non-critical alerts by using a regex to select only critical namespaces: namespace=~"(openshift.*|kube.*|default|logging)"

However, this alert is still being raised because the 'true' problematic namespace is in exported_namespace  *not*   namespace. 


Ideally, the problematic namespace would be in the 'namespace' attribute. Barring that, the regex should be applied to 'exported_namespace' instead of 'namespace'.

Comment 3 minden 2018-10-08 09:24:28 UTC

Right, thanks for the clarification Justin!

Tracked here: https://github.com/kubernetes-monitoring/kubernetes-mixin/issues/105

Comment 4 Matthias Loibl 2018-10-08 12:04:23 UTC

Hey,
this should not be a problem of the alerts themselves, but rather a configuration issue on the Prometheus itself. The Prometheus Operator uses a ServiceMonitor for the kubelet which sets `honorLabels: true` and I think you might be missing that on your cluster.
See my GitHub comment for additional information: https://github.com/kubernetes-monitoring/kubernetes-mixin/issues/105#issuecomment-427807261

Comment 5 Frederic Branczyk 2018-10-15 15:05:56 UTC

Yes, I think when this was originally reported the cluster-monitoring stack did not honor labels from the kubelet's metrics endpoint, but it does now. Justin can you let us know on which cluster you are seeing this? Then we can verify whether an upgrade would solve it, or whether it might be already solved now.

Comment 6 Justin Pierce 2018-10-15 15:24:19 UTC

First observed on starter-ca-central-1 and still observed as of now (v3.11.16):

alertname="KubePersistentVolumeFullInFourDays" endpoint="https-metrics" exported_namespace="my-project-new" instance="172.31.30.100:10250" job="kubelet" namespace="kube-system" persistentvolumeclaim="mysql" service="kubelet" severity="critical"

Comment 7 Frederic Branczyk 2018-10-15 15:56:14 UTC

Could you share the output of:

```
kubectl get servicemonitor kubelet -oyaml
```

I don't seem to have permissions to view ServiceMonitor objects.

Comment 8 Frederic Branczyk 2018-10-15 15:58:40 UTC

Actually nevermind. I just found that we indeed do not set the `honor_label` configuration properly: https://github.com/openshift/cluster-monitoring-operator/blob/1f465a5a9e1a2959d67f21102762a43228fadf4e/jsonnet/prometheus.jsonnet#L175-L181

This is where it needs to be fixed.

Comment 9 Frederic Branczyk 2019-01-23 16:06:20 UTC

Setting to modified as https://github.com/openshift/cluster-monitoring-operator/pull/127 was merged

Comment 11 Junqi Zhao 2019-02-12 03:30:44 UTC

tested with cluster monitoring v3.11.82, the cluster-monitoring stack honors labels from the kubelet's metrics endpoint.
eg:
ALERTS{alertname="KubePersistentVolumeFullInFourDays",alertstate="firing",endpoint="https-metrics",instance="10.0.77.44:10250",job="kubelet",namespace="openshift-monitoring",persistentvolumeclaim="prometheus-k8s-db-prometheus-k8s-1",service="kubelet",severity="critical"}

Comment 13 errata-xmlrpc 2019-02-20 14:11:01 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0326

Note You need to log in before you can comment on or make changes to this bug.