Bug 1634302

Summary: KubePersistentVolumeUsageCritical/KubePersistentVolumeFullInFourDays should regex check exported_namespace not namespace
Product: OpenShift Container Platform Reporter: Justin Pierce <jupierce>
Component: MonitoringAssignee: Frederic Branczyk <fbranczy>
Status: CLOSED ERRATA QA Contact: Junqi Zhao <juzhao>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 3.11.0CC: minden, mloibl
Target Milestone: ---   
Target Release: 3.11.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-02-20 14:11:01 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Justin Pierce 2018-09-29 15:44:13 UTC
Description of problem:
It looks like the intention of KubePersistentVolumeFullInFourDays is to filter non-critical namespaces:

>>>>
alert: KubePersistentVolumeUsageCritical
expr: 100
  * kubelet_volume_stats_available_bytes{job="kubelet",namespace=~"(openshift.*|kube.*|default|logging)"}
  / kubelet_volume_stats_capacity_bytes{job="kubelet",namespace=~"(openshift.*|kube.*|default|logging)"}
  < 3
<<<<

and
>>>>
alert: KubePersistentVolumeFullInFourDays
expr: kubelet_volume_stats_available_bytes{job="kubelet",namespace=~"(openshift.*|kube.*|default|logging)"}
  and predict_linear(kubelet_volume_stats_available_bytes{job="kubelet",namespace=~"(openshift.*|kube.*|default|logging)"}[6h],
  4 * 24 * 3600) < 0
<<<<

However, the problematic namespace seems to be exposed in the 'exported_namespace'.

>>>>
alertname="KubePersistentVolumeFullInFourDays" endpoint="https-metrics" exported_namespace="emergencytime" instance="172.31.30.61:10250" job="kubelet" namespace="kube-system" persistentvolumeclaim="php-storage" service="kubelet" severity="critical"
<<<<

Version-Release number of selected component (if applicable):
v3.11.16

How reproducible:
100%

Comment 1 minden 2018-10-01 12:20:55 UTC
What are you referring to with 'exported_namespace', Justin?

Comment 2 Justin Pierce 2018-10-05 20:59:59 UTC
If you look at the example alert:
alertname="KubePersistentVolumeFullInFourDays" endpoint="https-metrics" exported_namespace="emergencytime" instance="172.31.30.61:10250" job="kubelet" namespace="kube-system" persistentvolumeclaim="php-storage" service="kubelet" severity="critical"

You see: exported_namespace="emergencytime"

This is the actual namespace containing the pv which is nearly full: persistentvolumeclaim="php-storage"

The rule, KubePersistentVolumeFullInFourDays, is attempting to filter out non-critical alerts by using a regex to select only critical namespaces: namespace=~"(openshift.*|kube.*|default|logging)"

However, this alert is still being raised because the 'true' problematic namespace is in exported_namespace  *not*   namespace. 


Ideally, the problematic namespace would be in the 'namespace' attribute. Barring that, the regex should be applied to 'exported_namespace' instead of 'namespace'.

Comment 3 minden 2018-10-08 09:24:28 UTC
Right, thanks for the clarification Justin!

Tracked here: https://github.com/kubernetes-monitoring/kubernetes-mixin/issues/105

Comment 4 Matthias Loibl 2018-10-08 12:04:23 UTC
Hey,
this should not be a problem of the alerts themselves, but rather a configuration issue on the Prometheus itself. The Prometheus Operator uses a ServiceMonitor for the kubelet which sets `honorLabels: true` and I think you might be missing that on your cluster.
See my GitHub comment for additional information: https://github.com/kubernetes-monitoring/kubernetes-mixin/issues/105#issuecomment-427807261

Comment 5 Frederic Branczyk 2018-10-15 15:05:56 UTC
Yes, I think when this was originally reported the cluster-monitoring stack did not honor labels from the kubelet's metrics endpoint, but it does now. Justin can you let us know on which cluster you are seeing this? Then we can verify whether an upgrade would solve it, or whether it might be already solved now.

Comment 6 Justin Pierce 2018-10-15 15:24:19 UTC
First observed on starter-ca-central-1 and still observed as of now (v3.11.16):

alertname="KubePersistentVolumeFullInFourDays" endpoint="https-metrics" exported_namespace="my-project-new" instance="172.31.30.100:10250" job="kubelet" namespace="kube-system" persistentvolumeclaim="mysql" service="kubelet" severity="critical"

Comment 7 Frederic Branczyk 2018-10-15 15:56:14 UTC
Could you share the output of:

```
kubectl get servicemonitor kubelet -oyaml
```

I don't seem to have permissions to view ServiceMonitor objects.

Comment 8 Frederic Branczyk 2018-10-15 15:58:40 UTC
Actually nevermind. I just found that we indeed do not set the `honor_label` configuration properly: https://github.com/openshift/cluster-monitoring-operator/blob/1f465a5a9e1a2959d67f21102762a43228fadf4e/jsonnet/prometheus.jsonnet#L175-L181

This is where it needs to be fixed.

Comment 9 Frederic Branczyk 2019-01-23 16:06:20 UTC
Setting to modified as https://github.com/openshift/cluster-monitoring-operator/pull/127 was merged

Comment 11 Junqi Zhao 2019-02-12 03:30:44 UTC
tested with cluster monitoring v3.11.82, the cluster-monitoring stack honors labels from the kubelet's metrics endpoint.
eg:
ALERTS{alertname="KubePersistentVolumeFullInFourDays",alertstate="firing",endpoint="https-metrics",instance="10.0.77.44:10250",job="kubelet",namespace="openshift-monitoring",persistentvolumeclaim="prometheus-k8s-db-prometheus-k8s-1",service="kubelet",severity="critical"}

Comment 13 errata-xmlrpc 2019-02-20 14:11:01 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0326