1647492 – Missing Node-specific Metrics in Grafana-Dashboards

Bug 1647492 - Missing Node-specific Metrics in Grafana-Dashboards

Summary: Missing Node-specific Metrics in Grafana-Dashboards

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Monitoring
Sub Component:
Version:	3.11.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	medium
Target Milestone:	---
Target Release:	4.1.0
Assignee:	Frederic Branczyk
QA Contact:	Junqi Zhao
Docs Contact:
URL:
Whiteboard:
Depends On:	1678645
Blocks:
TreeView+	depends on / blocked

Reported:	2018-11-07 15:33 UTC by Vladislav Walek
Modified:	2020-02-24 12:03 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-06-04 10:40:51 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Kubernetes / USE Method / Cluster grafana UI (97.04 KB, image/png) 2019-03-01 03:59 UTC, Junqi Zhao	no flags	Details
Kubernetes / Compute Resources / Cluster grafana UI (127.97 KB, image/png) 2019-03-01 03:59 UTC, Junqi Zhao	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2019:0758	0	None	None	None	2019-06-04 10:40:59 UTC

Description Vladislav Walek 2018-11-07 15:33:22 UTC

Description of problem:

The Openshift-Monitoring provided by 3.11 is missing node-specific metrics:
- node_cpu
- node:node_num_cpu:sum
- node_memory_MemFreeCachedBuffers
- node_memory_MemTotal
- node:node_cpu_utilisation:avg1m
- node:node_num_cpu:sum
- node:node_cpu_saturation_load1
- node:node_memory_utilisation:ratio
- node:node_memory_swap_io_bytes:sum_rate
- node:node_disk_utilisation:avg_irate
the metrics should be available in Prometheus and subesequently be displayed in the dashboards of grafana.

Version-Release number of selected component (if applicable):
OpenShift Container Platform 3.11

How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:
will attach the screenshots from grafana

Comment 3 Frederic Branczyk 2018-11-09 10:47:24 UTC

Are the node-exporter targets healthy? (You can see this on the /targets page of the Prometheus UI)

Comment 15 trumbaut 2019-01-17 12:59:48 UTC

Having the same issue on OCP v3.11.59.

> Are the node-exporter targets healthy? (You can see this on the /targets
> page of the Prometheus UI)

Yes, all are healthy. There are more tan 1.000 metrics available in Prometheus, but only a few node_* metrics:

node_collector_evictions_number
node_collector_unhealthy_nodes_in_zone
node_collector_zone_health
node_collector_zone_size
node_lifecycle_controller_rate_limiter_use
node_namespace_pod:kube_pod_info:

Comment 16 Frederic Branczyk 2019-01-17 13:30:55 UTC

Could you share the Pod definition of one of those node-exporter Pods as well as sample logs? Are you sure these are Pods from the `openshift-monitoring` namespace? The tech preview had a number of node-exporter collectors turned off, but the new stack should have all of these metrics. What you're seeing might be the node-exporter of the tech-preview stack.

Comment 17 trumbaut 2019-01-17 14:21:01 UTC

Looks like https://bugzilla.redhat.com/show_bug.cgi?id=1608288 is the reason. Despite the release of https://access.redhat.com/errata/RHBA-2018:2652, our legacy openshift-metrics project was still using port 9100 on all nodes (we upgraded from OCP 3.10.59 to 3.11.59 and performed https://docs.openshift.com/container-platform/3.11/upgrading/automated_upgrades.html#upgrading-cluster-metrics afterwards).

Even after applying https://github.com/openshift/openshift-ansible/pull/9749/commits/d328bebd71c57692024cb693a72e15d0cb8f6676 manually and removing the old DaemonSet for the openshift-metrics project, we have some issues witin the openshift-monitoring project:

- targets for kubernetes-nodes-exporter are down
- node-exporter pods are running except for the Infra Nodes due to port 9101/1936 being used by HAProxy

I will create as support case for this.

Comment 22 trumbaut 2019-02-04 13:19:07 UTC

Issues has been fixed for us:

- node-exporter pods on Infra Nodes couldn't started properly due to prom/haproxy-exporter pods as part of HAProxy routers (https://docs.openshift.com/container-platform/3.5/install_config/router/default_haproxy_router.html#exposing-the-router-metrics). As these metrics were not used by us, we have deleted these pods (edited deployment configs).
- targets for kubernetes-nodes-exporter were down (except for the node where Prometheus was running) due to missing iptables rule, despite https://bugzilla.redhat.com/show_bug.cgi?id=1563888. Fixed by adding "iptables -A OS_FIREWALL_ALLOW -p tcp -m state --state NEW -m tcp --dport 9000:10000 -j ACCEPT" to all nodes.

Comment 25 Christian Heidenreich 2019-02-14 11:26:29 UTC

MH, is there still something needed?

Comment 29 Junqi Zhao 2019-03-01 03:59:02 UTC

Created attachment 1539723 [details]
Kubernetes / USE Method / Cluster grafana UI

Comment 30 Junqi Zhao 2019-03-01 03:59:35 UTC

Created attachment 1539724 [details]
Kubernetes / Compute Resources / Cluster grafana UI

Comment 31 Junqi Zhao 2019-03-01 04:00:41 UTC

Issue is fixed, see picture in Comment 29 and Comment 30

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.0.0-0.nightly-2019-02-27-213933   True        False         80m     Cluster version is 4.0.0-0.nightly-2019-02-27-213933

Comment 32 Junqi Zhao 2019-03-01 07:39:52 UTC

Did not find the 3.11 issue on AWS/GCE
Only one issue on 3.11 openstack, search "node:node_disk_utilisation:avg_irate" and "node:node_disk_saturation:avg_irate" in prometheus will meet error "No datapoints found.", this issue caused "Disk IO Utilisation" and "Disk IO Saturation" in grafana "K8s / USE Method / Cluster" page, and it is tracked in bug 1680517, AWS/GCE don't have this issue.

Comment 33 Junqi Zhao 2019-03-01 07:42:01 UTC

(In reply to Junqi Zhao from comment #32)
> this issue caused "Disk IO Utilisation" and "Disk IO
> Saturation" in grafana "K8s / USE Method / Cluster" page

change to
this issue caused "No data points" shows for "Disk IO Utilisation" and "Disk IO Saturation" in grafana "K8s / USE Method / Cluster" page on Openstack

Comment 36 errata-xmlrpc 2019-06-04 10:40:51 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758

Note You need to log in before you can comment on or make changes to this bug.