Bug 1647492
Summary: | Missing Node-specific Metrics in Grafana-Dashboards | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Vladislav Walek <vwalek> | ||||||
Component: | Monitoring | Assignee: | Frederic Branczyk <fbranczy> | ||||||
Status: | CLOSED ERRATA | QA Contact: | Junqi Zhao <juzhao> | ||||||
Severity: | medium | Docs Contact: | |||||||
Priority: | urgent | ||||||||
Version: | 3.11.0 | CC: | ckoep, cvogel, fbranczy, pdwyer, spasquie, thomas.rumbaut | ||||||
Target Milestone: | --- | ||||||||
Target Release: | 4.1.0 | ||||||||
Hardware: | Unspecified | ||||||||
OS: | Unspecified | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2019-06-04 10:40:51 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | 1678645 | ||||||||
Bug Blocks: | |||||||||
Attachments: |
|
Description
Vladislav Walek
2018-11-07 15:33:22 UTC
Are the node-exporter targets healthy? (You can see this on the /targets page of the Prometheus UI) Having the same issue on OCP v3.11.59.
> Are the node-exporter targets healthy? (You can see this on the /targets
> page of the Prometheus UI)
Yes, all are healthy. There are more tan 1.000 metrics available in Prometheus, but only a few node_* metrics:
node_collector_evictions_number
node_collector_unhealthy_nodes_in_zone
node_collector_zone_health
node_collector_zone_size
node_lifecycle_controller_rate_limiter_use
node_namespace_pod:kube_pod_info:
Could you share the Pod definition of one of those node-exporter Pods as well as sample logs? Are you sure these are Pods from the `openshift-monitoring` namespace? The tech preview had a number of node-exporter collectors turned off, but the new stack should have all of these metrics. What you're seeing might be the node-exporter of the tech-preview stack. Looks like https://bugzilla.redhat.com/show_bug.cgi?id=1608288 is the reason. Despite the release of https://access.redhat.com/errata/RHBA-2018:2652, our legacy openshift-metrics project was still using port 9100 on all nodes (we upgraded from OCP 3.10.59 to 3.11.59 and performed https://docs.openshift.com/container-platform/3.11/upgrading/automated_upgrades.html#upgrading-cluster-metrics afterwards). Even after applying https://github.com/openshift/openshift-ansible/pull/9749/commits/d328bebd71c57692024cb693a72e15d0cb8f6676 manually and removing the old DaemonSet for the openshift-metrics project, we have some issues witin the openshift-monitoring project: - targets for kubernetes-nodes-exporter are down - node-exporter pods are running except for the Infra Nodes due to port 9101/1936 being used by HAProxy I will create as support case for this. Issues has been fixed for us: - node-exporter pods on Infra Nodes couldn't started properly due to prom/haproxy-exporter pods as part of HAProxy routers (https://docs.openshift.com/container-platform/3.5/install_config/router/default_haproxy_router.html#exposing-the-router-metrics). As these metrics were not used by us, we have deleted these pods (edited deployment configs). - targets for kubernetes-nodes-exporter were down (except for the node where Prometheus was running) due to missing iptables rule, despite https://bugzilla.redhat.com/show_bug.cgi?id=1563888. Fixed by adding "iptables -A OS_FIREWALL_ALLOW -p tcp -m state --state NEW -m tcp --dport 9000:10000 -j ACCEPT" to all nodes. MH, is there still something needed? Created attachment 1539723 [details]
Kubernetes / USE Method / Cluster grafana UI
Created attachment 1539724 [details]
Kubernetes / Compute Resources / Cluster grafana UI
Issue is fixed, see picture in Comment 29 and Comment 30 $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.0.0-0.nightly-2019-02-27-213933 True False 80m Cluster version is 4.0.0-0.nightly-2019-02-27-213933 Did not find the 3.11 issue on AWS/GCE Only one issue on 3.11 openstack, search "node:node_disk_utilisation:avg_irate" and "node:node_disk_saturation:avg_irate" in prometheus will meet error "No datapoints found.", this issue caused "Disk IO Utilisation" and "Disk IO Saturation" in grafana "K8s / USE Method / Cluster" page, and it is tracked in bug 1680517, AWS/GCE don't have this issue. (In reply to Junqi Zhao from comment #32) > this issue caused "Disk IO Utilisation" and "Disk IO > Saturation" in grafana "K8s / USE Method / Cluster" page change to this issue caused "No data points" shows for "Disk IO Utilisation" and "Disk IO Saturation" in grafana "K8s / USE Method / Cluster" page on Openstack Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0758 |