Bug 1588420 - SDN related metrics cannot be captured by prometheus
Summary: SDN related metrics cannot be captured by prometheus
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 3.10.0
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.1.0
Assignee: Jacob Tanenbaum
QA Contact: Meng Bo
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-06-07 09:57 UTC by Meng Bo
Modified: 2019-06-04 10:40 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-06-04 10:40:21 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-network-operator pull 89 0 None None None 2019-02-13 12:44:25 UTC
Red Hat Product Errata RHBA-2019:0758 0 None None None 2019-06-04 10:40:28 UTC

Description Meng Bo 2018-06-07 09:57:10 UTC
Description of problem:
When setup env with prometheus enabled. The networking related metrics cannot be found in prometheus console.

Version-Release number of selected component (if applicable):
v3.10.0-0.63.0

How reproducible:
always

Steps to Reproduce:
1. Setup ocp cluster with prometheus enabled
2. Login to the prometheus console to check the sdn related metrics 
3.

Actual results:
Cannot find the openshift sdn related items

Expected results:
Should be able to capture the networking related metrics by prometheus.

Additional info:

Comment 1 Meng Bo 2018-06-07 10:01:21 UTC
The following metrics should be shown in prometheus console. 

 openshift_sdn_ovs_flows
 openshift_sdn_arp_cache_entries
 openshift_sdn_pod_ips
 openshift_sdn_pod_setup_errors
 openshift_sdn_pod_setup_latency
 openshift_sdn_pod_teardown_errors
 openshift_sdn_pod_teardown_latency

Comment 2 Ben Bennett 2018-06-07 13:04:23 UTC
This is presumably not a regression so moving to 3.11.

Reassigning to the metrics team since I assume they are in charge of hooking up metrics.

Comment 3 Frederic Branczyk 2018-06-12 13:34:56 UTC
Correct, this is not a regression, it's simply not configured. Assigning this to Casey for the SDN team to configure metrics collection for 3.11.

Comment 4 Casey Callendrello 2018-06-12 13:40:07 UTC
Where did the list of metrics in comment 1 come from? Were these metrics previously exposed via hawkular?

Comment 5 Meng Bo 2018-06-20 08:05:05 UTC
(In reply to Casey Callendrello from comment #4)
> Where did the list of metrics in comment 1 come from? Were these metrics
> previously exposed via hawkular?

Here is a list which were defined for networking related metrics, and I designed test case based on this
https://github.com/openshift/origin/blob/master/pkg/network/node/metrics.go#L19

I am not sure if they work with hawkular, we do not have case about networking metrics via hawkular.

Comment 6 Casey Callendrello 2018-06-20 09:25:09 UTC
Great, thanks for the info.

Assigning to dcbw, who wrote that code.

Comment 7 Jessica Forrester 2018-08-02 20:05:24 UTC
Since this is specifically monitoring of the networking stack I'm moving this to their BZ component for tracking.

Comment 8 Meng Bo 2018-08-07 09:29:29 UTC
Any progress on this? I can still see this problem on v3.11.0

Comment 9 Dan Williams 2018-08-07 21:05:16 UTC
Casey, I'd thought I did everything needed to expose those, but now that the SDN is a daemonset perhaps more is required?  Any idea what's needed to actually push the metrics out now?

Comment 10 Casey Callendrello 2018-08-09 13:55:18 UTC
The question is: where's the problem? Is the daemonst correctly answering on the metrics endpoint? Is Prometheus configured to scrape that endpoint?

Should be easy enough to answer the first question. I'll also ask the monitoring people if they can answer the second.

Comment 11 Meng Bo 2018-08-10 02:46:20 UTC
The prometheus related services are running in the openshift-metrics project, and the sdn related services are running in the openshift-sdn project. 

Is it able to read the info in openshift-sdn from openshift-metrics?

Comment 12 Casey Callendrello 2018-08-14 16:01:59 UTC
Figured this one out.

We need to:
1) Decide on a port for the sdn to use
2) Add that to "metrics-bind-address" in ProxyArguments
3) Configure a headless service for the metrics port with appropriate labels (see etcd for an example)
4) Create a ServiceMonitor object
5) Profit!

Comment 13 Casey Callendrello 2018-08-16 14:09:50 UTC
Clayton says he'll "take care of this..."

Comment 14 zhaozhanqi 2019-01-25 06:56:27 UTC
hi, this issue still has not fixed in 4.0, payload 4.0.0-0.nightly-2019-01-24-184525

since the target release the 4.0.0. So any progress on this?

Comment 15 Casey Callendrello 2019-01-25 16:14:53 UTC
Yes, Jacob has been making progress and this should be in soon.

Comment 16 Jacob Tanenbaum 2019-02-13 12:43:51 UTC
Posted and has been merged 

https://github.com/openshift/cluster-network-operator/pull/89

Comment 19 Anurag saxena 2019-03-29 20:00:39 UTC
Verified it on 4.0.0-0.nightly-2019-03-28-210640. Observed following SDN related metrics are now captured on Prometheus console:

openshift_sdn_ovs_flows
openshift_sdn_arp_cache_entries
openshift_sdn_pod_ips
openshift_sdn_pod_setup_errors
openshift_sdn_pod_setup_latency
openshift_sdn_pod_teardown_errors
openshift_sdn_pod_teardown_latency

Thanks!

Comment 21 errata-xmlrpc 2019-06-04 10:40:21 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758


Note You need to log in before you can comment on or make changes to this bug.