Bug 2039294 - SDN controller metrics cannot be consumed correctly by prometheus
Summary: SDN controller metrics cannot be consumed correctly by prometheus
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.10
Hardware: All
OS: Unspecified
high
high
Target Milestone: ---
: 4.10.0
Assignee: Martin Kennelly
QA Contact: Weibin Liang
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-01-11 13:01 UTC by Martin Kennelly
Modified: 2022-03-12 04:40 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-03-12 04:40:34 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift sdn pull 388 0 None open Bug 2039294: SDN controller metrics cannot be scraped by prometheus 2022-01-11 13:11:26 UTC
Red Hat Product Errata RHSA-2022:0056 0 None None None 2022-03-12 04:40:44 UTC

Description Martin Kennelly 2022-01-11 13:01:22 UTC
Description of problem:
Currently, the leader[1] SDN controller only serves metrics. Non-leader instances do not.

This is not compatible with using prometheus+servicemonitor[2]+k8 service to consume the metrics as non-leader SDN controller instances (which do not expose a metrics endpoint) cause an error within prometheus because the service contains endpoints which aren't valid for non-leader SDN controller instances.

Version-Release number of selected component (if applicable):
4.10

How reproducible:
Always

Steps to Reproduce:
1. Launch 4.10
2. Non-leader SDN controller instances do not serve metrics


Actual results:
Prometheus scraps metrics on leader controller but non-leader scraps fail with 5XX HTTP error code.

Expected results:
Prometheus scraps metrics without an error


Additional info:
[1] https://pkg.go.dev/k8s.io/client-go/tools/leaderelection
[2] https://docs.openshift.com/container-platform/4.7/monitoring/managing-metrics.html#setting-up-metrics-collection-for-user-defined-projects_managing-metrics

Comment 3 Weibin Liang 2022-01-21 17:00:03 UTC
Tested and verified in 4.10.0-0.nightly-2022-01-21-074618

[weliang@weliang verification-tests]$ oc get pod -o wide -n openshift-sdn
NAME                   READY   STATUS    RESTARTS   AGE   IP             NODE                                         NOMINATED NODE   READINESS GATES
sdn-26qp5              2/2     Running   0          38m   10.0.161.127   ip-10-0-161-127.us-east-2.compute.internal   <none>           <none>
sdn-controller-74tzj   1/1     Running   0          45m   10.0.135.231   ip-10-0-135-231.us-east-2.compute.internal   <none>           <none>
sdn-controller-m5r5s   1/1     Running   0          45m   10.0.179.115   ip-10-0-179-115.us-east-2.compute.internal   <none>           <none>
sdn-controller-xb4lj   1/1     Running   0          45m   10.0.204.42    ip-10-0-204-42.us-east-2.compute.internal    <none>           <none>
sdn-dr5kn              2/2     Running   0          45m   10.0.204.42    ip-10-0-204-42.us-east-2.compute.internal    <none>           <none>
sdn-hrpth              2/2     Running   0          45m   10.0.179.115   ip-10-0-179-115.us-east-2.compute.internal   <none>           <none>
sdn-n7b9g              2/2     Running   0          37m   10.0.215.226   ip-10-0-215-226.us-east-2.compute.internal   <none>           <none>
sdn-xpnsr              2/2     Running   0          38m   10.0.128.67    ip-10-0-128-67.us-east-2.compute.internal    <none>           <none>
sdn-z7pdw              2/2     Running   0          45m   10.0.135.231   ip-10-0-135-231.us-east-2.compute.internal   <none>           <none>
[weliang@weliang verification-tests]$ oc -n openshift-sdn get cm openshift-network-controller -o yaml | grep holderIdentity
    control-plane.alpha.kubernetes.io/leader: '{"holderIdentity":"ip-10-0-179-115","leaseDurationSeconds":137,"acquireTime":"2022-01-21T15:23:11Z","renewTime":"2022-01-21T16:08:50Z","leaderTransitions":0}'
[weliang@weliang verification-tests]$ oc exec -n openshift-sdn sdn-hrpth -- curl localhost:29100/metrics | grep -i egress_fire
Defaulted container "sdn" out of: sdn, kube-rbac-proxy
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1017  100  1017    0     0   993k      0 --:--:-- --:--:-- --:--:--  993k
# HELP sdn_controller_num_egress_firewall_rules The number of egress firewall rules defined
# TYPE sdn_controller_num_egress_firewall_rules gauge
sdn_controller_num_egress_firewall_rules 2
# HELP sdn_controller_num_egress_firewalls The number of egress firewall policies
# TYPE sdn_controller_num_egress_firewalls gauge
sdn_controller_num_egress_firewalls 1
[weliang@weliang verification-tests]$ oc exec -n openshift-sdn sdn-controller-74tzj -- curl localhost:29100/metrics | grep -i egress_fire
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   495  100   495    0     0   483k      0 --:--:-- --:--:-- --:--:--  483k
[weliang@weliang verification-tests]$ oc exec -n openshift-sdn sdn-controller-m5r5s -- curl localhost:29100/metrics | grep -i egress_fire
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1017  100  1017    0     0   993k      0 --:--:-- --:--:-- --:--:--  993k
# HELP sdn_controller_num_egress_firewall_rules The number of egress firewall rules defined
# TYPE sdn_controller_num_egress_firewall_rules gauge
sdn_controller_num_egress_firewall_rules 2
# HELP sdn_controller_num_egress_firewalls The number of egress firewall policies
# TYPE sdn_controller_num_egress_firewalls gauge
sdn_controller_num_egress_firewalls 1
[weliang@weliang verification-tests]$ oc exec -n openshift-sdn sdn-controller-xb4lj -- curl localhost:29100/metrics | grep -i egress_fire
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   495  100   495    0     0   483k      0 --:--:-- --:--:-- --:--:--  483k
[weliang@weliang verification-tests]$

Comment 6 errata-xmlrpc 2022-03-12 04:40:34 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056


Note You need to log in before you can comment on or make changes to this bug.