Bug 2039294

Summary: SDN controller metrics cannot be consumed correctly by prometheus
Product: OpenShift Container Platform Reporter: Martin Kennelly <mkennell>
Component: NetworkingAssignee: Martin Kennelly <mkennell>
Networking sub component: openshift-sdn QA Contact: Weibin Liang <weliang>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: high CC: weliang
Version: 4.10   
Target Milestone: ---   
Target Release: 4.10.0   
Hardware: All   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-03-12 04:40:34 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Martin Kennelly 2022-01-11 13:01:22 UTC
Description of problem:
Currently, the leader[1] SDN controller only serves metrics. Non-leader instances do not.

This is not compatible with using prometheus+servicemonitor[2]+k8 service to consume the metrics as non-leader SDN controller instances (which do not expose a metrics endpoint) cause an error within prometheus because the service contains endpoints which aren't valid for non-leader SDN controller instances.

Version-Release number of selected component (if applicable):
4.10

How reproducible:
Always

Steps to Reproduce:
1. Launch 4.10
2. Non-leader SDN controller instances do not serve metrics


Actual results:
Prometheus scraps metrics on leader controller but non-leader scraps fail with 5XX HTTP error code.

Expected results:
Prometheus scraps metrics without an error


Additional info:
[1] https://pkg.go.dev/k8s.io/client-go/tools/leaderelection
[2] https://docs.openshift.com/container-platform/4.7/monitoring/managing-metrics.html#setting-up-metrics-collection-for-user-defined-projects_managing-metrics

Comment 3 Weibin Liang 2022-01-21 17:00:03 UTC
Tested and verified in 4.10.0-0.nightly-2022-01-21-074618

[weliang@weliang verification-tests]$ oc get pod -o wide -n openshift-sdn
NAME                   READY   STATUS    RESTARTS   AGE   IP             NODE                                         NOMINATED NODE   READINESS GATES
sdn-26qp5              2/2     Running   0          38m   10.0.161.127   ip-10-0-161-127.us-east-2.compute.internal   <none>           <none>
sdn-controller-74tzj   1/1     Running   0          45m   10.0.135.231   ip-10-0-135-231.us-east-2.compute.internal   <none>           <none>
sdn-controller-m5r5s   1/1     Running   0          45m   10.0.179.115   ip-10-0-179-115.us-east-2.compute.internal   <none>           <none>
sdn-controller-xb4lj   1/1     Running   0          45m   10.0.204.42    ip-10-0-204-42.us-east-2.compute.internal    <none>           <none>
sdn-dr5kn              2/2     Running   0          45m   10.0.204.42    ip-10-0-204-42.us-east-2.compute.internal    <none>           <none>
sdn-hrpth              2/2     Running   0          45m   10.0.179.115   ip-10-0-179-115.us-east-2.compute.internal   <none>           <none>
sdn-n7b9g              2/2     Running   0          37m   10.0.215.226   ip-10-0-215-226.us-east-2.compute.internal   <none>           <none>
sdn-xpnsr              2/2     Running   0          38m   10.0.128.67    ip-10-0-128-67.us-east-2.compute.internal    <none>           <none>
sdn-z7pdw              2/2     Running   0          45m   10.0.135.231   ip-10-0-135-231.us-east-2.compute.internal   <none>           <none>
[weliang@weliang verification-tests]$ oc -n openshift-sdn get cm openshift-network-controller -o yaml | grep holderIdentity
    control-plane.alpha.kubernetes.io/leader: '{"holderIdentity":"ip-10-0-179-115","leaseDurationSeconds":137,"acquireTime":"2022-01-21T15:23:11Z","renewTime":"2022-01-21T16:08:50Z","leaderTransitions":0}'
[weliang@weliang verification-tests]$ oc exec -n openshift-sdn sdn-hrpth -- curl localhost:29100/metrics | grep -i egress_fire
Defaulted container "sdn" out of: sdn, kube-rbac-proxy
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1017  100  1017    0     0   993k      0 --:--:-- --:--:-- --:--:--  993k
# HELP sdn_controller_num_egress_firewall_rules The number of egress firewall rules defined
# TYPE sdn_controller_num_egress_firewall_rules gauge
sdn_controller_num_egress_firewall_rules 2
# HELP sdn_controller_num_egress_firewalls The number of egress firewall policies
# TYPE sdn_controller_num_egress_firewalls gauge
sdn_controller_num_egress_firewalls 1
[weliang@weliang verification-tests]$ oc exec -n openshift-sdn sdn-controller-74tzj -- curl localhost:29100/metrics | grep -i egress_fire
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   495  100   495    0     0   483k      0 --:--:-- --:--:-- --:--:--  483k
[weliang@weliang verification-tests]$ oc exec -n openshift-sdn sdn-controller-m5r5s -- curl localhost:29100/metrics | grep -i egress_fire
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1017  100  1017    0     0   993k      0 --:--:-- --:--:-- --:--:--  993k
# HELP sdn_controller_num_egress_firewall_rules The number of egress firewall rules defined
# TYPE sdn_controller_num_egress_firewall_rules gauge
sdn_controller_num_egress_firewall_rules 2
# HELP sdn_controller_num_egress_firewalls The number of egress firewall policies
# TYPE sdn_controller_num_egress_firewalls gauge
sdn_controller_num_egress_firewalls 1
[weliang@weliang verification-tests]$ oc exec -n openshift-sdn sdn-controller-xb4lj -- curl localhost:29100/metrics | grep -i egress_fire
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   495  100   495    0     0   483k      0 --:--:-- --:--:-- --:--:--  483k
[weliang@weliang verification-tests]$

Comment 6 errata-xmlrpc 2022-03-12 04:40:34 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056