Description of problem: Currently, the leader[1] SDN controller only serves metrics. Non-leader instances do not. This is not compatible with using prometheus+servicemonitor[2]+k8 service to consume the metrics as non-leader SDN controller instances (which do not expose a metrics endpoint) cause an error within prometheus because the service contains endpoints which aren't valid for non-leader SDN controller instances. Version-Release number of selected component (if applicable): 4.10 How reproducible: Always Steps to Reproduce: 1. Launch 4.10 2. Non-leader SDN controller instances do not serve metrics Actual results: Prometheus scraps metrics on leader controller but non-leader scraps fail with 5XX HTTP error code. Expected results: Prometheus scraps metrics without an error Additional info: [1] https://pkg.go.dev/k8s.io/client-go/tools/leaderelection [2] https://docs.openshift.com/container-platform/4.7/monitoring/managing-metrics.html#setting-up-metrics-collection-for-user-defined-projects_managing-metrics
Tested and verified in 4.10.0-0.nightly-2022-01-21-074618 [weliang@weliang verification-tests]$ oc get pod -o wide -n openshift-sdn NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES sdn-26qp5 2/2 Running 0 38m 10.0.161.127 ip-10-0-161-127.us-east-2.compute.internal <none> <none> sdn-controller-74tzj 1/1 Running 0 45m 10.0.135.231 ip-10-0-135-231.us-east-2.compute.internal <none> <none> sdn-controller-m5r5s 1/1 Running 0 45m 10.0.179.115 ip-10-0-179-115.us-east-2.compute.internal <none> <none> sdn-controller-xb4lj 1/1 Running 0 45m 10.0.204.42 ip-10-0-204-42.us-east-2.compute.internal <none> <none> sdn-dr5kn 2/2 Running 0 45m 10.0.204.42 ip-10-0-204-42.us-east-2.compute.internal <none> <none> sdn-hrpth 2/2 Running 0 45m 10.0.179.115 ip-10-0-179-115.us-east-2.compute.internal <none> <none> sdn-n7b9g 2/2 Running 0 37m 10.0.215.226 ip-10-0-215-226.us-east-2.compute.internal <none> <none> sdn-xpnsr 2/2 Running 0 38m 10.0.128.67 ip-10-0-128-67.us-east-2.compute.internal <none> <none> sdn-z7pdw 2/2 Running 0 45m 10.0.135.231 ip-10-0-135-231.us-east-2.compute.internal <none> <none> [weliang@weliang verification-tests]$ oc -n openshift-sdn get cm openshift-network-controller -o yaml | grep holderIdentity control-plane.alpha.kubernetes.io/leader: '{"holderIdentity":"ip-10-0-179-115","leaseDurationSeconds":137,"acquireTime":"2022-01-21T15:23:11Z","renewTime":"2022-01-21T16:08:50Z","leaderTransitions":0}' [weliang@weliang verification-tests]$ oc exec -n openshift-sdn sdn-hrpth -- curl localhost:29100/metrics | grep -i egress_fire Defaulted container "sdn" out of: sdn, kube-rbac-proxy % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 1017 100 1017 0 0 993k 0 --:--:-- --:--:-- --:--:-- 993k # HELP sdn_controller_num_egress_firewall_rules The number of egress firewall rules defined # TYPE sdn_controller_num_egress_firewall_rules gauge sdn_controller_num_egress_firewall_rules 2 # HELP sdn_controller_num_egress_firewalls The number of egress firewall policies # TYPE sdn_controller_num_egress_firewalls gauge sdn_controller_num_egress_firewalls 1 [weliang@weliang verification-tests]$ oc exec -n openshift-sdn sdn-controller-74tzj -- curl localhost:29100/metrics | grep -i egress_fire % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 495 100 495 0 0 483k 0 --:--:-- --:--:-- --:--:-- 483k [weliang@weliang verification-tests]$ oc exec -n openshift-sdn sdn-controller-m5r5s -- curl localhost:29100/metrics | grep -i egress_fire % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 1017 100 1017 0 0 993k 0 --:--:-- --:--:-- --:--:-- 993k # HELP sdn_controller_num_egress_firewall_rules The number of egress firewall rules defined # TYPE sdn_controller_num_egress_firewall_rules gauge sdn_controller_num_egress_firewall_rules 2 # HELP sdn_controller_num_egress_firewalls The number of egress firewall policies # TYPE sdn_controller_num_egress_firewalls gauge sdn_controller_num_egress_firewalls 1 [weliang@weliang verification-tests]$ oc exec -n openshift-sdn sdn-controller-xb4lj -- curl localhost:29100/metrics | grep -i egress_fire % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 495 100 495 0 0 483k 0 --:--:-- --:--:-- --:--:-- 483k [weliang@weliang verification-tests]$
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056