Description of problem: MetalLb Upstream metrics endpoint incorrect for controller. [1] Default Metallb Operator setup has no issue and Prometheus target for the controller is UP. [2] Issue is observed when trying to hosts the speakers on specific infra nodes, desired outcome is seen the speakers are running on the infra nodes with the specific taints applied, but Prometheus target has a failure with "server returned HTTP status 502 Bad Gateway" error" [3] kube-rbac-proxy container in the controller pod shows "http: proxy error: dial tcp 10.249.80.164:7472: connect: connection refused" Version-Release number of selected component (if applicable): OCPv4.10.5 How reproducible: Always Steps to Reproduce: 1. Install IPI cluster 2. Install MetalLB operator 3. Configure MetalLB instance and verify that monitoring finds metrics target using prometheus target discovery 4. Add infra nodes with taint using infra machine set. https://docs.openshift.com/container-platform/4.10/networking/metallb/metallb-operator-install.html#nw-metallb-operator-limit-speaker-to-nodes_metallb-operator-install 5. Configure MetalLB and put MetalLB speakers to these infra nodes. Actual results: The metrics service endpoint for the MetalLB controller is down when going to https://prometheus-k8s-openshift-monitoring.apps../targets An alert is seen after a while in the UI as well. Expected results: MetalLB controller should be UP in Prometheus target after tainting speaker pods on infra nodes Additional info: Cu suspects this issue might be because the container "kube-rbac-proxy" from deployment "controller" is pointing to upstream "--upstream=http://$(METALLB_HOST):7472/". The METALLB_HOST is configured as "status.hostIP". We believe this is a bug and the controller deployment from Metallb should point the kube-rbac-proxy container to upstream status.podIP:7472
We found the issue, the ports currently used for metrics are not in openshift's reserved range, working on a fix.
Just an extra note: "reserved range for pods that run with hostnetwork: true".
I just filed https://bugzilla.redhat.com/show_bug.cgi?id=2089179 for tracking the backport.
4.11.0-0.nightly-2022-05-18-171831 metallb-operator.4.11.0-202205191659 ===================================== can not find metallb metrics on prometheus pods: ****************************************************************** oc exec speaker-kvs66 -n metallb-system -- curl localhost:29151/metrics | grep metallb_bfd_control_packet_output # HELP metallb_bfd_control_packet_output Number of sent BFD control packets # TYPE metallb_bfd_control_packet_output counter metallb_bfd_control_packet_output{peer="10.46.55.34"} 2763 ****************************************************************** oc exec prometheus-k8s-0 -n openshift-monitoring -- curl http://localhost:9090/api/v1/query?query=metallb_bfd_control_packet_output {"status":"success","data":{"resultType":"vector","result":[]}} ****************************************************************** Metallb prometheus targets (ports 9120 & 9121) have status "down" ****************************************************************** Scrape failed server returned HTTP status 401 Unauthorized ******************************************************************
metallb-operator.4.11.0-202205242136 OCP 4.11.0-0.nightly-2022-05-18-171831 ======================================= • [SLOW TEST:50.158 seconds] MetalLB BGP /home/elevin/projects/cnf-gotestMy/remove/onemore/cnf-gotests/test/network/metallb/tests/bgp-test.go:25 updates /home/elevin/projects/cnf-gotestMy/remove/onemore/cnf-gotests/test/network/metallb/tests/bgp-test.go:106 metrics /home/elevin/projects/cnf-gotestMy/remove/onemore/cnf-gotests/test/network/metallb/tests/bgp-test.go:173 provides Prometheus BGP metrics /home/elevin/projects/cnf-gotestMy/remove/onemore/cnf-gotests/test/network/metallb/tests/bgp-test.go:200
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069