Bug 2069705 - prometheus target "serviceMonitor/openshift-metallb-system/monitor-metallb-controller/0" has a failure with "server returned HTTP status 502 Bad Gateway"
Summary: prometheus target "serviceMonitor/openshift-metallb-system/monitor-metallb-co...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.10
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.11.0
Assignee: Federico Paolinelli
QA Contact: Arti Sood
URL:
Whiteboard:
Depends On:
Blocks: 2089179
TreeView+ depends on / blocked
 
Reported: 2022-03-29 13:57 UTC by Sunil Gurnale
Modified: 2022-08-10 11:02 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 2089179 (view as bug list)
Environment:
Last Closed: 2022-08-10 11:02:32 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github metallb metallb-operator pull 197 0 None open Expose the ports used by metrics and memberlist as parameters 2022-04-27 14:19:43 UTC
Github openshift metallb-operator pull 82 0 None Merged Sync with upstream 05-05-2022 2022-05-06 08:25:33 UTC
Github openshift metallb-operator pull 83 0 None open Bug 2069705: Pin ports to allowed ranges 2022-05-06 08:22:59 UTC
Github openshift metallb-operator pull 88 0 None open Bug 2069705: Speaker: grant create subject accessreviews / token reviews 2022-05-24 15:38:48 UTC
Red Hat Product Errata RHSA-2022:5069 0 None None None 2022-08-10 11:02:50 UTC

Description Sunil Gurnale 2022-03-29 13:57:36 UTC
Description of problem:
MetalLb Upstream metrics endpoint incorrect for controller.
[1] Default Metallb Operator setup has no issue and Prometheus target for the controller is UP.
[2] Issue is observed when trying to hosts the speakers on specific infra nodes, desired outcome is seen the speakers are running on the infra nodes with the specific taints applied, but Prometheus target has a failure with "server returned HTTP status 502 Bad Gateway" error"
[3] kube-rbac-proxy container in the controller pod shows "http: proxy error: dial tcp 10.249.80.164:7472: connect: connection refused"

Version-Release number of selected component (if applicable): OCPv4.10.5


How reproducible: Always


Steps to Reproduce:

1. Install IPI cluster

2. Install MetalLB operator

3. Configure MetalLB instance and verify that monitoring finds metrics target using prometheus target discovery

4. Add infra nodes with taint using infra machine set.
https://docs.openshift.com/container-platform/4.10/networking/metallb/metallb-operator-install.html#nw-metallb-operator-limit-speaker-to-nodes_metallb-operator-install

5. Configure MetalLB and put MetalLB speakers to these infra nodes.


Actual results:
The metrics service endpoint for the MetalLB controller is down when going to https://prometheus-k8s-openshift-monitoring.apps../targets
An alert is seen after a while in the UI as well.

Expected results:
MetalLB controller should be UP in Prometheus target after tainting speaker pods on infra nodes

Additional info:

Cu suspects this issue might be because the container "kube-rbac-proxy" from deployment "controller" is pointing to upstream 
"--upstream=http://$(METALLB_HOST):7472/". The METALLB_HOST is configured as "status.hostIP". 

We believe this is a bug and the controller deployment from Metallb should point the kube-rbac-proxy container to upstream status.podIP:7472

Comment 5 Federico Paolinelli 2022-04-27 14:18:34 UTC
We found the issue, the ports currently used for metrics are not in openshift's reserved range, working on a fix.

Comment 6 Federico Paolinelli 2022-04-27 14:20:57 UTC
Just an extra note: "reserved range for pods that run with hostnetwork: true".

Comment 9 Federico Paolinelli 2022-05-23 07:33:30 UTC
I just filed https://bugzilla.redhat.com/show_bug.cgi?id=2089179 for tracking the backport.

Comment 10 elevin 2022-05-24 09:00:56 UTC
4.11.0-0.nightly-2022-05-18-171831
metallb-operator.4.11.0-202205191659
=====================================
can not find metallb  metrics on prometheus pods:
******************************************************************
oc exec speaker-kvs66 -n metallb-system -- curl localhost:29151/metrics | grep metallb_bfd_control_packet_output

# HELP metallb_bfd_control_packet_output Number of sent BFD control packets
# TYPE metallb_bfd_control_packet_output counter
metallb_bfd_control_packet_output{peer="10.46.55.34"} 2763
******************************************************************
oc exec prometheus-k8s-0 -n openshift-monitoring -- curl http://localhost:9090/api/v1/query?query=metallb_bfd_control_packet_output

{"status":"success","data":{"resultType":"vector","result":[]}}
******************************************************************

Metallb prometheus targets (ports 9120 & 9121) have status "down"

******************************************************************

Scrape failed
server returned HTTP status 401 Unauthorized

******************************************************************

Comment 12 elevin 2022-05-30 06:25:29 UTC
metallb-operator.4.11.0-202205242136
OCP 4.11.0-0.nightly-2022-05-18-171831
=======================================

• [SLOW TEST:50.158 seconds]
MetalLB BGP
/home/elevin/projects/cnf-gotestMy/remove/onemore/cnf-gotests/test/network/metallb/tests/bgp-test.go:25
  updates
  /home/elevin/projects/cnf-gotestMy/remove/onemore/cnf-gotests/test/network/metallb/tests/bgp-test.go:106
    metrics
    /home/elevin/projects/cnf-gotestMy/remove/onemore/cnf-gotests/test/network/metallb/tests/bgp-test.go:173
      provides Prometheus BGP metrics
      /home/elevin/projects/cnf-gotestMy/remove/onemore/cnf-gotests/test/network/metallb/tests/bgp-test.go:200

Comment 14 errata-xmlrpc 2022-08-10 11:02:32 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069


Note You need to log in before you can comment on or make changes to this bug.