2069705 – prometheus target "serviceMonitor/openshift-metallb-system/monitor-metallb-controller/0" has a failure with "server returned HTTP status 502 Bad Gateway"

Bug 2069705 - prometheus target "serviceMonitor/openshift-metallb-system/monitor-metallb-controller/0" has a failure with "server returned HTTP status 502 Bad Gateway"

Summary: prometheus target "serviceMonitor/openshift-metallb-system/monitor-metallb-co...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.10
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.11.0
Assignee:	Federico Paolinelli
QA Contact:	Arti Sood
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2089179
TreeView+	depends on / blocked

Reported:	2022-03-29 13:57 UTC by Sunil Gurnale
Modified:	2022-08-10 11:02 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Clones:	2089179 (view as bug list)
Environment:
Last Closed:	2022-08-10 11:02:32 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	metallb metallb-operator pull 197	None	open	Expose the ports used by metrics and memberlist as parameters	2022-04-27 14:19:43 UTC
Github	openshift metallb-operator pull 82	None	Merged	Sync with upstream 05-05-2022	2022-05-06 08:25:33 UTC
Github	openshift metallb-operator pull 83	None	open	Bug 2069705: Pin ports to allowed ranges	2022-05-06 08:22:59 UTC
Github	openshift metallb-operator pull 88	None	open	Bug 2069705: Speaker: grant create subject accessreviews / token reviews	2022-05-24 15:38:48 UTC
Red Hat Product Errata	RHSA-2022:5069	None	None	None	2022-08-10 11:02:50 UTC

Description Sunil Gurnale 2022-03-29 13:57:36 UTC

Description of problem:
MetalLb Upstream metrics endpoint incorrect for controller.
[1] Default Metallb Operator setup has no issue and Prometheus target for the controller is UP.
[2] Issue is observed when trying to hosts the speakers on specific infra nodes, desired outcome is seen the speakers are running on the infra nodes with the specific taints applied, but Prometheus target has a failure with "server returned HTTP status 502 Bad Gateway" error"
[3] kube-rbac-proxy container in the controller pod shows "http: proxy error: dial tcp 10.249.80.164:7472: connect: connection refused"

Version-Release number of selected component (if applicable): OCPv4.10.5

How reproducible: Always

Steps to Reproduce:

1. Install IPI cluster

2. Install MetalLB operator

3. Configure MetalLB instance and verify that monitoring finds metrics target using prometheus target discovery

4. Add infra nodes with taint using infra machine set.
https://docs.openshift.com/container-platform/4.10/networking/metallb/metallb-operator-install.html#nw-metallb-operator-limit-speaker-to-nodes_metallb-operator-install

5. Configure MetalLB and put MetalLB speakers to these infra nodes.

Actual results:
The metrics service endpoint for the MetalLB controller is down when going to https://prometheus-k8s-openshift-monitoring.apps../targets
An alert is seen after a while in the UI as well.

Expected results:
MetalLB controller should be UP in Prometheus target after tainting speaker pods on infra nodes

Additional info:

Cu suspects this issue might be because the container "kube-rbac-proxy" from deployment "controller" is pointing to upstream
"--upstream=http://$(METALLB_HOST):7472/". The METALLB_HOST is configured as "status.hostIP".

We believe this is a bug and the controller deployment from Metallb should point the kube-rbac-proxy container to upstream status.podIP:7472

Comment 5 Federico Paolinelli 2022-04-27 14:18:34 UTC

We found the issue, the ports currently used for metrics are not in openshift's reserved range, working on a fix.

Comment 6 Federico Paolinelli 2022-04-27 14:20:57 UTC

Just an extra note: "reserved range for pods that run with hostnetwork: true".

Comment 9 Federico Paolinelli 2022-05-23 07:33:30 UTC

I just filed https://bugzilla.redhat.com/show_bug.cgi?id=2089179 for tracking the backport.

Comment 10 elevin 2022-05-24 09:00:56 UTC

4.11.0-0.nightly-2022-05-18-171831
metallb-operator.4.11.0-202205191659
=====================================
can not find metallb  metrics on prometheus pods:
******************************************************************
oc exec speaker-kvs66 -n metallb-system -- curl localhost:29151/metrics | grep metallb_bfd_control_packet_output

# HELP metallb_bfd_control_packet_output Number of sent BFD control packets
# TYPE metallb_bfd_control_packet_output counter
metallb_bfd_control_packet_output{peer="10.46.55.34"} 2763
******************************************************************
oc exec prometheus-k8s-0 -n openshift-monitoring -- curl http://localhost:9090/api/v1/query?query=metallb_bfd_control_packet_output

{"status":"success","data":{"resultType":"vector","result":[]}}
******************************************************************

Metallb prometheus targets (ports 9120 & 9121) have status "down"

******************************************************************

Scrape failed
server returned HTTP status 401 Unauthorized

******************************************************************

Comment 12 elevin 2022-05-30 06:25:29 UTC

metallb-operator.4.11.0-202205242136
OCP 4.11.0-0.nightly-2022-05-18-171831
=======================================

• [SLOW TEST:50.158 seconds]
MetalLB BGP
/home/elevin/projects/cnf-gotestMy/remove/onemore/cnf-gotests/test/network/metallb/tests/bgp-test.go:25
  updates
  /home/elevin/projects/cnf-gotestMy/remove/onemore/cnf-gotests/test/network/metallb/tests/bgp-test.go:106
    metrics
    /home/elevin/projects/cnf-gotestMy/remove/onemore/cnf-gotests/test/network/metallb/tests/bgp-test.go:173
      provides Prometheus BGP metrics
      /home/elevin/projects/cnf-gotestMy/remove/onemore/cnf-gotests/test/network/metallb/tests/bgp-test.go:200

Comment 14 errata-xmlrpc 2022-08-10 11:02:32 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069

Note You need to log in before you can comment on or make changes to this bug.