Bug 1926329

Summary:

[Assisted-4.7][Staging] monitoring stack in staging is being overloaded by the amount of metrics being exposed by assisted-installer pods and scraped by prometheus.

Product:

OpenShift Container Platform

Reporter:

Yuri Obshansky <yobshans>

Component:

assisted-installer

Assignee:

Sarah Lavie <slavie>

assisted-installer sub component:

assisted-service

QA Contact:

Yuri Obshansky <yobshans>

Status:

CLOSED ERRATA

Docs Contact:

Severity:

high

Priority:

high

CC:

alazar, aos-bugs, slavie

Version:

4.7

Keywords:

Triaged

Target Milestone:

---

Target Release:

4.8.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Ai-Team-Cloud

Fixed In Version:

OCP-Metal-v1.0.17.1

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2021-07-27 22:42:26 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
Grafana screenshot	none

Description Yuri Obshansky 2021-02-08 16:23:09 UTC

Description of problem:
Hello,

you are receiving this email because you are listed as service owners for the assisted-installer service in app-interface.

AppSRE has been paged multiple times since 4am EST Saturday Feb 6 because the monitoring stack in staging is being overloaded by the amount of metrics being exposed by assisted-installer pods and scraped by prometheus.

Notes of the investigation can be found here https://coreos.slack.com/archives/CKN746TDW/p1612640719039800

The ServiceMonitor deployment job for _staging_ has been disabled in this MR https://gitlab.cee.redhat.com/service/app-interface/-/merge_requests/14714

My recommendation to your team is to investigate why more than 1.2 million of metrics are being exposed by the service and ensure that the number of metrics returned is reasonable. The metrics returned should be what is necessary to monitor and ensure service reliability. If you have any questions the AppSRE team can offer some guidance. 

Once the problem is fixed, an MR can be sent to re-enable the job and allow the servicemonitor to be deployed to staging again.

It might also be good to do additional checks to ensure this issue does not happen in production.

Thanks,

Version-Release number of selected component (if applicable):
staging v1.0.15.3

How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Yuri Obshansky 2021-02-08 16:24:52 UTC

Created attachment 1755759 [details]
Grafana screenshot

Comment 2 Yuri Obshansky 2021-03-04 15:14:21 UTC

Verified on Staging v1.0.17.1
Still collected metric service_assisted_installer_host_installation_phase_seconds_bucket, 
mostly on the clusterId label 
Reopened

Comment 3 Ronnie Lazar 2021-05-23 15:05:35 UTC

Will be resolved by Epic https://issues.redhat.com/browse/MGMT-4525

Comment 4 Sarah Lavie 2021-05-26 11:31:38 UTC

Per our meeting yesterday moving this bug to UI to be further handled as discussed (showing %).

Comment 8 errata-xmlrpc 2021-07-27 22:42:26 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438