Bug 1926329 - [Assisted-4.7][Staging] monitoring stack in staging is being overloaded by the amount of metrics being exposed by assisted-installer pods and scraped by prometheus.
Summary: [Assisted-4.7][Staging] monitoring stack in staging is being overloaded by th...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: assisted-installer
Version: 4.7
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.8.0
Assignee: Sarah Lavie
QA Contact: Yuri Obshansky
URL:
Whiteboard: Ai-Team-Cloud
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-02-08 16:23 UTC by Yuri Obshansky
Modified: 2021-07-27 22:42 UTC (History)
3 users (show)

Fixed In Version: OCP-Metal-v1.0.17.1
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-07-27 22:42:26 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Grafana screenshot (274.07 KB, image/png)
2021-02-08 16:24 UTC, Yuri Obshansky
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2021:2438 0 None None None 2021-07-27 22:42:59 UTC

Description Yuri Obshansky 2021-02-08 16:23:09 UTC
Description of problem:
Hello,

you are receiving this email because you are listed as service owners for the assisted-installer service in app-interface.

AppSRE has been paged multiple times since 4am EST Saturday Feb 6 because the monitoring stack in staging is being overloaded by the amount of metrics being exposed by assisted-installer pods and scraped by prometheus.

Notes of the investigation can be found here https://coreos.slack.com/archives/CKN746TDW/p1612640719039800

The ServiceMonitor deployment job for _staging_ has been disabled in this MR https://gitlab.cee.redhat.com/service/app-interface/-/merge_requests/14714

My recommendation to your team is to investigate why more than 1.2 million of metrics are being exposed by the service and ensure that the number of metrics returned is reasonable. The metrics returned should be what is necessary to monitor and ensure service reliability. If you have any questions the AppSRE team can offer some guidance. 

Once the problem is fixed, an MR can be sent to re-enable the job and allow the servicemonitor to be deployed to staging again.

It might also be good to do additional checks to ensure this issue does not happen in production.

Thanks,

Version-Release number of selected component (if applicable):
staging v1.0.15.3

How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Yuri Obshansky 2021-02-08 16:24:52 UTC
Created attachment 1755759 [details]
Grafana screenshot

Comment 2 Yuri Obshansky 2021-03-04 15:14:21 UTC
Verified on Staging v1.0.17.1
Still collected metric service_assisted_installer_host_installation_phase_seconds_bucket, 
mostly on the clusterId label 
Reopened

Comment 3 Ronnie Lazar 2021-05-23 15:05:35 UTC
Will be resolved by Epic https://issues.redhat.com/browse/MGMT-4525

Comment 4 Sarah Lavie 2021-05-26 11:31:38 UTC
Per our meeting yesterday moving this bug to UI to be further handled as discussed (showing %).

Comment 8 errata-xmlrpc 2021-07-27 22:42:26 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438


Note You need to log in before you can comment on or make changes to this bug.