Bug 1926329

Summary: [Assisted-4.7][Staging] monitoring stack in staging is being overloaded by the amount of metrics being exposed by assisted-installer pods and scraped by prometheus.
Product: OpenShift Container Platform Reporter: Yuri Obshansky <yobshans>
Component: assisted-installerAssignee: Sarah Lavie <slavie>
assisted-installer sub component: assisted-service QA Contact: Yuri Obshansky <yobshans>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: high CC: alazar, aos-bugs, slavie
Version: 4.7Keywords: Triaged
Target Milestone: ---   
Target Release: 4.8.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: Ai-Team-Cloud
Fixed In Version: OCP-Metal-v1.0.17.1 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-07-27 22:42:26 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Grafana screenshot none

Description Yuri Obshansky 2021-02-08 16:23:09 UTC
Description of problem:
Hello,

you are receiving this email because you are listed as service owners for the assisted-installer service in app-interface.

AppSRE has been paged multiple times since 4am EST Saturday Feb 6 because the monitoring stack in staging is being overloaded by the amount of metrics being exposed by assisted-installer pods and scraped by prometheus.

Notes of the investigation can be found here https://coreos.slack.com/archives/CKN746TDW/p1612640719039800

The ServiceMonitor deployment job for _staging_ has been disabled in this MR https://gitlab.cee.redhat.com/service/app-interface/-/merge_requests/14714

My recommendation to your team is to investigate why more than 1.2 million of metrics are being exposed by the service and ensure that the number of metrics returned is reasonable. The metrics returned should be what is necessary to monitor and ensure service reliability. If you have any questions the AppSRE team can offer some guidance. 

Once the problem is fixed, an MR can be sent to re-enable the job and allow the servicemonitor to be deployed to staging again.

It might also be good to do additional checks to ensure this issue does not happen in production.

Thanks,

Version-Release number of selected component (if applicable):
staging v1.0.15.3

How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Yuri Obshansky 2021-02-08 16:24:52 UTC
Created attachment 1755759 [details]
Grafana screenshot

Comment 2 Yuri Obshansky 2021-03-04 15:14:21 UTC
Verified on Staging v1.0.17.1
Still collected metric service_assisted_installer_host_installation_phase_seconds_bucket, 
mostly on the clusterId label 
Reopened

Comment 3 Ronnie Lazar 2021-05-23 15:05:35 UTC
Will be resolved by Epic https://issues.redhat.com/browse/MGMT-4525

Comment 4 Sarah Lavie 2021-05-26 11:31:38 UTC
Per our meeting yesterday moving this bug to UI to be further handled as discussed (showing %).

Comment 8 errata-xmlrpc 2021-07-27 22:42:26 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438