Description of problem: Hello, you are receiving this email because you are listed as service owners for the assisted-installer service in app-interface. AppSRE has been paged multiple times since 4am EST Saturday Feb 6 because the monitoring stack in staging is being overloaded by the amount of metrics being exposed by assisted-installer pods and scraped by prometheus. Notes of the investigation can be found here https://coreos.slack.com/archives/CKN746TDW/p1612640719039800 The ServiceMonitor deployment job for _staging_ has been disabled in this MR https://gitlab.cee.redhat.com/service/app-interface/-/merge_requests/14714 My recommendation to your team is to investigate why more than 1.2 million of metrics are being exposed by the service and ensure that the number of metrics returned is reasonable. The metrics returned should be what is necessary to monitor and ensure service reliability. If you have any questions the AppSRE team can offer some guidance. Once the problem is fixed, an MR can be sent to re-enable the job and allow the servicemonitor to be deployed to staging again. It might also be good to do additional checks to ensure this issue does not happen in production. Thanks, Version-Release number of selected component (if applicable): staging v1.0.15.3 How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
Created attachment 1755759 [details] Grafana screenshot
Verified on Staging v1.0.17.1 Still collected metric service_assisted_installer_host_installation_phase_seconds_bucket, mostly on the clusterId label Reopened
Will be resolved by Epic https://issues.redhat.com/browse/MGMT-4525
Per our meeting yesterday moving this bug to UI to be further handled as discussed (showing %).
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438