Bug 1897501
| Summary: | Can NOT preview MTC metrics on local Prometheus server after using must-gather to dump past 24h metrics data | ||
|---|---|---|---|
| Product: | Migration Toolkit for Containers | Reporter: | whu |
| Component: | General | Assignee: | Derek Whatley <dwhatley> |
| Status: | CLOSED ERRATA | QA Contact: | Xin jiang <xjiang> |
| Severity: | high | Docs Contact: | Avital Pinnick <apinnick> |
| Priority: | unspecified | ||
| Version: | 1.3.z | CC: | ernelson, rjohnson, sregidor, whu, xjiang |
| Target Milestone: | --- | ||
| Target Release: | 1.4.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2021-02-11 12:54:49 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
whu
2020-11-13 08:43:39 UTC
I can confirm seeing the same behavior on the latest available must-gather image. Intermittently for me, `make prometheus-run` will fail on startup (looks like this is due to "wal" file corruption). Even when I was able to start a local prometheus with `make prometheus-run`, I did _not_ see the expected `cam_app_workload_migrations` metric visible in the local querying interface, despite the same metric being available on cluster. I _was_ able to see other metrics exported from the cluster. We are doing a blanket copy of the wal, so we should be grabbing everything. Marek did most of the work on this so he knows more of the fine details, but as I understand it we borrowed our technique for copying prometheus metrics off the cluster from an OpenShift CI ansible playbook. Our method of data extraction from the wal (write ahead log) is not officially supported by prometheus so it's unsurprising to me that it's behaving unexpectedly. The strange thing to me is that I've observed this working reliably in the past. One possible cause that comes to mind: perhaps newer versions of OpenShift have updated their included Prometheus version with a new wal schema, and maybe we need to update the container image we're launching locally to view the files we've copied. I'll try to find out what version the on-cluster Prometheus is and report back here. This issue seems related to https://bugzilla.redhat.com/show_bug.cgi?id=1898522 ======================================= Prometheus build info on OpenShift 4.6: Build Information Version 2.21.0 Revision 65ae9312f8eb78f710b33216aab96dc51957de0e Branch rhaos-4.6-rhel-8 BuildUser root@9fc7745b753a BuildDate 20200929-05:31:44 GoVersion go1.15.0 ======================================= --- =================================== Locally running prometheus version: prom/prometheus:v2.6.0 =================================== Seen in release notes for Prometheus 2.22.0: [ENHANCEMENT] Gracefully handle unknown WAL record types. #8004 https://github.com/prometheus/prometheus/releases/tag/v2.22.0 This bug should be resolved by https://github.com/konveyor/must-gather/pull/16 I was able to view `cam_app_workload_migrations` from OCP 4.6 locally with this PR. PR is merged. Available on master and release-1.4.0 branch. @rayford should I also add this to a release 1.3.x branch? Verified using MTC 1.4.0 in AWS OCP 4.5 openshift-migration-must-gather-container-v1.4.0-0.8 rhmtc-openshift-migration-must-gather@sha256:9c17c38e8f0a4cb8aa885d98581ec83ef81080ea089a9aa15d8c14ccadf7cb0d "cam_app_workload_migrations" can be queried without problems. Move to VERIFIED. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Migration Toolkit for Containers (MTC) tool image release advisory 1.4.0), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:5329 |