Description of Problem: After using must-gather to dump past 24h metrics data, can NOT preview MTC metrics data on local Prometheus server. Version-Release number of selected component (if applicable): source cluster: gcp 3.11 target cluster: gcp 4.6 MTC: 1.3.2 image: registry.redhat.io/rhmtc/openshift-migration-rhel7-operator@sha256:28f9a1399cf480dd5de763187b1ded4dc29e663afeb6c32bfa1eda75c04b86c6 How Reproducible: Always Steps to Reproduce: 1. deploy MTC 1.3.2 2. migrate any application, such as ngnix. just for preparing some MTC migration data. 3. mkdir /tmp/must-gather && cd /tmp/must-gather 4. git clone https://github.com/konveyor/must-gather.git 5. oc adm must-gather --image quay.io/konveyor/must-gather:latest -- /usr/bin/gather_metrics_dump 6. make prometheus-run 7. open Prometheus web console http://localhost:9090 by browser 8, query "cam_app_workload_migrations" in Prometheus web console, got nothing Actual Results: preview MTC metres in local Prometheus server, query "cam_app_workload_migrations", get nothing. BUT use Prometheus shipped by OCP can query "cam_app_workload_migrations" metrics. Expected Results: Can preview MTC metres information from local Prometheus server. Additional info: $ ll /tmp/must-gather total 20 drwxrwxr-x. 2 hwh hwh 4096 Nov 4 21:21 collection-scripts -rw-rw-r--. 1 hwh hwh 586 Nov 4 21:21 Dockerfile -rw-rw-r--. 1 hwh hwh 1909 Nov 13 14:55 Makefile drwxrwxr-x. 3 hwh hwh 4096 Nov 13 13:03 must-gather.local.6042321186268319498 -rw-rw-r--. 1 hwh hwh 2578 Nov 4 21:21 README.md $ tree must-gather.local.6042321186268319498/ must-gather.local.6042321186268319498/ ├── event-filter.html └── quay-io-konveyor-must-gather-sha256-842d3396bbc53fe54f24f1c56048cec2c99d8e0c9cb72bab74d9cb250121ebc8 └── metrics ├── /tmp/must-gather $ tar xvf /tmp/must-gather $ ls -lRsh prometheus/ prometheus/: total 8.0K 4.0K drwxr-xr-x. 2 fedora fedora 4.0K Nov 13 03:31 chunks_head 4.0K drwxr-xr-x. 2 fedora fedora 4.0K Nov 13 03:55 wal prometheus/chunks_head: total 52M 52M -rw-r--r--. 1 fedora fedora 52M Nov 13 04:00 000001 prometheus/wal: total 212M 128M -rw-r--r--. 1 fedora fedora 128M Nov 13 03:55 00000000 85M -rw-r--r--. 1 fedora fedora 85M Nov 13 04:33 00000001 $ sudo docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 44ab1fd46fd4 prom/prometheus:v2.6.0 "/bin/prometheus" 31 minutes ago Up 31 minutes 127.0.0.1:9090->9090/tcp mig-metrics-prometheus $ sudo docker exec -it mig-metrics-prometheus ls -lRsh /prometheus /prometheus: total 0 0 drwxrwsrwx 2 root root 60 Nov 13 03:31 chunks_head 0 -rw-r--r-- 1 nobody nogroup 0 Nov 13 06:56 lock 0 drwxrwsrwx 2 root root 60 Nov 13 06:56 wal /prometheus/chunks_head: total 53244 53244 -rwxrwxrwx 1 root root 52.0M Nov 13 04:00 000001 /prometheus/wal: total 540 540 -rw-r--r-- 1 nobody root 537.6K Nov 13 07:27 00000000 Not sure why, after mounting Prometheus dump data to local prometheus container, the data files under "/prometheus/wal" were changed.
I can confirm seeing the same behavior on the latest available must-gather image. Intermittently for me, `make prometheus-run` will fail on startup (looks like this is due to "wal" file corruption). Even when I was able to start a local prometheus with `make prometheus-run`, I did _not_ see the expected `cam_app_workload_migrations` metric visible in the local querying interface, despite the same metric being available on cluster. I _was_ able to see other metrics exported from the cluster. We are doing a blanket copy of the wal, so we should be grabbing everything. Marek did most of the work on this so he knows more of the fine details, but as I understand it we borrowed our technique for copying prometheus metrics off the cluster from an OpenShift CI ansible playbook. Our method of data extraction from the wal (write ahead log) is not officially supported by prometheus so it's unsurprising to me that it's behaving unexpectedly. The strange thing to me is that I've observed this working reliably in the past. One possible cause that comes to mind: perhaps newer versions of OpenShift have updated their included Prometheus version with a new wal schema, and maybe we need to update the container image we're launching locally to view the files we've copied. I'll try to find out what version the on-cluster Prometheus is and report back here.
This issue seems related to https://bugzilla.redhat.com/show_bug.cgi?id=1898522
======================================= Prometheus build info on OpenShift 4.6: Build Information Version 2.21.0 Revision 65ae9312f8eb78f710b33216aab96dc51957de0e Branch rhaos-4.6-rhel-8 BuildUser root@9fc7745b753a BuildDate 20200929-05:31:44 GoVersion go1.15.0 ======================================= --- =================================== Locally running prometheus version: prom/prometheus:v2.6.0 ===================================
Seen in release notes for Prometheus 2.22.0: [ENHANCEMENT] Gracefully handle unknown WAL record types. #8004 https://github.com/prometheus/prometheus/releases/tag/v2.22.0
This bug should be resolved by https://github.com/konveyor/must-gather/pull/16 I was able to view `cam_app_workload_migrations` from OCP 4.6 locally with this PR.
PR is merged. Available on master and release-1.4.0 branch. @rayford should I also add this to a release 1.3.x branch?
Verified using MTC 1.4.0 in AWS OCP 4.5 openshift-migration-must-gather-container-v1.4.0-0.8 rhmtc-openshift-migration-must-gather@sha256:9c17c38e8f0a4cb8aa885d98581ec83ef81080ea089a9aa15d8c14ccadf7cb0d "cam_app_workload_migrations" can be queried without problems. Move to VERIFIED.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Migration Toolkit for Containers (MTC) tool image release advisory 1.4.0), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:5329