1897501 – Can NOT preview MTC metrics on local Prometheus server after using must-gather to dump past 24h metrics data

Bug 1897501 - Can NOT preview MTC metrics on local Prometheus server after using must-gather to dump past 24h metrics data

Summary: Can NOT preview MTC metrics on local Prometheus server after using must-gathe...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Migration Toolkit for Containers
Classification:	Red Hat
Component:	General
Sub Component:
Version:	1.3.z
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	1.4.0
Assignee:	Derek Whatley
QA Contact:	Xin jiang
Docs Contact:	Avital Pinnick
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-11-13 08:43 UTC by whu
Modified:	2021-10-13 12:11 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-02-11 12:54:49 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2020:5329	0	None	None	None	2021-02-11 12:55:07 UTC

Description whu 2020-11-13 08:43:39 UTC

Description of Problem:
After using must-gather to dump past 24h metrics data, can NOT preview MTC metrics data on local Prometheus server.

Version-Release number of selected component (if applicable):
source cluster: gcp 3.11
target cluster: gcp 4.6
MTC: 1.3.2
image: registry.redhat.io/rhmtc/openshift-migration-rhel7-operator@sha256:28f9a1399cf480dd5de763187b1ded4dc29e663afeb6c32bfa1eda75c04b86c6

How Reproducible:
Always

Steps to Reproduce:
1.  deploy MTC 1.3.2 
2.  migrate any application, such as ngnix. just for preparing some MTC migration data. 
3. mkdir /tmp/must-gather  && cd /tmp/must-gather
4. git clone https://github.com/konveyor/must-gather.git
5. oc adm must-gather --image quay.io/konveyor/must-gather:latest -- /usr/bin/gather_metrics_dump
6. make prometheus-run
7. open Prometheus web console http://localhost:9090 by browser
8, query "cam_app_workload_migrations" in Prometheus web console, got nothing 


Actual Results:
preview MTC metres in local Prometheus server, query "cam_app_workload_migrations", get nothing. BUT use Prometheus shipped by OCP can query "cam_app_workload_migrations" metrics.

Expected Results:
Can preview MTC metres information from local Prometheus server.

Additional info:

$ ll  /tmp/must-gather
total 20
drwxrwxr-x. 2 hwh hwh 4096 Nov  4 21:21 collection-scripts
-rw-rw-r--. 1 hwh hwh  586 Nov  4 21:21 Dockerfile
-rw-rw-r--. 1 hwh hwh 1909 Nov 13 14:55 Makefile
drwxrwxr-x. 3 hwh hwh 4096 Nov 13 13:03 must-gather.local.6042321186268319498
-rw-rw-r--. 1 hwh hwh 2578 Nov  4 21:21 README.md


$ tree must-gather.local.6042321186268319498/
must-gather.local.6042321186268319498/
├── event-filter.html
└── quay-io-konveyor-must-gather-sha256-842d3396bbc53fe54f24f1c56048cec2c99d8e0c9cb72bab74d9cb250121ebc8
    └── metrics
        ├── /tmp/must-gather
        
$ tar xvf /tmp/must-gather

$ ls -lRsh prometheus/
prometheus/:
total 8.0K
4.0K drwxr-xr-x. 2 fedora fedora 4.0K Nov 13 03:31 chunks_head
4.0K drwxr-xr-x. 2 fedora fedora 4.0K Nov 13 03:55 wal

prometheus/chunks_head:
total 52M
52M -rw-r--r--. 1 fedora fedora 52M Nov 13 04:00 000001

prometheus/wal:
total 212M
128M -rw-r--r--. 1 fedora fedora 128M Nov 13 03:55 00000000
 85M -rw-r--r--. 1 fedora fedora  85M Nov 13 04:33 00000001

$ sudo docker ps 
CONTAINER ID        IMAGE                    COMMAND             CREATED             STATUS              PORTS                      NAMES
44ab1fd46fd4        prom/prometheus:v2.6.0   "/bin/prometheus"   31 minutes ago      Up 31 minutes       127.0.0.1:9090->9090/tcp   mig-metrics-prometheus

$ sudo docker exec -it mig-metrics-prometheus ls -lRsh /prometheus
/prometheus:
total 0
     0 drwxrwsrwx    2 root     root          60 Nov 13 03:31 chunks_head
     0 -rw-r--r--    1 nobody   nogroup        0 Nov 13 06:56 lock
     0 drwxrwsrwx    2 root     root          60 Nov 13 06:56 wal

/prometheus/chunks_head:
total 53244
 53244 -rwxrwxrwx    1 root     root       52.0M Nov 13 04:00 000001

/prometheus/wal:
total 540
   540 -rw-r--r--    1 nobody   root      537.6K Nov 13 07:27 00000000

Not sure why, after mounting Prometheus dump data to local prometheus container, the data files under "/prometheus/wal" were changed.

Comment 1 Derek Whatley 2020-12-15 20:03:41 UTC

I can confirm seeing the same behavior on the latest available must-gather image. 

Intermittently for me, `make prometheus-run` will fail on startup (looks like this is due to "wal" file corruption).

Even when I was able to start a local prometheus with `make prometheus-run`, I did _not_ see the expected `cam_app_workload_migrations` metric visible in the local querying interface, despite the same metric being available on cluster. I _was_ able to see other metrics exported from the cluster. We are doing a blanket copy of the wal, so we should be grabbing everything.

Marek did most of the work on this so he knows more of the fine details, but as I understand it we borrowed our technique for copying prometheus metrics off the cluster from an OpenShift CI ansible playbook. Our method of data extraction from the wal (write ahead log) is not officially supported by prometheus so it's unsurprising to me that it's behaving unexpectedly. The strange thing to me is that I've observed this working reliably in the past. 

One possible cause that comes to mind: perhaps newer versions of OpenShift have updated their included Prometheus version with a new wal schema, and maybe we need to update the container image we're launching locally to view the files we've copied. I'll try to find out what version the on-cluster Prometheus is and report back here.

Comment 2 Derek Whatley 2020-12-15 20:07:16 UTC

This issue seems related to https://bugzilla.redhat.com/show_bug.cgi?id=1898522

Comment 3 Derek Whatley 2020-12-15 20:19:24 UTC

=======================================
Prometheus build info on OpenShift 4.6:

Build Information
Version	2.21.0
Revision	65ae9312f8eb78f710b33216aab96dc51957de0e
Branch	rhaos-4.6-rhel-8
BuildUser	root@9fc7745b753a
BuildDate	20200929-05:31:44
GoVersion	go1.15.0
=======================================

---

===================================
Locally running prometheus version:
prom/prometheus:v2.6.0
===================================

Comment 4 Derek Whatley 2020-12-15 20:47:07 UTC

Seen in release notes for Prometheus 2.22.0:
[ENHANCEMENT] Gracefully handle unknown WAL record types. #8004

https://github.com/prometheus/prometheus/releases/tag/v2.22.0

Comment 5 Derek Whatley 2020-12-15 20:56:14 UTC

This bug should be resolved by https://github.com/konveyor/must-gather/pull/16

I was able to view `cam_app_workload_migrations` from OCP 4.6 locally with this PR.

Comment 6 Derek Whatley 2020-12-16 15:57:59 UTC

PR is merged. Available on master and release-1.4.0 branch. @rayford should I also add this to a release 1.3.x branch?

Comment 12 Sergio 2021-01-21 13:46:29 UTC

Verified using MTC 1.4.0  in AWS OCP 4.5

openshift-migration-must-gather-container-v1.4.0-0.8
rhmtc-openshift-migration-must-gather@sha256:9c17c38e8f0a4cb8aa885d98581ec83ef81080ea089a9aa15d8c14ccadf7cb0d 

"cam_app_workload_migrations" can be queried without problems.


Move to VERIFIED.

Comment 14 errata-xmlrpc 2021-02-11 12:54:49 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Migration Toolkit for Containers (MTC) tool image release advisory 1.4.0), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:5329

Note You need to log in before you can comment on or make changes to this bug.