Bug 1898522

Summary:	local prometheus server can not load the prom_data.tar.gz due to dump file has problem
Product:	Migration Toolkit for Containers	Reporter:	whu
Component:	General	Assignee:	Derek Whatley <dwhatley>
Status:	CLOSED ERRATA	QA Contact:	Xin jiang <xjiang>
Severity:	urgent	Docs Contact:	Avital Pinnick <apinnick>
Priority:	unspecified
Version:	1.3.z	CC:	ernelson, maufart, rjohnson, sregidor, whu, xjiang
Target Milestone:	---
Target Release:	1.4.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-02-11 12:54:49 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description whu 2020-11-17 12:34:34 UTC

Description of Problem:
After useing command "oc adm must-gather --image=brew.registry.redhat.io/rh-osbs/rhmtc-openshift-migration-must-gather:v1.3.2 -- /usr/bin/gather_metrics_dump" to dump past 24h prometheus metrics, container "mig-metrics-prometheus" can not start due to the dump file "data/wal/checkpoint.00000001/00000000" has problem.


How Reproducible:
Always

Steps to Reproduce:
1.  deploy MTC 1.3.2 
2.  migrate any application, such as ngnix. just for preparing some MTC migration data. 
3. mkdir /tmp/must-gather  && cd /tmp/must-gather
4. git clone https://github.com/konveyor/must-gather.git
5. cd /tmp/must-gather/must-gather/
6. oc adm must-gather --image=brew.registry.redhat.io/rh-osbs/rhmtc-openshift-migration-must-gather:v1.3.2  -- /usr/bin/gather_metrics_dump
7. sudo make prometheus-run
8. open Prometheus web console http://localhost:9090 by browser 


Actual Results:
Container "mig-metrics-prometheus" can not start due to the dump file "data/wal/checkpoint.00000001/00000000" has problem. The local Prometheus web console http://localhost:9090 can not be accessed.


Expected Results:
Container "mig-metrics-prometheus" can start successfully, The local Prometheus web console http://localhost:9090 can be accessed.


Additional info:

$ oc adm must-gather --image=brew.registry.redhat.io/rh-osbs/rhmtc-openshift-migration-must-gather:v1.3.2 -- /usr/bin/gather_metrics_dump
[must-gather      ] OUT Using must-gather plugin-in image: brew.registry.redhat.io/rh-osbs/rhmtc-openshift-migration-must-gather:v1.3.2
[must-gather      ] OUT namespace/openshift-must-gather-rsx5m created
[must-gather      ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-5vbtq created
[must-gather      ] OUT pod for plug-in image brew.registry.redhat.io/rh-osbs/rhmtc-openshift-migration-must-gather:v1.3.2 created
[must-gather-47dw7] POD Running Prometheus data files last day dump ... (might take a while)
[must-gather-47dw7] POD Defaulting container name to prometheus.
[must-gather-47dw7] POD Use 'oc describe pod/prometheus-k8s-0 -n openshift-monitoring' to see all of the containers in this pod.
[must-gather-47dw7] POD tar: Removing leading `/' from member names
[must-gather-47dw7] POD /prometheus/
[must-gather-47dw7] POD /prometheus/wal/
[must-gather-47dw7] POD /prometheus/wal/00000002
[must-gather-47dw7] POD /prometheus/wal/00000003
[must-gather-47dw7] POD /prometheus/wal/00000004
[must-gather-47dw7] POD tar: /prometheus/wal/00000004: file changed as we read it
[must-gather-47dw7] POD /prometheus/wal/checkpoint.00000001/
[must-gather-47dw7] POD /prometheus/wal/checkpoint.00000001/00000000
[must-gather-47dw7] POD /prometheus/chunks_head/
[must-gather-47dw7] POD /prometheus/chunks_head/000001
[must-gather-47dw7] POD /prometheus/chunks_head/000002
[must-gather-47dw7] POD /prometheus/chunks_head/000003
[must-gather-47dw7] POD /prometheus/01EQAXWSPGPS58NWQEDZZKQKJN/
[must-gather-47dw7] POD /prometheus/01EQAXWSPGPS58NWQEDZZKQKJN/chunks/
[must-gather-47dw7] POD /prometheus/01EQAXWSPGPS58NWQEDZZKQKJN/chunks/000001
[must-gather-47dw7] POD /prometheus/01EQAXWSPGPS58NWQEDZZKQKJN/index
[must-gather-47dw7] POD /prometheus/01EQAXWSPGPS58NWQEDZZKQKJN/meta.json
[must-gather-47dw7] POD /prometheus/01EQAXWSPGPS58NWQEDZZKQKJN/tombstones
[must-gather-47dw7] POD command terminated with exit code 1
[must-gather-47dw7] OUT waiting for gather to complete
[must-gather-47dw7] OUT downloading gather output
[must-gather-47dw7] OUT receiving incremental file list
[must-gather-47dw7] OUT ./
[must-gather-47dw7] OUT metrics/
[must-gather-47dw7] OUT metrics/prom_data.tar.gz
[must-gather-47dw7] OUT 
[must-gather-47dw7] OUT sent 54 bytes  received 243,159,469 bytes  54,035,449.56 bytes/sec
[must-gather-47dw7] OUT total size is 243,099,967  speedup is 1.00
[must-gather      ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-5vbtq deleted
[must-gather      ] OUT namespace/openshift-must-gather-rsx5m deleted

$ tree must-gather.local.6406855851152770461/
must-gather.local.6406855851152770461/
├── brew-registry-redhat-io-rh-osbs-rhmtc-openshift-migration-must-gather-sha256-32f4504d7954402256d74104d5f8f549ced49278f1a1a33ca6a1b3f9e99a2649
│   └── metrics
│       └── prom_data.tar.gz
└── event-filter.html

$ sudo make prometheus-run
# delete data files directly from the container to allow delete data directory from outside of the container
docker exec mig-metrics-prometheus rm -rf /prometheus || true
Error: No such container: mig-metrics-prometheus
docker rm -f mig-metrics-prometheus || true
Error: No such container: mig-metrics-prometheus
test -f "./must-gather.local.6406855851152770461/brew-registry-redhat-io-rh-osbs-rhmtc-openshift-migration-must-gather-sha256-32f4504d7954402256d74104d5f8f549ced49278f1a1a33ca6a1b3f9e99a2649/metrics/prom_data.tar.gz" || (echo "Error: Prometheus archive file does not exist. Specify valid file in PROMETHEUS_DUMP_PATH environment variable."; exit 1)
rm -rf /tmp/mig-prometheus-data-dump
mkdir -p /tmp/mig-prometheus-data-dump
tar xvf ./must-gather.local.6406855851152770461/brew-registry-redhat-io-rh-osbs-rhmtc-openshift-migration-must-gather-sha256-32f4504d7954402256d74104d5f8f549ced49278f1a1a33ca6a1b3f9e99a2649/metrics/prom_data.tar.gz -C /tmp/mig-prometheus-data-dump --strip-components=1 --no-same-owner
prometheus/wal/
prometheus/wal/00000002
prometheus/wal/00000003
prometheus/wal/00000004
prometheus/wal/checkpoint.00000001/
prometheus/wal/checkpoint.00000001/00000000
prometheus/chunks_head/
prometheus/chunks_head/000001
prometheus/chunks_head/000002
prometheus/chunks_head/000003
prometheus/01EQAXWSPGPS58NWQEDZZKQKJN/
prometheus/01EQAXWSPGPS58NWQEDZZKQKJN/chunks/
prometheus/01EQAXWSPGPS58NWQEDZZKQKJN/chunks/000001
prometheus/01EQAXWSPGPS58NWQEDZZKQKJN/index
prometheus/01EQAXWSPGPS58NWQEDZZKQKJN/meta.json
prometheus/01EQAXWSPGPS58NWQEDZZKQKJN/tombstones
chmod 777 -R /tmp/mig-prometheus-data-dump
docker run -d \
  --mount type=bind,source=/tmp/mig-prometheus-data-dump,target=/etc/prometheus/data \
  --name mig-metrics-prometheus \
  --publish 127.0.0.1:9090:9090 \
  prom/prometheus:v2.6.0 \
&& echo "Started Prometheus on http://localhost:9090"
c06da2b6137137b32f5161972a9bfe54ef1d8c205d5aab9cc5650c1cfaa44259
Started Prometheus on http://localhost:9090



$ sudo docker ps -a
CONTAINER ID        IMAGE                    COMMAND             CREATED             STATUS                     PORTS               NAMES
c06da2b61371        prom/prometheus:v2.6.0   "/bin/prometheus"   4 seconds ago       Exited (1) 3 seconds ago   mig-metrics-prometheus
                    
                    
[hwh@bogon must-gather]$ sudo docker logs c06da2b61371
level=info ts=2020-11-17T11:48:11.944057435Z caller=main.go:243 msg="Starting Prometheus" version="(version=2.6.0, branch=HEAD, revision=dbd1d58c894775c0788470944b818cc724f550fb)"
level=info ts=2020-11-17T11:48:11.944116896Z caller=main.go:244 build_context="(go=go1.11.3, user=root@bf5760470f13, date=20181217-15:14:46)"
level=info ts=2020-11-17T11:48:11.944151594Z caller=main.go:245 host_details="(Linux 5.0.9-301.fc30.x86_64 #1 SMP Tue Apr 23 23:57:35 UTC 2019 x86_64 c06da2b61371 (none))"
level=info ts=2020-11-17T11:48:11.944186403Z caller=main.go:246 fd_limits="(soft=1048576, hard=1048576)"
level=info ts=2020-11-17T11:48:11.944216049Z caller=main.go:247 vm_limits="(soft=unlimited, hard=unlimited)"
level=info ts=2020-11-17T11:48:11.944778711Z caller=main.go:561 msg="Starting TSDB ..."
level=info ts=2020-11-17T11:48:11.945107839Z caller=web.go:429 component=web msg="Start listening for connections" address=0.0.0.0:9090
level=info ts=2020-11-17T11:48:11.945390775Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1605600122676 maxt=1605607200000 ulid=01EQAXWSPGPS58NWQEDZZKQKJN
level=warn ts=2020-11-17T11:48:11.946715337Z caller=wal.go:116 component=tsdb msg="last page of the wal is torn, filling it with zeros" segment=data/wal/00000004
level=info ts=2020-11-17T11:48:11.975366918Z caller=main.go:430 msg="Stopping scrape discovery manager..."
level=info ts=2020-11-17T11:48:11.975402445Z caller=main.go:444 msg="Stopping notify discovery manager..."
level=info ts=2020-11-17T11:48:11.975411327Z caller=main.go:466 msg="Stopping scrape manager..."
level=info ts=2020-11-17T11:48:11.975422699Z caller=main.go:440 msg="Notify discovery manager stopped"
level=info ts=2020-11-17T11:48:11.975465952Z caller=manager.go:664 component="rule manager" msg="Stopping rule manager..."
level=info ts=2020-11-17T11:48:11.975546863Z caller=manager.go:670 component="rule manager" msg="Rule manager stopped"
level=info ts=2020-11-17T11:48:11.975565077Z caller=notifier.go:521 component=notifier msg="Stopping notification manager..."
level=info ts=2020-11-17T11:48:11.975597645Z caller=main.go:615 msg="Notifier manager stopped"
level=info ts=2020-11-17T11:48:11.975481444Z caller=main.go:460 msg="Scrape manager stopped"
level=info ts=2020-11-17T11:48:11.975460368Z caller=main.go:426 msg="Scrape discovery manager stopped"
level=error ts=2020-11-17T11:48:11.975720503Z caller=main.go:624 err="opening storage failed: read WAL: backfill checkpoint: read records: corruption in segment data/wal/checkpoint.00000001/00000000 at 1231: unexpected record type 9"

Comment 1 Marek Aufart 2020-11-18 08:37:18 UTC

Hi, the reading of WAL data _could_ fail on starting the local container. It is read from filesystem without stopping Prometheus service in the source OCP containers, so WAL is captured as nice to have to ensure we have the most up-to-date data (if it works).

Generally all data should be in prometheus/<SOME_ID>/ data directories, but Prometheus persists samples to data directories in batches. The batch size and frequency depends on Prometheus config and amount of captured metrics samples. In my testing, the worst case was that data files were missing last hour of samples (this was reason to capture also WAL which sometimes is readable from the dump sometimes not).

Assuming the main problem is not seeing expected metric samples in locally running Prometheus instance - could you try execute the must-gather for metrics 30mins or better 1 hour after a testing migration has finished?

Thanks

Comment 2 Marek Aufart 2020-11-18 15:56:04 UTC

Based on more detailed information, the time delay before dumping the metrics shouldn't cause error blocking start prometheus container locally. I'm going to do more experiments with downstream must-gather image and potentially skip WAL directory.

Comment 4 Derek Whatley 2020-12-15 20:06:06 UTC

I was able to reproduce this problem intermittently, wrote up more details in https://bugzilla.redhat.com/show_bug.cgi?id=1897501#c1

At this time I'm not aware of a solution. The first thing I'm going to try is making sure my local prometheus version matches that of the on-cluster Prometheus that generated the wal data.

Comment 5 Derek Whatley 2020-12-15 20:29:23 UTC

@whu I think this issue is caused by a gap between the on-cluster Prometheus version and the version we run locally to view the data.

```
level=error ts=2020-11-17T11:48:11.975720503Z caller=main.go:624 err="opening storage failed: read WAL: backfill checkpoint: read records: corruption in segment data/wal/checkpoint.00000001/00000000 at 1231: unexpected record type 9"
```

I was able to reproduce using the latest available Makefile in the must-gather repo. The fix for me was to change the must-gather version that is launched locally to match the version on the OpenShift cluster. For OpenShift 4.6 I found that Prometheus v2.21.0 was running on-cluster, so I modified my Makefile to use the same.

git diff Makefile
diff --git a/Makefile b/Makefile
index 25ecb76..3ba8bc2 100644
--- a/Makefile
+++ b/Makefile
@@ -21,7 +21,7 @@ prometheus-run: prometheus-cleanup-container prometheus-load-dump
          --mount type=bind,source=${PROMETHEUS_LOCAL_DATA_DIR},target=/etc/prometheus/data \
          --name mig-metrics-prometheus \
          --publish 127.0.0.1:9090:9090 \
-         prom/prometheus:v2.6.0 \
+         prom/prometheus:v2.21.0 \
        && echo "Started Prometheus on http://localhost:9090"
 
 prometheus-load-dump: prometheus-check-archive-file prometheus-cleanup-data


PR posted https://github.com/konveyor/must-gather/pull/16 . Worth seeing if this will fix the issue across the board for OCP 4.x.

Comment 6 Derek Whatley 2020-12-15 20:56:37 UTC

This bug should be resolved by https://github.com/konveyor/must-gather/pull/16

I was able to view `cam_app_workload_migrations` from OCP 4.6 locally with this PR.

Comment 7 Derek Whatley 2020-12-16 15:58:30 UTC

PR is merged. Available on master and release-1.4.0 branch. @rayford should I also add this to a release 1.3.x branch?

Comment 13 Sergio 2021-01-19 16:54:02 UTC

Verified using MTC 1.4.0 openshift-migration-must-gather-container-v1.4.0-0.6

rhmtc-openshift-migration-must-gather@sha256:8f9d760d574e58af86097097af1cf5534c73e1bcc4e50bdf6788641fef7fc59c


The pod could run without problems and the prometheus application could be used.


Moved to VERIFIED status.

Comment 15 errata-xmlrpc 2021-02-11 12:54:49 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Migration Toolkit for Containers (MTC) tool image release advisory 1.4.0), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:5329

Comment 16 Red Hat Bugzilla 2023-09-15 00:51:18 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days