Bug 1898522
| Summary: | local prometheus server can not load the prom_data.tar.gz due to dump file has problem | ||
|---|---|---|---|
| Product: | Migration Toolkit for Containers | Reporter: | whu |
| Component: | General | Assignee: | Derek Whatley <dwhatley> |
| Status: | CLOSED ERRATA | QA Contact: | Xin jiang <xjiang> |
| Severity: | urgent | Docs Contact: | Avital Pinnick <apinnick> |
| Priority: | unspecified | ||
| Version: | 1.3.z | CC: | ernelson, maufart, rjohnson, sregidor, whu, xjiang |
| Target Milestone: | --- | ||
| Target Release: | 1.4.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2021-02-11 12:54:49 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
whu
2020-11-17 12:34:34 UTC
Hi, the reading of WAL data _could_ fail on starting the local container. It is read from filesystem without stopping Prometheus service in the source OCP containers, so WAL is captured as nice to have to ensure we have the most up-to-date data (if it works). Generally all data should be in prometheus/<SOME_ID>/ data directories, but Prometheus persists samples to data directories in batches. The batch size and frequency depends on Prometheus config and amount of captured metrics samples. In my testing, the worst case was that data files were missing last hour of samples (this was reason to capture also WAL which sometimes is readable from the dump sometimes not). Assuming the main problem is not seeing expected metric samples in locally running Prometheus instance - could you try execute the must-gather for metrics 30mins or better 1 hour after a testing migration has finished? Thanks Based on more detailed information, the time delay before dumping the metrics shouldn't cause error blocking start prometheus container locally. I'm going to do more experiments with downstream must-gather image and potentially skip WAL directory. I was able to reproduce this problem intermittently, wrote up more details in https://bugzilla.redhat.com/show_bug.cgi?id=1897501#c1 At this time I'm not aware of a solution. The first thing I'm going to try is making sure my local prometheus version matches that of the on-cluster Prometheus that generated the wal data. @whu I think this issue is caused by a gap between the on-cluster Prometheus version and the version we run locally to view the data.
```
level=error ts=2020-11-17T11:48:11.975720503Z caller=main.go:624 err="opening storage failed: read WAL: backfill checkpoint: read records: corruption in segment data/wal/checkpoint.00000001/00000000 at 1231: unexpected record type 9"
```
I was able to reproduce using the latest available Makefile in the must-gather repo. The fix for me was to change the must-gather version that is launched locally to match the version on the OpenShift cluster. For OpenShift 4.6 I found that Prometheus v2.21.0 was running on-cluster, so I modified my Makefile to use the same.
git diff Makefile
diff --git a/Makefile b/Makefile
index 25ecb76..3ba8bc2 100644
--- a/Makefile
+++ b/Makefile
@@ -21,7 +21,7 @@ prometheus-run: prometheus-cleanup-container prometheus-load-dump
--mount type=bind,source=${PROMETHEUS_LOCAL_DATA_DIR},target=/etc/prometheus/data \
--name mig-metrics-prometheus \
--publish 127.0.0.1:9090:9090 \
- prom/prometheus:v2.6.0 \
+ prom/prometheus:v2.21.0 \
&& echo "Started Prometheus on http://localhost:9090"
prometheus-load-dump: prometheus-check-archive-file prometheus-cleanup-data
PR posted https://github.com/konveyor/must-gather/pull/16 . Worth seeing if this will fix the issue across the board for OCP 4.x.
This bug should be resolved by https://github.com/konveyor/must-gather/pull/16 I was able to view `cam_app_workload_migrations` from OCP 4.6 locally with this PR. PR is merged. Available on master and release-1.4.0 branch. @rayford should I also add this to a release 1.3.x branch? Verified using MTC 1.4.0 openshift-migration-must-gather-container-v1.4.0-0.6 rhmtc-openshift-migration-must-gather@sha256:8f9d760d574e58af86097097af1cf5534c73e1bcc4e50bdf6788641fef7fc59c The pod could run without problems and the prometheus application could be used. Moved to VERIFIED status. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Migration Toolkit for Containers (MTC) tool image release advisory 1.4.0), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:5329 The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days |