Description of Problem: After useing command "oc adm must-gather --image=brew.registry.redhat.io/rh-osbs/rhmtc-openshift-migration-must-gather:v1.3.2 -- /usr/bin/gather_metrics_dump" to dump past 24h prometheus metrics, container "mig-metrics-prometheus" can not start due to the dump file "data/wal/checkpoint.00000001/00000000" has problem. How Reproducible: Always Steps to Reproduce: 1. deploy MTC 1.3.2 2. migrate any application, such as ngnix. just for preparing some MTC migration data. 3. mkdir /tmp/must-gather && cd /tmp/must-gather 4. git clone https://github.com/konveyor/must-gather.git 5. cd /tmp/must-gather/must-gather/ 6. oc adm must-gather --image=brew.registry.redhat.io/rh-osbs/rhmtc-openshift-migration-must-gather:v1.3.2 -- /usr/bin/gather_metrics_dump 7. sudo make prometheus-run 8. open Prometheus web console http://localhost:9090 by browser Actual Results: Container "mig-metrics-prometheus" can not start due to the dump file "data/wal/checkpoint.00000001/00000000" has problem. The local Prometheus web console http://localhost:9090 can not be accessed. Expected Results: Container "mig-metrics-prometheus" can start successfully, The local Prometheus web console http://localhost:9090 can be accessed. Additional info: $ oc adm must-gather --image=brew.registry.redhat.io/rh-osbs/rhmtc-openshift-migration-must-gather:v1.3.2 -- /usr/bin/gather_metrics_dump [must-gather ] OUT Using must-gather plugin-in image: brew.registry.redhat.io/rh-osbs/rhmtc-openshift-migration-must-gather:v1.3.2 [must-gather ] OUT namespace/openshift-must-gather-rsx5m created [must-gather ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-5vbtq created [must-gather ] OUT pod for plug-in image brew.registry.redhat.io/rh-osbs/rhmtc-openshift-migration-must-gather:v1.3.2 created [must-gather-47dw7] POD Running Prometheus data files last day dump ... (might take a while) [must-gather-47dw7] POD Defaulting container name to prometheus. [must-gather-47dw7] POD Use 'oc describe pod/prometheus-k8s-0 -n openshift-monitoring' to see all of the containers in this pod. [must-gather-47dw7] POD tar: Removing leading `/' from member names [must-gather-47dw7] POD /prometheus/ [must-gather-47dw7] POD /prometheus/wal/ [must-gather-47dw7] POD /prometheus/wal/00000002 [must-gather-47dw7] POD /prometheus/wal/00000003 [must-gather-47dw7] POD /prometheus/wal/00000004 [must-gather-47dw7] POD tar: /prometheus/wal/00000004: file changed as we read it [must-gather-47dw7] POD /prometheus/wal/checkpoint.00000001/ [must-gather-47dw7] POD /prometheus/wal/checkpoint.00000001/00000000 [must-gather-47dw7] POD /prometheus/chunks_head/ [must-gather-47dw7] POD /prometheus/chunks_head/000001 [must-gather-47dw7] POD /prometheus/chunks_head/000002 [must-gather-47dw7] POD /prometheus/chunks_head/000003 [must-gather-47dw7] POD /prometheus/01EQAXWSPGPS58NWQEDZZKQKJN/ [must-gather-47dw7] POD /prometheus/01EQAXWSPGPS58NWQEDZZKQKJN/chunks/ [must-gather-47dw7] POD /prometheus/01EQAXWSPGPS58NWQEDZZKQKJN/chunks/000001 [must-gather-47dw7] POD /prometheus/01EQAXWSPGPS58NWQEDZZKQKJN/index [must-gather-47dw7] POD /prometheus/01EQAXWSPGPS58NWQEDZZKQKJN/meta.json [must-gather-47dw7] POD /prometheus/01EQAXWSPGPS58NWQEDZZKQKJN/tombstones [must-gather-47dw7] POD command terminated with exit code 1 [must-gather-47dw7] OUT waiting for gather to complete [must-gather-47dw7] OUT downloading gather output [must-gather-47dw7] OUT receiving incremental file list [must-gather-47dw7] OUT ./ [must-gather-47dw7] OUT metrics/ [must-gather-47dw7] OUT metrics/prom_data.tar.gz [must-gather-47dw7] OUT [must-gather-47dw7] OUT sent 54 bytes received 243,159,469 bytes 54,035,449.56 bytes/sec [must-gather-47dw7] OUT total size is 243,099,967 speedup is 1.00 [must-gather ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-5vbtq deleted [must-gather ] OUT namespace/openshift-must-gather-rsx5m deleted $ tree must-gather.local.6406855851152770461/ must-gather.local.6406855851152770461/ ├── brew-registry-redhat-io-rh-osbs-rhmtc-openshift-migration-must-gather-sha256-32f4504d7954402256d74104d5f8f549ced49278f1a1a33ca6a1b3f9e99a2649 │ └── metrics │ └── prom_data.tar.gz └── event-filter.html $ sudo make prometheus-run # delete data files directly from the container to allow delete data directory from outside of the container docker exec mig-metrics-prometheus rm -rf /prometheus || true Error: No such container: mig-metrics-prometheus docker rm -f mig-metrics-prometheus || true Error: No such container: mig-metrics-prometheus test -f "./must-gather.local.6406855851152770461/brew-registry-redhat-io-rh-osbs-rhmtc-openshift-migration-must-gather-sha256-32f4504d7954402256d74104d5f8f549ced49278f1a1a33ca6a1b3f9e99a2649/metrics/prom_data.tar.gz" || (echo "Error: Prometheus archive file does not exist. Specify valid file in PROMETHEUS_DUMP_PATH environment variable."; exit 1) rm -rf /tmp/mig-prometheus-data-dump mkdir -p /tmp/mig-prometheus-data-dump tar xvf ./must-gather.local.6406855851152770461/brew-registry-redhat-io-rh-osbs-rhmtc-openshift-migration-must-gather-sha256-32f4504d7954402256d74104d5f8f549ced49278f1a1a33ca6a1b3f9e99a2649/metrics/prom_data.tar.gz -C /tmp/mig-prometheus-data-dump --strip-components=1 --no-same-owner prometheus/wal/ prometheus/wal/00000002 prometheus/wal/00000003 prometheus/wal/00000004 prometheus/wal/checkpoint.00000001/ prometheus/wal/checkpoint.00000001/00000000 prometheus/chunks_head/ prometheus/chunks_head/000001 prometheus/chunks_head/000002 prometheus/chunks_head/000003 prometheus/01EQAXWSPGPS58NWQEDZZKQKJN/ prometheus/01EQAXWSPGPS58NWQEDZZKQKJN/chunks/ prometheus/01EQAXWSPGPS58NWQEDZZKQKJN/chunks/000001 prometheus/01EQAXWSPGPS58NWQEDZZKQKJN/index prometheus/01EQAXWSPGPS58NWQEDZZKQKJN/meta.json prometheus/01EQAXWSPGPS58NWQEDZZKQKJN/tombstones chmod 777 -R /tmp/mig-prometheus-data-dump docker run -d \ --mount type=bind,source=/tmp/mig-prometheus-data-dump,target=/etc/prometheus/data \ --name mig-metrics-prometheus \ --publish 127.0.0.1:9090:9090 \ prom/prometheus:v2.6.0 \ && echo "Started Prometheus on http://localhost:9090" c06da2b6137137b32f5161972a9bfe54ef1d8c205d5aab9cc5650c1cfaa44259 Started Prometheus on http://localhost:9090 $ sudo docker ps -a CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES c06da2b61371 prom/prometheus:v2.6.0 "/bin/prometheus" 4 seconds ago Exited (1) 3 seconds ago mig-metrics-prometheus [hwh@bogon must-gather]$ sudo docker logs c06da2b61371 level=info ts=2020-11-17T11:48:11.944057435Z caller=main.go:243 msg="Starting Prometheus" version="(version=2.6.0, branch=HEAD, revision=dbd1d58c894775c0788470944b818cc724f550fb)" level=info ts=2020-11-17T11:48:11.944116896Z caller=main.go:244 build_context="(go=go1.11.3, user=root@bf5760470f13, date=20181217-15:14:46)" level=info ts=2020-11-17T11:48:11.944151594Z caller=main.go:245 host_details="(Linux 5.0.9-301.fc30.x86_64 #1 SMP Tue Apr 23 23:57:35 UTC 2019 x86_64 c06da2b61371 (none))" level=info ts=2020-11-17T11:48:11.944186403Z caller=main.go:246 fd_limits="(soft=1048576, hard=1048576)" level=info ts=2020-11-17T11:48:11.944216049Z caller=main.go:247 vm_limits="(soft=unlimited, hard=unlimited)" level=info ts=2020-11-17T11:48:11.944778711Z caller=main.go:561 msg="Starting TSDB ..." level=info ts=2020-11-17T11:48:11.945107839Z caller=web.go:429 component=web msg="Start listening for connections" address=0.0.0.0:9090 level=info ts=2020-11-17T11:48:11.945390775Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1605600122676 maxt=1605607200000 ulid=01EQAXWSPGPS58NWQEDZZKQKJN level=warn ts=2020-11-17T11:48:11.946715337Z caller=wal.go:116 component=tsdb msg="last page of the wal is torn, filling it with zeros" segment=data/wal/00000004 level=info ts=2020-11-17T11:48:11.975366918Z caller=main.go:430 msg="Stopping scrape discovery manager..." level=info ts=2020-11-17T11:48:11.975402445Z caller=main.go:444 msg="Stopping notify discovery manager..." level=info ts=2020-11-17T11:48:11.975411327Z caller=main.go:466 msg="Stopping scrape manager..." level=info ts=2020-11-17T11:48:11.975422699Z caller=main.go:440 msg="Notify discovery manager stopped" level=info ts=2020-11-17T11:48:11.975465952Z caller=manager.go:664 component="rule manager" msg="Stopping rule manager..." level=info ts=2020-11-17T11:48:11.975546863Z caller=manager.go:670 component="rule manager" msg="Rule manager stopped" level=info ts=2020-11-17T11:48:11.975565077Z caller=notifier.go:521 component=notifier msg="Stopping notification manager..." level=info ts=2020-11-17T11:48:11.975597645Z caller=main.go:615 msg="Notifier manager stopped" level=info ts=2020-11-17T11:48:11.975481444Z caller=main.go:460 msg="Scrape manager stopped" level=info ts=2020-11-17T11:48:11.975460368Z caller=main.go:426 msg="Scrape discovery manager stopped" level=error ts=2020-11-17T11:48:11.975720503Z caller=main.go:624 err="opening storage failed: read WAL: backfill checkpoint: read records: corruption in segment data/wal/checkpoint.00000001/00000000 at 1231: unexpected record type 9"
Hi, the reading of WAL data _could_ fail on starting the local container. It is read from filesystem without stopping Prometheus service in the source OCP containers, so WAL is captured as nice to have to ensure we have the most up-to-date data (if it works). Generally all data should be in prometheus/<SOME_ID>/ data directories, but Prometheus persists samples to data directories in batches. The batch size and frequency depends on Prometheus config and amount of captured metrics samples. In my testing, the worst case was that data files were missing last hour of samples (this was reason to capture also WAL which sometimes is readable from the dump sometimes not). Assuming the main problem is not seeing expected metric samples in locally running Prometheus instance - could you try execute the must-gather for metrics 30mins or better 1 hour after a testing migration has finished? Thanks
Based on more detailed information, the time delay before dumping the metrics shouldn't cause error blocking start prometheus container locally. I'm going to do more experiments with downstream must-gather image and potentially skip WAL directory.
I was able to reproduce this problem intermittently, wrote up more details in https://bugzilla.redhat.com/show_bug.cgi?id=1897501#c1 At this time I'm not aware of a solution. The first thing I'm going to try is making sure my local prometheus version matches that of the on-cluster Prometheus that generated the wal data.
@whu I think this issue is caused by a gap between the on-cluster Prometheus version and the version we run locally to view the data. ``` level=error ts=2020-11-17T11:48:11.975720503Z caller=main.go:624 err="opening storage failed: read WAL: backfill checkpoint: read records: corruption in segment data/wal/checkpoint.00000001/00000000 at 1231: unexpected record type 9" ``` I was able to reproduce using the latest available Makefile in the must-gather repo. The fix for me was to change the must-gather version that is launched locally to match the version on the OpenShift cluster. For OpenShift 4.6 I found that Prometheus v2.21.0 was running on-cluster, so I modified my Makefile to use the same. git diff Makefile diff --git a/Makefile b/Makefile index 25ecb76..3ba8bc2 100644 --- a/Makefile +++ b/Makefile @@ -21,7 +21,7 @@ prometheus-run: prometheus-cleanup-container prometheus-load-dump --mount type=bind,source=${PROMETHEUS_LOCAL_DATA_DIR},target=/etc/prometheus/data \ --name mig-metrics-prometheus \ --publish 127.0.0.1:9090:9090 \ - prom/prometheus:v2.6.0 \ + prom/prometheus:v2.21.0 \ && echo "Started Prometheus on http://localhost:9090" prometheus-load-dump: prometheus-check-archive-file prometheus-cleanup-data PR posted https://github.com/konveyor/must-gather/pull/16 . Worth seeing if this will fix the issue across the board for OCP 4.x.
This bug should be resolved by https://github.com/konveyor/must-gather/pull/16 I was able to view `cam_app_workload_migrations` from OCP 4.6 locally with this PR.
PR is merged. Available on master and release-1.4.0 branch. @rayford should I also add this to a release 1.3.x branch?
Verified using MTC 1.4.0 openshift-migration-must-gather-container-v1.4.0-0.6 rhmtc-openshift-migration-must-gather@sha256:8f9d760d574e58af86097097af1cf5534c73e1bcc4e50bdf6788641fef7fc59c The pod could run without problems and the prometheus application could be used. Moved to VERIFIED status.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Migration Toolkit for Containers (MTC) tool image release advisory 1.4.0), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:5329
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days