Bug 1664174
| Summary: | Prometheus fails to start due to "Opening storage failed unexpected end of JSON input" | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Robert Bost <rbost> |
| Component: | Monitoring | Assignee: | Frederic Branczyk <fbranczy> |
| Status: | CLOSED UPSTREAM | QA Contact: | Junqi Zhao <juzhao> |
| Severity: | unspecified | Docs Contact: | |
| Priority: | high | ||
| Version: | 3.11.0 | CC: | aabhishe, fbranczy, grodrigu, hgomes, jfoots, nberry, surbania |
| Target Milestone: | --- | ||
| Target Release: | 4.2.z | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2019-08-05 06:17:24 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Robert Bost
2019-01-07 23:59:16 UTC
Added case to the BZ and provided customer with the related upstream issue from the parent comment. Waiting for update from customer. Customer provided the following update: ~~~ Greg, We tried to delete meta.json files with 0 size (as mentioned in https://github.com/prometheus/prometheus/issues/4058) but it said they are Read-only file-system. cannot remove '01D223VQWPRQGAFFQ5SHCXHCS1/meta.json': Read-only file system We also think it might be with one of our "glusterfs-infrastorage" pods got corrupted. We deleted those 3 pods one after the other, and after the new pods got started, Prometheus pod went into running state. Thanks, Vamshi ~~~ Note that Prometheus requires a POSIX filesystem (which for example most NFS implementations, and I believe gluster as well are not), otherwise corruptions are more likely to happen. In case of corruption there is nothing that can be done, but remove a block (which typically means loosing a window of 2 hours of data, but given that our retention is only 15 days this shouldn't be too terrible of a problem, as at most after 15 days the the gap will be gone). *** Bug 1669641 has been marked as a duplicate of this bug. *** *** Bug 1683033 has been marked as a duplicate of this bug. *** Moving the target release out to 4.2, as 4.1 is a very short release cycle. At this point it is unclear whether blocks corrupted in this way are recoverable at all, but we will look into whether it is possible to handle these situations more gracefully, with no guarantee that this is actually possible. For now, in order to get a working stack again, you will need to delete the corrupted block. |