Bug 1664174

Summary: Prometheus fails to start due to "Opening storage failed unexpected end of JSON input"
Product: OpenShift Container Platform Reporter: Robert Bost <rbost>
Component: MonitoringAssignee: Frederic Branczyk <fbranczy>
Status: CLOSED UPSTREAM QA Contact: Junqi Zhao <juzhao>
Severity: unspecified Docs Contact:
Priority: high    
Version: 3.11.0CC: aabhishe, fbranczy, grodrigu, hgomes, jfoots, nberry, surbania
Target Milestone: ---   
Target Release: 4.2.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-08-05 06:17:24 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Robert Bost 2019-01-07 23:59:16 UTC
Description of problem:

Prometheus failing to start with following log:

level=info ts=2019-01-02T13:29:26.619568937Z caller=main.go:222 msg="Starting Prometheus" version="(version=2.3.2, branch=, revision=)"
level=info ts=2019-01-02T13:29:26.61965685Z caller=main.go:223 build_context="(go=go1.10.3, user=mockbuild.eng.bos.redhat.com, date=20181203-06:09:17)"
level=info ts=2019-01-02T13:29:26.61968429Z caller=main.go:224 host_details="(Linux 3.10.0-957.1.3.el7.x86_64 #1 SMP Thu Nov 15 17:36:42 UTC 2018 x86_64 prometheus-k8s-1 (none))"
level=info ts=2019-01-02T13:29:26.619707528Z caller=main.go:225 fd_limits="(soft=1048576, hard=1048576)"
level=info ts=2019-01-02T13:29:26.620479309Z caller=main.go:533 msg="Starting TSDB ..."
level=info ts=2019-01-02T13:29:26.620537954Z caller=web.go:415 component=web msg="Start listening for connections" address=127.0.0.1:9090
level=info ts=2019-01-02T13:29:26.620816828Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1542693600000 maxt=1542758400000 ulid=01CWTYPEVT60FS41F9V3WH25V1
....
level=info ts=2019-01-02T13:29:26.624150248Z caller=main.go:402 msg="Stopping scrape discovery manager..."
level=info ts=2019-01-02T13:29:26.624189428Z caller=main.go:416 msg="Stopping notify discovery manager..."
level=info ts=2019-01-02T13:29:26.624209756Z caller=main.go:438 msg="Stopping scrape manager..."
level=info ts=2019-01-02T13:29:26.624226041Z caller=main.go:412 msg="Notify discovery manager stopped"
level=info ts=2019-01-02T13:29:26.624264698Z caller=main.go:432 msg="Scrape manager stopped"
level=info ts=2019-01-02T13:29:26.624270021Z caller=manager.go:464 component="rule manager" msg="Stopping rule manager..."
level=info ts=2019-01-02T13:29:26.6242707Z caller=main.go:398 msg="Scrape discovery manager stopped"
level=info ts=2019-01-02T13:29:26.624318455Z caller=manager.go:470 component="rule manager" msg="Rule manager stopped"
level=info ts=2019-01-02T13:29:26.624350464Z caller=notifier.go:512 component=notifier msg="Stopping notification manager..."
level=info ts=2019-01-02T13:29:26.62439182Z caller=main.go:587 msg="Notifier manager stopped"
level=error ts=2019-01-02T13:29:26.624472077Z caller=main.go:596 err="Opening storage failed unexpected end of JSON input"


Version-Release number of selected component (if applicable): 


How reproducible: Always upon startup for customer.

Additional info:
Similar upstream issue: https://github.com/prometheus/prometheus/issues/4058

Comment 1 Greg Rodriguez II 2019-01-28 22:24:57 UTC
Added case to the BZ and provided customer with the related upstream issue from the parent comment.  Waiting for update from customer.

Comment 2 Greg Rodriguez II 2019-01-29 19:21:26 UTC
Customer provided the following update:

~~~

Greg,

We tried to delete meta.json files with 0 size (as mentioned in https://github.com/prometheus/prometheus/issues/4058) but it said they are Read-only file-system.
cannot remove '01D223VQWPRQGAFFQ5SHCXHCS1/meta.json': Read-only file system

We also think it might be with one of our "glusterfs-infrastorage" pods got corrupted. We deleted those 3 pods one after the other, and after the new pods got started, Prometheus pod went into running state.

Thanks,
Vamshi

~~~

Comment 3 Frederic Branczyk 2019-02-04 16:22:47 UTC
Note that Prometheus requires a POSIX filesystem (which for example most NFS implementations, and I believe gluster as well are not), otherwise corruptions are more likely to happen. In case of corruption there is nothing that can be done, but remove a block (which typically means loosing a window of 2 hours of data, but given that our retention is only 15 days this shouldn't be too terrible of a problem, as at most after 15 days the the gap will be gone).

Comment 4 Frederic Branczyk 2019-02-04 16:28:52 UTC
*** Bug 1669641 has been marked as a duplicate of this bug. ***

Comment 6 Frederic Branczyk 2019-02-27 15:43:51 UTC
*** Bug 1683033 has been marked as a duplicate of this bug. ***

Comment 7 Frederic Branczyk 2019-02-27 15:48:11 UTC
Moving the target release out to 4.2, as 4.1 is a very short release cycle. At this point it is unclear whether blocks corrupted in this way are recoverable at all, but we will look into whether it is possible to handle these situations more gracefully, with no guarantee that this is actually possible.

For now, in order to get a working stack again, you will need to delete the corrupted block.