Bug 1664174 - Prometheus fails to start due to "Opening storage failed unexpected end of JSON input"
Summary: Prometheus fails to start due to "Opening storage failed unexpected end of JS...
Keywords:
Status: CLOSED UPSTREAM
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 3.11.0
Hardware: Unspecified
OS: Unspecified
high
unspecified
Target Milestone: ---
: 4.2.z
Assignee: Frederic Branczyk
QA Contact: Junqi Zhao
URL:
Whiteboard:
: 1669641 1683033 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-01-07 23:59 UTC by Robert Bost
Modified: 2019-08-05 06:17 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-08-05 06:17:24 UTC


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 3891791 None None None 2019-02-06 15:57:50 UTC

Description Robert Bost 2019-01-07 23:59:16 UTC
Description of problem:

Prometheus failing to start with following log:

level=info ts=2019-01-02T13:29:26.619568937Z caller=main.go:222 msg="Starting Prometheus" version="(version=2.3.2, branch=, revision=)"
level=info ts=2019-01-02T13:29:26.61965685Z caller=main.go:223 build_context="(go=go1.10.3, user=mockbuild@x86-037.build.eng.bos.redhat.com, date=20181203-06:09:17)"
level=info ts=2019-01-02T13:29:26.61968429Z caller=main.go:224 host_details="(Linux 3.10.0-957.1.3.el7.x86_64 #1 SMP Thu Nov 15 17:36:42 UTC 2018 x86_64 prometheus-k8s-1 (none))"
level=info ts=2019-01-02T13:29:26.619707528Z caller=main.go:225 fd_limits="(soft=1048576, hard=1048576)"
level=info ts=2019-01-02T13:29:26.620479309Z caller=main.go:533 msg="Starting TSDB ..."
level=info ts=2019-01-02T13:29:26.620537954Z caller=web.go:415 component=web msg="Start listening for connections" address=127.0.0.1:9090
level=info ts=2019-01-02T13:29:26.620816828Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1542693600000 maxt=1542758400000 ulid=01CWTYPEVT60FS41F9V3WH25V1
....
level=info ts=2019-01-02T13:29:26.624150248Z caller=main.go:402 msg="Stopping scrape discovery manager..."
level=info ts=2019-01-02T13:29:26.624189428Z caller=main.go:416 msg="Stopping notify discovery manager..."
level=info ts=2019-01-02T13:29:26.624209756Z caller=main.go:438 msg="Stopping scrape manager..."
level=info ts=2019-01-02T13:29:26.624226041Z caller=main.go:412 msg="Notify discovery manager stopped"
level=info ts=2019-01-02T13:29:26.624264698Z caller=main.go:432 msg="Scrape manager stopped"
level=info ts=2019-01-02T13:29:26.624270021Z caller=manager.go:464 component="rule manager" msg="Stopping rule manager..."
level=info ts=2019-01-02T13:29:26.6242707Z caller=main.go:398 msg="Scrape discovery manager stopped"
level=info ts=2019-01-02T13:29:26.624318455Z caller=manager.go:470 component="rule manager" msg="Rule manager stopped"
level=info ts=2019-01-02T13:29:26.624350464Z caller=notifier.go:512 component=notifier msg="Stopping notification manager..."
level=info ts=2019-01-02T13:29:26.62439182Z caller=main.go:587 msg="Notifier manager stopped"
level=error ts=2019-01-02T13:29:26.624472077Z caller=main.go:596 err="Opening storage failed unexpected end of JSON input"


Version-Release number of selected component (if applicable): 


How reproducible: Always upon startup for customer.

Additional info:
Similar upstream issue: https://github.com/prometheus/prometheus/issues/4058

Comment 1 Greg Rodriguez II 2019-01-28 22:24:57 UTC
Added case to the BZ and provided customer with the related upstream issue from the parent comment.  Waiting for update from customer.

Comment 2 Greg Rodriguez II 2019-01-29 19:21:26 UTC
Customer provided the following update:

~~~

Greg,

We tried to delete meta.json files with 0 size (as mentioned in https://github.com/prometheus/prometheus/issues/4058) but it said they are Read-only file-system.
cannot remove '01D223VQWPRQGAFFQ5SHCXHCS1/meta.json': Read-only file system

We also think it might be with one of our "glusterfs-infrastorage" pods got corrupted. We deleted those 3 pods one after the other, and after the new pods got started, Prometheus pod went into running state.

Thanks,
Vamshi

~~~

Comment 3 Frederic Branczyk 2019-02-04 16:22:47 UTC
Note that Prometheus requires a POSIX filesystem (which for example most NFS implementations, and I believe gluster as well are not), otherwise corruptions are more likely to happen. In case of corruption there is nothing that can be done, but remove a block (which typically means loosing a window of 2 hours of data, but given that our retention is only 15 days this shouldn't be too terrible of a problem, as at most after 15 days the the gap will be gone).

Comment 4 Frederic Branczyk 2019-02-04 16:28:52 UTC
*** Bug 1669641 has been marked as a duplicate of this bug. ***

Comment 6 Frederic Branczyk 2019-02-27 15:43:51 UTC
*** Bug 1683033 has been marked as a duplicate of this bug. ***

Comment 7 Frederic Branczyk 2019-02-27 15:48:11 UTC
Moving the target release out to 4.2, as 4.1 is a very short release cycle. At this point it is unclear whether blocks corrupted in this way are recoverable at all, but we will look into whether it is possible to handle these situations more gracefully, with no guarantee that this is actually possible.

For now, in order to get a working stack again, you will need to delete the corrupted block.


Note You need to log in before you can comment on or make changes to this bug.