Prometheus pods in CrashLoopBack state after a full OCP cluster shutdown and restart Description of problem: ========================== OCP 3.11.82 was installed on a 3 node setup(master+infra+compute) all on the same nodes. With rook, ceph storage was configured to create block rbd pvcs. Installed prometheus and logging(elasticsearch) pods backed by block rbd pvcs. All pods were in running state. The cluster was shutdown overnight and then restarted again the next day. Though alertmanager pods(also backed by pvc) came up, both the prometheus-k8s pods stayed back in CrashLoppBack state. Checked the nodes and the mount points were accessible. Some logs from the pod prometheus-k8s-0 : ========================= $ oc logs prometheus-k8s-1 -c prometheus level=info ts=2019-02-26T06:18:21.933545117Z caller=main.go:222 msg="Starting Prometheus" version="(version=2.3.2, branch=, revision=)" level=info ts=2019-02-26T06:18:21.933625904Z caller=main.go:223 build_context="(go=go1.10.3, user=mockbuild.eng.bos.redhat.com, date=20190208-01:54:33)" level=info ts=2019-02-26T06:18:21.933654895Z caller=main.go:224 host_details="(Linux 3.10.0-957.5.1.el7.x86_64 #1 SMP Wed Dec 19 10:46:58 EST 2018 x86_64 prometheus-k8s-1 (none))" level=info ts=2019-02-26T06:18:21.933673239Z caller=main.go:225 fd_limits="(soft=1048576, hard=1048576)" level=info ts=2019-02-26T06:18:21.934319393Z caller=main.go:533 msg="Starting TSDB ..." level=info ts=2019-02-26T06:18:21.934449797Z caller=web.go:415 component=web msg="Start listening for connections" address=127.0.0.1:9090 level=info ts=2019-02-26T06:18:21.934594772Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1551088800000 maxt=1551096000000 ulid=01D4JBV7964MZMHYR3PW4X5WNG level=info ts=2019-02-26T06:18:21.934672398Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1551096000000 maxt=1551103200000 ulid=01D4JJPYH5GNFJFGCHWDFPGX73 level=info ts=2019-02-26T06:18:21.934727094Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1551103200000 maxt=1551110400000 ulid=01D4JSJNS5FJQ26K899H5JM0GQ level=info ts=2019-02-26T06:18:21.934780868Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1551110400000 maxt=1551117600000 ulid=01D4K0ED15917HTDCP67DDY53H level=info ts=2019-02-26T06:18:21.934832811Z caller=main.go:402 msg="Stopping scrape discovery manager..." level=info ts=2019-02-26T06:18:21.934851694Z caller=main.go:416 msg="Stopping notify discovery manager..." level=info ts=2019-02-26T06:18:21.934861388Z caller=main.go:438 msg="Stopping scrape manager..." level=info ts=2019-02-26T06:18:21.934870596Z caller=main.go:412 msg="Notify discovery manager stopped" level=info ts=2019-02-26T06:18:21.93488949Z caller=main.go:398 msg="Scrape discovery manager stopped" level=info ts=2019-02-26T06:18:21.934903914Z caller=main.go:432 msg="Scrape manager stopped" level=info ts=2019-02-26T06:18:21.934885969Z caller=manager.go:464 component="rule manager" msg="Stopping rule manager..." level=info ts=2019-02-26T06:18:21.934927481Z caller=manager.go:470 component="rule manager" msg="Rule manager stopped" level=info ts=2019-02-26T06:18:21.934958634Z caller=notifier.go:512 component=notifier msg="Stopping notification manager..." level=info ts=2019-02-26T06:18:21.934987393Z caller=main.go:587 msg="Notifier manager stopped" level=error ts=2019-02-26T06:18:21.935075032Z caller=main.go:596 err="Opening storage failed unexpected end of JSON input" oc describe =============== Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedMount 1h kubelet, ip-172-16-50-242.us-east-2.compute.internal MountVolume.SetUp failed for volume "pvc-0cd31f06-38ea-11e9-94f4-0a86a43d410c" : mount command failed, status: Failure, reason: Rook: Error getting RPC client: error connecting to socket /usr/libexec/kubernetes/kubelet-plugins/volume/exec/ceph.rook.io~rook-ceph-system/.rook.sock: dial unix /usr/libexec/kubernetes/kubelet-plugins/volume/exec/ceph.rook.io~rook-ceph-system/.rook.sock: connect: no such file or directory Warning NetworkNotReady 1h (x2 over 1h) kubelet, ip-172-16-50-242.us-east-2.compute.internal network is not ready: [runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized] Warning NetworkFailed 1h openshift-sdn, ip-172-16-50-242.us-east-2.compute.internal The pod's network interface has been lost and the pod will be stopped. Warning FailedMount 1h (x7 over 1h) kubelet, ip-172-16-50-242.us-east-2.compute.internal MountVolume.SetUp failed for volume "pvc-0cd31f06-38ea-11e9-94f4-0a86a43d410c" : mount command failed, status: Failure, reason: Rook: Error getting RPC client: error connecting to socket /usr/libexec/kubernetes/kubelet-plugins/volume/exec/ceph.rook.io~rook-ceph-system/.rook.sock: dial unix /usr/libexec/kubernetes/kubelet-plugins/volume/exec/ceph.rook.io~rook-ceph-system/.rook.sock: connect: connection refused Warning NetworkFailed 1h openshift-sdn, ip-172-16-50-242.us-east-2.compute.internal The pod's network interface has been lost and the pod will be stopped. Normal SandboxChanged 1h kubelet, ip-172-16-50-242.us-east-2.compute.internal Pod sandbox changed, it will be killed and re-created. Normal Pulled 1h kubelet, ip-172-16-50-242.us-east-2.compute.internal Container image "registry.access.redhat.com/openshift3/ose-prometheus-config-reloader:v3.11.82" already present on machine Normal Created 1h kubelet, ip-172-16-50-242.us-east-2.compute.internal Created container Normal Started 1h kubelet, ip-172-16-50-242.us-east-2.compute.internal Started container Normal Pulled 1h kubelet, ip-172-16-50-242.us-east-2.compute.internal Container image "registry.access.redhat.com/openshift3/ose-configmap-reloader:v3.11.82" already present on machine Normal Created 1h kubelet, ip-172-16-50-242.us-east-2.compute.internal Created container Normal Started 1h kubelet, ip-172-16-50-242.us-east-2.compute.internal Started container Normal Pulled 1h kubelet, ip-172-16-50-242.us-east-2.compute.internal Container image "registry.access.redhat.com/openshift3/oauth-proxy:v3.11.82" already present on machine Normal Created 1h kubelet, ip-172-16-50-242.us-east-2.compute.internal Created container Normal Started 1h kubelet, ip-172-16-50-242.us-east-2.compute.internal Started container Normal Started 1h (x2 over 1h) kubelet, ip-172-16-50-242.us-east-2.compute.internal Started container Normal Created 1h (x3 over 1h) kubelet, ip-172-16-50-242.us-east-2.compute.internal Created container Warning FailedSync 35m kubelet, ip-172-16-50-242.us-east-2.compute.internal error determining status: rpc error: code = Unknown desc = Error: No such container: 2e267e3450931378db5e67996f51a211c9e0501e3141ee793b59d00be4e67911 Normal Pulled 31m (x14 over 1h) kubelet, ip-172-16-50-242.us-east-2.compute.internal Container image "registry.access.redhat.com/openshift3/prometheus:v3.11.82" already present on machine Warning BackOff 1m (x78 over 1h) kubelet, ip-172-16-50-242.us-east-2.compute.internal Back-off restarting failed container Version-Release number of selected component (if applicable): =============================== $ oc version oc v3.11.82 kubernetes v1.11.0+d4cacc0 features: Basic-Auth GSSAPI Kerberos SPNEGO Server https://master.refarch50.storage-strategy.com:8446 openshift v3.11.82 kubernetes v1.11.0+d4cacc0 ceph version = 12.2.58 (luminous) How reproducible: ============= Cluster shutdown is performed almost evry night and is then powered ON the next morning. We have seen similar issue multiple times(more than once) during these daily activities Steps to Reproduce: 1. Install prometheus backed by ceph rbd volume 2. Shutdown the cluster overnight 3. Power on the cluster and check the status of Prometheus pods Actual results: ============== The proemtheus pods are in CrashLoopBack state after a cluster shutdown + PowerON Expected results: ==================== All pods should be in Running state after a cluster shutdown + Power ON