1683033 – Both Prometheus pods in CrashLoopBack state after a full OCP cluster shutdown and restart

Bug 1683033 - Both Prometheus pods in CrashLoopBack state after a full OCP cluster shutdown and restart

Summary: Both Prometheus pods in CrashLoopBack state after a full OCP cluster shutdown...

Keywords:
Status:	CLOSED DUPLICATE of bug 1664174
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Monitoring
Sub Component:
Version:	3.11.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Frederic Branczyk
QA Contact:	Junqi Zhao
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-02-26 06:49 UTC by Neha Berry
Modified:	2019-02-27 15:43 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-02-27 15:43:51 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Neha Berry 2019-02-26 06:49:37 UTC

Prometheus pods in CrashLoopBack state after a full OCP cluster shutdown and restart

Description of problem:
==========================
OCP 3.11.82 was installed on a 3 node setup(master+infra+compute) all on the same nodes.

With rook, ceph storage was configured to create block rbd pvcs.

Installed prometheus and logging(elasticsearch) pods backed by block rbd pvcs. All pods were in running state.

The cluster was shutdown overnight and then restarted again the next day. Though alertmanager pods(also backed by pvc) came up, both the prometheus-k8s pods stayed back in CrashLoppBack state.

Checked the nodes and the mount points were accessible. 

Some logs from the pod prometheus-k8s-0 :
=========================
$ oc logs prometheus-k8s-1 -c prometheus
level=info ts=2019-02-26T06:18:21.933545117Z caller=main.go:222 msg="Starting Prometheus" version="(version=2.3.2, branch=, revision=)"
level=info ts=2019-02-26T06:18:21.933625904Z caller=main.go:223 build_context="(go=go1.10.3, user=mockbuild.eng.bos.redhat.com, date=20190208-01:54:33)"
level=info ts=2019-02-26T06:18:21.933654895Z caller=main.go:224 host_details="(Linux 3.10.0-957.5.1.el7.x86_64 #1 SMP Wed Dec 19 10:46:58 EST 2018 x86_64 prometheus-k8s-1 (none))"
level=info ts=2019-02-26T06:18:21.933673239Z caller=main.go:225 fd_limits="(soft=1048576, hard=1048576)"
level=info ts=2019-02-26T06:18:21.934319393Z caller=main.go:533 msg="Starting TSDB ..."
level=info ts=2019-02-26T06:18:21.934449797Z caller=web.go:415 component=web msg="Start listening for connections" address=127.0.0.1:9090
level=info ts=2019-02-26T06:18:21.934594772Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1551088800000 maxt=1551096000000 ulid=01D4JBV7964MZMHYR3PW4X5WNG
level=info ts=2019-02-26T06:18:21.934672398Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1551096000000 maxt=1551103200000 ulid=01D4JJPYH5GNFJFGCHWDFPGX73
level=info ts=2019-02-26T06:18:21.934727094Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1551103200000 maxt=1551110400000 ulid=01D4JSJNS5FJQ26K899H5JM0GQ
level=info ts=2019-02-26T06:18:21.934780868Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1551110400000 maxt=1551117600000 ulid=01D4K0ED15917HTDCP67DDY53H
level=info ts=2019-02-26T06:18:21.934832811Z caller=main.go:402 msg="Stopping scrape discovery manager..."
level=info ts=2019-02-26T06:18:21.934851694Z caller=main.go:416 msg="Stopping notify discovery manager..."
level=info ts=2019-02-26T06:18:21.934861388Z caller=main.go:438 msg="Stopping scrape manager..."
level=info ts=2019-02-26T06:18:21.934870596Z caller=main.go:412 msg="Notify discovery manager stopped"
level=info ts=2019-02-26T06:18:21.93488949Z caller=main.go:398 msg="Scrape discovery manager stopped"
level=info ts=2019-02-26T06:18:21.934903914Z caller=main.go:432 msg="Scrape manager stopped"
level=info ts=2019-02-26T06:18:21.934885969Z caller=manager.go:464 component="rule manager" msg="Stopping rule manager..."
level=info ts=2019-02-26T06:18:21.934927481Z caller=manager.go:470 component="rule manager" msg="Rule manager stopped"
level=info ts=2019-02-26T06:18:21.934958634Z caller=notifier.go:512 component=notifier msg="Stopping notification manager..."
level=info ts=2019-02-26T06:18:21.934987393Z caller=main.go:587 msg="Notifier manager stopped"
level=error ts=2019-02-26T06:18:21.935075032Z caller=main.go:596 err="Opening storage failed unexpected end of JSON input"



oc describe
===============

Events:
  Type     Reason           Age                From                                                        Message
  ----     ------           ----               ----                                                        -------
  Warning  FailedMount      1h                 kubelet, ip-172-16-50-242.us-east-2.compute.internal        MountVolume.SetUp failed for volume "pvc-0cd31f06-38ea-11e9-94f4-0a86a43d410c" : mount command failed, status: Failure, reason: Rook: Error getting RPC client: error connecting to socket /usr/libexec/kubernetes/kubelet-plugins/volume/exec/ceph.rook.io~rook-ceph-system/.rook.sock: dial unix /usr/libexec/kubernetes/kubelet-plugins/volume/exec/ceph.rook.io~rook-ceph-system/.rook.sock: connect: no such file or directory
  Warning  NetworkNotReady  1h (x2 over 1h)    kubelet, ip-172-16-50-242.us-east-2.compute.internal        network is not ready: [runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized]
  Warning  NetworkFailed    1h                 openshift-sdn, ip-172-16-50-242.us-east-2.compute.internal  The pod's network interface has been lost and the pod will be stopped.
  Warning  FailedMount      1h (x7 over 1h)    kubelet, ip-172-16-50-242.us-east-2.compute.internal        MountVolume.SetUp failed for volume "pvc-0cd31f06-38ea-11e9-94f4-0a86a43d410c" : mount command failed, status: Failure, reason: Rook: Error getting RPC client: error connecting to socket /usr/libexec/kubernetes/kubelet-plugins/volume/exec/ceph.rook.io~rook-ceph-system/.rook.sock: dial unix /usr/libexec/kubernetes/kubelet-plugins/volume/exec/ceph.rook.io~rook-ceph-system/.rook.sock: connect: connection refused
  Warning  NetworkFailed    1h                 openshift-sdn, ip-172-16-50-242.us-east-2.compute.internal  The pod's network interface has been lost and the pod will be stopped.
  Normal   SandboxChanged   1h                 kubelet, ip-172-16-50-242.us-east-2.compute.internal        Pod sandbox changed, it will be killed and re-created.
  Normal   Pulled           1h                 kubelet, ip-172-16-50-242.us-east-2.compute.internal        Container image "registry.access.redhat.com/openshift3/ose-prometheus-config-reloader:v3.11.82" already present on machine
  Normal   Created          1h                 kubelet, ip-172-16-50-242.us-east-2.compute.internal        Created container
  Normal   Started          1h                 kubelet, ip-172-16-50-242.us-east-2.compute.internal        Started container
  Normal   Pulled           1h                 kubelet, ip-172-16-50-242.us-east-2.compute.internal        Container image "registry.access.redhat.com/openshift3/ose-configmap-reloader:v3.11.82" already present on machine
  Normal   Created          1h                 kubelet, ip-172-16-50-242.us-east-2.compute.internal        Created container
  Normal   Started          1h                 kubelet, ip-172-16-50-242.us-east-2.compute.internal        Started container
  Normal   Pulled           1h                 kubelet, ip-172-16-50-242.us-east-2.compute.internal        Container image "registry.access.redhat.com/openshift3/oauth-proxy:v3.11.82" already present on machine
  Normal   Created          1h                 kubelet, ip-172-16-50-242.us-east-2.compute.internal        Created container
  Normal   Started          1h                 kubelet, ip-172-16-50-242.us-east-2.compute.internal        Started container
  Normal   Started          1h (x2 over 1h)    kubelet, ip-172-16-50-242.us-east-2.compute.internal        Started container
  Normal   Created          1h (x3 over 1h)    kubelet, ip-172-16-50-242.us-east-2.compute.internal        Created container
  Warning  FailedSync       35m                kubelet, ip-172-16-50-242.us-east-2.compute.internal        error determining status: rpc error: code = Unknown desc = Error: No such container: 2e267e3450931378db5e67996f51a211c9e0501e3141ee793b59d00be4e67911
  Normal   Pulled           31m (x14 over 1h)  kubelet, ip-172-16-50-242.us-east-2.compute.internal        Container image "registry.access.redhat.com/openshift3/prometheus:v3.11.82" already present on machine
  Warning  BackOff          1m (x78 over 1h)   kubelet, ip-172-16-50-242.us-east-2.compute.internal        Back-off restarting failed container


Version-Release number of selected component (if applicable):
===============================

$ oc version
oc v3.11.82
kubernetes v1.11.0+d4cacc0
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://master.refarch50.storage-strategy.com:8446
openshift v3.11.82
kubernetes v1.11.0+d4cacc0

ceph version = 12.2.58 (luminous)




How reproducible:
=============
Cluster shutdown is performed almost evry night and is then powered ON the next morning. We have seen similar issue multiple times(more than once) during these daily activities

Steps to Reproduce:
1. Install prometheus backed by ceph rbd volume 
2. Shutdown the cluster overnight
3. Power on the cluster and check the status of Prometheus pods




Actual results:
==============

The proemtheus pods are in CrashLoopBack state after a cluster shutdown + PowerON

Expected results:
====================
All pods should be in Running state after a cluster shutdown + Power ON

Note You need to log in before you can comment on or make changes to this bug.