Bug 1500627 - Prometheus pod in CrashLoopBackOff status, prometheus container failed to start up
Summary: Prometheus pod in CrashLoopBackOff status, prometheus container failed to sta...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Hawkular
Version: 3.7.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 3.7.0
Assignee: Paul Gier
QA Contact: Junqi Zhao
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-10-11 08:36 UTC by Junqi Zhao
Modified: 2017-11-28 22:16 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-11-28 22:16:24 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
pod log (21.44 KB, text/plain)
2017-10-11 08:36 UTC, Junqi Zhao
no flags Details
pod log (22.86 KB, text/plain)
2017-10-11 08:42 UTC, Junqi Zhao
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2017:3188 0 normal SHIPPED_LIVE Moderate: Red Hat OpenShift Container Platform 3.7 security, bug, and enhancement update 2017-11-29 02:34:54 UTC

Description Junqi Zhao 2017-10-11 08:36:30 UTC
Created attachment 1337047 [details]
pod log

Description of problem:
Prometheus pod in CrashLoopBackOff status
# oc get po
NAME                          READY     STATUS             RESTARTS   AGE
prometheus-1095623639-nrqvw   4/5       CrashLoopBackOff   10         31m

# oc logs prometheus-1095623639-nrqvw -c prometheus
level=info ts=2017-10-11T08:09:31.322830654Z caller=main.go:214 msg="Starting prometheus" version="(version=2.0.0-beta.5, branch=non-git, revision=non-git)"
level=info ts=2017-10-11T08:09:31.323310641Z caller=main.go:215 build_context="(go=go1.8.3, user=mockbuild.eng.bos.redhat.com, date=20171003-17:40:48)"
level=info ts=2017-10-11T08:09:31.32333094Z caller=main.go:216 host_details="(Linux 3.10.0-693.el7.x86_64 #1 SMP Thu Jul 6 19:56:57 EDT 2017 x86_64 prometheus-1095623639-nrqvw (none))"
level=info ts=2017-10-11T08:09:31.326497919Z caller=web.go:367 component=web msg="Start listening for connections" address=localhost:9090
level=info ts=2017-10-11T08:09:31.326350719Z caller=main.go:308 msg="Starting TSDB"
level=info ts=2017-10-11T08:09:31.326365309Z caller=targetmanager.go:68 component="target manager" msg="Starting target manager..."
level=error ts=2017-10-11T08:09:31.340600313Z caller=main.go:317 msg="Opening storage failed" err="create leveled compactor: at least one range must be provided"

Version-Release number of the following components:
prometheus:v3.7.0-26
prometheus-alert-buffer:v3.7.0-26
prometheus-alertmanager:v3.7.0-26
oauth-proxy:v3.7.0-27

How reproducible:
Always

Steps to Reproduce:
1. Deploy prometheus via ansible, inventory file see the [Additional info] part
2. 
3.

Actual results:
Prometheus pod in CrashLoopBackOff status, prometheus container failed to start up

Expected results:
Prometheus pod should be healthy

Additional info:
# Inventory file
[OSEv3:children]
masters
etcd
nfs

[masters]
${MASTER} openshift_public_hostname=${MASTER}

[etcd]
${ETCD} openshift_public_hostname=${ETCD}

[nfs]
${NFS} openshift_public_hostname=${NFS}


[OSEv3:vars]
ansible_ssh_user=root
ansible_ssh_private_key_file="~/libra.pem"
deployment_type=openshift-enterprise
openshift_docker_additional_registries=brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888


# prometheus
openshift_prometheus_state=present
openshift_prometheus_namespace=prometheus

openshift_prometheus_replicas=1
openshift_prometheus_node_selector={'role': 'node'}

openshift_prometheus_image_proxy=brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/oauth-proxy:v3.7
openshift_prometheus_image_prometheus=brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/prometheus:v3.7
openshift_prometheus_image_alertmanager=brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/prometheus-alertmanager:v3.7
openshift_prometheus_image_alertbuffer=brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/prometheus-alert-buffer:v3.7

Comment 1 Junqi Zhao 2017-10-11 08:37:58 UTC
Blocks all the prometheus testings

Comment 2 Junqi Zhao 2017-10-11 08:42:10 UTC
Created attachment 1337049 [details]
pod log

Comment 3 Tim Bielawa 2017-10-11 17:22:06 UTC
I can reproduce this

level=info ts=2017-10-11T17:20:54.739605775Z caller=main.go:214 msg="Starting prometheus" version="(version=2.0.0-beta.5, branch=non-git, revision=non-git)"
level=info ts=2017-10-11T17:20:54.739738561Z caller=main.go:215 build_context="(go=go1.8.3, user=mockbuild.eng.bos.redhat.com, date=20171003-17:40:48)"
level=info ts=2017-10-11T17:20:54.739758868Z caller=main.go:216 host_details="(Linux 3.10.0-693.2.1.el7.x86_64 #1 SMP Fri Aug 11 04:58:43 EDT 2017 x86_64 prometheus-3537460379-q44s5 (none))"
level=info ts=2017-10-11T17:20:54.743760222Z caller=web.go:367 component=web msg="Start listening for connections" address=localhost:9090
level=info ts=2017-10-11T17:20:54.74381493Z caller=main.go:308 msg="Starting TSDB"
level=info ts=2017-10-11T17:20:54.744095515Z caller=targetmanager.go:68 component="target manager" msg="Starting target manager..."
level=error ts=2017-10-11T17:20:54.767331121Z caller=main.go:317 msg="Opening storage failed" err="create leveled compactor: at least one range must be provided"


I am going to see what happens if I can get an older images to launch correctly.

Comment 4 Tim Bielawa 2017-10-11 18:14:52 UTC
Error reproduced using images with tag 'latest' as well. 4/5 pods come up, Prometheus still fails with the message:

> err="create leveled compactor: at least one range must be provided"

Everything comes up when I comment out the "openshift_prometheus_image_<COMPONENT>" lines in my inventory

[root@m01 ~]# oc get pods
NAME                         READY     STATUS    RESTARTS   AGE
prometheus-662406382-ngr9k   5/5       Running   0          3m


And to see their running images

[root@m01 ~]# oc describe pod prometheus-662406382-ngr9k  | grep Image
    Image:              openshift/oauth-proxy:v1.0.0
    Image ID:           docker-pullable://docker.io/openshift/oauth-proxy@sha256:48191b4beb0abcb6b81cee329133f7aa4c6e9d2cfca2ce55ef68b817d669f337
    Image:              openshift/prometheus:v2.0.0-dev
    Image ID:           docker-pullable://docker.io/openshift/prometheus@sha256:856ae17355bf635aa8a741fa717a3d1162df961bd0d245a1adc07489be886f52
    Image:              openshift/oauth-proxy:v1.0.0
    Image ID:           docker-pullable://docker.io/openshift/oauth-proxy@sha256:48191b4beb0abcb6b81cee329133f7aa4c6e9d2cfca2ce55ef68b817d669f337
    Image:              openshift/prometheus-alert-buffer:v0.0.1
    Image ID:           docker-pullable://docker.io/openshift/prometheus-alert-buffer@sha256:07ab354604db7d96ec1015e9ffdfedacba153215bbd0667aed8ef2d223a79833
    Image:              openshift/prometheus-alertmanager:dev
    Image ID:           docker-pullable://docker.io/openshift/prometheus-alertmanager@sha256:b965ec4cf9e5884e5681e7e58888e3b35a5c4ffa96932c7246faa1e998eddbf6


Of course I am not certain if these are the latest images we expect to see users consuming when OCP 3.7 is released.

Comment 5 Tim Bielawa 2017-10-11 18:21:20 UTC
I tried the following configuration as well:

openshift_prometheus_image_proxy=brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/oauth-proxy:v3.7
# openshift_prometheus_image_prometheus=brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/prometheus:v3.7
openshift_prometheus_image_alertmanager=brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/prometheus-alertmanager:v3.7
openshift_prometheus_image_alertbuffer=brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/prometheus-alert-buffer:v3.7


Notice that 'openshift_prometheus_image_prometheus' is commented out. It is using this image:

    Image:              openshift/prometheus:v2.0.0-dev
    Image ID:           docker-pullable://docker.io/openshift/prometheus@sha256:856ae17355bf635aa8a741fa717a3d1162df961bd0d245a1adc07489be886f52


I am wondering now if something broke in the upstream image. There were changes committed recently to the prometheus/tsdb and prometheus/prometheus repos recently which changed code related to the original error, wrt: NewLeveledCompactor

Comment 6 Tim Bielawa 2017-10-11 19:24:36 UTC
(In reply to Tim Bielawa from comment #5)
> I tried the following configuration as well:
> 
> ...

I forgot to note that in the configuration I tried there the prometheus services all came up 5/5 containers for the prometheus pod.

@Junqi Zhao, can you explain to me why you overrode the "openshift_prometheus_image_<COMPONENT>" parameters from their defaults?

I have stronger concerns now that this may be a bug upstream with prometheus or their tsdb configuration.

----

NEED INFO: Why are you using the image name overrides that you chose for each component? Is that just to test the to-release tags of each pod container image? I don't think we can necessarily solve this ourselves, we may need to tag in someone from upstream about this.

Comment 7 Junqi Zhao 2017-10-12 00:34:47 UTC
(In reply to Tim Bielawa from comment #6)
> 
> NEED INFO: Why are you using the image name overrides that you chose for
> each component? Is that just to test the to-release tags of each pod
> container image? I don't think we can necessarily solve this ourselves, we
> may need to tag in someone from upstream about this.

When we test on openstack, we should use images from brew repo: brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3, use images from ops repo: registry.ops.openshift.com/openshift3/ when we test on AWS/GCE. Not use them from the docker.io

Comment 8 Tim Bielawa 2017-10-12 19:30:06 UTC
I don't think this is a problem with openshift-ansible or the role. The error message is coming from the prometheus application. 

The following containers come up OK:

alert-buffer
alertmanager
alerts-proxy
prom-proxy

However the 'prometheus' pod is not coming up OK:


The application can not open the TSDB storage. It appears from the source (prometheus/main.go) that when the tsdb.Open() function is called the cfg (`&cfg.tsdb`) parameter does not have a valid value for the range parameter which is used when creating a new NewLeveledCompactor in the TSDB code.

prometheus/cmd/prometheus/main.go:
> #317 level.Error(logger).Log("msg", "Opening storage failed", "err", err)

prometheus/vendor/github.com/prometheus/tsdb:
> #127 return nil, errors.Errorf("at least one range must be provided")


level=error ts=2017-10-12T19:22:55.547339685Z caller=main.go:317 msg="Opening storage failed" err="create leveled compactor: at least one range must be provided"


What causes this, I am uncertain. I'm going to try to give this to the prometheus team to look at.

Comment 14 Paul Gier 2017-10-25 22:11:37 UTC
This was due to the effective storage.tsdb.max-block-duration being set smaller than the storage.tsdb.min-block-duration.

The value of storage.tsdb.max-block-duration defaults to 10% of the of the                                 retention period.  The retention period set in the ansible installer was 6h, so 10% * 6h = 36m.  Which is smaller than the default value of min-block-duration (2h).

This should already be fixed in the installer by Zohar's recent change which included setting the min-block-duration to 2m: https://github.com/openshift/openshift-ansible/commit/3792787d7e7cc3b8c44ccbbc83a3c2f9a9299f38#diff-3bd4adcc0b4e5e576e296f8c5e52d07bL74

I'll create an upstream issue to improve the input validation and error message.

Comment 15 Paul Gier 2017-10-26 01:15:03 UTC
Upstream PR: https://github.com/prometheus/prometheus/pull/3354

Comment 16 Junqi Zhao 2017-10-26 01:55:28 UTC
PR is not merged, will verify it later

Comment 17 Paul Gier 2017-10-26 13:10:57 UTC
Just to clarify, the linked upstream PR is just to improve the error message.  The fix is in the ansible installer which was merged last week, so it should be ok to test.

Comment 19 Junqi Zhao 2017-10-27 12:45:39 UTC
prometheus pod can be started up now, testing scenario:
1. attach pv
2. without pv

# oc get po
NAME           READY     STATUS    RESTARTS   AGE
prometheus-0   5/5       Running   0          15m


prometheus/images/v3.7.0-51
prometheus-alert-buffer/images/v3.7.0-48
oauth-proxy/images/v3.7.0-51
prometheus-alertmanager/images/v3.7.0-51

# rpm -qa | grep openshift-ansible
openshift-ansible-filter-plugins-3.7.0-0.181.0.git.0.34f6e3e.el7.noarch
openshift-ansible-playbooks-3.7.0-0.181.0.git.0.34f6e3e.el7.noarch
openshift-ansible-3.7.0-0.181.0.git.0.34f6e3e.el7.noarch
openshift-ansible-callback-plugins-3.7.0-0.181.0.git.0.34f6e3e.el7.noarch
openshift-ansible-roles-3.7.0-0.181.0.git.0.34f6e3e.el7.noarch
openshift-ansible-docs-3.7.0-0.181.0.git.0.34f6e3e.el7.noarch
openshift-ansible-lookup-plugins-3.7.0-0.181.0.git.0.34f6e3e.el7.noarch


# openshift version
openshift v3.7.0-0.181.0
kubernetes v1.7.6+a08f5eeb62
etcd 3.2.8

Comment 22 errata-xmlrpc 2017-11-28 22:16:24 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:3188


Note You need to log in before you can comment on or make changes to this bug.