Created attachment 1337047 [details] pod log Description of problem: Prometheus pod in CrashLoopBackOff status # oc get po NAME READY STATUS RESTARTS AGE prometheus-1095623639-nrqvw 4/5 CrashLoopBackOff 10 31m # oc logs prometheus-1095623639-nrqvw -c prometheus level=info ts=2017-10-11T08:09:31.322830654Z caller=main.go:214 msg="Starting prometheus" version="(version=2.0.0-beta.5, branch=non-git, revision=non-git)" level=info ts=2017-10-11T08:09:31.323310641Z caller=main.go:215 build_context="(go=go1.8.3, user=mockbuild.eng.bos.redhat.com, date=20171003-17:40:48)" level=info ts=2017-10-11T08:09:31.32333094Z caller=main.go:216 host_details="(Linux 3.10.0-693.el7.x86_64 #1 SMP Thu Jul 6 19:56:57 EDT 2017 x86_64 prometheus-1095623639-nrqvw (none))" level=info ts=2017-10-11T08:09:31.326497919Z caller=web.go:367 component=web msg="Start listening for connections" address=localhost:9090 level=info ts=2017-10-11T08:09:31.326350719Z caller=main.go:308 msg="Starting TSDB" level=info ts=2017-10-11T08:09:31.326365309Z caller=targetmanager.go:68 component="target manager" msg="Starting target manager..." level=error ts=2017-10-11T08:09:31.340600313Z caller=main.go:317 msg="Opening storage failed" err="create leveled compactor: at least one range must be provided" Version-Release number of the following components: prometheus:v3.7.0-26 prometheus-alert-buffer:v3.7.0-26 prometheus-alertmanager:v3.7.0-26 oauth-proxy:v3.7.0-27 How reproducible: Always Steps to Reproduce: 1. Deploy prometheus via ansible, inventory file see the [Additional info] part 2. 3. Actual results: Prometheus pod in CrashLoopBackOff status, prometheus container failed to start up Expected results: Prometheus pod should be healthy Additional info: # Inventory file [OSEv3:children] masters etcd nfs [masters] ${MASTER} openshift_public_hostname=${MASTER} [etcd] ${ETCD} openshift_public_hostname=${ETCD} [nfs] ${NFS} openshift_public_hostname=${NFS} [OSEv3:vars] ansible_ssh_user=root ansible_ssh_private_key_file="~/libra.pem" deployment_type=openshift-enterprise openshift_docker_additional_registries=brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888 # prometheus openshift_prometheus_state=present openshift_prometheus_namespace=prometheus openshift_prometheus_replicas=1 openshift_prometheus_node_selector={'role': 'node'} openshift_prometheus_image_proxy=brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/oauth-proxy:v3.7 openshift_prometheus_image_prometheus=brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/prometheus:v3.7 openshift_prometheus_image_alertmanager=brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/prometheus-alertmanager:v3.7 openshift_prometheus_image_alertbuffer=brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/prometheus-alert-buffer:v3.7
Blocks all the prometheus testings
Created attachment 1337049 [details] pod log
I can reproduce this level=info ts=2017-10-11T17:20:54.739605775Z caller=main.go:214 msg="Starting prometheus" version="(version=2.0.0-beta.5, branch=non-git, revision=non-git)" level=info ts=2017-10-11T17:20:54.739738561Z caller=main.go:215 build_context="(go=go1.8.3, user=mockbuild.eng.bos.redhat.com, date=20171003-17:40:48)" level=info ts=2017-10-11T17:20:54.739758868Z caller=main.go:216 host_details="(Linux 3.10.0-693.2.1.el7.x86_64 #1 SMP Fri Aug 11 04:58:43 EDT 2017 x86_64 prometheus-3537460379-q44s5 (none))" level=info ts=2017-10-11T17:20:54.743760222Z caller=web.go:367 component=web msg="Start listening for connections" address=localhost:9090 level=info ts=2017-10-11T17:20:54.74381493Z caller=main.go:308 msg="Starting TSDB" level=info ts=2017-10-11T17:20:54.744095515Z caller=targetmanager.go:68 component="target manager" msg="Starting target manager..." level=error ts=2017-10-11T17:20:54.767331121Z caller=main.go:317 msg="Opening storage failed" err="create leveled compactor: at least one range must be provided" I am going to see what happens if I can get an older images to launch correctly.
Error reproduced using images with tag 'latest' as well. 4/5 pods come up, Prometheus still fails with the message: > err="create leveled compactor: at least one range must be provided" Everything comes up when I comment out the "openshift_prometheus_image_<COMPONENT>" lines in my inventory [root@m01 ~]# oc get pods NAME READY STATUS RESTARTS AGE prometheus-662406382-ngr9k 5/5 Running 0 3m And to see their running images [root@m01 ~]# oc describe pod prometheus-662406382-ngr9k | grep Image Image: openshift/oauth-proxy:v1.0.0 Image ID: docker-pullable://docker.io/openshift/oauth-proxy@sha256:48191b4beb0abcb6b81cee329133f7aa4c6e9d2cfca2ce55ef68b817d669f337 Image: openshift/prometheus:v2.0.0-dev Image ID: docker-pullable://docker.io/openshift/prometheus@sha256:856ae17355bf635aa8a741fa717a3d1162df961bd0d245a1adc07489be886f52 Image: openshift/oauth-proxy:v1.0.0 Image ID: docker-pullable://docker.io/openshift/oauth-proxy@sha256:48191b4beb0abcb6b81cee329133f7aa4c6e9d2cfca2ce55ef68b817d669f337 Image: openshift/prometheus-alert-buffer:v0.0.1 Image ID: docker-pullable://docker.io/openshift/prometheus-alert-buffer@sha256:07ab354604db7d96ec1015e9ffdfedacba153215bbd0667aed8ef2d223a79833 Image: openshift/prometheus-alertmanager:dev Image ID: docker-pullable://docker.io/openshift/prometheus-alertmanager@sha256:b965ec4cf9e5884e5681e7e58888e3b35a5c4ffa96932c7246faa1e998eddbf6 Of course I am not certain if these are the latest images we expect to see users consuming when OCP 3.7 is released.
I tried the following configuration as well: openshift_prometheus_image_proxy=brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/oauth-proxy:v3.7 # openshift_prometheus_image_prometheus=brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/prometheus:v3.7 openshift_prometheus_image_alertmanager=brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/prometheus-alertmanager:v3.7 openshift_prometheus_image_alertbuffer=brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/prometheus-alert-buffer:v3.7 Notice that 'openshift_prometheus_image_prometheus' is commented out. It is using this image: Image: openshift/prometheus:v2.0.0-dev Image ID: docker-pullable://docker.io/openshift/prometheus@sha256:856ae17355bf635aa8a741fa717a3d1162df961bd0d245a1adc07489be886f52 I am wondering now if something broke in the upstream image. There were changes committed recently to the prometheus/tsdb and prometheus/prometheus repos recently which changed code related to the original error, wrt: NewLeveledCompactor
(In reply to Tim Bielawa from comment #5) > I tried the following configuration as well: > > ... I forgot to note that in the configuration I tried there the prometheus services all came up 5/5 containers for the prometheus pod. @Junqi Zhao, can you explain to me why you overrode the "openshift_prometheus_image_<COMPONENT>" parameters from their defaults? I have stronger concerns now that this may be a bug upstream with prometheus or their tsdb configuration. ---- NEED INFO: Why are you using the image name overrides that you chose for each component? Is that just to test the to-release tags of each pod container image? I don't think we can necessarily solve this ourselves, we may need to tag in someone from upstream about this.
(In reply to Tim Bielawa from comment #6) > > NEED INFO: Why are you using the image name overrides that you chose for > each component? Is that just to test the to-release tags of each pod > container image? I don't think we can necessarily solve this ourselves, we > may need to tag in someone from upstream about this. When we test on openstack, we should use images from brew repo: brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3, use images from ops repo: registry.ops.openshift.com/openshift3/ when we test on AWS/GCE. Not use them from the docker.io
I don't think this is a problem with openshift-ansible or the role. The error message is coming from the prometheus application. The following containers come up OK: alert-buffer alertmanager alerts-proxy prom-proxy However the 'prometheus' pod is not coming up OK: The application can not open the TSDB storage. It appears from the source (prometheus/main.go) that when the tsdb.Open() function is called the cfg (`&cfg.tsdb`) parameter does not have a valid value for the range parameter which is used when creating a new NewLeveledCompactor in the TSDB code. prometheus/cmd/prometheus/main.go: > #317 level.Error(logger).Log("msg", "Opening storage failed", "err", err) prometheus/vendor/github.com/prometheus/tsdb: > #127 return nil, errors.Errorf("at least one range must be provided") level=error ts=2017-10-12T19:22:55.547339685Z caller=main.go:317 msg="Opening storage failed" err="create leveled compactor: at least one range must be provided" What causes this, I am uncertain. I'm going to try to give this to the prometheus team to look at.
This was due to the effective storage.tsdb.max-block-duration being set smaller than the storage.tsdb.min-block-duration. The value of storage.tsdb.max-block-duration defaults to 10% of the of the retention period. The retention period set in the ansible installer was 6h, so 10% * 6h = 36m. Which is smaller than the default value of min-block-duration (2h). This should already be fixed in the installer by Zohar's recent change which included setting the min-block-duration to 2m: https://github.com/openshift/openshift-ansible/commit/3792787d7e7cc3b8c44ccbbc83a3c2f9a9299f38#diff-3bd4adcc0b4e5e576e296f8c5e52d07bL74 I'll create an upstream issue to improve the input validation and error message.
Upstream PR: https://github.com/prometheus/prometheus/pull/3354
PR is not merged, will verify it later
Just to clarify, the linked upstream PR is just to improve the error message. The fix is in the ansible installer which was merged last week, so it should be ok to test.
prometheus pod can be started up now, testing scenario: 1. attach pv 2. without pv # oc get po NAME READY STATUS RESTARTS AGE prometheus-0 5/5 Running 0 15m prometheus/images/v3.7.0-51 prometheus-alert-buffer/images/v3.7.0-48 oauth-proxy/images/v3.7.0-51 prometheus-alertmanager/images/v3.7.0-51 # rpm -qa | grep openshift-ansible openshift-ansible-filter-plugins-3.7.0-0.181.0.git.0.34f6e3e.el7.noarch openshift-ansible-playbooks-3.7.0-0.181.0.git.0.34f6e3e.el7.noarch openshift-ansible-3.7.0-0.181.0.git.0.34f6e3e.el7.noarch openshift-ansible-callback-plugins-3.7.0-0.181.0.git.0.34f6e3e.el7.noarch openshift-ansible-roles-3.7.0-0.181.0.git.0.34f6e3e.el7.noarch openshift-ansible-docs-3.7.0-0.181.0.git.0.34f6e3e.el7.noarch openshift-ansible-lookup-plugins-3.7.0-0.181.0.git.0.34f6e3e.el7.noarch # openshift version openshift v3.7.0-0.181.0 kubernetes v1.7.6+a08f5eeb62 etcd 3.2.8
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2017:3188