Bug 1615732 - prometheus-operator ReplicaSet has timed out progressing
Summary: prometheus-operator ReplicaSet has timed out progressing
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 3.11.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 3.11.0
Assignee: Frederic Branczyk
QA Contact: Junqi Zhao
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-08-14 06:54 UTC by Junqi Zhao
Modified: 2018-10-11 07:25 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-10-11 07:24:39 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
prometheus-operator pod in CrashLoopBackOff status (9.27 KB, text/plain)
2018-08-14 06:54 UTC, Junqi Zhao
no flags Details
installation log (417.49 KB, text/plain)
2018-08-14 06:56 UTC, Junqi Zhao
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2018:2652 0 None None None 2018-10-11 07:25:07 UTC

Description Junqi Zhao 2018-08-14 06:54:09 UTC
Created attachment 1475753 [details]
prometheus-operator pod in CrashLoopBackOff status

Description of problem:
Deploy cluster monitoring, prometheus-operator pod is in CrashLoopBackOff status. This blocks cluster monitoring installation now.

# kubectl -n openshift-monitoring get pod
NAME                                          READY     STATUS             RESTARTS   AGE
cluster-monitoring-operator-9f7578d96-c2m8p   1/1       Running            0          49m
prometheus-operator-9f6cffdb-vrrtf            0/1       CrashLoopBackOff   13         47m

# kubectl -n openshift-monitoring get deploy prometheus-operator -o yaml
status:
  conditions:
  - lastTransitionTime: 2018-08-14T05:31:33Z
    lastUpdateTime: 2018-08-14T05:31:33Z
    message: Deployment does not have minimum availability.
    reason: MinimumReplicasUnavailable
    status: "False"
    type: Available
  - lastTransitionTime: 2018-08-14T05:41:34Z
    lastUpdateTime: 2018-08-14T05:41:34Z
    message: ReplicaSet "prometheus-operator-9f6cffdb" has timed out progressing.
    reason: ProgressDeadlineExceeded
    status: "False"
    type: Progressing
  observedGeneration: 4
  replicas: 1
  unavailableReplicas: 1
  updatedReplicas: 1

The installation log also showed the ServiceMonitor CRD was not created
Version-Release number of selected component (if applicable):
ose-prometheus-operator:v3.11.0-0.14.0.0

How reproducible:
Always

Steps to Reproduce:
1. Deploy cluster monitoring
2.
3.

Actual results:
prometheus-operator pod in CrashLoopBackOff status

Expected results:
prometheus-operator pod should be OK

Additional info:
# parameters
openshift_cluster_monitoring_operator_install=true
openshift_cluster_monitoring_operator_node_selector={'role': 'node'}

Comment 1 Junqi Zhao 2018-08-14 06:56:29 UTC
Created attachment 1475755 [details]
installation log

Comment 2 Frederic Branczyk 2018-08-14 09:44:26 UTC
Could you also share the logs of the Prometheus Operator?

Comment 3 Junqi Zhao 2018-08-15 04:18:24 UTC
(In reply to Frederic Branczyk from comment #2)
> Could you also share the logs of the Prometheus Operator?

# kubectl logs prometheus-operator-c7dd5cb69-vc85r
standard_init_linux.go:178: exec user process caused "operation not permitted"

Comment 4 Junqi Zhao 2018-08-15 04:19:52 UTC
It seems it is the same issue with
https://github.com/google/metallb/issues/21

Comment 5 Junqi Zhao 2018-08-15 05:24:35 UTC
# docker ps -a | grep operator
83ea8c727627        6313079d656b                                                                                                                                       "/usr/bin/operator..."   4 minutes ago       Exited (1) 4 minutes ago                       k8s_prometheus-operator_prometheus-operator-c7dd5cb69-vc85r_openshift-monitoring_429902ea-a039-11e8-8c6d-42010af00009_29
68785f345171        registry.reg-aws.openshift.com:443/openshift3/ose-pod:v3.11.0-0.14.0                                                                               "/usr/bin/pod"           2 hours ago         Up 2 hours                                     k8s_POD_prometheus-operator-c7dd5cb69-vc85r_openshift-monitoring_429902ea-a039-11e8-8c6d-42010af00009_0
# docker logs 83ea8c727627
standard_init_linux.go:178: exec user process caused "operation not permitted"


# docker version
Client:
 Version:         1.13.1
 API version:     1.26
 Package version: <unknown>
 Go version:      go1.8.3
 Git commit:      774336d/1.13.1
 Built:           Tue Feb 20 13:46:34 2018
 OS/Arch:         linux/amd64

Server:
 Version:         1.13.1
 API version:     1.26 (minimum version 1.12)
 Package version: <unknown>
 Go version:      go1.8.3
 Git commit:      774336d/1.13.1
 Built:           Tue Feb 20 13:46:34 2018
 OS/Arch:         linux/amd64
 Experimental:    false

Comment 6 Frederic Branczyk 2018-08-15 08:55:18 UTC
We just merged https://github.com/openshift/cluster-monitoring-operator/pull/67, so this should be fixed in the next 3.11 build.

Comment 7 Junqi Zhao 2018-08-16 04:27:40 UTC
Issue is fixed with the fix, but kube-state-metrics pod/service/deployment/replicaset are not created, defect is tracked in Bug 1617695

Comment 8 Junqi Zhao 2018-08-20 05:56:33 UTC
Issue is fixed in ose-prometheus-operator-v3.11.0-0.17.0.0


# openshift version
openshift v3.11.0-0.17.0

Comment 10 errata-xmlrpc 2018-10-11 07:24:39 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:2652


Note You need to log in before you can comment on or make changes to this bug.