Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1615732

Summary: prometheus-operator ReplicaSet has timed out progressing
Product: OpenShift Container Platform Reporter: Junqi Zhao <juzhao>
Component: MonitoringAssignee: Frederic Branczyk <fbranczy>
Status: CLOSED ERRATA QA Contact: Junqi Zhao <juzhao>
Severity: high Docs Contact:
Priority: high    
Version: 3.11.0CC: xtian
Target Milestone: ---Keywords: TestBlocker
Target Release: 3.11.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-10-11 07:24:39 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
prometheus-operator pod in CrashLoopBackOff status
none
installation log none

Description Junqi Zhao 2018-08-14 06:54:09 UTC
Created attachment 1475753 [details]
prometheus-operator pod in CrashLoopBackOff status

Description of problem:
Deploy cluster monitoring, prometheus-operator pod is in CrashLoopBackOff status. This blocks cluster monitoring installation now.

# kubectl -n openshift-monitoring get pod
NAME                                          READY     STATUS             RESTARTS   AGE
cluster-monitoring-operator-9f7578d96-c2m8p   1/1       Running            0          49m
prometheus-operator-9f6cffdb-vrrtf            0/1       CrashLoopBackOff   13         47m

# kubectl -n openshift-monitoring get deploy prometheus-operator -o yaml
status:
  conditions:
  - lastTransitionTime: 2018-08-14T05:31:33Z
    lastUpdateTime: 2018-08-14T05:31:33Z
    message: Deployment does not have minimum availability.
    reason: MinimumReplicasUnavailable
    status: "False"
    type: Available
  - lastTransitionTime: 2018-08-14T05:41:34Z
    lastUpdateTime: 2018-08-14T05:41:34Z
    message: ReplicaSet "prometheus-operator-9f6cffdb" has timed out progressing.
    reason: ProgressDeadlineExceeded
    status: "False"
    type: Progressing
  observedGeneration: 4
  replicas: 1
  unavailableReplicas: 1
  updatedReplicas: 1

The installation log also showed the ServiceMonitor CRD was not created
Version-Release number of selected component (if applicable):
ose-prometheus-operator:v3.11.0-0.14.0.0

How reproducible:
Always

Steps to Reproduce:
1. Deploy cluster monitoring
2.
3.

Actual results:
prometheus-operator pod in CrashLoopBackOff status

Expected results:
prometheus-operator pod should be OK

Additional info:
# parameters
openshift_cluster_monitoring_operator_install=true
openshift_cluster_monitoring_operator_node_selector={'role': 'node'}

Comment 1 Junqi Zhao 2018-08-14 06:56:29 UTC
Created attachment 1475755 [details]
installation log

Comment 2 Frederic Branczyk 2018-08-14 09:44:26 UTC
Could you also share the logs of the Prometheus Operator?

Comment 3 Junqi Zhao 2018-08-15 04:18:24 UTC
(In reply to Frederic Branczyk from comment #2)
> Could you also share the logs of the Prometheus Operator?

# kubectl logs prometheus-operator-c7dd5cb69-vc85r
standard_init_linux.go:178: exec user process caused "operation not permitted"

Comment 4 Junqi Zhao 2018-08-15 04:19:52 UTC
It seems it is the same issue with
https://github.com/google/metallb/issues/21

Comment 5 Junqi Zhao 2018-08-15 05:24:35 UTC
# docker ps -a | grep operator
83ea8c727627        6313079d656b                                                                                                                                       "/usr/bin/operator..."   4 minutes ago       Exited (1) 4 minutes ago                       k8s_prometheus-operator_prometheus-operator-c7dd5cb69-vc85r_openshift-monitoring_429902ea-a039-11e8-8c6d-42010af00009_29
68785f345171        registry.reg-aws.openshift.com:443/openshift3/ose-pod:v3.11.0-0.14.0                                                                               "/usr/bin/pod"           2 hours ago         Up 2 hours                                     k8s_POD_prometheus-operator-c7dd5cb69-vc85r_openshift-monitoring_429902ea-a039-11e8-8c6d-42010af00009_0
# docker logs 83ea8c727627
standard_init_linux.go:178: exec user process caused "operation not permitted"


# docker version
Client:
 Version:         1.13.1
 API version:     1.26
 Package version: <unknown>
 Go version:      go1.8.3
 Git commit:      774336d/1.13.1
 Built:           Tue Feb 20 13:46:34 2018
 OS/Arch:         linux/amd64

Server:
 Version:         1.13.1
 API version:     1.26 (minimum version 1.12)
 Package version: <unknown>
 Go version:      go1.8.3
 Git commit:      774336d/1.13.1
 Built:           Tue Feb 20 13:46:34 2018
 OS/Arch:         linux/amd64
 Experimental:    false

Comment 6 Frederic Branczyk 2018-08-15 08:55:18 UTC
We just merged https://github.com/openshift/cluster-monitoring-operator/pull/67, so this should be fixed in the next 3.11 build.

Comment 7 Junqi Zhao 2018-08-16 04:27:40 UTC
Issue is fixed with the fix, but kube-state-metrics pod/service/deployment/replicaset are not created, defect is tracked in Bug 1617695

Comment 8 Junqi Zhao 2018-08-20 05:56:33 UTC
Issue is fixed in ose-prometheus-operator-v3.11.0-0.17.0.0


# openshift version
openshift v3.11.0-0.17.0

Comment 10 errata-xmlrpc 2018-10-11 07:24:39 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:2652