1565405 – the Prometheus ansible installer playbook does not take into account a multi AZ deployment

Bug 1565405 - the Prometheus ansible installer playbook does not take into account a multi AZ deployment

Summary: the Prometheus ansible installer playbook does not take into account a multi ...

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Monitoring
Sub Component:
Version:	3.9.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	3.11.0
Assignee:	Zohar Gal-Or
QA Contact:	Junqi Zhao
Docs Contact:
URL:
Whiteboard:
Depends On:	1622945 1625817
Blocks:
TreeView+	depends on / blocked

Reported:	2018-04-10 00:45 UTC by raffaele spazzoli
Modified:	2018-09-06 03:18 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-09-05 20:44:59 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1554921	0	medium	CLOSED	prometheus deployment fails in OCP3.7 on AWS platform with EBS storage	2023-09-07 19:05:21 UTC

Internal Links: 1554921

Description raffaele spazzoli 2018-04-10 00:45:34 UTC

Description of problem:
When installing prometheus in a multi zone cluster and using the cloud provider storage class, the PVCs created may land on different zone.
Because all of the prometheus containers are currently on the same pod, OpenShift cannot schedule the pod. 


Version-Release number of selected component (if applicable):
v3.9.14


How reproducible:
deploy prometheus on a multizone cluster, almost certainly the bug will be reproduced.


Actual results:
pod is not deployed.

Expected results:
if each component of the template with storage had its own pod, the deployment would be successful.


Additional info:

Comment 3 Frederic Branczyk 2018-08-16 09:26:00 UTC

In 3.11 the cluster-monitoring stack deploys all components as separate Pods, so this will not be an issue anymore.

Comment 6 Junqi Zhao 2018-08-29 11:08:26 UTC

Issue is not fixed,prometheus and prometheus-alertmanager PVs are in the same zone, but prometheus-alertbuffer is in another zone

prometheus images version:v3.11.0-0.25.0

openshift-ansible-3.11.0-0.25.0.git.0.7497e69.el7.noarch

# oc get pod -n openshift-metrics
NAME                             READY     STATUS    RESTARTS   AGE
prometheus-0                     0/6       Pending   0          16m
prometheus-node-exporter-6bndx   1/1       Running   0          16m
prometheus-node-exporter-lnsx9   1/1       Running   0          16m
prometheus-node-exporter-m78zx   1/1       Running   0          16m
prometheus-node-exporter-mlzcc   1/1       Running   0          16m
prometheus-node-exporter-v74tk   1/1       Running   0          16m


# oc describe po prometheus-0 -n openshift-metrics
************************snip**************************************************
Events:
  Type     Reason            Age                 From               Message
  ----     ------            ----                ----               -------
  Warning  FailedScheduling  2m (x484 over 17m)  default-scheduler  0/5 nodes are available: 1 node(s) didn't match node selector, 4 node(s) had no available volume zone.

************************snip**************************************************

# oc get pv | grep prometheus
pvc-d25f0e84-ab76-11e8-9515-0ede6b3c22da   10Gi       RWO            Delete           Bound     openshift-metrics/prometheus                                    gp2                      18m
pvc-d43e82be-ab76-11e8-9515-0ede6b3c22da   10Gi       RWO            Delete           Bound     openshift-metrics/prometheus-alertmanager                       gp2                      18m
pvc-d66a1d85-ab76-11e8-9515-0ede6b3c22da   10Gi       RWO            Delete           Bound     openshift-metrics/prometheus-alertbuffer                        gp2                      18m

# oc get pv pvc-d25f0e84-ab76-11e8-9515-0ede6b3c22da -o yaml
************************snip**************************************************
  labels:
    failure-domain.beta.kubernetes.io/region: us-east-1
    failure-domain.beta.kubernetes.io/zone: us-east-1d
************************snip**************************************************

# oc get pv pvc-d43e82be-ab76-11e8-9515-0ede6b3c22da -o yaml
************************snip**************************************************
  labels:
    failure-domain.beta.kubernetes.io/region: us-east-1
    failure-domain.beta.kubernetes.io/zone: us-east-1d
************************snip**************************************************


# oc get pv pvc-d66a1d85-ab76-11e8-9515-0ede6b3c22da -o yaml
************************snip**************************************************
  labels:
    failure-domain.beta.kubernetes.io/region: us-east-1
    failure-domain.beta.kubernetes.io/zone: us-east-1c
************************snip**************************************************

# oc get sc 
NAME            PROVISIONER             AGE
gp2 (default)   kubernetes.io/aws-ebs   2h


# oc get sc gp2 -o yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  annotations:
    storageclass.beta.kubernetes.io/is-default-class: "true"
  creationTimestamp: 2018-08-29T08:17:50Z
  name: gp2
  resourceVersion: "2075"
  selfLink: /apis/storage.k8s.io/v1/storageclasses/gp2
  uid: 04ecf704-ab64-11e8-9515-0ede6b3c22da
parameters:
  encrypted: "false"
  kmsKeyId: ""
  type: gp2
provisioner: kubernetes.io/aws-ebs
reclaimPolicy: Delete
volumeBindingMode: Immediate

Comment 7 Junqi Zhao 2018-08-29 11:11:56 UTC

workaround please see
https://bugzilla.redhat.com/show_bug.cgi?id=1554921#c20

Comment 9 Frederic Branczyk 2018-09-05 20:44:59 UTC

I'm marking this won't fix as we are not going to fix this for the deprecated tech preview stack, but it's going to be solved in the new Prometheus based cluster monitoring stack.

Note You need to log in before you can comment on or make changes to this bug.