Description of problem: When installing prometheus in a multi zone cluster and using the cloud provider storage class, the PVCs created may land on different zone. Because all of the prometheus containers are currently on the same pod, OpenShift cannot schedule the pod. Version-Release number of selected component (if applicable): v3.9.14 How reproducible: deploy prometheus on a multizone cluster, almost certainly the bug will be reproduced. Actual results: pod is not deployed. Expected results: if each component of the template with storage had its own pod, the deployment would be successful. Additional info:
In 3.11 the cluster-monitoring stack deploys all components as separate Pods, so this will not be an issue anymore.
Issue is not fixed,prometheus and prometheus-alertmanager PVs are in the same zone, but prometheus-alertbuffer is in another zone prometheus images version:v3.11.0-0.25.0 openshift-ansible-3.11.0-0.25.0.git.0.7497e69.el7.noarch # oc get pod -n openshift-metrics NAME READY STATUS RESTARTS AGE prometheus-0 0/6 Pending 0 16m prometheus-node-exporter-6bndx 1/1 Running 0 16m prometheus-node-exporter-lnsx9 1/1 Running 0 16m prometheus-node-exporter-m78zx 1/1 Running 0 16m prometheus-node-exporter-mlzcc 1/1 Running 0 16m prometheus-node-exporter-v74tk 1/1 Running 0 16m # oc describe po prometheus-0 -n openshift-metrics ************************snip************************************************** Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling 2m (x484 over 17m) default-scheduler 0/5 nodes are available: 1 node(s) didn't match node selector, 4 node(s) had no available volume zone. ************************snip************************************************** # oc get pv | grep prometheus pvc-d25f0e84-ab76-11e8-9515-0ede6b3c22da 10Gi RWO Delete Bound openshift-metrics/prometheus gp2 18m pvc-d43e82be-ab76-11e8-9515-0ede6b3c22da 10Gi RWO Delete Bound openshift-metrics/prometheus-alertmanager gp2 18m pvc-d66a1d85-ab76-11e8-9515-0ede6b3c22da 10Gi RWO Delete Bound openshift-metrics/prometheus-alertbuffer gp2 18m # oc get pv pvc-d25f0e84-ab76-11e8-9515-0ede6b3c22da -o yaml ************************snip************************************************** labels: failure-domain.beta.kubernetes.io/region: us-east-1 failure-domain.beta.kubernetes.io/zone: us-east-1d ************************snip************************************************** # oc get pv pvc-d43e82be-ab76-11e8-9515-0ede6b3c22da -o yaml ************************snip************************************************** labels: failure-domain.beta.kubernetes.io/region: us-east-1 failure-domain.beta.kubernetes.io/zone: us-east-1d ************************snip************************************************** # oc get pv pvc-d66a1d85-ab76-11e8-9515-0ede6b3c22da -o yaml ************************snip************************************************** labels: failure-domain.beta.kubernetes.io/region: us-east-1 failure-domain.beta.kubernetes.io/zone: us-east-1c ************************snip************************************************** # oc get sc NAME PROVISIONER AGE gp2 (default) kubernetes.io/aws-ebs 2h # oc get sc gp2 -o yaml apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: annotations: storageclass.beta.kubernetes.io/is-default-class: "true" creationTimestamp: 2018-08-29T08:17:50Z name: gp2 resourceVersion: "2075" selfLink: /apis/storage.k8s.io/v1/storageclasses/gp2 uid: 04ecf704-ab64-11e8-9515-0ede6b3c22da parameters: encrypted: "false" kmsKeyId: "" type: gp2 provisioner: kubernetes.io/aws-ebs reclaimPolicy: Delete volumeBindingMode: Immediate
workaround please see https://bugzilla.redhat.com/show_bug.cgi?id=1554921#c20
I'm marking this won't fix as we are not going to fix this for the deprecated tech preview stack, but it's going to be solved in the new Prometheus based cluster monitoring stack.