Bug 1579607 - Prometheus pod failed to start up if mounted multiple PVs are in multizone cluster
Summary: Prometheus pod failed to start up if mounted multiple PVs are in multizone cl...
Keywords:
Status: CLOSED DUPLICATE of bug 1554921
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 3.10.0
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: 3.11.0
Assignee: Paul Gier
QA Contact: Junqi Zhao
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-05-18 02:39 UTC by Junqi Zhao
Modified: 2018-08-23 16:56 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-08-23 16:56:49 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
pv info (4.30 KB, text/plain)
2018-05-18 02:39 UTC, Junqi Zhao
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1554921 0 medium CLOSED prometheus deployment fails in OCP3.7 on AWS platform with EBS storage 2023-09-07 19:05:21 UTC

Internal Links: 1554921

Description Junqi Zhao 2018-05-18 02:39:01 UTC
Created attachment 1438279 [details]
pv info

Description of problem:
Deploy prometheus 3.10 with dynamic pv in LB environment, prometheus pod failed to start up(prometheus-0 will use there PVs: prometheus, prometheus-alertmanager, prometheus-alertbuffer )

# oc get po -n openshift-metrics
NAME                             READY     STATUS    RESTARTS   AGE
prometheus-0                     0/6       Pending   0          2m
prometheus-node-exporter-27vsc   1/1       Running   0          2m
prometheus-node-exporter-9n7rx   1/1       Running   0          2m
prometheus-node-exporter-h5dng   1/1       Running   0          2m
prometheus-node-exporter-pkwr4   1/1       Running   0          2m
prometheus-node-exporter-q7k85   1/1       Running   0          2m
prometheus-node-exporter-shs7d   1/1       Running   0          2m
prometheus-node-exporter-wjjdp   1/1       Running   0          2m

Error: node(s) had no available volume zone
# oc describe po prometheus-0 -n openshift-metrics
************************snip****************************************************
Events:
  Type     Reason            Age                From               Message
  ----     ------            ----               ----               -------
  Warning  FailedScheduling  26s (x15 over 3m)  default-scheduler  0/7 nodes are available: 7 node(s) had no available volume zone.

From the following, we can see two pvs are mounted in us-east-1d zone, the other is in us-east-1c zone, this caused the prometheus pod failed to start up
# oc get pv | grep prometheus
pvc-b23b072c-5a3d-11e8-bdf1-0ec461d5fa72   10Gi       RWO            Delete           Bound     openshift-metrics/prometheus                gp2                      4m
pvc-b356f89c-5a3d-11e8-bdf1-0ec461d5fa72   10Gi       RWO            Delete           Bound     openshift-metrics/prometheus-alertmanager   gp2                      4m
pvc-b46f6292-5a3d-11e8-a85e-0ede90de0f32   10Gi       RWO            Delete           Bound     openshift-metrics/prometheus-alertbuffer    gp2                      4m

# oc get pv pvc-b23b072c-5a3d-11e8-bdf1-0ec461d5fa72 -o yaml
************************snip****************************************************
  labels:
    failure-domain.beta.kubernetes.io/region: us-east-1
    failure-domain.beta.kubernetes.io/zone: us-east-1d

# oc get pv pvc-b356f89c-5a3d-11e8-bdf1-0ec461d5fa72 -o yaml
************************snip****************************************************
  labels:
    failure-domain.beta.kubernetes.io/region: us-east-1
    failure-domain.beta.kubernetes.io/zone: us-east-1d

# oc get pv pvc-b46f6292-5a3d-11e8-a85e-0ede90de0f32 -o yaml
************************snip****************************************************
    failure-domain.beta.kubernetes.io/region: us-east-1
    failure-domain.beta.kubernetes.io/zone: us-east-1c


********************************************************************************
# oc get sc
NAME            PROVISIONER             AGE
gp2 (default)   kubernetes.io/aws-ebs   10h
********************************************************************************
# oc get sc gp2 -o yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  annotations:
    storageclass.beta.kubernetes.io/is-default-class: "true"
  creationTimestamp: 2018-05-17T15:10:02Z
  name: gp2
  resourceVersion: "173353"
  selfLink: /apis/storage.k8s.io/v1/storageclasses/gp2
  uid: 5fbaff32-59e4-11e8-a6c6-0aa38464fcf8
parameters:
  encrypted: "false"
  kmsKeyId: ""
  type: gp2
provisioner: kubernetes.io/aws-ebs
reclaimPolicy: Delete
volumeBindingMode: Immediate

# oc get nodes --show-labels
NAME                            STATUS    ROLES           AGE       VERSION           LABELS
ip-172-18-1-161.ec2.internal    Ready     compute,infra   18h       v1.10.0+b81c8f8   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m4.2xlarge,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-east-1,failure-domain.beta.kubernetes.io/zone=us-east-1d,kubernetes.io/hostname=ip-172-18-1-161.ec2.internal,node-role.kubernetes.io/compute=true,node-role.kubernetes.io/infra=true,region=infra,registry=enabled,role=node,router=enabled
ip-172-18-12-157.ec2.internal   Ready     master          18h       v1.10.0+b81c8f8   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m4.2xlarge,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-east-1,failure-domain.beta.kubernetes.io/zone=us-east-1d,kubernetes.io/hostname=ip-172-18-12-157.ec2.internal,node-role.kubernetes.io/master=true,role=node
ip-172-18-13-242.ec2.internal   Ready     master          18h       v1.10.0+b81c8f8   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m4.2xlarge,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-east-1,failure-domain.beta.kubernetes.io/zone=us-east-1d,kubernetes.io/hostname=ip-172-18-13-242.ec2.internal,node-role.kubernetes.io/master=true,role=node
ip-172-18-28-213.ec2.internal   Ready     compute         18h       v1.10.0+b81c8f8   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m4.2xlarge,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-east-1,failure-domain.beta.kubernetes.io/zone=us-east-1c,kubernetes.io/hostname=ip-172-18-28-213.ec2.internal,node-role.kubernetes.io/compute=true,region=primary,role=node
ip-172-18-30-205.ec2.internal   Ready     compute,infra   18h       v1.10.0+b81c8f8   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m4.2xlarge,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-east-1,failure-domain.beta.kubernetes.io/zone=us-east-1c,kubernetes.io/hostname=ip-172-18-30-205.ec2.internal,node-role.kubernetes.io/compute=true,node-role.kubernetes.io/infra=true,region=infra,registry=enabled,role=node,router=enabled
ip-172-18-30-216.ec2.internal   Ready     master          18h       v1.10.0+b81c8f8   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m4.2xlarge,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-east-1,failure-domain.beta.kubernetes.io/zone=us-east-1c,kubernetes.io/hostname=ip-172-18-30-216.ec2.internal,node-role.kubernetes.io/master=true,role=node
ip-172-18-4-120.ec2.internal    Ready     compute         18h       v1.10.0+b81c8f8   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m4.2xlarge,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-east-1,failure-domain.beta.kubernetes.io/zone=us-east-1d,kubernetes.io/hostname=ip-172-18-4-120.ec2.internal,node-role.kubernetes.io/compute=true,region=primary,role=node

Version-Release number of selected component (if applicable):
openshift-ansible-3.10.0-0.47.0.git.0.c018c8f.el7.noarch

# openshift version
openshift v3.10.0-0.47.0
kubernetes v1.10.0+b81c8f8
etcd 3.2.16

all prometheus componet version
v3.10.0-0.47.0.0

How reproducible:
always

Steps to Reproduce:
1. Deploy prometheus 3.10 with dynamic pv, inventory see the [Additional info] part
2.
3.

Actual results:
prometheus pod failed to start up

Expected results:
prometheus pod should be healthy with dynamic pv

Additional info:
openshift_prometheus_state=present
openshift_prometheus_node_selector={'role': 'node'}
openshift_prometheus_image_prefix=registry.reg-aws.openshift.com:443/openshift3/
openshift_prometheus_image_version=v3.10
openshift_prometheus_storage_type=pvc
openshift_prometheus_alertmanager_storage_type=pvc
openshift_prometheus_alertbuffer_storage_type=pvc

Comment 1 Junqi Zhao 2018-05-18 02:41:08 UTC
Report this defect to prometheus first, and if it should be fixed in other component, please feel free to change it.

Comment 2 Paul Gier 2018-05-18 18:57:27 UTC
Assigning to bchilds to look into why the PVs are being created in two separate zones.

Comment 3 Hemant Kumar 2018-05-18 19:48:36 UTC
Did you intentionally run multi-zone cluster? if yes - until we can do zone topology aware volume binding this should be considered a known bug. One can workaround this by specifying zone parameter in SC:

kind: StorageClass
apiVersion: storage.k8s.io/v1beta1
metadata:
  name: slow
provisioner: kubernetes.io/aws-ebs
parameters:
  type: gp2
  zone: us-east-1d

If you did not mean to start a multizone cluster then please make sure that all participating nodes are in in one zone and kubernetes.io/cluster/<cluser_key>: <cluser_val> tag exists on master nodes (and other nodes as well).

Where cluster_key and cluster_val can be any simple string. Something like:

kubernetes.io/cluster/macy-main-cluster: "us_east"

Comment 4 Junqi Zhao 2018-05-22 00:41:24 UTC
(In reply to Hemant Kumar from comment #3)
> Did you intentionally run multi-zone cluster? if yes - until we can do zone
> topology aware volume binding this should be considered a known bug. One can
> workaround this by specifying zone parameter in SC:
> 
> kind: StorageClass
> apiVersion: storage.k8s.io/v1beta1
> metadata:
>   name: slow
> provisioner: kubernetes.io/aws-ebs
> parameters:
>   type: gp2
>   zone: us-east-1d

We have one public LB environment and it is a multi-zone cluster, found this known defect when testing on it. We can use the workaround to make all PVs in the same zone

Comment 6 Paul Gier 2018-08-23 16:56:49 UTC
Closing this issue as I believe it is the same problem described in 1554921

*** This bug has been marked as a duplicate of bug 1554921 ***


Note You need to log in before you can comment on or make changes to this bug.