Created attachment 1438279 [details] pv info Description of problem: Deploy prometheus 3.10 with dynamic pv in LB environment, prometheus pod failed to start up(prometheus-0 will use there PVs: prometheus, prometheus-alertmanager, prometheus-alertbuffer ) # oc get po -n openshift-metrics NAME READY STATUS RESTARTS AGE prometheus-0 0/6 Pending 0 2m prometheus-node-exporter-27vsc 1/1 Running 0 2m prometheus-node-exporter-9n7rx 1/1 Running 0 2m prometheus-node-exporter-h5dng 1/1 Running 0 2m prometheus-node-exporter-pkwr4 1/1 Running 0 2m prometheus-node-exporter-q7k85 1/1 Running 0 2m prometheus-node-exporter-shs7d 1/1 Running 0 2m prometheus-node-exporter-wjjdp 1/1 Running 0 2m Error: node(s) had no available volume zone # oc describe po prometheus-0 -n openshift-metrics ************************snip**************************************************** Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling 26s (x15 over 3m) default-scheduler 0/7 nodes are available: 7 node(s) had no available volume zone. From the following, we can see two pvs are mounted in us-east-1d zone, the other is in us-east-1c zone, this caused the prometheus pod failed to start up # oc get pv | grep prometheus pvc-b23b072c-5a3d-11e8-bdf1-0ec461d5fa72 10Gi RWO Delete Bound openshift-metrics/prometheus gp2 4m pvc-b356f89c-5a3d-11e8-bdf1-0ec461d5fa72 10Gi RWO Delete Bound openshift-metrics/prometheus-alertmanager gp2 4m pvc-b46f6292-5a3d-11e8-a85e-0ede90de0f32 10Gi RWO Delete Bound openshift-metrics/prometheus-alertbuffer gp2 4m # oc get pv pvc-b23b072c-5a3d-11e8-bdf1-0ec461d5fa72 -o yaml ************************snip**************************************************** labels: failure-domain.beta.kubernetes.io/region: us-east-1 failure-domain.beta.kubernetes.io/zone: us-east-1d # oc get pv pvc-b356f89c-5a3d-11e8-bdf1-0ec461d5fa72 -o yaml ************************snip**************************************************** labels: failure-domain.beta.kubernetes.io/region: us-east-1 failure-domain.beta.kubernetes.io/zone: us-east-1d # oc get pv pvc-b46f6292-5a3d-11e8-a85e-0ede90de0f32 -o yaml ************************snip**************************************************** failure-domain.beta.kubernetes.io/region: us-east-1 failure-domain.beta.kubernetes.io/zone: us-east-1c ******************************************************************************** # oc get sc NAME PROVISIONER AGE gp2 (default) kubernetes.io/aws-ebs 10h ******************************************************************************** # oc get sc gp2 -o yaml apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: annotations: storageclass.beta.kubernetes.io/is-default-class: "true" creationTimestamp: 2018-05-17T15:10:02Z name: gp2 resourceVersion: "173353" selfLink: /apis/storage.k8s.io/v1/storageclasses/gp2 uid: 5fbaff32-59e4-11e8-a6c6-0aa38464fcf8 parameters: encrypted: "false" kmsKeyId: "" type: gp2 provisioner: kubernetes.io/aws-ebs reclaimPolicy: Delete volumeBindingMode: Immediate # oc get nodes --show-labels NAME STATUS ROLES AGE VERSION LABELS ip-172-18-1-161.ec2.internal Ready compute,infra 18h v1.10.0+b81c8f8 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m4.2xlarge,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-east-1,failure-domain.beta.kubernetes.io/zone=us-east-1d,kubernetes.io/hostname=ip-172-18-1-161.ec2.internal,node-role.kubernetes.io/compute=true,node-role.kubernetes.io/infra=true,region=infra,registry=enabled,role=node,router=enabled ip-172-18-12-157.ec2.internal Ready master 18h v1.10.0+b81c8f8 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m4.2xlarge,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-east-1,failure-domain.beta.kubernetes.io/zone=us-east-1d,kubernetes.io/hostname=ip-172-18-12-157.ec2.internal,node-role.kubernetes.io/master=true,role=node ip-172-18-13-242.ec2.internal Ready master 18h v1.10.0+b81c8f8 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m4.2xlarge,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-east-1,failure-domain.beta.kubernetes.io/zone=us-east-1d,kubernetes.io/hostname=ip-172-18-13-242.ec2.internal,node-role.kubernetes.io/master=true,role=node ip-172-18-28-213.ec2.internal Ready compute 18h v1.10.0+b81c8f8 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m4.2xlarge,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-east-1,failure-domain.beta.kubernetes.io/zone=us-east-1c,kubernetes.io/hostname=ip-172-18-28-213.ec2.internal,node-role.kubernetes.io/compute=true,region=primary,role=node ip-172-18-30-205.ec2.internal Ready compute,infra 18h v1.10.0+b81c8f8 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m4.2xlarge,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-east-1,failure-domain.beta.kubernetes.io/zone=us-east-1c,kubernetes.io/hostname=ip-172-18-30-205.ec2.internal,node-role.kubernetes.io/compute=true,node-role.kubernetes.io/infra=true,region=infra,registry=enabled,role=node,router=enabled ip-172-18-30-216.ec2.internal Ready master 18h v1.10.0+b81c8f8 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m4.2xlarge,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-east-1,failure-domain.beta.kubernetes.io/zone=us-east-1c,kubernetes.io/hostname=ip-172-18-30-216.ec2.internal,node-role.kubernetes.io/master=true,role=node ip-172-18-4-120.ec2.internal Ready compute 18h v1.10.0+b81c8f8 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m4.2xlarge,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-east-1,failure-domain.beta.kubernetes.io/zone=us-east-1d,kubernetes.io/hostname=ip-172-18-4-120.ec2.internal,node-role.kubernetes.io/compute=true,region=primary,role=node Version-Release number of selected component (if applicable): openshift-ansible-3.10.0-0.47.0.git.0.c018c8f.el7.noarch # openshift version openshift v3.10.0-0.47.0 kubernetes v1.10.0+b81c8f8 etcd 3.2.16 all prometheus componet version v3.10.0-0.47.0.0 How reproducible: always Steps to Reproduce: 1. Deploy prometheus 3.10 with dynamic pv, inventory see the [Additional info] part 2. 3. Actual results: prometheus pod failed to start up Expected results: prometheus pod should be healthy with dynamic pv Additional info: openshift_prometheus_state=present openshift_prometheus_node_selector={'role': 'node'} openshift_prometheus_image_prefix=registry.reg-aws.openshift.com:443/openshift3/ openshift_prometheus_image_version=v3.10 openshift_prometheus_storage_type=pvc openshift_prometheus_alertmanager_storage_type=pvc openshift_prometheus_alertbuffer_storage_type=pvc
Report this defect to prometheus first, and if it should be fixed in other component, please feel free to change it.
Assigning to bchilds to look into why the PVs are being created in two separate zones.
Did you intentionally run multi-zone cluster? if yes - until we can do zone topology aware volume binding this should be considered a known bug. One can workaround this by specifying zone parameter in SC: kind: StorageClass apiVersion: storage.k8s.io/v1beta1 metadata: name: slow provisioner: kubernetes.io/aws-ebs parameters: type: gp2 zone: us-east-1d If you did not mean to start a multizone cluster then please make sure that all participating nodes are in in one zone and kubernetes.io/cluster/<cluser_key>: <cluser_val> tag exists on master nodes (and other nodes as well). Where cluster_key and cluster_val can be any simple string. Something like: kubernetes.io/cluster/macy-main-cluster: "us_east"
(In reply to Hemant Kumar from comment #3) > Did you intentionally run multi-zone cluster? if yes - until we can do zone > topology aware volume binding this should be considered a known bug. One can > workaround this by specifying zone parameter in SC: > > kind: StorageClass > apiVersion: storage.k8s.io/v1beta1 > metadata: > name: slow > provisioner: kubernetes.io/aws-ebs > parameters: > type: gp2 > zone: us-east-1d We have one public LB environment and it is a multi-zone cluster, found this known defect when testing on it. We can use the workaround to make all PVs in the same zone
Closing this issue as I believe it is the same problem described in 1554921 *** This bug has been marked as a duplicate of bug 1554921 ***