Bug 1554921

Summary:	prometheus deployment fails in OCP3.7 on AWS platform with EBS storage
Product:	OpenShift Container Platform	Reporter:	mmariyan
Component:	Installer	Assignee:	Paul Gier <pgier>
Status:	CLOSED ERRATA	QA Contact:	Junqi Zhao <juzhao>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	3.9.0	CC:	aos-bugs, aos-storage-staff, avagarwa, bchilds, decarr, jdesousa, jokerman, juzhao, mmariyan, mmccomas, pgier, wmeng
Target Milestone:	---
Target Release:	3.11.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Known Issue
Doc Text:	Cause: Installing prometheus in a multi-zone/region cluster using dynamic storage provisioning causes the prometheus pod to become unschedulable in some cases. Consequence: The prometheus pod requires three PVs (physical volumes): one for the prometheus server, one for the alertmanager, and one for the alert-buffer. In a multi-zone cluster with dynamic storage, it's possible that one or more of these volumes are allocated in a different zone than the others. This causes the prometheus pod to become unschedulable due to each node in the cluster only able to access PVs in it's own zone. So there is no node which can run the Prometheus pod and access all three PVs. Workaround (if any): The recommended solution is to create a storage class which restricts volumes to a single zone by using the "zone:" parameter, and assigning this storage class to the prometheus volumes using the ansible installer inventory variable "openshift_prometheus_<COMPONENT>_storage_class=<zone_restricted_storage_class>". Result: All three volumes will be created in the same zone/region, and the prometheus pod will be automatically scheduled to a node in the same zone.	Story Points:	---
Clone Of:
Clones:	1621887 (view as bug list)		Environment:
Last Closed:	2018-10-11 07:19:09 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1621887

Description mmariyan 2018-03-13 15:03:05 UTC

Description of problem:
Prometheus deployment in ocp3.7 on AWS platform with EBS storage, the prometheus playbook ends without any fail but the prometheus pod always being pending state, pod logs as follow "0/6 nodes are available: 3 CheckServiceAffinity, 3 MatchNodeSelector, 6 NoVolumeZoneConflict". 


Version-Release number of selected component (if applicable):


How reproducible:
prometheus deployment fails in OCP3.7 on AWS platform with EBS storage

Steps to Reproduce:
1.run the playbook 
#ansible-playbook -i <inventory-host> /usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/openshift-prometheus.yml [ playbook runs successfully always]
2. check pod status
3.

Actual results:
Prometheus pods always being pending state

Expected results:
Prometheus pods should run without any error

Additional info:
Tested with empty dir for Prometheus storage the Prometheus pod running successfully,

Comment 1 Avesh Agarwal 2018-03-13 15:39:07 UTC

This might be a configuration issue. Could you provide controller logs to see what is going on?

Comment 5 Juan Luis de Sousa-Valadas 2018-04-20 13:32:18 UTC

It also affects 3.9, same steps, same result.

Looks like issue: https://github.com/kubernetes/kubernetes/issues/39178

Comment 14 jmselmi 2018-08-07 18:53:02 UTC

Hello, 

I am facing the same issue on deploying prometheus on OCP3.7 on Google Storage also.
The PV are created but, the were not assigned to nodes.

The pod stuck in "pending" like forever.

Cheers,
/JM

Comment 17 Juan Luis de Sousa-Valadas 2018-08-09 10:58:45 UTC

jmselmi, this issue is not specific to AWS, every cluster which has a storage class which can provision storage bound to one AZ in multiple AZs is prone to face this issue.

You can workaround it by manually creating all the persistent volume claims in the same AZ.

Comment 24 Paul Gier 2018-08-23 16:56:49 UTC

*** Bug 1579607 has been marked as a duplicate of this bug. ***

Comment 25 Junqi Zhao 2018-08-27 00:59:58 UTC

See also Bug 1565405

Comment 26 Junqi Zhao 2018-08-29 11:10:10 UTC

workaround works

specify the zone in the StorageClass:
parameters:
  type: gp2
  zone: us-east-1d

prometheus images version:v3.11.0-0.25.0

openshift-ansible-3.11.0-0.25.0.git.0.7497e69.el7.noarch

Comment 28 errata-xmlrpc 2018-10-11 07:19:09 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:2652