Bug 1554921 - prometheus deployment fails in OCP3.7 on AWS platform with EBS storage
Summary: prometheus deployment fails in OCP3.7 on AWS platform with EBS storage
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 3.9.0
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 3.11.0
Assignee: Paul Gier
QA Contact: Junqi Zhao
URL:
Whiteboard:
: 1579607 (view as bug list)
Depends On:
Blocks: 1621887
TreeView+ depends on / blocked
 
Reported: 2018-03-13 15:03 UTC by mmariyan
Modified: 2018-10-11 07:20 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: Known Issue
Doc Text:
Cause: Installing prometheus in a multi-zone/region cluster using dynamic storage provisioning causes the prometheus pod to become unschedulable in some cases. Consequence: The prometheus pod requires three PVs (physical volumes): one for the prometheus server, one for the alertmanager, and one for the alert-buffer. In a multi-zone cluster with dynamic storage, it's possible that one or more of these volumes are allocated in a different zone than the others. This causes the prometheus pod to become unschedulable due to each node in the cluster only able to access PVs in it's own zone. So there is no node which can run the Prometheus pod and access all three PVs. Workaround (if any): The recommended solution is to create a storage class which restricts volumes to a single zone by using the "zone:" parameter, and assigning this storage class to the prometheus volumes using the ansible installer inventory variable "openshift_prometheus_<COMPONENT>_storage_class=<zone_restricted_storage_class>". Result: All three volumes will be created in the same zone/region, and the prometheus pod will be automatically scheduled to a node in the same zone.
Clone Of:
: 1621887 (view as bug list)
Environment:
Last Closed: 2018-10-11 07:19:09 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Red Hat Bugzilla 1565405 None CLOSED the Prometheus ansible installer playbook does not take into account a multi AZ deployment 2019-07-23 06:40:23 UTC
Red Hat Bugzilla 1579607 None CLOSED Prometheus pod failed to start up if mounted multiple PVs are in multizone cluster 2019-07-23 06:40:23 UTC
Red Hat Product Errata RHBA-2018:2652 None None None 2018-10-11 07:20:11 UTC

Internal Links: 1565405 1579607

Description mmariyan 2018-03-13 15:03:05 UTC
Description of problem:
Prometheus deployment in ocp3.7 on AWS platform with EBS storage, the prometheus playbook ends without any fail but the prometheus pod always being pending state, pod logs as follow "0/6 nodes are available: 3 CheckServiceAffinity, 3 MatchNodeSelector, 6 NoVolumeZoneConflict". 


Version-Release number of selected component (if applicable):


How reproducible:
prometheus deployment fails in OCP3.7 on AWS platform with EBS storage

Steps to Reproduce:
1.run the playbook 
#ansible-playbook -i <inventory-host> /usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/openshift-prometheus.yml [ playbook runs successfully always]
2. check pod status
3.

Actual results:
Prometheus pods always being pending state

Expected results:
Prometheus pods should run without any error

Additional info:
Tested with empty dir for Prometheus storage the Prometheus pod running successfully,

Comment 1 Avesh Agarwal 2018-03-13 15:39:07 UTC
This might be a configuration issue. Could you provide controller logs to see what is going on?

Comment 5 Juan Luis de Sousa-Valadas 2018-04-20 13:32:18 UTC
It also affects 3.9, same steps, same result.

Looks like issue: https://github.com/kubernetes/kubernetes/issues/39178

Comment 14 jmselmi 2018-08-07 18:53:02 UTC
Hello, 

I am facing the same issue on deploying prometheus on OCP3.7 on Google Storage also.
The PV are created but, the were not assigned to nodes.

The pod stuck in "pending" like forever.

Cheers,
/JM

Comment 17 Juan Luis de Sousa-Valadas 2018-08-09 10:58:45 UTC
jmselmi, this issue is not specific to AWS, every cluster which has a storage class which can provision storage bound to one AZ in multiple AZs is prone to face this issue.

You can workaround it by manually creating all the persistent volume claims in the same AZ.

Comment 24 Paul Gier 2018-08-23 16:56:49 UTC
*** Bug 1579607 has been marked as a duplicate of this bug. ***

Comment 25 Junqi Zhao 2018-08-27 00:59:58 UTC
See also Bug 1565405

Comment 26 Junqi Zhao 2018-08-29 11:10:10 UTC
workaround works

specify the zone in the StorageClass:
parameters:
  type: gp2
  zone: us-east-1d

prometheus images version:v3.11.0-0.25.0

openshift-ansible-3.11.0-0.25.0.git.0.7497e69.el7.noarch

Comment 28 errata-xmlrpc 2018-10-11 07:19:09 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:2652


Note You need to log in before you can comment on or make changes to this bug.