Bug 1564939

Summary: RFE: Cassandra pods for metrics are not failure domain aware
Product: OpenShift Container Platform Reporter: raffaele spazzoli <rspazzol>
Component: HawkularAssignee: John Sanda <jsanda>
Status: CLOSED WONTFIX QA Contact: Junqi Zhao <juzhao>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 3.9.0CC: aos-bugs, rspazzol, rvargasp
Target Milestone: ---   
Target Release: 3.11.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-09-10 18:57:31 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description raffaele spazzoli 2018-04-09 01:33:00 UTC
Description of problem:
when installing metrics in a cluster that has failure domains, the cassandra pods should be failure domain aware and distributing themselves across those domains.

The conventional label that identifies the failure domains are:
failure-domain.beta.kubernetes.io/region=us-central1,failure-domain.beta.kubernetes.io/zone=us-central1-a

These labels are set automatically when running in a cloud environment, but can be manually set also for on premise deployments.


Version-Release number of selected component (if applicable):
metrics-cassandra:v3.9.14


How reproducible:
100%


Steps to Reproduce:
1. deploy a cassandra cluster in a multi-AZ OCP cluster 



Actual results:

the multiple cassandra rc ignore the failure domain.
They may actually accidentally distribute themselves correctly, but there is no directive for creating an anti affinity behavior based on the failure domain.



Additional info:
If stateful set was used to deploy the cassandra cluster this behavior would be easier to configure.

Comment 1 John Sanda 2018-05-29 19:44:03 UTC
Is this the same as bug 1563853? If so, can we close this ticket?

Comment 2 raffaele spazzoli 2018-05-29 20:35:49 UTC
John, it is not the same.
1564939 is asking for two pods to not be started in the same failure domain.
1563853 is asking for two pods to not be started on the same node