Bug 1459968 - [RFE] Highly available Metrics
Summary: [RFE] Highly available Metrics
Keywords:
Status: CLOSED DEFERRED
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Hawkular
Version: 3.4.1
Hardware: Unspecified
OS: Unspecified
high
medium
Target Milestone: ---
: 3.8.0
Assignee: John Sanda
QA Contact: Liming Zhou
URL:
Whiteboard:
: 1459877 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-06-08 17:10 UTC by Brennan Vincello
Modified: 2023-09-18 00:12 UTC (History)
15 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-10-13 12:42:10 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Brennan Vincello 2017-06-08 17:10:00 UTC
Description of problem:

As an OpenShift admin I'd like to deploy Metrics an a highly available manner so that if one infra node with Metrics becomes unavailable, we can continue to track metrics across the cluster. 

Version-Release number of selected component (if applicable): OCP 3.4

Comment 1 Matt Wringe 2017-06-08 18:52:07 UTC
I believe we have this functionality today with a couple of configuration changes. The problem is that scaling up and down clusters where we have data replication requires someone to manually run commands.

The Hawkular Metrics team is looking at management capabilities which would allow this to be more easily managed and automated. But I don't know when this will be available.

We could technically update the documentation on how to handle this if we expect this is something that we would want users to be configuring manually until a more automated system is in place.

Comment 2 Matt Wringe 2017-06-08 18:55:12 UTC
*** Bug 1459877 has been marked as a duplicate of this bug. ***

Comment 3 Eric Jones 2017-06-20 20:08:02 UTC
@Matt, Does this mean if we switch this to a documentation BZ you would be able to work with that team to help us come up with steps for this?

Comment 13 Louis Santillan 2017-09-02 15:59:01 UTC
Ron (Subhankar) Segupta recently shared with me a very rough procedure for scaling Metrics in 3.5 (https://docs.google.com/document/d/1aaVaNY_k6bIA3jWDe1a-LbBddCN8apFdTlaIRsVBWGk/edit?ts=59a56988).  It looks to be private for right now so I'll copy a version of it here.  Joohoo Lee & I had planned to confirm his results.  I still need to reconfigure my lab cluster to match this setup.

Independent verification by CEE and Engineering would be much appreciated.  Also, after verifying the procedure, a determination as to Supportability by CEE would also be required.  Brennan, could you drive these two processes?  Thanks.

I had promised to get the procedure documented (openshift-docs or openshift-playbooks) once it had been verified and I'll make the same promise here.

=====================
OCP version 3.5 (Desk Notes)



Ansible Hosfile group vars

openshift_hosted_metrics_deploy=true
openshift_hosted_metrics_public_url=https://metrics.{{openshift_master_default_subdomain}}/hawkular/metrics
openshift_metrics_cassandra_storage_type=emptydir
openshift_metrics_cassandra_replicas=3
openshift_metrics_hawkular_replicas=3

Test the default installation is working consistently
Scale all  replicas to zero for Hawkular, Cassandra and Heapster

Edit the ReplicationController or DC settings for cassandra and add hostPath entries by replacing "emptyDir" in volumes sections.

#oc edit rc hawkular-cassandra-1
#oc edit rc hawkular-cassandra-2


volumes:
    	- name: cassandra-data
      	hostPath:
        	path: /mnt/merticdata

Node selector for metrics,

#oc project openshift-infra

Change the setting of node selector
Should be like following,
openshift.io/node-selector: metric=true

3 nodes are already labeled with metric=true

SE Linux

Run
#chmod 777 /mnt/metricdata

Assign Se-Linux permission to /mnt/metricdata directory 

#chcon -R -t svirt_sandbox_file_t  /ocp/metrics

OR

#chcon -u system_u -r object_r -t svirt_sandbox_file_t -l s0  (If permission shows unconfined_r instead of system_u)



Assign privileges to cassandra service account

#oadm policy add-scc-to-user privileged -z cassandra -n openshift-infra

Scale up the metrics pods again.

Update Cassandra Replication factor in cassandra pod.

$ oc rsh <cassandra_pod>
$ cqlsh --ssl
   cqlsh>  ALTER KEYSPACE hawkular_metrics WITH replication = {'class': 'SimpleStrategy', 'replication_factor': '2'} AND durable_writes = true;
   cqlsh>  exit

$ nodetool repair

Comment 14 Louis Santillan 2017-09-02 16:03:33 UTC
(In reply to Matt Wringe from comment #1)
> I believe we have this functionality today with a couple of configuration
> changes. The problem is that scaling up and down clusters where we have data
> replication requires someone to manually run commands.
> 
> The Hawkular Metrics team is looking at management capabilities which would
> allow this to be more easily managed and automated. But I don't know when
> this will be available.
> 
> We could technically update the documentation on how to handle this if we
> expect this is something that we would want users to be configuring manually
> until a more automated system is in place.

I believe this will be an increasingly requested feature in the field (I was asked about this at a large US retailer which generated the original support case) as many are looking to have a level of HA/fault tolerance throughout the infrastructure.

Comment 15 John Sanda 2017-09-03 00:46:44 UTC
When scaling up Cassandra it should really be done one pod at a time. The reason being is that if multiple Cassandra instances try to join the cluster at the same time, their token range assignments can get messed up.

We really should consider moving to stateful sets for Cassandra. This is something that the engineering team has been discussing.

Comment 19 Red Hat Bugzilla 2023-09-18 00:12:29 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days


Note You need to log in before you can comment on or make changes to this bug.