Description of problem:
At start up hawkular-metrics applies schema updates to Cassandra if necessary. Schema updates should generally be done serially in Cassandra so as to avoid inconsistencies between Cassandra nodes. In theory concurrent schema updates to a Cassandra cluster should not be a problem. In reality, they often are a source of problems.
If the replica count for hawkular-metrics is greater than one, there is a possibility of concurrent schema updates. We use an infinispan cache at start up in hawkular-metrics for coordination with schema updates. On the one hand, this seem like overkill to introduce infinispan just for this one small use case. At the time it seemed like a reasonable approach because it could be used in environments other than OpenShift.
As it turns out now, OpenShift is the only environment we need to worry about for hawkular-metrics. The Infinispan integration has been a source of some problems (see bug 1469423). Most importantly, I do not think it has prevented concurrent schema updates.
To properly address (or prevent) the issue of concurrent schema updates and the problems with infinispan/jgroups, we will move schema updates out of the hawkular-metrics server and into a separate standalone installer that will run as a kubernetes job. Running schema updates in a kubernetes job ensures we do not have to worry about concurrent updates; therefore, there is no longer a need to use infinispan and jgroups.
There is no guarantee about start up order of pods; so, hawkular-metrics will poll cassandra for a property that is to be set by the installer. The installer will set the property only after all schema updates are done, at which point hawkular-metrics can proceed with its start up.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
Tested with default/non-default value for openshift_metrics_hawkular_replicas and openshift_metrics_cassandra_replicas, all metrics pods were running well, and sanity testing was passed.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.