Description of problem: At start up hawkular-metrics applies schema updates to Cassandra if necessary. Schema updates should generally be done serially in Cassandra so as to avoid inconsistencies between Cassandra nodes. In theory concurrent schema updates to a Cassandra cluster should not be a problem. In reality, they often are a source of problems. If the replica count for hawkular-metrics is greater than one, there is a possibility of concurrent schema updates. We use an infinispan cache at start up in hawkular-metrics for coordination with schema updates. On the one hand, this seem like overkill to introduce infinispan just for this one small use case. At the time it seemed like a reasonable approach because it could be used in environments other than OpenShift. As it turns out now, OpenShift is the only environment we need to worry about for hawkular-metrics. The Infinispan integration has been a source of some problems (see bug 1469423). With deployment config, we can use a lifecycle hook to run a single container that will apply schema updates before any hawkular-metrics pods are started. By utilizing the lifecycle hook, we do not have to worry about concurrent schema updates. And coupled with bug 1543647, we can completely all dependencies on infinispan which will further simplify things. Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
I'm adding case 02022261 to this BZ, the errors reported seems to be related to infinispan/JGroups and they show always when the hawkular-metrics is scaled to 2 replicas. Errors: 2018-01-29 08:21:58,577 ERROR [org.jgroups.protocols.ASYM_ENCRYPT] (thread-1,ee,hawkular-metrics-l1l9x) null: key server is currently not set JOIN fails: 2018-01-29 08:22:26,995 ERROR [org.jgroups.protocols.ASYM_ENCRYPT] (thread-1,ee,hawkular-metrics-l1l9x) null: key server is currently not set 2018-01-29 08:22:29,894 WARN [org.jgroups.protocols.pbcast.GMS] (MSC service thread 1-6) hawkular-metrics-l1l9x: JOIN(hawkular-metrics-l1l9x) sent to hawkular-metrics-rz9qc timed out (after 3000 ms), on try 10 2018-01-29 08:22:29,894 WARN [org.jgroups.protocols.pbcast.GMS] (MSC service thread 1-6) hawkular-metrics-l1l9x: too many JOIN attempts (10): becoming singleton 2018-01-29 08:22:34,518 WARN [org.jgroups.protocols.ASYM_ENCRYPT] (thread-1,ee,hawkular-metrics-l1l9x) hawkular-metrics-l1l9x: unrecognized cipher; discarding message from hawkular-metrics-rz9qc
The lifecycle hooks in the deployment config do not work the way I thought and will not be a suitable solution. I am closing this ticket. I have created bug 1560695 to address the deployment issues.