Bug 1632853 - The schema installer job may not complete successfully
Summary: The schema installer job may not complete successfully
Keywords:
Status: CLOSED DEFERRED
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Hawkular
Version: 3.10.0
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 3.10.z
Assignee: Ruben Vargas Palma
QA Contact: Junqi Zhao
URL:
Whiteboard:
Depends On: 1632852
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-09-25 17:08 UTC by John Sanda
Modified: 2019-11-20 18:58 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1632852
Environment:
Last Closed: 2019-11-20 18:58:03 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description John Sanda 2018-09-25 17:08:36 UTC
+++ This bug was initially created as a clone of Bug #1632852 +++

Description of problem:
The schema installer job was introduced in 3.10. It applies Hawkular Metrics schema changes to Cassandra. This was previously done by hawkular-metrics at start up prior to 3.10. 

The hawkular-metrics pod(s) cannot start until the schema installer job finishes successfully. The job updates a row in the sys_config table to indicate that the schema is up to date. Hawkular Metrics polls Cassandra, checking that row to determine if the schema is up to date.

The job is configured with a restart policy of OnFailure which made me think that the job would be continually run until it completed successfully. https://github.com/kubernetes/kubernetes/issues/54870 explains that the backoffLimit will cap the number of retries for a job. The schema installer does not set the backoffLimit, so it gets the default value of 6.

This has led to situations where hawkular-metrics cannot start up because the job has failed after six attempts. This has happened for example when the Cassandra cluster is in a bad state and requires some manual intervention to get it healthy. 

We could increase the backoffLimit, but that does not really address the problem. The installer has to be updated with additional retry logic so that it keeps running until changes are applied successfully.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 2 Stephen Cuppett 2019-11-20 18:58:03 UTC
OCP 3.6-3.10 is no longer on full support [1]. Marking CLOSED DEFERRED. If you have a customer case with a support exception or have reproduced on 3.11+, please reopen and include those details. When reopening, please set the Target Release to the appropriate version where needed.

[1]: https://access.redhat.com/support/policy/updates/openshift


Note You need to log in before you can comment on or make changes to this bug.