Bug 1418099 - Metrics are failing to fully deploy on OCP 3.4
Summary: Metrics are failing to fully deploy on OCP 3.4
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Hawkular
Version: 3.4.0
Hardware: x86_64
OS: Linux
high
high
Target Milestone: ---
: 3.4.z
Assignee: Matt Wringe
QA Contact: Peng Li
URL:
Whiteboard:
: 1432150 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-01-31 21:06 UTC by Matt Woodson
Modified: 2018-07-19 06:04 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: If the Hawkular Metrics pod is restarted while it is creating its schema in Cassandra, it may not be able to properly restart again. Consequence: When the pod would be restarted, it would fail to properly start and an admin would have to manually fix it. Fix: The schema processing has been updated to better handle being restarted during schema creation and should in more situations be able to continue where it left off when the pod is restarted. Result: The pod should continue to function properly when it is restarted.
Clone Of:
Environment:
Last Closed: 2017-02-22 18:33:42 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2017:0290 0 normal SHIPPED_LIVE OpenShift Container Platform 3.4.1.7, 3.3.1.14, and 3.2.1.26 images update 2017-02-22 23:30:53 UTC

Description Matt Woodson 2017-01-31 21:06:53 UTC
Description of problem:

When we deploy metrics on 3.4, we are getting errors with the hawkular-metrics.

The error we saw when doing "oc logs hawkular-metrics-xxxxx was:

 ERROR [org.jboss.as] (Controller Boot Thread) WFLYSRV0026: JBoss EAP 7.0.3.GA (WildFly Core 2.1.9.Final-redhat-1) started (with errors) in 265006ms - Started 896 of 1420 services (15 services failed or missing dependencies, 772 services are lazy, passive or on-demand)


We installed logging and metrics with our custom ansible scripts.  These can be found here:

https://github.com/openshift/openshift-tools/blob/prod/ansible/roles/openshift_metrics/tasks/main.yml

These scripts attempt to take the documented metrics install and automate it using ansible and ansible modules.  If there is an issue with our scripts, we could look at fixing that.

Version-Release number of selected component (if applicable):

3.4.0

How reproducible:

This happens when we install metrics on a 3.4 cluster

Comment 1 Matt Woodson 2017-01-31 21:09:05 UTC
This is related to the bug, marking it here:

https://issues.jboss.org/browse/HWKALERTS-220

Comment 3 Jay Shaughnessy 2017-02-01 17:57:56 UTC
The logs would indicate that the Cassandra DB already had the alerting keyspace defined at install time.  This is strange, it would indicate that either a previous install had already been interrupted, or that the Cassandra DB was somehow "dirty" for some other reason.

Is this failure repeatable on clean 3.4 environments?

Comment 4 John Sanda 2017-02-01 19:59:40 UTC
The simplest resolution may involve dropping the alerts schema and restarting the hawkular-metrics pod. To drop the keyspace execute `cqlsh --ssl -e "drop keyspace hawkular_alerts"'.

Comment 5 Matt Wringe 2017-02-01 20:14:59 UTC
The creation of the schema at startup is problematic if it gets interrupted. The assumption that we are going on here is that the pod got killed while the schema was being created and it didn't finish creating the schema properly.

Since the alerts keyspace should be empty here, it should be safe to drop it.

@jsanda: how can we determine if the keyspace actually is empty and this command will not be destructive to the data?

This shouldn't be happening too often for users, but this is concerning that the schema creation is not more robust and cannot automatically continue if the server is shut down during creation.

Comment 6 John Sanda 2017-02-01 22:05:14 UTC
(In reply to Matt Wringe from comment #5)
> The creation of the schema at startup is problematic if it gets interrupted.
> The assumption that we are going on here is that the pod got killed while
> the schema was being created and it didn't finish creating the schema
> properly.
> 
> Since the alerts keyspace should be empty here, it should be safe to drop it.
> 
> @jsanda: how can we determine if the keyspace actually is empty and this
> command will not be destructive to the data?
> 
> This shouldn't be happening too often for users, but this is concerning that
> the schema creation is not more robust and cannot automatically continue if
> the server is shut down during creation.

It could be automated with a bash script. Here is what you need to check.

First run cqlsh --ssl -e "select table_name from system_schema.tables where keyspace_name = 'hawkular_alerts'"

This should return results that look like:

 table_name
-------------------------
          action_plugins
     actions_definitions
         actions_history
 actions_history_actions
  actions_history_alerts
  actions_history_ctimes
 actions_history_results
                  alerts
           alerts_ctimes
        alerts_lifecycle
         alerts_triggers
                cassalog
              conditions
              dampenings
                  events
       events_categories
           events_ctimes
         events_triggers
              sys_config
                    tags
                triggers
        triggers_actions

Then you need to loop over each table and run a query like:

cqlsh --ssl -e "select * from hawkular_alerts.$table limit 1"

If the table is empty you will get back a header row with each of the columns and at the bottom it will say "(o rows)" for an empty result set.

Comment 7 John Sanda 2017-02-01 22:11:21 UTC
(In reply to John Sanda from comment #6)
> (In reply to Matt Wringe from comment #5)
> > The creation of the schema at startup is problematic if it gets interrupted.
> > The assumption that we are going on here is that the pod got killed while
> > the schema was being created and it didn't finish creating the schema
> > properly.
> > 
> > Since the alerts keyspace should be empty here, it should be safe to drop it.
> > 
> > @jsanda: how can we determine if the keyspace actually is empty and this
> > command will not be destructive to the data?
> > 
> > This shouldn't be happening too often for users, but this is concerning that
> > the schema creation is not more robust and cannot automatically continue if
> > the server is shut down during creation.
> 
> It could be automated with a bash script. Here is what you need to check.
> 
> First run cqlsh --ssl -e "select table_name from system_schema.tables where
> keyspace_name = 'hawkular_alerts'"
> 
> This should return results that look like:
> 
>  table_name
> -------------------------
>           action_plugins
>      actions_definitions
>          actions_history
>  actions_history_actions
>   actions_history_alerts
>   actions_history_ctimes
>  actions_history_results
>                   alerts
>            alerts_ctimes
>         alerts_lifecycle
>          alerts_triggers
>                 cassalog
>               conditions
>               dampenings
>                   events
>        events_categories
>            events_ctimes
>          events_triggers
>               sys_config
>                     tags
>                 triggers
>         triggers_actions
> 
> Then you need to loop over each table and run a query like:
> 
> cqlsh --ssl -e "select * from hawkular_alerts.$table limit 1"
> 
> If the table is empty you will get back a header row with each of the
> columns and at the bottom it will say "(o rows)" for an empty result set.

That should be (0 rows)

Comment 8 Matt Wringe 2017-02-01 22:14:02 UTC
The alerts keyspace didn't appear to have any data in it and it was dropped.

After dropping the alerts keyspace, Hawkular Metrics was able to startup properly.

I am not sure if this is going to be rare occurrence and that we need to document what to do when someone runs into this problem. Or if we need to look into making the code more robust here.

From talking with @jsanda it doesn't seem like it would be feasible to make the schema creation process be able to continue if the server was shut down mid way through.

Comment 9 Matt Wringe 2017-02-01 22:45:29 UTC
I created https://issues.jboss.org/browse/HWKMETRICS-594 to help track if we can automatically fix some of these issues in Hawkular Metrics or not.

Comment 10 John Sanda 2017-02-03 14:34:51 UTC
I have submitted pull requests for both metrics and alerts. The code changes are complete.

Comment 11 Matt Wringe 2017-02-13 19:32:43 UTC
Does this still need to be an opsblocker for now?

We have already explained how to prevent this issue from happening (increasing the timeout value) and how to work around this issue if it happens again. It also should only be happening in certain specific setups.

We are working on getting this back ported to 3.4 and it should hopefully show up in the next release.

Comment 13 Troy Dawson 2017-02-15 23:11:41 UTC
This should be fixed in openshift3/metrics-hawkular-metrics:3.4.1-6 which should be in all the standard testing areas.

Comment 15 Peng Li 2017-02-16 08:04:17 UTC
verified with 
openshift3/metrics-hawkular-metrics   3.4.1-6             e0efa6ff3575        13 hours ago        1.5 GB

Steps:
1. deploy 3.4.1 metrics.
2. when the pods are starting, monitor the pod log
#oc logs -f hawkular-metrics-xxxx pod log, 
when below line show, kill the pod.

07:08:53,013 INFO  [org.hawkular.alerts.engine.impl.CassCluster] (ServerService Thread Pool -- 67) Checking Schema existence for keyspace: hawkular_alerts
07:08:54,667 INFO  [org.cassalog.core.CassalogImpl] (ServerService Thread Pool -- 67) Executing [script:vfs:/content/hawkular-metrics.ear/hawkular-alerts.war/WEB-INF/lib/hawkular-alerts-engine-1.3.4.Final-redhat-1.jar/org/hawkular/alerts/schema/cassalog.groovy, tags:[], vars:[keyspace:hawkular_alerts, reset:false, session:com.datastax.driver.core.SessionManager@48e91b7f, logger:org.jboss.logging.JBossLogManagerLogger@12bc69e5]]
07:08:55,508 INFO  [org.hawkular.alerts.engine.impl.CassCluster] (ServerService Thread Pool -- 67) Creating table triggers
07:08:55,881 INFO  [org.hawkular.alerts.engine.impl.CassCluster] (ServerService Thread Pool -- 67) Creating table triggers_actions
07:08:56,140 INFO  [org.hawkular.alerts.engine.impl.CassCluster] (ServerService Thread Pool -- 67) Creating table conditions
07:08:56,728 INFO  [org.hawkular.alerts.engine.impl.CassCluster] (ServerService Thread Pool -- 67) Creating table dampenings
07:08:57,337 INFO  [org.hawkular.alerts.engine.impl.CassCluster] (ServerService Thread Pool -- 67) Creating table action_plugins
07:08:57,615 INFO  [org.hawkular.alerts.engine.impl.CassCluster] (ServerService Thread Pool -- 67) Creating table actions_definitions
07:08:57,881 INFO  [org.hawkular.alerts.engine.impl.CassCluster] (ServerService Thread Pool -- 67) Creating table actions_history
07:08:58,194 INFO  [org.hawkular.alerts.engine.impl.CassCluster] (ServerService Thread Pool -- 67) Creating table actions_history_actions
07:08:58,646 INFO  [org.hawkular.alerts.engine.impl.CassCluster] (ServerService Thread Pool -- 67) Creating table actions_history_alerts
07:08:59,031 INFO  [org.hawkular.alerts.engine.impl.CassCluster] (ServerService Thread Pool -- 67) Creating table actions_history_ctimes
07:08:59,394 INFO  [org.hawkular.alerts.engine.impl.CassCluster] (ServerService Thread Pool -- 67) Creating table actions_history_results
07:08:59,828 INFO  [org.hawkular.alerts.engine.impl.CassCluster] (ServerService Thread Pool -- 67) Creating table tags
07:09:00,223 INFO  [org.hawkular.alerts.engine.impl.CassCluster] (ServerService Thread Pool -- 67) Creating table alerts
*** JBossAS process (226) received TERM signal ***

3. wait for a while to see if hawkular-metrics pod resume and create the keyspace again, not the age hawkular-metrics is younger than other pods, and check the log, the schema is created again.

# oc get pod
NAME                         READY     STATUS      RESTARTS   AGE
hawkular-cassandra-1-pgt9j   1/1       Running     0          23m
hawkular-metrics-2vwff       1/1       Running     0          21m
heapster-slx5t               1/1       Running     0          23m
metrics-deployer-rp1jp       0/1       Completed   0          23m

07:09:58,991 INFO  [org.hawkular.alerts.engine.impl.CassCluster] (ServerService Thread Pool -- 70) Done creating Schema for keyspace: hawkular_alerts

Comment 19 errata-xmlrpc 2017-02-22 18:33:42 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:0290

Comment 20 Matt Wringe 2017-03-14 19:49:28 UTC
*** Bug 1432150 has been marked as a duplicate of this bug. ***

Comment 21 Paul Dwyer 2017-03-15 13:10:19 UTC
*** Bug 1432150 has been marked as a duplicate of this bug. ***


Note You need to log in before you can comment on or make changes to this bug.