Description of problem: When we deploy metrics on 3.4, we are getting errors with the hawkular-metrics. The error we saw when doing "oc logs hawkular-metrics-xxxxx was: ERROR [org.jboss.as] (Controller Boot Thread) WFLYSRV0026: JBoss EAP 7.0.3.GA (WildFly Core 2.1.9.Final-redhat-1) started (with errors) in 265006ms - Started 896 of 1420 services (15 services failed or missing dependencies, 772 services are lazy, passive or on-demand) We installed logging and metrics with our custom ansible scripts. These can be found here: https://github.com/openshift/openshift-tools/blob/prod/ansible/roles/openshift_metrics/tasks/main.yml These scripts attempt to take the documented metrics install and automate it using ansible and ansible modules. If there is an issue with our scripts, we could look at fixing that. Version-Release number of selected component (if applicable): 3.4.0 How reproducible: This happens when we install metrics on a 3.4 cluster
This is related to the bug, marking it here: https://issues.jboss.org/browse/HWKALERTS-220
The logs would indicate that the Cassandra DB already had the alerting keyspace defined at install time. This is strange, it would indicate that either a previous install had already been interrupted, or that the Cassandra DB was somehow "dirty" for some other reason. Is this failure repeatable on clean 3.4 environments?
The simplest resolution may involve dropping the alerts schema and restarting the hawkular-metrics pod. To drop the keyspace execute `cqlsh --ssl -e "drop keyspace hawkular_alerts"'.
The creation of the schema at startup is problematic if it gets interrupted. The assumption that we are going on here is that the pod got killed while the schema was being created and it didn't finish creating the schema properly. Since the alerts keyspace should be empty here, it should be safe to drop it. @jsanda: how can we determine if the keyspace actually is empty and this command will not be destructive to the data? This shouldn't be happening too often for users, but this is concerning that the schema creation is not more robust and cannot automatically continue if the server is shut down during creation.
(In reply to Matt Wringe from comment #5) > The creation of the schema at startup is problematic if it gets interrupted. > The assumption that we are going on here is that the pod got killed while > the schema was being created and it didn't finish creating the schema > properly. > > Since the alerts keyspace should be empty here, it should be safe to drop it. > > @jsanda: how can we determine if the keyspace actually is empty and this > command will not be destructive to the data? > > This shouldn't be happening too often for users, but this is concerning that > the schema creation is not more robust and cannot automatically continue if > the server is shut down during creation. It could be automated with a bash script. Here is what you need to check. First run cqlsh --ssl -e "select table_name from system_schema.tables where keyspace_name = 'hawkular_alerts'" This should return results that look like: table_name ------------------------- action_plugins actions_definitions actions_history actions_history_actions actions_history_alerts actions_history_ctimes actions_history_results alerts alerts_ctimes alerts_lifecycle alerts_triggers cassalog conditions dampenings events events_categories events_ctimes events_triggers sys_config tags triggers triggers_actions Then you need to loop over each table and run a query like: cqlsh --ssl -e "select * from hawkular_alerts.$table limit 1" If the table is empty you will get back a header row with each of the columns and at the bottom it will say "(o rows)" for an empty result set.
(In reply to John Sanda from comment #6) > (In reply to Matt Wringe from comment #5) > > The creation of the schema at startup is problematic if it gets interrupted. > > The assumption that we are going on here is that the pod got killed while > > the schema was being created and it didn't finish creating the schema > > properly. > > > > Since the alerts keyspace should be empty here, it should be safe to drop it. > > > > @jsanda: how can we determine if the keyspace actually is empty and this > > command will not be destructive to the data? > > > > This shouldn't be happening too often for users, but this is concerning that > > the schema creation is not more robust and cannot automatically continue if > > the server is shut down during creation. > > It could be automated with a bash script. Here is what you need to check. > > First run cqlsh --ssl -e "select table_name from system_schema.tables where > keyspace_name = 'hawkular_alerts'" > > This should return results that look like: > > table_name > ------------------------- > action_plugins > actions_definitions > actions_history > actions_history_actions > actions_history_alerts > actions_history_ctimes > actions_history_results > alerts > alerts_ctimes > alerts_lifecycle > alerts_triggers > cassalog > conditions > dampenings > events > events_categories > events_ctimes > events_triggers > sys_config > tags > triggers > triggers_actions > > Then you need to loop over each table and run a query like: > > cqlsh --ssl -e "select * from hawkular_alerts.$table limit 1" > > If the table is empty you will get back a header row with each of the > columns and at the bottom it will say "(o rows)" for an empty result set. That should be (0 rows)
The alerts keyspace didn't appear to have any data in it and it was dropped. After dropping the alerts keyspace, Hawkular Metrics was able to startup properly. I am not sure if this is going to be rare occurrence and that we need to document what to do when someone runs into this problem. Or if we need to look into making the code more robust here. From talking with @jsanda it doesn't seem like it would be feasible to make the schema creation process be able to continue if the server was shut down mid way through.
I created https://issues.jboss.org/browse/HWKMETRICS-594 to help track if we can automatically fix some of these issues in Hawkular Metrics or not.
I have submitted pull requests for both metrics and alerts. The code changes are complete.
Does this still need to be an opsblocker for now? We have already explained how to prevent this issue from happening (increasing the timeout value) and how to work around this issue if it happens again. It also should only be happening in certain specific setups. We are working on getting this back ported to 3.4 and it should hopefully show up in the next release.
This should be fixed in openshift3/metrics-hawkular-metrics:3.4.1-6 which should be in all the standard testing areas.
verified with openshift3/metrics-hawkular-metrics 3.4.1-6 e0efa6ff3575 13 hours ago 1.5 GB Steps: 1. deploy 3.4.1 metrics. 2. when the pods are starting, monitor the pod log #oc logs -f hawkular-metrics-xxxx pod log, when below line show, kill the pod. 07:08:53,013 INFO [org.hawkular.alerts.engine.impl.CassCluster] (ServerService Thread Pool -- 67) Checking Schema existence for keyspace: hawkular_alerts 07:08:54,667 INFO [org.cassalog.core.CassalogImpl] (ServerService Thread Pool -- 67) Executing [script:vfs:/content/hawkular-metrics.ear/hawkular-alerts.war/WEB-INF/lib/hawkular-alerts-engine-1.3.4.Final-redhat-1.jar/org/hawkular/alerts/schema/cassalog.groovy, tags:[], vars:[keyspace:hawkular_alerts, reset:false, session:com.datastax.driver.core.SessionManager@48e91b7f, logger:org.jboss.logging.JBossLogManagerLogger@12bc69e5]] 07:08:55,508 INFO [org.hawkular.alerts.engine.impl.CassCluster] (ServerService Thread Pool -- 67) Creating table triggers 07:08:55,881 INFO [org.hawkular.alerts.engine.impl.CassCluster] (ServerService Thread Pool -- 67) Creating table triggers_actions 07:08:56,140 INFO [org.hawkular.alerts.engine.impl.CassCluster] (ServerService Thread Pool -- 67) Creating table conditions 07:08:56,728 INFO [org.hawkular.alerts.engine.impl.CassCluster] (ServerService Thread Pool -- 67) Creating table dampenings 07:08:57,337 INFO [org.hawkular.alerts.engine.impl.CassCluster] (ServerService Thread Pool -- 67) Creating table action_plugins 07:08:57,615 INFO [org.hawkular.alerts.engine.impl.CassCluster] (ServerService Thread Pool -- 67) Creating table actions_definitions 07:08:57,881 INFO [org.hawkular.alerts.engine.impl.CassCluster] (ServerService Thread Pool -- 67) Creating table actions_history 07:08:58,194 INFO [org.hawkular.alerts.engine.impl.CassCluster] (ServerService Thread Pool -- 67) Creating table actions_history_actions 07:08:58,646 INFO [org.hawkular.alerts.engine.impl.CassCluster] (ServerService Thread Pool -- 67) Creating table actions_history_alerts 07:08:59,031 INFO [org.hawkular.alerts.engine.impl.CassCluster] (ServerService Thread Pool -- 67) Creating table actions_history_ctimes 07:08:59,394 INFO [org.hawkular.alerts.engine.impl.CassCluster] (ServerService Thread Pool -- 67) Creating table actions_history_results 07:08:59,828 INFO [org.hawkular.alerts.engine.impl.CassCluster] (ServerService Thread Pool -- 67) Creating table tags 07:09:00,223 INFO [org.hawkular.alerts.engine.impl.CassCluster] (ServerService Thread Pool -- 67) Creating table alerts *** JBossAS process (226) received TERM signal *** 3. wait for a while to see if hawkular-metrics pod resume and create the keyspace again, not the age hawkular-metrics is younger than other pods, and check the log, the schema is created again. # oc get pod NAME READY STATUS RESTARTS AGE hawkular-cassandra-1-pgt9j 1/1 Running 0 23m hawkular-metrics-2vwff 1/1 Running 0 21m heapster-slx5t 1/1 Running 0 23m metrics-deployer-rp1jp 0/1 Completed 0 23m 07:09:58,991 INFO [org.hawkular.alerts.engine.impl.CassCluster] (ServerService Thread Pool -- 70) Done creating Schema for keyspace: hawkular_alerts
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:0290
*** Bug 1432150 has been marked as a duplicate of this bug. ***