Bug 1376074

Summary: Error in Last Refresh status after adding hawkular provider
Product: [JBoss] Middleware Manager Reporter: Sunil Kondkar <skondkar>
Component: OtherAssignee: Heiko W. Rupp <hrupp>
Status: VERIFIED --- QA Contact: Matt Mahoney <mmahoney>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 7.0.0 TP2CC: fbrychta, hrupp, jhardy, mmahoney
Target Milestone: ---Keywords: Triaged
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: Hawkular
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: Middleware Target Upstream Version:
Attachments:
Description Flags
Screenshot
none
evm.log
none
HS DR2 log
none
Cassandra Log
none
hs-container-restart
none
hs-post-cassandra-auto_snapshot-change
none
/hs-post-cassandra-auto_snapshot-change-failed
none
Cassandra logback.xml with debug logging enabled
none
hs-with-logback-debug-endabled
none
cassandra-with-logback-debug-endabled none

Description Sunil Kondkar 2016-09-14 15:10:44 UTC
Created attachment 1200868 [details]
Screenshot

Description of problem:

Followed the quick-start guide to launch Hawkular services and CloudForms containers. 

When adding the Hawkular services provider(that was started using the quickstart guide) to the Cloudforms UI , the 'Last Refresh' status on provider details shows below:

WELD-000052: Cannot return null from a non-dependent producer method: Producer for Producer Method [MetricsService]

The hawkular services server.log shows below continuously:

2016-09-14 09:53:25,171 ERROR [org.hawkular.alerts.engine.log] (default task-4) HAWKALERT220008: Database Exception. Msg: [unconfigured table events_ctimes]
2016-09-14 09:53:25,173 ERROR [org.jboss.as.ejb3.invocation] (default task-4) WFLYEJB0034: EJB Invocation failed on component CassAlertsServiceImpl for method public abstract org.hawkular.alerts.api.model.paging.Page org.hawkular.alerts.api.services.AlertsService.getEvents(java.lang.String,org.hawkular.alerts.api.services.EventsCriteria,org.hawkular.alerts.api.model.paging.Pager) throws java.lang.Exception: javax.ejb.EJBException: com.datastax.driver.core.exceptions.InvalidQueryException: unconfigured table events_ctimes


Version-Release number of selected component (if applicable):

DR1

How reproducible:

Always

Steps to Reproduce:

1. Launch cassandra, hawkular services and CloudForms containers using the quickstart guide:

https://docs.engineering.redhat.com/display/JP/CloudForms+Middleware+-+Quickstart+Guide#CloudFormsMiddleware-QuickstartGuide-CloudFormswithMiddlewareIntegrationEnabled

2. Once the CloudForms UI is on, add the hawkular provider.
3. After some time check the 'Last Refresh' status in 'Status' section
4. The 'Last Refresh' status shows error :
WELD-000052: Cannot return null from a non-dependent producer method: Producer for Producer Method [MetricsService]

Actual results:

The 'Last Refresh' status shows error after adding hawkular provider

Expected results:

No errors

Additional info:

Comment 2 Sunil Kondkar 2016-09-14 15:31:19 UTC
Created attachment 1200872 [details]
evm.log

Comment 3 Matt Mahoney 2016-09-15 13:16:47 UTC
Created attachment 1201238 [details]
HS DR2 log

Comment 5 Heiko W. Rupp 2016-09-15 13:52:13 UTC
This is the same as https://issues.jboss.org/browse/HAWKULAR-1120 and seems to be an issue in the upstream Hawkular server.
Is this also run here on Kubernetes?
Can you try on "plain Docker" as described e.g. at http://pilhuhn.blogspot.de/2016/06/using-hawkular-services-via-docker.html

Comment 6 Sunil Kondkar 2016-09-15 14:42:56 UTC
Tried hawkular services installation steps with docker-compose at:  http://pilhuhn.blogspot.in/2016/06/using-hawkular-services-via-docker.html

It shows the same error in provider details Last Refresh status after adding the provider:

WELD-000052: Cannot return null from a non-dependent producer method: Producer for Producer Method [MetricsService]

Comment 7 Matt Mahoney 2016-09-15 14:58:32 UTC
Created attachment 1201283 [details]
Cassandra Log

Comment 8 John Sanda 2016-09-16 14:06:47 UTC
Sunil/Matt, can you restart the Hawkular server and attach the server log after it has been running for a few minutes? I would like to compare that with the server log already provided. After restarting, we shouldn't see all the schema initialization. I am wondering if that is contributing to the errors.

Comment 11 Matt Mahoney 2016-09-16 17:10:49 UTC
Created attachment 1201727 [details]
hs-container-restart

Comment 12 John Sanda 2016-09-16 21:25:18 UTC
Thanks for the additional log. It looks like due to the errors from the initial deployment, metrics got into somewhat of an inconsistent state in the second deployment with respect to schema updates. It looks like during the first deployment a schema change was performed, but we failed to record the change. During the second deployment, we tried to apply the change again (dropping a table) since we didn't have record of the change. Because the table didn't exist, we got an error and metrics failed to initialize. We certainly want to avoid getting into an inconsistent state like this, but the situation with second deployment is tangential to the original problem.

If network storage is being used, that could certainly be an issue, but I don't think that in and of itself is the problem. I haven't done any benchmarking but some earlier investigation makes me think that DDL statements can some times take a bit of time to complete. If that is the case, then maybe having a bunch of concurrent schema updates take place while all of the hawkular services initialize can be problematic. This is just a theory of which I am not entirely convinced. 

How often does this happen? Note that my question applies strictly to new installations in which all schema changes are applied.

Comment 13 John Sanda 2016-09-19 16:46:37 UTC
In the server.log provided one of the exceptions we're focusing on is:

---
HAWKMETRICS200006: An error occurred trying to connect to the Cassandra cluster: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: hawkular-cassandra/172.17.0.2:9042 (com.datastax.driver.core.exceptions.OperationTimedOutException: [hawkular-cassandra/172.17.0.2] Timed out waiting for server response))
---

which happens during schema creation. OperationTimedOutException is a client-side timeout. It does not necessarily mean that there is a timeout on the Cassandra side. In another QE environment, we have the same exception, but at a different place:

---
2016-09-19 05:03:34,118 ERROR [org.hawkular.alerts.engine.impl.CassDefinitionsServiceImpl] (ServerService Thread Pool -- 70) Failed schema creation: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: hawkular-cassandra/172.17.0.2:9042 (com.datastax.driver.core.exceptions.OperationTimedOutException: [hawkular-cassandra/172.17.0.2] Timed out waiting for server response))
---

In this case the time out occurs while initializing alerts schema. We can increase timeout settings, but that would have to be done separately for each of the hawkular services as each service manages driver connections separately. And I am not entirely convinced increasing timeout settings is the best course of action.

Comment 14 Matt Mahoney 2016-09-19 18:46:08 UTC
Created attachment 1202612 [details]
hs-post-cassandra-auto_snapshot-change

Comment 16 Matt Mahoney 2016-09-19 20:26:32 UTC
Created attachment 1202624 [details]
/hs-post-cassandra-auto_snapshot-change-failed

Comment 17 John Sanda 2016-09-20 03:28:23 UTC
There were a handful of minor schema changes that went into Hawkular Metrics 0.19.0. The last of those was dropping an used table. After more investigation and testing with Matt, I am reasonably certain that dropping the table is indeed the impetus for the timeout exceptions. I want to do more testing to figure out what is happening, but I will undo the schema change for now so we get a stable, working build for QE to test.

Comment 18 John Sanda 2016-09-20 13:58:35 UTC
Matt,

Can we get another test run with debug logging turned on for Cassandra? There are a couple of ways you could turn on debug logging. I will cover both since I am not sure which will be easier to do.

1) Use nodetool
After Cassandra is started run `/opt/apache-cassandra/bin/nodetool setlogginglevel org.apache.cassandra DEBUG`. The changes are not persisted and therefore logging levels will change back to original values on restart.

2) Update logback.xml
Edit/replace /opt/apache-cassandra/conf/logback.xml. I will upload an example file that can be used.

Comment 19 John Sanda 2016-09-20 13:59:36 UTC
Created attachment 1202904 [details]
Cassandra logback.xml with debug logging enabled

Comment 21 Matt Mahoney 2016-09-20 17:14:53 UTC
Created attachment 1202982 [details]
hs-with-logback-debug-endabled

Comment 22 Matt Mahoney 2016-09-20 17:15:22 UTC
Created attachment 1202983 [details]
cassandra-with-logback-debug-endabled

Comment 24 Sunil Kondkar 2016-09-27 14:53:41 UTC
Adding update for recent test on DR3 on couple of instances.

Launched cassandra and hawkular services using the quickstart guide. The hawkular services shows below versions:

Hawkular Services: 	Running	0.0.13.Final-redhat-1 (commit 4cb6ee134197c1af6cc5c908db863299f1a012d7)	
Hawkular Metrics: 	Running	0.19.3.Final	
Hawkular Alerts: 	Running	1.1.4.Final	
Hawkular Inventory:	Running	0.18.0.Final

After adding the hawkular provider, it is still showing the last refresh error as below:

WELD-000052: Cannot return null from a non-dependent producer method: Producer for Producer Method [MetricsService]

Comment 25 John Sanda 2016-09-27 15:18:33 UTC
(In reply to Sunil Kondkar from comment #24)
> Adding update for recent test on DR3 on couple of instances.
> 
> Launched cassandra and hawkular services using the quickstart guide. The
> hawkular services shows below versions:
> 
> Hawkular Services: 	Running	0.0.13.Final-redhat-1 (commit
> 4cb6ee134197c1af6cc5c908db863299f1a012d7)	
> Hawkular Metrics: 	Running	0.19.3.Final	
> Hawkular Alerts: 	Running	1.1.4.Final	
> Hawkular Inventory:	Running	0.18.0.Final
> 
> After adding the hawkular provider, it is still showing the last refresh
> error as below:
> 
> WELD-000052: Cannot return null from a non-dependent producer method:
> Producer for Producer Method [MetricsService]

Is this a new installation or an upgrade? And you provide a link to the quick start guide?

Comment 26 Sunil Kondkar 2016-09-27 16:51:30 UTC
Latest update:

Launched new cassandra, hawkular services and CloudForms containers using the quickstart guide:

https://docs.engineering.redhat.com/display/JP/CloudForms+Middleware+-+Quickstart+Guide

The hawkular version now running is:
Hawkular Services: 	Running	0.0.14.Final-redhat-1 (commit 4b5ae40eb4f44fe95b0fa2d0997c315b09e75bd8)

( Note that previous version where error was visible, was showing HS version as - 0.0.13.Final-redhat-1 (commit 4cb6ee134197c1af6cc5c908db863299f1a012d7)

The error is not visible after adding the hawkular provider (with version-0.0.14.Final) . It shows the success message.

Verified adding the provider with hawkular services version 0.0.14.Final-redhat-1 (commit 4b5ae40eb4f44fe95b0fa2d0997c315b09e75bd8) on existing CloudForms instance and new CloudForms instance does not display the error.

Comment 27 Filip Brychta 2016-10-04 15:22:00 UTC
This issue is very probably caused by hawkular metrics not being started.
See bz1381616 even though the hawkular services summary page is reporting hawkular metrics as running it's not true in some cases.
This is another issue caused by slow environment similar to https://issues.jboss.org/browse/HWKALERTS-155

Comment 28 Filip Brychta 2016-10-06 11:33:20 UTC
I created new JIRA for timeout issue in metrics component - https://issues.jboss.org/browse/HWKMETRICS-496