| Summary: | Error in Last Refresh status after adding hawkular provider | ||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | [JBoss] Middleware Manager | Reporter: | Sunil Kondkar <skondkar> | ||||||||||||||||||||||
| Component: | Other | Assignee: | Heiko W. Rupp <hrupp> | ||||||||||||||||||||||
| Status: | VERIFIED --- | QA Contact: | Matt Mahoney <mmahoney> | ||||||||||||||||||||||
| Severity: | urgent | Docs Contact: | |||||||||||||||||||||||
| Priority: | urgent | ||||||||||||||||||||||||
| Version: | 7.0.0 TP2 | CC: | fbrychta, hrupp, jhardy, mmahoney | ||||||||||||||||||||||
| Target Milestone: | --- | Keywords: | Triaged | ||||||||||||||||||||||
| Target Release: | --- | ||||||||||||||||||||||||
| Hardware: | Unspecified | ||||||||||||||||||||||||
| OS: | Unspecified | ||||||||||||||||||||||||
| Whiteboard: | Hawkular | ||||||||||||||||||||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||||||||||||||||||
| Doc Text: | Story Points: | --- | |||||||||||||||||||||||
| Clone Of: | Environment: | ||||||||||||||||||||||||
| Last Closed: | Type: | Bug | |||||||||||||||||||||||
| Regression: | --- | Mount Type: | --- | ||||||||||||||||||||||
| Documentation: | --- | CRM: | |||||||||||||||||||||||
| Verified Versions: | Category: | --- | |||||||||||||||||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||||||||||
| Cloudforms Team: | Middleware | Target Upstream Version: | |||||||||||||||||||||||
| Attachments: |
|
||||||||||||||||||||||||
|
Description
Sunil Kondkar
2016-09-14 15:10:44 UTC
Created attachment 1200872 [details]
evm.log
Created attachment 1201238 [details]
HS DR2 log
This is the same as https://issues.jboss.org/browse/HAWKULAR-1120 and seems to be an issue in the upstream Hawkular server. Is this also run here on Kubernetes? Can you try on "plain Docker" as described e.g. at http://pilhuhn.blogspot.de/2016/06/using-hawkular-services-via-docker.html Tried hawkular services installation steps with docker-compose at: http://pilhuhn.blogspot.in/2016/06/using-hawkular-services-via-docker.html It shows the same error in provider details Last Refresh status after adding the provider: WELD-000052: Cannot return null from a non-dependent producer method: Producer for Producer Method [MetricsService] Created attachment 1201283 [details]
Cassandra Log
Sunil/Matt, can you restart the Hawkular server and attach the server log after it has been running for a few minutes? I would like to compare that with the server log already provided. After restarting, we shouldn't see all the schema initialization. I am wondering if that is contributing to the errors. Created attachment 1201727 [details]
hs-container-restart
Thanks for the additional log. It looks like due to the errors from the initial deployment, metrics got into somewhat of an inconsistent state in the second deployment with respect to schema updates. It looks like during the first deployment a schema change was performed, but we failed to record the change. During the second deployment, we tried to apply the change again (dropping a table) since we didn't have record of the change. Because the table didn't exist, we got an error and metrics failed to initialize. We certainly want to avoid getting into an inconsistent state like this, but the situation with second deployment is tangential to the original problem. If network storage is being used, that could certainly be an issue, but I don't think that in and of itself is the problem. I haven't done any benchmarking but some earlier investigation makes me think that DDL statements can some times take a bit of time to complete. If that is the case, then maybe having a bunch of concurrent schema updates take place while all of the hawkular services initialize can be problematic. This is just a theory of which I am not entirely convinced. How often does this happen? Note that my question applies strictly to new installations in which all schema changes are applied. In the server.log provided one of the exceptions we're focusing on is: --- HAWKMETRICS200006: An error occurred trying to connect to the Cassandra cluster: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: hawkular-cassandra/172.17.0.2:9042 (com.datastax.driver.core.exceptions.OperationTimedOutException: [hawkular-cassandra/172.17.0.2] Timed out waiting for server response)) --- which happens during schema creation. OperationTimedOutException is a client-side timeout. It does not necessarily mean that there is a timeout on the Cassandra side. In another QE environment, we have the same exception, but at a different place: --- 2016-09-19 05:03:34,118 ERROR [org.hawkular.alerts.engine.impl.CassDefinitionsServiceImpl] (ServerService Thread Pool -- 70) Failed schema creation: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: hawkular-cassandra/172.17.0.2:9042 (com.datastax.driver.core.exceptions.OperationTimedOutException: [hawkular-cassandra/172.17.0.2] Timed out waiting for server response)) --- In this case the time out occurs while initializing alerts schema. We can increase timeout settings, but that would have to be done separately for each of the hawkular services as each service manages driver connections separately. And I am not entirely convinced increasing timeout settings is the best course of action. Created attachment 1202612 [details]
hs-post-cassandra-auto_snapshot-change
Created attachment 1202624 [details]
/hs-post-cassandra-auto_snapshot-change-failed
There were a handful of minor schema changes that went into Hawkular Metrics 0.19.0. The last of those was dropping an used table. After more investigation and testing with Matt, I am reasonably certain that dropping the table is indeed the impetus for the timeout exceptions. I want to do more testing to figure out what is happening, but I will undo the schema change for now so we get a stable, working build for QE to test. Matt, Can we get another test run with debug logging turned on for Cassandra? There are a couple of ways you could turn on debug logging. I will cover both since I am not sure which will be easier to do. 1) Use nodetool After Cassandra is started run `/opt/apache-cassandra/bin/nodetool setlogginglevel org.apache.cassandra DEBUG`. The changes are not persisted and therefore logging levels will change back to original values on restart. 2) Update logback.xml Edit/replace /opt/apache-cassandra/conf/logback.xml. I will upload an example file that can be used. Created attachment 1202904 [details]
Cassandra logback.xml with debug logging enabled
Created attachment 1202982 [details]
hs-with-logback-debug-endabled
Created attachment 1202983 [details]
cassandra-with-logback-debug-endabled
Adding update for recent test on DR3 on couple of instances. Launched cassandra and hawkular services using the quickstart guide. The hawkular services shows below versions: Hawkular Services: Running 0.0.13.Final-redhat-1 (commit 4cb6ee134197c1af6cc5c908db863299f1a012d7) Hawkular Metrics: Running 0.19.3.Final Hawkular Alerts: Running 1.1.4.Final Hawkular Inventory: Running 0.18.0.Final After adding the hawkular provider, it is still showing the last refresh error as below: WELD-000052: Cannot return null from a non-dependent producer method: Producer for Producer Method [MetricsService] (In reply to Sunil Kondkar from comment #24) > Adding update for recent test on DR3 on couple of instances. > > Launched cassandra and hawkular services using the quickstart guide. The > hawkular services shows below versions: > > Hawkular Services: Running 0.0.13.Final-redhat-1 (commit > 4cb6ee134197c1af6cc5c908db863299f1a012d7) > Hawkular Metrics: Running 0.19.3.Final > Hawkular Alerts: Running 1.1.4.Final > Hawkular Inventory: Running 0.18.0.Final > > After adding the hawkular provider, it is still showing the last refresh > error as below: > > WELD-000052: Cannot return null from a non-dependent producer method: > Producer for Producer Method [MetricsService] Is this a new installation or an upgrade? And you provide a link to the quick start guide? Latest update: Launched new cassandra, hawkular services and CloudForms containers using the quickstart guide: https://docs.engineering.redhat.com/display/JP/CloudForms+Middleware+-+Quickstart+Guide The hawkular version now running is: Hawkular Services: Running 0.0.14.Final-redhat-1 (commit 4b5ae40eb4f44fe95b0fa2d0997c315b09e75bd8) ( Note that previous version where error was visible, was showing HS version as - 0.0.13.Final-redhat-1 (commit 4cb6ee134197c1af6cc5c908db863299f1a012d7) The error is not visible after adding the hawkular provider (with version-0.0.14.Final) . It shows the success message. Verified adding the provider with hawkular services version 0.0.14.Final-redhat-1 (commit 4b5ae40eb4f44fe95b0fa2d0997c315b09e75bd8) on existing CloudForms instance and new CloudForms instance does not display the error. This issue is very probably caused by hawkular metrics not being started. See bz1381616 even though the hawkular services summary page is reporting hawkular metrics as running it's not true in some cases. This is another issue caused by slow environment similar to https://issues.jboss.org/browse/HWKALERTS-155 I created new JIRA for timeout issue in metrics component - https://issues.jboss.org/browse/HWKMETRICS-496 |