Bug 1121282

Summary:	Storage cluster reported as down, but not really down
Product:	[Other] RHQ Project	Reporter:	Elias Ross <genman>
Component:	Storage Node	Assignee:	RHQ Project Maintainer <rhq-maint>
Status:	CLOSED NOTABUG	QA Contact:	Mike Foley <mfoley>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	4.12
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2014-07-23 16:44:17 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Elias Ross 2014-07-18 21:09:21 UTC

Description of problem:

[rhq@vp25q03ad-hadoop097 bin]$ ./nodetool -p 7299 status
Datacenter: 176
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address         Load       Tokens  Owns (effective)  Host ID                               Rack
UN  17.176.208.117  15.93 GB   256     69.6%             cad149ed-d5e1-4633-8e5a-6d6cb8a3da6b  208
UN  17.176.208.118  56.93 GB   256     64.9%             c421b915-9bc5-46bd-b26f-e88c89f114bf  208
UN  17.176.208.119  53.17 GB   256     65.5%             7367d69c-8fa6-4162-8b18-963c0ae1a229  208

Logs:

21:07:37,792 WARN  [org.rhq.server.metrics.StorageSession] (http-/0.0.0.0:7080-585) Encountered NoHostAvailableException due to following error(s): {}
21:07:37,792 INFO  [org.rhq.enterprise.server.storage.StorageClusterMonitor] (http-/0.0.0.0:7080-513) Storage cluster is down
21:07:37,793 INFO  [org.rhq.enterprise.server.storage.StorageClusterMonitor] (http-/0.0.0.0:7080-585) Storage cluster is down
21:07:37,793 INFO  [org.rhq.enterprise.server.storage.StorageClusterMonitor] (http-/0.0.0.0:7080-459) Storage cluster is down


Version-Release number of selected component (if applicable): 4.12


How reproducible: Unclear. Seems to have happened when I got some timeouts at startup. Startup took a long time, so I wonder if there is some sort of conflict.

The error message looks really suspicious, though.

Comment 1 Elias Ross 2014-07-21 19:11:22 UTC

I had trouble running repair. It seems there is an installation issue with Cassandra.

Over enough times running repair, things seemed to work okay once I ran repair over the weekend. I don't know the root cause, though. The Cassandra logs don't reveal much detail as to any IO errors or not.

My suspicion is there is either a capacity or load issue, but since this happened as well with 4.9, I'm guessing not an RHQ issue.