1121282 – Storage cluster reported as down, but not really down

Bug 1121282 - Storage cluster reported as down, but not really down

Summary: Storage cluster reported as down, but not really down

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	RHQ Project
Classification:	Other
Component:	Storage Node
Sub Component:
Version:	4.12
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	---
Assignee:	RHQ Project Maintainer
QA Contact:	Mike Foley
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2014-07-18 21:09 UTC by Elias Ross
Modified:	2014-07-23 16:44 UTC (History)
CC List:	0 users
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2014-07-23 16:44:17 UTC
Embargoed:

Attachments	(Terms of Use)

Description Elias Ross 2014-07-18 21:09:21 UTC

Description of problem:

[rhq@vp25q03ad-hadoop097 bin]$ ./nodetool -p 7299 status
Datacenter: 176
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address         Load       Tokens  Owns (effective)  Host ID                               Rack
UN  17.176.208.117  15.93 GB   256     69.6%             cad149ed-d5e1-4633-8e5a-6d6cb8a3da6b  208
UN  17.176.208.118  56.93 GB   256     64.9%             c421b915-9bc5-46bd-b26f-e88c89f114bf  208
UN  17.176.208.119  53.17 GB   256     65.5%             7367d69c-8fa6-4162-8b18-963c0ae1a229  208

Logs:

21:07:37,792 WARN  [org.rhq.server.metrics.StorageSession] (http-/0.0.0.0:7080-585) Encountered NoHostAvailableException due to following error(s): {}
21:07:37,792 INFO  [org.rhq.enterprise.server.storage.StorageClusterMonitor] (http-/0.0.0.0:7080-513) Storage cluster is down
21:07:37,793 INFO  [org.rhq.enterprise.server.storage.StorageClusterMonitor] (http-/0.0.0.0:7080-585) Storage cluster is down
21:07:37,793 INFO  [org.rhq.enterprise.server.storage.StorageClusterMonitor] (http-/0.0.0.0:7080-459) Storage cluster is down


Version-Release number of selected component (if applicable): 4.12


How reproducible: Unclear. Seems to have happened when I got some timeouts at startup. Startup took a long time, so I wonder if there is some sort of conflict.

The error message looks really suspicious, though.

Comment 1 Elias Ross 2014-07-21 19:11:22 UTC

I had trouble running repair. It seems there is an installation issue with Cassandra.

Over enough times running repair, things seemed to work okay once I ran repair over the weekend. I don't know the root cause, though. The Cassandra logs don't reveal much detail as to any IO errors or not.

My suspicion is there is either a capacity or load issue, but since this happened as well with 4.9, I'm guessing not an RHQ issue.

Note You need to log in before you can comment on or make changes to this bug.