Bug 1121282 - Storage cluster reported as down, but not really down
Summary: Storage cluster reported as down, but not really down
Alias: None
Product: RHQ Project
Classification: Other
Component: Storage Node
Version: 4.12
Hardware: Unspecified
OS: Unspecified
unspecified vote
Target Milestone: ---
: ---
Assignee: RHQ Project Maintainer
QA Contact: Mike Foley
Depends On:
TreeView+ depends on / blocked
Reported: 2014-07-18 21:09 UTC by Elias Ross
Modified: 2014-07-23 16:44 UTC (History)
0 users

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Last Closed: 2014-07-23 16:44:17 UTC

Attachments (Terms of Use)

Description Elias Ross 2014-07-18 21:09:21 UTC
Description of problem:

[rhq@vp25q03ad-hadoop097 bin]$ ./nodetool -p 7299 status
Datacenter: 176
|/ State=Normal/Leaving/Joining/Moving
--  Address         Load       Tokens  Owns (effective)  Host ID                               Rack
UN  15.93 GB   256     69.6%             cad149ed-d5e1-4633-8e5a-6d6cb8a3da6b  208
UN  56.93 GB   256     64.9%             c421b915-9bc5-46bd-b26f-e88c89f114bf  208
UN  53.17 GB   256     65.5%             7367d69c-8fa6-4162-8b18-963c0ae1a229  208


21:07:37,792 WARN  [org.rhq.server.metrics.StorageSession] (http-/ Encountered NoHostAvailableException due to following error(s): {}
21:07:37,792 INFO  [org.rhq.enterprise.server.storage.StorageClusterMonitor] (http-/ Storage cluster is down
21:07:37,793 INFO  [org.rhq.enterprise.server.storage.StorageClusterMonitor] (http-/ Storage cluster is down
21:07:37,793 INFO  [org.rhq.enterprise.server.storage.StorageClusterMonitor] (http-/ Storage cluster is down

Version-Release number of selected component (if applicable): 4.12

How reproducible: Unclear. Seems to have happened when I got some timeouts at startup. Startup took a long time, so I wonder if there is some sort of conflict.

The error message looks really suspicious, though.

Comment 1 Elias Ross 2014-07-21 19:11:22 UTC
I had trouble running repair. It seems there is an installation issue with Cassandra.

Over enough times running repair, things seemed to work okay once I ran repair over the weekend. I don't know the root cause, though. The Cassandra logs don't reveal much detail as to any IO errors or not.

My suspicion is there is either a capacity or load issue, but since this happened as well with 4.9, I'm guessing not an RHQ issue.

Note You need to log in before you can comment on or make changes to this bug.