Bug 1121282

Summary: Storage cluster reported as down, but not really down
Product: [Other] RHQ Project Reporter: Elias Ross <genman>
Component: Storage NodeAssignee: RHQ Project Maintainer <rhq-maint>
Status: CLOSED NOTABUG QA Contact: Mike Foley <mfoley>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 4.12   
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-07-23 16:44:17 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Elias Ross 2014-07-18 21:09:21 UTC
Description of problem:

[rhq@vp25q03ad-hadoop097 bin]$ ./nodetool -p 7299 status
Datacenter: 176
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address         Load       Tokens  Owns (effective)  Host ID                               Rack
UN  17.176.208.117  15.93 GB   256     69.6%             cad149ed-d5e1-4633-8e5a-6d6cb8a3da6b  208
UN  17.176.208.118  56.93 GB   256     64.9%             c421b915-9bc5-46bd-b26f-e88c89f114bf  208
UN  17.176.208.119  53.17 GB   256     65.5%             7367d69c-8fa6-4162-8b18-963c0ae1a229  208

Logs:

21:07:37,792 WARN  [org.rhq.server.metrics.StorageSession] (http-/0.0.0.0:7080-585) Encountered NoHostAvailableException due to following error(s): {}
21:07:37,792 INFO  [org.rhq.enterprise.server.storage.StorageClusterMonitor] (http-/0.0.0.0:7080-513) Storage cluster is down
21:07:37,793 INFO  [org.rhq.enterprise.server.storage.StorageClusterMonitor] (http-/0.0.0.0:7080-585) Storage cluster is down
21:07:37,793 INFO  [org.rhq.enterprise.server.storage.StorageClusterMonitor] (http-/0.0.0.0:7080-459) Storage cluster is down


Version-Release number of selected component (if applicable): 4.12


How reproducible: Unclear. Seems to have happened when I got some timeouts at startup. Startup took a long time, so I wonder if there is some sort of conflict.

The error message looks really suspicious, though.

Comment 1 Elias Ross 2014-07-21 19:11:22 UTC
I had trouble running repair. It seems there is an installation issue with Cassandra.

Over enough times running repair, things seemed to work okay once I ran repair over the weekend. I don't know the root cause, though. The Cassandra logs don't reveal much detail as to any IO errors or not.

My suspicion is there is either a capacity or load issue, but since this happened as well with 4.9, I'm guessing not an RHQ issue.