Bug 1545698
Summary: | Storage service resource is down when additional storage node is deployed | ||||||
---|---|---|---|---|---|---|---|
Product: | [JBoss] JBoss Operations Network | Reporter: | Filip Brychta <fbrychta> | ||||
Component: | Storage Node, Agent | Assignee: | Michael Burman <miburman> | ||||
Status: | CLOSED WONTFIX | QA Contact: | Mike Foley <mfoley> | ||||
Severity: | medium | Docs Contact: | |||||
Priority: | high | ||||||
Version: | JON 3.3.10 | Keywords: | Triaged | ||||
Target Milestone: | ER01 | ||||||
Target Release: | JON 3.3.11 | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2018-06-29 13:00:53 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Filip Brychta
2018-02-15 13:46:05 UTC
Update: it's not happening always. I tried that again and I saw different behavior. From 5 tries it was visible 4 times, 1 time the issue was not visible. This one try, following warnings (many for different column families) were thrown to agent.log during first start up: 2018-02-16 02:38:40,395 WARN [MeasurementManager.collector-1] (rhq.core.pc.measurement.MeasurementCollectorRunner)- Failure to collect measurement data for Resource[id=10171, uuid=8de3a003-29a1-45b7-80e5-5d9ed1b0e708, type={RHQStorage}ColumnFamily, key=peers, name=peers, parent=system] - cause: java.lang.RuntimeException:Unable to load attributes on bean [org.apache.cassandra.db:type=ColumnFamilies,keyspace=system,columnfamily=peers] null -> java.lang.reflect.UndeclaredThrowableException:null -> java.rmi.ConnectException:Connection refused to host: 127.0.0.1; nested exception is: java.net.ConnectException: Connection refused (Connection refused) -> java.net.ConnectException:Connection refused (Connection refused) After that everything was ok. Not sure if this should be reported in different BZs but it seems that there are more possible paths depending on timing resulting in different issues. For the first bug, is it corrected only with the "update plugins" or by waiting for the next availability scan (10 mins or so) ? The update plugins causes a new availability scan, that's why I'm wondering. To me these errors look like the storage node was just very slow in getting up and at least the first error is an indication of that -> management interface of Cassandra is up, but the server itself is not capable of receiving messages yet. It reports this as native transport is not available, but it does get a connection to the Cassandra's management interface itself. Availability scan did not change availability of Storage service resource. I was still down after several availability scans (both scheduled and manual). Again, the resource has become UP after "update plugins on all agents" operation. Attaching full agent.log Created attachment 1447413 [details]
full agent.log
Operation Restart Plugin Container resolves the issue too. Triage: Larry, Simeon, Filip: not a customer issue, workaround is very simple -> closing as won't fix |