Bug 1545698 - Storage service resource is down when additional storage node is deployed
Summary: Storage service resource is down when additional storage node is deployed
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: JBoss Operations Network
Classification: JBoss
Component: Storage Node, Agent
Version: JON 3.3.10
Hardware: Unspecified
OS: Unspecified
high
medium
Target Milestone: ER01
: JON 3.3.11
Assignee: Michael Burman
QA Contact: Mike Foley
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-02-15 13:46 UTC by Filip Brychta
Modified: 2018-06-29 13:00 UTC (History)
0 users

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-06-29 13:00:53 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
full agent.log (53.00 KB, text/plain)
2018-06-04 12:32 UTC, Filip Brychta
no flags Details

Description Filip Brychta 2018-02-15 13:46:05 UTC
Description of problem:
When adding additional JON server with storage node the storage service resource is down until the plugins on agents are updated.

Version-Release number of selected component (if applicable):
JON 3.3.10

How reproducible:
Always

Steps to Reproduce:
1. prepare postgres db to be able to accept remote connections on host 1
2. unzip jon-server-3.3.0.GA.zip and jon-plugins-patch-3.3.0.GA.zip (CP10)
3. apply the patch jon-server-3.3.0.GA-update-10/apply-updates.sh jon-server-3.3.0.GA
4. install the server: jon-server-3.3.0.GA/bin/rhqctl install
5. start it and wait until it's fully up
6. install additional jon server and storage node on another host
   a) unzip jon-server-3.3.0.GA.zip and jon-plugins-patch-3.3.0.GA.zip (CP10)
   b) apply the patch jon-server-3.3.0.GA-update-10/apply-updates.sh jon-server-3.3.0.GA
   c) edit jon-server-3.3.0.GA/bin/rhq-server.properties to use postgres on host 1
   d) install the server: jon-server-3.3.0.GA/bin/rhqctl install
   e) start it



Actual results:
Storage service resource and following warning is visible in agent.log:
2018-02-15 07:55:23,499 WARN  [ResourceContainer.invoker.availCheck.daemon-75] (org.rhq.plugins.cassandra.StorageServiceComponent)- Native transport is disabled for org.apache.cassandra.db:type=StorageService

Expected results:
Resources are up and there are no working in logs

Additional info:
The resource becomes up and the warning disappears from the agent.log when the Update Plugins on Agents operation is invoked (Administration->Agent plugins->Update Plugins on Agents)


All agent plugins on host 1 and host 2 are the same, e.g.:
102f74feb3d24454e6994fd4d1e62331  rhq-agent/plugins/rhq-cassandra-plugin-4.12.0.JON330GA-redhat-1.jar

This plugin remains the same even after Update Plugins on Agents operation is invoked.

Comment 1 Filip Brychta 2018-02-16 08:09:41 UTC
Update: it's not happening always.
I tried that again and I saw different behavior.

From 5 tries it was visible 4 times, 1 time the issue was not visible.
This one try, following warnings (many for different column families) were thrown to agent.log during first start up:
2018-02-16 02:38:40,395 WARN  [MeasurementManager.collector-1] (rhq.core.pc.measurement.MeasurementCollectorRunner)- Failure to collect measurement data for Resource[id=10171, uuid=8de3a003-29a1-45b7-80e5-5d9ed1b0e708, type={RHQStorage}ColumnFamily, key=peers, name=peers, parent=system] - cause: java.lang.RuntimeException:Unable to load attributes on bean [org.apache.cassandra.db:type=ColumnFamilies,keyspace=system,columnfamily=peers] null -> java.lang.reflect.UndeclaredThrowableException:null -> java.rmi.ConnectException:Connection refused to host: 127.0.0.1; nested exception is:
        java.net.ConnectException: Connection refused (Connection refused) -> java.net.ConnectException:Connection refused (Connection refused)

After that everything was ok. Not sure if this should be reported in different BZs but it seems that there are more possible paths depending on timing resulting in different issues.

Comment 2 Michael Burman 2018-06-01 12:20:25 UTC
For the first bug, is it corrected only with the "update plugins" or by waiting for the next availability scan (10 mins or so) ? The update plugins causes a new availability scan, that's why I'm wondering.

To me these errors look like the storage node was just very slow in getting up and at least the first error is an indication of that -> management interface of Cassandra is up, but the server itself is not capable of receiving messages yet. It reports this as native transport is not available, but it does get a connection to the Cassandra's management interface itself.

Comment 3 Filip Brychta 2018-06-04 12:31:46 UTC
Availability scan did not change availability of Storage service resource. I was still down after several availability scans (both scheduled and manual).
Again, the resource has become UP after "update plugins on all agents" operation.

Attaching full agent.log

Comment 4 Filip Brychta 2018-06-04 12:32:07 UTC
Created attachment 1447413 [details]
full agent.log

Comment 5 Filip Brychta 2018-06-29 09:46:30 UTC
Operation Restart Plugin Container resolves the issue too.

Comment 6 Filip Brychta 2018-06-29 13:00:53 UTC
Triage: Larry, Simeon, Filip: not a customer issue, workaround is very simple -> closing as won't fix


Note You need to log in before you can comment on or make changes to this bug.