1545698 – Storage service resource is down when additional storage node is deployed

Bug 1545698 - Storage service resource is down when additional storage node is deployed

Summary: Storage service resource is down when additional storage node is deployed

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	JBoss Operations Network
Classification:	JBoss
Component:	Storage Node, Agent
Sub Component:
Version:	JON 3.3.10
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	medium
Target Milestone:	ER01
Target Release:	JON 3.3.11
Assignee:	Michael Burman
QA Contact:	Mike Foley
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-02-15 13:46 UTC by Filip Brychta
Modified:	2018-06-29 13:00 UTC (History)
CC List:	0 users
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-06-29 13:00:53 UTC
Type:	Bug
Embargoed:

Attachments	(Terms of Use)
full agent.log (53.00 KB, text/plain) 2018-06-04 12:32 UTC, Filip Brychta	no flags	Details
View All

Description Filip Brychta 2018-02-15 13:46:05 UTC

Description of problem:
When adding additional JON server with storage node the storage service resource is down until the plugins on agents are updated.

Version-Release number of selected component (if applicable):
JON 3.3.10

How reproducible:
Always

Steps to Reproduce:
1. prepare postgres db to be able to accept remote connections on host 1
2. unzip jon-server-3.3.0.GA.zip and jon-plugins-patch-3.3.0.GA.zip (CP10)
3. apply the patch jon-server-3.3.0.GA-update-10/apply-updates.sh jon-server-3.3.0.GA
4. install the server: jon-server-3.3.0.GA/bin/rhqctl install
5. start it and wait until it's fully up
6. install additional jon server and storage node on another host
   a) unzip jon-server-3.3.0.GA.zip and jon-plugins-patch-3.3.0.GA.zip (CP10)
   b) apply the patch jon-server-3.3.0.GA-update-10/apply-updates.sh jon-server-3.3.0.GA
   c) edit jon-server-3.3.0.GA/bin/rhq-server.properties to use postgres on host 1
   d) install the server: jon-server-3.3.0.GA/bin/rhqctl install
   e) start it



Actual results:
Storage service resource and following warning is visible in agent.log:
2018-02-15 07:55:23,499 WARN  [ResourceContainer.invoker.availCheck.daemon-75] (org.rhq.plugins.cassandra.StorageServiceComponent)- Native transport is disabled for org.apache.cassandra.db:type=StorageService

Expected results:
Resources are up and there are no working in logs

Additional info:
The resource becomes up and the warning disappears from the agent.log when the Update Plugins on Agents operation is invoked (Administration->Agent plugins->Update Plugins on Agents)


All agent plugins on host 1 and host 2 are the same, e.g.:
102f74feb3d24454e6994fd4d1e62331  rhq-agent/plugins/rhq-cassandra-plugin-4.12.0.JON330GA-redhat-1.jar

This plugin remains the same even after Update Plugins on Agents operation is invoked.

Comment 1 Filip Brychta 2018-02-16 08:09:41 UTC

Update: it's not happening always.
I tried that again and I saw different behavior.

From 5 tries it was visible 4 times, 1 time the issue was not visible.
This one try, following warnings (many for different column families) were thrown to agent.log during first start up:
2018-02-16 02:38:40,395 WARN  [MeasurementManager.collector-1] (rhq.core.pc.measurement.MeasurementCollectorRunner)- Failure to collect measurement data for Resource[id=10171, uuid=8de3a003-29a1-45b7-80e5-5d9ed1b0e708, type={RHQStorage}ColumnFamily, key=peers, name=peers, parent=system] - cause: java.lang.RuntimeException:Unable to load attributes on bean [org.apache.cassandra.db:type=ColumnFamilies,keyspace=system,columnfamily=peers] null -> java.lang.reflect.UndeclaredThrowableException:null -> java.rmi.ConnectException:Connection refused to host: 127.0.0.1; nested exception is:
        java.net.ConnectException: Connection refused (Connection refused) -> java.net.ConnectException:Connection refused (Connection refused)

After that everything was ok. Not sure if this should be reported in different BZs but it seems that there are more possible paths depending on timing resulting in different issues.

Comment 2 Michael Burman 2018-06-01 12:20:25 UTC

For the first bug, is it corrected only with the "update plugins" or by waiting for the next availability scan (10 mins or so) ? The update plugins causes a new availability scan, that's why I'm wondering.

To me these errors look like the storage node was just very slow in getting up and at least the first error is an indication of that -> management interface of Cassandra is up, but the server itself is not capable of receiving messages yet. It reports this as native transport is not available, but it does get a connection to the Cassandra's management interface itself.

Comment 3 Filip Brychta 2018-06-04 12:31:46 UTC

Availability scan did not change availability of Storage service resource. I was still down after several availability scans (both scheduled and manual).
Again, the resource has become UP after "update plugins on all agents" operation.

Attaching full agent.log

Comment 4 Filip Brychta 2018-06-04 12:32:07 UTC

Created attachment 1447413 [details]
full agent.log

Comment 5 Filip Brychta 2018-06-29 09:46:30 UTC

Operation Restart Plugin Container resolves the issue too.

Comment 6 Filip Brychta 2018-06-29 13:00:53 UTC

Triage: Larry, Simeon, Filip: not a customer issue, workaround is very simple -> closing as won't fix

Note You need to log in before you can comment on or make changes to this bug.