Bug 1545698

Summary:

Storage service resource is down when additional storage node is deployed

Product:

[JBoss] JBoss Operations Network

Reporter:

Filip Brychta <fbrychta>

Component:

Storage Node, Agent

Assignee:

Michael Burman <miburman>

Status:

CLOSED WONTFIX

QA Contact:

Mike Foley <mfoley>

Severity:

medium

Docs Contact:

Priority:

high

Version:

JON 3.3.10

Keywords:

Triaged

Target Milestone:

ER01

Target Release:

JON 3.3.11

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2018-06-29 13:00:53 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
full agent.log	none

Description Filip Brychta 2018-02-15 13:46:05 UTC

Description of problem:
When adding additional JON server with storage node the storage service resource is down until the plugins on agents are updated.

Version-Release number of selected component (if applicable):
JON 3.3.10

How reproducible:
Always

Steps to Reproduce:
1. prepare postgres db to be able to accept remote connections on host 1
2. unzip jon-server-3.3.0.GA.zip and jon-plugins-patch-3.3.0.GA.zip (CP10)
3. apply the patch jon-server-3.3.0.GA-update-10/apply-updates.sh jon-server-3.3.0.GA
4. install the server: jon-server-3.3.0.GA/bin/rhqctl install
5. start it and wait until it's fully up
6. install additional jon server and storage node on another host
   a) unzip jon-server-3.3.0.GA.zip and jon-plugins-patch-3.3.0.GA.zip (CP10)
   b) apply the patch jon-server-3.3.0.GA-update-10/apply-updates.sh jon-server-3.3.0.GA
   c) edit jon-server-3.3.0.GA/bin/rhq-server.properties to use postgres on host 1
   d) install the server: jon-server-3.3.0.GA/bin/rhqctl install
   e) start it



Actual results:
Storage service resource and following warning is visible in agent.log:
2018-02-15 07:55:23,499 WARN  [ResourceContainer.invoker.availCheck.daemon-75] (org.rhq.plugins.cassandra.StorageServiceComponent)- Native transport is disabled for org.apache.cassandra.db:type=StorageService

Expected results:
Resources are up and there are no working in logs

Additional info:
The resource becomes up and the warning disappears from the agent.log when the Update Plugins on Agents operation is invoked (Administration->Agent plugins->Update Plugins on Agents)


All agent plugins on host 1 and host 2 are the same, e.g.:
102f74feb3d24454e6994fd4d1e62331  rhq-agent/plugins/rhq-cassandra-plugin-4.12.0.JON330GA-redhat-1.jar

This plugin remains the same even after Update Plugins on Agents operation is invoked.

Comment 1 Filip Brychta 2018-02-16 08:09:41 UTC

Update: it's not happening always.
I tried that again and I saw different behavior.

From 5 tries it was visible 4 times, 1 time the issue was not visible.
This one try, following warnings (many for different column families) were thrown to agent.log during first start up:
2018-02-16 02:38:40,395 WARN  [MeasurementManager.collector-1] (rhq.core.pc.measurement.MeasurementCollectorRunner)- Failure to collect measurement data for Resource[id=10171, uuid=8de3a003-29a1-45b7-80e5-5d9ed1b0e708, type={RHQStorage}ColumnFamily, key=peers, name=peers, parent=system] - cause: java.lang.RuntimeException:Unable to load attributes on bean [org.apache.cassandra.db:type=ColumnFamilies,keyspace=system,columnfamily=peers] null -> java.lang.reflect.UndeclaredThrowableException:null -> java.rmi.ConnectException:Connection refused to host: 127.0.0.1; nested exception is:
        java.net.ConnectException: Connection refused (Connection refused) -> java.net.ConnectException:Connection refused (Connection refused)

After that everything was ok. Not sure if this should be reported in different BZs but it seems that there are more possible paths depending on timing resulting in different issues.

Comment 2 Michael Burman 2018-06-01 12:20:25 UTC

For the first bug, is it corrected only with the "update plugins" or by waiting for the next availability scan (10 mins or so) ? The update plugins causes a new availability scan, that's why I'm wondering.

To me these errors look like the storage node was just very slow in getting up and at least the first error is an indication of that -> management interface of Cassandra is up, but the server itself is not capable of receiving messages yet. It reports this as native transport is not available, but it does get a connection to the Cassandra's management interface itself.

Comment 3 Filip Brychta 2018-06-04 12:31:46 UTC

Availability scan did not change availability of Storage service resource. I was still down after several availability scans (both scheduled and manual).
Again, the resource has become UP after "update plugins on all agents" operation.

Attaching full agent.log

Comment 4 Filip Brychta 2018-06-04 12:32:07 UTC

Created attachment 1447413 [details]
full agent.log

Comment 5 Filip Brychta 2018-06-29 09:46:30 UTC

Operation Restart Plugin Container resolves the issue too.

Comment 6 Filip Brychta 2018-06-29 13:00:53 UTC

Triage: Larry, Simeon, Filip: not a customer issue, workaround is very simple -> closing as won't fix