Description of problem: When adding additional JON server with storage node the storage service resource is down until the plugins on agents are updated. Version-Release number of selected component (if applicable): JON 3.3.10 How reproducible: Always Steps to Reproduce: 1. prepare postgres db to be able to accept remote connections on host 1 2. unzip jon-server-3.3.0.GA.zip and jon-plugins-patch-3.3.0.GA.zip (CP10) 3. apply the patch jon-server-3.3.0.GA-update-10/apply-updates.sh jon-server-3.3.0.GA 4. install the server: jon-server-3.3.0.GA/bin/rhqctl install 5. start it and wait until it's fully up 6. install additional jon server and storage node on another host a) unzip jon-server-3.3.0.GA.zip and jon-plugins-patch-3.3.0.GA.zip (CP10) b) apply the patch jon-server-3.3.0.GA-update-10/apply-updates.sh jon-server-3.3.0.GA c) edit jon-server-3.3.0.GA/bin/rhq-server.properties to use postgres on host 1 d) install the server: jon-server-3.3.0.GA/bin/rhqctl install e) start it Actual results: Storage service resource and following warning is visible in agent.log: 2018-02-15 07:55:23,499 WARN [ResourceContainer.invoker.availCheck.daemon-75] (org.rhq.plugins.cassandra.StorageServiceComponent)- Native transport is disabled for org.apache.cassandra.db:type=StorageService Expected results: Resources are up and there are no working in logs Additional info: The resource becomes up and the warning disappears from the agent.log when the Update Plugins on Agents operation is invoked (Administration->Agent plugins->Update Plugins on Agents) All agent plugins on host 1 and host 2 are the same, e.g.: 102f74feb3d24454e6994fd4d1e62331 rhq-agent/plugins/rhq-cassandra-plugin-4.12.0.JON330GA-redhat-1.jar This plugin remains the same even after Update Plugins on Agents operation is invoked.
Update: it's not happening always. I tried that again and I saw different behavior. From 5 tries it was visible 4 times, 1 time the issue was not visible. This one try, following warnings (many for different column families) were thrown to agent.log during first start up: 2018-02-16 02:38:40,395 WARN [MeasurementManager.collector-1] (rhq.core.pc.measurement.MeasurementCollectorRunner)- Failure to collect measurement data for Resource[id=10171, uuid=8de3a003-29a1-45b7-80e5-5d9ed1b0e708, type={RHQStorage}ColumnFamily, key=peers, name=peers, parent=system] - cause: java.lang.RuntimeException:Unable to load attributes on bean [org.apache.cassandra.db:type=ColumnFamilies,keyspace=system,columnfamily=peers] null -> java.lang.reflect.UndeclaredThrowableException:null -> java.rmi.ConnectException:Connection refused to host: 127.0.0.1; nested exception is: java.net.ConnectException: Connection refused (Connection refused) -> java.net.ConnectException:Connection refused (Connection refused) After that everything was ok. Not sure if this should be reported in different BZs but it seems that there are more possible paths depending on timing resulting in different issues.
For the first bug, is it corrected only with the "update plugins" or by waiting for the next availability scan (10 mins or so) ? The update plugins causes a new availability scan, that's why I'm wondering. To me these errors look like the storage node was just very slow in getting up and at least the first error is an indication of that -> management interface of Cassandra is up, but the server itself is not capable of receiving messages yet. It reports this as native transport is not available, but it does get a connection to the Cassandra's management interface itself.
Availability scan did not change availability of Storage service resource. I was still down after several availability scans (both scheduled and manual). Again, the resource has become UP after "update plugins on all agents" operation. Attaching full agent.log
Created attachment 1447413 [details] full agent.log
Operation Restart Plugin Container resolves the issue too.
Triage: Larry, Simeon, Filip: not a customer issue, workaround is very simple -> closing as won't fix