Created attachment 1396902 [details] all logs Description of problem: $Summary Version-Release number of selected component (if applicable): JON 3.3.10 How reproducible: Sometimes Steps to Reproduce: 1. prepare postgres db to be able to accept remote connections on host 1 2. unzip jon-server-3.3.0.GA.zip and jon-plugins-patch-3.3.0.GA.zip (CP10) 3. apply the patch jon-server-3.3.0.GA-update-10/apply-updates.sh jon-server-3.3.0.GA 4. install the server: jon-server-3.3.0.GA/bin/rhqctl install 5. start it and wait until it's fully up 6. install additional jon server and storage node on another host 2 a) unzip jon-server-3.3.0.GA.zip and jon-plugins-patch-3.3.0.GA.zip (CP10) b) apply the patch jon-server-3.3.0.GA-update-10/apply-updates.sh jon-server-3.3.0.GA c) edit jon-server-3.3.0.GA/bin/rhq-server.properties to use postgres on host 1 d) install the server: jon-server-3.3.0.GA/bin/rhqctl install e) start it and wait until it's fully up 7. restart services on host 2 jon-server-3.3.0.GA/bin/rhqctl restart Actual results: AntiEntropySessions resource on host 2 is marked as down and following warns are thrown to agent.log: WARN [MeasurementManager.collector-1] (rhq.core.pc.measurement.MeasurementCollectorRunner)- Failure to collect measurement data for Resource[id=10221, uuid=4fdd7c18-b65b-487b-adcb-03381ae9473f, type={RHQStorage}ConfigurableInternalServerMetrics, key=org.apache.cassandra.internal:type=AntiEntropySessions, name=Anti Entropy Sessions, parent=RHQ Storage Node(fbr-pre-ha2.bc.jonqe.lab.eng.bos.redhat.com)] - cause: java.lang.IllegalStateException:EMS bean was null for Resource with type [ResourceType[id=0, name=ConfigurableInternalServerMetrics, plugin=RHQStorage, category=Service]] and key [org.apache.cassandra.internal:type=AntiEntropySessions]. Expected results: All expected resources are UP and no warnings in logs Additional info: AntiEntropySessions resource becomes up again after StorageNodeManager.runClusterMaintenance() operation invoked. I saw this issue even after upgrade of JON 3.3.0 HA env to JON3.3.10 when upgrading hosts one by one without stopping services on other host All logs from both hosts are attached. Note that NPE in server log is reported in bz1544424 and "Connection refused to host: 127.0.0.1" WARNs in agent.log are reported in bz1545698
As indicated in Bug 1084056, the decision to resolve this was to add support for the missing availability policy (supportsMissingAvailabilityType). This allows the user to set the Administration > Configuration > Missing Resource Policy for the RHQ Storage Node / ConfigurableInternalServerMetrics resource type to IGNORE or UNINVENTORY to prevent this transient resource from continuously being reported as DOWN while cluster repair is not executing. Closing as WONTFIX.