Bug 1546066

Summary: AntiEntropySessions resource can be marked as down after restart of JON services in HA env
Product: [JBoss] JBoss Operations Network Reporter: Filip Brychta <fbrychta>
Component: Storage NodeAssignee: Michael Burman <miburman>
Status: CLOSED WONTFIX QA Contact: Mike Foley <mfoley>
Severity: medium Docs Contact:
Priority: low    
Version: JON 3.3.10CC: loleary, miburman
Target Milestone: ER01Keywords: Triaged
Target Release: JON 3.3.11   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-04-05 16:58:03 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
all logs none

Description Filip Brychta 2018-02-16 09:06:28 UTC
Created attachment 1396902 [details]
all logs

Description of problem:
$Summary

Version-Release number of selected component (if applicable):
JON 3.3.10

How reproducible:
Sometimes

Steps to Reproduce:
1. prepare postgres db to be able to accept remote connections on host 1
2. unzip jon-server-3.3.0.GA.zip and jon-plugins-patch-3.3.0.GA.zip (CP10)
3. apply the patch jon-server-3.3.0.GA-update-10/apply-updates.sh jon-server-3.3.0.GA
4. install the server: jon-server-3.3.0.GA/bin/rhqctl install
5. start it and wait until it's fully up
6. install additional jon server and storage node on another host 2
   a) unzip jon-server-3.3.0.GA.zip and jon-plugins-patch-3.3.0.GA.zip (CP10)
   b) apply the patch jon-server-3.3.0.GA-update-10/apply-updates.sh jon-server-3.3.0.GA
   c) edit jon-server-3.3.0.GA/bin/rhq-server.properties to use postgres on host 1
   d) install the server: jon-server-3.3.0.GA/bin/rhqctl install
   e) start it and wait until it's fully up
7. restart services on host 2 jon-server-3.3.0.GA/bin/rhqctl restart


Actual results:
AntiEntropySessions resource on host 2 is marked as down and following warns are thrown to agent.log:
WARN  [MeasurementManager.collector-1] (rhq.core.pc.measurement.MeasurementCollectorRunner)- Failure to collect measurement data for Resource[id=10221, uuid=4fdd7c18-b65b-487b-adcb-03381ae9473f, type={RHQStorage}ConfigurableInternalServerMetrics, key=org.apache.cassandra.internal:type=AntiEntropySessions, name=Anti Entropy Sessions, parent=RHQ Storage Node(fbr-pre-ha2.bc.jonqe.lab.eng.bos.redhat.com)] - cause: java.lang.IllegalStateException:EMS bean was null for Resource with type [ResourceType[id=0, name=ConfigurableInternalServerMetrics, plugin=RHQStorage, category=Service]] and key [org.apache.cassandra.internal:type=AntiEntropySessions].

Expected results:
All expected resources are UP and no warnings in logs

Additional info:
AntiEntropySessions resource becomes up again after StorageNodeManager.runClusterMaintenance() operation invoked.

I saw this issue even after upgrade of JON 3.3.0 HA env to JON3.3.10 when upgrading hosts one by one without stopping services on other host

All logs from both hosts are attached. Note that NPE in server log is reported in bz1544424 and "Connection refused to host: 127.0.0.1" WARNs in agent.log are reported in bz1545698

Comment 4 Larry O'Leary 2018-04-05 16:58:03 UTC
As indicated in Bug 1084056, the decision to resolve this was to add support for the missing availability policy (supportsMissingAvailabilityType). This allows the user to set the Administration > Configuration > Missing Resource Policy for the RHQ Storage Node / ConfigurableInternalServerMetrics resource type to IGNORE or UNINVENTORY to prevent this transient resource from continuously being reported as DOWN while cluster repair is not executing.

Closing as WONTFIX.