Bug 1546066 - AntiEntropySessions resource can be marked as down after restart of JON services in HA env
Summary: AntiEntropySessions resource can be marked as down after restart of JON servi...
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: JBoss Operations Network
Classification: JBoss
Component: Storage Node
Version: JON 3.3.10
Hardware: Unspecified
OS: Unspecified
low
medium
Target Milestone: ER01
: JON 3.3.11
Assignee: Michael Burman
QA Contact: Mike Foley
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-02-16 09:06 UTC by Filip Brychta
Modified: 2018-04-05 16:58 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-04-05 16:58:03 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
all logs (116.63 KB, application/x-gzip)
2018-02-16 09:06 UTC, Filip Brychta
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1084056 0 unspecified CLOSED Storage node has internal server metrics "Anti Entropy Sessions" marked as unavailable 2021-02-22 00:41:40 UTC
Red Hat Knowledge Base (Solution) 743393 0 None None None 2018-04-05 16:58:02 UTC

Internal Links: 1084056

Description Filip Brychta 2018-02-16 09:06:28 UTC
Created attachment 1396902 [details]
all logs

Description of problem:
$Summary

Version-Release number of selected component (if applicable):
JON 3.3.10

How reproducible:
Sometimes

Steps to Reproduce:
1. prepare postgres db to be able to accept remote connections on host 1
2. unzip jon-server-3.3.0.GA.zip and jon-plugins-patch-3.3.0.GA.zip (CP10)
3. apply the patch jon-server-3.3.0.GA-update-10/apply-updates.sh jon-server-3.3.0.GA
4. install the server: jon-server-3.3.0.GA/bin/rhqctl install
5. start it and wait until it's fully up
6. install additional jon server and storage node on another host 2
   a) unzip jon-server-3.3.0.GA.zip and jon-plugins-patch-3.3.0.GA.zip (CP10)
   b) apply the patch jon-server-3.3.0.GA-update-10/apply-updates.sh jon-server-3.3.0.GA
   c) edit jon-server-3.3.0.GA/bin/rhq-server.properties to use postgres on host 1
   d) install the server: jon-server-3.3.0.GA/bin/rhqctl install
   e) start it and wait until it's fully up
7. restart services on host 2 jon-server-3.3.0.GA/bin/rhqctl restart


Actual results:
AntiEntropySessions resource on host 2 is marked as down and following warns are thrown to agent.log:
WARN  [MeasurementManager.collector-1] (rhq.core.pc.measurement.MeasurementCollectorRunner)- Failure to collect measurement data for Resource[id=10221, uuid=4fdd7c18-b65b-487b-adcb-03381ae9473f, type={RHQStorage}ConfigurableInternalServerMetrics, key=org.apache.cassandra.internal:type=AntiEntropySessions, name=Anti Entropy Sessions, parent=RHQ Storage Node(fbr-pre-ha2.bc.jonqe.lab.eng.bos.redhat.com)] - cause: java.lang.IllegalStateException:EMS bean was null for Resource with type [ResourceType[id=0, name=ConfigurableInternalServerMetrics, plugin=RHQStorage, category=Service]] and key [org.apache.cassandra.internal:type=AntiEntropySessions].

Expected results:
All expected resources are UP and no warnings in logs

Additional info:
AntiEntropySessions resource becomes up again after StorageNodeManager.runClusterMaintenance() operation invoked.

I saw this issue even after upgrade of JON 3.3.0 HA env to JON3.3.10 when upgrading hosts one by one without stopping services on other host

All logs from both hosts are attached. Note that NPE in server log is reported in bz1544424 and "Connection refused to host: 127.0.0.1" WARNs in agent.log are reported in bz1545698

Comment 4 Larry O'Leary 2018-04-05 16:58:03 UTC
As indicated in Bug 1084056, the decision to resolve this was to add support for the missing availability policy (supportsMissingAvailabilityType). This allows the user to set the Administration > Configuration > Missing Resource Policy for the RHQ Storage Node / ConfigurableInternalServerMetrics resource type to IGNORE or UNINVENTORY to prevent this transient resource from continuously being reported as DOWN while cluster repair is not executing.

Closing as WONTFIX.


Note You need to log in before you can comment on or make changes to this bug.