1546066 – AntiEntropySessions resource can be marked as down after restart of JON services in HA env

Bug 1546066 - AntiEntropySessions resource can be marked as down after restart of JON services in HA env

Summary: AntiEntropySessions resource can be marked as down after restart of JON servi...

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	JBoss Operations Network
Classification:	JBoss
Component:	Storage Node
Sub Component:
Version:	JON 3.3.10
Hardware:	Unspecified
OS:	Unspecified
Priority:	low
Severity:	medium
Target Milestone:	ER01
Target Release:	JON 3.3.11
Assignee:	Michael Burman
QA Contact:	Mike Foley
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-02-16 09:06 UTC by Filip Brychta
Modified:	2018-04-05 16:58 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2018-04-05 16:58:03 UTC
Type:	Bug
Embargoed:

Attachments	(Terms of Use)
all logs (116.63 KB, application/x-gzip) 2018-02-16 09:06 UTC, Filip Brychta	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1084056	0	unspecified	CLOSED	Storage node has internal server metrics "Anti Entropy Sessions" marked as unavailable	2021-02-22 00:41:40 UTC
Red Hat Knowledge Base (Solution)	743393	0	None	None	None	2018-04-05 16:58:02 UTC

Internal Links: 1084056

Description Filip Brychta 2018-02-16 09:06:28 UTC

Created attachment 1396902 [details]
all logs

Description of problem:
$Summary

Version-Release number of selected component (if applicable):
JON 3.3.10

How reproducible:
Sometimes

Steps to Reproduce:
1. prepare postgres db to be able to accept remote connections on host 1
2. unzip jon-server-3.3.0.GA.zip and jon-plugins-patch-3.3.0.GA.zip (CP10)
3. apply the patch jon-server-3.3.0.GA-update-10/apply-updates.sh jon-server-3.3.0.GA
4. install the server: jon-server-3.3.0.GA/bin/rhqctl install
5. start it and wait until it's fully up
6. install additional jon server and storage node on another host 2
   a) unzip jon-server-3.3.0.GA.zip and jon-plugins-patch-3.3.0.GA.zip (CP10)
   b) apply the patch jon-server-3.3.0.GA-update-10/apply-updates.sh jon-server-3.3.0.GA
   c) edit jon-server-3.3.0.GA/bin/rhq-server.properties to use postgres on host 1
   d) install the server: jon-server-3.3.0.GA/bin/rhqctl install
   e) start it and wait until it's fully up
7. restart services on host 2 jon-server-3.3.0.GA/bin/rhqctl restart


Actual results:
AntiEntropySessions resource on host 2 is marked as down and following warns are thrown to agent.log:
WARN  [MeasurementManager.collector-1] (rhq.core.pc.measurement.MeasurementCollectorRunner)- Failure to collect measurement data for Resource[id=10221, uuid=4fdd7c18-b65b-487b-adcb-03381ae9473f, type={RHQStorage}ConfigurableInternalServerMetrics, key=org.apache.cassandra.internal:type=AntiEntropySessions, name=Anti Entropy Sessions, parent=RHQ Storage Node(fbr-pre-ha2.bc.jonqe.lab.eng.bos.redhat.com)] - cause: java.lang.IllegalStateException:EMS bean was null for Resource with type [ResourceType[id=0, name=ConfigurableInternalServerMetrics, plugin=RHQStorage, category=Service]] and key [org.apache.cassandra.internal:type=AntiEntropySessions].

Expected results:
All expected resources are UP and no warnings in logs

Additional info:
AntiEntropySessions resource becomes up again after StorageNodeManager.runClusterMaintenance() operation invoked.

I saw this issue even after upgrade of JON 3.3.0 HA env to JON3.3.10 when upgrading hosts one by one without stopping services on other host

All logs from both hosts are attached. Note that NPE in server log is reported in bz1544424 and "Connection refused to host: 127.0.0.1" WARNs in agent.log are reported in bz1545698

Comment 4 Larry O'Leary 2018-04-05 16:58:03 UTC

As indicated in Bug 1084056, the decision to resolve this was to add support for the missing availability policy (supportsMissingAvailabilityType). This allows the user to set the Administration > Configuration > Missing Resource Policy for the RHQ Storage Node / ConfigurableInternalServerMetrics resource type to IGNORE or UNINVENTORY to prevent this transient resource from continuously being reported as DOWN while cluster repair is not executing.

Closing as WONTFIX.

Note You need to log in before you can comment on or make changes to this bug.