Bug 1546066

Summary:

AntiEntropySessions resource can be marked as down after restart of JON services in HA env

Product:

[JBoss] JBoss Operations Network

Reporter:

Filip Brychta <fbrychta>

Component:

Storage Node

Assignee:

Michael Burman <miburman>

Status:

CLOSED WONTFIX

QA Contact:

Mike Foley <mfoley>

Severity:

medium

Docs Contact:

Priority:

low

Version:

JON 3.3.10

CC:

loleary, miburman

Target Milestone:

ER01

Keywords:

Triaged

Target Release:

JON 3.3.11

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2018-04-05 16:58:03 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
all logs	none

Description Filip Brychta 2018-02-16 09:06:28 UTC

Created attachment 1396902 [details]
all logs

Description of problem:
$Summary

Version-Release number of selected component (if applicable):
JON 3.3.10

How reproducible:
Sometimes

Steps to Reproduce:
1. prepare postgres db to be able to accept remote connections on host 1
2. unzip jon-server-3.3.0.GA.zip and jon-plugins-patch-3.3.0.GA.zip (CP10)
3. apply the patch jon-server-3.3.0.GA-update-10/apply-updates.sh jon-server-3.3.0.GA
4. install the server: jon-server-3.3.0.GA/bin/rhqctl install
5. start it and wait until it's fully up
6. install additional jon server and storage node on another host 2
   a) unzip jon-server-3.3.0.GA.zip and jon-plugins-patch-3.3.0.GA.zip (CP10)
   b) apply the patch jon-server-3.3.0.GA-update-10/apply-updates.sh jon-server-3.3.0.GA
   c) edit jon-server-3.3.0.GA/bin/rhq-server.properties to use postgres on host 1
   d) install the server: jon-server-3.3.0.GA/bin/rhqctl install
   e) start it and wait until it's fully up
7. restart services on host 2 jon-server-3.3.0.GA/bin/rhqctl restart


Actual results:
AntiEntropySessions resource on host 2 is marked as down and following warns are thrown to agent.log:
WARN  [MeasurementManager.collector-1] (rhq.core.pc.measurement.MeasurementCollectorRunner)- Failure to collect measurement data for Resource[id=10221, uuid=4fdd7c18-b65b-487b-adcb-03381ae9473f, type={RHQStorage}ConfigurableInternalServerMetrics, key=org.apache.cassandra.internal:type=AntiEntropySessions, name=Anti Entropy Sessions, parent=RHQ Storage Node(fbr-pre-ha2.bc.jonqe.lab.eng.bos.redhat.com)] - cause: java.lang.IllegalStateException:EMS bean was null for Resource with type [ResourceType[id=0, name=ConfigurableInternalServerMetrics, plugin=RHQStorage, category=Service]] and key [org.apache.cassandra.internal:type=AntiEntropySessions].

Expected results:
All expected resources are UP and no warnings in logs

Additional info:
AntiEntropySessions resource becomes up again after StorageNodeManager.runClusterMaintenance() operation invoked.

I saw this issue even after upgrade of JON 3.3.0 HA env to JON3.3.10 when upgrading hosts one by one without stopping services on other host

All logs from both hosts are attached. Note that NPE in server log is reported in bz1544424 and "Connection refused to host: 127.0.0.1" WARNs in agent.log are reported in bz1545698

Comment 4 Larry O'Leary 2018-04-05 16:58:03 UTC

As indicated in Bug 1084056, the decision to resolve this was to add support for the missing availability policy (supportsMissingAvailabilityType). This allows the user to set the Administration > Configuration > Missing Resource Policy for the RHQ Storage Node / ConfigurableInternalServerMetrics resource type to IGNORE or UNINVENTORY to prevent this transient resource from continuously being reported as DOWN while cluster repair is not executing.

Closing as WONTFIX.