884338 – Agent's availability report is ignored due to bogus stale resource error

Bug 884338 - Agent's availability report is ignored due to bogus stale resource error

Summary: Agent's availability report is ignored due to bogus stale resource error

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	JBoss Operations Network
Classification:	JBoss
Component:	Inventory
Sub Component:
Version:	JON 3.1.0
Hardware:	All
OS:	All
Priority:	medium
Severity:	medium
Target Milestone:	ER01
Target Release:	JON 3.2.0
Assignee:	Jay Shaughnessy
QA Contact:	Mike Foley
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	916373 (view as bug list)
Depends On:
Blocks:	895743
TreeView+	depends on / blocked

Reported:	2012-12-06 00:30 UTC by Larry O'Leary
Modified:	2018-12-05 15:43 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2014-01-02 20:43:03 UTC
Type:	Bug
Embargoed:

Attachments	(Terms of Use)
jon server log files (719.05 KB, application/x-zip-compressed) 2013-02-27 10:52 UTC, bkramer	no flags	Details
jon agent log files (2.46 MB, application/x-zip-compressed) 2013-02-27 10:53 UTC, bkramer	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	877176	urgent	CLOSED	CLI DiscoveryBoss.importResources fails when upgraded inventory contains empty/null resource availability	2021-02-22 00:41:40 UTC
Red Hat Bugzilla	895743	unspecified	CLOSED	Back-port commit 3605ce3 for better logging and recovery attempt of stale resource situation similar to what was identif...	2021-02-22 00:41:40 UTC
Red Hat Knowledge Base (Solution)	268103	None	None	None	Never

Internal Links: 877176 895743

Description Larry O'Leary 2012-12-06 00:30:41 UTC

Description of problem:
All resources for an agent show an availability of UNKNOWN in the UI even though the agent is sending an availability report and resources are intact on both server and agent.

The server appears to be ignoring the availability report sent by the agent and the following INFO message is logged:

    INFO  [org.rhq.enterprise.server.measurement.AvailabilityManagerBean] Skipping mergeAvailabilityReport() for stale resource [Resource[id=10008, uuid=null, type=<null>, key=null, name=null, parent=<null>]]. These messages should go away after the next agent synchronization with the server....


In this case, resource 10008 is a JBoss AS5 resource and from the agent side, is available and working as expected. Additionally, the server also reflects the exact same state for this resource however it still logs the message indicating that the resource is stale.

Version-Release number of selected component (if applicable):
RHQ_4_4_0_JON310GA

How reproducible:
Unknown

  

Additional info:
After reviewing the source, it appears that perhaps this is an issue with a missing row in the RHQ_AVAILABILITY table. This seems very similar to bug 877176.

Additional diagnostic is necessary.

Comment 1 Larry O'Leary 2012-12-06 00:33:45 UTC

I am waiting for more information from the user who reported this problem. Specifically, the state of the RHQ_AVAILABILITY table for the impacted resource and hopefully and upgrade history that may explain how we ended up in such a state. At the moment, this is a hypothesis based on the exception being thrown when executing AvailabilityManagerBean.mergeAvailabilityReport at line 512[1].



[1]: http://git.fedorahosted.org/cgit/rhq/rhq.git/tree/modules/enterprise/server/jar/src/main/java/org/rhq/enterprise/server/measurement/AvailabilityManagerBean.java?id=RHQ_4_4_0_JON310GA#n512

Comment 2 Jay Shaughnessy 2012-12-06 16:01:42 UTC

I believe they have hit Bug 865167 and will need to apply the Workaround DB Query found in that BZ.

Comment 3 Larry O'Leary 2012-12-06 16:07:58 UTC

I think so too, but the workaround in bug 865167 is limited to the RHQ_RESOURCE_AVAIL table. In this case, that issue seems to apply to the RHQ_AVAILABILITY table too.

Comment 6 Jay Shaughnessy 2012-12-18 15:09:57 UTC

I have spent a decent amount of time trying to find a code-path that can explain either the removal of the latest availability row (i.e. null end-time), or the errant setting of the end-time of this row, and I can't reproduce the situation.

- purge seems ok
- uninventory seems ok
- out-of-order avail reports (i.e. insertion of past avail change) seems ok
- interplay with agent backfilling seems ok
- edge cases (e.g. repeated reporting for the same start time) seems ok

This is reported to not be an upgrade so upgrade issues seem to not be a factor.  Seeing that the customer shows the problem on several resources, including non-deletable resources, indicates that the it's not a deleteted resource issue.  So, the cause remains unexplained but if it's a valid issue it seems that it must be a very unusual/unexpected use case.

If reported again please collect the oldest possible server logs to potentially help determine a root cause.

I have updated the code such that the error handling can tell whether we really have a stale resource problem, which is a rare but expected situation and can be resolved with an agent sync, or if this is an instance of the reported corruption.  In the latter case I have added code that logs differently and will attempt to repair the situation.  Unfortunately it will not help with determining the root cause.

I've committed the changes to master: 

commit 3605ce3398557277d8ddf4deb3ffaa83337b7c58
Author: Jay Shaughnessy <jshaughn>
Date:   Tue Dec 18 10:06:59 2012 -0500

I failed to reproduce this issue but:
- Add some better logging and some attempted repair code for this situation
- Add a couple more tests

Comment 7 Larry O'Leary 2012-12-18 16:40:39 UTC

Considering that we are unable to reproduce this or identify a possible cause of this state, I am closing this bug as per comment 6.

Comment 9 bkramer 2013-02-27 10:51:22 UTC

Managed to reproduce this with the following test case:

1. Run JON Server and JON Agent 3.0.0;
2. Run discovery and imported some resources. I left RHQ Agent and postgres in the discovery queue.
3. Shutdown both JON Agent and JON Server;
4. Upgraded JON Server and JON Agent to 3.1.2. 
5. Checked discovery queue and confirmed that above resources are still there.
6. Added them to the inventory - their availability - UNKNOWN.
7. Run avail --force but that didn't make a difference (both resources still in UNKNOWN state although JON Agent claims that everything is UP and running).
8. Removed RHQ Agent resource from the inventory and add it back again - it's availability changed to UP.
9. For postgres, I run SQL from comment 8 and after that "avail --force" and that resolved the issue.

I will attach the log files as well.

Comment 10 bkramer 2013-02-27 10:52:31 UTC

Created attachment 703367 [details]
jon server log files

Comment 11 bkramer 2013-02-27 10:53:15 UTC

Created attachment 703368 [details]
jon agent log files

Comment 12 Larry O'Leary 2013-02-27 15:02:14 UTC

Re-opening as this seems to be a severe issue that impacts inventories that have been upgraded from prior releases of JBoss ON. Specifically, we should probably get the resolution from solution 268103[1] added to the schema upgrade task to ensure that existing inventory is properly upgraded to handle the availability changes that were introduced in during the 3.1.x time frame.



[1]: https://access.redhat.com/knowledge/solutions/268103

Comment 13 Jay Shaughnessy 2013-02-28 22:22:11 UTC

The db upgrade now contains steps that resolve the issues with missing ResourceAvailability and also Availability.  These situations arise when an upgrade is performed with resources in the discovery queue.

It does not have the queries above, which also seem to deal with resources that have historical Availability records but not a current Availability record (null end-time).  How that happens is not clear but that issue, if it occurs is cleared up as it is encountered when merging availability reports.

I think the work for 3.2 is done.

Comment 14 Larry O'Leary 2013-03-01 00:41:02 UTC

As long as both situations are handled by 3.2, I agree. Should this be moved to MODIFIED then? Also, can we get the commit IDs copied here considering this is the product bug that links to the errata, knowledge base, and customer portal so we have reference for tracking?

Comment 15 Charles Crouch 2013-04-09 20:24:26 UTC

Jay, can you reply to Larry's comment and document in here the testcases which should be executed to verify this change.

Comment 16 Jay Shaughnessy 2013-04-11 18:29:19 UTC

The commit to fix db-upgrade.xml: 
  068f664483a2013fa84123cb6b6ba85b54ee7c5c

The primary commit for the runtime "repair code" mentioned in Comment 13:
  955ef8974d5c8782c048704550ad8da395365e93

There may have been a few other relevant commits. It was fixed up during the availability report perf work in Feb '13.  But after the commit above everything was in place.


The primary test case is the one mentioned in the BZ, regarding upgrades from 3.1.x or earlier, with resources in the discovery queue.

This can be moved to ON_QA as these commits were in prior to any 3.2 build, so any 3.2 build should have the changes already.

Comment 17 Larry O'Leary 2013-09-06 14:31:26 UTC

As this is MODIFIED or ON_QA, setting milestone to ER1.

Comment 18 Larry O'Leary 2013-09-18 02:30:18 UTC

Looks like comment #9 outlines the steps necessary to test this. Specifically this is for upgrade testing and will require the installation of JON 3.0 and upgrade to JON 3.2.

Comment 19 Larry O'Leary 2013-09-18 02:35:24 UTC

Actually, comment #9 doesn't contain a complete set of instructions as it is not specific in the actual cause. Here is a new set from ccrouch's commnet in bug 916373:

The key part is: "You only need to do this IF you committed resources AFTER the 3.1.2 upgrade where those resources were in the discovery queue pre-upgrade (that is, when 3.0.0 was running.)"

So steps to reproduce:
1) Install JON300
2) Start an agent, but don't import its platform, leave it in the autodiscovery queue.
2) Upgrade to JON320. Import the agent and platform from step2), see that their availabilities are green.

Comment 20 Larry O'Leary 2013-09-18 02:36:34 UTC

*** Bug 916373 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.