Bug 830158

Summary: java.sql.BatchUpdateException when you uninventory a platform immediately after importing it
Product: [Other] RHQ Project Reporter: Libor Zoubek <lzoubek>
Component: AgentAssignee: RHQ Project Maintainer <rhq-maint>
Status: CLOSED WONTFIX QA Contact: Mike Foley <mfoley>
Severity: medium Docs Contact:
Priority: low    
Version: 4.4CC: fbrychta, hrupp, jkremser, jshaughn, theute
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-09-05 14:36:28 EDT Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
Description Flags
server.log none

Description Libor Zoubek 2012-06-08 08:20:42 EDT
Created attachment 590417 [details]

Description of problem: This is a strange issue. I have JON server with 2 agents (agent A is on the same host as JON server, agent B is on different host). I've been running CLI automation, that was periodically importing and unimporting platforms

Version-Release number of selected component (if applicable):
JON 3.1.CR3

How reproducible: always

Steps to Reproduce:
1.have JON with more than 1 agents
2.have something running on agents (for example AS7 on each host)
3.import whole discovery queue
4.uninventory everything right after importing it
Actual results:

you get java.sql.BatchUpdateExceptionexception Batch entry XXX was aborted.

Consequences are sometimes different, I've met these:

1. One of agents just disconnects .. or it completely disappears from list 'Administration' -> 'Agents'. I've restarted the agent and it just connected and statred to work again like nothing happened before.

2. an asynchronous uninventory request fails - I am not able to uninventory platform anymore

3. there are no consequences, but it's rather I didn't found them.

Expected results: JON does not get broken when user does these things.

Additional info: log attached reflects consequences #2 of repro steps a few times
Comment 1 Jirka Kremser 2012-06-14 07:54:31 EDT
I've reproduced it with one agent as well
Comment 2 Jay Shaughnessy 2012-06-14 10:05:37 EDT
I believe it is expected that the entry in the rhq_agent table (i.e. Administration-> Agents list) go away when you uninventory a platform.

Other than the log messages, is there any unexpected behavior?  Does the uninventory fail?  Or subsequent discovery of resources from the agent?  I'm not sure I understand consequence #2 above. Are you saying the uninventory fails and then the doomed platform still shows up and can't be removed?
Comment 3 Jay Shaughnessy 2012-06-14 14:38:34 EDT
Other than the scary server logging, and possibly a bad message to the user when the uninventory fails, I'm not sure this is a really bad problem. Jirka and I are still looking at it.

The problem basically results in the fact that inventory is being added to the platform, from the initial import, at the same time that we are trying to remove the platform and its associated Agent entry.  This issue happens only when uninventorying a platform.

This is not a very production-oriented use case.  Uninventorying a platform is almost always done only after the platform and agent have gone away. As a clean up mechanism. Even for an active platform it would typically be well after initial import.

So, any fix here should probably be limited to error handling and/or improved messaging. No major effort should be expended trying to get the 

NOTE: With recent security enhancements we do our token handling a bit differently.  This means that if you uninventory a platform with a running RHQ Agent that it will no longer be able to submit commands to the server. This is because its security token is no longer valid (because we've removed the Agent record in the db).  There are only two ways to get the agent to again send commands to the server: 1) Restart the agent.  2) If it is running interactively, you can execute the 'register' command prompt command.
Comment 4 Jay Shaughnessy 2012-06-14 16:39:56 EDT
whoops, I never finished my sentence above:

...No major effort should be expended trying to get the uninventory to actually succeed.  If the user just waits for the new resources to get merged a retry should see success.

master commit 6b4d1b0baf4d1d95d51d14fc7387fbc78cbf97fe
The exception is due to badness when trying to uninventory a platform that is
actively merging in new resources.  This is an unlikely use case.  This commit
does not try to prevent the problem but rather adds some exception handling
and an improved log message indicating the probably scenario, and suggesting the
user retry after waiting for the resources to be merged.

I'm sending to QA for an opinion on whether this is sufficient.

Test Notes:
- remember the note above about the agent restart. We've also added a FAQ:
- Nothing is changed other than the improved message in the root cause.
- The platform should uninventory successfully if you wait a minute for everything to get imported.
Comment 5 Libor Zoubek 2012-06-25 17:03:20 EDT
I was going to move this to VERIFIED and just live with a fact, that I have to wait some time, when unimporting platform. But I've currently reproduced this BZ on recent RHQ master build and it said:

"Failed to uninventory platform. This can happen if new resources were actively being imported. Please wait and try again shortly."

But I did't unimport platform immediately after import (I did it after few hours). This means, this BZ can happen anytime when I remove platform from inventory and there is some ongoing inventory merge at the same time.

+ if sometimes (not this time) agent completely disconnects and needs to be restarted manually.( Consequence #1)
Comment 6 Jay Shaughnessy 2012-06-26 11:13:14 EDT
It's true that it can happen at any time, if you are unlucky enough to try the uninventory while new resources are being merged in.  But this is very unlikely in the real world as:
1) platform uninventory is typically performed for down agents or obsolete platforms.
2) resource discovery happens relatively rarely.
3) new resources are rare after initial discovery.

I recommend no further work on this issue.  As a note, I have recently added slightly more information (I added the root cause Exception) to the thrown exception in DEBUG mode.

It should be noted that actually avoiding this situation would likely require major work.

Setting to MODIFIED and asking for Triage.
Comment 7 Filip Brychta 2012-07-18 11:58:33 EDT
Found JON in following state: 
[1342626387798] javax.ejb.EJBException:java.lang.IllegalStateException: Failed to uninventory platform. This can happen if new resources were actively being imported. Please wait and try again shortly. -> java.lang.IllegalStateException:Failed to uninventory platform. This can happen if new resources were actively being imported. Please wait and try again shortly.

This exception was thrown repeatedly even a few hours later after last import. So it is not possible to uninventory resource at all. Unfortunately no reproduction scenario.
Comment 8 Filip Brychta 2013-03-19 07:09:18 EDT
This approach (not fixing the cause) is not suitable for our CLI automation (for any kind of automation). Last 3 runs failed because of this BZ so i had to add the following ugly workaround:
var result = common.waitFor(function (){
					return true;
					var errMsg = err.message;
					common.warn("Caught following error during uninventory: " + errMsg);
					if(errMsg.indexOf("Failed to uninventory platform. " +
							"This can happen if new resources were actively being imported. " +
							"Please wait and try again shortly") != -1){
						return false;
						return true;
				throw "Failed to uninventory. See previous errors."

Is there any better solution?
Comment 9 Heiko W. Rupp 2013-09-04 05:40:06 EDT
Resetting to on_dev for further investigation
Comment 10 Jay Shaughnessy 2014-02-07 15:17:28 EST
master commit dbd05bd84ff577c98fbd723bf557b0face03736b
Author: Jay Shaughnessy <jshaughn@redhat.com>
Date:   Fri Feb 7 11:28:41 2014 -0500

Just an attempt to solve this issue by detaching the agent entity while
the [potentially] large resource update executes, and flushing changes to the
DB prior to deleting the agent.

If this doesn't work (i.e. pass automation testing, which currently can produce
the issue) then we may want to look at changing around the transactioning such
that the resource update and the agent removal happen in distinct transactions,
the former finishing prior to the latter executing.
Comment 11 Jay Shaughnessy 2014-03-10 14:10:17 EDT
I'm going to re-target this for 4.11, as far as I know this is only an issue in our testing envs and has not really been a real-world problem.
Comment 12 Heiko W. Rupp 2014-05-08 10:42:51 EDT
Bump the target version now that 4.11 is out.
Comment 13 Jay Shaughnessy 2014-07-07 12:09:57 EDT
Unsetting target version and throwing back in the pool as this is mainly an in-house QE automation issue.
Comment 14 Jay Shaughnessy 2014-09-05 14:36:28 EDT
Closing this as wont fix because we son't know exactly what to do about it and it seems not to be affecting real-world users.