614591 – After uninventorying resources, other resources are shown as DOWN but they are really UP

Bug 614591 - After uninventorying resources, other resources are shown as DOWN but they are really UP

Summary: After uninventorying resources, other resources are shown as DOWN but they ar...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	RHQ Project
Classification:	Other
Component:	Core Server
Sub Component:
Version:	3.0.0
Hardware:	All
OS:	Linux
Priority:	urgent
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Joseph Marques
QA Contact:	Corey Welton
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	rhq_triage jon-sprint12-bugs
TreeView+	depends on / blocked

Reported:	2010-07-14 19:54 UTC by Jeff Weiss
Modified:	2014-11-09 22:50 UTC (History)
CC List:	3 users (show)
Fixed In Version:	2.4
Clone Of:
Environment:
Last Closed:	2010-08-12 16:57:44 UTC
Embargoed:

Attachments	(Terms of Use)

Description Jeff Weiss 2010-07-14 19:54:05 UTC

Description of problem:
This was discovered in an overnight run of automated test suite.  After the suite was finished, the inventory was in an inconsistent state.  Many of the servers' Availability was shown as DOWN even though metrics were still being collected.

Version-Release number of selected component (if applicable):


How reproducible:
Not sure, probably not reproducible without running the overnight suite.  But I believe I've seen it cause this behavior before.

Steps to Reproduce:
1.
2.
3.
  
Actual results:
Some but not all running services are shown as DOWN (including the agent itself)

Expected results:
All servers that are running should be shown UP

Additional info:
From server log: 

2010-07-14 11:34:17,216 WARN  [org.hibernate.util.JDBCExceptionReporter] SQL Error: 0, SQLState: 23503
2010-07-14 11:34:17,216 ERROR [org.hibernate.util.JDBCExceptionReporter] Batch entry 0 delete from RHQ_RESOURCE where ID='10952' was aborted.  Call getNextException to see the cause.
2010-07-14 11:34:17,216 WARN  [org.hibernate.util.JDBCExceptionReporter] SQL Error: 0, SQLState: 23503
2010-07-14 11:34:17,216 ERROR [org.hibernate.util.JDBCExceptionReporter] ERROR: update or delete on table "rhq_resource" violates foreign key constraint "rhq_resource_parent_resource_id_fkey" on table "rhq_resource"
  Detail: Key (id)=(10952) is still referenced from table "rhq_resource".
2010-07-14 11:34:17,217 ERROR [org.hibernate.event.def.AbstractFlushingEventListener] Could not synchronize database state with session
org.hibernate.exception.ConstraintViolationException: Could not execute JDBC batch update
	at org.hibernate.exception.SQLStateConverter.convert(SQLStateConverter.java:71)
	at org.hibernate.exception.JDBCExceptionHelper.convert(JDBCExceptionHelper.java:43)
	at org.hibernate.jdbc.AbstractBatcher.executeBatch(AbstractBatcher.java:254)
	at org.hibernate.jdbc.AbstractBatcher.prepareStatement(AbstractBatcher.java:93)
	at org.hibernate.jdbc.AbstractBatcher.prepareStatement(AbstractBatcher.java:88)
	at org.hibernate.jdbc.AbstractBatcher.prepareBatchStatement(AbstractBatcher.java:223)
	at org.hibernate.persister.entity.AbstractEntityPersister.delete(AbstractEntityPersister.java:2484)
	at org.hibernate.persister.entity.AbstractEntityPersister.delete(AbstractEntityPersister.java:2702)
	at org.hibernate.action.EntityDeleteAction.execute(EntityDeleteAction.java:77)
	at org.hibernate.engine.ActionQueue.execute(ActionQueue.java:250)
	at org.hibernate.engine.ActionQueue.executeActions(ActionQueue.java:234)
	at org.hibernate.engine.ActionQueue.executeActions(ActionQueue.java:146)
	at org.hibernate.event.def.AbstractFlushingEventListener.performExecutions(AbstractFlushingEventListener.java:298)
	at org.hibernate.event.def.DefaultFlushEventListener.onFlush(DefaultFlushEventListener.java:27)
	at org.hibernate.impl.SessionImpl.flush(SessionImpl.java:1000)
	at org.hibernate.impl.SessionImpl.managedFlush(SessionImpl.java:338)
	at org.hibernate.ejb.AbstractEntityManagerImpl$1.beforeCompletion(AbstractEntityManagerImpl.java:515)
	at com.arjuna.ats.internal.jta.resources.arjunacore.SynchronizationImple.beforeCompletion(SynchronizationImple.java:114)
	at com.arjuna.ats.arjuna.coordinator.TwoPhaseCoordinator.beforeCompletion(TwoPhaseCoordinator.java:247)
	at com.arjuna.ats.arjuna.coordinator.TwoPhaseCoordinator.end(TwoPhaseCoordinator.java:86)
	at com.arjuna.ats.arjuna.AtomicAction.commit(AtomicAction.java:177)
	at com.arjuna.ats.internal.jta.transaction.arjunacore.TransactionImple.commitAndDisassociate(TransactionImple.java:1389)
	at com.arjuna.ats.internal.jta.transaction.arjunacore.BaseTransaction.commit(BaseTransaction.java:135)
	at com.arjuna.ats.jbossatx.BaseTransactionManagerDelegate.commit(BaseTransactionManagerDelegate.java:87)
	at org.jboss.aspects.tx.TxPolicy.endTransaction(TxPolicy.java:175)
	at org.jboss.aspects.tx.TxPolicy.invokeInOurTx(TxPolicy.java:87)
	at org.jboss.aspects.tx.TxInterceptor$RequiresNew.invoke(TxInterceptor.java:262)
	at org.jboss.aop.joinpoint.MethodInvocation.invokeNext(MethodInvocation.java:101)
	at org.jboss.aspects.tx.TxPropagationInterceptor.invoke(TxPropagationInterceptor.java:95)
	at org.jboss.aop.joinpoint.MethodInvocation.invokeNext(MethodInvocation.java:101)
	at org.jboss.ejb3.stateless.StatelessInstanceInterceptor.invoke(StatelessInstanceInterceptor.java:62)
	at org.jboss.aop.joinpoint.MethodInvocation.invokeNext(MethodInvocation.java:101)
	at org.jboss.aspects.security.AuthenticationInterceptor.invoke(AuthenticationInterceptor.java:77)
	at org.jboss.ejb3.security.Ejb3AuthenticationInterceptor.invoke(Ejb3AuthenticationInterceptor.java:110)
	at org.jboss.aop.joinpoint.MethodInvocation.invokeNext(MethodInvocation.java:101)
	at org.jboss.ejb3.ENCPropagationInterceptor.invoke(ENCPropagationInterceptor.java:46)
	at org.jboss.aop.joinpoint.MethodInvocation.invokeNext(MethodInvocation.java:101)
	at org.jboss.ejb3.asynchronous.AsynchronousInterceptor.invoke(AsynchronousInterceptor.java:106)
	at org.jboss.aop.joinpoint.MethodInvocation.invokeNext(MethodInvocation.java:101)
	at org.jboss.ejb3.stateless.StatelessContainer.localInvoke(StatelessContainer.java:240)
	at org.jboss.ejb3.stateless.StatelessContainer.localInvoke(StatelessContainer.java:210)
	at org.jboss.ejb3.stateless.StatelessLocalProxy.invoke(StatelessLocalProxy.java:84)
	at $Proxy336.uninventoryResourceAsyncWork(Unknown Source)
	at org.rhq.enterprise.server.scheduler.jobs.AsyncResourceDeleteJob.executeJobCode(AsyncResourceDeleteJob.java:54)
	at org.rhq.enterprise.server.scheduler.jobs.AbstractStatefulJob.execute(AbstractStatefulJob.java:48)
	at org.quartz.core.JobRunShell.run(JobRunShell.java:202)
	at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:525)
Caused by: java.sql.BatchUpdateException: Batch entry 0 delete from RHQ_RESOURCE where ID='10952' was aborted.  Call getNextException to see the cause.
	at org.postgresql.jdbc2.AbstractJdbc2Statement$BatchResultHandler.handleError(AbstractJdbc2Statement.java:2569)
	at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1796)
	at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:407)
	at org.postgresql.jdbc2.AbstractJdbc2Statement.executeBatch(AbstractJdbc2Statement.java:2708)
	at sun.reflect.GeneratedMethodAccessor133.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:616)
	at org.postgresql.ds.jdbc23.AbstractJdbc23PooledConnection$StatementHandler.invoke(AbstractJdbc23PooledConnection.java:455)
	at $Proxy52.executeBatch(Unknown Source)
	at org.jboss.resource.adapter.jdbc.CachedPreparedStatement.executeBatch(CachedPreparedStatement.java:476)
	at org.jboss.resource.adapter.jdbc.WrappedStatement.executeBatch(WrappedStatement.java:774)
	at org.hibernate.jdbc.BatchingBatcher.doExecuteBatch(BatchingBatcher.java:48)
	at org.hibernate.jdbc.AbstractBatcher.executeBatch(AbstractBatcher.java:247)
	... 44 more

Comment 1 John Mazzitelli 2010-07-14 20:00:56 UTC

one symptom - you'll notice some child resources with an inventory status of COMMITTED whose parent resources are UNINVENTORIED state. Getting into that state is what I believe is the cause to the whole mess. Fix that, and we fix the availability issue too.

Another symptom - it appears the agent is being asked (or its doing it on its own) to send an availability report very fast, like every few seconds.

Comment 2 Joseph Marques 2010-07-15 01:16:38 UTC

commit f84d18e73052c67a0bebe6eb08063c7cc417ef9b
Author: Joseph Marques <joseph>
Date:   Wed Jul 14 21:12:52 2010 -0400

BZ-614591: fix inventory sync issues that manifest during specific timings of async uninventory
    
part 1 - do not merge inventory report resources that have already been uninventoried
    
there exists a small window of time after the synchronous part of the uninventory and before the async quartz job comes along to perform the actual removal of the resource from the database, that an inventory report can come across the wire and !OVERWROTE! the UNINVENTORIED status back to COMMITTED.  if we find, during an inventory report merge, that the existing resource was already uninventoried (indicating that the quartz job has not yet come along to remove this resource from the database) we should stop all processing from this node and return immediately.  this short-cuts the processing for the entire sub-tree under this resource, but that's OK because the in-band uninventory logic will have marked entire sub-tree for uninventory atomically.  in other words, all of the descendants under a resource would also be marked for async uninventory too.
    
part 2 - async uninventory work should recursively delete resources
    
this handles the case when the the rare timings in part 1 made it into the system and the inventory is current in an inconsistent state.  the likely effect is that a resource marked with UNINVENTORIED status has a child that is marked as COMMITTED.  thus, if we recursively delete resources in the out-of-band job, we ensure that synchronization inconsistencies are eliminated and the inventory is brought back to a consistent state.
    
part 3 - async uninventory work should remove resources from their containing groups
    
even though the group removal occurs in the in-band work, there can be some group definitions that just happens to perform its recalculation (either manually or schedules) in the period after the in-band work completes but before the async job triggers. since the ExpressionEvaluator that underlies the bulk of the dynagroup query generations automatically adds a filter to only manipulate COMMITTED resource, this work should be a no-op most of the time.  however, in rare circumstances it's possible for an InventoryReport to come across the wire and flip the status of resources from UNINVENTORIED back to COMMITTED.  in this case, this group removal logic needs to be executed again just prior to removing the rest of the reosurce history.

Comment 3 John Mazzitelli 2010-07-15 14:52:22 UTC

these queries, if they return any rows, indicate you have this problem. You can run these in the http://<server>:7080/admin/test/sql.jsp page:


select * from rhq_resource where inventory_status = 'COMMITTED' and parent_resource_id in (select id from rhq_resource where inventory_status='UNINVENTORIED' order by id);

select * from rhq_resource where id in (select distinct parent_resource_id from rhq_resource where inventory_status = 'COMMITTED' and parent_resource_id in (select id from rhq_resource where inventory_status='UNINVENTORIED' order by id));

Comment 4 John Mazzitelli 2010-07-15 19:06:36 UTC

I kinda replicated this.

Get a JBossAS instance in the auto-discovery queue. Import it. As soon as you click the "import" button, hurry to the resource browser list page and immediately uninventory that new JBossAS server instance as fast as you can.

When I did this, those queries I mention in the previous returned rows - meaning I ended up with committed children of an uninventoried resource.

Weird thing is, the async deletion job reported it couldn't delete 1 resource however, my queries ended up returning 0 rows after the async delete job ran. Meaning the situation somehow cleaned itself up (at least in my use case).

I bet if I somehow had put this jboss-as server as an implicit member in a recursive group, it would have been unable to clean up. I'll check that next.

Comment 5 John Mazzitelli 2010-07-15 19:18:58 UTC

That's it - make it a member of a group. Here's what I did that replicated this:

1) Start server and agent and JBossAS server
2) Import JUST the platform - do not commit the jboss-as server to inventory yet
3) Create a compatible group of platforms - make it a recursive group
4) Add the newly important platform to that group
5) Now import the JBossAS Server that was auto-discovered
6) as soon as you click "import" hurry up to the resource list view and "uninventory" the JBossAS server you just committed. You must do this prior to the series of service scan inventory reports that come in from the agent due to the commit of the jboss as instance. I had two browser tabs open - one on the resource list view and one on the dashboard to make it easier/faster to switch to the different pages
7) Wait for the async delete job to run - in the old code, you get the exception mentioned in this bug. if this is fixed, you'll get no errors in the logs and the two queries that I posted in an earlier comment will return 0 rows each.

Comment 6 Corey Welton 2010-07-16 15:25:49 UTC

I am going to mark this as verified -- it was discovered w/ automation, and auto can certainly do a better job of accomplishing the steps in comment #5 than I could.  As it stands, the latest automation run against the build with this fix seemed to be pretty clean.

We can reopen if this issue rears its head again.

Comment 7 Corey Welton 2010-08-12 16:57:44 UTC

Mass-closure of verified bugs against JON.

Note You need to log in before you can comment on or make changes to this bug.