Bug 801432

Summary: ConcurrentModificationException on agent when resource is ignored in the discovery queue
Product: [Other] RHQ Project Reporter: Lukas Krejci <lkrejci>
Component: AgentAssignee: John Mazzitelli <mazz>
Status: CLOSED CURRENTRELEASE QA Contact: Mike Foley <mfoley>
Severity: unspecified Docs Contact:
Priority: high    
Version: 4.3CC: ccrouch, hrupp, jshaughn, mazz
Target Milestone: ---   
Target Release: RHQ 4.4.0   
Hardware: Unspecified   
OS: Unspecified   
See Also: https://bugzilla.redhat.com/show_bug.cgi?id=794790
https://bugzilla.redhat.com/show_bug.cgi?id=807465
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-09-01 06:07:17 EDT Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Bug Depends On:    
Bug Blocks: 782579    

Description Lukas Krejci 2012-03-08 09:22:35 EST
Description of problem:
This appears in the server.log after an attempt to ignore a resource in the discovery queue (actually I tried to ignore several resources at once, not sure if that makes any difference)

15:17:56,508 WARN  [DiscoveryBossBean] Could not perform commit synchronization with agent for server [OpenSSH sshd]
java.lang.RuntimeException: java.util.ConcurrentModificationException
        at org.rhq.core.pc.inventory.InventoryManager.synchInventory(InventoryManager.java:1072)
        at org.rhq.core.pc.inventory.InventoryManager.synchronizeInventory(InventoryManager.java:2020)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.rhq.enterprise.communications.command.impl.remotepojo.server.RemotePojoInvocationCommandService.execute(RemotePojoInvocationCommandService.java:184)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIntrospector.java:93)
        at com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIntrospector.java:27)
        at com.sun.jmx.mbeanserver.MBeanIntrospector.invokeM(MBeanIntrospector.java:208)
        at com.sun.jmx.mbeanserver.PerInterface.invoke(PerInterface.java:120)
        at com.sun.jmx.mbeanserver.MBeanSupport.invoke(MBeanSupport.java:262)
        at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.invoke(DefaultMBeanServerInterceptor.java:836)
        at com.sun.jmx.mbeanserver.JmxMBeanServer.invoke(JmxMBeanServer.java:761)
        at javax.management.MBeanServerInvocationHandler.invoke(MBeanServerInvocationHandler.java:288)
        at $Proxy0.execute(Unknown Source)
        at org.rhq.enterprise.communications.command.server.CommandProcessor.handleIncomingInvocationRequest(CommandProcessor.java:290)
        at org.rhq.enterprise.communications.command.server.CommandProcessor.invoke(CommandProcessor.java:184)
        at org.jboss.remoting.ServerInvoker.invoke(ServerInvoker.java:809)
        at org.jboss.remoting.transport.socket.ServerThread.processInvocation(ServerThread.java:608)
        at org.jboss.remoting.transport.socket.ServerThread.dorun(ServerThread.java:420)
        at org.jboss.remoting.transport.socket.ServerThread.run(ServerThread.java:173)
Caused by: java.util.ConcurrentModificationException
        at java.util.HashMap$HashIterator.nextEntry(HashMap.java:793)
        at java.util.HashMap$KeyIterator.next(HashMap.java:828)
        at org.rhq.core.pc.inventory.InventoryManager.removeResourceAndIndicateIfScanIsNeeded(InventoryManager.java:1175)
        at org.rhq.core.pc.inventory.InventoryManager.uninventoryResource(InventoryManager.java:1146)
        at org.rhq.core.pc.inventory.InventoryManager.purgeObsoleteResources(InventoryManager.java:2760)
        at org.rhq.core.pc.inventory.InventoryManager.synchInventory(InventoryManager.java:1043)
        ... 25 more


Version-Release number of selected component (if applicable):
4.3.0

How reproducible:
frequently

Steps to Reproduce:
1.Try to ignore serveral resources in the discovery queue
  
Actual results:
The above stacktrace in the server.log

Expected results:
no exceptions

Additional info:
Comment 1 Mike Foley 2012-03-12 11:42:46 EDT
per BZ traige (crouch, foley, loleary), medium priority, not targetted for JON 3.1
Comment 2 Charles Crouch 2012-03-19 18:59:40 EDT
Actually triaging this into JON3.1 since it looks like https://bugzilla.redhat.com/show_bug.cgi?id=794790 didn't catch all the CMEs
Comment 3 John Mazzitelli 2012-03-26 11:48:53 EDT
(In reply to comment #2)
> Actually triaging this into JON3.1 since it looks like
> https://bugzilla.redhat.com/show_bug.cgi?id=794790 didn't catch all the CMEs

I don't think its that it didn't catch all CMEs, it looks like it put them back in.

For example, this issue shows the error happening in the stack:

Caused by: java.util.ConcurrentModificationException
...
        at
org.rhq.core.pc.inventory.InventoryManager.removeResourceAndIndicateIfScanIsNeeded(InventoryManager.java:1175)

Looking at commit 6f7913acb3583d89f8e1c5dd61f220cef1e05b9b which was for bug #794790, you will see this (notice it took OUT the copying of the child resources collection which was in here specifically to avoid the CME error):

we went from here (lines 1170-2):

            // put in new set to avoid concurrent mod exceptions
            Set<Resource> children = new HashSet<Resource>(resource.getChildResources());
            for (Resource child : children) {

to here (line 1176):

            for (Resource child : resource.getChildResources()) {

I see the same kind of thing in other places (that is, the commit removed the copying of the child collection - thus reintroducing the CME)... for example:

this (lines 1864-7):
                // Copy child Resources to an array and iterate that, rather than iterating the Set, which could cause
                // a ConcurrentModificationException if another thread tries to modify the Set while we're iterating it.
                Resource[] childResources = resource.getChildResources().toArray(new Resource[resource.getChildResources().size()]);
                for (Resource child : childResources) {

went to (line 1866):

                for (Resource child : resource.getChildResources()) {


Probably need Ian to take a look at his fix for 794790, because it looks like it caused regressions. I think we need to reopen bug #794790
Comment 4 Lukas Krejci 2012-03-26 12:21:40 EDT
Along with the removal of copying child collections, the fix also introduced the "Collections.newSetFromMap(new ConcurrentHashMap())" when a setting the child resources in InventoryManager#createNewResource().

That should remove the CMEs, too, but I guess that is not the only place we set the childresources collection.

A quick idea might be that the resources deserialized from the serverside wouldn't be treated with this code and thus the collection still be suspectible to the CMEs.
Comment 5 Ian Springer 2012-03-26 12:42:50 EDT
I make sure Resources from the server-side are cloned with a concurrent childResources field (find usages of InventoryManager.cloneResourceWithoutChildren()), but I must have missed some code path.
Comment 6 John Mazzitelli 2012-03-26 15:17:58 EDT
I'm here and I see a non-empty child resources set that is of type HashSet:

Daemon Thread [InventoryManager.discovery-1]
	Resource.getChildResources() line: 1375	
	InventoryManager.mergeResource(Resource) line: 2797	
	InventoryManager.mergeUnknownResources(Set<Integer>) line: 2687	
	InventoryManager.synchInventory(ResourceSyncInfo) line: 1089	
	InventoryManager.handleReport(InventoryReport) line: 1054	
	AutoDiscoveryExecutor.call() line: 131	

I was normally seeing the child resources set to be Collections$SetFromMap which I think is what we want now. So why is this still HashSet?
Comment 7 John Mazzitelli 2012-03-26 16:36:05 EDT
I cannot replicate this. however, I did push git commit in master3e93e33b82b96cc3cf73c54f5907ac8f9d97bbe6 only to ensure that our child collection remains as it was if it was already set. This really doesn't seem to fix anything but it at least serves to maintain our expectations.

I've run this through a debugger several times and the only time I see that the child collection is HashSet (as opposed to the thread-friendly Collections$SetFromMap) is from the stack trace I posted earlier (from mergeUnknownResources during a sync).
Comment 8 John Mazzitelli 2012-03-26 17:15:08 EDT
after staring at this for hours, I think I might see something:

from InventoryManager.removeResourceAndIndicateIfScanIsNeeded

            deactivateResource(resource);

            for (Resource child : resource.getChildResources()) {
                scanIsNeeded |= removeResourceAndIndicateIfScanIsNeeded(child);
            }

            Resource parent = resource.getParentResource();
            if (parent != null) {
                parent.removeChildResource(resource);
            }

Now, deactivateResource and, more importantly, removeResourceAndIndicateIfScanIsNeeded is recursive (as you see here).

What happens if one of the recursive calls hits the parent.removeChildResource(resource) line, returns and the for-loop tries to pick the next item out of the child set (for (Resource child : resource.getChildResources())) ?? That appears to be a potential source for ConcurrentModificationException? It is hard to see because you have to notice the recursive call.

I think we need to re-introduce the copying of the child set here. We'll have to re-examine all the places where the childResource set it used and see if there are any other recursive workflows that modify the set.
Comment 9 Lukas Krejci 2012-03-27 06:51:08 EDT
The ConcurrentHashMap allows simulatenous traversal and modification. So I do not think that the code snippet above should cause any problems iff the child resources collection is a set backed by this map:

- for loop creates an iterator over child resources set (set A)
  - recursive call (iterated value becomes the "resource")
    - for loop over children (set B)
    - parent.removeChildResource(resource) - this will remove the the resource that has just been iterated over in the set A
  - end of recursive call
  - iterator moves to the next element

so this is just iteration over and modification of a single set (or indeed the map that backs the set) in the same thread.

The following code simulates in a simple manner what is happening in the code snippet you identified:

Set<Object> objects = Collections.newSetFromMap(new ConcurrentHashMap<Object, Boolean>());
objects.add(new Object());
objects.add(new Object());
for(Object o : objects) objects.remove(o);

This code works OK. The CME *is* thrown though if the "objects" variable is initialized with a mere HashSet<Object> instead of the concurrent variant.

We need to identify the call chain that you once saw which leads to the resource child collection having type of HashSet and prevent that from happening.

Also note that, IMHO, the copy-before-iteration approach that we employed before didn't really work, but only lowered the chance of the CMEs happening.

If you look at the HashSet(Collection) constructor, you can see that it merely calls addAll(Collection) which *iterates* over the collection. So the CME still *would* happen if we for example removed a resource at the same time a scheduled discovery was running and child resources of the parent resource of the removed resource were being iterated over.

In short, we can't assume the inventory not changing out-of-band in any code that iterates over the resources because of the async nature of all of our operations on the inventory in the PC. 

There are several solutions to that - locking the subtree of a resource for every access to its children (i.e. essentially lock the whole inventory during a scheduled discovery for the whole time of the discovery (because this goes recursively top down from the platform level)), reimplement the discovery so that it is not a naive single-threaded recursive tree traversal - split it up in chunks that can happen concurrently, but you still need locking while traversing those chunks (currently we just hold N iterators "open" when doing discovery, where N is the depth of the resource tree) or employ concurrent-access-friendly collections.
Comment 10 John Mazzitelli 2012-03-27 08:17:52 EDT
> This code works OK. The CME *is* thrown though if the "objects" variable is
> initialized with a mere HashSet<Object> instead of the concurrent variant.

and this is what I am seeing - I was seeing a non-empty collection as a HashSet which causes the problem. I
Comment 11 Lukas Krejci 2012-03-27 08:34:14 EDT
IMHO, as I mentioned in that long post, the solution is not to re-introduce the copy-before-iteration concept, because I think it actually doesn't solve the CME problem in general, but to make sure that we are using the concurrent-safe collections in all workflows.
Comment 12 Charles Crouch 2012-03-27 09:27:49 EDT
So looking at https://bugzilla.redhat.com/show_bug.cgi?id=801432#c6 is the issue finding out where this Resource instance came from? (since it couldn't have been created by InventoryManager since all the places its calls new Resource it passes in a ConcurrentHashMap based set for childresources, except when doing things like new Resource(resourceError.getResource().getId())))

If that is the issue, then can't you just temporarily add a member variable to Resource and in 

public Resource(Set<Resource> childResources)

set it to be Thread.currentThread().getStackTrace(). Then when you're in the debugger you can see where this rogue Resource is being created from.
Comment 13 John Mazzitelli 2012-03-27 13:43:03 EDT
I see server-side calls like this:

            Set<Resource> unknownResources = configuration.getServerServices().getDiscoveryServerService()
                .getResources(unknownResourceIds, true);

and through the debugger, I see coming back over the wire the child set is of type HashSet as expected. Now, the question is - do we squirrel away these resources that have child HashSet objects or do we operate on the hashset child object in a recursive way (while modifying those sets)? If so, we need to convert those sets to concurrent versions.

I am still not able to reproduce any ConcurrentModificationExceptions - I'm just going by the original BZ description and looking at the code and stepping through the debugger to find where it might occur. But I've yet to actually see the problem.
Comment 14 John Mazzitelli 2012-03-27 16:07:47 EDT
(In reply to comment #8)
> after staring at this for hours, I think I might see something:
> 
> from InventoryManager.removeResourceAndIndicateIfScanIsNeeded

I've examined this and I do not see any HashSet problems in here so far.
Comment 15 John Mazzitelli 2012-03-27 16:55:36 EDT
added test code to try to capture the stack trace when this happens, since I
can't replicate.

bug #807465 was written to remind us to delete that test code before release
Comment 16 Charles Crouch 2012-06-05 10:37:14 EDT
All commits were made for RHQ4.4. all outstanding issues should have been addressed either in this bug or in related issues: 797999, 807465
Comment 17 Heiko W. Rupp 2013-09-01 06:07:17 EDT
Bulk closing of items that are on_qa and in old RHQ releases, which are out for a long time and where the issue has not been re-opened since.
Comment 18 Jay Shaughnessy 2014-07-24 22:26:43 EDT
OK, so in RHQ 4.12-ish I stumbled onto the log message added for this issue (see also Bug 807465).  In my agent log I saw messages like this when uninventorying resources:

2014-07-24 09:05:41,859 DEBUG [WorkerThread#0[127.0.0.1:54666]] (rhq.core.pc.inventory.InventoryManager)- [BZ 801432]
java.lang.Exception: Unexpected child set - if you see this, please notify support or log it in bugzillajava.util.HashSet:10230:test-simple.war
	at org.rhq.core.pc.inventory.InventoryManager.removeResourceAndIndicateIfScanIsNeeded(InventoryManager.java:1551)


This is a problem because it means that we are perhaps more susceptible to ConcurrentModificationException than we may have thought.  This indicates that we are working with HashSet when we are expecting to have the child resources in a safer Set impl.

After some investigation I found some issues, leading to this commit:

master commit 3e47741694b55fedecf9c904d89de11a7b92ccb2
Author: Jay Shaughnessy <jshaughn@redhat.com>
Date:   Thu Jul 24 22:22:55 2014 -0400

    Uninventory triggered our "assert-style" debug logging indicating that we
    were working with HashSet for child resources when we expected to be using
    a safer Set impl (wrt preventing ConcurrentModoficationException). There
    were a few issues in play:
    - When loading persisted resources from disk convert childResources to
      CopyOnWriteArraySet if necessary.  This repairs resources persisted with
      the wrong Set impl.
    - Eliminate some potential concurrent modification danger (or, at a minimum,
      some unnecessary work) by only calling deactivateResource() on the root
      resource. It recursively deactivates the subtree, so no need to then call
      it on every node in the subtree.
    - Fix a couple of other places where we weren't using CopyOnWriteArraySet
    - Fix Resource entity to ensure that when
      customChildResourcesCollection=true that the childResources Set is truly
      protected. It was here that we were losing the proper Set impl.
    - Fix our "assert-logging" conditional, which despite finding this issue was
      actually looking for the wrong set impl. It had not been updated when we
      moved to CopyOnWriteArraySet.