Bug 534385 (RHQ-1187)

Summary:	improve performance of resource uninventory
Product:	[Other] RHQ Project	Reporter:	John Mazzitelli <mazz>
Component:	Performance	Assignee:	Joseph Marques <jmarques>
Status:	CLOSED NEXTRELEASE	QA Contact:	Pavel Kralik <pkralik>
Severity:	medium	Docs Contact:
Priority:	high
Version:	unspecified	CC:	mvecera
Target Milestone:	---	Keywords:	Improvement
Target Release:	---
Hardware:	All
OS:	All
URL:	http://jira.rhq-project.org/browse/RHQ-1187
Whiteboard:
Fixed In Version:	1.3	Doc Type:	Enhancement
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:		Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	534390, 534391, 535428
Bug Blocks:

Description John Mazzitelli 2008-11-26 17:18:00 UTC

Currently, when you click "uninventory" from the UI, we attempt to remove the resources synchronously from the UI thread. We need to just mark the resources as "uninventories" and have an async job purge them all.

(11:51:04 AM) mazz: (asynchronous deletion of resources - i.e. mark a resource as "deleted", don't show deleted resources in the UI and have a job scan for deleted resources and purge them)
(11:51:12 AM) joseph: ah, right.  ; )
(11:52:12 AM) mazz: it looks like we already have a DELETED status: http://jira.rhq-project.org/browse/RHQ-922
(11:52:29 AM) mazz: er... that's not uninventory, nm
(11:52:44 AM) joseph: but that's DELETED
(11:52:52 AM) joseph: riiiiight
(11:53:15 AM) mazz: do we need an UNINVENTORIED status?
(11:53:31 AM) joseph: that would probably be the indicator
(11:53:41 AM) joseph: would naturally work in the UI because we only display COMMITTED
(11:53:48 AM) mazz: we keep DELETED resources around for their audit history, from what I understand
(11:54:09 AM) mazz: I think we should consider trying something like this soon.
(11:54:14 AM) mazz: WE need it for one!
(11:54:17 AM) joseph: yes, well, everything gets blown away when you uninventory today
(11:54:27 AM) mazz: and now that we can support large inventories, it is going to be needed
(11:54:32 AM) joseph: if we wanted to keep history around post uninventory, that would be a large, fundamental change
(11:54:42 AM) mazz: no no - I agree
(11:54:48 AM) mazz: uninventory blows everyting away
(11:54:59 AM) mazz: but if I DELETE a child service, its history remains
(11:55:04 AM) joseph: right
(11:55:09 AM) mazz: if I UNINVENTORY it, it goes away
(11:55:17 AM) joseph: correct
(11:55:38 AM) mazz: so... our "uninventory" button would be fast.. just set inventory status to UNINVENTORIED
(11:55:52 AM) mazz: we right a simple quartz job that scans for that status and purges them all
(11:56:36 AM) mazz: or we have a JMS message :)
(11:56:47 AM) mazz: "delete this resource mr. message bean"
(11:56:56 AM) joseph: mazz, yeaup
(11:57:04 AM) mazz: I like that better
(11:57:12 AM) joseph: it would probably collect them on a per agent basis
(11:57:21 AM) joseph: so we can properly delete bottom up
(11:57:40 AM) mazz: well, I'm thinking, you just pass the resoruce ID of the ones selected by the user
(11:57:48 AM) mazz: and we do the same thing we do today, only in a JMS bean
(11:57:59 AM) joseph: define "the ones select by the user"
(11:58:03 AM) mazz: it already handles the bottom up stuff
(11:58:11 AM) mazz: the ones you click the checkboxes on
(11:58:25 AM) joseph: the only thing the out-of-band job would know of is the resource that are UNINVENTORIED
(11:58:28 AM) mazz: click the checkboxes in the UI, click uninventory, those resource IDs you checked are passed to the JMS bean
(11:58:41 AM) mazz: yeah, that's all we know today too
(11:59:01 AM) joseph: but..the sync job that updates the resources would have to update the entire tree
(11:59:03 AM) mazz: the UI passes the selected/checked resource IDs to the SLSB
(11:59:12 AM) joseph: uninvenotry platform needs to update state on nested servers and services too
(11:59:15 AM) mazz: there is no sync job in my scenario
(11:59:26 AM) mazz: the JMS bean calls the SLSB, just like how the UI does it today
(11:59:40 AM) joseph: there needs to be sync work
(11:59:45 AM) mazz: we could have one JMS method invocation per selected resource ID
(11:59:45 AM) joseph: to get the resources hidden from the UI
(12:00:02 PM) joseph: so that they appear to have been deleted
(12:00:18 PM) joseph: how the delete is done after that can be jms via passed ids if you want
(12:00:33 PM) joseph: i'd rather NOT do it that way
(12:00:35 PM) mazz: oh, I'm saying do THAT work synchronously from the UI
(12:00:40 PM) joseph: because i only want a single job doing the async deletes
(12:00:56 PM) joseph: instead of multiple JMS guys
(12:01:00 PM) mazz: you click "uninventory" - the UI calls the SLSB method that is now fast - it just marks them all as UNINVNENTORIED
(12:01:04 PM) joseph: right
(12:01:19 PM) joseph: and an async job comes along and figures out how to delete
(12:01:29 PM) mazz: at that point, we send a JMS message saying, "I just marked these resources as UNINVNENTORIES, please purge them"
(12:01:41 PM) mazz: now, at that point, it's the same code we have today - nothing is different
(12:01:43 PM) joseph: how about the aysnc job does...
(12:01:57 PM) joseph: "delete any marked paltforms"
(12:02:01 PM) mazz: only we are call the deleteResources SLSB from a JMS bean
(12:02:01 PM) joseph: "delet any marked servers"
(12:02:07 PM) joseph: "delete any marked services"
(12:02:14 PM) joseph: cause the SLSB will take care of the recursion
(12:02:21 PM) mazz: that's more code we'd have to write that we don't have to
(12:02:23 PM) joseph: so we just delete top-down based on category
(12:02:39 PM) mazz: the recursion code is already written (and tested! this was not trivial with all the timeouts and RQUIRES_NEW and all)
(12:02:40 PM) joseph: again, i don't want multiple threads doing this delete
(12:02:52 PM) mazz: it wouldn't be multiple threads - 
(12:03:07 PM) mazz: we can just have a single JMS bean call make a single SLSB call to deleteResources() just lke the UI does today
(12:03:13 PM) joseph: two UI ops wil lbe 2 threads
(12:03:24 PM) joseph: a repear/uninventory job will always be 1
(12:04:06 PM) mazz: I see, what happens if I ask to delete X plaforms, using individual UI clicks 
(12:04:15 PM) joseph: uninvenotry doesn't happen often, but if users do happen to be stupid and trigger multiple deletes through multiple, sep UI ops...we can reduce contention by logically serializing by only having a single repear
(12:04:15 PM) mazz: that's X threads now killing the server
(12:04:22 PM) joseph: right
(12:04:23 PM) mazz: yes, I see your point
(12:04:31 PM) joseph: protect against the ignorant
(12:04:46 PM) joseph: and keep rhq_resource contention as low as possible
(12:04:59 PM) joseph: by logically serializing updates to it
(12:05:08 PM) mazz: well, I think that new recursion code might be difficult
(12:05:25 PM) mazz: well, maybe you can
(12:05:52 PM) mazz: what - you do platforms first? and get the platform IDs, pass them to deleteResources() (let it do the recusion down that tree)... then do servers then to services?
(12:06:15 PM) joseph: would it?  job: loop marked platforms (uninventory platform), loop marked servers (uninventory server), loop marked services (uninventory service)
(12:06:18 PM) jshaughn: can't the reaper just call delete resource in the same order as the ui clicks?
(12:06:28 PM) joseph: the SLSB will handle the recursion
(12:06:49 PM) mazz: yeah, SLSB handles the recursion of the child resources of the marked resource
(12:06:55 PM) joseph: each uninvneotry is done in a sep xtn, so the results will be visible to the next level
(12:07:04 PM) joseph: level = category
(12:07:19 PM) joseph: jshaughn: that's possible too
(12:07:47 PM) jshaughn: what happens if the uninventory fails for whatever reason. The resources are hidden from view but perhaps sticking around indefinitely.  I suppose this is unlikely.
(12:07:57 PM) joseph: but requires storing the user clicks somewhere, plus then you get into a scenario where the user might have deleted the server, then deleted the platform
(12:08:19 PM) jshaughn: I guess I was thinking the "flag" could help in ordering the requests
(12:08:21 PM) joseph: jshaughn: that'll be handled naturally because the resource state is still in UNINVENTORY, so will be retried on the next go around
(12:08:51 PM) jshaughn: oh, I see, you're using the state as the flag
(12:08:54 PM) mazz: the marking of UNINV would be in a single tx - it all goes or none of it
(12:09:06 PM) joseph: right
(12:09:18 PM) mazz: that should be a fast operation though
(12:09:45 PM) joseph: updates aren't *that* fast
(12:09:50 PM) mazz: question: if I click "uinvnenotry" on a platform... do I JUST mark that platfomr's resource as UNINVNEOTRY?
(12:09:58 PM) mazz: what about its children? 
(12:10:04 PM) joseph: but in this case they are faster than removing ALL associated data for that resource
(12:10:10 PM) joseph: mark the entire tree
(12:10:11 PM) jshaughn: I think you'd have to do the tree
(12:10:16 PM) mazz: yeah, that's what I think
(12:10:23 PM) mazz: which is gonna be slower
(12:10:24 PM) jshaughn: it still shouldn't be horrible 
(12:10:36 PM) mazz: but as you said, not as slow as deleting the entire set of data associated with all children
(12:10:37 PM) jshaughn: esp compared to uninv
(12:11:42 PM) joseph: this might appear to be nomially slower on tiny inventories, but as the size of the inventory grows will appear to have no appreciable increase in time to uninventory vast quantities of resources
(12:13:13 PM) jshaughn: One small diff is that users will not know exactly how long it will be before any re-discovery will take place.  But I think this is fine, especially if uninv is typically used on dead resources.
(12:15:45 PM) joseph: correct, the re-discovery will be variable, but at least they have UI responsiveness and can do other things in the meantime
(12:16:25 PM) jshaughn: so how much work will it be to hide UNINVENTORY resources?  Pretty much none?
(12:16:40 PM) joseph: jshaughn: 0
(12:16:47 PM) jshaughn: nice
(12:16:57 PM) joseph: all meaningful resource-specific ops in the UI today already switch on COMMITTED status
(12:17:22 PM) jshaughn: will it affect inventory sync?
(12:17:36 PM) ccrouch: (11:13:15 AM) jshaughn:  But I think this is fine, especially if uninv is typically used on dead resources.
(12:17:36 PM) ccrouch: unfortunately i think a fairly common use case when people are having problems with jon is to uninventory the platform, clean the agent and reimport
(12:18:01 PM) mazz: as long as we do not sync for things not COMMMITTED
(12:18:03 PM) mazz: we should be fine
(12:18:17 PM) mazz: this is no different than if the status was DELETED
(12:18:26 PM) mazz: (i.e. it is something that != COMMITTED)
(12:18:28 PM) jshaughn: The sync does some interesting things, we'll probably just have to take a look

summary:
* takes selected resources in UI to be uninventoried and mark their entire respecitve resource trees as UNINVENTORIED status
* new quartz job that comes along and uninventories any resource marks as UNINVENTORIED in a semi-ordered fashion:
** uninventory each marked platform (which will naturally uninventory nested servers and services)
** uninventory each marked server (which will naturally uninventory nested servers and services)
** uninventory each marked service (which will naturally uninventory nested services)

this semi-ordered fashion, by virtue of the fact that the ResourceManagerBean SLSB deletes the resource tree properly when given parent resource ids, will guarantee that each evel of the semi-ordered uninventory will only muck with resources that were in resource trees unique from all other higher-level uninventories.  in other words, if there is a left-over service that needs to be explicitly uninventoried, we know that it was not part of any of the servers explicitly uninventoried, and that the service AND none of those servers were descendants of the explicitly marked platforms.

Comment 1 John Mazzitelli 2008-11-26 17:20:53 UTC

we need to think about how this affects deleting agents too

Comment 2 Heiko W. Rupp 2008-11-27 09:08:04 UTC

When we start doing this, we should also allow to ignore individual services.
Those should not be uninventoried, but marked as ignored so that they don't directly show up again after a discovery.
The gui should perhaps change the 'uninventory' button for those to 'ignore'
The AD portlet (or the parents inventory ?) should allow the bring them back in.

Comment 3 John Mazzitelli 2009-05-28 17:22:16 UTC

we *need* this implemented.

(btw: in addition to changnig the status, we need to change the resource key to something like "~~dummy~~")

Comment 4 Joseph Marques 2009-05-29 14:54:16 UTC

my recent thoughts:

* need a new UNINVENTORIED type for InventoryStatus to distinguish from deleted resources (delete button from inventory tab of the parent resource)

for the in-band work:

* change all statuses to UNINVENTORIED in one fell swoop - there are queries in ResourceManager that show how to do this to resource tree with at most 6 levels of depth in a single query
* modify the resource key to junk so that the discovery mechanisms on the agent think that the entire resource tree has been successfully deleted
* set the agent references on all uninventoried resources to null - this will prevent the synchronization job from seeing two platforms (the old one with the junk resource key, and the new one after the next discovery run)
* after the first three steps are complete, notify the agent of the uninventory action for the given resources - this should trigger a new auto-discovery and work because the server-side when it returns the ResourceSyncInfo tree back to the agent it will be the empty set

for the out-of-band work:

* quartz job can come along and look to see if any resources have the UNINVENTORIED status and takes it sweet time cleaning up after it - no need to batch resources in sizes of 200, the size can be 1 to keep the transaction small, especially considering that resources which have been in inventory for a long time can conceivably have lots of audited data scattered across all parts of the domain

other thoughts:

* use HibernatePerformanceMonitor utility to profile uninventorying a single resource today (which has a reasonable amount of history across various parts of the domain) to see where our bottlenecks are.  consider reworking the slowest points
* what happens if there is an error during the async uninventory, will that resource forever remain in a loop trying to uninventory itself and always failing?  do we ever mark a resource as "bad" and stop trying to uninventory it?  how would this error even be reported back to one or more registered users of the system now that it's done completely in the background?

Comment 5 Joseph Marques 2009-06-23 07:06:39 UTC

rev4160 - this commit adds asynchronous uninventory and fixes several spots of row contention

[RHQ-1187][RHQ-1191][RHQ-1192] - asynchronous uninventory by setting uninventoried resources agent references to null to stop majority agent-side sync, setting the parent to null to take it out of the object graph, using a special UNINVENTORY inventoryStatus so it doesn't conflict with existing semantics around any other state, and using a dummy resourceKey so that the next discovery doesn't collide; 
[RHQ-1324] - specific timings during uninventory calling reinventory failure are no longer possible because uninventory of the entire resource tree occurs atomically in one bulk update statement, then the agent is notified if successful; 
[RHQ-2124][RHQ-1656][RHQ-1221] - removed hot spots and various other points of contention by shortening transaction times or using indexes as available for: a) uninventory work, b) cloud manager job, c) check for suspect agent job, d) dynagroup recalculation job, e) alerts cache in-band agent and server status bit setting, f) isAgentBackfilled checking

Comment 6 Pavel Kralik 2009-07-03 13:41:09 UTC

Asynchronous uninventory verified.  r4181

Comment 7 Red Hat Bugzilla 2009-11-10 20:27:37 UTC

This bug was previously known as http://jira.rhq-project.org/browse/RHQ-1187
This bug is related to RHQ-914
This bug is related to RHQ-1324
This bug relates to RHQ-2218