823942 – Deadlock in plugin container (possible issue with resource hierarchy traversal)

Bug 823942 - Deadlock in plugin container (possible issue with resource hierarchy traversal)

Summary: Deadlock in plugin container (possible issue with resource hierarchy traversal)

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	RHQ Project
Classification:	Other
Component:	Agent
Sub Component:
Version:	4.4
Hardware:	All
OS:	All
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	RHQ 4.5.0
Assignee:	Jay Shaughnessy
QA Contact:	Mike Foley
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	jon310-sprint11, rhq44-sprint11 825331
TreeView+	depends on / blocked

Reported:	2012-05-22 13:43 UTC by Jay Shaughnessy
Modified:	2013-09-01 10:14 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Clone Of:
Clones:	825331 (view as bug list)
Environment:
Last Closed:	2013-09-01 10:14:18 UTC
Embargoed:

Attachments	(Terms of Use)
agent log (1.17 MB, text/plain) 2012-05-22 13:43 UTC, Jay Shaughnessy	no flags	Details
thread dump (22.63 KB, text/plain) 2012-05-22 13:44 UTC, Jay Shaughnessy	no flags	Details
View All

Description Jay Shaughnessy 2012-05-22 13:43:14 UTC

Created attachment 586040 [details]
agent log

Deadlock detected in PC while testing with AS7 inventory.  Agent log and thread dump attached.

Notes:
- Async avail checking may have been turned on

Comment 1 Jay Shaughnessy 2012-05-22 13:44:01 UTC

Created attachment 586041 [details]
thread dump

Comment 2 Jay Shaughnessy 2012-05-22 23:48:21 UTC

commit ade022c91458c79e472ba62841735de985db58a6

    Tweek the MeasurementManager initialization to leverage the
    ResourceContainer because it may not be valid to assume the
    Resource hierarchy can be correctly accessed as is. Also, add
    some logging in the area of code that seemed like it may have
    been causing a block.

IRC chat about this:

(3:51:55 PM) jshaughn: The thread dump analysis did not implicate async avail checking or discovery, or avail checking for that matter
(3:52:19 PM) jshaughn: the culprit seemed, as unlikely as it may be, to be MeasurementManager init
(3:53:06 PM) jshaughn: the symptoms seemed to indicate that  MM.initialize was not completing
(3:53:16 PM) jshaughn: perhaps was stuck in a loop.
(3:54:10 PM) jshaughn: also, InventoryManager.initialize was taking quite a while, and so various executors were starting up and having to wait because PC init was not done
(3:54:52 PM) jshaughn: after 60s of waiting for a PC lock you get theat deadlock message
(3:55:09 PM) jshaughn: but really the PC was still initializing
(3:55:35 PM) jshaughn: the AS7 standalone server components seem to take about 30s to start by the way
(3:57:41 PM) jshaughn: mazz and I see one tweak we could make to the MM init that could possibly help things
(3:57:58 PM) jshaughn: we may just make that change and see whether this thing re-occurs ever
(3:58:10 PM) ips: it doesn't seem right to me that getInventoryManager() and the other get*Manager() methods use that read lock
(3:58:37 PM) ips: because PC.initialize() obtains the write block before calling initialize on all the managers
(3:58:44 PM) ips: write lock
(3:58:47 PM) jshaughn: we don't want other threads calling into the PC before it's initialized
(3:59:47 PM) ips: but then most of the managers will end up blocking for 60s in their init() methods when they try to call PC.get*Manager() methods
(4:00:02 PM) jshaughn: A manager's init may call upon a previously initialized manager but external threads can not
(4:00:21 PM) ips: so it seems it would slow down the entire PC startup and potentially cause the managers to start up in the wrong order
(4:00:22 PM) jshaughn: so the ordering is important
(4:00:24 PM) mazz: ips: they won't
(4:00:34 PM) mazz: the main init thread already has the write lock
(4:00:48 PM) mazz: so all the getters automatically get the read lock and it is very fast obtaining that read lock
(4:01:08 PM) mazz: this is working as intended.
(4:01:15 PM) ips: ah, so as long as it's the same thread it already has the lock..
(4:01:18 PM) mazz: right
(4:01:22 PM) jshaughn: right
(4:01:31 PM) ips: hmm
(4:02:00 PM) mazz: the problem was - the MeasurementManager.reschedule either was in an infinite loop or just was taking a long time (and you may not have waited long enough)
(4:02:15 PM) mazz: either way, the problem was the measurement manager did not initialize in a timely manner
(4:02:17 PM) ips: 30s for the as7 server component to start seems long. i'll look into why that's so slow
(4:02:19 PM) mazz: now, what we do is
(4:02:29 PM) mazz: after 60s, we "release the hounds" and let the threads all go
(4:02:38 PM) mazz: crossing our fingers that things wont' break
(4:02:49 PM) mazz: and if they do break, cross our fingers the PC will self-correct over time
(4:03:02 PM) mazz: that's why this locking is useful - because it allows us to free the PC to continue
(4:03:04 PM) mazz: after that 60s
(4:03:18 PM) jshaughn: one thing I don't love is that avail and discovery and measurement collection executors all get going a little early in the process, I'm not sure 60s is enough of a delay
(4:03:21 PM) mazz: so yeah, things might break, but it should let the PC move forward
(4:03:35 PM) mazz: that 60s is definitely arbitrary
(4:03:40 PM) mazz: it could be whatever
(4:04:04 PM) jshaughn: because you get deadlock messages if they hit that 60s limit at startup
(4:04:17 PM) mazz: right. that is what I was talking about ^^^
(4:04:31 PM) mazz: we spit out a message saying "hey, we think there is a problem"
(4:04:45 PM) mazz: "and we are breaking the deadlock - you might want to look to see what's going on"
(4:05:24 PM) mazz: that's what this is all about:
(4:05:32 PM) ips: what i was seeing is service discovery taking forever. i could tell because when i ran discovery -f it told me a scan was already in progress
(4:05:33 PM) mazz:     private Lock obtainReadLock() {
(4:05:33 PM) mazz:         // try to obtain the lock, but if we can't get the lock in 60 seconds,
(4:05:33 PM) mazz:         // keep going. The PC is usually fine within seconds after its initializes,
(4:05:33 PM) mazz:         // so not getting this lock within 60 seconds probably isn't detrimental.
(4:05:33 PM) mazz:         // But if there is a deadlock, blocking forever here would be detrimental,
(4:05:33 PM) mazz:         // so we do not do it. We'll just log a warning and let the thread keep going.
(4:05:56 PM) ips: and not all descendant resources of the AS5 server got discovered
(4:05:57 PM) mazz: that comment explains the purpose of the lock and the timeout
(4:06:06 PM) jshaughn: (ips, see line 3485 of the agent logyou provided for an example of that Standalone Server start time)
(4:06:37 PM) jshaughn: yeah, the discovery may have been blocked forever by MM not finishing initializing
(4:07:00 PM) mazz: it actually won't be blocked forever
(4:07:01 PM) jshaughn: but mazz, would you even get an agent prompt if the PC didn't finish initializing?
(4:07:06 PM) mazz: after 60s, we free it up
(4:07:38 PM) mazz: so it will be SLOW
(4:07:43 PM) mazz: but it won't hang forever
(4:07:53 PM) mazz: each time a getXXXManager is called, it willl wait 60s
(4:08:05 PM) mazz: but if it can't lock, it breaks the deadlock, and still gets you the manager
(4:08:41 PM) mazz: this is what that comment was trying to say  above - its detrimental to block forever, so we don't do it. its "probable" the PC is OK after 60s - so we'll let you have the manager after that timeperiod
(4:08:42 PM) jshaughn: ok, so then how do we explain what ips was seeing?  Are you saying each request for a manager will hang for 60s?  That would seem like forever
(4:08:50 PM) mazz: but we log that deadlock message to let yo know something is going wrong
(4:09:01 PM) mazz: that's what I'm saying.
(4:09:13 PM) mazz: it would seem like forever yes if the code keeps calling getXXXManager
(4:09:19 PM) jshaughn: so if MM init did get stuck then you would still get a prompt eventually and so on
(4:09:19 PM) mazz: but its NOT blocked forever :)
(4:09:34 PM) mazz: just really really slow due to the waiting on the MM for those 60s
(4:09:41 PM) mazz: correct
(4:09:58 PM) mazz: and either that MM was in an infinite loop
(4:10:00 PM) jshaughn: mazz, I think we should make the tweak to MM and go from there
(4:10:12 PM) mazz: (in which case, the agent will be slow all the time)
(4:10:35 PM) mazz: or the MM's reschedule is just taking a long time and EVENTUALLY the PC will recover because the MM wilil finish the initialize
(4:10:52 PM) mazz: this is what I mean by the PC self-correcting - eventually, the MM.reschedule will finish, and the agent will continue normally
(4:11:10 PM) mazz: now - if we blocked forever on that lock acquisition (or, rather than use a lock, replace it with checks for some kind of "isStarting" flag)
(4:11:23 PM) mazz: that would be worse - because then it WOULD block forever and the PC would never get a chance to recover
(4:13:01 PM) mazz: well, I suppose that if we poll some kind of "isStarting" flag, that too would work
(4:13:29 PM) mazz: but this uses a standard R/W lock that works well and is designed for this kind of thing
(4:21:12 PM) lzoubek left the room.
(4:28:51 PM) ips: could the MM scheduling be done in a separate thread?
(4:29:27 PM) ips: or is it important for all metrics to be scheduled before MM.init() returns?
(4:34:00 PM) mazz: good question.
(4:34:09 PM) mazz: note that this MM code hasn't been touched since 2008 :)
(4:34:34 PM) mazz: so whose to say it couldn't be improved upon :}
(4:38:02 PM) mazz: I think jshaughn is going to tweek this reschedule code
(4:38:10 PM) mazz: let's see if that fixes something (it might)
(4:38:28 PM) mazz: note that we have seen the child collection returned by resource.getChildResources() not be correct
(4:38:40 PM) mazz: we now need to make sure we only get that through the resource container
(4:39:27 PM) mazz: don't know why that happened, but its interesting that in this reschedule code - the loop is going around resource.getChildResources() rather than through the child resources from the agent's resource container

(5:22:18 PM) ips: interesting..
(5:39:31 PM) ips: so what is the fix? replacing the one PluginContainer.getInstance().getInventoryManager() in initialize() w/ this.inventoryManager?
(5:40:07 PM) ips: i thought that call was not a big deal when in the same thread as PC.init()
(5:40:49 PM) ips: or is the child resources thing what you think was the issue?
(5:42:55 PM) jshaughn: ips, the child resources thiung
(5:43:45 PM) jshaughn: Traversing the hierarchy is safer if done through getting resource from the container
(5:44:06 PM) jshaughn: I'm not sure it shoud be safer, but I've seen it be the case
(5:44:22 PM) ips: hmm
(5:44:23 PM) jshaughn: all in all it's a shot in the dark
(5:44:33 PM) jshaughn: and there is some added logging
(5:45:00 PM) jshaughn: but the thread dump you provided basically indicated that the MM init was stuck in that area
(5:45:12 PM) jshaughn: almost as if it were in some sort of loop
(5:45:46 PM) ips: i thought we kept the res hierarchy rooted at InventoryManager.platform up to date, but perhaps we botch that somewhere
(5:46:12 PM) jshaughn: I saw it recently be the case that it was not in sync with the resource containers
(5:46:37 PM) jshaughn: In my case children were missing
(5:46:44 PM) jshaughn: it was weird
(5:47:04 PM) jshaughn: but traversing the hierarchy seems safer done this way
(5:47:26 PM) mazz: I think the logging part will help the most
(5:47:38 PM) mazz: if this happens again, it should help us figure out what's going on
(5:47:59 PM) jshaughn: yeah, at least if it happens again it will tell us whether it's proceeding normally, or in fact looping
(5:48:01 PM) mazz: ips: if you see this again, turn on agent debugging
(5:48:19 PM) mazz: you can do that on the fly from the prompt (debug -e)
(5:49:15 PM) ips: i can see the local resource var potentially becoming defunct if inv sync kicks in in another thread and updates a bunch of resources
(5:49:51 PM) ips: ideally i think this code should not even run until the initial inv sync is complete
(5:50:15 PM) ips: of course if the server is down there will be no sync at startup
(5:50:35 PM) jshaughn: ips, I think there are issues we also have with inv synch in that way
(5:51:22 PM) jshaughn: If the resource tree is updated while discovery or avail are traversing/recursing it then we don't always cover the bases
(5:51:35 PM) jshaughn: There is an instance of this at the top of your log file
(5:52:07 PM) jshaughn: the parent container "disappears" while recursing the children, and when the child refers to the parent it goes boom
(5:52:17 PM) ips: interesting
(5:52:38 PM) ips: i guess we may need to make a lot more calls to invManager.getResContainer()
(5:52:40 PM) jshaughn: it's not common but it can happen
(5:53:08 PM) ips: just to make sure the res we're about to do something with didn't just get removed from inv
(5:53:13 PM) jshaughn: in this case nothing will really help other than protecting against the fact that the parent container may go away
(5:53:41 PM) jshaughn: but in general, a sync that us handling uninv can bork things in progress, I belive
(5:56:34 PM) jshaughn: ips, did you see the deadlocking/slowness even after a --purgedata?
(5:56:47 PM) ips: actually it look like we do an initial sync synchronously in invManager.init() (activateAndUpgradeResources()), so that's good
(5:57:30 PM) jshaughn: ah, ok.  good.   
(5:57:33 PM) ips: so at at MM.init() and the other managers are dealing w/ a fairly in-sync inventory in the case when the Server is reachable
(5:58:21 PM) ips: jshaughn: yeah, i did a -lu a couple times and still had issues
(5:58:31 PM) jshaughn: ok
(5:58:49 PM) ips: i only saw them clear up this morning after doing a full clean container rebuild and turning off async avail checking
(5:59:50 PM) jshaughn: It makes me uneasy but I think we don't have much choice but to press on at this point and see whether anything re-occurs
(6:36:30 PM) galderz left the room (quit: Quit: This computer has gone to sleep).
(6:36:58 PM) ips: yeah, agreed. i'll keep an eye out for it

Comment 3 Charles Crouch 2012-05-23 13:29:10 UTC

Setting to urgent, pending triage into jon3.1.x

Comment 4 Jay Shaughnessy 2012-05-24 16:40:59 UTC

commit 14d53ea73b219a85d1b54584457ed48a60e1a556

 Prefer resource container Resource when recursing hierarchy
 Given recent optimizations it's necessary to use the ResourceContainer's
 version of a resource when requiring the most up to date version of
 the resource.  This commit ensures that when walking the hierarchy we
 prefer the container's version of the resource when getting children.

commit 5c91b4659d17ec66c01260dfa3339d182558702b

 Update test code related to fixes for this issue.

Comment 5 Jay Shaughnessy 2012-05-25 17:33:59 UTC

Test Notes
This is an implicit code change and no specific test is recommended. Only
that the issue does not re-appear and no agent logging generates
seemingly related errors.

Comment 6 Heiko W. Rupp 2013-09-01 10:14:18 UTC

Bulk closing of items that are on_qa and in old RHQ releases, which are out for a long time and where the issue has not been re-opened since.

Note You need to log in before you can comment on or make changes to this bug.