1121270 – Deadlock in configuration update

Bug 1121270 - Deadlock in configuration update

Summary: Deadlock in configuration update

Keywords:
Status:	NEW
Alias:	None
Product:	RHQ Project
Classification:	Other
Component:	Core Server
Sub Component:
Version:	4.12
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	---
Assignee:	Nobody
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2014-07-18 20:32 UTC by Elias Ross
Modified:	2022-03-31 04:27 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:
Embargoed:

Attachments	(Terms of Use)

Description Elias Ross 2014-07-18 20:32:23 UTC

Description of problem:

Seen by our DBA

----- Information for the OTHER waiting sessions -----
Session 470:
  sid: 470 ser: 41209 audsid: 397421 user: 87/RHQ
    flags: (0x45) USR/- flags_idl: (0x1) BSY/-/-/-/-/-
    flags2: (0x40009) -/-/INC
  pid: 62 O/S info: user: oracle, term: UNKNOWN, ospid: 2045
  client details:
    O/S info: user: rhq, term: unknown, ospid: 1234
    machine: vp25q03ad-hadoop098.iad.apple.com program: JDBC Thin Client
    application name: JDBC Thin Client, hash value=2546894660
  current SQL:
  insert into RHQ_CONFIG (CTIME, MTIME, NOTES, VERSION, id) values (:1 , :2 , :3 , :4 , :5 )
 
Session 398:
  sid: 398 ser: 3403 audsid: 397429 user: 87/RHQ
    flags: (0x45) USR/- flags_idl: (0x1) BSY/-/-/-/-/-
    flags2: (0x40009) -/-/INC
  pid: 60 O/S info: user: oracle, term: UNKNOWN, ospid: 2088
  client details:
    O/S info: user: rhq, term: unknown, ospid: 1234
    application name: JDBC Thin Client, hash value=2546894660
  current SQL:
  delete from RHQ_CONFIG where id=:1 
 
----- End of information for the OTHER waiting sessions -----
 
Information for THIS session:
 
----- Current SQL Statement for this session (sql_id=3b79gqjg9d7u2) -----
delete from RHQ_CONFIG where id in (select resourceco1_.CONFIGURATION_ID from RHQ_CONFIG_UPDATE resourceco1_, RHQ_RESOURCE resource2_ where resourceco1_.DTYPE='resource' and resourceco1_.CONFIG_RES_ID=resource2_.ID and (resourceco1_.CONFIG_RES_ID in (:1 )) and resourceco1_.CONFIGURATION_ID<>resource2_.RES_CONFIGURATION_ID)


Version-Release number of selected component (if applicable): 4.12


How reproducible: Unclear, probably from doing a plugin update, or possibly a bulk delete of resources

Comment 1 Elias Ross 2014-07-18 20:36:07 UTC

Seems to come from:

        at org.rhq.enterprise.server.resource.ResourceManagerLocal$$$view23.uninventoryResourceAsyncWork(Unknown Source) [rhq-server.jar:4.12.0]
        at org.rhq.enterprise.server.scheduler.jobs.AsyncResourceDeleteJob.uninventoryResource(AsyncResourceDeleteJob.java:102) [rhq-server.jar:4.12.0]
        at org.rhq.enterprise.server.scheduler.jobs.AsyncResourceDeleteJob.executeJobCode(AsyncResourceDeleteJob.java:66) [rhq-server.jar:4.12.0]
        at org.rhq.enterprise.server.scheduler.jobs.AbstractStatefulJob.execute(AbstractStatefulJob.java:48) [rhq-server.jar:4.12.0]
        at org.rhq.enterprise.server.resource.metadata.ResourceMetadataManagerBean.removeObsoleteTypes(ResourceMetadataManagerBean.java:201) [rhq-server.jar:4.12.0]
...
        at org.rhq.enterprise.server.resource.metadata.PluginManagerLocal$$$view149.registerPlugin(Unknown Source) [rhq-server.jar:4.12.0]

Comment 2 Elias Ross 2014-07-18 20:39:14 UTC

                // 2) Immediately remove the uninventoried resources by forcing the normally async work to run in-band
                new AsyncResourceDeleteJob().execute(null);

                // 3) Immediately finish removing the deleted types by forcing the normally async work to run in-band
                new PurgeResourceTypesJob().executeJobCode(null);

Might not be a good idea to do this actually... Is it possible to simply chain the two together?

Comment 3 Jay Shaughnessy 2014-07-22 19:08:10 UTC

Right, looking at this it seems dangerous because these in-band executions could overlap scheduled executions.  The question is how to get the behavior we want.  We need the uninventoried resources to go away before we can purge the type completely.  So the uninventory job code must execute to completion before the purge job code runs.  And then we have the issue of existing scheduled jobs, which could already be in-progress.

This should only be an issue for plugin updates executing after server startup, because the scheduled jobs haven't been started prior to the initial plugin update.  I suppose we may also be at risk during on-demand plugin delete/purge.

How to do this seems a little difficult because it's not possible afaik with the quartz api (at least for our ancient version) to determine if a job is actually executing (in a cluster-aware way, it is possible for a single node).  So I think we need to maybe do something like:

1) Wait for there to be no resources in an uninventory state
2) pause the two relevant, scheduled jobs
3) execute the job code synchronously, like we do now
4) un-pause the jobs

thoughts?

Note You need to log in before you can comment on or make changes to this bug.