Bug 1139433 - Deadlock in deletion of RHQ_RESOURCE_ERROR, possibly in resource upgrade?
Summary: Deadlock in deletion of RHQ_RESOURCE_ERROR, possibly in resource upgrade?
Keywords:
Status: ON_QA
Alias: None
Product: RHQ Project
Classification: Other
Component: Core Server, Database
Version: 4.9
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: RHQ 4.13
Assignee: Thomas Segismont
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2014-09-08 22:10 UTC by Elias Ross
Modified: 2022-03-31 04:28 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1145778 0 unspecified ON_QA Purge RHQ_RESOURCE_ERROR table as well 2022-03-31 04:28:40 UTC

Internal Links: 1145778

Description Elias Ross 2014-09-08 22:10:34 UTC
Description of problem:

It seems there is a possibility of deadlock. I'm not sure what is the problematic SQL exactly, as there is no precise match.

Deadlock graph:
                       ---------Blocker(s)--------  ---------Waiter(s)---------
Resource Name          process session holds waits  process session holds waits
TX-00060012-00080534        54     208     X             75     107           X
TX-002d0005-0002ae45        75     107     X             54     208           X
 
session 208: DID 0001-0036-24B4BB3C     session 107: DID 0001-004B-0B812C4E 
session 107: DID 0001-004B-0B812C4E     session 208: DID 0001-0036-24B4BB3C 
 
Rows waited on:
  Session 208: obj - rowid = 00014DA7 - AAAVXjAAFAAES3IAAG
  (dictionary objn - 85415, file - 5, block - 1125832, slot - 6)
  Session 107: obj - rowid = 00014DA7 - AAAVXjAAFAAAJp/AAB
  (dictionary objn - 85415, file - 5, block - 39551, slot - 1)
 
----- Information for the OTHER waiting sessions -----
Session 107:
  sid: 107 ser: 1531 audsid: 409493 user: 87/RHQ
    flags: (0x45) USR/- flags_idl: (0x1) BSY/-/-/-/-/-
    flags2: (0x40009) -/-/INC
  pid: 75 O/S info: user: oracle, term: UNKNOWN, ospid: 28705
    image: oracle.apple.com
  client details:
    O/S info: user: rhq, term: unknown, ospid: 1234
    machine: vp25q03ad-hadoop098.iad.apple.com program: JDBC Thin Client
    application name: JDBC Thin Client, hash value=2546894660
  current SQL:
  delete from RHQ_RESOURCE_ERROR where RESOURCE_ID=:1  and ERROR_TYPE=:2 
 
----- End of information for the OTHER waiting sessions -----
 
Information for THIS session:
 
----- Current SQL Statement for this session (sql_id=489wxtgjx80ct) -----
delete from RHQ_RESOURCE_ERROR where RESOURCE_ID=:1  and ERROR_TYPE=:2 


My suspicion is this:

    @NamedQuery(name = ResourceError.QUERY_FIND_BY_RESOURCE_ID_AND_ERROR_TYPE, query = "SELECT re FROM ResourceError re WHERE re.resource.id = :resourceId AND re.errorType = :errorType") })

Which is used here:

            // finally let's remove the potential previous upgrade error. we've now successfully
            // upgraded the resource.
            List<ResourceError> upgradeErrors = resourceManager.findResourceErrors(subjectManager.getOverlord(),
                resource.getId(), ResourceErrorType.UPGRADE);
            for (ResourceError error : upgradeErrors) {
                entityManager.remove(error);
            }

But the SQL isn't for delete, it is for select. I'm wondering if Hibernate is optimizing this somehow.

Version-Release number of selected component (if applicable): 4.12


How reproducible:

Unclear. Perhaps if two agents connect to one server and call into:

    @Override
    @SuppressWarnings("deprecation")
    public Set<ResourceUpgradeResponse> upgradeResources(Set<ResourceUpgradeRequest> upgradeRequests) {

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Thomas Segismont 2014-10-01 21:22:02 UTC
I've been looking at the code and trying to understand the deadlock report throughout the evening.

Due to the type of locks (row locks) and the query involved, it looks like it's the consequence of concurrent calls to ResourceManagerBean#clearResourceConfigErrorByType with the same parameters.

Such concurrent calls would not necessarily lead to a deadlock. It would happen only if the row locking order is different. My knowledge of Oracle is limited but I'd say this is possible (if the two queries get a different execution plan) but very unlikely.

How often does it happen?

Maybe purging the table as requested per Bug 1139433 will be enough.

Besides, as only one of the two sessions will be rolled-back by Oracle, the data will be deleted anyway.

Comment 2 Elias Ross 2014-10-01 22:49:51 UTC
Sorry to have potentially wasted a lot your time. I report bugs that I see, often without knowing or reporting the severity or impact.

My thinking, though, is it isn't really a row lock, since the delete can delete N or more rows. Basically, the delete in session 1 is waiting for any sort of other inserts or deletes to take place in the other session, before it can commit.

I do see something similar with rhq_config table, where two sessions attempt to delete the same row, but I don't really understand how that could be a deadlock, either.

Comment 3 Thomas Segismont 2014-10-02 08:06:12 UTC
(In reply to Elias Ross from comment #2)
> Sorry to have potentially wasted a lot your time. I report bugs that I see,
> often without knowing or reporting the severity or impact.

No worries.

> My thinking, though, is it isn't really a row lock, since the delete can
> delete N or more rows. Basically, the delete in session 1 is waiting for any
> sort of other inserts or deletes to take place in the other session, before
> it can commit.

What makes me think two sessions holding write locks on different rows and waiting for each other:

* the "TX" prefix in resource name:
> Deadlock graph:
>                        ---------Blocker(s)-------- 
> ---------Waiter(s)---------
> Resource Name          process session holds waits  process session holds
> waits
> TX-00060012-00080534        54     208     X             75     107         
> X
> TX-002d0005-0002ae45        75     107     X             54     208         
> X

* the "Rows waited on" section:

> Rows waited on:
>   Session 208: obj - rowid = 00014DA7 - AAAVXjAAFAAES3IAAG
>   (dictionary objn - 85415, file - 5, block - 1125832, slot - 6)
>   Session 107: obj - rowid = 00014DA7 - AAAVXjAAFAAAJp/AAB
>   (dictionary objn - 85415, file - 5, block - 39551, slot - 1)

I tried to figure out why the same query could lead to a different execution plan and one possible reason would be that one is using the index on resourceId and the other one is not.

Comment 4 Thomas Segismont 2014-10-16 12:34:14 UTC
Pull request sent and waiting for review:
https://github.com/rhq-project/rhq/pull/132

Comment 5 Thomas Segismont 2014-11-20 09:36:07 UTC
Merged in master

commit 8efb4f275b98d85a8bee2f12110dd955a1ebf67c
Merge: 86968f0 1f08ab3
Author: Libor Zoubek <lzoubek>
Date:   Thu Nov 13 18:16:58 2014 +0100

    Merge branch 'purgepurgepurge' of https://github.com/tsegismont/rhq into tsegismont-purgepurgepurge
    
    Conflicts:
        modules/core/dbutils/src/main/scripts/dbupgrade/db-upgrade.xml
        modules/enterprise/gui/coregui/src/main/resources/org/rhq/coregui/client/Messages_de.properties
        modules/enterprise/gui/coregui/src/main/resources/org/rhq/coregui/client/Messages_ja.properties
        modules/enterprise/gui/coregui/src/main/resources/org/rhq/coregui/client/Messages_ru.properties


Note You need to log in before you can comment on or make changes to this bug.