Bug 1095939 - during simultaneous unsubscribe --all, we can cause: Runtime Error ERROR: deadlock detected
Summary: during simultaneous unsubscribe --all, we can cause: Runtime Error ERROR: dea...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Candlepin
Classification: Community
Component: candlepin
Version: 0.9
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: ---
Assignee: Bryan Kearney
QA Contact: Katello QA List
URL:
Whiteboard:
Depends On:
Blocks: rhsm-rhel511
TreeView+ depends on / blocked
 
Reported: 2014-05-08 21:39 UTC by John Sefler
Modified: 2015-05-14 14:52 UTC (History)
1 user (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2014-09-29 17:49:42 UTC
Embargoed:


Attachments (Terms of Use)

Description John Sefler 2014-05-08 21:39:44 UTC
Description of problem:
With a little whitebox knowledge from ckozak, we now have a scenario by which two consumers can somewhat reliably cause a deadlock situation.  The scenario has been automated in our test bucket and the manual description is below.


Version-Release number of selected component (if applicable):
[root@jsefler-7server ~]# subscription-manager version
server type: Red Hat Subscription Management
subscription management server: 0.9.9-1   <=======
subscription-manager: 1.10.14-7.el7
python-rhsm: 1.10.12-2.el7


How reproducible:
often, but not absolutely every time


Steps to Reproduce:
1. from two separate systems, register consumer1 and consumer2 within the same org
2. from each system, list --all --available pools (the more the better)
3. from consumer1 attach all the pools
4. from consumer2 attach all the pools in the reverse order attached by consumer 1 (a different order is crucial)
5  simultaneously unsubscribe --all from each consumer

Actual results (taken from an automated test log):
201405081727:34.995 - INFO: ssh root.redhat.com subscription-manager unsubscribe --all
201405081727:35.127 - INFO: ssh root.redhat.com subscription-manager unsubscribe --all
201405081727:43.152 - INFO: SSHCommandResult from an attempt to unsubscribe all on 'jsefler-7server.usersys.redhat.com': 
exitCode=0
stdout='55 subscriptions removed at the server.
55 local certificates have been deleted.'
stderr=''
201405081727:43.153 - INFO: SSHCommandResult from an attempt to unsubscribe all on 'jsefler-7workstation.usersys.redhat.com': 
exitCode=255
stdout=''
stderr='Runtime Error ERROR: deadlock detected
  Detail: Process 20383 waits for ShareLock on transaction 49111010; blocked by process 24227.
Process 24227 waits for ShareLock on transaction 49111011; blocked by process 20383.
  Hint: See server log for query details. at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse:2,102'


From the log above, you will see that consumer1 was able to successfully unsubscribe from all 55 attached subscriptions.  However, consumer 2 encountered a deadlock runtime error.

Comment 1 Carter Kozak 2014-05-14 14:12:34 UTC
commit 334bcceab9708c9e6457618afe1e581726d89cde
Author: ckozak <ckozak>
Date:   Thu May 8 13:14:50 2014 -0400

    Some preliminary code to avoid deadlock situations
    
    This is not a complete sollution, but does reduce the chances
    of deadlocks in the most common places.
    
    Deadlocks may still occur when removing entitlements that
    create subpools while subpools are being bound (with other pools)
    As well as updating subpools.
    
    We need to lock the source pool before updating derived pools,
    which makes enforcing order much harder.  We might be able to
    get around that by leveraging the "created" date (breaking
    ties with id)

Comment 2 John Sefler 2014-06-17 12:49:02 UTC
Verifying Candlepin Version 0.9.19-1....
[root@jsefler-7server ~]# subscription-manager version
server type: This system is currently not registered.
subscription management server: 0.9.19-1     <=======
subscription-manager: 1.10.14-7.el7
python-rhsm: 1.10.12-2.el7
[root@jsefler-7workstation ~]# subscription-manager version
server type: This system is currently not registered.
subscription management server: 0.9.19-1     <=======
subscription-manager: 1.10.14-7.el7
python-rhsm: 1.10.12-2.el7



The automated test that was implemented to catch the failure in comment 0 is now passing (rhsm.cli.tests.UnsubscribeTests.MultiClientAttemptToDeadLockOnUnsubscribeAll_Test).
Below is a verification excerpt from the automated test log corresponding to the failure point in comment 0.

201406170838:38.857 - INFO: Simultaneously attempting to unsubscribe all on 'jsefler-7server.usersys.redhat.com' and 'jsefler-7workstation.usersys.redhat.com'...
201406170838:38.857 - INFO: ssh root.redhat.com subscription-manager unsubscribe --all
201406170838:38.946 - INFO: ssh root.redhat.com subscription-manager unsubscribe --all
201406170838:56.981 - INFO: SSHCommandResult from an attempt to unsubscribe all on 'jsefler-7server.usersys.redhat.com': 
exitCode=0 stdout='52 subscriptions removed at the server.
52 local certificates have been deleted.
' stderr=''
201406170838:56.981 - INFO: SSHCommandResult from an attempt to unsubscribe all on 'jsefler-7workstation.usersys.redhat.com': 
exitCode=0 stdout='52 subscriptions removed at the server.
52 local certificates have been deleted.
' stderr=''
201406170838:56.982 - INFO: Asserted: Actual value of '0' matches expected value: The exit code from the unsubscribe all command on 'jsefler-7server.usersys.redhat.com'.
201406170838:56.982 - INFO: Asserted: Actual value of '' matches expected value: Stderr from the unsubscribe all on 'jsefler-7server.usersys.redhat.com'.
201406170838:56.982 - INFO: Asserted: Actual value of '0' matches expected value: The exit code from the unsubscribe all command on 'jsefler-7workstation.usersys.redhat.com'.
201406170838:56.982 - INFO: Asserted: Actual value of '' matches expected value: Stderr from the unsubscribe all on 'jsefler-7workstation.usersys.redhat.com'.

Moving to status: VERIFIED

Comment 4 Bryan Kearney 2014-09-29 17:49:42 UTC
Cleaning up old bugs. These were verified but never closed.


Note You need to log in before you can comment on or make changes to this bug.