Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 886211

Summary: Exception Thrown when Deleting Consumers
Product: [Community] Candlepin (Migrated to Jira) Reporter: Dennis Crissman <dcrissman>
Component: candlepinAssignee: Devan Goodwin <dgoodwin>
Status: CLOSED CURRENTRELEASE QA Contact: Eric Sammons <esammons>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 0.9CC: dcrissman, dgoodwin, jallen
Target Milestone: ---Keywords: Reopened
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-01-10 19:55:04 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Exception Stacktrace
none
Exception Continues none

Description Dennis Crissman 2012-12-11 19:12:43 UTC
Created attachment 661608 [details]
Exception Stacktrace

Reproducing this bug requires a little playing around with it, but the end result is a database deadlock and the attached exception.

I reproduced by creating two virtual host consumers, adding all available subscriptions to each. Then I deleted the first one I created just fine, but when got the exception when deleting the second. This second consumer still exists with uuid 6186b33f-b22f-462c-a723-1dc80ea58012, owner is rhn-cservice-acarter.

=====

jallen reproduced it this way:

In RHSM Web, create the following system units (consumers), all virtual:
powerpc
x86_64
itanium
s390
x86
sparc solaris
system z
(ie, one of each available type).
As you create each one, select Attach a Subscription.
When the Attach a Subscription modal window opens, deselect the "Only show subscriptions that match this Virtual System" checkbox.
Select all subscriptions and click Attach Selected.

Once all of these system units are selected, navigate to the Units page. Select one of the units you created and click Delete Selected. When the "Are you sure" dialog pops up, click Delete.
This error banner appears: [500] [Internal Server Error]

Comment 2 Devan Goodwin 2012-12-11 19:43:05 UTC
Can't pick this up right at the moment but if anyone starts looking at it, the stack trace indicates a deadlock happening during consumer delete, revoking all entitlements, going into a listModifying call. We totally don't need to do this if we're just deleting the consumer itself entirely and revoking all certs, so the problem may be avoidable that way.

Not sure if this will be reproducible on postgresql, you'd need to play with modifier entitlements and the ordering in which they get deleted I think, it's possible. 

It's also possible it's mysql specific though.

Comment 3 Devan Goodwin 2012-12-12 19:24:09 UTC
We're unable to reproduce in postgresql, and the suspected cause (modifier entitlements) do not appear to be in play here, I listed all subscriptions in this org, checked all the content for modifiedProductIds, none of them have any.

So the deadlock is just happening by deleting entitlements and then checking for modifier entitlements, even though none exist.

We can't provide a solid reproducer but if IT has one, I think I have a fix, as the code that's running here technically doesn't need to run when we're deleting the whole consumer, so I can make it conditional on that.

Comment 4 Jon Allen 2012-12-12 20:06:26 UTC
Just to confirm that dcrissman's reproduce path is reliable and quicker than the longer one I initially reproduced it with. 
Also, only seeing this behavior at stage - can't reproduce at dev, qa, or prod. Confusing, because CP versions are the same at stage and qa and I tested extensively at qa...

Comment 5 Devan Goodwin 2012-12-13 14:11:58 UTC
Fix is pending review. The issue was in deleting all entitlements for a consumer, we were re-using a method that is used when deleting just one entitlement. That method tries to find any other entitlements for the consumer which need to be updated (due to a feature in candlepin), but clearly when revoking all we don't need to do this. So to fix I've skipped this call when doing a mass revoke.

We still haven't seen this on postgresql, it's probably worth having a DBA take a look on mysql if possible, as I'm not totally sure how this is deadlocking. It might have something to do with nested transactions, or a transaction not getting closed.

Comment 6 Devan Goodwin 2012-12-13 14:16:20 UTC
candlepin.git master 7832f41c6a6982e5a467b1c26ab405ea2bd9b0b8
cherry-picking to 0.7.13-hotfix branch, will be in 0.7.13.7.

Comment 7 Dennis Crissman 2012-12-17 18:51:22 UTC
Changes are now in stage. Please verify.

Thanks,
Dennis

Comment 8 Jon Allen 2012-12-18 13:46:22 UTC
Unfortunately the problem is still there - no change.

Comment 9 Dennis Crissman 2012-12-18 14:12:34 UTC
Here is an updated exception from today. It does look very much like the earlier one.

http://pastebin.test.redhat.com/120302

Comment 10 Dennis Crissman 2012-12-18 15:38:38 UTC
This issue is attached to https://url.corp.redhat.com/INC000000615498.

Comment 11 Devan Goodwin 2012-12-18 17:21:30 UTC
Issue looks pretty bad, sad to see it resurface in another area. 

Questions for IT:

Is there any chance you guys could provide more logging? Anything around the stack trace could be useful, ideally with debug logging enabled:

log4j.logger.org.candlepin=DEBUG in candlepin.conf

When this happens, does a repeat attempt to delete the consumer succeed? What if mysql was restarted?





For engineering, this is what I have so far:

Anil has provided this info:

"""
The two sqls associated with the locks are below. The first sql was waiting to obtain a lock while the second was holding it .. but eventually the second transaction was rolled back because of the dead lock

delete from cp_cert_serial where expiration<='2012-12-12' and revoked=1
update cp_cert_serial set created='2012-05-08 20:34:05', updated='2012-12-13 13:29:05', collected=0, expiration='2012-11-29 23:59:59', revoked=1 where id=1082569973189627258

Can you please check from what functionality of the application calls the above two sqls

Thanks
Anil 
"""

The first query looks to be from CertificateSerialCurator.deleteExpiredSerials. I believe the only place this is called is from the CrlGenerator, which is in turn only used by the CertificateRevocationListTask (pinsetter job to regenerate the CRL), and by an explicit api call to get the current CRL in CrlResource. Interestingly enough, I cannot find any @Transactional annotations *anywhere* in this call-stack. Will research if there is a default transaction or not.

The second query looks like it's a part of the actual delete consumer routine that's failing. (CandlepinPoolManager.revokeAllEntitlements)

So based on this it looks like it happens when two jobs collide, a CRL regenerate and a consumer delete somehow trouncing on each other.

Comment 12 Dennis Crissman 2012-12-20 17:59:57 UTC
Changes have been pushed into stage. Please verify.

Thanks.

Comment 13 Dennis Crissman 2012-12-20 20:49:27 UTC
Everything is working properly now. Closing issue.

Comment 14 Dennis Crissman 2012-12-20 20:57:51 UTC
Oops. Closed the wrong ticket. This is still an issue being worked on. Sorry for any confusion I may have caused.

Comment 15 Dennis Crissman 2012-12-20 22:07:37 UTC
Created attachment 666973 [details]
Exception Continues

Comment 16 Dennis Crissman 2012-12-20 22:08:30 UTC
Sadly, the exception continues. Please see my last update for the attached exception.

Comment 17 Dennis Crissman 2012-12-21 20:31:43 UTC
<anil> dcrissman: latest lock was at 9:17:56
<anil> delete from cp_cert_serial where expiration<='2012-12-20' and revoked=1
<anil> update cp_cert_serial set created='2012-12-20 16:57:29', updated='2012-12-21 09:17:56', collected=0, expiration='2013-02-23 23:59:59', revoked=1 where id=7009666720645816922

Comment 18 Dennis Crissman 2013-01-03 19:29:12 UTC
*** Bug 891314 has been marked as a duplicate of this bug. ***

Comment 19 Dennis Crissman 2013-01-03 19:52:53 UTC
We found that the qrtz_cron_triggers table had multiple rows with varying trigger times, causing the CRL job to fire ever 2 seconds, 5 seconds, in addition to the correct every 2 hours. The frequency of this caused the deadlocks that we were getting.

mysql> select * from QRTZ_CRON_TRIGGERS;
+----------------------+--------------------------------------------------------------------+---------------+-----------------+------------------+
| SCHED_NAME           | TRIGGER_NAME                                                       | TRIGGER_GROUP | CRON_EXPRESSION | TIME_ZONE_ID     |
+----------------------+--------------------------------------------------------------------+---------------+-----------------+------------------+
| ITClusteredScheduler | CertificateRevocationListTask-0f63b281-8046-4f6b-a80a-4053a658d9c6 | cron group    | 0/5 * * * * ?   | US/Eastern       |
| ITClusteredScheduler | CertificateRevocationListTask-2cff169b-01ef-4f54-a946-5e441c4cdad2 | cron group    | 0/2 * * * * ?   | US/Eastern       |
| ITClusteredScheduler | CertificateRevocationListTask-4db75071-193c-484f-b896-5ed371e3fb62 | cron group    | 0 0 0/2 * * ?   | America/New_York |
| ITClusteredScheduler | CertificateRevocationListTask-53c12dba-09c8-4ddc-90a0-47802ac2944f | cron group    | 0/2 * * * * ?   | US/Eastern       |
| ITClusteredScheduler | CertificateRevocationListTask-939f284e-2695-4823-a90a-43be1bb8702a | cron group    | 0 0 0/2 * * ?   | US/Eastern       |
| ITClusteredScheduler | CertificateRevocationListTask-a5b8a23c-a2fa-495c-924b-574a645bc68f | cron group    | 0 0 0/2 * * ?   | US/Eastern       |
| ITClusteredScheduler | JobCleaner-ceff3cb3-23d0-4be9-9cb9-6742d422a720                    | cron group    | 0 0 12 * * ?    | US/Eastern       |
+----------------------+--------------------------------------------------------------------+---------------+-----------------+------------------+
7 rows in set (0.00 sec)

By removing the erroneous rows, things appear to be back to normal.

At this time we are waiting for 2 hours until the job fires to make sure that things continue to work properly.

verification is still ongoing...

Comment 20 Dennis Crissman 2013-01-03 21:28:17 UTC
Jon and I are no longer able to reproduce this error. Moving to verified status.

Comment 21 Devan Goodwin 2013-01-04 19:59:13 UTC
Caused by Candlepin bug where every time you changed the schedule for a job in candlepin.conf:

pinsetter.org.candlepin.pinsetter.tasks.CertificateRevocationListTask.schedule=0 0/4 * * * ?

A new trigger was created without cleaning up the old, so each change results in a new schedule for the same job. In stage's case we had two servers simultaneously trying to regenerate CRL every 2 seconds, another every 5 seconds, in addition to the more normal schedules. We believe this very quickly lead to the deadlocks. They are still theoretically possible but will be very hard to hit, and should be possible to just re-try the deletion and see success.

This config issue will be fixed in master: 0bf2760a3b47ba7d593e06572e694b393f84e07a

Will appear in both candlepin-0.7.24-1 as well as candlepin-0.7.13.10-1.

Comment 22 Dennis Crissman 2013-01-10 19:55:04 UTC
Closing as these changes are now in production and all appears well.