Bug 886211
| Summary: | Exception Thrown when Deleting Consumers | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | [Community] Candlepin (Migrated to Jira) | Reporter: | Dennis Crissman <dcrissman> | ||||||
| Component: | candlepin | Assignee: | Devan Goodwin <dgoodwin> | ||||||
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Eric Sammons <esammons> | ||||||
| Severity: | unspecified | Docs Contact: | |||||||
| Priority: | unspecified | ||||||||
| Version: | 0.9 | CC: | dcrissman, dgoodwin, jallen | ||||||
| Target Milestone: | --- | Keywords: | Reopened | ||||||
| Target Release: | --- | ||||||||
| Hardware: | Unspecified | ||||||||
| OS: | Unspecified | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||
| Doc Text: | Story Points: | --- | |||||||
| Clone Of: | Environment: | ||||||||
| Last Closed: | 2013-01-10 19:55:04 UTC | Type: | Bug | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Embargoed: | |||||||||
| Attachments: |
|
||||||||
Can't pick this up right at the moment but if anyone starts looking at it, the stack trace indicates a deadlock happening during consumer delete, revoking all entitlements, going into a listModifying call. We totally don't need to do this if we're just deleting the consumer itself entirely and revoking all certs, so the problem may be avoidable that way. Not sure if this will be reproducible on postgresql, you'd need to play with modifier entitlements and the ordering in which they get deleted I think, it's possible. It's also possible it's mysql specific though. We're unable to reproduce in postgresql, and the suspected cause (modifier entitlements) do not appear to be in play here, I listed all subscriptions in this org, checked all the content for modifiedProductIds, none of them have any. So the deadlock is just happening by deleting entitlements and then checking for modifier entitlements, even though none exist. We can't provide a solid reproducer but if IT has one, I think I have a fix, as the code that's running here technically doesn't need to run when we're deleting the whole consumer, so I can make it conditional on that. Just to confirm that dcrissman's reproduce path is reliable and quicker than the longer one I initially reproduced it with. Also, only seeing this behavior at stage - can't reproduce at dev, qa, or prod. Confusing, because CP versions are the same at stage and qa and I tested extensively at qa... Fix is pending review. The issue was in deleting all entitlements for a consumer, we were re-using a method that is used when deleting just one entitlement. That method tries to find any other entitlements for the consumer which need to be updated (due to a feature in candlepin), but clearly when revoking all we don't need to do this. So to fix I've skipped this call when doing a mass revoke. We still haven't seen this on postgresql, it's probably worth having a DBA take a look on mysql if possible, as I'm not totally sure how this is deadlocking. It might have something to do with nested transactions, or a transaction not getting closed. candlepin.git master 7832f41c6a6982e5a467b1c26ab405ea2bd9b0b8 cherry-picking to 0.7.13-hotfix branch, will be in 0.7.13.7. Changes are now in stage. Please verify. Thanks, Dennis Unfortunately the problem is still there - no change. Here is an updated exception from today. It does look very much like the earlier one. http://pastebin.test.redhat.com/120302 This issue is attached to https://url.corp.redhat.com/INC000000615498. Issue looks pretty bad, sad to see it resurface in another area. Questions for IT: Is there any chance you guys could provide more logging? Anything around the stack trace could be useful, ideally with debug logging enabled: log4j.logger.org.candlepin=DEBUG in candlepin.conf When this happens, does a repeat attempt to delete the consumer succeed? What if mysql was restarted? For engineering, this is what I have so far: Anil has provided this info: """ The two sqls associated with the locks are below. The first sql was waiting to obtain a lock while the second was holding it .. but eventually the second transaction was rolled back because of the dead lock delete from cp_cert_serial where expiration<='2012-12-12' and revoked=1 update cp_cert_serial set created='2012-05-08 20:34:05', updated='2012-12-13 13:29:05', collected=0, expiration='2012-11-29 23:59:59', revoked=1 where id=1082569973189627258 Can you please check from what functionality of the application calls the above two sqls Thanks Anil """ The first query looks to be from CertificateSerialCurator.deleteExpiredSerials. I believe the only place this is called is from the CrlGenerator, which is in turn only used by the CertificateRevocationListTask (pinsetter job to regenerate the CRL), and by an explicit api call to get the current CRL in CrlResource. Interestingly enough, I cannot find any @Transactional annotations *anywhere* in this call-stack. Will research if there is a default transaction or not. The second query looks like it's a part of the actual delete consumer routine that's failing. (CandlepinPoolManager.revokeAllEntitlements) So based on this it looks like it happens when two jobs collide, a CRL regenerate and a consumer delete somehow trouncing on each other. Changes have been pushed into stage. Please verify. Thanks. Everything is working properly now. Closing issue. Oops. Closed the wrong ticket. This is still an issue being worked on. Sorry for any confusion I may have caused. Created attachment 666973 [details]
Exception Continues
Sadly, the exception continues. Please see my last update for the attached exception. <anil> dcrissman: latest lock was at 9:17:56 <anil> delete from cp_cert_serial where expiration<='2012-12-20' and revoked=1 <anil> update cp_cert_serial set created='2012-12-20 16:57:29', updated='2012-12-21 09:17:56', collected=0, expiration='2013-02-23 23:59:59', revoked=1 where id=7009666720645816922 *** Bug 891314 has been marked as a duplicate of this bug. *** We found that the qrtz_cron_triggers table had multiple rows with varying trigger times, causing the CRL job to fire ever 2 seconds, 5 seconds, in addition to the correct every 2 hours. The frequency of this caused the deadlocks that we were getting. mysql> select * from QRTZ_CRON_TRIGGERS; +----------------------+--------------------------------------------------------------------+---------------+-----------------+------------------+ | SCHED_NAME | TRIGGER_NAME | TRIGGER_GROUP | CRON_EXPRESSION | TIME_ZONE_ID | +----------------------+--------------------------------------------------------------------+---------------+-----------------+------------------+ | ITClusteredScheduler | CertificateRevocationListTask-0f63b281-8046-4f6b-a80a-4053a658d9c6 | cron group | 0/5 * * * * ? | US/Eastern | | ITClusteredScheduler | CertificateRevocationListTask-2cff169b-01ef-4f54-a946-5e441c4cdad2 | cron group | 0/2 * * * * ? | US/Eastern | | ITClusteredScheduler | CertificateRevocationListTask-4db75071-193c-484f-b896-5ed371e3fb62 | cron group | 0 0 0/2 * * ? | America/New_York | | ITClusteredScheduler | CertificateRevocationListTask-53c12dba-09c8-4ddc-90a0-47802ac2944f | cron group | 0/2 * * * * ? | US/Eastern | | ITClusteredScheduler | CertificateRevocationListTask-939f284e-2695-4823-a90a-43be1bb8702a | cron group | 0 0 0/2 * * ? | US/Eastern | | ITClusteredScheduler | CertificateRevocationListTask-a5b8a23c-a2fa-495c-924b-574a645bc68f | cron group | 0 0 0/2 * * ? | US/Eastern | | ITClusteredScheduler | JobCleaner-ceff3cb3-23d0-4be9-9cb9-6742d422a720 | cron group | 0 0 12 * * ? | US/Eastern | +----------------------+--------------------------------------------------------------------+---------------+-----------------+------------------+ 7 rows in set (0.00 sec) By removing the erroneous rows, things appear to be back to normal. At this time we are waiting for 2 hours until the job fires to make sure that things continue to work properly. verification is still ongoing... Jon and I are no longer able to reproduce this error. Moving to verified status. Caused by Candlepin bug where every time you changed the schedule for a job in candlepin.conf: pinsetter.org.candlepin.pinsetter.tasks.CertificateRevocationListTask.schedule=0 0/4 * * * ? A new trigger was created without cleaning up the old, so each change results in a new schedule for the same job. In stage's case we had two servers simultaneously trying to regenerate CRL every 2 seconds, another every 5 seconds, in addition to the more normal schedules. We believe this very quickly lead to the deadlocks. They are still theoretically possible but will be very hard to hit, and should be possible to just re-try the deletion and see success. This config issue will be fixed in master: 0bf2760a3b47ba7d593e06572e694b393f84e07a Will appear in both candlepin-0.7.24-1 as well as candlepin-0.7.13.10-1. Closing as these changes are now in production and all appears well. |
Created attachment 661608 [details] Exception Stacktrace Reproducing this bug requires a little playing around with it, but the end result is a database deadlock and the attached exception. I reproduced by creating two virtual host consumers, adding all available subscriptions to each. Then I deleted the first one I created just fine, but when got the exception when deleting the second. This second consumer still exists with uuid 6186b33f-b22f-462c-a723-1dc80ea58012, owner is rhn-cservice-acarter. ===== jallen reproduced it this way: In RHSM Web, create the following system units (consumers), all virtual: powerpc x86_64 itanium s390 x86 sparc solaris system z (ie, one of each available type). As you create each one, select Attach a Subscription. When the Attach a Subscription modal window opens, deselect the "Only show subscriptions that match this Virtual System" checkbox. Select all subscriptions and click Attach Selected. Once all of these system units are selected, navigate to the Units page. Select one of the units you created and click Delete Selected. When the "Are you sure" dialog pops up, click Delete. This error banner appears: [500] [Internal Server Error]