Bug 1122626
Summary: | deadlock when multiple system registrations update entitlement counts in rhnPrivateChannelFamily | ||
---|---|---|---|
Product: | [Community] Spacewalk | Reporter: | Tasos Papaioannou <tpapaioa> |
Component: | Server | Assignee: | Tomáš Kašpárek <tkasparek> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | Red Hat Satellite QA List <satqe-list> |
Severity: | urgent | Docs Contact: | |
Priority: | urgent | ||
Version: | 2.2 | CC: | byodlows, ggainey, mmello, pgervase, satqe-list, xdmoon |
Target Milestone: | --- | Keywords: | Reopened |
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | spacewalk-backend-2.3.24-1 | Doc Type: | Bug Fix |
Doc Text: | Story Points: | --- | |
Clone Of: | 1122625 | Environment: | |
Last Closed: | 2016-06-28 14:45:41 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1122625 | ||
Bug Blocks: | 1207293 |
Description
Tasos Papaioannou
2014-07-23 16:20:57 UTC
--- Additional comment from Stephen Herr on 2014-10-21 11:11:59 EDT --- I don't think the problem is lack of explicit lock aquiring. rhn_channel.update_family_counts (since it's an update statement) will always aquire a ROW EXCLUSIVE lock on the appropriate rows of rhnPrivateChannel family, the same lock as the rhn_channel.obtain_read_lock (poorly named, as that's an update lock not a read lock) method will aquire if it is called. Deadlocks are caused by transaction 1 holding a lock on table X and attempting to get a lock on table Y at the same time that transaction 2 holds a lock on table Y and attempts to get a lock on table X. In this case it looks to me like tables X and Y are rhnPrivateChannelFamily and rhnServerGroup, and that there are paths through rhn_entitlements.entitle_server and rhn_channel.subscribe_server that require locks on both of them and aquire them in reverse order. The fix should be ensuring that the locks are aquired in the same order in either path, which may or may not be resolved by adding more explicit calls to rhn_channel.obtain_read_lock. I'll investigate further. --- Additional comment from Stephen Herr on 2014-10-21 15:28:24 EDT --- I believe I have found a problem that could cause this behavior, although there's no way to know for sure if this is *the* problem that is causing this issue for the customer. If you register a system with an activationKey then it will eventually drop down to server_token.py:process_token(), where we (in this order): call token_obj.entitle which will lock rhnServerGroup call token_channels which will lock rhnPrivateChannelFamily If you register a system without an activationKey then it will eventually drop down to server_class.py:__save where we (in this order); call rhnChannel.subscribe_server_channels which will lock rhnPrivateChannelFamily call self.autoentitle which will lock rhnServerGroup If you are registering systems with an activationKey constantly due to the massive kickstarting effort that is going on then it is not unlikely that if you happen to register a system without a registration key that you would hit this potential deadlock situation. Committing to Spacewalk master: 2d58184cf3ec59b62fad3cb14e11e60e885b9380 Moving bugs to ON_QA as we move to release Spacewalk 2.3 Spacewalk 2.3 has been released. See https://fedorahosted.org/spacewalk/wiki/ReleaseNotes23 Reopening, as we have encountered this issue once again. Determining the root cause of the deadlock and a workaround or hotfix is urgently needed. We did the following to reproduce: 1) Ran latest satellite 5.7 with latest packages all updated 2) Run external postegres database 3) Upon restarting postgres and satellite. The first deadlock is seen. Wed Mar 16 09:56:12 CDT 2016 blocked_pid | blocked_user | blocking_statement | blocking_duration | blocking_pid | blocking_user | blocked_statement | blocked_duration -------------+--------------+------------------------------------------------------------------------------+-------------------+--------------+--------------- 29915 | satadmin | SELECT rhn_channel.update_family_counts(1020, 2) | 00:00:00.542909 | 30147 | satadmin | SELECT rhn_channel.subscribe_server(1000330104, This appears to match with this bug, and customer is still getting this error. Thanks. Re-closing CURRENTRELEASE, for two reasons: * The issue that caused it to be reopened is a different codepath than that addressed in this BZ, and * with the release of SW2.5, Spacewalk no longer counts consumption. This eliminated the deadlocking codepath completely. |