+++ This bug was initially created as a clone of Bug #1122625 +++ Description of problem: Deadlocks detected in postgresql when multiple systems are registering simultaneously, e.g., **** 2014-07-08 05:49:45.027 EDT ERROR: deadlock detected 2014-07-08 05:49:45.027 EDT DETAIL: Process 32173 waits for ShareLock on transaction 15425362; blocked by process 31792. Process 31792 waits for ShareLock on transaction 15425338; blocked by process 31758. Process 31758 waits for ShareLock on transaction 15425340; blocked by process 32573. Process 32573 waits for ExclusiveLock on tuple (433,6) of relation 18339 of database 16384; blocked by process 32173. Process 32173: SELECT rhn_channel.subscribe_server(1000156735, E'130', 0) Process 31792: SELECT rhn_entitlements.entitle_server(1000156557, E'enterprise_entitled') Process 31758: SELECT rhn_channel.subscribe_server(1000158308, 200, 0) Process 32573: SELECT rhn_channel.update_family_counts(1024, 1) **** Version-Release number of selected component (if applicable): spacewalk-backend-server-2.0.3-29.el6sat.noarch spacewalk-schema-2.0.2-13.el6sat.noarch satellite-schema-5.6.0.18-1.el6sat.noarch How reproducible: 100% by customer Steps to Reproduce: 1.) Do a mass kickstart of ~3,800 systems (over the course of ~3 hours, approximately 1 system registration every 2-3 seconds) using the same activation key. 2.) See 'deadlock detected' errors in database and httpd error logs. Actual results: Deadlocks detected. Expected results: No deadlocks. Additional info: Postgres database log attached, showing the full deadlock output. It looks like there are a few different entry points in the system registration code to accessing the rhnPrivateChannelFamily table, and inconsistent use of locking may be the problem: **** The xmlrpc entry point is: xmlrpc/registration.create_system calls rhnServer/server_class.server_token which calls rhnServer/server_token.process_token which first calls rhnServer/server_token.entitle [called once for each token] then later calls rhnServer/server_token.token_channels Looking at the entitle and token_channels functions in turn: a.) rhnServer/server_token.entitle: calls rhn_entitlements.entitle_server [once for each entitlement] which calls rhn_server.insert_into_servergroup which [if is_virt=1] calls rhn_entitlements.repoll_virt_guest_entitlements [once for each server group] which calls rhn_channel.update_family_counts [once for each channel family] [assumes read lock] *** b.) rhnServer/server_token.token_channels: first calls rhn_channel.subscribe_server [w/ recalc_family_in parameter set to 0] then later calls rhn_channel.update_family_counts [once for each channel family] [assumes read lock] *** rhn_channel.subscribe_server does the following: c.) rhn_channel.subscribe_server: first calls rhn_channel.obtain_read_lock [always] then [if recalc_family_in=1] calls rhn_channel.update_family_counts [assumes read lock] *** **** So, there are three different spots where rhn_channel.update_family_counts can be called, but only one of them, rhn_channel.subscribe_server, bothers to call rhn_channel.obtain_read_lock, which locks rhnPrivateChannelFamily for the given channel family id and org id: select created into read_lock from rhnPrivateChannelFamily where channel_family_id = channel_family_id_in and org_id = org_id_in for update; Even so, subscribe_server is called with recalc_family_in=0, so it obtains the lock but doesn't actually call update_family_counts. The other two places in the code that call update_family_counts, database function rhn_entitlements.repoll_virt_guest_entitlements and rhnServer/server_token.token_channels in python, do not obtain this lock, but should, to make sure that consistent locking takes place and deadlocks are avoided.
--- Additional comment from Stephen Herr on 2014-10-21 11:11:59 EDT --- I don't think the problem is lack of explicit lock aquiring. rhn_channel.update_family_counts (since it's an update statement) will always aquire a ROW EXCLUSIVE lock on the appropriate rows of rhnPrivateChannel family, the same lock as the rhn_channel.obtain_read_lock (poorly named, as that's an update lock not a read lock) method will aquire if it is called. Deadlocks are caused by transaction 1 holding a lock on table X and attempting to get a lock on table Y at the same time that transaction 2 holds a lock on table Y and attempts to get a lock on table X. In this case it looks to me like tables X and Y are rhnPrivateChannelFamily and rhnServerGroup, and that there are paths through rhn_entitlements.entitle_server and rhn_channel.subscribe_server that require locks on both of them and aquire them in reverse order. The fix should be ensuring that the locks are aquired in the same order in either path, which may or may not be resolved by adding more explicit calls to rhn_channel.obtain_read_lock. I'll investigate further. --- Additional comment from Stephen Herr on 2014-10-21 15:28:24 EDT --- I believe I have found a problem that could cause this behavior, although there's no way to know for sure if this is *the* problem that is causing this issue for the customer. If you register a system with an activationKey then it will eventually drop down to server_token.py:process_token(), where we (in this order): call token_obj.entitle which will lock rhnServerGroup call token_channels which will lock rhnPrivateChannelFamily If you register a system without an activationKey then it will eventually drop down to server_class.py:__save where we (in this order); call rhnChannel.subscribe_server_channels which will lock rhnPrivateChannelFamily call self.autoentitle which will lock rhnServerGroup If you are registering systems with an activationKey constantly due to the massive kickstarting effort that is going on then it is not unlikely that if you happen to register a system without a registration key that you would hit this potential deadlock situation.
Committing to Spacewalk master: 2d58184cf3ec59b62fad3cb14e11e60e885b9380
Moving bugs to ON_QA as we move to release Spacewalk 2.3
Spacewalk 2.3 has been released. See https://fedorahosted.org/spacewalk/wiki/ReleaseNotes23
Reopening, as we have encountered this issue once again. Determining the root cause of the deadlock and a workaround or hotfix is urgently needed. We did the following to reproduce: 1) Ran latest satellite 5.7 with latest packages all updated 2) Run external postegres database 3) Upon restarting postgres and satellite. The first deadlock is seen. Wed Mar 16 09:56:12 CDT 2016 blocked_pid | blocked_user | blocking_statement | blocking_duration | blocking_pid | blocking_user | blocked_statement | blocked_duration -------------+--------------+------------------------------------------------------------------------------+-------------------+--------------+--------------- 29915 | satadmin | SELECT rhn_channel.update_family_counts(1020, 2) | 00:00:00.542909 | 30147 | satadmin | SELECT rhn_channel.subscribe_server(1000330104, This appears to match with this bug, and customer is still getting this error. Thanks.
Re-closing CURRENTRELEASE, for two reasons: * The issue that caused it to be reopened is a different codepath than that addressed in this BZ, and * with the release of SW2.5, Spacewalk no longer counts consumption. This eliminated the deadlocking codepath completely.