Bug 1122626

Summary:	deadlock when multiple system registrations update entitlement counts in rhnPrivateChannelFamily
Product:	[Community] Spacewalk	Reporter:	Tasos Papaioannou <tpapaioa>
Component:	Server	Assignee:	Tomáš Kašpárek <tkasparek>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Red Hat Satellite QA List <satqe-list>
Severity:	urgent	Docs Contact:
Priority:	urgent
Version:	2.2	CC:	byodlows, ggainey, mmello, pgervase, satqe-list, xdmoon
Target Milestone:	---	Keywords:	Reopened
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:	spacewalk-backend-2.3.24-1	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:	1122625	Environment:
Last Closed:	2016-06-28 14:45:41 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1122625
Bug Blocks:	1207293

Description Tasos Papaioannou 2014-07-23 16:20:57 UTC

+++ This bug was initially created as a clone of Bug #1122625 +++

Description of problem:

Deadlocks detected in postgresql when multiple systems are registering simultaneously, e.g.,

****
2014-07-08 05:49:45.027 EDT ERROR:  deadlock detected
2014-07-08 05:49:45.027 EDT DETAIL:  Process 32173 waits for ShareLock on transaction 15425362; blocked by process 31792.
        Process 31792 waits for ShareLock on transaction 15425338; blocked by process 31758.
        Process 31758 waits for ShareLock on transaction 15425340; blocked by process 32573.
        Process 32573 waits for ExclusiveLock on tuple (433,6) of relation 18339 of database 16384; blocked by process 32173.
        Process 32173: SELECT rhn_channel.subscribe_server(1000156735, E'130', 0)
        Process 31792: SELECT rhn_entitlements.entitle_server(1000156557, E'enterprise_entitled')
        Process 31758: SELECT rhn_channel.subscribe_server(1000158308, 200, 0)
        Process 32573: SELECT rhn_channel.update_family_counts(1024, 1)
****


Version-Release number of selected component (if applicable):

spacewalk-backend-server-2.0.3-29.el6sat.noarch
spacewalk-schema-2.0.2-13.el6sat.noarch
satellite-schema-5.6.0.18-1.el6sat.noarch

How reproducible:

100% by customer

Steps to Reproduce:
1.) Do a mass kickstart of ~3,800 systems (over the course of ~3 hours, approximately 1 system registration every 2-3 seconds) using the same activation key.

2.) See 'deadlock detected' errors in database and httpd error logs.

Actual results:

Deadlocks detected.

Expected results:

No deadlocks.

Additional info:

Postgres database log attached, showing the full deadlock output. It looks like there are a few different entry points in the system registration code to accessing the rhnPrivateChannelFamily table, and inconsistent use of locking may be the problem:

****
The xmlrpc entry point is:

xmlrpc/registration.create_system
calls rhnServer/server_class.server_token
which calls rhnServer/server_token.process_token
which first calls rhnServer/server_token.entitle [called once for each token]
then later calls rhnServer/server_token.token_channels

Looking at the entitle and token_channels functions in turn:

a.) rhnServer/server_token.entitle:
calls rhn_entitlements.entitle_server [once for each entitlement]
which calls rhn_server.insert_into_servergroup
which [if is_virt=1] calls rhn_entitlements.repoll_virt_guest_entitlements [once for each server group]
which calls rhn_channel.update_family_counts [once for each channel family] [assumes read lock] ***

b.) rhnServer/server_token.token_channels:
first calls rhn_channel.subscribe_server [w/ recalc_family_in parameter set to 0]
then later calls rhn_channel.update_family_counts [once for each channel family] [assumes read lock] ***


rhn_channel.subscribe_server does the following:

c.) rhn_channel.subscribe_server:
first calls rhn_channel.obtain_read_lock [always]
then [if recalc_family_in=1] calls rhn_channel.update_family_counts [assumes read lock] ***
****

So, there are three different spots where rhn_channel.update_family_counts can be called, but only one of them, rhn_channel.subscribe_server, bothers to call rhn_channel.obtain_read_lock, which locks rhnPrivateChannelFamily for the given channel family id and org id:

        select created into read_lock
          from rhnPrivateChannelFamily
         where channel_family_id = channel_family_id_in and org_id = org_id_in
           for update;

Even so, subscribe_server is called with recalc_family_in=0, so it obtains the lock but doesn't actually call update_family_counts. The other two places in the code that call update_family_counts, database function rhn_entitlements.repoll_virt_guest_entitlements and rhnServer/server_token.token_channels in python, do not obtain this lock, but should, to make sure that consistent locking takes place and deadlocks are avoided.

Comment 1 Stephen Herr 2014-11-18 15:54:47 UTC

--- Additional comment from Stephen Herr on 2014-10-21 11:11:59 EDT ---

I don't think the problem is lack of explicit lock aquiring. rhn_channel.update_family_counts (since it's an update statement) will always aquire a ROW EXCLUSIVE lock on the appropriate rows of rhnPrivateChannel family, the same lock as the rhn_channel.obtain_read_lock (poorly named, as that's an update lock not a read lock) method will aquire if it is called.

Deadlocks are caused by transaction 1 holding a lock on table X and attempting to get a lock on table Y at the same time that transaction 2 holds a lock on table Y and attempts to get a lock on table X. In this case it looks to me like tables X and Y are rhnPrivateChannelFamily and rhnServerGroup, and that there are paths through rhn_entitlements.entitle_server and rhn_channel.subscribe_server that require locks on both of them and aquire them in reverse order. The fix should be ensuring that the locks are aquired in the same order in either path, which may or may not be resolved by adding more explicit calls to rhn_channel.obtain_read_lock.

I'll investigate further.

--- Additional comment from Stephen Herr on 2014-10-21 15:28:24 EDT ---

I believe I have found a problem that could cause this behavior, although there's no way to know for sure if this is *the* problem that is causing this issue for the customer.

If you register a system with an activationKey then it will eventually drop down to server_token.py:process_token(), where we (in this order):
call token_obj.entitle  which will lock rhnServerGroup
call token_channels which will lock rhnPrivateChannelFamily

If you register a system without an activationKey then it will eventually drop down to server_class.py:__save where we (in this order);
call rhnChannel.subscribe_server_channels which will lock rhnPrivateChannelFamily
call self.autoentitle which will lock rhnServerGroup

If you are registering systems with an activationKey constantly due to the massive kickstarting effort that is going on then it is not unlikely that if you happen to register a system without a registration key that you would hit this potential deadlock situation.

Comment 2 Stephen Herr 2014-11-18 16:01:45 UTC

Committing to Spacewalk master:
2d58184cf3ec59b62fad3cb14e11e60e885b9380

Comment 3 Grant Gainey 2015-03-23 16:59:35 UTC

Moving bugs to ON_QA as we move to release Spacewalk 2.3

Comment 4 Grant Gainey 2015-04-14 19:04:06 UTC

Spacewalk 2.3 has been released. See

https://fedorahosted.org/spacewalk/wiki/ReleaseNotes23

Comment 6 Bill Yodlowsky 2016-03-17 06:31:36 UTC

Reopening, as we have encountered this issue once again.

Determining the root cause of the deadlock and a workaround or hotfix is urgently needed.

We did the following to reproduce:


1) Ran latest satellite 5.7 with latest packages all updated
2) Run external postegres database


3) Upon restarting postgres and satellite. The first deadlock is seen.


Wed Mar 16 09:56:12 CDT 2016
 blocked_pid | blocked_user |     blocking_statement      | blocking_duration | blocking_pid | blocking_user |  blocked_statement  | blocked_duration 
-------------+--------------+------------------------------------------------------------------------------+-------------------+--------------+---------------
29915 |        satadmin     | SELECT rhn_channel.update_family_counts(1020, 2)   | 00:00:00.542909   |        30147 | satadmin      | SELECT rhn_channel.subscribe_server(1000330104,


This appears to match with this bug, and customer is still getting this error.

Thanks.

Comment 11 Grant Gainey 2016-06-28 14:45:41 UTC

Re-closing CURRENTRELEASE, for two reasons:

* The issue that caused it to be reopened is a different codepath than that addressed in this BZ, and
* with the release of SW2.5, Spacewalk no longer counts consumption. This eliminated the deadlocking codepath completely.