Bug 1165251 - deadlock when multiple system registrations update entitlement counts in rhnPrivateChannelFamily
Summary: deadlock when multiple system registrations update entitlement counts in rhnP...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Satellite 5
Classification: Red Hat
Component: Server
Version: 570
Hardware: All
OS: Linux
urgent
urgent
Target Milestone: ---
Assignee: Tomáš Kašpárek
QA Contact: Pavel Studeník
URL:
Whiteboard:
Depends On: 1122625
Blocks: sat570-blockers
TreeView+ depends on / blocked
 
Reported: 2014-11-18 15:51 UTC by Stephen Herr
Modified: 2021-03-11 14:17 UTC (History)
12 users (show)

Fixed In Version: spacewalk-backend-2.3.3-20
Doc Type: Bug Fix
Doc Text:
Clone Of: 1122625
Environment:
Last Closed: 2016-04-08 07:26:33 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1265804 1 None None None 2021-01-20 06:05:38 UTC

Internal Links: 1265804

Description Stephen Herr 2014-11-18 15:51:38 UTC
+++ This bug was initially created as a clone of Bug #1122625 +++

Description of problem:

Deadlocks detected in postgresql when multiple systems are registering simultaneously, e.g.,

****
2014-07-08 05:49:45.027 EDT ERROR:  deadlock detected
2014-07-08 05:49:45.027 EDT DETAIL:  Process 32173 waits for ShareLock on transaction 15425362; blocked by process 31792.
        Process 31792 waits for ShareLock on transaction 15425338; blocked by process 31758.
        Process 31758 waits for ShareLock on transaction 15425340; blocked by process 32573.
        Process 32573 waits for ExclusiveLock on tuple (433,6) of relation 18339 of database 16384; blocked by process 32173.
        Process 32173: SELECT rhn_channel.subscribe_server(1000156735, E'130', 0)
        Process 31792: SELECT rhn_entitlements.entitle_server(1000156557, E'enterprise_entitled')
        Process 31758: SELECT rhn_channel.subscribe_server(1000158308, 200, 0)
        Process 32573: SELECT rhn_channel.update_family_counts(1024, 1)
****


Version-Release number of selected component (if applicable):

spacewalk-backend-server-2.0.3-29.el6sat.noarch
spacewalk-schema-2.0.2-13.el6sat.noarch
satellite-schema-5.6.0.18-1.el6sat.noarch

How reproducible:

100% by customer

Steps to Reproduce:
1.) Do a mass kickstart of ~3,800 systems (over the course of ~3 hours, approximately 1 system registration every 2-3 seconds) using the same activation key.

2.) See 'deadlock detected' errors in database and httpd error logs.

Actual results:

Deadlocks detected.

Expected results:

No deadlocks.

Additional info:

Postgres database log attached, showing the full deadlock output. It looks like there are a few different entry points in the system registration code to accessing the rhnPrivateChannelFamily table, and inconsistent use of locking may be the problem:

****
The xmlrpc entry point is:

xmlrpc/registration.create_system
calls rhnServer/server_class.server_token
which calls rhnServer/server_token.process_token
which first calls rhnServer/server_token.entitle [called once for each token]
then later calls rhnServer/server_token.token_channels

Looking at the entitle and token_channels functions in turn:

a.) rhnServer/server_token.entitle:
calls rhn_entitlements.entitle_server [once for each entitlement]
which calls rhn_server.insert_into_servergroup
which [if is_virt=1] calls rhn_entitlements.repoll_virt_guest_entitlements [once for each server group]
which calls rhn_channel.update_family_counts [once for each channel family] [assumes read lock] ***

b.) rhnServer/server_token.token_channels:
first calls rhn_channel.subscribe_server [w/ recalc_family_in parameter set to 0]
then later calls rhn_channel.update_family_counts [once for each channel family] [assumes read lock] ***


rhn_channel.subscribe_server does the following:

c.) rhn_channel.subscribe_server:
first calls rhn_channel.obtain_read_lock [always]
then [if recalc_family_in=1] calls rhn_channel.update_family_counts [assumes read lock] ***
****

So, there are three different spots where rhn_channel.update_family_counts can be called, but only one of them, rhn_channel.subscribe_server, bothers to call rhn_channel.obtain_read_lock, which locks rhnPrivateChannelFamily for the given channel family id and org id:

        select created into read_lock
          from rhnPrivateChannelFamily
         where channel_family_id = channel_family_id_in and org_id = org_id_in
           for update;

Even so, subscribe_server is called with recalc_family_in=0, so it obtains the lock but doesn't actually call update_family_counts. The other two places in the code that call update_family_counts, database function rhn_entitlements.repoll_virt_guest_entitlements and rhnServer/server_token.token_channels in python, do not obtain this lock, but should, to make sure that consistent locking takes place and deadlocks are avoided.

--- Additional comment from Stephen Herr on 2014-10-21 11:11:59 EDT ---

I don't think the problem is lack of explicit lock aquiring. rhn_channel.update_family_counts (since it's an update statement) will always aquire a ROW EXCLUSIVE lock on the appropriate rows of rhnPrivateChannel family, the same lock as the rhn_channel.obtain_read_lock (poorly named, as that's an update lock not a read lock) method will aquire if it is called.

Deadlocks are caused by transaction 1 holding a lock on table X and attempting to get a lock on table Y at the same time that transaction 2 holds a lock on table Y and attempts to get a lock on table X. In this case it looks to me like tables X and Y are rhnPrivateChannelFamily and rhnServerGroup, and that there are paths through rhn_entitlements.entitle_server and rhn_channel.subscribe_server that require locks on both of them and aquire them in reverse order. The fix should be ensuring that the locks are aquired in the same order in either path, which may or may not be resolved by adding more explicit calls to rhn_channel.obtain_read_lock.

I'll investigate further.

--- Additional comment from Stephen Herr on 2014-10-21 15:28:24 EDT ---

I believe I have found a problem that could cause this behavior, although there's no way to know for sure if this is *the* problem that is causing this issue for the customer.

If you register a system with an activationKey then it will eventually drop down to server_token.py:process_token(), where we (in this order):
call token_obj.entitle  which will lock rhnServerGroup
call token_channels which will lock rhnPrivateChannelFamily

If you register a system without an activationKey then it will eventually drop down to server_class.py:__save where we (in this order);
call rhnChannel.subscribe_server_channels which will lock rhnPrivateChannelFamily
call self.autoentitle which will lock rhnServerGroup

If you are registering systems with an activationKey constantly due to the massive kickstarting effort that is going on then it is not unlikely that if you happen to register a system without a registration key that you would hit this potential deadlock situation.

Comment 3 Pavel Studeník 2014-11-25 13:52:41 UTC
I got following traceback when 4 systems were registered simultaneously to satellite and during registration I remove 500 systems by web UI.


Reproducer with spacewalk-backend-2.0.3-30.el6sat.noarch

>> rhnrek_ks ..
Error Message:
    ERROR:  deadlock detected
Error Class Code: 90
Error Class Info: Unable to entitle system
Explanation: 
     An error has occurred while processing your request. If this problem
     persists please enter a bug report at bugzilla.redhat.com.
     If you choose to submit the bug report, please be sure to include
     details of what you were trying to do when this error occurred and
     details on how to reproduce this problem.

>> tail -f /var/lib/pgsql/data/pg_log/postgresql-*
...
2014-11-25 08:36:18.372 EST DETAIL:  Process 5480 waits for ShareLock on transaction 22287017; blocked by process 3823.
        Process 3823 waits for ShareLock on transaction 22287030; blocked by process 5480.
        Process 5480: SELECT rhn_entitlements.entitle_server(1000029179, E'enterprise_entitled')
        Process 3823: select * from delete_server($1)  as result
2014-11-25 08:36:18.372 EST HINT:  See server log for query details.
2014-11-25 08:36:18.372 EST CONTEXT:  SQL statement "select sg.group_type, sg.org_id, sg.current_members, sg.max_members from rhnServerGroup sg where sg.id =  $1  for update of sg"
        PL/pgSQL function "insert_into_servergroup" line 19 at SQL statement
        SQL statement "SELECT  rhn_server.insert_into_servergroup ( $1 ,  $2 )"
        PL/pgSQL function "entitle_server" line 39 at PERFORM

>> tail -f /var/lib/pgsql/data/pg_log/postgresql-*
2014-11-25 08:41:50.373 EST ERROR:  deadlock detected
2014-11-25 08:41:50.373 EST DETAIL:  Process 7775 waits for ExclusiveLock on tuple (3,16) of relation 20231 of database 16384; blocked by process 9542.
        Process 9542 waits for ShareLock on transaction 22298652; blocked by process 9540.
        Process 9540 waits for ShareLock on transaction 22298642; blocked by process 7775.
        Process 7775: select * from delete_server($1)  as result
        Process 9542: SELECT rhn_channel.subscribe_server(1000029782, E'101', 0)
        Process 9540: SELECT rhn_entitlements.entitle_server(1000029781, E'enterprise_entitled')
2014-11-25 08:41:50.373 EST HINT:  See server log for query details.
2014-11-25 08:41:50.373 EST CONTEXT:  SQL statement "update rhnPrivateChannelFamily set current_members = current_members -1 where org_id in ( select org_id from rhnServer where id =  $1  ) and channel_family_id in ( select rcfm.channel_family_id from rhnChannelFamilyMembers rcfm, rhnServerChannel rsc where rsc.server_id =  $1  and rsc.channel_id = rcfm.channel_id and not exists ( select 1 from rhnChannelFamilyVirtSubLevel cfvsl, rhnSGTypeVirtSubLevel sgtvsl, rhnServerEntitlementView sev, rhnVirtualInstance vi where vi.virtual_system_id =  $1  and vi.host_system_id = sev.server_id and sev.label in ('virtualization_host', 'virtualization_host_platform') and sev.server_group_type_id = sgtvsl.server_group_type_id and sgtvsl.virt_sub_level_id = cfvsl.virt_sub_level_id and cfvsl.channel_family_id = rcfm.channel_family_id ) )"
        PL/pgSQL function "delete_server_channels" line 2 at SQL statement
        SQL statement "SELECT  rhn_channel.delete_server_channels( $1 )"
        PL/pgSQL function "delete_server" line 24 at PERFORM
2014-11-25 08:41:50.373 EST STATEMENT:  select * from delete_server($1)  as result
2014-11-25 08:41:50.558 EST ERROR:  current transaction is aborted, commands ignored until end of transaction block
2014-11-25 08:41:50.558 EST STATEMENT:  select 'c3p0 ping' from dual
2014-11-25 08:41:50.632 EST ERROR:  current transaction is aborted, commands ignored until end of transaction block
2014-11-25 08:41:50.632 EST STATEMENT:  UPDATE rhnSsmOperation
                 SET status = $1,
                     modified = current_timestamp
               WHERE id = $2
                 AND user_id = $3
2014-11-25 08:41:50.633 EST ERROR:  current transaction is aborted, commands ignored until end of transaction block
2014-11-25 08:41:50.633 EST STATEMENT:  select 'c3p0 ping' from dual

Comment 4 Stephen Herr 2014-11-25 14:01:49 UTC
So this bug was for the deadlock that could happen when you were registering with activation keys / no activation keys. However it's very timely that you were able to reproduce the deadlock with this fix while deleting systems, since we have another bug open right now for that issue and I just passed this patch off to them as a hotfix. I'll comment in bug 1159914 that the hotfix probably doesn't work and more investigation is needed.

Comment 6 Pavel Studeník 2014-12-04 14:57:02 UTC
Verified with packages spacewalk-java-2.3.8-81.el6sat.noarch spacewalk-schema-2.3.2-10.el6sat.noarch with postgresql 9.2

Comment 7 Clifford Perry 2015-01-13 10:24:58 UTC
With the release of Red Hat Satellite 5.7 on January 12th 2015 this bug is
being moved to a Closed Current Release state. 

The Satellite 5.7 GA Errata:
 - https://rhn.redhat.com/errata/RHSA-2015-0033.html 

Satellite 5.7 Release Notes:
 -
https://access.redhat.com/documentation/en-US/Red_Hat_Satellite/5.7/html-single/Release_Notes/index.html

Satellite Customer Portal Blog announcement for release:
 - https://access.redhat.com/blogs/1169563/posts/1315743 

Cliff

NOTE: This bug has not been re-verified (moved to RELEASE_PENDING) prior to release. We assume that the bug has indeed been fixed and not regressed since we initially verified it. Please re-open in the future if needed.

Comment 8 Bill Yodlowsky 2016-03-17 06:42:27 UTC
Hi,

Determining the root cause of the deadlock and a workaround or hotfix is urgently needed.

We did the following to reproduce:


1) Ran latest satellite 5.7 with latest packages all updated
2) Run external postegres database


3) Upon restarting postgres and satellite. The first deadlock is seen.


Wed Mar 16 09:56:12 CDT 2016
 blocked_pid | blocked_user |     blocking_statement      | blocking_duration | blocking_pid | blocking_user |  blocked_statement  | blocked_duration 
-------------+--------------+------------------------------------------------------------------------------+-------------------+--------------+---------------
29915 |        satadmin     | SELECT rhn_channel.update_family_counts(1020, 2)   | 00:00:00.542909   |        30147 | satadmin      | SELECT rhn_channel.subscribe_server(1000330104,


This appears to match with this bug, and customer is still getting this error.

Thanks.

Comment 10 Tomas Lestach 2016-04-08 07:26:33 UTC
There's some misunderstanding. This Bug has been fixed, properly QAed and shipped as part of Satellite 5.7 GA. We're not reopening bugs that have been shipped. Please, open a separate bug, if needed.

I'm closing this bug back with CURRENTRELEASE.


Note You need to log in before you can comment on or make changes to this bug.