Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1168292

Summary:	Deadlocks on RHNADM.RHN_CHANNEL (system registration vs deletion)
Product:	[Community] Spacewalk	Reporter:	Stephen Herr <sherr>
Component:	Server	Assignee:	Stephen Herr <sherr>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Red Hat Satellite QA List <satqe-list>
Severity:	high	Docs Contact:
Priority:	high
Version:	2.2	CC:	byodlows, hartsjc, mmraka, satqe-list, sherr, taw, xdmoon
Target Milestone:	---
Target Release:	---
Hardware:	All
OS:	All
Whiteboard:
Fixed In Version:	spacewalk-java-2.3.92-1	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:	1168286	Environment:
Last Closed:	2015-04-14 19:02:38 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1159914, 1168285, 1168286
Bug Blocks:	1207293

Description Stephen Herr 2014-11-26 15:38:40 UTC

+++ This bug was initially created as a clone of Bug #1168286 +++

+++ This bug was initially created as a clone of Bug #1168285 +++

+++ This bug was initially created as a clone of Bug #1159914 +++

Description of problem:

X is in the process of building batches of servers (around 50 at a time) and had issues with database locks and Satellite Server problems.  

They reported these errors:

e = <class 'cx_Oracle.DatabaseError'> ORA-00060: deadlock detected while waiting for resource
ORA-06512: at "RHNADM.RHN_CHANNEL", line 31
ORA-06512: at "RHNADM.RHN_CHANNEL", line 118
ORA-06512: at line 1

Their DBA's investigated and reported the following:

DEADLOCK DETECTED ( ORA-00060 )
 
[Transaction Deadlock]
 
The following deadlock is not an ORACLE error. It is a
deadlock due to user error in the design of an application
or from issuing incorrect ad-hoc SQL. The following
information may aid in determining the deadlock:
 
Deadlock graph:
                       ---------Blocker(s)--------  ---------Waiter(s)---------
Resource Name          process session holds waits  process session holds waits
TM-000044ad-00000000        46    2656    SX             28    2298    SX   SSX
TX-006c001d-000084e9        28    2298     X             46    2656           X
 
session 2656: DID 0001-002E-00001AF5    session 2298: DID 0001-001C-0000341D
session 2298: DID 0001-001C-0000341D    session 2656: DID 0001-002E-00001AF5
 
Session History:
 
      SID USERNAME  STATUS   OSUSER MACHINE       PROGRAM                 SQL_ID
---------- ------- --------- ------ --------------- --------- -------------------------
      2298 RHNADM   INACTIVE tomcat st00li01  java@st00li01 (TNS V1-V3)  az33m61ym46y4
 
 
       SID USERNAME STATUS   OSUSER MACHINE              PROGRAM                                          SQL_ID
---------- -------- -------- ------ -------------------- ------------------------------------------------ -------------
      2656 RHNADM   INACTIVE tomcat st00li01             java@st00li01 (TNS V1-V3)                        az33m61ym46y4
 
Query for az33m61ym46y4:
 
SELECT NULL AS table_cat,        o.owner AS table_schem,        o.object_name AS table_name,        o.object_type AS table_type,        NULL AS remark
s   FROM all_objects o   WHERE o.owner LIKE :1 ESCAPE '/'     AND o.object_name LIKE :2 ESCAPE '/'     AND o.object_type IN ('xxx', 'TABLE')   ORDER B
Y table_type, table_schem, table_name


Version-Release number of selected component (if applicable):

Satellite 5.5

How reproducible:

My understanding of the issue is that when the lauch automated builds of a number of servers, they run into issues where the registration of the newly built server fails as described below. They also state that 'when this is happening, other things like yum install begin to fail as well'.

Steps to Reproduce:
1.
2.
3.

Actual results:

Deadlock ensues, causing failures

Expected results:

No deadlock

Additional info:

--- Additional comment from Xixi on 2014-11-21 20:58:36 EST ---

Expanding on the deadlock trace from Comment 4:

From RHNPROD_ora_9447.trc and others:
*** 2014-10-27 16:42:19.615
DEADLOCK DETECTED ( ORA-00060 )
 
[Transaction Deadlock]
...
Deadlock graph:
                       ---------Blocker(s)--------  ---------Waiter(s)---------
Resource Name          process session holds waits  process session holds waits
TM-000044ad-00000000       104    1526    SX             25    1768    SX   SSX
TX-00800003-00002bae        25    1768     X            108    2280           X
TX-009f0005-00002bbc       108    2280     X            104    1526           X
...
----- Information for the OTHER waiting sessions -----
Session 1768:
  sid: 1768 ser: 31239 audsid: 23667258 user: 36/RHNADM
    flags: (0x45) USR/- flags_idl: (0x1) BSY/-/-/-/-/-
    flags2: (0x40009) -/-/INC
  pid: 25 O/S info: user: oracle, term: UNKNOWN, ospid: 22009
    image: oracle@sv00or03
  client details:
    O/S info: user: tomcat, term: , ospid: 31069
    machine: st00li01 program: java@st00li01 (TNS V1-V3)
    application name: java@st00li01 (TNS V1-V3), hash value=1200114154
  current SQL:
  DELETE FROM RHNSERVER WHERE ID = :B1 
 
Session 2280:
  sid: 2280 ser: 45625 audsid: 23667370 user: 36/RHNADM
    flags: (0x45) USR/- flags_idl: (0x1) BSY/-/-/-/-/-
    flags2: (0x40009) -/-/INC
  pid: 108 O/S info: user: oracle, term: UNKNOWN, ospid: 9455
    image: oracle@sv00or03
  client details:
    O/S info: user: apache, term: , ospid: 1923
    machine: st00li01 program: httpd@st00li01 (TNS V1-V3)
    application name: mod_wsgi, hash value=1764118423
  current SQL:
  SELECT SG.GROUP_TYPE, SG.ORG_ID, SG.CURRENT_MEMBERS, SG.MAX_MEMBERS FROM RHNSERVERGROUP SG WHERE SG.ID = :B1 FOR UPDATE OF SG.CURRENT_MEMBERS
 
----- End of information for the OTHER waiting sessions -----
 
Information for THIS session:
 
----- Current SQL Statement for this session (sql_id=ajx1p7a7bkfs3) -----
SELECT CREATED FROM RHNPRIVATECHANNELFAMILY WHERE CHANNEL_FAMILY_ID = :B2 AND ORG_ID = :B1 FOR UPDATE
----- PL/SQL Stack -----
----- PL/SQL Call Stack -----
  object      line  object
  handle    number  name
0x15b476db8        31  package body RHNADM.RHN_CHANNEL
0x15b476db8       118  package body RHNADM.RHN_CHANNEL
0x13d894fd0         1  anonymous block
===================================================
...

The SQL comes from insert_into_servergroup and delete_server, during server registration and deletion, where group membership and channel membership are updated.  It may be similar/related to Bug 1122625 "deadlock when multiple system registrations update entitlement counts in rhnPrivateChannelFamily", especially per https://bugzilla.redhat.com/show_bug.cgi?id=1122625#c13.

--- Additional comment from Stephen Herr on 2014-11-25 09:02:05 EST ---

Our QE team was actually just able to reproduce this deadlock on a system that had the patch applied, so I don't think these hotfix packages will help this customer. More investigation is needed, I'll see what I can do.

https://bugzilla.redhat.com/show_bug.cgi?id=1165251#c3

--- Additional comment from Stephen Herr on 2014-11-25 15:23:27 EST ---

Copying the "new" traceback that occurs after the hotfix patch has been applied from https://bugzilla.redhat.com/show_bug.cgi?id=1165251#c5

My belief is that this is related to deleting servers with different channel entitlements at the same as things are registering. So for example the ssm delete server transaction deletes the first server it's operating on (which just so happens to be for example a RHEL 5 server), locking a bunch of related tables in the process. At the same time a RHEL 6 machine is registering, so it subscribes to the RHEL 6 channel (which is not locked) but then gets blocked on something that is locked, like rhnServerGroup. Then the ssm-delete-system transaction goes to delete a RHEL 6 machine and can't get a lock on the RHEL 6 row of rhnChannelFamilyMembers. The processes are then waiting for each other, resulting in a deadlock. Or something.

I believe that changing the SSM Delete System action to commit after each deleted system will prevent this from happening. QE has a test pacakges that does that and will test it tomorrow to see if it makes things better.

-------------

Stephen, I try it again with package spacewalk-java-2.3.8-71.el6sat.noarch on satelite 5.7 and I got other tracaback.

Reproducer:
Simultaneously 4 systems was registered to Satellite and during this process I remove systems by WebUI.

2014-11-25 10:10:38.689 EST ERROR:  deadlock detected
2014-11-25 10:10:38.689 EST DETAIL:  Process 11400 waits for ExclusiveLock on tuple (5,90) of relation 389625 of database 16384; blocked by process 11460.
        Process 11460 waits for ShareLock on transaction 2839699; blocked by process 11439.
        Process 11439 waits for ShareLock on transaction 2839644; blocked by process 11400.
        Process 11400: select * from delete_server($1)  as result
        Process 11460: SELECT rhn_channel.subscribe_server(1000012309, '101', 0)
        Process 11439: SELECT rhn_channel.subscribe_server(1000012308, 110, 0)
2014-11-25 10:10:38.689 EST HINT:  See server log for query details.
2014-11-25 10:10:38.689 EST CONTEXT:  SQL statement "update  rhnPrivateChannelFamily
                set     current_members = current_members -1
                where   org_id in (
                                select  org_id
                                from    rhnServer
                                where   id = server_id_in
                        )
                        and channel_family_id in (
                                select  rcfm.channel_family_id
                                from    rhnChannelFamilyMembers rcfm,
                                        rhnServerChannel rsc
                                where   rsc.server_id = server_id_in
                                        and rsc.channel_id = rcfm.channel_id
                        and not exists (
                            select 1
                            from
                                rhnChannelFamilyVirtSubLevel cfvsl,
                                rhnSGTypeVirtSubLevel sgtvsl,
                                rhnServerEntitlementView sev,
                                rhnVirtualInstance vi
                            where
                                -- system is a virtual instance
                                vi.virtual_system_id = server_id_in
                                and vi.host_system_id = sev.server_id
                                -- system's host has a virt ent
                                and sev.label in ('virtualization_host',
                                                  'virtualization_host_platform')
                                and sev.server_group_type_id = 
                                    sgtvsl.server_group_type_id
                                -- the host's virt ent grants a cf virt sub level
                                and sgtvsl.virt_sub_level_id = cfvsl.virt_sub_level_id
                                -- the cf is in that virt sub level
                                and cfvsl.channel_family_id = rcfm.channel_family_id
                            )
                        )"
        PL/pgSQL function rhn_channel.delete_server_channels(numeric) line 3 at SQL statement
        SQL statement "SELECT rhn_channel.delete_server_channels(server_id_in)"
        PL/pgSQL function delete_server(numeric) line 25 at PERFORM
2014-11-25 10:10:38.689 EST STATEMENT:  select * from delete_server($1)  as result
2014-11-25 10:10:38.724 EST ERROR:  current transaction is aborted, commands ignored until end of transaction block
2014-11-25 10:10:38.724 EST STATEMENT:  select 'c3p0 ping' from dual
2014-11-25 10:10:38.978 EST ERROR:  current transaction is aborted, commands ignored until end of transaction block
2014-11-25 10:10:38.978 EST STATEMENT:  UPDATE rhnSsmOperation
                 SET status = $1,
                     modified = current_timestamp
               WHERE id = $2
                 AND user_id = $3
2014-11-25 10:10:38.978 EST ERROR:  current transaction is aborted, commands ignored until end of transaction block
2014-11-25 10:10:38.978 EST STATEMENT:  select 'c3p0 ping' from dual

Comment 1 Stephen Herr 2014-11-26 16:23:47 UTC

Bug 1122626 fixes the first part of this problem.

Changing the SSM Delete System action to commit after each system to avoid deadlocks with registrations.

Committing to Spacewalk master:
35026fc0cb7de1b7752434001088da214bba0a1f

Comment 2 Grant Gainey 2015-03-23 16:58:43 UTC

Moving bugs to ON_QA as we move to release Spacewalk 2.3

Comment 3 Grant Gainey 2015-04-14 19:02:38 UTC

Spacewalk 2.3 has been released. See

https://fedorahosted.org/spacewalk/wiki/ReleaseNotes23