+++ This bug was initially created as a clone of Bug #1168286 +++ +++ This bug was initially created as a clone of Bug #1168285 +++ +++ This bug was initially created as a clone of Bug #1159914 +++ Description of problem: X is in the process of building batches of servers (around 50 at a time) and had issues with database locks and Satellite Server problems. They reported these errors: e = <class 'cx_Oracle.DatabaseError'> ORA-00060: deadlock detected while waiting for resource ORA-06512: at "RHNADM.RHN_CHANNEL", line 31 ORA-06512: at "RHNADM.RHN_CHANNEL", line 118 ORA-06512: at line 1 Their DBA's investigated and reported the following: DEADLOCK DETECTED ( ORA-00060 ) [Transaction Deadlock] The following deadlock is not an ORACLE error. It is a deadlock due to user error in the design of an application or from issuing incorrect ad-hoc SQL. The following information may aid in determining the deadlock: Deadlock graph: ---------Blocker(s)-------- ---------Waiter(s)--------- Resource Name process session holds waits process session holds waits TM-000044ad-00000000 46 2656 SX 28 2298 SX SSX TX-006c001d-000084e9 28 2298 X 46 2656 X session 2656: DID 0001-002E-00001AF5 session 2298: DID 0001-001C-0000341D session 2298: DID 0001-001C-0000341D session 2656: DID 0001-002E-00001AF5 Session History: SID USERNAME STATUS OSUSER MACHINE PROGRAM SQL_ID ---------- ------- --------- ------ --------------- --------- ------------------------- 2298 RHNADM INACTIVE tomcat st00li01 java@st00li01 (TNS V1-V3) az33m61ym46y4 SID USERNAME STATUS OSUSER MACHINE PROGRAM SQL_ID ---------- -------- -------- ------ -------------------- ------------------------------------------------ ------------- 2656 RHNADM INACTIVE tomcat st00li01 java@st00li01 (TNS V1-V3) az33m61ym46y4 Query for az33m61ym46y4: SELECT NULL AS table_cat, o.owner AS table_schem, o.object_name AS table_name, o.object_type AS table_type, NULL AS remark s FROM all_objects o WHERE o.owner LIKE :1 ESCAPE '/' AND o.object_name LIKE :2 ESCAPE '/' AND o.object_type IN ('xxx', 'TABLE') ORDER B Y table_type, table_schem, table_name Version-Release number of selected component (if applicable): Satellite 5.5 How reproducible: My understanding of the issue is that when the lauch automated builds of a number of servers, they run into issues where the registration of the newly built server fails as described below. They also state that 'when this is happening, other things like yum install begin to fail as well'. Steps to Reproduce: 1. 2. 3. Actual results: Deadlock ensues, causing failures Expected results: No deadlock Additional info: --- Additional comment from Xixi on 2014-11-21 20:58:36 EST --- Expanding on the deadlock trace from Comment 4: From RHNPROD_ora_9447.trc and others: *** 2014-10-27 16:42:19.615 DEADLOCK DETECTED ( ORA-00060 ) [Transaction Deadlock] ... Deadlock graph: ---------Blocker(s)-------- ---------Waiter(s)--------- Resource Name process session holds waits process session holds waits TM-000044ad-00000000 104 1526 SX 25 1768 SX SSX TX-00800003-00002bae 25 1768 X 108 2280 X TX-009f0005-00002bbc 108 2280 X 104 1526 X ... ----- Information for the OTHER waiting sessions ----- Session 1768: sid: 1768 ser: 31239 audsid: 23667258 user: 36/RHNADM flags: (0x45) USR/- flags_idl: (0x1) BSY/-/-/-/-/- flags2: (0x40009) -/-/INC pid: 25 O/S info: user: oracle, term: UNKNOWN, ospid: 22009 image: oracle@sv00or03 client details: O/S info: user: tomcat, term: , ospid: 31069 machine: st00li01 program: java@st00li01 (TNS V1-V3) application name: java@st00li01 (TNS V1-V3), hash value=1200114154 current SQL: DELETE FROM RHNSERVER WHERE ID = :B1 Session 2280: sid: 2280 ser: 45625 audsid: 23667370 user: 36/RHNADM flags: (0x45) USR/- flags_idl: (0x1) BSY/-/-/-/-/- flags2: (0x40009) -/-/INC pid: 108 O/S info: user: oracle, term: UNKNOWN, ospid: 9455 image: oracle@sv00or03 client details: O/S info: user: apache, term: , ospid: 1923 machine: st00li01 program: httpd@st00li01 (TNS V1-V3) application name: mod_wsgi, hash value=1764118423 current SQL: SELECT SG.GROUP_TYPE, SG.ORG_ID, SG.CURRENT_MEMBERS, SG.MAX_MEMBERS FROM RHNSERVERGROUP SG WHERE SG.ID = :B1 FOR UPDATE OF SG.CURRENT_MEMBERS ----- End of information for the OTHER waiting sessions ----- Information for THIS session: ----- Current SQL Statement for this session (sql_id=ajx1p7a7bkfs3) ----- SELECT CREATED FROM RHNPRIVATECHANNELFAMILY WHERE CHANNEL_FAMILY_ID = :B2 AND ORG_ID = :B1 FOR UPDATE ----- PL/SQL Stack ----- ----- PL/SQL Call Stack ----- object line object handle number name 0x15b476db8 31 package body RHNADM.RHN_CHANNEL 0x15b476db8 118 package body RHNADM.RHN_CHANNEL 0x13d894fd0 1 anonymous block =================================================== ... The SQL comes from insert_into_servergroup and delete_server, during server registration and deletion, where group membership and channel membership are updated. It may be similar/related to Bug 1122625 "deadlock when multiple system registrations update entitlement counts in rhnPrivateChannelFamily", especially per https://bugzilla.redhat.com/show_bug.cgi?id=1122625#c13. --- Additional comment from Stephen Herr on 2014-11-25 09:02:05 EST --- Our QE team was actually just able to reproduce this deadlock on a system that had the patch applied, so I don't think these hotfix packages will help this customer. More investigation is needed, I'll see what I can do. https://bugzilla.redhat.com/show_bug.cgi?id=1165251#c3 --- Additional comment from Stephen Herr on 2014-11-25 15:23:27 EST --- Copying the "new" traceback that occurs after the hotfix patch has been applied from https://bugzilla.redhat.com/show_bug.cgi?id=1165251#c5 My belief is that this is related to deleting servers with different channel entitlements at the same as things are registering. So for example the ssm delete server transaction deletes the first server it's operating on (which just so happens to be for example a RHEL 5 server), locking a bunch of related tables in the process. At the same time a RHEL 6 machine is registering, so it subscribes to the RHEL 6 channel (which is not locked) but then gets blocked on something that is locked, like rhnServerGroup. Then the ssm-delete-system transaction goes to delete a RHEL 6 machine and can't get a lock on the RHEL 6 row of rhnChannelFamilyMembers. The processes are then waiting for each other, resulting in a deadlock. Or something. I believe that changing the SSM Delete System action to commit after each deleted system will prevent this from happening. QE has a test pacakges that does that and will test it tomorrow to see if it makes things better. ------------- Stephen, I try it again with package spacewalk-java-2.3.8-71.el6sat.noarch on satelite 5.7 and I got other tracaback. Reproducer: Simultaneously 4 systems was registered to Satellite and during this process I remove systems by WebUI. 2014-11-25 10:10:38.689 EST ERROR: deadlock detected 2014-11-25 10:10:38.689 EST DETAIL: Process 11400 waits for ExclusiveLock on tuple (5,90) of relation 389625 of database 16384; blocked by process 11460. Process 11460 waits for ShareLock on transaction 2839699; blocked by process 11439. Process 11439 waits for ShareLock on transaction 2839644; blocked by process 11400. Process 11400: select * from delete_server($1) as result Process 11460: SELECT rhn_channel.subscribe_server(1000012309, '101', 0) Process 11439: SELECT rhn_channel.subscribe_server(1000012308, 110, 0) 2014-11-25 10:10:38.689 EST HINT: See server log for query details. 2014-11-25 10:10:38.689 EST CONTEXT: SQL statement "update rhnPrivateChannelFamily set current_members = current_members -1 where org_id in ( select org_id from rhnServer where id = server_id_in ) and channel_family_id in ( select rcfm.channel_family_id from rhnChannelFamilyMembers rcfm, rhnServerChannel rsc where rsc.server_id = server_id_in and rsc.channel_id = rcfm.channel_id and not exists ( select 1 from rhnChannelFamilyVirtSubLevel cfvsl, rhnSGTypeVirtSubLevel sgtvsl, rhnServerEntitlementView sev, rhnVirtualInstance vi where -- system is a virtual instance vi.virtual_system_id = server_id_in and vi.host_system_id = sev.server_id -- system's host has a virt ent and sev.label in ('virtualization_host', 'virtualization_host_platform') and sev.server_group_type_id = sgtvsl.server_group_type_id -- the host's virt ent grants a cf virt sub level and sgtvsl.virt_sub_level_id = cfvsl.virt_sub_level_id -- the cf is in that virt sub level and cfvsl.channel_family_id = rcfm.channel_family_id ) )" PL/pgSQL function rhn_channel.delete_server_channels(numeric) line 3 at SQL statement SQL statement "SELECT rhn_channel.delete_server_channels(server_id_in)" PL/pgSQL function delete_server(numeric) line 25 at PERFORM 2014-11-25 10:10:38.689 EST STATEMENT: select * from delete_server($1) as result 2014-11-25 10:10:38.724 EST ERROR: current transaction is aborted, commands ignored until end of transaction block 2014-11-25 10:10:38.724 EST STATEMENT: select 'c3p0 ping' from dual 2014-11-25 10:10:38.978 EST ERROR: current transaction is aborted, commands ignored until end of transaction block 2014-11-25 10:10:38.978 EST STATEMENT: UPDATE rhnSsmOperation SET status = $1, modified = current_timestamp WHERE id = $2 AND user_id = $3 2014-11-25 10:10:38.978 EST ERROR: current transaction is aborted, commands ignored until end of transaction block 2014-11-25 10:10:38.978 EST STATEMENT: select 'c3p0 ping' from dual
Bug 1122626 fixes the first part of this problem. Changing the SSM Delete System action to commit after each system to avoid deadlocks with registrations. Committing to Spacewalk master: 35026fc0cb7de1b7752434001088da214bba0a1f
Moving bugs to ON_QA as we move to release Spacewalk 2.3
Spacewalk 2.3 has been released. See https://fedorahosted.org/spacewalk/wiki/ReleaseNotes23