Bug 1459964

Summary: [5.6] DEADLOCK: reactivation-key, snapshot_server(), register/delete
Product: Red Hat Satellite 5 Reporter: Grant Gainey <ggainey>
Component: ServerAssignee: Grant Gainey <ggainey>
Status: CLOSED ERRATA QA Contact: Radovan Drazny <rdrazny>
Severity: high Docs Contact:
Priority: unspecified    
Version: 560CC: jhutar, rdrazny, satqe-list, tlestach
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: spacewalk-backend-2.0.3-47-sat spacewalk-schema-2.0.2-27-sat satellite-schema-5.6.0.36-1-sat Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1459963 Environment:
Last Closed: 2017-07-05 14:28:07 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1458880, 1459963    
Bug Blocks: 1452353    

Description Grant Gainey 2017-06-08 17:01:06 UTC
+++ This bug was initially created as a clone of Bug #1459963 +++

+++ This bug was initially created as a clone of Bug #1458880 +++

This BZ covers a new potential-deadlock path found during the testing of bug 1402935 . The deadlock can be reproduced in 5.7, so is not a 5.8 regression. Summary from the comments in 1402935:

* Running the reproducer-scripts from bug 1388160 on 4 clients
* Running rhnreg_ks --user --pass in a loop on a 5th machine

Expected: reactivations and registrations without deadlocks.
Actual:

===
Jan Hutař 2017-05-30 16:40:31 EDT
Grant, might this be related please?

==> /var/opt/rh/rh-postgresql95/lib/pgsql/data/pg_log/postgresql-Tue.log <==
2017-05-30 10:39:07.320 EDT ERROR:  deadlock detected
2017-05-30 10:39:07.320 EDT DETAIL:  Process 21183 waits for ShareLock on transaction 450090; blocked by process 21476.
	Process 21476 waits for ShareLock on transaction 450097; blocked by process 21183.
	Process 21183: SELECT rhn_channel.subscribe_server(1000015983, '123', 0)
	Process 21476: select * from delete_server($1)  as result
2017-05-30 10:39:07.320 EDT HINT:  See server log for query details.
2017-05-30 10:39:07.320 EDT CONTEXT:  while locking tuple (0,2) in relation "rhnservergroup"
	SQL statement "select sg.created                     from rhnservergroup sg
	           inner join rhnserver s on s.org_id = sg.org_id
	     where s.id = server_id_in
	       for update"
	PL/pgSQL function lock_counts(numeric) line 16 at SQL statement
	SQL statement "SELECT lock_counts(server_id_in)"
	PL/pgSQL function rhn_channel.subscribe_server(numeric,numeric,numeric,numeric,numeric) line 16 at PERFORM
2017-05-30 10:39:07.320 EDT STATEMENT:  SELECT rhn_channel.subscribe_server(1000015983, '123', 0)
2017-05-30 10:39:23.471 EDT ERROR:  deadlock detected
2017-05-30 10:39:23.471 EDT DETAIL:  Process 21337 waits for ShareLock on transaction 450880; blocked by process 21599.
	Process 21599 waits for ShareLock on transaction 450883; blocked by process 21337.
	Process 21337: select * from delete_server($1)  as result
	Process 21599: SELECT rhn_channel.unsubscribe_server(1000016009, 116, 0)
2017-05-30 10:39:23.471 EDT HINT:  See server log for query details.
2017-05-30 10:39:23.471 EDT CONTEXT:  while locking tuple (0,2) in relation "rhnservergroup"
	SQL statement "select sg.created                     from rhnservergroup sg
	           inner join rhnserver s on s.org_id = sg.org_id
	     where s.id = server_id_in
	       for update"
	PL/pgSQL function lock_counts(numeric) line 16 at SQL statement
	SQL statement "SELECT lock_counts(server_id_in)"
	PL/pgSQL function delete_server(numeric) line 27 at PERFORM
2017-05-30 10:39:23.471 EDT STATEMENT:  select * from delete_server($1)  as result
2017-05-30 10:39:23.471 EDT ERROR:  current transaction is aborted, commands ignored until end of transaction block
2017-05-30 10:39:23.471 EDT STATEMENT:  select 'c3p0 ping' from dual


I have been registering 4 systems in a loop using activation key (into tiny custom channel) and deleting big bunch of previously registered profiles (mix of username/password and activation key registrations).

And forgot to mention that activation key included system group.
===

Comments from the ensuing investigation:

===
Jan Hutař 2017-06-01 06:52:37 EDT

DB log:


2017-05-31 05:29:12.282 EDT ERROR:  deadlock detected
2017-05-31 05:29:12.282 EDT DETAIL:  Process 6997 waits for ShareLock on transaction 651256; blocked by process 6658.
        Process 6658 waits for ShareLock on transaction 651261; blocked by process 6997.
        Process 6997: SELECT rhn_channel.unsubscribe_server(1000017821, 169, 0)
        Process 6658: select * from rhn_channel.unsubscribe_server($1, $2, 0) as result
2017-05-31 05:29:12.282 EDT HINT:  See server log for query details.
2017-05-31 05:29:12.282 EDT CONTEXT:  while locking tuple (0,2) in relation "rhnservergroup"
        SQL statement "select sg.created                     from rhnservergroup sg
                   inner join rhnserver s on s.org_id = sg.org_id
             where s.id = server_id_in
               for update"
        PL/pgSQL function lock_counts(numeric) line 16 at SQL statement
        SQL statement "SELECT lock_counts(server_id_in)"
        PL/pgSQL function rhn_channel.unsubscribe_server(numeric,numeric,numeric,numeric,numeric,numeric) line 19 at PERFORM
2017-05-31 05:29:12.282 EDT STATEMENT:  SELECT rhn_channel.unsubscribe_server(1000017821, 169, 0)


and tomcat log is (not sure if all of this is relevant, but this is what timestamps say):


2017-05-31 05:29:11,281 [TP-Processor10] ERROR com.redhat.rhn.frontend.xmlrpc.BaseHandler - Error calling method: 
java.lang.reflect.InvocationTargetException
        at sun.reflect.GeneratedMethodAccessor789.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:55)
        at java.lang.reflect.Method.invoke(Method.java:508)
        at com.redhat.rhn.frontend.xmlrpc.BaseHandler.invoke(BaseHandler.java:174)
        at redstone.xmlrpc.XmlRpcDispatcher.dispatch(XmlRpcDispatcher.java:123)
        at com.redhat.rhn.frontend.xmlrpc.RhnXmlRpcServer.execute(RhnXmlRpcServer.java:54)
        at com.redhat.rhn.frontend.xmlrpc.XmlRpcServlet.doPost(XmlRpcServlet.java:162)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:643)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:723)
        at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:290)
        at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
        at com.opensymphony.sitemesh.webapp.SiteMeshFilter.obtainContent(SiteMeshFilter.java:129)
        at com.opensymphony.sitemesh.webapp.SiteMeshFilter.doFilter(SiteMeshFilter.java:77)
        at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
        at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
        at com.redhat.rhn.frontend.servlets.LocalizedEnvironmentFilter.doFilter(LocalizedEnvironmentFilter.java:67)
        at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
        at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
        at com.redhat.rhn.frontend.servlets.EnvironmentFilter.doFilter(EnvironmentFilter.java:101)
        at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
        at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
        at com.redhat.rhn.frontend.servlets.SessionFilter.doFilter(SessionFilter.java:57)
        at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
        at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
        at com.redhat.rhn.frontend.servlets.SetCharacterEncodingFilter.doFilter(SetCharacterEncodingFilter.java:97)
        at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
        at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
        at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
        at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
        at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
        at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
        at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
        at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:299)
        at org.apache.jk.server.JkCoyoteHandler.invoke(JkCoyoteHandler.java:190)
        at org.apache.jk.common.HandlerRequest.invoke(HandlerRequest.java:291)
        at org.apache.jk.common.ChannelSocket.invoke(ChannelSocket.java:769)
        at org.apache.jk.common.ChannelSocket.processConnection(ChannelSocket.java:698)
        at org.apache.jk.common.ChannelSocket$SocketConnection.runIt(ChannelSocket.java:891)
        at org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:690)
        at java.lang.Thread.run(Thread.java:785)
Caused by: 
com.redhat.rhn.common.db.WrappedSQLException: ERROR: deadlock detected
  Detail: Process 6611 waits for ShareLock on transaction 651256; blocked by process 6658.
Process 6658 waits for AccessExclusiveLock on tuple (1,10) of relation 58289 of database 16384; blocked by process 6997.
Process 6997 waits for ShareLock on transaction 651259; blocked by process 6611.
  Hint: See server log for query details.
  Where: while locking tuple (0,2) in relation "rhnservergroup"
SQL statement "select sg.created                     from rhnservergroup sg
           inner join rhnserver s on s.org_id = sg.org_id
     where s.id = server_id_in
       for update"
PL/pgSQL function lock_counts(numeric) line 16 at SQL statement
SQL statement "SELECT lock_counts(server_id_in)"
PL/pgSQL function rhn_channel.unsubscribe_server(numeric,numeric,numeric,numeric,numeric,numeric) line 19 at PERFORM
        at com.redhat.rhn.common.translation.SqlExceptionTranslator.postgreSqlException(SqlExceptionTranslator.java:54)
        at com.redhat.rhn.common.translation.SqlExceptionTranslator.sqlException(SqlExceptionTranslator.java:44)
        at com.redhat.rhn.common.db.datasource.CachedStatement.executeCallable(CachedStatement.java:487)
        at com.redhat.rhn.common.db.datasource.CallableMode.execute(CallableMode.java:39)
        at com.redhat.rhn.manager.system.SystemManager.unsubscribeServerFromChannel(SystemManager.java:1557)
        at com.redhat.rhn.manager.system.SystemManager.unsubscribeServerFromChannel(SystemManager.java:1531)
        at com.redhat.rhn.manager.system.SystemManager.unsubscribeServerFromChannel(SystemManager.java:1488)
        at com.redhat.rhn.manager.system.SystemManager.unsubscribeServerFromChannel(SystemManager.java:1449)
        at com.redhat.rhn.manager.system.UpdateChildChannelsCommand.subscribeToNewChannels(UpdateChildChannelsCommand.java:156)
        at com.redhat.rhn.manager.system.UpdateChildChannelsCommand.store(UpdateChildChannelsCommand.java:103)
        at com.redhat.rhn.frontend.xmlrpc.system.SystemHandler.setChildChannels(SystemHandler.java:381)
        ... 40 more
Caused by: 
org.postgresql.util.PSQLException: ERROR: deadlock detected
  Detail: Process 6611 waits for ShareLock on transaction 651256; blocked by process 6658.
Process 6658 waits for AccessExclusiveLock on tuple (1,10) of relation 58289 of database 16384; blocked by process 6997.
Process 6997 waits for ShareLock on transaction 651259; blocked by process 6611.
  Hint: See server log for query details.
  Where: while locking tuple (0,2) in relation "rhnservergroup"
SQL statement "select sg.created                     from rhnservergroup sg
           inner join rhnserver s on s.org_id = sg.org_id
     where s.id = server_id_in
       for update"
PL/pgSQL function lock_counts(numeric) line 16 at SQL statement
SQL statement "SELECT lock_counts(server_id_in)"
PL/pgSQL function rhn_channel.unsubscribe_server(numeric,numeric,numeric,numeric,numeric,numeric) line 19 at PERFORM
        at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2094)
        at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1827)
        at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:255)
        at org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:508)
        at org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags(AbstractJdbc2Statement.java:384)
        at org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:377)
        at com.mchange.v2.c3p0.impl.NewProxyCallableStatement.execute(NewProxyCallableStatement.java:2706)
        at com.redhat.rhn.common.db.NamedPreparedStatement.execute(NamedPreparedStatement.java:117)
        at com.redhat.rhn.common.db.datasource.CachedStatement.executeCallable(CachedStatement.java:474)
        ... 48 more
2017-05-31 05:29:11,282 [TP-Processor10] ERROR com.redhat.rhn.frontend.xmlrpc.BaseHandler - Caused by: 
com.redhat.rhn.common.db.WrappedSQLException: ERROR: deadlock detected
  Detail: Process 6611 waits for ShareLock on transaction 651256; blocked by process 6658.
Process 6658 waits for AccessExclusiveLock on tuple (1,10) of relation 58289 of database 16384; blocked by process 6997.
Process 6997 waits for ShareLock on transaction 651259; blocked by process 6611.
  Hint: See server log for query details.
  Where: while locking tuple (0,2) in relation "rhnservergroup"
SQL statement "select sg.created                     from rhnservergroup sg
           inner join rhnserver s on s.org_id = sg.org_id
     where s.id = server_id_in
       for update"
PL/pgSQL function lock_counts(numeric) line 16 at SQL statement
SQL statement "SELECT lock_counts(server_id_in)"
PL/pgSQL function rhn_channel.unsubscribe_server(numeric,numeric,numeric,numeric,numeric,numeric) line 19 at PERFORM
        at com.redhat.rhn.common.translation.SqlExceptionTranslator.postgreSqlException(SqlExceptionTranslator.java:54)
        at com.redhat.rhn.common.translation.SqlExceptionTranslator.sqlException(SqlExceptionTranslator.java:44)
        at com.redhat.rhn.common.db.datasource.CachedStatement.executeCallable(CachedStatement.java:487)
        at com.redhat.rhn.common.db.datasource.CallableMode.execute(CallableMode.java:39)
        at com.redhat.rhn.manager.system.SystemManager.unsubscribeServerFromChannel(SystemManager.java:1557)
        at com.redhat.rhn.manager.system.SystemManager.unsubscribeServerFromChannel(SystemManager.java:1531)
        at com.redhat.rhn.manager.system.SystemManager.unsubscribeServerFromChannel(SystemManager.java:1488)
        at com.redhat.rhn.manager.system.SystemManager.unsubscribeServerFromChannel(SystemManager.java:1449)
        at com.redhat.rhn.manager.system.UpdateChildChannelsCommand.subscribeToNewChannels(UpdateChildChannelsCommand.java:156)
        at com.redhat.rhn.manager.system.UpdateChildChannelsCommand.store(UpdateChildChannelsCommand.java:103)
        at com.redhat.rhn.frontend.xmlrpc.system.SystemHandler.setChildChannels(SystemHandler.java:381)
        at sun.reflect.GeneratedMethodAccessor789.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:55)
        at java.lang.reflect.Method.invoke(Method.java:508)
        at com.redhat.rhn.frontend.xmlrpc.BaseHandler.invoke(BaseHandler.java:174)
        at redstone.xmlrpc.XmlRpcDispatcher.dispatch(XmlRpcDispatcher.java:123)
        at com.redhat.rhn.frontend.xmlrpc.RhnXmlRpcServer.execute(RhnXmlRpcServer.java:54)
        at com.redhat.rhn.frontend.xmlrpc.XmlRpcServlet.doPost(XmlRpcServlet.java:162)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:643)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:723)
        at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:290)
        at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
        at com.opensymphony.sitemesh.webapp.SiteMeshFilter.obtainContent(SiteMeshFilter.java:129)
        at com.opensymphony.sitemesh.webapp.SiteMeshFilter.doFilter(SiteMeshFilter.java:77)
        at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
        at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
        at com.redhat.rhn.frontend.servlets.LocalizedEnvironmentFilter.doFilter(LocalizedEnvironmentFilter.java:67)
        at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
        at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
        at com.redhat.rhn.frontend.servlets.EnvironmentFilter.doFilter(EnvironmentFilter.java:101)
        at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
        at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
        at com.redhat.rhn.frontend.servlets.SessionFilter.doFilter(SessionFilter.java:57)
        at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
        at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
        at com.redhat.rhn.frontend.servlets.SetCharacterEncodingFilter.doFilter(SetCharacterEncodingFilter.java:97)
        at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
        at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
        at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
        at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
        at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
        at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
        at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
        at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:299)
        at org.apache.jk.server.JkCoyoteHandler.invoke(JkCoyoteHandler.java:190)
        at org.apache.jk.common.HandlerRequest.invoke(HandlerRequest.java:291)
        at org.apache.jk.common.ChannelSocket.invoke(ChannelSocket.java:769)
        at org.apache.jk.common.ChannelSocket.processConnection(ChannelSocket.java:698)
        at org.apache.jk.common.ChannelSocket$SocketConnection.runIt(ChannelSocket.java:891)
        at org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:690)
        at java.lang.Thread.run(Thread.java:785)
Caused by:
org.postgresql.util.PSQLException: ERROR: deadlock detected
  Detail: Process 6611 waits for ShareLock on transaction 651256; blocked by process 6658.
Process 6658 waits for AccessExclusiveLock on tuple (1,10) of relation 58289 of database 16384; blocked by process 6997.
Process 6997 waits for ShareLock on transaction 651259; blocked by process 6611.
  Hint: See server log for query details.
  Where: while locking tuple (0,2) in relation "rhnservergroup"
SQL statement "select sg.created                     from rhnservergroup sg
           inner join rhnserver s on s.org_id = sg.org_id
     where s.id = server_id_in
       for update"
PL/pgSQL function lock_counts(numeric) line 16 at SQL statement
SQL statement "SELECT lock_counts(server_id_in)"
PL/pgSQL function rhn_channel.unsubscribe_server(numeric,numeric,numeric,numeric,numeric,numeric) line 19 at PERFORM
        at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2094)
        at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1827)
        at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:255)
        at org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:508)
        at org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags(AbstractJdbc2Statement.java:384)
        at org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:377)
        at com.mchange.v2.c3p0.impl.NewProxyCallableStatement.execute(NewProxyCallableStatement.java:2706)
        at com.redhat.rhn.common.db.NamedPreparedStatement.execute(NamedPreparedStatement.java:117)
        at com.redhat.rhn.common.db.datasource.CachedStatement.executeCallable(CachedStatement.java:474)
        ... 48 more
2017-05-31 05:29:11,282 [TP-Processor10] ERROR com.redhat.rhn.frontend.xmlrpc.BaseHandler - Rolling back transaction
May 31, 2017 5:29:11 AM redstone.xmlrpc.XmlRpcDispatcher writeError
WARNING: redstone.xmlrpc.XmlRpcFault: unhandled internal exception: ERROR: deadlock detected
  Detail: Process 6611 waits for ShareLock on transaction 651256; blocked by process 6658.
Process 6658 waits for AccessExclusiveLock on tuple (1,10) of relation 58289 of database 16384; blocked by process 6997.
Process 6997 waits for ShareLock on transaction 651259; blocked by process 6611.
  Hint: See server log for query details.
  Where: while locking tuple (0,2) in relation "rhnservergroup"
SQL statement "select sg.created                     from rhnservergroup sg
           inner join rhnserver s on s.org_id = sg.org_id
     where s.id = server_id_in
       for update"
PL/pgSQL function lock_counts(numeric) line 16 at SQL statement
SQL statement "SELECT lock_counts(server_id_in)"
PL/pgSQL function rhn_channel.unsubscribe_server(numeric,numeric,numeric,numeric,numeric,numeric) line 19 at PERFORM
===

===
Grant Gainey 2017-06-01 16:24:23 EDT

Using the scripts from bug 1388160, everything is fine - until I start using SSM to delete profiles at the same time.  Investigating SSM-delete-servers codepath, to see if there's a place where Java is locking out-of-order.

Jan - ignore the "can we get 5.8 to look at 9.2?" from #c17 - instead, can you point your clients at a *5.7* satellite, and re-run your reproducer? If this happens against 5.7, then it is not this BZ, and not a 5.8 regression - would still need its own BZ, tho. In the meantime, I will attempt same, assuming I can get an env to work before I have to leave tonight.

[reply] [−] Private Comment 19 Grant Gainey 2017-06-01 16:32:28 EDT
I take it back - it wasn't SSM-delete that caused deadlocks, it was registering new systems (which is going down the python-path, and which I could very easily believe is not locking everything in the right order. Investigation Continues.
===

===
Grant Gainey 2017-06-02 11:52:30 EDT
Current status:

* can reproduce frequent deadlocks on 5.7 using the scripts from bug 1388160 alone, on 4 clients.
* explicitly locking reactivation-path in python appears to address these
* adding vanilla-rhnreg_ks to the 4-clients-deadlocks.sh causes ~5-7% of the registrations to cause a deadlock
* currently appears to be caused by something on the rhn_server.snapshot_server() path (all systems have Provisioning and are snapshotting at registration time)

Current thoughts:

* This is not a 5.8 regression
* In fact, things appear to be *better* on 5.8 than 5.7. Somehow.
* spacewalk-backend/python appears to be the problem path
* explicit locking of the reactivation-key path helps
* we're not quite done yet

Investigation continues. Suggest that the current state is that we do *not* hold up 5.8's schedule for this issue, since it is reproduceable in 5.7 (and possibly in 5.6, based on a parallel investigation)
===


Current Results:

* Locking is needed, in python, on the reactivation-key path
* Locking is needed, in stored-procedure, in rhn_server.snapshot_server()

Doing both of these things resolves the deadlocks observed during this investigation, on both 5.8 and 5.7.

--- Additional comment from Grant Gainey on 2017-06-05 14:10 EDT ---

Proposed patch teaches python reactivation-key path and rhn_server.snapshot_server() to explicitly call lock_counts()

--- Additional comment from Jan Hutař on 2017-06-05 16:55 EDT ---

I was able to get this deadlock on PostgreSQL only and basically these are the steps:

1. register ~1000 profiles to your satellite (I had a mix of profiles registered
   via username/password and activation key, but I believe that is not important)
2. save deadlock.sh/reactivationkey.py/updatechannels.py combo to all your
   clients (I had 4 client VMs with 1GB RAM and 1 Satellite VM with 4GB RAM all
   on my laptop)
3. create activation key "1-shell"
     channel set to some custom channel
     join to some system group
4. clone rhel-x86_64-server-6 to shell-rhel-x86_64-server-6 (I believe content
   does not matter)
5. clone rhn-tools-rhel-x86_64-server-6 twice as clone-1-rhel-x86_64-server-6
   and clone-2-rhel-x86_64-server-6 as a childs of previously cloned base
   channel
6. update satellite hostname in both reactivationkey.py/updatechannels.py
7. register all your client systems and give them Provisioning add-on entitlement

--- Additional comment from Jan Hutař on 2017-06-05 17:02:17 EDT ---

And of course I have forgot about most important step:

8. start deadlock.sh on all your clients and once it is running - it should be
   printing something like:

# ./deadlock.sh 
ACT KEY: 1-shell,re-1-d4ddf783dc2182e5bdf4ed21476831d2.
ACT KEY: 1-shell,re-1-25d94990a0887dfe976ea4332b1133bd.
[...]

So, once that is running, select all previously registered profiles, add them to SSM and use SSM to delete these profiles.

Now watch PostgreSQL and Tomcat log. In my case, deleting only 25 profiles caused the deadlock once.

Comment 6 Radovan Drazny 2017-06-29 11:07:08 UTC
Tested on 
spacewalk-backend-2.0.3-47-sat
spacewalk-schema-2.0.2-27-sat
satellite-schema-5.6.0.36-1-sat

using:
 - the reproducer scripts from bug 1388160 running on four clients 
 - two systems doing rhnreg_ks in a loop 
 - and manually deleting hundreds of previously registered systems using the SSM

No deadlocks in tomcat logs nor PG logs, all deletions, registrations and re-activations were successful. Only error encountered was in PG logs:

<...>
2017-06-29 06:15:17.756 EDT LOG:  unexpected EOF on client connection
2017-06-29 06:15:22.568 EDT LOG:  unexpected EOF on client connection
2017-06-29 06:15:45.619 EDT LOG:  unexpected EOF on client connection
2017-06-29 06:15:52.843 EDT LOG:  unexpected EOF on client connection
2017-06-29 06:16:17.802 EDT LOG:  unexpected EOF on client connection
<...>

These messages look harmless, as there wasn't any other problem.

VERIFIED

Comment 7 Radovan Drazny 2017-06-29 11:08:09 UTC
Just to add - I have tested both the PG and Oracle backend. I was able to reproduce the error only on PG.

Comment 8 Grant Gainey 2017-06-29 11:53:35 UTC
(In reply to Radovan Drazny from comment #7)
> Just to add - I have tested both the PG and Oracle backend. I was able to
> reproduce the error only on PG.

Correct - we've never seen this particular issue in ORA. The two dbs have subtly-different locking strategies.

Comment 10 errata-xmlrpc 2017-07-05 14:28:07 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:1683