This BZ covers a new potential-deadlock path found during the testing of bug 1402935 . The deadlock can be reproduced in 5.7, so is not a 5.8 regression. Summary from the comments in 1402935: * Running the reproducer-scripts from bug 1388160 on 4 clients * Running rhnreg_ks --user --pass in a loop on a 5th machine Expected: reactivations and registrations without deadlocks. Actual: === Jan Hutař 2017-05-30 16:40:31 EDT Grant, might this be related please? ==> /var/opt/rh/rh-postgresql95/lib/pgsql/data/pg_log/postgresql-Tue.log <== 2017-05-30 10:39:07.320 EDT ERROR: deadlock detected 2017-05-30 10:39:07.320 EDT DETAIL: Process 21183 waits for ShareLock on transaction 450090; blocked by process 21476. Process 21476 waits for ShareLock on transaction 450097; blocked by process 21183. Process 21183: SELECT rhn_channel.subscribe_server(1000015983, '123', 0) Process 21476: select * from delete_server($1) as result 2017-05-30 10:39:07.320 EDT HINT: See server log for query details. 2017-05-30 10:39:07.320 EDT CONTEXT: while locking tuple (0,2) in relation "rhnservergroup" SQL statement "select sg.created from rhnservergroup sg inner join rhnserver s on s.org_id = sg.org_id where s.id = server_id_in for update" PL/pgSQL function lock_counts(numeric) line 16 at SQL statement SQL statement "SELECT lock_counts(server_id_in)" PL/pgSQL function rhn_channel.subscribe_server(numeric,numeric,numeric,numeric,numeric) line 16 at PERFORM 2017-05-30 10:39:07.320 EDT STATEMENT: SELECT rhn_channel.subscribe_server(1000015983, '123', 0) 2017-05-30 10:39:23.471 EDT ERROR: deadlock detected 2017-05-30 10:39:23.471 EDT DETAIL: Process 21337 waits for ShareLock on transaction 450880; blocked by process 21599. Process 21599 waits for ShareLock on transaction 450883; blocked by process 21337. Process 21337: select * from delete_server($1) as result Process 21599: SELECT rhn_channel.unsubscribe_server(1000016009, 116, 0) 2017-05-30 10:39:23.471 EDT HINT: See server log for query details. 2017-05-30 10:39:23.471 EDT CONTEXT: while locking tuple (0,2) in relation "rhnservergroup" SQL statement "select sg.created from rhnservergroup sg inner join rhnserver s on s.org_id = sg.org_id where s.id = server_id_in for update" PL/pgSQL function lock_counts(numeric) line 16 at SQL statement SQL statement "SELECT lock_counts(server_id_in)" PL/pgSQL function delete_server(numeric) line 27 at PERFORM 2017-05-30 10:39:23.471 EDT STATEMENT: select * from delete_server($1) as result 2017-05-30 10:39:23.471 EDT ERROR: current transaction is aborted, commands ignored until end of transaction block 2017-05-30 10:39:23.471 EDT STATEMENT: select 'c3p0 ping' from dual I have been registering 4 systems in a loop using activation key (into tiny custom channel) and deleting big bunch of previously registered profiles (mix of username/password and activation key registrations). And forgot to mention that activation key included system group. === Comments from the ensuing investigation: === Jan Hutař 2017-06-01 06:52:37 EDT DB log: 2017-05-31 05:29:12.282 EDT ERROR: deadlock detected 2017-05-31 05:29:12.282 EDT DETAIL: Process 6997 waits for ShareLock on transaction 651256; blocked by process 6658. Process 6658 waits for ShareLock on transaction 651261; blocked by process 6997. Process 6997: SELECT rhn_channel.unsubscribe_server(1000017821, 169, 0) Process 6658: select * from rhn_channel.unsubscribe_server($1, $2, 0) as result 2017-05-31 05:29:12.282 EDT HINT: See server log for query details. 2017-05-31 05:29:12.282 EDT CONTEXT: while locking tuple (0,2) in relation "rhnservergroup" SQL statement "select sg.created from rhnservergroup sg inner join rhnserver s on s.org_id = sg.org_id where s.id = server_id_in for update" PL/pgSQL function lock_counts(numeric) line 16 at SQL statement SQL statement "SELECT lock_counts(server_id_in)" PL/pgSQL function rhn_channel.unsubscribe_server(numeric,numeric,numeric,numeric,numeric,numeric) line 19 at PERFORM 2017-05-31 05:29:12.282 EDT STATEMENT: SELECT rhn_channel.unsubscribe_server(1000017821, 169, 0) and tomcat log is (not sure if all of this is relevant, but this is what timestamps say): 2017-05-31 05:29:11,281 [TP-Processor10] ERROR com.redhat.rhn.frontend.xmlrpc.BaseHandler - Error calling method: java.lang.reflect.InvocationTargetException at sun.reflect.GeneratedMethodAccessor789.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:55) at java.lang.reflect.Method.invoke(Method.java:508) at com.redhat.rhn.frontend.xmlrpc.BaseHandler.invoke(BaseHandler.java:174) at redstone.xmlrpc.XmlRpcDispatcher.dispatch(XmlRpcDispatcher.java:123) at com.redhat.rhn.frontend.xmlrpc.RhnXmlRpcServer.execute(RhnXmlRpcServer.java:54) at com.redhat.rhn.frontend.xmlrpc.XmlRpcServlet.doPost(XmlRpcServlet.java:162) at javax.servlet.http.HttpServlet.service(HttpServlet.java:643) at javax.servlet.http.HttpServlet.service(HttpServlet.java:723) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:290) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at com.opensymphony.sitemesh.webapp.SiteMeshFilter.obtainContent(SiteMeshFilter.java:129) at com.opensymphony.sitemesh.webapp.SiteMeshFilter.doFilter(SiteMeshFilter.java:77) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at com.redhat.rhn.frontend.servlets.LocalizedEnvironmentFilter.doFilter(LocalizedEnvironmentFilter.java:67) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at com.redhat.rhn.frontend.servlets.EnvironmentFilter.doFilter(EnvironmentFilter.java:101) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at com.redhat.rhn.frontend.servlets.SessionFilter.doFilter(SessionFilter.java:57) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at com.redhat.rhn.frontend.servlets.SetCharacterEncodingFilter.doFilter(SetCharacterEncodingFilter.java:97) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:299) at org.apache.jk.server.JkCoyoteHandler.invoke(JkCoyoteHandler.java:190) at org.apache.jk.common.HandlerRequest.invoke(HandlerRequest.java:291) at org.apache.jk.common.ChannelSocket.invoke(ChannelSocket.java:769) at org.apache.jk.common.ChannelSocket.processConnection(ChannelSocket.java:698) at org.apache.jk.common.ChannelSocket$SocketConnection.runIt(ChannelSocket.java:891) at org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:690) at java.lang.Thread.run(Thread.java:785) Caused by: com.redhat.rhn.common.db.WrappedSQLException: ERROR: deadlock detected Detail: Process 6611 waits for ShareLock on transaction 651256; blocked by process 6658. Process 6658 waits for AccessExclusiveLock on tuple (1,10) of relation 58289 of database 16384; blocked by process 6997. Process 6997 waits for ShareLock on transaction 651259; blocked by process 6611. Hint: See server log for query details. Where: while locking tuple (0,2) in relation "rhnservergroup" SQL statement "select sg.created from rhnservergroup sg inner join rhnserver s on s.org_id = sg.org_id where s.id = server_id_in for update" PL/pgSQL function lock_counts(numeric) line 16 at SQL statement SQL statement "SELECT lock_counts(server_id_in)" PL/pgSQL function rhn_channel.unsubscribe_server(numeric,numeric,numeric,numeric,numeric,numeric) line 19 at PERFORM at com.redhat.rhn.common.translation.SqlExceptionTranslator.postgreSqlException(SqlExceptionTranslator.java:54) at com.redhat.rhn.common.translation.SqlExceptionTranslator.sqlException(SqlExceptionTranslator.java:44) at com.redhat.rhn.common.db.datasource.CachedStatement.executeCallable(CachedStatement.java:487) at com.redhat.rhn.common.db.datasource.CallableMode.execute(CallableMode.java:39) at com.redhat.rhn.manager.system.SystemManager.unsubscribeServerFromChannel(SystemManager.java:1557) at com.redhat.rhn.manager.system.SystemManager.unsubscribeServerFromChannel(SystemManager.java:1531) at com.redhat.rhn.manager.system.SystemManager.unsubscribeServerFromChannel(SystemManager.java:1488) at com.redhat.rhn.manager.system.SystemManager.unsubscribeServerFromChannel(SystemManager.java:1449) at com.redhat.rhn.manager.system.UpdateChildChannelsCommand.subscribeToNewChannels(UpdateChildChannelsCommand.java:156) at com.redhat.rhn.manager.system.UpdateChildChannelsCommand.store(UpdateChildChannelsCommand.java:103) at com.redhat.rhn.frontend.xmlrpc.system.SystemHandler.setChildChannels(SystemHandler.java:381) ... 40 more Caused by: org.postgresql.util.PSQLException: ERROR: deadlock detected Detail: Process 6611 waits for ShareLock on transaction 651256; blocked by process 6658. Process 6658 waits for AccessExclusiveLock on tuple (1,10) of relation 58289 of database 16384; blocked by process 6997. Process 6997 waits for ShareLock on transaction 651259; blocked by process 6611. Hint: See server log for query details. Where: while locking tuple (0,2) in relation "rhnservergroup" SQL statement "select sg.created from rhnservergroup sg inner join rhnserver s on s.org_id = sg.org_id where s.id = server_id_in for update" PL/pgSQL function lock_counts(numeric) line 16 at SQL statement SQL statement "SELECT lock_counts(server_id_in)" PL/pgSQL function rhn_channel.unsubscribe_server(numeric,numeric,numeric,numeric,numeric,numeric) line 19 at PERFORM at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2094) at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1827) at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:255) at org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:508) at org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags(AbstractJdbc2Statement.java:384) at org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:377) at com.mchange.v2.c3p0.impl.NewProxyCallableStatement.execute(NewProxyCallableStatement.java:2706) at com.redhat.rhn.common.db.NamedPreparedStatement.execute(NamedPreparedStatement.java:117) at com.redhat.rhn.common.db.datasource.CachedStatement.executeCallable(CachedStatement.java:474) ... 48 more 2017-05-31 05:29:11,282 [TP-Processor10] ERROR com.redhat.rhn.frontend.xmlrpc.BaseHandler - Caused by: com.redhat.rhn.common.db.WrappedSQLException: ERROR: deadlock detected Detail: Process 6611 waits for ShareLock on transaction 651256; blocked by process 6658. Process 6658 waits for AccessExclusiveLock on tuple (1,10) of relation 58289 of database 16384; blocked by process 6997. Process 6997 waits for ShareLock on transaction 651259; blocked by process 6611. Hint: See server log for query details. Where: while locking tuple (0,2) in relation "rhnservergroup" SQL statement "select sg.created from rhnservergroup sg inner join rhnserver s on s.org_id = sg.org_id where s.id = server_id_in for update" PL/pgSQL function lock_counts(numeric) line 16 at SQL statement SQL statement "SELECT lock_counts(server_id_in)" PL/pgSQL function rhn_channel.unsubscribe_server(numeric,numeric,numeric,numeric,numeric,numeric) line 19 at PERFORM at com.redhat.rhn.common.translation.SqlExceptionTranslator.postgreSqlException(SqlExceptionTranslator.java:54) at com.redhat.rhn.common.translation.SqlExceptionTranslator.sqlException(SqlExceptionTranslator.java:44) at com.redhat.rhn.common.db.datasource.CachedStatement.executeCallable(CachedStatement.java:487) at com.redhat.rhn.common.db.datasource.CallableMode.execute(CallableMode.java:39) at com.redhat.rhn.manager.system.SystemManager.unsubscribeServerFromChannel(SystemManager.java:1557) at com.redhat.rhn.manager.system.SystemManager.unsubscribeServerFromChannel(SystemManager.java:1531) at com.redhat.rhn.manager.system.SystemManager.unsubscribeServerFromChannel(SystemManager.java:1488) at com.redhat.rhn.manager.system.SystemManager.unsubscribeServerFromChannel(SystemManager.java:1449) at com.redhat.rhn.manager.system.UpdateChildChannelsCommand.subscribeToNewChannels(UpdateChildChannelsCommand.java:156) at com.redhat.rhn.manager.system.UpdateChildChannelsCommand.store(UpdateChildChannelsCommand.java:103) at com.redhat.rhn.frontend.xmlrpc.system.SystemHandler.setChildChannels(SystemHandler.java:381) at sun.reflect.GeneratedMethodAccessor789.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:55) at java.lang.reflect.Method.invoke(Method.java:508) at com.redhat.rhn.frontend.xmlrpc.BaseHandler.invoke(BaseHandler.java:174) at redstone.xmlrpc.XmlRpcDispatcher.dispatch(XmlRpcDispatcher.java:123) at com.redhat.rhn.frontend.xmlrpc.RhnXmlRpcServer.execute(RhnXmlRpcServer.java:54) at com.redhat.rhn.frontend.xmlrpc.XmlRpcServlet.doPost(XmlRpcServlet.java:162) at javax.servlet.http.HttpServlet.service(HttpServlet.java:643) at javax.servlet.http.HttpServlet.service(HttpServlet.java:723) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:290) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at com.opensymphony.sitemesh.webapp.SiteMeshFilter.obtainContent(SiteMeshFilter.java:129) at com.opensymphony.sitemesh.webapp.SiteMeshFilter.doFilter(SiteMeshFilter.java:77) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at com.redhat.rhn.frontend.servlets.LocalizedEnvironmentFilter.doFilter(LocalizedEnvironmentFilter.java:67) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at com.redhat.rhn.frontend.servlets.EnvironmentFilter.doFilter(EnvironmentFilter.java:101) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at com.redhat.rhn.frontend.servlets.SessionFilter.doFilter(SessionFilter.java:57) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at com.redhat.rhn.frontend.servlets.SetCharacterEncodingFilter.doFilter(SetCharacterEncodingFilter.java:97) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:299) at org.apache.jk.server.JkCoyoteHandler.invoke(JkCoyoteHandler.java:190) at org.apache.jk.common.HandlerRequest.invoke(HandlerRequest.java:291) at org.apache.jk.common.ChannelSocket.invoke(ChannelSocket.java:769) at org.apache.jk.common.ChannelSocket.processConnection(ChannelSocket.java:698) at org.apache.jk.common.ChannelSocket$SocketConnection.runIt(ChannelSocket.java:891) at org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:690) at java.lang.Thread.run(Thread.java:785) Caused by: org.postgresql.util.PSQLException: ERROR: deadlock detected Detail: Process 6611 waits for ShareLock on transaction 651256; blocked by process 6658. Process 6658 waits for AccessExclusiveLock on tuple (1,10) of relation 58289 of database 16384; blocked by process 6997. Process 6997 waits for ShareLock on transaction 651259; blocked by process 6611. Hint: See server log for query details. Where: while locking tuple (0,2) in relation "rhnservergroup" SQL statement "select sg.created from rhnservergroup sg inner join rhnserver s on s.org_id = sg.org_id where s.id = server_id_in for update" PL/pgSQL function lock_counts(numeric) line 16 at SQL statement SQL statement "SELECT lock_counts(server_id_in)" PL/pgSQL function rhn_channel.unsubscribe_server(numeric,numeric,numeric,numeric,numeric,numeric) line 19 at PERFORM at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2094) at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1827) at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:255) at org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:508) at org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags(AbstractJdbc2Statement.java:384) at org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:377) at com.mchange.v2.c3p0.impl.NewProxyCallableStatement.execute(NewProxyCallableStatement.java:2706) at com.redhat.rhn.common.db.NamedPreparedStatement.execute(NamedPreparedStatement.java:117) at com.redhat.rhn.common.db.datasource.CachedStatement.executeCallable(CachedStatement.java:474) ... 48 more 2017-05-31 05:29:11,282 [TP-Processor10] ERROR com.redhat.rhn.frontend.xmlrpc.BaseHandler - Rolling back transaction May 31, 2017 5:29:11 AM redstone.xmlrpc.XmlRpcDispatcher writeError WARNING: redstone.xmlrpc.XmlRpcFault: unhandled internal exception: ERROR: deadlock detected Detail: Process 6611 waits for ShareLock on transaction 651256; blocked by process 6658. Process 6658 waits for AccessExclusiveLock on tuple (1,10) of relation 58289 of database 16384; blocked by process 6997. Process 6997 waits for ShareLock on transaction 651259; blocked by process 6611. Hint: See server log for query details. Where: while locking tuple (0,2) in relation "rhnservergroup" SQL statement "select sg.created from rhnservergroup sg inner join rhnserver s on s.org_id = sg.org_id where s.id = server_id_in for update" PL/pgSQL function lock_counts(numeric) line 16 at SQL statement SQL statement "SELECT lock_counts(server_id_in)" PL/pgSQL function rhn_channel.unsubscribe_server(numeric,numeric,numeric,numeric,numeric,numeric) line 19 at PERFORM === === Grant Gainey 2017-06-01 16:24:23 EDT Using the scripts from bug 1388160, everything is fine - until I start using SSM to delete profiles at the same time. Investigating SSM-delete-servers codepath, to see if there's a place where Java is locking out-of-order. Jan - ignore the "can we get 5.8 to look at 9.2?" from #c17 - instead, can you point your clients at a *5.7* satellite, and re-run your reproducer? If this happens against 5.7, then it is not this BZ, and not a 5.8 regression - would still need its own BZ, tho. In the meantime, I will attempt same, assuming I can get an env to work before I have to leave tonight. [reply] [−] Private Comment 19 Grant Gainey 2017-06-01 16:32:28 EDT I take it back - it wasn't SSM-delete that caused deadlocks, it was registering new systems (which is going down the python-path, and which I could very easily believe is not locking everything in the right order. Investigation Continues. === === Grant Gainey 2017-06-02 11:52:30 EDT Current status: * can reproduce frequent deadlocks on 5.7 using the scripts from bug 1388160 alone, on 4 clients. * explicitly locking reactivation-path in python appears to address these * adding vanilla-rhnreg_ks to the 4-clients-deadlocks.sh causes ~5-7% of the registrations to cause a deadlock * currently appears to be caused by something on the rhn_server.snapshot_server() path (all systems have Provisioning and are snapshotting at registration time) Current thoughts: * This is not a 5.8 regression * In fact, things appear to be *better* on 5.8 than 5.7. Somehow. * spacewalk-backend/python appears to be the problem path * explicit locking of the reactivation-key path helps * we're not quite done yet Investigation continues. Suggest that the current state is that we do *not* hold up 5.8's schedule for this issue, since it is reproduceable in 5.7 (and possibly in 5.6, based on a parallel investigation) === Current Results: * Locking is needed, in python, on the reactivation-key path * Locking is needed, in stored-procedure, in rhn_server.snapshot_server() Doing both of these things resolves the deadlocks observed during this investigation, on both 5.8 and 5.7.
Created attachment 1285092 [details] Proposed patch Proposed patch teaches python reactivation-key path and rhn_server.snapshot_server() to explicitly call lock_counts()
Re-verified on spacewalk-backend-2.5.3-151.el6sat and satellite-schema-5.8.0.33-1.el6sat, both Oracle and PG servers.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2017:2645