+++ This bug was initially created as a clone of Bug #1480367 +++ Description of problem: BZ to track the deadlock work that is occurring on the activation key and server snapshot path during registration: 2017-08-09 15:35:30.511 EDT ERROR: deadlock detected 2017-08-09 15:35:30.511 EDT DETAIL: Process 12475 waits for ShareLock on transaction 949432; blocked by process 12488. Process 12488 waits for ShareLock on transaction 949435; blocked by process 12475. Process 12475: SELECT rhn_server.snapshot_server(1000012238, 'Package profile changed') Process 12488: SELECT rhn_entitlements.entitle_server(1000012239, 'enterprise_entitled') 2017-08-09 15:35:30.511 EDT HINT: See server log for query details. 2017-08-09 15:35:30.511 EDT CONTEXT: SQL statement "SELECT 1 FROM ONLY "public"."rhnservergroup" x WHERE "id" OPERATOR(pg_catalog.=) $1 FOR SHARE OF x" SQL statement "insert into rhnSnapshotServerGroup (snapshot_id, server_group_id) ( select snapshot_id_v, sgm.server_group_id from rhnServerGroupMembers sgm where sgm.server_id = server_id_in )" PL/pgSQL function rhn_server.snapshot_server(numeric,character varying) line 37 at SQL statement 2017-08-09 15:35:30.511 EDT STATEMENT: SELECT rhn_server.snapshot_server(1000012238, 'Package profile changed') Turning off snapshots removes the deadlock but current work is looking at rhnServerSnapshopGroup within lock_counts() Currently seeing on Satellite 5.7 but would expect this to affect 5.8 too. --- Additional comment from Grant Gainey on 2017-08-10 16:31 EDT --- After a lot of digging, it appears that the act of storing a server-snapshot is locking rows in rhnservergroup as a side-effect. This caused DB locks to be acquired out-of-order, and hence the deadlock. This file adds the rhnSnapshotServerGroup to the list of locks managed by lock_counts(), and enforces lock-acquisition in the right order. It should address the deadlocks between snapshot_server() and entitle_server() Apply by downloading lock_counts.sql onto the Sat5 instance, and running the foolowing command: # spacewalk-sql --select-mode lock_counts.sql --- Additional comment from Tomas Lestach on 2017-08-14 09:26:45 EDT --- Grant, does Comment 1 mean we have a fix? Or is there anything else needed? --- Additional comment from Grant Gainey on 2017-08-14 10:19:09 EDT --- It means we have a *potential* fix, and SEG is working with the customer to apply the patch and see what happens :)
Testing against 5.8 and postgresql95, using the updated lock_count.sql from https://bugzilla.redhat.com/show_bug.cgi?id=1480367#c1 , shows a greatly-reduced frequency of deadlocks - but they still happen. Investigation for the 5.8 fix continues.
After updating the test 5.8 instance with the fixes from the soon-to-be-released https://bugzilla.redhat.com/show_bug.cgi?id=1458880 *and* the lock_counts change from 1480367#c1 , situation is improved dramatically - no deadlocks after two hours of testing with previous recreators. A separate stress-script that forces setBaseChannel API calls from multiple sources as fast as we can call it, was able to produce 2 deadlocks in 30 minutes of stress. The tracebacks generated involve snapshots, but as (separately) invoked from SystemHandler.setBaseChannel(). This *may* be the https://bugzilla.redhat.com/show_bug.cgi?id=1469744 (which has not yet had its own reproducer). NOTE: I can only create this, very occasionally, when stressing a 5.8 system that has had both the lock_counts fix for 1480367 *and* the changes for 1458880. The stress-script I was running was 4 instances of the script from https://bugzilla.redhat.com/show_bug.cgi?id=1469744#c6 The postgresql tracebacks in the setBaseChannel case are === 2017-08-30 15:06:18.906 EDT ERROR: deadlock detected 2017-08-30 15:06:18.906 EDT DETAIL: Process 29117 waits for ShareLock on transaction 246381; blocked by process 29115. Process 29115 waits for ShareLock on transaction 246377; blocked by process 29117. Process 29117: select * from rhn_channel.unsubscribe_server($1, $2, 0) as result Process 29115: select * from rhn_channel.unsubscribe_server($1, $2, 0) as result 2017-08-30 15:06:18.906 EDT HINT: See server log for query details. 2017-08-30 15:06:18.906 EDT CONTEXT: while locking tuple (0,1) in relation "rhnsnapshotservergroup" SQL statement "select ssg.snapshot_id from rhnSnapshotServerGroup ssg inner join rhnservergroup sg on sg.id = ssg.server_group_id inner join rhnserver s on s.org_id = sg.org_id where s.id = server_id_in for update" PL/pgSQL function lock_counts(numeric) line 17 at SQL statement SQL statement "SELECT lock_counts(server_id_in)" PL/pgSQL function rhn_channel.unsubscribe_server(numeric,numeric,numeric,numeric,numeric,numeric) line 19 at PERFORM 2017-08-30 15:06:18.906 EDT STATEMENT: select * from rhn_channel.unsubscribe_server($1, $2, 0) as result 2017-08-30 15:06:18.917 EDT ERROR: current transaction is aborted, commands ignored until end of transaction block 2017-08-30 15:06:18.917 EDT STATEMENT: select 'c3p0 ping' from dual === and the matching tomcat tracebacks from catalina.out are here: === 2017-08-30 15:06:18,926 [TP-Processor5] ERROR com.redhat.rhn.frontend.xmlrpc.BaseHandler - Error calling method: java.lang.reflect.InvocationTargetException at sun.reflect.GeneratedMethodAccessor676.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:55) at java.lang.reflect.Method.invoke(Method.java:508) at com.redhat.rhn.frontend.xmlrpc.BaseHandler.invoke(BaseHandler.java:174) at redstone.xmlrpc.XmlRpcDispatcher.dispatch(XmlRpcDispatcher.java:123) at com.redhat.rhn.frontend.xmlrpc.RhnXmlRpcServer.execute(RhnXmlRpcServer.java:54) at com.redhat.rhn.frontend.xmlrpc.XmlRpcServlet.doPost(XmlRpcServlet.java:162) at javax.servlet.http.HttpServlet.service(HttpServlet.java:643) at javax.servlet.http.HttpServlet.service(HttpServlet.java:723) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:290) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at com.opensymphony.sitemesh.webapp.SiteMeshFilter.obtainContent(SiteMeshFilter.java:129) at com.opensymphony.sitemesh.webapp.SiteMeshFilter.doFilter(SiteMeshFilter.java:77) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at com.redhat.rhn.frontend.servlets.LocalizedEnvironmentFilter.doFilter(LocalizedEnvironmentFilter.java:67) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at com.redhat.rhn.frontend.servlets.EnvironmentFilter.doFilter(EnvironmentFilter.java:101) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at com.redhat.rhn.frontend.servlets.SessionFilter.doFilter(SessionFilter.java:57) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at com.redhat.rhn.frontend.servlets.SetCharacterEncodingFilter.doFilter(SetCharacterEncodingFilter.java:97) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:299) at org.apache.jk.server.JkCoyoteHandler.invoke(JkCoyoteHandler.java:190) at org.apache.jk.common.HandlerRequest.invoke(HandlerRequest.java:291) at org.apache.jk.common.ChannelSocket.invoke(ChannelSocket.java:769) at org.apache.jk.common.ChannelSocket.processConnection(ChannelSocket.java:698) at org.apache.jk.common.ChannelSocket$SocketConnection.runIt(ChannelSocket.java:891) at org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:690) at java.lang.Thread.run(Thread.java:785) Caused by: com.redhat.rhn.common.db.WrappedSQLException: ERROR: deadlock detected Detail: Process 29117 waits for ShareLock on transaction 246381; blocked by process 29115. Process 29115 waits for ShareLock on transaction 246377; blocked by process 29117. Hint: See server log for query details. Where: while locking tuple (0,1) in relation "rhnsnapshotservergroup" SQL statement "select ssg.snapshot_id from rhnSnapshotServerGroup ssg inner join rhnservergroup sg on sg.id = ssg.server_group_id inner join rhnserver s on s.org_id = sg.org_id where s.id = server_id_in for update" PL/pgSQL function lock_counts(numeric) line 17 at SQL statement SQL statement "SELECT lock_counts(server_id_in)" PL/pgSQL function rhn_channel.unsubscribe_server(numeric,numeric,numeric,numeric,numeric,numeric) line 19 at PERFORM at com.redhat.rhn.common.translation.SqlExceptionTranslator.postgreSqlException(SqlExceptionTranslator.java:54) at com.redhat.rhn.common.translation.SqlExceptionTranslator.sqlException(SqlExceptionTranslator.java:44) at com.redhat.rhn.common.db.datasource.CachedStatement.executeCallable(CachedStatement.java:487) at com.redhat.rhn.common.db.datasource.CallableMode.execute(CallableMode.java:39) at com.redhat.rhn.manager.system.SystemManager.unsubscribeServerFromChannel(SystemManager.java:1557) at com.redhat.rhn.manager.system.SystemManager.unsubscribeServerFromChannel(SystemManager.java:1531) at com.redhat.rhn.manager.system.SystemManager.unsubscribeServerFromChannel(SystemManager.java:1488) at com.redhat.rhn.manager.system.SystemManager.unsubscribeServerFromChannel(SystemManager.java:1449) at com.redhat.rhn.manager.system.UpdateChildChannelsCommand.unsubscribeFromOldChannels(UpdateChildChannelsCommand.java:194) at com.redhat.rhn.manager.system.UpdateChildChannelsCommand.store(UpdateChildChannelsCommand.java:107) at com.redhat.rhn.manager.system.UpdateBaseChannelCommand.store(UpdateBaseChannelCommand.java:101) at com.redhat.rhn.frontend.xmlrpc.system.SystemHandler.setBaseChannel(SystemHandler.java:417) ... 40 more Caused by: org.postgresql.util.PSQLException: ERROR: deadlock detected Detail: Process 29117 waits for ShareLock on transaction 246381; blocked by process 29115. Process 29115 waits for ShareLock on transaction 246377; blocked by process 29117. Hint: See server log for query details. Where: while locking tuple (0,1) in relation "rhnsnapshotservergroup" SQL statement "select ssg.snapshot_id from rhnSnapshotServerGroup ssg inner join rhnservergroup sg on sg.id = ssg.server_group_id inner join rhnserver s on s.org_id = sg.org_id where s.id = server_id_in for update" PL/pgSQL function lock_counts(numeric) line 17 at SQL statement SQL statement "SELECT lock_counts(server_id_in)" PL/pgSQL function rhn_channel.unsubscribe_server(numeric,numeric,numeric,numeric,numeric,numeric) line 19 at PERFORM at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2094) at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1827) at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:255) at org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:508) at org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags(AbstractJdbc2Statement.java:384) at org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:377) at com.mchange.v2.c3p0.impl.NewProxyCallableStatement.execute(NewProxyCallableStatement.java:2706) at com.redhat.rhn.common.db.NamedPreparedStatement.execute(NamedPreparedStatement.java:117) at com.redhat.rhn.common.db.datasource.CachedStatement.executeCallable(CachedStatement.java:474) ... 49 more ===
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:3445