Bug 1481848 - [5.8] rhn_server.snapshot_server deadlock on Activation Key path
Summary: [5.8] rhn_server.snapshot_server deadlock on Activation Key path
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Satellite 5
Classification: Red Hat
Component: Server
Version: 580
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
Assignee: Grant Gainey
QA Contact: Martin Korbel
URL:
Whiteboard:
Depends On: 1480367
Blocks: sat58-errata sat5-deadlock
TreeView+ depends on / blocked
 
Reported: 2017-08-15 21:13 UTC by Grant Gainey
Modified: 2017-12-13 07:59 UTC (History)
12 users (show)

Fixed In Version: spacewalk-schema-2.5.1-54-sat satellite-schema-5.8.0.38-1-sat spacewalk-backend-2.5.3-154-sat
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1480367
Environment:
Last Closed: 2017-12-13 07:59:26 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2017:3445 0 normal SHIPPED_LIVE Red Hat Satellite 5.8.0 bug fix update 2017-12-12 19:00:37 UTC

Description Grant Gainey 2017-08-15 21:13:45 UTC
+++ This bug was initially created as a clone of Bug #1480367 +++

Description of problem:

BZ to track the deadlock work that is occurring on the activation key and server snapshot path during registration: 

2017-08-09 15:35:30.511 EDT ERROR:  deadlock detected
2017-08-09 15:35:30.511 EDT DETAIL:  Process 12475 waits for ShareLock on transaction 949432; blocked by process 12488.
        Process 12488 waits for ShareLock on transaction 949435; blocked by process 12475.
        Process 12475: SELECT rhn_server.snapshot_server(1000012238, 'Package profile changed')
        Process 12488: SELECT rhn_entitlements.entitle_server(1000012239, 'enterprise_entitled')
2017-08-09 15:35:30.511 EDT HINT:  See server log for query details.
2017-08-09 15:35:30.511 EDT CONTEXT:  SQL statement "SELECT 1 FROM ONLY "public"."rhnservergroup" x WHERE "id" OPERATOR(pg_catalog.=) $1 FOR SHARE OF x"
        SQL statement "insert into rhnSnapshotServerGroup (snapshot_id, server_group_id) (
                        select        snapshot_id_v, sgm.server_group_id
                        from        rhnServerGroupMembers sgm
                        where        sgm.server_id = server_id_in
                )"
        PL/pgSQL function rhn_server.snapshot_server(numeric,character varying) line 37 at SQL statement
2017-08-09 15:35:30.511 EDT STATEMENT:  SELECT rhn_server.snapshot_server(1000012238, 'Package profile changed')


Turning off snapshots removes the deadlock but current work is looking at rhnServerSnapshopGroup within lock_counts()

Currently seeing on Satellite 5.7 but would expect this to affect 5.8 too.

--- Additional comment from Grant Gainey on 2017-08-10 16:31 EDT ---

After a lot of digging, it appears that the act of storing a server-snapshot is locking rows in rhnservergroup as a side-effect. This caused DB locks to be acquired out-of-order, and hence the deadlock.

This file adds the rhnSnapshotServerGroup to the list of locks managed by lock_counts(), and enforces lock-acquisition in the right order. It should address the deadlocks between snapshot_server() and entitle_server()

Apply by downloading lock_counts.sql onto the Sat5 instance, and running the foolowing command:

# spacewalk-sql --select-mode lock_counts.sql

--- Additional comment from Tomas Lestach on 2017-08-14 09:26:45 EDT ---

Grant, does Comment 1 mean we have a fix? Or is there anything else needed?

--- Additional comment from Grant Gainey on 2017-08-14 10:19:09 EDT ---

It means we have a *potential* fix, and SEG is working with the customer to apply the patch and see what happens :)

Comment 1 Grant Gainey 2017-08-30 15:56:28 UTC
Testing against 5.8 and postgresql95, using the updated lock_count.sql from https://bugzilla.redhat.com/show_bug.cgi?id=1480367#c1 , shows a greatly-reduced frequency of deadlocks - but they still happen. Investigation for the 5.8 fix continues.

Comment 2 Grant Gainey 2017-08-30 19:40:07 UTC
After updating the test 5.8 instance with the fixes from the soon-to-be-released https://bugzilla.redhat.com/show_bug.cgi?id=1458880 *and* the lock_counts change from 1480367#c1 , situation is improved dramatically - no deadlocks after two hours of testing with previous recreators.

A separate stress-script that forces setBaseChannel API calls from multiple sources as fast as we can call it, was able to produce 2 deadlocks in 30 minutes of stress. The tracebacks generated involve snapshots, but as (separately) invoked from SystemHandler.setBaseChannel(). This *may* be the https://bugzilla.redhat.com/show_bug.cgi?id=1469744 (which has not yet had its own reproducer).

NOTE: I can only create this, very occasionally, when stressing a 5.8 system that has had both the lock_counts fix for 1480367 *and* the changes for 1458880. The stress-script I was running was 4 instances of the script from https://bugzilla.redhat.com/show_bug.cgi?id=1469744#c6

The postgresql tracebacks in the setBaseChannel case are
===
2017-08-30 15:06:18.906 EDT ERROR:  deadlock detected
2017-08-30 15:06:18.906 EDT DETAIL:  Process 29117 waits for ShareLock on transaction 246381; blocked by process 29115.
    Process 29115 waits for ShareLock on transaction 246377; blocked by process 29117.
    Process 29117: select * from rhn_channel.unsubscribe_server($1, $2, 0) as result
    Process 29115: select * from rhn_channel.unsubscribe_server($1, $2, 0) as result
2017-08-30 15:06:18.906 EDT HINT:  See server log for query details.
2017-08-30 15:06:18.906 EDT CONTEXT:  while locking tuple (0,1) in relation "rhnsnapshotservergroup"
    SQL statement "select ssg.snapshot_id                          from rhnSnapshotServerGroup ssg
               inner join rhnservergroup sg on sg.id = ssg.server_group_id
               inner join rhnserver s on s.org_id = sg.org_id
         where s.id = server_id_in
           for update"
    PL/pgSQL function lock_counts(numeric) line 17 at SQL statement
    SQL statement "SELECT lock_counts(server_id_in)"
    PL/pgSQL function rhn_channel.unsubscribe_server(numeric,numeric,numeric,numeric,numeric,numeric) line 19 at PERFORM
2017-08-30 15:06:18.906 EDT STATEMENT:  select * from rhn_channel.unsubscribe_server($1, $2, 0) as result
2017-08-30 15:06:18.917 EDT ERROR:  current transaction is aborted, commands ignored until end of transaction block
2017-08-30 15:06:18.917 EDT STATEMENT:  select 'c3p0 ping' from dual
===

and the matching tomcat tracebacks from catalina.out are here:
===
2017-08-30 15:06:18,926 [TP-Processor5] ERROR com.redhat.rhn.frontend.xmlrpc.BaseHandler - Error calling method:
java.lang.reflect.InvocationTargetException
    at sun.reflect.GeneratedMethodAccessor676.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:55)
    at java.lang.reflect.Method.invoke(Method.java:508)
    at com.redhat.rhn.frontend.xmlrpc.BaseHandler.invoke(BaseHandler.java:174)
    at redstone.xmlrpc.XmlRpcDispatcher.dispatch(XmlRpcDispatcher.java:123)
    at com.redhat.rhn.frontend.xmlrpc.RhnXmlRpcServer.execute(RhnXmlRpcServer.java:54)
    at com.redhat.rhn.frontend.xmlrpc.XmlRpcServlet.doPost(XmlRpcServlet.java:162)
    at javax.servlet.http.HttpServlet.service(HttpServlet.java:643)
    at javax.servlet.http.HttpServlet.service(HttpServlet.java:723)
    at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:290)
    at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
    at com.opensymphony.sitemesh.webapp.SiteMeshFilter.obtainContent(SiteMeshFilter.java:129)
    at com.opensymphony.sitemesh.webapp.SiteMeshFilter.doFilter(SiteMeshFilter.java:77)
    at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
    at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
    at com.redhat.rhn.frontend.servlets.LocalizedEnvironmentFilter.doFilter(LocalizedEnvironmentFilter.java:67)
    at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
    at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
    at com.redhat.rhn.frontend.servlets.EnvironmentFilter.doFilter(EnvironmentFilter.java:101)
    at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
    at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
    at com.redhat.rhn.frontend.servlets.SessionFilter.doFilter(SessionFilter.java:57)
    at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
    at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
    at com.redhat.rhn.frontend.servlets.SetCharacterEncodingFilter.doFilter(SetCharacterEncodingFilter.java:97)
    at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
    at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
    at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
    at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
    at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
    at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
    at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
    at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:299)
    at org.apache.jk.server.JkCoyoteHandler.invoke(JkCoyoteHandler.java:190)
    at org.apache.jk.common.HandlerRequest.invoke(HandlerRequest.java:291)
    at org.apache.jk.common.ChannelSocket.invoke(ChannelSocket.java:769)
    at org.apache.jk.common.ChannelSocket.processConnection(ChannelSocket.java:698)
    at org.apache.jk.common.ChannelSocket$SocketConnection.runIt(ChannelSocket.java:891)
    at org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:690)
    at java.lang.Thread.run(Thread.java:785)
Caused by:
com.redhat.rhn.common.db.WrappedSQLException: ERROR: deadlock detected
  Detail: Process 29117 waits for ShareLock on transaction 246381; blocked by process 29115.
Process 29115 waits for ShareLock on transaction 246377; blocked by process 29117.
  Hint: See server log for query details.
  Where: while locking tuple (0,1) in relation "rhnsnapshotservergroup"
SQL statement "select ssg.snapshot_id                          from rhnSnapshotServerGroup ssg
           inner join rhnservergroup sg on sg.id = ssg.server_group_id
           inner join rhnserver s on s.org_id = sg.org_id
     where s.id = server_id_in
       for update"
PL/pgSQL function lock_counts(numeric) line 17 at SQL statement
SQL statement "SELECT lock_counts(server_id_in)"
PL/pgSQL function rhn_channel.unsubscribe_server(numeric,numeric,numeric,numeric,numeric,numeric) line 19 at PERFORM
    at com.redhat.rhn.common.translation.SqlExceptionTranslator.postgreSqlException(SqlExceptionTranslator.java:54)
    at com.redhat.rhn.common.translation.SqlExceptionTranslator.sqlException(SqlExceptionTranslator.java:44)
    at com.redhat.rhn.common.db.datasource.CachedStatement.executeCallable(CachedStatement.java:487)
    at com.redhat.rhn.common.db.datasource.CallableMode.execute(CallableMode.java:39)
    at com.redhat.rhn.manager.system.SystemManager.unsubscribeServerFromChannel(SystemManager.java:1557)
    at com.redhat.rhn.manager.system.SystemManager.unsubscribeServerFromChannel(SystemManager.java:1531)
    at com.redhat.rhn.manager.system.SystemManager.unsubscribeServerFromChannel(SystemManager.java:1488)
    at com.redhat.rhn.manager.system.SystemManager.unsubscribeServerFromChannel(SystemManager.java:1449)
    at com.redhat.rhn.manager.system.UpdateChildChannelsCommand.unsubscribeFromOldChannels(UpdateChildChannelsCommand.java:194)
    at com.redhat.rhn.manager.system.UpdateChildChannelsCommand.store(UpdateChildChannelsCommand.java:107)
    at com.redhat.rhn.manager.system.UpdateBaseChannelCommand.store(UpdateBaseChannelCommand.java:101)
    at com.redhat.rhn.frontend.xmlrpc.system.SystemHandler.setBaseChannel(SystemHandler.java:417)
    ... 40 more
Caused by:
org.postgresql.util.PSQLException: ERROR: deadlock detected
  Detail: Process 29117 waits for ShareLock on transaction 246381; blocked by process 29115.
Process 29115 waits for ShareLock on transaction 246377; blocked by process 29117.
  Hint: See server log for query details.
  Where: while locking tuple (0,1) in relation "rhnsnapshotservergroup"
SQL statement "select ssg.snapshot_id                          from rhnSnapshotServerGroup ssg
           inner join rhnservergroup sg on sg.id = ssg.server_group_id
           inner join rhnserver s on s.org_id = sg.org_id
     where s.id = server_id_in
       for update"
PL/pgSQL function lock_counts(numeric) line 17 at SQL statement
SQL statement "SELECT lock_counts(server_id_in)"
PL/pgSQL function rhn_channel.unsubscribe_server(numeric,numeric,numeric,numeric,numeric,numeric) line 19 at PERFORM
    at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2094)
    at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1827)
    at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:255)
    at org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:508)
    at org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags(AbstractJdbc2Statement.java:384)
    at org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:377)
    at com.mchange.v2.c3p0.impl.NewProxyCallableStatement.execute(NewProxyCallableStatement.java:2706)
    at com.redhat.rhn.common.db.NamedPreparedStatement.execute(NamedPreparedStatement.java:117)
    at com.redhat.rhn.common.db.datasource.CachedStatement.executeCallable(CachedStatement.java:474)
    ... 49 more
===

Comment 18 errata-xmlrpc 2017-12-13 07:59:26 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:3445


Note You need to log in before you can comment on or make changes to this bug.