Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1350771

Summary: ExclusiveLock held by Candlepin, resulting in INSERTs waiting for a lock and thus a hanging Satellite
Product: [Community] Candlepin (Migrated to Jira) Reporter: Evgeni Golov <egolov>
Component: candlepinAssignee: Filip Nguyen <fnguyen>
Status: CLOSED CURRENTRELEASE QA Contact: Katello QA List <katello-qa-list>
Severity: high Docs Contact:
Priority: high    
Version: 0.9.49CC: bcourt, egolov, fnguyen, mstead, redakkan, skallesh, vrjain
Target Milestone: ---Keywords: Triaged
Target Release: 0.9.49   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: candlepin-0.9.49.17-1 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1350834 1351774 1351776 (view as bug list) Environment:
Last Closed: 2016-10-20 20:36:49 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1350834, 1351774, 1351776    

Description Evgeni Golov 2016-06-28 11:06:33 UTC
Description of problem:
We are running Satellite 6.1.8 which gets reports submitted by three virt-who instances (each in the ballpark of 500 ESXi and 10-15k VMs). At some point the Satellite UI will only return error 500 because Passenger has no more idle threads left.

Digging into the stack it seems the issue is that Candlepin has a few slow SELECT queries and also tries to do INSERTs in another transaction, but that waits for the ExclusiveLock of the SELECT one.

The ExclusiveLock seems never released and thus the INSERT never happens, but Satellite waitis for it, blocking a Passenger thread after another, until everything goes belly up.

PostgreSQL's lock logging reveals this:
LOG:  process 29474 still waiting for ShareLock on transaction 61147756 after 1000.171 ms
CONTEXT:  SQL statement "SELECT 1 FROM ONLY "public"."cp_consumer" x WHERE "id"::pg_catalog.text OPERATOR(pg_catalog.=) $1::pg_catalog.text FOR SHARE OF x"
STATEMENT:  insert into cp_consumer_checkin (created, updated, checkInTime, consumer_id, id) values ($1, $2, $3, $4, $5)
LOG:  process 18678 still waiting for ShareLock on transaction 61147756 after 1000.157 ms
CONTEXT:  SQL statement "SELECT 1 FROM ONLY "public"."cp_consumer" x WHERE "id"::pg_catalog.text OPERATOR(pg_catalog.=) $1::pg_catalog.text FOR SHARE OF x"
STATEMENT:  insert into cp_consumer_checkin (created, updated, checkInTime, consumer_id, id) values ($1, $2, $3, $4, $5)
LOG:  process 11839 still waiting for ShareLock on transaction 61147756 after 1000.120 ms
CONTEXT:  SQL statement "SELECT 1 FROM ONLY "public"."cp_consumer" x WHERE "id"::pg_catalog.text OPERATOR(pg_catalog.=) $1::pg_catalog.text FOR SHARE OF x"
STATEMENT:  insert into cp_consumer_checkin (created, updated, checkInTime, consumer_id, id) values ($1, $2, $3, $4, $5)
LOG:  process 22409 still waiting for ShareLock on transaction 61147756 after 1000.093 ms
CONTEXT:  SQL statement "SELECT 1 FROM ONLY "public"."cp_consumer" x WHERE "id"::pg_catalog.text OPERATOR(pg_catalog.=) $1::pg_catalog.text FOR SHARE OF x"
STATEMENT:  insert into cp_consumer_checkin (created, updated, checkInTime, consumer_id, id) values ($1, $2, $3, $4, $5)
LOG:  process 30816 still waiting for ShareLock on transaction 61147756 after 1000.099 ms
CONTEXT:  SQL statement "SELECT 1 FROM ONLY "public"."cp_consumer" x WHERE "id"::pg_catalog.text OPERATOR(pg_catalog.=) $1::pg_catalog.text FOR SHARE OF x"
STATEMENT:  insert into cp_consumer_checkin (created, updated, checkInTime, consumer_id, id) values ($1, $2, $3, $4, $5)


And if we look further in the transaction holding and requesting the lock:
LOG:  process 29474 still waiting for ShareLock on transaction 61147756 after 1000.171 ms
CONTEXT:  SQL statement "SELECT 1 FROM ONLY "public"."cp_consumer" x WHERE "id"::pg_catalog.text OPERATOR(pg_catalog.=) $1::pg_catalog.text FOR SHARE OF x"
STATEMENT:  insert into cp_consumer_checkin (created, updated, checkInTime, consumer_id, id) values ($1, $2, $3, $4, $5)
LOG:  process 18678 still waiting for ShareLock on transaction 61147756 after 1000.157 ms
CONTEXT:  SQL statement "SELECT 1 FROM ONLY "public"."cp_consumer" x WHERE "id"::pg_catalog.text OPERATOR(pg_catalog.=) $1::pg_catalog.text FOR SHARE OF x"
STATEMENT:  insert into cp_consumer_checkin (created, updated, checkInTime, consumer_id, id) values ($1, $2, $3, $4, $5)
LOG:  process 11839 still waiting for ShareLock on transaction 61147756 after 1000.120 ms
CONTEXT:  SQL statement "SELECT 1 FROM ONLY "public"."cp_consumer" x WHERE "id"::pg_catalog.text OPERATOR(pg_catalog.=) $1::pg_catalog.text FOR SHARE OF x"
STATEMENT:  insert into cp_consumer_checkin (created, updated, checkInTime, consumer_id, id) values ($1, $2, $3, $4, $5)
LOG:  process 22409 still waiting for ShareLock on transaction 61147756 after 1000.093 ms
CONTEXT:  SQL statement "SELECT 1 FROM ONLY "public"."cp_consumer" x WHERE "id"::pg_catalog.text OPERATOR(pg_catalog.=) $1::pg_catalog.text FOR SHARE OF x"
STATEMENT:  insert into cp_consumer_checkin (created, updated, checkInTime, consumer_id, id) values ($1, $2, $3, $4, $5)
LOG:  process 30816 still waiting for ShareLock on transaction 61147756 after 1000.099 ms
CONTEXT:  SQL statement "SELECT 1 FROM ONLY "public"."cp_consumer" x WHERE "id"::pg_catalog.text OPERATOR(pg_catalog.=) $1::pg_catalog.text FOR SHARE OF x"
STATEMENT:  insert into cp_consumer_checkin (created, updated, checkInTime, consumer_id, id) values ($1, $2, $3, $4, $5)

postgres=# select * from pg_stat_activity where pid = 27233;
-[ RECORD 1 ]----+-
datid            | 16387
datname          | candlepin
pid              | 27233
usesysid         | 16386
usename          | candlepin
application_name |
client_addr      | 127.0.0.1
client_hostname  |
client_port      | 43111
backend_start    | 2016-06-28 10:46:49.553728+02
xact_start       | 2016-06-28 10:52:17.088618+02
query_start      | 2016-06-28 11:16:01.05266+02
state_change     | 2016-06-28 11:16:01.055267+02
waiting          | f
state            | idle in transaction
query            | select this_.consumer_id as y0_ from cp_consumer_guests this_ inner join cp_consumer gconsumer1_ on this_.consumer_id=gconsumer1_.id inner join cp_guest_ids_checkin checkins2_ on gconsumer1_.id=checkins2_.consumer_id where gconsumer1_.owner_id=$1 and (lower(this_.guest_id)=$2 or lower(this_.guest_id)=$3) order by checkins2_.updated desc limit $4

postgres=# select * from pg_stat_activity where pid = 27233;
-[ RECORD 1 ]----+-
datid            | 16387
datname          | candlepin
pid              | 27233
usesysid         | 16386
usename          | candlepin
application_name |
client_addr      | 127.0.0.1
client_hostname  |
client_port      | 43111
backend_start    | 2016-06-28 10:46:49.553728+02
xact_start       | 2016-06-28 10:52:17.088618+02
query_start      | 2016-06-28 11:16:01.05266+02
state_change     | 2016-06-28 11:16:01.055267+02
waiting          | f
state            | idle in transaction
query            | select this_.consumer_id as y0_ from cp_consumer_guests this_ inner join cp_consumer gconsumer1_ on this_.consumer_id=gconsumer1_.id inner join cp_guest_ids_checkin checkins2_ on gconsumer1_.id=checkins2_.consumer_id where gconsumer1_.owner_id=$1 and (lower(this_.guest_id)=$2 or lower(this_.guest_id)=$3) order by checkins2_.updated desc limit $4

Version-Release number of selected component (if applicable):
0.9.49.12-1.el7

How reproducible:
100%, but needs quite some time to lock up properly

Steps to Reproduce:
1. configure 3 big virt-who instances to report to satellite
2. wait

Actual results:
inserts never get the lock, blocking everything


Expected results:
insert happens instantly.

Additional info:

Comment 2 Evgeni Golov 2016-06-28 11:30:49 UTC
Seems I miss-pasted some output

postgres=# select * from pg_locks where transactionid=61147756;
   locktype    | database | relation | page | tuple | virtualxid | transactionid | classid | objid | objsubid | virtualtransaction |  pid  |     mode      | granted | fastpath
---------------+----------+----------+------+-------+------------+---------------+---------+-------+----------+--------------------+-------+---------------+---------+----------
 transactionid |          |          |      |       |            |      61147756 |         |       |          | 49/148001          | 27233 | ExclusiveLock | t       | f
 transactionid |          |          |      |       |            |      61147756 |         |       |          | 10/578471          | 22408 | ShareLock     | f       | f
 transactionid |          |          |      |       |            |      61147756 |         |       |          | 39/189985          | 27232 | ShareLock     | f       | f
 transactionid |          |          |      |       |            |      61147756 |         |       |          | 38/202683          | 25702 | ShareLock     | f       | f
 transactionid |          |          |      |       |            |      61147756 |         |       |          | 17/418134          | 18678 | ShareLock     | f       | f
 transactionid |          |          |      |       |            |      61147756 |         |       |          | 16/415382          | 22409 | ShareLock     | f       | f
 transactionid |          |          |      |       |            |      61147756 |         |       |          | 55/15872           | 30817 | ShareLock     | f       | f
 transactionid |          |          |      |       |            |      61147756 |         |       |          | 51/101912          | 29472 | ShareLock     | f       | f
 transactionid |          |          |      |       |            |      61147756 |         |       |          | 54/30737           | 30816 | ShareLock     | f       | f
 transactionid |          |          |      |       |            |      61147756 |         |       |          | 50/105856          | 27234 | ShareLock     | f       | f
 transactionid |          |          |      |       |            |      61147756 |         |       |          | 12/491939          | 11839 | ShareLock     | f       | f
 transactionid |          |          |      |       |            |      61147756 |         |       |          | 53/39830           | 29474 | ShareLock     | f       | f
(12 rows)



postgres=# select * from pg_stat_activity where pid = 22408;
-[ RECORD 1 ]----+-------------------------------------------------------------------------------------------------------------
datid            | 16387
datname          | candlepin
pid              | 22408
usesysid         | 16386
usename          | candlepin
application_name |
client_addr      | 127.0.0.1
client_hostname  |
client_port      | 41453
backend_start    | 2016-06-28 10:38:05.011045+02
xact_start       | 2016-06-28 11:14:47.21995+02
query_start      | 2016-06-28 11:14:47.220685+02
state_change     | 2016-06-28 11:14:47.220686+02
waiting          | t
state            | active
query            | insert into cp_consumer_checkin (created, updated, checkInTime, consumer_id, id) values ($1, $2, $3, $4, $5)

Comment 3 Evgeni Golov 2016-06-28 14:38:06 UTC
Some more info based on https://wiki.postgresql.org/wiki/Lock_Monitoring
Having two virt-who daemons seems to be enough to trigger this.

candlepin=#  SELECT blocked_locks.pid     AS blocked_pid,
         blocked_activity.usename  AS blocked_user,
         blocking_locks.pid     AS blocking_pid,
         blocking_activity.usename AS blocking_user,
         blocked_activity.query    AS blocked_statement,
         blocking_activity.query   AS current_statement_in_blocking_process
   FROM  pg_catalog.pg_locks         blocked_locks
    JOIN pg_catalog.pg_stat_activity blocked_activity  ON blocked_activity.pid = blocked_locks.pid
    JOIN pg_catalog.pg_locks         blocking_locks
        ON blocking_locks.locktype = blocked_locks.locktype
        AND blocking_locks.DATABASE IS NOT DISTINCT FROM blocked_locks.DATABASE
        AND blocking_locks.relation IS NOT DISTINCT FROM blocked_locks.relation
        AND blocking_locks.page IS NOT DISTINCT FROM blocked_locks.page
        AND blocking_locks.tuple IS NOT DISTINCT FROM blocked_locks.tuple
        AND blocking_locks.virtualxid IS NOT DISTINCT FROM blocked_locks.virtualxid
        AND blocking_locks.transactionid IS NOT DISTINCT FROM blocked_locks.transactionid
        AND blocking_locks.classid IS NOT DISTINCT FROM blocked_locks.classid
        AND blocking_locks.objid IS NOT DISTINCT FROM blocked_locks.objid
        AND blocking_locks.objsubid IS NOT DISTINCT FROM blocked_locks.objsubid
        AND blocking_locks.pid != blocked_locks.pid

    JOIN pg_catalog.pg_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pid
   WHERE NOT blocked_locks.GRANTED;
-[ RECORD 1 ]-------------------------+-
blocked_pid                           | 31895
blocked_user                          | candlepin
blocking_pid                          | 31898
blocking_user                         | candlepin
blocked_statement                     | COMMIT
current_statement_in_blocking_process | select this_.consumer_id as y0_ from cp_consumer_guests this_ inner join cp_consumer gconsumer1_ on this_.consumer_id=gconsumer1_.id inner join cp_guest_ids_checkin checkins2_ on gconsumer1_.id=checkins2_.consumer_id where gconsumer1_.owner_id=$1 and (lower(this_.guest_id)=$2 or lower(this_.guest_id)=$3) order by checkins2_.updated desc limit $4


candlepin=#  SELECT bl.pid                 AS blocked_pid,
         a.usename              AS blocked_user,
         ka.query               AS current_or_recent_statement_in_blocking_process,
         ka.state               AS state_of_blocking_process,
         now() - ka.query_start AS blocking_duration,
         kl.pid                 AS blocking_pid,
         ka.usename             AS blocking_user,
         a.query                AS blocked_statement,
         now() - a.query_start  AS blocked_duration
  FROM  pg_catalog.pg_locks         bl
   JOIN pg_catalog.pg_stat_activity a  ON a.pid = bl.pid
   JOIN pg_catalog.pg_locks         kl ON kl.transactionid = bl.transactionid AND kl.pid != bl.pid
   JOIN pg_catalog.pg_stat_activity ka ON ka.pid = kl.pid
  WHERE NOT bl.GRANTED;
-[ RECORD 1 ]-----------------------------------+-
blocked_pid                                     | 31895
blocked_user                                    | candlepin
current_or_recent_statement_in_blocking_process | select this_.consumer_id as y0_ from cp_consumer_guests this_ inner join cp_consumer gconsumer1_ on this_.consumer_id=gconsumer1_.id inner join cp_guest_ids_checkin checkins2_ on gconsumer1_.id=checkins2_.consumer_id where gconsumer1_.owner_id=$1 and (lower(this_.guest_id)=$2 or lower(this_.guest_id)=$3) order by checkins2_.updated desc limit $4
state_of_blocking_process                       | idle in transaction
blocking_duration                               | -00:00:21.61859
blocking_pid                                    | 31898
blocking_user                                   | candlepin
blocked_statement                               | COMMIT
blocked_duration                                | 00:01:01.802835

Comment 9 Filip Nguyen 2016-07-14 11:47:20 UTC
https://github.com/candlepin/candlepin/pull/1291

Comment 10 Filip Nguyen 2016-07-14 11:48:26 UTC
Disregard Comment 9


PR for this BZ: 
https://github.com/candlepin/candlepin/pull/1285

Comment 11 Barnaby Court 2016-07-22 19:08:24 UTC
Adding external tracker for github and target release