Bug 1334924

Summary: Slow response times due to Foreman locking query
Product: Red Hat Satellite Reporter: Chris Adams <cadams>
Component: Tasks PluginAssignee: Ivan Necas <inecas>
Status: CLOSED ERRATA QA Contact: jcallaha
Severity: high Docs Contact:
Priority: high    
Version: 6.1.7CC: bbuckingham, bkearney, cwelton, erinn.looneytriggs, inecas, jcallaha, jentrena, ktordeur, mmccune, oshtaier, rdixon, sabnave, smalik, xdmoon
Target Milestone: UnspecifiedKeywords: Triaged
Target Release: Unused   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: rubygem-foreman-tasks-0.7.14.7-1 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-10-04 06:43:38 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Chris Adams 2016-05-10 19:48:56 UTC
Using Satellite 6.1.7, we experienced extremely slow UI and API performance (response times measured in the minutes, tons of ancillary failures due to lock contention). On the server, we hit the Passenger process limit on Foreman instances which were each tied to a Postgres query using 100% CPU continuously for tens of minutes.

We added memory and reduced concurrency to avoid swapping and increased Postgres's shared_buffers & work_mem but the queries were still taking minutes to execute.

The long running queries were all the same Foreman lock check, something of this form:

> SELECT COUNT(*) FROM "foreman_tasks_locks" INNER JOIN "foreman_tasks_tasks" ON "foreman_tasks_tasks"."id" = "foreman_tasks_locks"."task_id" WHERE "foreman_tasks_locks"."name" = 'read' AND "foreman_tasks_locks"."resource_id" = 529 AND "foreman_tasks_locks"."resource_type" = 'Katello::System' AND (foreman_tasks_tasks.state != 'stopped') AND ("foreman_tasks_locks"."task_id" NOT IN ('ab80a792-a95e-455a-a0cf-23801e84eba2'));

A quick EXPLAIN showed that triggers a table scan:

> foreman=# EXPLAIN ANALYZE SELECT COUNT(*) FROM "foreman_tasks_locks" INNER JOIN "foreman_tasks_tasks" ON "foreman_tasks_tasks"."id" = "foreman_tasks_locks"."task_id" WHERE "foreman_tasks_locks"."name" = 'read' AND "foreman_tasks_locks"."resource_id" = 529 AND "foreman_tasks_locks"."resource_type" = 'Katello::System' AND (foreman_tasks_tasks.state != 'stopped') AND ("foreman_tasks_locks"."task_id" NOT IN ('ab80a792-a95e-455a-a0cf-23801e84eba2'));
>                                                                                             QUERY PLAN                                                                                            
> --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>  Aggregate  (cost=16218.46..16218.47 rows=1 width=0) (actual time=115714.130..115714.131 rows=1 loops=1)
>    ->  Nested Loop  (cost=0.00..16218.46 rows=1 width=0) (actual time=7013.001..115714.110 rows=2 loops=1)
>          Join Filter: ((foreman_tasks_locks.task_id)::text = (foreman_tasks_tasks.id)::text)
>          Rows Removed by Join Filter: 3029738
>          ->  Index Scan using index_foreman_tasks_locks_on_resource_type_and_resource_id on foreman_tasks_locks  (cost=0.00..23.13 rows=1 width=37) (actual time=0.111..16.142 rows=1615 loops=1)
>                Index Cond: (((resource_type)::text = 'Katello::System'::text) AND (resource_id = 529))
>                Filter: (((task_id)::text <> 'ab80a792-a95e-455a-a0cf-23801e84eba2'::text) AND ((name)::text = 'read'::text))
>                Rows Removed by Filter: 1615
>          ->  Seq Scan on foreman_tasks_tasks  (cost=0.00..16171.96 rows=1869 width=37) (actual time=0.020..71.448 rows=1876 loops=1615)
>                Filter: ((state)::text <> 'stopped'::text)
>                Rows Removed by Filter: 371338
>  Total runtime: 115714.238 ms
> (12 rows)

After adding a couple of indexes, the response times are significantly improved:

> foreman=# CREATE INDEX ON foreman_tasks_tasks (id, state);
> CREATE INDEX
> foreman=# CREATE INDEX ON foreman_tasks_locks (name, resource_type, resource_id);
> CREATE INDEX
> foreman=# EXPLAIN ANALYZE SELECT COUNT(*) FROM "foreman_tasks_locks" INNER JOIN "foreman_tasks_tasks" ON "foreman_tasks_tasks"."id" = "foreman_tasks_locks"."task_id" WHERE "foreman_tasks_locks"."name" = 'read' AND "foreman_tasks_locks"."resource_id" = 529 AND "foreman_tasks_locks"."resource_type" = 'Katello::System' AND (foreman_tasks_tasks.state != 'stopped') AND ("foreman_tasks_locks"."task_id" NOT IN ('ab80a792-a95e-455a-a0cf-23801e84eba2'));
>                                                                                          QUERY PLAN                                                                                          
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>  Aggregate  (cost=69.76..69.77 rows=1 width=0) (actual time=30.810..30.811 rows=1 loops=1)
>    ->  Nested Loop  (cost=0.00..69.75 rows=1 width=0) (actual time=30.787..30.807 rows=2 loops=1)
>          ->  Index Scan using foreman_tasks_locks_name_resource_type_resource_id_idx on foreman_tasks_locks  (cost=0.00..22.91 rows=7 width=37) (actual time=0.149..1.926 rows=1615 loops=1)
>                Index Cond: (((name)::text = 'read'::text) AND ((resource_type)::text = 'Katello::System'::text) AND (resource_id = 529))
>                Filter: ((task_id)::text <> 'ab80a792-a95e-455a-a0cf-23801e84eba2'::text)
>          ->  Index Only Scan using foreman_tasks_tasks_id_state_idx on foreman_tasks_tasks  (cost=0.00..6.68 rows=1 width=37) (actual time=0.018..0.018 rows=0 loops=1615)
>                Index Cond: (id = (foreman_tasks_locks.task_id)::text)
>                Filter: ((state)::text <> 'stopped'::text)
>                Rows Removed by Filter: 1
>                Heap Fetches: 1057
>  Total runtime: 30.873 ms
> (11 rows)

(As an aside, I don't know whether ~300k tasks is considered excessive but https://bugzilla.redhat.com/show_bug.cgi?id=1289384 makes me glad 6.2 will include foreman_tasks:cleanup)

Comment 1 Chris Adams 2016-05-13 15:01:31 UTC
One other improvement we made to the base config was adding an Apache conf file in the foreman .d directories which capped the maximum number of processes for Foreman below the overall limit. That reduced the number of 100% CPU Postgres processes but also, more importantly, ensured that non-Foreman Apache requests weren't getting blocked by those Foreman workers running for hours so the other apps didn't stop responding.

Comment 3 Ivan Necas 2016-06-03 09:55:48 UTC
I don't think it's addressed in 6.2, this should be an easy fix with adding the indexes as described in https://bugzilla.redhat.com/show_bug.cgi?id=1334924#c0

Comment 6 Bryan Kearney 2016-06-03 12:19:25 UTC
Created redmine issue http://projects.theforeman.org/issues/15283 from this bug

Comment 7 Bryan Kearney 2016-06-03 14:17:08 UTC
Upstream bug component is Remote Execution

Comment 8 Bryan Kearney 2016-06-07 22:14:49 UTC
Upstream bug component is Tasks Plugin

Comment 10 Erinn Looney-Triggs 2016-09-08 02:29:04 UTC
Chris,
Would it be possible for you to include the config you cite here: https://bugzilla.redhat.com/show_bug.cgi?id=1334924#c1

I'd like to get something like this in place,

-Erinn

Comment 12 Chris Adams 2016-09-08 16:10:35 UTC
Erinn: the current version of our config is actually quite simple (“PassengerMaxPoolSize 10” dropped into both /etc/httpd/conf.d/05-foreman.d/site-tuning.conf and /etc/httpd/conf.d/05-foreman-ssl.d/site-tuning.conf). 

That's actually the second pass at that config – one of the earlier troubleshooting steps silently removed the earlier config file – and the values are probably quite conservative. I'm not the owner for our Satellite server & those values were an emergency fix to end the downtime since we have a full rebuild in progress. After bumping up Postgres's shared_buffers and work_mem values and creating those indexes I'd expect the numbers could be increased considerably.

Comment 13 Bryan Kearney 2016-09-20 14:16:33 UTC
Upstream bug assigned to inecas

Comment 14 Bryan Kearney 2016-09-20 14:16:38 UTC
Upstream bug assigned to inecas

Comment 15 Bryan Kearney 2016-09-23 20:16:27 UTC
Moving this bug to POST for triage into Satellite 6 since the upstream issue http://projects.theforeman.org/issues/15283 has been resolved.

Comment 18 Ivan Necas 2016-09-26 20:33:03 UTC
*** Bug 1378830 has been marked as a duplicate of this bug. ***

Comment 20 Erinn Looney-Triggs 2016-09-29 02:34:43 UTC
I don't understand a lot about postgres but I can tell you even after putting those indexes in place I am running into query performance issues, these are consuming all CPU resources on the system:

foreman=# explain analyze SELECT COUNT(*) FROM "foreman_tasks_locks" INNER JOIN "foreman_tasks_tasks" ON "foreman_tasks_tasks"."id" = "foreman_tasks_locks"."task_id" WHERE (foreman_tasks_tasks.state != 'stopped') AND ("foreman_tasks_locks"."task_id" NOT IN ('8abb9ca3-815f-49ea-b97b-6aa505f2a29a')) AND "foreman_tasks_locks"."name" = 'task_owner' AND "foreman_tasks_locks"."resource_id" = 1 AND "foreman_tasks_locks"."resource_type" = 'User' AND "foreman_tasks_locks"."exclusive" = 't';
                                                                                
            QUERY PLAN                                                          
                                  
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
----------------------------------
 Aggregate  (cost=99686.75..99686.76 rows=1 width=0) (actual time=117418.632..11
7418.633 rows=1 loops=1)
   ->  Nested Loop  (cost=10291.17..99686.75 rows=1 width=0) (actual time=117418
.610..117418.610 rows=0 loops=1)
         Join Filter: ((foreman_tasks_locks.task_id)::text = (foreman_tasks_task
s.id)::text)
         ->  Seq Scan on foreman_tasks_tasks  (cost=0.00..28213.80 rows=1 width=
37) (actual time=0.101..857.333 rows=100 loops=1)
               Filter: ((state)::text <> 'stopped'::text)
               Rows Removed by Filter: 590769
         ->  Bitmap Heap Scan on foreman_tasks_locks  (cost=10291.17..70187.01 r
ows=102875 width=37) (actual time=1165.321..1165.321 rows=0 loops=100)
               Recheck Cond: (((name)::text = 'task_owner'::text) AND ((resource
_type)::text = 'User'::text) AND (resource_id = 1))
               Rows Removed by Index Recheck: 1197069
               Filter: (exclusive AND ((task_id)::text <> '8abb9ca3-815f-49ea-b9
7b-6aa505f2a29a'::text))
               Rows Removed by Filter: 1647091
               ->  Bitmap Index Scan on index_foreman_tasks_locks_name_resource_
type_resource_id  (cost=0.00..10265.45 rows=291542 width=0) (actual time=346.569
..346.569 rows=1647091 loops=100)
                     Index Cond: (((name)::text = 'task_owner'::text) AND ((reso
urce_type)::text = 'User'::text) AND (resource_id = 1))
 Total runtime: 117418.692 ms
(14 rows)

As mentioned the indexes are in place, but I really don't understand the above or the internals of postgresql so something is probably missing.

foreman=# select                                                                    t.relname as table_name,
    i.relname as index_name,
    array_to_string(array_agg(a.attname), ', ') as column_names
from
    pg_class t,
    pg_class i,
    pg_index ix,
    pg_attribute a
where
    t.oid = ix.indrelid
    and i.oid = ix.indexrelid
    and a.attrelid = t.oid
    and a.attnum = ANY(ix.indkey)
    and t.relkind = 'r'
    and t.relname like 'foreman_tasks_%'
group by
    t.relname,
    i.relname
order by
    t.relname,
    i.relname;
            table_name            |                         index_name          
               |           column_names           
----------------------------------+---------------------------------------------
---------------+----------------------------------
 foreman_tasks_locks              | foreman_tasks_locks_pkey                    
               | id
 foreman_tasks_locks              | index_foreman_tasks_locks_name_resource_type
_resource_id   | resource_id, name, resource_type
 foreman_tasks_locks              | index_foreman_tasks_locks_on_resource_type_a
nd_resource_id | resource_type, resource_id
 foreman_tasks_recurring_logics   | foreman_tasks_recurring_logics_pkey         
               | id
 foreman_tasks_task_group_members | foreman_tasks_task_group_members_index      
               | task_id, task_group_id
 foreman_tasks_task_group_members | foreman_tasks_task_group_members_pkey       
               | id
 foreman_tasks_task_groups        | foreman_tasks_task_groups_pkey              
               | id
 foreman_tasks_tasks              | index_foreman_tasks_id_state                
               | state, id
 foreman_tasks_tasks              | index_foreman_tasks_tasks_on_type_and_label 
               | label, type
 foreman_tasks_triggerings        | foreman_tasks_triggerings_pkey              
               | id
(10 rows)

Comment 21 Erinn Looney-Triggs 2016-09-29 03:09:28 UTC
And it looks a lot like I have fixed it (this is 6.2.2 by the way):

foreman=# explain analyze SELECT COUNT(*) FROM "foreman_tasks_locks" INNER JOIN "foreman_tasks_tasks" ON "foreman_tasks_tasks"."id" = "foreman_tasks_locks"."task_id" WHERE (foreman_tasks_tasks.state != 'stopped') AND ("foreman_tasks_locks"."task_id" NOT IN ('8abb9ca3-815f-49ea-b97b-6aa505f2a29a')) AND "foreman_tasks_locks"."name" = 'task_owner' AND "foreman_tasks_locks"."resource_id" = 1 AND "foreman_tasks_locks"."resource_type" = 'User' AND "foreman_tasks_locks"."exclusive" = 't';
                                                                                
         QUERY PLAN                                                             
                            
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
----------------------------
 Aggregate  (cost=89414.34..89414.35 rows=1 width=0) (actual time=0.072..0.072 r
ows=1 loops=1)
   ->  Hash Join  (cost=32232.62..89414.31 rows=12 width=0) (actual time=0.069..
0.069 rows=0 loops=1)
         Hash Cond: ((foreman_tasks_locks.task_id)::text = (foreman_tasks_tasks.
id)::text)
         ->  Bitmap Heap Scan on foreman_tasks_locks  (cost=4014.51..60810.19 ro
ws=102904 width=37) (actual time=0.069..0.069 rows=0 loops=1)
               Recheck Cond: (((name)::text = 'task_owner'::text) AND ((resource
_type)::text = 'User'::text) AND (resource_id = 1))
               Filter: (exclusive AND ((task_id)::text <> '8abb9ca3-815f-49ea-b9
7b-6aa505f2a29a'::text))
               ->  Bitmap Index Scan on index_foreman_tasks_locks_name_resource_
type_resource_id_exclus  (cost=0.00..3988.78 rows=102904 width=0) (actual time=0
.067..0.067 rows=0 loops=1)
                     Index Cond: (((name)::text = 'task_owner'::text) AND ((reso
urce_type)::text = 'User'::text) AND (resource_id = 1) AND (exclusive = true))
         ->  Hash  (cost=28217.12..28217.12 rows=79 width=37) (never executed)
               ->  Seq Scan on foreman_tasks_tasks  (cost=0.00..28217.12 rows=79
 width=37) (never executed)
                     Filter: ((state)::text <> 'stopped'::text)
 Total runtime: 0.130 ms
(12 rows)


By creating this index:
foreman=# CREATE INDEX index_foreman_tasks_locks_name_resource_type_resource_id_exclusive ON foreman_tasks_locks (name, resource_type, resource_id, exclusive);NOTICE:  identifier "index_foreman_tasks_locks_name_resource_type_resource_id_exclusive" will be truncated to "index_foreman_tasks_locks_name_resource_type_resource_id_exclus"
CREATE INDEX

Obviously the name can't be that, but it might be worth adding this to the pull request that is currently going through QA.

Comment 22 jcallaha 2016-09-29 19:44:48 UTC
Satellite 6.2.2 Snap 1 results

[root@rhsm-qe-1 ~]# sudo su - postgres -s /bin/bash -c "psql -d foreman -c '\d foreman_tasks_tasks'"
            Table "public.foreman_tasks_tasks"
     Column     |            Type             | Modifiers 
----------------+-----------------------------+-----------
 id             | character varying(255)      | not null
 type           | character varying(255)      | not null
 label          | character varying(255)      | 
 started_at     | timestamp without time zone | 
 ended_at       | timestamp without time zone | 
 state          | character varying(255)      | not null
 result         | character varying(255)      | not null
 external_id    | character varying(255)      | 
 parent_task_id | character varying(255)      | 
 start_at       | timestamp without time zone | 
 start_before   | timestamp without time zone | 
Indexes:
    "index_foreman_tasks_id_state" btree (id, state)
    "index_foreman_tasks_tasks_on_type_and_label" btree (type, label)

[root@rhsm-qe-1 ~]# sudo su - postgres -s /bin/bash -c "psql -d foreman -c '\d foreman_tasks_locks'"
                                    Table "public.foreman_tasks_locks"
    Column     |          Type          |                            Modifiers                             
---------------+------------------------+------------------------------------------------------------------
 id            | integer                | not null default nextval('foreman_tasks_locks_id_seq'::regclass)
 task_id       | character varying(255) | not null
 name          | character varying(255) | not null
 resource_type | character varying(255) | 
 resource_id   | integer                | 
 exclusive     | boolean                | 
Indexes:
    "foreman_tasks_locks_pkey" PRIMARY KEY, btree (id)
    "index_foreman_tasks_locks_name_resource_type_resource_id" btree (name, resource_type, resource_id)
    "index_foreman_tasks_locks_on_resource_type_and_resource_id" btree (resource_type, resource_id)

Ivan, can you weight in on #21 and #22?

Comment 23 Ivan Necas 2016-10-03 07:26:10 UTC
Thanks @Erinn for giving this a try. I've opened additional PR to include this params combination into index as well. https://github.com/theforeman/foreman-tasks/pull/205

Comment 24 jcallaha 2016-10-03 13:12:53 UTC
Verified in Satellite 6.2.2 Async Snap 2

Since Ivan's original patch is in place and has already shown some significant performance boosts. I will verify this one and vote we push the next patch to 6.2.3.

Comment 26 errata-xmlrpc 2016-10-04 06:43:38 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2016:1993