Bug 1234604 - update_needed_cache fails to return
Summary: update_needed_cache fails to return
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Spacewalk
Classification: Community
Component: Server
Version: 2.3
Hardware: All
OS: Linux
urgent
urgent
Target Milestone: ---
Assignee: Grant Gainey
QA Contact: Red Hat Satellite QA List
URL:
Whiteboard:
Depends On: 1230280 1245540
Blocks: 465198 sat570-triage space24
TreeView+ depends on / blocked
 
Reported: 2015-06-22 19:20 UTC by Grant Gainey
Modified: 2019-09-12 08:34 UTC (History)
13 users (show)

Fixed In Version: spacewalk-schema-2.4.15-1
Doc Type: Bug Fix
Doc Text:
Clone Of: 1230280
Environment:
Last Closed: 2015-10-08 13:26:53 UTC
Embargoed:


Attachments (Terms of Use)

Description Grant Gainey 2015-06-22 19:20:01 UTC
+++ This bug was initially created as a clone of Bug #1230280 +++

This bug created to investigate issues raised under 1211729, that have nothing to do with that BZ or its fix.

[lots of comments elided, tried to keep the core of the technical discussion]

+++ This bug was initially created as a clone of Bug #1211729 +++


--- Additional comment from Grant Gainey on 2015-06-04 10:18:47 EDT ---

pid 5176 starts out waiting-on 18080:

  Tue Jun 2 08:40:01 GMT 2015:
  blocked_pid | blocking_pid | blocking_duration
  5176        | 18080        | 00:00:01.481463

Later, it is blocked by 1389:

  Tue Jun 2 08:59:01 GMT 2015:
  blocked_pid | blocking_pid | blocking_duration
  5176        | 1389         | 00:18:41.464813

So clearly 5176 became unblocked and then reblocked on a new call: 08:59:01 - 18:41 == 08:40:20, which means the second lock-attempt started after the time (08:40:01) when 5176 was waiting-on 18080.

So - at 08:40:20, more or less, 5176 dropped into update_needed_cache() again, and found that it had to wait (again), this time on pid 1389 that was in the same code.

As of 08:59:01, 1389 is *still* attempting to do 'select * from rhn_channel.update_needed_cache($1) as result'. But we don't see it as 'waiting for' anything, in any of the 'waiting' checks in #c12.

The question I have is, "what is happening with PID 1389?"  It doesn't appear to be the other side of a deadly embrace, as we don't see it 'waiting on' anything; it looks like just another instance of update_needed_cache() doing its job.  Until it completes whatever it's doing, everything *has to* back up behind it - that's what the locking is for, after all.

It would appear that everything here is stuck because

  a) update_needed_cache() acquires locks and holds them until it's done (which it has to do, and is fine assuming that whoever has the lock eventually gives it up)

  b) process 1389 grabbed a lock and now seems to have disappeared.

Can we find anything from the stuck instance that might explain what happened with that process, that might have caused it to hang without showing us what it was waiting for? Because *that* is the root-cause for whatever is going on in #c12

--- Additional comment from Martin Osvald on 2015-06-16 08:51:27 EDT ---

It finally isn't stuck anywhere it simply takes huge time to complete which makes other queries to be "stuck" and this is where I need your help, we definitely need to improve/optimize locking behaviour to speed things for other queries, it is simply wrong that other queries need to wait hours before this query gets finished or are influenced by the locking mechanism which makes them very slow.

Due to the bellow:

################################################################################
rhnschema=# select sc.server_id as id from rhnServerChannel sc where sc.channel_id = 7021;
...
(7009 rows)

rhnschema=#
################################################################################

the code executes rhn_server.update_needed_cache(server.id) 7009 times:

################################################################################
1223    create or replace function update_needed_cache(channel_id_in in numeric) returns void
1224    as $$
1225    declare
1226    server record;
1227    begin
1228       -- we intentionaly do a loop here instead of one huge select
1229       -- b/c we want to break update into smaller transaction to unblock other sessions
1230       -- querying rhnServerNeededCache
1231       for server in (
1232                 select sc.server_id as id
1233                   from rhnServerChannel sc
1234                  where sc.channel_id = channel_id_in
1235       ) loop
1236          perform rhn_server.update_needed_cache(server.id);
1237       end loop;
1238    end$$ language plpgsql;
################################################################################

and if we look at the plan we can see it is costly query:

################################################################################
/var/lib/pgsql/data/pg_log/postgresql-Tue.log:
...
2015-06-16 17:23:14.842 IST|[local]|14096|557fef7e.3710|1923878|SELECT| STATEMENT:  select * from rhn_channel.update_needed_cache(7021);
2015-06-16 17:23:14.843 IST|[local]|14096|557fef7e.3710|1923878|SELECT| LOG:  duration: 983.540 ms  plan:
        Unique  (cost=12692.52..12692.55 rows=2 width=55) (actual time=981.457..981.553 rows=49 loops=1)
          Output: sp_sp.server_id, ep.errata_id, p.id, ce.channel_id
          ->  Sort  (cost=12692.52..12692.53 rows=2 width=55) (actual time=981.455..981.483 rows=53 loops=1)
                Output: sp_sp.server_id, ep.errata_id, p.id, ce.channel_id
                Sort Key: ep.errata_id, p.id, ce.channel_id
                Sort Method:  quicksort  Memory: 29kB
                ->  Hash Left Join  (cost=10942.52..12692.51 rows=2 width=55) (actual time=382.188..981.331 rows=53 loops=1)
                      Output: sp_sp.server_id, ep.errata_id, p.id, ce.channel_id
                      Hash Cond: ((sc.channel_id = ce.channel_id) AND (cp.package_id = ep.package_id))
                      ->  Nested Loop  (cost=4062.29..5811.53 rows=2 width=55) (actual time=229.005..827.955 rows=53 loops=1)
                            Output: sp_sp.server_id, sc.channel_id, cp.package_id, p.id
                            ->  Nested Loop  (cost=4062.29..5790.26 rows=68 width=55) (actual time=56.937..821.168 rows=1090 loops=1)
                                  Output: sp_sp.server_id, p.id, cp.package_id, cp.channel_id
                                  ->  Nested Loop  (cost=4062.29..5776.92 rows=21 width=40) (actual time=56.914..810.128 rows=765 loops=1)
                                        Output: sp_sp.server_id, p.id
                                        Join Filter: ((max(sp_pe.evr)) < pe.evr)
                                        ->  Hash Join  (cost=4062.29..5741.07 rows=64 width=80) (actual time=55.805..97.073 rows=5812 loops=1)
                                              Output: sp_sp.server_id, (max(sp_pe.evr)), p.id, p.evr_id
                                              Hash Cond: ((sp_sp.package_arch_id = puac.package_arch_id) AND (p.package_arch_id = puac.package_upgrade_arch_id))
                                              ->  Nested Loop  (cost=4058.54..5710.28 rows=352 width=119) (actual time=55.690..83.950 rows=7384 loops=1)
                                                    Output: sp_sp.server_id, (max(sp_pe.evr)), sp_sp.package_arch_id, p.id, p.evr_id, p.package_arch_id
                                                    ->  HashAggregate  (cost=4058.54..4059.21 rows=53 width=61) (actual time=55.665..56.337 rows=414 loops=1)
                                                          Output: sp_sp.server_id, sp_sp.name_id, sp_sp.package_arch_id, max(sp_pe.evr)
                                                          ->  Merge Join  (cost=1934.43..4053.34 rows=520 width=61) (actual time=1.528..53.897 rows=416 loops=1)
                                                                Output: sp_sp.server_id, sp_sp.name_id, sp_sp.package_arch_id, sp_pe.evr
                                                                Merge Cond: (sp_pe.id = sp_sp.evr_id)
                                                                ->  Index Scan using rhn_pe_id_pk on rhnpackageevr sp_pe  (cost=0.00..1999.98 rows=44808 width=45) (actual time=0.012..27.342 rows=44774 loops=1)
                                                                      Output: sp_pe.id, sp_pe.epoch, sp_pe.version, sp_pe.release, sp_pe.evr
                                                                ->  Sort  (cost=1933.92..1935.22 rows=520 width=32) (actual time=1.485..1.675 rows=416 loops=1)
                                                                      Output: sp_sp.server_id, sp_sp.name_id, sp_sp.package_arch_id, sp_sp.evr_id
                                                                      Sort Key: sp_sp.evr_id
                                                                      Sort Method:  quicksort  Memory: 57kB
                                                                      ->  Bitmap Heap Scan on rhnserverpackage sp_sp  (cost=21.27..1910.46 rows=520 width=32) (actual time=0.154..0.691 rows=416 loops=1)
                                                                            Output: sp_sp.server_id, sp_sp.name_id, sp_sp.package_arch_id, sp_sp.evr_id
                                                                            Recheck Cond: (server_id = $1)
                                                                            ->  Bitmap Index Scan on rhn_sp_snep_uq  (cost=0.00..21.14 rows=520 width=0) (actual time=0.117..0.117 rows=416 loops=1)
                                                                                  Index Cond: (server_id = $1)
                                                    ->  Index Scan using rhn_package_nid_id_idx on rhnpackage p  (cost=0.00..31.05 rows=7 width=30) (actual time=0.010..0.051 rows=18 loops=414)
                                                          Output: p.id, p.org_id, p.name_id, p.evr_id, p.package_arch_id, p.package_group, p.rpm_version, p.description, p.summary, p.package_size, p.payload_size, p.installed_size, p.build_host, p.build_time, p.source_rpm_id, p.checksum_id, p.vendor, p.payload_format, p.compat, p.path, p.header_sig, p.copyright, p.cookie, p.last_modified, p.created, p.modified, p.header_start, p.header_end
                                                          Index Cond: (p.name_id = sp_sp.name_id)
                                              ->  Hash  (cost=2.10..2.10 rows=110 width=14) (actual time=0.095..0.095 rows=110 loops=1)
                                                    Output: puac.package_arch_id, puac.package_upgrade_arch_id
                                                    ->  Seq Scan on rhnpackageupgradearchcompat puac  (cost=0.00..2.10 rows=110 width=14) (actual time=0.008..0.041 rows=110 loops=1)
                                                          Output: puac.package_arch_id, puac.package_upgrade_arch_id
                                        ->  Index Scan using rhn_pe_id_pk on rhnpackageevr pe  (cost=0.00..0.30 rows=1 width=45) (actual time=0.006..0.008 rows=1 loops=5812)
                                              Output: pe.id, pe.epoch, pe.version, pe.release, pe.evr
                                              Index Cond: (pe.id = p.evr_id)
                                  ->  Index Scan using rhn_cp_pid_idx on rhnchannelpackage cp  (cost=0.00..0.57 rows=5 width=15) (actual time=0.010..0.011 rows=1 loops=765)
                                        Output: cp.channel_id, cp.package_id, cp.created, cp.modified
                                        Index Cond: (cp.package_id = p.id)
                            ->  Index Scan using rhn_sc_sid_cid_uq on rhnserverchannel sc  (cost=0.00..0.30 rows=1 width=17) (actual time=0.005..0.005 rows=0 loops=1090)
                                  Output: sc.server_id, sc.channel_id, sc.is_fve, sc.created, sc.modified
                                  Index Cond: ((sc.server_id = $1) AND (sc.channel_id = cp.channel_id))
                      ->  Hash  (cost=6279.23..6279.23 rows=40066 width=23) (actual time=153.144..153.144 rows=33199 loops=1)
                            Output: ce.channel_id, ep.errata_id, ep.package_id
                            ->  Hash Join  (cost=2249.32..6279.23 rows=40066 width=23) (actual time=18.077..131.749 rows=33199 loops=1)
                                  Output: ce.channel_id, ep.errata_id, ep.package_id
                                  Hash Cond: (ep.errata_id = ce.errata_id)
                                  ->  Seq Scan on rhnerratapackage ep  (cost=0.00..2566.29 rows=141729 width=16) (actual time=0.012..41.431 rows=141675 loops=1)
                                        Output: ep.errata_id, ep.package_id, ep.created, ep.modified
                                  ->  Hash  (cost=2163.78..2163.78 rows=6843 width=15) (actual time=7.549..7.549 rows=4598 loops=1)
                                        Output: ce.channel_id, ce.errata_id
                                        ->  Nested Loop  (cost=0.00..2163.78 rows=6843 width=15) (actual time=0.073..5.069 rows=4598 loops=1)
                                              Output: ce.channel_id, ce.errata_id
                                              ->  Index Scan using rhn_sc_sid_cid_uq on rhnserverchannel sc_sc  (cost=0.00..48.27 rows=11 width=7) (actual time=0.012..0.020 rows=13 loops=1)
                                                    Output: sc_sc.server_id, sc_sc.channel_id, sc_sc.is_fve, sc_sc.created, sc_sc.modified
                                                    Index Cond: (server_id = $1)
                                              ->  Index Scan using rhn_ce_ce_uq on rhnchannelerrata ce  (cost=0.00..179.59 rows=1018 width=15) (actual time=0.010..0.199 rows=354 loops=13)
                                                    Output: ce.channel_id, ce.errata_id, ce.created, ce.modified
                                                    Index Cond: (ce.channel_id = sc_sc.channel_id)
2015-06-16 17:23:14.843 IST|[local]|14096|557fef7e.3710|1923878|SELECT| CONTEXT:  SQL statement "insert into rhnServerNeededCache (server_id, errata_id, package_id, channel_id) (select distinct sp.server_id, x.errata_id, p.id, x.channel_id FROM (SELECT sp_sp.server_id, sp_sp.name_id, sp_sp.package_arch_id, max(sp_pe.evr) AS max_evr FROM rhnServerPackage sp_sp join rhnPackageEvr sp_pe ON sp_pe.id = sp_sp.evr_id GROUP BY sp_sp.server_id, sp_sp.name_id, sp_sp.package_arch_id) sp join rhnPackage p ON p.name_id = sp.name_id join rhnPackageEvr pe ON pe.id = p.evr_id AND sp.max_evr < pe.evr join rhnPackageUpgradeArchCompat puac ON puac.package_arch_id = sp.package_arch_id AND puac.package_upgrade_arch_id = p.package_arch_id join rhnServerChannel sc ON sc.server_id = sp.server_id join rhnChannelPackage cp ON cp.package_id = p.id AND cp.channel_id = sc.channel_id left join (SELECT ep.errata_id, ce.channel_id, ep.package_id FROM rhnChannelErrata ce join rhnErrataPackage ep ON ep.errata_id = ce.errata_id join rhnServerChannel sc_sc ON sc_sc.channel_id = ce.channel_id WHERE sc_sc.server_id =  $1 ) x ON x.channel_id = sc.channel_id AND x.package_id = cp.package_id where sp.server_id =  $1 )"
        PL/pgSQL function "update_needed_cache" line 4 at SQL statement
        SQL statement "SELECT  rhn_server.update_needed_cache( $1 )"
        PL/pgSQL function "update_needed_cache" line 13 at PERFORM
################################################################################

--- Additional comment from Grant Gainey on 2015-06-16 10:38:11 EDT ---


Channel 7021 is ft_6_6.2_ft-infrastructure-x86_64-6

rhnschema=# select label, last_modified, last_synced, created, modified from rhnchannel where id = 7021;
                label                |          last_modified           | last_synced |             created              |             modified             
-------------------------------------+----------------------------------+-------------+----------------------------------+----------------------------------
 ft_6_6.2_ft-infrastructure-x86_64-6 | 2015-05-22 12:44:40.876978+05:30 |             | 2015-03-11 03:47:48.053545+05:30 | 2015-03-11 03:47:48.053545+05:30
(1 row)

That channel has only 51 packages:

rhnschema=# select count(package_id) from rhnchannelpackage where channel_id = 7021;
 count 
-------
    51
(1 row)

> 
> Due to the bellow:
#############################################################################
> ###
> rhnschema=# select sc.server_id as id from rhnServerChannel sc where
> sc.channel_id = 7021;
> ...
> (7009 rows)
> 
> rhnschema=#
> #############################################################################
> ###

7021 isn't even the most-subscribed channel in the DB:

rhnschema=# select channel_id, count(server_id) from rhnserverchannel group by channel_id order by count(server_id);
[...rows elided...]
       7007 |    99
       6989 |   115
       7008 |   299
       6998 |   312
       7029 |  1271
       7002 |  2048
       6981 |  2133
       7009 |  2650
       7019 |  2658
       6992 |  3083
       6987 |  3084
       7027 |  5745
       7011 |  7007
       6994 |  7008
       7014 |  7009
       7021 |  7009  <<<<<
       7006 |  7025
       6982 |  7037
       7023 |  7037
       6980 |  7039
(93 rows)

Channel 6980, for example, has more systems subscribed, and far more packages:

rhnschema=# select count(package_id) from rhnchannelpackage where channel_id = 6980;
 count 
-------
  4684
(1 row)

> the code executes rhn_server.update_needed_cache(server.id) 7009 times:

Yes - but it does not hold table-locks across each call. You will note that rhn_channel.update_needed_cache(cid) doesn't hold a global lock itself, so we will hold only be locking as we enter rhn_server.update_needed_cache(sid), and dropping them as we exit that procedure. There may be some difference in the rhn_server.update_needed_cache(sid) code between postgres and oracle that's biting us.

> and if we look at the plan we can see it is costly query:

Yes. Figuring out the server-cache is possibly the most expensive query Satellite executes. People have been trying to make it faster for a decade or so now.

> 
> #############################################################################
> ###
> /var/lib/pgsql/data/pg_log/postgresql-Tue.log:
> ...
> 2015-06-16 17:23:14.842 IST|[local]|14096|557fef7e.3710|1923878|SELECT|
> STATEMENT:  select * from rhn_channel.update_needed_cache(7021);
> 2015-06-16 17:23:14.843 IST|[local]|14096|557fef7e.3710|1923878|SELECT| LOG:
> duration: 983.540 ms  plan:

But regardless of how complicated and expensive it is, this suggests that it finished in 980 milliseconds?

>         Unique  (cost=12692.52..12692.55 rows=2 width=55) (actual
> time=981.457..981.553 rows=49 loops=1)

I see a lot of index-scans and bitmap-joins here. I don't see any full-table-scans. Looks to me like this query is as performant as it's going to get.

> 2015-06-16 17:23:14.843 IST|[local]|14096|557fef7e.3710|1923878|SELECT|
> CONTEXT:  SQL statement "insert into rhnServerNeededCache (server_id,
> errata_id, package_id, channel_id) (select distinct sp.server_id,
> x.errata_id, p.id, x.channel_id FROM (SELECT sp_sp.server_id, sp_sp.name_id,
> sp_sp.package_arch_id, max(sp_pe.evr) AS max_evr FROM rhnServerPackage sp_sp
> join rhnPackageEvr sp_pe ON sp_pe.id = sp_sp.evr_id GROUP BY
> sp_sp.server_id, sp_sp.name_id, sp_sp.package_arch_id) sp join rhnPackage p
> ON p.name_id = sp.name_id join rhnPackageEvr pe ON pe.id = p.evr_id AND
> sp.max_evr < pe.evr join rhnPackageUpgradeArchCompat puac ON
> puac.package_arch_id = sp.package_arch_id AND puac.package_upgrade_arch_id =
> p.package_arch_id join rhnServerChannel sc ON sc.server_id = sp.server_id
> join rhnChannelPackage cp ON cp.package_id = p.id AND cp.channel_id =
> sc.channel_id left join (SELECT ep.errata_id, ce.channel_id, ep.package_id
> FROM rhnChannelErrata ce join rhnErrataPackage ep ON ep.errata_id =
> ce.errata_id join rhnServerChannel sc_sc ON sc_sc.channel_id = ce.channel_id
> WHERE sc_sc.server_id =  $1 ) x ON x.channel_id = sc.channel_id AND
> x.package_id = cp.package_id where sp.server_id =  $1 )"
>         PL/pgSQL function "update_needed_cache" line 4 at SQL statement
>         SQL statement "SELECT  rhn_server.update_needed_cache( $1 )"
>         PL/pgSQL function "update_needed_cache" line 13 at PERFORM
> #############################################################################
> ###
> 

Actually, that plan looks like it's for rhn_SERVER.update_needed_cache(sid). In that context, "980 ms per server" would explain the "takes a long time" issue.

On the reproducer-machine, I did some very very rough experiments, measuring wallclock-time to update-cache on channels with a variety of systems subscribed.

10 systems, ~14 seconds
# select * from rhn_channel.update_needed_cache(7022); 

100 systems, ~120 seconds
# select * from rhn_channel.update_needed_cache(7007);

1200 systems, ~20 minutes
select * from rhn_channel.update_needed_cache(7029);

Basically, on this machine, updating the cache looks to be taking 1 minute per 50-60 systems subscribed to a channel - which maps pretty closely to the 980ms in the explain-plan. So channels with 7K subscribed systems are going to take about two hours apiece.

5.7 has had a number of updates to rhn_channel and rhn_server - I will investigate and see if any of them might have an impact on us. 

--- Additional comment from Martin Osvald on 2015-06-16 12:48:13 EDT ---

maybe wrong question, but couldn't be this somehow related to oracle -> postgres transition? 

--- Additional comment from Grant Gainey on 2015-06-16 13:45:46 EDT ---

(In reply to Martin Osvald from comment #8)
> maybe wrong question, but couldn't be this somehow related to oracle ->
> postgres transition? 

The short answer is "yes". If you read the investigation at

http://www.moioli.net/progetti/a-postgres-deadlock-bug-case-study/ ,

you can see that this problem crops up as fallout of Postgres' implementation of foreign-keys as triggers.

> I don't remember any such problems with oracle, at
> least during 2009 to 2012 or 13 when I for some time ended working on
> Satellite cases and one year back I didn't see this too, although there are
> currently only 2 cases attached I see multiple cases related to problems
> with postgres db.

Postgres9 appears to be much better (in this and many other respects) than Postgres8.

--- Additional comment from Grant Gainey on 2015-06-16 15:29:36 EDT ---

Martin - OK, I have made a change in the reproducer, that seems to at least not-break-anything. Let me tell you what I did, and see if it makes sense to you:

The current problem is that:
- We call rhn_channel.update_needed_cache(cid), which
  - calls rhn_server.update_needed_cache(sid) forall servers subscribed to CID, which 
    - locks rhnServer where id = sid
    ...and holds it until rhn_channel.update_needed_cache(cid) *commits*.

That last bit is the problem, because (as noted) it can take a Very Long Time to process thousands of systems, and we don't commit until it's done.

There exists a proc in Satellite called 'queue_server(sid, now)'. Its job is to do one of two things:

1) Call rhn_server.update_needed_cache(sid) if now > 0, OR
2) Place an 'update_server_errata_cache' entry onto rhnTaskQueue, for later pickup by Taskomatic, which will then call rhn_server.update_needed_cache(sid).

The main difference is, the Taskomatic call *commits after each server*. This means the locks acquired by rhn_server.update_needed_cache(sid) are released, allowing Something Else to get work done before the next server is processed.

The net from the user's POV is that servers will correct their errata-needed state a little more slowly than currently, but the rest of Satellite should continue working.

I have set the reproducer up to work this way. If you call rhn_channel.update_needed_cache(CID) now, it returns immediately. If you then look at the rhnTaskQueue:

select count(*) from rhnTaskQueue where task_name = 'update_server_errata_cache';

you'll see one entry per-server-subscribed-to that CID.

If you continue to monitor rhnTaskQueue, you will see that number begin to decrease, as Taskomatic does its thing.

So - I have gotten that far. Could you hammer on the reproducer, and see if this change frees up the rest of the system?  Processing manymany channels at once, it is still going to take quite some time before every system's errata-cache-state makes sense - but a) it should get there eventually, and b) it shouldn't cause the instance to grind to a halt.

--- Additional comment from Grant Gainey on 2015-06-16 15:33:50 EDT ---

Note to clarify #c10: I modified 

  rhn_channel.update_needed_cache(CID) 

to call 

  queue_server(server.id), 

instead of calling 

  rhn_server.update_needed_cache(server.id) 

directly.

Comment 1 Grant Gainey 2015-06-22 19:58:16 UTC
spacewalk.github commit c52cf40f012136b5469ef773da8dd3ed8a55e618

Comment 2 Grant Gainey 2015-06-22 20:12:52 UTC
spacewalk.github commit 41926ed10ea3d08e228691f6243734da9ffdd4f2

Comment 3 Tomas Lestach 2015-06-30 11:50:40 UTC
Re-setting back to ASSIGNED, as an oracle specific upgrade script is missing in the patch.

Comment 4 Grant Gainey 2015-06-30 18:59:59 UTC
spacewalk.github commit: 3b6c652741e22427972eb21398a9e42fbdf1b1e5

Comment 7 Jan Dobes 2015-10-08 13:26:53 UTC
Spacewalk 2.4 has been released.


Note You need to log in before you can comment on or make changes to this bug.