Bug 1869812 - Tasks fail to complete under load
Summary: Tasks fail to complete under load
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Satellite
Classification: Red Hat
Component: Dynflow
Version: 6.8.0
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: 6.8.0
Assignee: Adam Ruzicka
QA Contact: Imaan
URL:
Whiteboard:
Depends On:
Blocks: 1691416
TreeView+ depends on / blocked
 
Reported: 2020-08-18 17:29 UTC by Mike McCune
Modified: 2021-01-04 11:22 UTC (History)
6 users (show)

Fixed In Version: tfm-rubygem-dynflow-1.4.7
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-10-27 13:05:30 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github Dynflow dynflow pull 362/ 0 None None None 2020-10-05 06:09:12 UTC
Github theforeman puppet-foreman pull 882 0 None closed Fixes #30789 - Set DB pool size dynamically 2021-02-10 13:44:45 UTC
Red Hat Product Errata RHSA-2020:4366 0 None None None 2020-10-27 13:08:38 UTC

Description Mike McCune 2020-08-18 17:29:44 UTC
When synchronizing 20+ repositories users will see one or more of the synchronization tasks never complete with a step showing:

"waiting for Pulp to start the task"

when in fact, Pulp has completed the task.

This is resolved upstream with:

https://github.com/Dynflow/dynflow/pull/362/

Errors in the log include:

PersistenceError in executor 
caused by Sequel::PoolTimeout: timeout: 5.0, elapsed: 5.000096781004686 (Dynflow::Errors::PersistenceError)

This is a regression from 6.7

Comment 1 Adam Ruzicka 2020-08-28 11:52:43 UTC
Upstream fix was merged, moving to POST

Comment 2 Adam Ruzicka 2020-08-29 09:00:25 UTC
Upstream release 1.4.7 containing fix for this BZ is out, moving to modified.

Comment 3 Jan Hutař 2020-09-07 10:21:22 UTC
Hello. Even with tfm-rubygem-dynflow-1.4.7-1.fm2_1.el7sat.noarch (used `yum upgrade; satellite-installer --scenario satellite` to upgrade) I'm still getting lots of these when registering 62 hosts in parallel:

    2020-09-07T09:51:20 [E|kat|6d773629] ActiveRecord::ConnectionTimeoutError: could not obtain a connection from the pool within 5.000 seconds (waited 5.003 seconds); all pooled connections were in use

(30 passed, 32 failed).

Looking into our monitoring, now we are topping 59 connections to PostgreSQL (47 for foreman, 12 for candlepin) where we had 48 before (39 for foreman and 9 for candlepin) - so there probably is some improvement, but either there is some issue, or I'm missing something?

Comment 4 Adam Ruzicka 2020-09-07 10:30:48 UTC
Could you confirm that all dynflow-sidekiq@* services were restarted during that upgrade?

Comment 5 Jan Hutař 2020-09-07 10:50:14 UTC
I have tried to increase "pool" value from 5 to 10 and 20 in /etc/foreman/database.yml and that helped partially (~30 and ~40 passed) but still I do not see DB connections in our monitoring (max was 78 - 63 for foreman).

Now I have tried to change "concurrency" value from 5 to 10 in /etc/foreman/dynflow/worker.yml (with "pool" back to 5 in /etc/foreman/database.yml) and got 25 passes (out of these 62 concurrent registrations).

Comment 6 Jan Hutař 2020-09-07 10:56:24 UTC
(In reply to Adam Ruzicka from comment #4)
> Could you confirm that all dynflow-sidekiq@* services were restarted during
> that upgrade?

Yep, I have `foreman-maintain service reboot` multiple times and process age for sidekiq confirms it:

    [root@f03-h29-000-r620 ~]# ps axf | grep side
    28735 pts/2    S+     0:00  |       \_ grep --color=auto side
    24692 ?        Ssl    1:29 sidekiq 5.2.7  [0 of 1 busy]
    25507 ?        Ssl    1:19 sidekiq 5.2.7  [0 of 10 busy]
    25518 ?        Ssl    1:52 sidekiq 5.2.7  [0 of 5 busy]
    [root@f03-h29-000-r620 ~]# ps -p 24692,25507,25518 -o lstart
                     STARTED
    Mon Sep  7 10:38:49 2020
    Mon Sep  7 10:40:38 2020
    Mon Sep  7 10:40:38 2020
    [root@f03-h29-000-r620 ~]# date
    Mon Sep  7 10:52:02 UTC 2020

I'm trying to reboot to be 100% sure all is fresh.

Comment 8 Adam Ruzicka 2020-09-15 09:50:54 UTC
The line from logs from #3 leads me to believe the pool ran out of connections one of puma workers, while this BZ focused purely on dynflow-sidekiq workers. The pool depletion in puma is tracked in upstream as https://projects.theforeman.org/issues/30789/ .

Maybe it should have its own BZ, the symptoms are almost the same, but it happens in a different process and the fix is completely different.

Comment 11 Mike McCune 2020-09-22 13:33:35 UTC
Once we switch back to Passenger in snap 16, we can re-test this as it effects how DB pooling is utilized.

Comment 12 Mike McCune 2020-09-23 18:20:33 UTC
Moving back ON_QA for a re-test now that we are running Passenger again.

Comment 15 Imaan 2020-10-06 13:03:03 UTC
Hello,

I have tried to sync more than 20+ repositories and they synced successfully. Tested 10,20 and 30 repositories with iso, yum and docker repo types. 

This bug has been verified in a new snap of Satellite 6.8.

Thank you.

Comment 18 errata-xmlrpc 2020-10-27 13:05:30 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Satellite 6.8 release), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:4366

Comment 19 errata-xmlrpc 2020-10-27 13:08:36 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Satellite 6.8 release), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:4366


Note You need to log in before you can comment on or make changes to this bug.