1869812 – Tasks fail to complete under load

Red Hat Satellite engineering is moving the tracking of its product development work on Satellite to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "Satellite project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs will be migrated starting at the end of May. If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "Satellite project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/SAT-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1869812 - Tasks fail to complete under load

Summary: Tasks fail to complete under load

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Satellite
Classification:	Red Hat
Component:	Dynflow
Sub Component:
Version:	6.8.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	6.8.0
Assignee:	Adam Ruzicka
QA Contact:	Imaan
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1691416
TreeView+	depends on / blocked

Reported:	2020-08-18 17:29 UTC by Mike McCune
Modified:	2021-01-04 11:22 UTC (History)
CC List:	6 users (show)
Fixed In Version:	tfm-rubygem-dynflow-1.4.7
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-10-27 13:05:30 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	Dynflow dynflow pull 362/	None	None	None	2020-10-05 06:09:12 UTC
Github	theforeman puppet-foreman pull 882	None	closed	Fixes #30789 - Set DB pool size dynamically	2021-02-10 13:44:45 UTC
Red Hat Product Errata	RHSA-2020:4366	None	None	None	2020-10-27 13:08:38 UTC

Description Mike McCune 2020-08-18 17:29:44 UTC

When synchronizing 20+ repositories users will see one or more of the synchronization tasks never complete with a step showing:

"waiting for Pulp to start the task"

when in fact, Pulp has completed the task.

This is resolved upstream with:

https://github.com/Dynflow/dynflow/pull/362/

Errors in the log include:

PersistenceError in executor 
caused by Sequel::PoolTimeout: timeout: 5.0, elapsed: 5.000096781004686 (Dynflow::Errors::PersistenceError)

This is a regression from 6.7

Comment 1 Adam Ruzicka 2020-08-28 11:52:43 UTC

Upstream fix was merged, moving to POST

Comment 2 Adam Ruzicka 2020-08-29 09:00:25 UTC

Upstream release 1.4.7 containing fix for this BZ is out, moving to modified.

Comment 3 Jan Hutař 2020-09-07 10:21:22 UTC

Hello. Even with tfm-rubygem-dynflow-1.4.7-1.fm2_1.el7sat.noarch (used `yum upgrade; satellite-installer --scenario satellite` to upgrade) I'm still getting lots of these when registering 62 hosts in parallel:

    2020-09-07T09:51:20 [E|kat|6d773629] ActiveRecord::ConnectionTimeoutError: could not obtain a connection from the pool within 5.000 seconds (waited 5.003 seconds); all pooled connections were in use

(30 passed, 32 failed).

Looking into our monitoring, now we are topping 59 connections to PostgreSQL (47 for foreman, 12 for candlepin) where we had 48 before (39 for foreman and 9 for candlepin) - so there probably is some improvement, but either there is some issue, or I'm missing something?

Comment 4 Adam Ruzicka 2020-09-07 10:30:48 UTC

Could you confirm that all dynflow-sidekiq@* services were restarted during that upgrade?

Comment 5 Jan Hutař 2020-09-07 10:50:14 UTC

I have tried to increase "pool" value from 5 to 10 and 20 in /etc/foreman/database.yml and that helped partially (~30 and ~40 passed) but still I do not see DB connections in our monitoring (max was 78 - 63 for foreman).

Now I have tried to change "concurrency" value from 5 to 10 in /etc/foreman/dynflow/worker.yml (with "pool" back to 5 in /etc/foreman/database.yml) and got 25 passes (out of these 62 concurrent registrations).

Comment 6 Jan Hutař 2020-09-07 10:56:24 UTC

(In reply to Adam Ruzicka from comment #4)
> Could you confirm that all dynflow-sidekiq@* services were restarted during
> that upgrade?

Yep, I have `foreman-maintain service reboot` multiple times and process age for sidekiq confirms it:

    [root@f03-h29-000-r620 ~]# ps axf | grep side
    28735 pts/2    S+     0:00  |       \_ grep --color=auto side
    24692 ?        Ssl    1:29 sidekiq 5.2.7  [0 of 1 busy]
    25507 ?        Ssl    1:19 sidekiq 5.2.7  [0 of 10 busy]
    25518 ?        Ssl    1:52 sidekiq 5.2.7  [0 of 5 busy]
    [root@f03-h29-000-r620 ~]# ps -p 24692,25507,25518 -o lstart
                     STARTED
    Mon Sep  7 10:38:49 2020
    Mon Sep  7 10:40:38 2020
    Mon Sep  7 10:40:38 2020
    [root@f03-h29-000-r620 ~]# date
    Mon Sep  7 10:52:02 UTC 2020

I'm trying to reboot to be 100% sure all is fresh.

Comment 8 Adam Ruzicka 2020-09-15 09:50:54 UTC

The line from logs from #3 leads me to believe the pool ran out of connections one of puma workers, while this BZ focused purely on dynflow-sidekiq workers. The pool depletion in puma is tracked in upstream as https://projects.theforeman.org/issues/30789/ .

Maybe it should have its own BZ, the symptoms are almost the same, but it happens in a different process and the fix is completely different.

Comment 11 Mike McCune 2020-09-22 13:33:35 UTC

Once we switch back to Passenger in snap 16, we can re-test this as it effects how DB pooling is utilized.

Comment 12 Mike McCune 2020-09-23 18:20:33 UTC

Moving back ON_QA for a re-test now that we are running Passenger again.

Comment 15 Imaan 2020-10-06 13:03:03 UTC

Hello,

I have tried to sync more than 20+ repositories and they synced successfully. Tested 10,20 and 30 repositories with iso, yum and docker repo types. 

This bug has been verified in a new snap of Satellite 6.8.

Thank you.

Comment 18 errata-xmlrpc 2020-10-27 13:05:30 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Satellite 6.8 release), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:4366

Comment 19 errata-xmlrpc 2020-10-27 13:08:36 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Satellite 6.8 release), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:4366

Note You need to log in before you can comment on or make changes to this bug.