1908237 – [Installer] Installer timeout when executing foreman-rake db:migrate

Red Hat Satellite engineering is moving the tracking of its product development work on Satellite to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "Satellite project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs will be migrated starting at the end of May. If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "Satellite project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/SAT-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1908237 - [Installer] Installer timeout when executing foreman-rake db:migrate

Summary: [Installer] Installer timeout when executing foreman-rake db:migrate

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Satellite
Classification:	Red Hat
Component:	Installation
Sub Component:
Version:	6.8.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	medium
Target Milestone:	6.9.0
Assignee:	Evgeni Golov
QA Contact:	Devendra Singh
Docs Contact:
URL:
Whiteboard:
Depends On:	1904963
Blocks:
TreeView+	depends on / blocked

Reported:	2020-12-16 07:41 UTC by Evgeni Golov
Modified:	2023-09-15 00:53 UTC (History)
CC List:	9 users (show)
Fixed In Version:	foreman-installer-2.3.1.8-1
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1904963
Environment:
Last Closed:	2021-04-21 13:24:23 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Foreman Issue Tracker	31670	0	High	Closed	Command exceeded timeout while Installer executes foreman-rake db:migrate	2021-02-15 20:30:21 UTC
Red Hat Product Errata	RHSA-2021:1313	0	None	None	None	2021-04-21 13:24:45 UTC

Description Evgeni Golov 2020-12-16 07:41:50 UTC

Installer Clone of the BZ, as the installer should not timeout after 5 minutes.

+++ This bug was initially created as a clone of Bug #1904963 +++

Description of problem:
During an upgrade of 6.7.5 to 6.8.2 (snap 2) the db:migrate step in the installer fails, as it doesn't finish in 5 minutes (which is the default timeout for Puppet exec).

[DEBUG 2020-12-04T13:14:09 main]  Exec[foreman-rake-db:migrate](provider=posix): Executing '/usr/sbin/foreman-rake db:migrate'
[DEBUG 2020-12-04T13:14:09 main]  Executing with uid=foreman: '/usr/sbin/foreman-rake db:migrate'
[ERROR 2020-12-04T13:19:09 main]  Command exceeded timeout

One interesting detail is that Puppet does *NOT* kill the process, so the db:migrate is running in the background and tries to re-run the installer will result in other, confusing errors (ActiveRecord::ConcurrentMigrationError, Foreman::PermissionMissingException -- the later seems to happen on the "unless" command)

The migration run took roughtly 6 hours on my setup, and I could finish the upgrade just fine after that.

Looking at the PostgreSQL activity during the migration, I've seen the following query over and over again:

postgres=# select backend_start,xact_start,query_start,query from pg_stat_activity where state='active';

 2020-12-04 13:15:07.457939-05 | 2020-12-04 13:15:08.027364-05 | 2020-12-04 14:07:56.01994-05  | SELECT COUNT(*) FROM "katello_rpms" WHERE "katello_rpms"."id" IN (SELECT "katello_repository_rpms"."rpm_id" FROM "katello_repository_r
pms" WHERE "katello_repository_rpms"."repository_id" IN (SELECT "katello_repositories"."id" FROM "katello_repositories" WHERE "katello_repositories"."content_view_version_id" = $1 AND (environment_id is NULL)))

As the migrations ran without a terminal attached, I don't know *which* migration triggered that, but looking at the Katello migrations between 6.7.5 and 6.8.2, I think it could be one of these:
db/migrate/20191204214919_add_content_view_version_counts.rb (most probably!)
db/migrate/20200129172534_add_epoch_version_release_arch_to_katello_installed_packages.rb
db/migrate/20200501155054_installed_package_unique_nvrea.rb


I think there are actually two bugs here:
1. the installer should not assume db:migrate can finish in 5 minutes, this is not realistic.
2. the migration should not require 6 hours to finish.

Version-Release number of selected component (if applicable):
Satellite 6.8.2 Snap 2

How reproducible:
100%


Steps to Reproduce:
1. take a 6.7.5 dogfood server (big db, rather slow-ish storage)
2. try to upgrade to 6.8.2

Actual results:
Fails

Expected results:
Succeeds ;-)

Additional info:
This is probably related, but not exactly the same as #1888983

--- Additional comment from Evgeni Golov on 2020-12-09 06:49:57 UTC ---

When running the migrations outside the installer, I could confirm which migration is the culprit:
== 20191204214919 AddContentViewVersionCounts: migrated (27884.8553s) =========

--- Additional comment from Evgeni Golov on 2020-12-09 08:54:00 UTC ---

You can find a 6.7.5 DB dump in dogfood:/root/dogfood-6.7.5.dump that can be used to replicate this issue.

--- Additional comment from Tomer Brisker on 2020-12-15 15:13:05 UTC ---

This migration was added in the fix for BZ 1486446, looks like the migration first pulls all CVVs to memory then iterates over them triggering multiple count queries for each CVV (which could be thousands), than saves them running all callbacks, which is also potentially slow. The migration should likely be changed to update the columns directly and possibly calculate the totals in a different more efficient manner.

--- Additional comment from Tomer Brisker on 2020-12-15 15:14:26 UTC ---

Proposing for 6.8.z, 6 hours is unreasonable and there is no point in fixing this for 6.9 since by that time users will already have run this (but if they migrate to 6.8.3 with a fix they won't)

Comment 1 Evgeni Golov 2021-01-18 15:33:42 UTC

Created redmine issue https://projects.theforeman.org/issues/31670 from this bug

Comment 2 Bryan Kearney 2021-01-20 15:45:13 UTC

Moving this bug to POST for triage into Satellite since the upstream issue https://projects.theforeman.org/issues/31670 has been resolved.

Comment 8 Devendra Singh 2021-03-25 15:49:42 UTC

Verified on 6.9 Snap 18

Comment 9 wclark 2021-04-07 18:03:17 UTC

This BZ has been reported against Satellite 6.8 with symptoms of installer timing out while waiting for `foreman-rake db:migrate` to finish. However the attached fixes also provide a fix for the issue of `pulpcore-manager migrate` timeout which could occur on Satellite 6.9 or later. That issue would present a different installer error (below) than the one originally reported in this BZ:

2021-04-06 18:43:34 [ERROR ] [configure] /Stage[main]/Pulpcore::Database/Pulpcore::Admin[migrate --noinput]/Exec[pulpcore-manager migrate --noinput]/returns: change from 'notrun' to ['0'] failed: Command exceeded timeout

The fix, in both cases, is to increase the default timeout for the installer to wait for the command to finish executing from 300 seconds (5 minutes) to 3600 seconds (one hour).

In some cases, such as an internal lab Satellite with 1,000,000+ packages synced, this one hour timeout was still insufficient for `pulpcore-manager migrate`. If you face a similar issue and need to increase the timeout beyond the default, you can adjust the parameter in /etc/foreman-installer/custom-hiera.yaml as below; for example, to increase the timeout for `pulpcore-manager migrate` to two hours (7200 seconds):

pulpcore::database::timeout: 7200

Or to instead increase to timeout for `foreman-rake db:migrate` to 7200 seconds, add to /etc/foreman-installer/custom-hiera.yaml:

foreman::database::timeout: 7200

Note that these are the timeout values for the installer to exit while waiting for the command to run. That means, the installer itself will exit with an error but whichever `pulpcore-manager migrate --noinput` or `foreman-rake db:migrate` command timed out whould continue running any pending DB migrations in the background. Re-running the installer immediately, before those migrations are completed, could result in other errors. Therefore if you reach this scenario, you should monitor[1] the still running process until it completes. Then you should modify /etc/foreman-installer/custom-hiera.yaml as above to prevent the issue from occurring on future database migrations when upgrading your Satellite, and re-run[2] the installer to finish any remaining configuration tasks which aren't executed by the installer until after the the database migrations.

[1] Using top, ps, or any process monitor
[2] If you monitored the original migration process until it completed, the installer would find no remaining pending migrations upon reaching that installation step, meaning it should progress more quickly to the actual remaining work.

Comment 12 errata-xmlrpc 2021-04-21 13:24:23 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Satellite 6.9 Release), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:1313

Comment 13 Red Hat Bugzilla 2023-09-15 00:53:07 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days

Note You need to log in before you can comment on or make changes to this bug.