Bug 2203353
| Summary: | 6.13 upgrade fails with deadlock detected when migrating pulpcore database | ||
|---|---|---|---|
| Product: | Red Hat Satellite | Reporter: | matt jia <mjia> |
| Component: | Upgrades | Assignee: | Evgeni Golov <egolov> |
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Satellite QE Team <sat-qe-bz-list> |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | 6.13.1 | CC: | ahumbe, dalley, egolov, ehelms, ekohlvan, jbhatia, jentrena, pdwyer, risantam, rlavi, saydas |
| Target Milestone: | Unspecified | Keywords: | Triaged, Upgrades, WorkAround |
| Target Release: | Unused | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2023-09-27 15:09:43 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | 2211502 | ||
| Bug Blocks: | |||
|
Description
matt jia
2023-05-12 04:53:50 UTC
An additional traceback:
$ PULP_SETTINGS=/etc/pulp/settings.py pulpcore-manager migrate --noinput
Operations to perform:
Apply all migrations: ansible, auth, certguard, container, contenttypes, core, file, rpm, sessions
Running migrations:
Applying core.0090_char_to_text_field...Traceback (most recent call last):
File "/usr/lib/python3.9/site-packages/django/db/backends/utils.py", line 84, in _execute
return self.cursor.execute(sql, params)
psycopg2.errors.DeadlockDetected: deadlock detected
DETAIL: Process 31321 waits for AccessExclusiveLock on relation 1373490 of database 647534; blocked by process 11607.
Process 11607 waits for AccessShareLock on relation 647792 of database 647534; blocked by process 31321.
HINT: See server log for query details.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/bin/pulpcore-manager", line 33, in <module>
sys.exit(load_entry_point('pulpcore==3.21.6', 'console_scripts', 'pulpcore-manager')())
File "/usr/lib/python3.9/site-packages/pulpcore/app/manage.py", line 11, in manage
execute_from_command_line(sys.argv)
File "/usr/lib/python3.9/site-packages/django/core/management/__init__.py", line 419, in execute_from_command_line
utility.execute()
File "/usr/lib/python3.9/site-packages/django/core/management/__init__.py", line 413, in execute
self.fetch_command(subcommand).run_from_argv(self.argv)
File "/usr/lib/python3.9/site-packages/django/core/management/base.py", line 354, in run_from_argv
self.execute(*args, **cmd_options)
File "/usr/lib/python3.9/site-packages/django/core/management/base.py", line 398, in execute
output = self.handle(*args, **options)
File "/usr/lib/python3.9/site-packages/django/core/management/base.py", line 89, in wrapped
res = handle_func(*args, **kwargs)
File "/usr/lib/python3.9/site-packages/django/core/management/commands/migrate.py", line 244, in handle
post_migrate_state = executor.migrate(
File "/usr/lib/python3.9/site-packages/django/db/migrations/executor.py", line 117, in migrate
state = self._migrate_all_forwards(state, plan, full_plan, fake=fake, fake_initial=fake_initial)
File "/usr/lib/python3.9/site-packages/django/db/migrations/executor.py", line 147, in _migrate_all_forwards
state = self.apply_migration(state, migration, fake=fake, fake_initial=fake_initial)
File "/usr/lib/python3.9/site-packages/django/db/migrations/executor.py", line 227, in apply_migration
state = migration.apply(state, schema_editor)
File "/usr/lib/python3.9/site-packages/django/db/migrations/migration.py", line 126, in apply
operation.database_forwards(self.app_label, schema_editor, old_state, project_state)
File "/usr/lib/python3.9/site-packages/django/db/migrations/operations/fields.py", line 244, in database_forwards
schema_editor.alter_field(from_model, from_field, to_field)
File "/usr/lib/python3.9/site-packages/django/db/backends/base/schema.py", line 608, in alter_field
self._alter_field(model, old_field, new_field, old_type, new_type,
File "/usr/lib/python3.9/site-packages/django/db/backends/postgresql/schema.py", line 194, in _alter_field
self.execute(self._delete_index_sql(model, index_name))
File "/usr/lib/python3.9/site-packages/django/db/backends/base/schema.py", line 145, in execute
cursor.execute(sql, params)
File "/usr/lib/python3.9/site-packages/django/db/backends/utils.py", line 66, in execute
return self._execute_with_wrappers(sql, params, many=False, executor=self._execute)
File "/usr/lib/python3.9/site-packages/django/db/backends/utils.py", line 75, in _execute_with_wrappers
return executor(sql, params, many, context)
File "/usr/lib/python3.9/site-packages/django/db/backends/utils.py", line 84, in _execute
return self.cursor.execute(sql, params)
File "/usr/lib/python3.9/site-packages/django/db/utils.py", line 90, in __exit__
raise dj_exc_value.with_traceback(traceback) from exc_value
File "/usr/lib/python3.9/site-packages/django/db/backends/utils.py", line 84, in _execute
return self.cursor.execute(sql, params)
django.db.utils.OperationalError: deadlock detected
DETAIL: Process 31321 waits for AccessExclusiveLock on relation 1373490 of database 647534; blocked by process 11607.
Process 11607 waits for AccessShareLock on relation 647792 of database 647534; blocked by process 31321.
HINT: See server log for query details.
Another traceback in case it helps:
2023-05-08 15:48:21 EDT LOG: duration: 21045.299 ms statement: ALTER TABLE "core_task" ALTER COLUMN "reserved_resources
_record" TYPE text[] USING "reserved_resources_record"::text[]
2023-05-08 15:48:22 EDT ERROR: deadlock detected
2023-05-08 15:48:22 EDT DETAIL: Process 17216 waits for AccessExclusiveLock on relation 1373490 of database 647534; blocked by process 11610.
Process 11610 waits for AccessShareLock on relation 647792 of database 647534; blocked by process 17216.
Process 17216: DROP INDEX IF EXISTS "core_taskschedule_name_f9ade8e8_like"
Process 11610: SELECT "core_taskschedule"."pulp_id", "core_taskschedule"."pulp_created", "core_taskschedule"."pulp_last_updated", "core_taskschedule"."name", "core_taskschedule"."next_dispatch", "core_taskschedule"."dispatch_interval", "core_taskschedule"."task_name", "core_taskschedule"."last_task_id" FROM "core_taskschedule" LEFT OUTER JOIN "core_task" ON ("core_taskschedule"."last_task_id" = "core_task"."pulp_id") WHERE ("core_taskschedule"."next_dispatch" <= '2023-05-08T19:48:07.457888+00:00'::timestamptz AND ("core_taskschedule"."last_task_id" IS NULL OR "core_task"."state" IN ('skipped', 'completed', 'failed', 'canceled')))
This error makes me wonder if we need to / can add something that ensures all services are stopped before performing the database migration [1]? I recall we encountered this problem with Pulp 2 and ended up with using transitions to help solve it [2]. [1] https://github.com/theforeman/puppet-pulpcore/blob/master/manifests/database.pp#L29 [2] https://github.com/theforeman/puppet-pulp/blob/d064b340cae3ca8c2ab61d02179020b711b0cdc8/manifests/database.pp#LL11C1-L11C1 Let's look at the sosreport in the attached case and establish a timeline: % zgrep -h "Running foreman-maintain command" var/log/foreman-maintain/foreman-maintain.log* | sort -h I, [2023-05-08 18:13:22-0700 #906990] INFO -- : Running foreman-maintain command with arguments [["upgrade", "run", "--target-version", "6.12"]] I, [2023-05-08 18:14:03-0700 #907509] INFO -- : Running foreman-maintain command with arguments [["packages", "-h"]] I, [2023-05-08 18:14:10-0700 #907835] INFO -- : Running foreman-maintain command with arguments [["packages", "is-locked"]] I, [2023-05-08 18:15:37-0700 #910780] INFO -- : Running foreman-maintain command with arguments [["packages", "lock", "--assumeyes"]] I, [2023-05-10 10:48:43-0700 #44651] INFO -- : Running foreman-maintain command with arguments [["service", "status"]] I, [2023-05-10 15:22:20-0700 #52575] INFO -- : Running foreman-maintain command with arguments [["upgrade", "list-versions"]] I, [2023-05-10 15:24:42-0700 #52724] INFO -- : Running foreman-maintain command with arguments [["upgrade", "list-versions"]] I, [2023-05-10 15:25:02-0700 #53112] INFO -- : Running foreman-maintain command with arguments [["upgrade", "check", "--target-version", "6.13"]] I, [2023-05-10 15:27:21-0700 #54311] INFO -- : Running foreman-maintain command with arguments [["upgrade", "run", "--target-version", "6.13"]] I, [2023-05-10 15:59:17-0700 #62626] INFO -- : Running foreman-maintain command with arguments [["service", "status"]] I, [2023-05-10 16:22:32-0700 #66250] INFO -- : Running foreman-maintain command with arguments [["maintenance-mode", "status"]] I, [2023-05-10 16:22:46-0700 #66365] INFO -- : Running foreman-maintain command with arguments [["service", "stop"]] I, [2023-05-10 16:23:02-0700 #66880] INFO -- : Running foreman-maintain command with arguments [["maintenance-mode", "status"]] I, [2023-05-10 16:23:13-0700 #66995] INFO -- : Running foreman-maintain command with arguments [["service", "start"]] I, [2023-05-11 10:43:10-0700 #89377] INFO -- : Running foreman-maintain command with arguments [["packages", "unlock"]] I, [2023-05-11 10:43:36-0700 #89498] INFO -- : Running foreman-maintain command with arguments [["packages", "lock"]] I, [2023-05-11 12:55:55-0700 #91909] INFO -- : Running foreman-maintain command with arguments [["service", "status"]] I, [2023-05-11 12:56:04-0700 #92362] INFO -- : Running foreman-maintain command with arguments [["maintenance-mode", "status"]] I, [2023-05-11 12:58:34-0700 #92648] INFO -- : Running foreman-maintain command with arguments [["upgrade", "run", "--target-version", "6.13", "-y"]] I, [2023-05-11 15:16:52-0700 #95286] INFO -- : Running foreman-maintain command with arguments [["packages", "-h"]] I, [2023-05-11 15:16:58-0700 #95520] INFO -- : Running foreman-maintain command with arguments [["packages", "is-locked"]] I, [2023-05-11 15:22:32-0700 #100330] INFO -- : Running foreman-maintain command with arguments [["packages", "lock", "--assumeyes"]] I, [2023-05-11 16:18:33-0700 #101236] INFO -- : Running foreman-maintain command with arguments [["packages", "-h"]] I, [2023-05-11 16:18:38-0700 #101467] INFO -- : Running foreman-maintain command with arguments [["packages", "is-locked"]] I, [2023-05-11 16:18:39-0700 #101577] INFO -- : Running foreman-maintain command with arguments [["packages", "unlock", "--assumeyes"]] I, [2023-05-11 16:21:41-0700 #103116] INFO -- : Running foreman-maintain command with arguments [["packages", "lock", "--assumeyes"]] I, [2023-05-11 16:31:39-0700 #103374] INFO -- : Running foreman-maintain command with arguments [["packages", "-h"]] I, [2023-05-11 16:31:45-0700 #103607] INFO -- : Running foreman-maintain command with arguments [["packages", "is-locked"]] I, [2023-05-11 16:31:46-0700 #103717] INFO -- : Running foreman-maintain command with arguments [["packages", "unlock", "--assumeyes"]] I, [2023-05-11 16:33:19-0700 #104830] INFO -- : Running foreman-maintain command with arguments [["packages", "lock", "--assumeyes"]] I, [2023-05-11 18:03:34-0700 #106149] INFO -- : Running foreman-maintain command with arguments [["upgrade", "run", "--target-version", "6.13", "-y"]] I, [2023-05-11 18:04:22-0700 #106594] INFO -- : Running foreman-maintain command with arguments [["packages", "-h"]] I, [2023-05-11 18:04:27-0700 #106826] INFO -- : Running foreman-maintain command with arguments [["packages", "is-locked"]] I, [2023-05-11 18:06:02-0700 #107945] INFO -- : Running foreman-maintain command with arguments [["packages", "lock", "--assumeyes"]] In detail, that means: - 2023-05-08 18:13 upgrade to 6.12 - 2023-05-10 15:27 first attempt to upgrade to 6.13, which failed (at around 2023-05-10 15:46:17) with Running scriptlet: rubygem-smart_proxy_remote_execution_ssh-0.10.1- 235/482 mv: cannot move '/usr/share/foreman-proxy/.ssh' to '/var/lib/foreman-proxy/ssh/.ssh': File exists error: %prein(rubygem-smart_proxy_remote_execution_ssh-0.10.1-1.el8sat.noarch) scriptlet failed, exit status 1 Error in PREIN scriptlet in rpm package rubygem-smart_proxy_remote_execution_ssh Upgrading : python39-pulp-container-2.14.3-1.el8pc.noarch 236/482 error: rubygem-smart_proxy_remote_execution_ssh-0.10.1-1.el8sat.noarch: install failed - 2023-05-11 12:58 another attempt to upgrade to 6.13, still failing: mv: cannot move '/usr/share/foreman-proxy/.ssh' to '/var/lib/foreman-proxy/ssh/.ssh': File exists error: %prein(rubygem-smart_proxy_remote_execution_ssh-0.10.1-1.el8sat.noarch) scriptlet failed, exit status 1 Error in PREIN scriptlet in rpm package rubygem-smart_proxy_remote_execution_ssh Verifying : rubygem-smart_proxy_remote_execution_ssh-0.10.1-1.el 1/2 Verifying : rubygem-smart_proxy_remote_execution_ssh-0.7.3-1.el8 2/2 Installed products updated. Error: Transaction failed - 2023-05-11 18:03 another attempt to upgrade this now correctly upgraded to rubygem-smart_proxy_remote_execution_ssh-0.10.1-1.el8sat.noarch but then failed on /Stage[main]/Pulpcore::Database/Pulpcore::Admin[migrate --noinput]/Exec[pulpcore-manager migrate --noinput]/returns: change from 'notrun' to ['0'] failed: 'pulpcore-manager migrate --noinput' returned 1 instead of one of [0] However, if we look at PostgreSQL and Installer logs, we see AccessExclusiveLock errors *before* that run (but after the last failed attempt): % grep -r AccessExclusiveLock var/ var/lib/pgsql/data/log/postgresql-Thu.log:2023-05-11 15:20:42 PDT DETAIL: Process 100014 waits for AccessExclusiveLock on relation 3049568 of database 2369498; blocked by process 67704. var/lib/pgsql/data/log/postgresql-Thu.log:2023-05-11 15:21:01 PDT DETAIL: Process 100151 waits for AccessExclusiveLock on relation 3049568 of database 2369498; blocked by process 67682. var/lib/pgsql/data/log/postgresql-Thu.log:2023-05-11 16:20:11 PDT DETAIL: Process 102858 waits for AccessExclusiveLock on relation 3049568 of database 2369498; blocked by process 102575. var/lib/pgsql/data/log/postgresql-Thu.log:2023-05-11 16:33:12 PDT DETAIL: Process 104808 waits for AccessExclusiveLock on relation 3049568 of database 2369498; blocked by process 102575. var/lib/pgsql/data/log/postgresql-Thu.log:2023-05-11 18:05:52 PDT DETAIL: Process 107920 waits for AccessExclusiveLock on relation 3049568 of database 2369498; blocked by process 102571. var/log/foreman-installer/satellite.20230511-161833.log:2023-05-11 15:20:42 [INFO ] [configure] /Stage[main]/Pulpcore::Database/Pulpcore::Admin[migrate --noinput]/Exec[pulpcore-manager migrate --noinput]/returns: DETAIL: Process 100014 waits for AccessExclusiveLock on relation 3049568 of database 2369498; blocked by process 67704. var/log/foreman-installer/satellite.20230511-161833.log:2023-05-11 15:20:42 [INFO ] [configure] /Stage[main]/Pulpcore::Database/Pulpcore::Admin[migrate --noinput]/Exec[pulpcore-manager migrate --noinput]/returns: DETAIL: Process 100014 waits for AccessExclusiveLock on relation 3049568 of database 2369498; blocked by process 67704. var/log/foreman-installer/satellite.20230511-161833.log:2023-05-11 15:21:02 [INFO ] [configure] /Stage[main]/Pulpcore::Database/Pulpcore::Admin[migrate --noinput]/Exec[pulpcore-manager migrate --noinput]/returns: DETAIL: Process 100151 waits for AccessExclusiveLock on relation 3049568 of database 2369498; blocked by process 67682. var/log/foreman-installer/satellite.20230511-161833.log:2023-05-11 15:21:02 [INFO ] [configure] /Stage[main]/Pulpcore::Database/Pulpcore::Admin[migrate --noinput]/Exec[pulpcore-manager migrate --noinput]/returns: DETAIL: Process 100151 waits for AccessExclusiveLock on relation 3049568 of database 2369498; blocked by process 67682. var/log/foreman-installer/satellite.20230511-163139.log:2023-05-11 16:20:11 [INFO ] [configure] /Stage[main]/Pulpcore::Database/Pulpcore::Admin[migrate --noinput]/Exec[pulpcore-manager migrate --noinput]/returns: DETAIL: Process 102858 waits for AccessExclusiveLock on relation 3049568 of database 2369498; blocked by process 102575. var/log/foreman-installer/satellite.20230511-163139.log:2023-05-11 16:20:11 [INFO ] [configure] /Stage[main]/Pulpcore::Database/Pulpcore::Admin[migrate --noinput]/Exec[pulpcore-manager migrate --noinput]/returns: DETAIL: Process 102858 waits for AccessExclusiveLock on relation 3049568 of database 2369498; blocked by process 102575. var/log/foreman-installer/satellite.20230511-180422.log:2023-05-11 16:33:13 [INFO ] [configure] /Stage[main]/Pulpcore::Database/Pulpcore::Admin[migrate --noinput]/Exec[pulpcore-manager migrate --noinput]/returns: DETAIL: Process 104808 waits for AccessExclusiveLock on relation 3049568 of database 2369498; blocked by process 102575. var/log/foreman-installer/satellite.20230511-180422.log:2023-05-11 16:33:13 [INFO ] [configure] /Stage[main]/Pulpcore::Database/Pulpcore::Admin[migrate --noinput]/Exec[pulpcore-manager migrate --noinput]/returns: DETAIL: Process 104808 waits for AccessExclusiveLock on relation 3049568 of database 2369498; blocked by process 102575. var/log/foreman-installer/satellite.log:2023-05-11 18:05:53 [INFO ] [configure] /Stage[main]/Pulpcore::Database/Pulpcore::Admin[migrate --noinput]/Exec[pulpcore-manager migrate --noinput]/returns: DETAIL: Process 107920 waits for AccessExclusiveLock on relation 3049568 of database 2369498; blocked by process 102571. var/log/foreman-installer/satellite.log:2023-05-11 18:05:53 [INFO ] [configure] /Stage[main]/Pulpcore::Database/Pulpcore::Admin[migrate --noinput]/Exec[pulpcore-manager migrate --noinput]/returns: DETAIL: Process 107920 waits for AccessExclusiveLock on relation 3049568 of database 2369498; blocked by process 102571. This is interesting, but also if we look again at the maintain logs, the "service stop" was not executed for the "2023-05-11 18:03" run: % grep "Stop applicable services" var/log/foreman-maintain/foreman-maintain.log* var/log/foreman-maintain/foreman-maintain.log-20230511:I, [2023-05-10 15:42:28-0700 #54311] INFO -- : --- Execution step 'Stop applicable services' [service-stop] started --- var/log/foreman-maintain/foreman-maintain.log-20230511:I, [2023-05-10 15:42:35-0700 #54311] INFO -- : --- Execution step 'Stop applicable services' finished --- var/log/foreman-maintain/foreman-maintain.log-20230511:I, [2023-05-10 16:22:46-0700 #66365] INFO -- : --- Execution step 'Stop applicable services' [service-stop] started --- var/log/foreman-maintain/foreman-maintain.log-20230511:I, [2023-05-10 16:22:47-0700 #66365] INFO -- : --- Execution step 'Stop applicable services' finished --- % grep "Procedures before migrating to Satellite 6.13" var/log/foreman-maintain/foreman-maintain.log* var/log/foreman-maintain/foreman-maintain.log:I, [2023-05-11 12:58:46-0700 #92648] INFO -- : === Scenario 'Procedures before migrating to Satellite 6.13' started === var/log/foreman-maintain/foreman-maintain.log:I, [2023-05-11 12:58:46-0700 #92648] INFO -- : === Scenario 'Procedures before migrating to Satellite 6.13' finished === var/log/foreman-maintain/foreman-maintain.log:I, [2023-05-11 18:03:46-0700 #106149] INFO -- : === Scenario 'Procedures before migrating to Satellite 6.13' started === var/log/foreman-maintain/foreman-maintain.log:I, [2023-05-11 18:03:46-0700 #106149] INFO -- : === Scenario 'Procedures before migrating to Satellite 6.13' finished === var/log/foreman-maintain/foreman-maintain.log-20230511:I, [2023-05-10 15:40:28-0700 #54311] INFO -- : === Scenario 'Procedures before migrating to Satellite 6.13' started === var/log/foreman-maintain/foreman-maintain.log-20230511:I, [2023-05-10 15:42:35-0700 #54311] INFO -- : === Scenario 'Procedures before migrating to Satellite 6.13' finished === Now, I still don't completely understand *why* the services weren't stopped in the "2023-05-11 18:03" run, but it explains why the deadlock happened as Pulpcore was running at that point (but shouldn't). The Installer runs without foreman-maintain also didn't stop the services and thus ran into the issue. So looking further at the logs and at the f-m data in var/lib/foreman-maintain/data.yml, I think what happened is: f-m thought it got interrupted in the middle of an upgrade (which, strictly speaking, is correct, as yum failing to upgrade rubygem-smart_proxy_remote_execution_ssh aborts the rest of the scenario), but then when resuming the operation thought the other steps were already done and thus did not stop the services again. I'd argue this is a bug -- certain steps of the scenario should happen regardless whether they already happened or not. The workaround for users running into this would be to manually stop the services (satellite-maintain service stop) and then perform the upgrade once more. *** Bug 2219589 has been marked as a duplicate of this bug. *** This is addressed via https://bugzilla.redhat.com/show_bug.cgi?id=2211502 since that change moves the service stop out of the pre-migration phase and into the main migration phase which will ensure services are stopped every time. Therefore I am going to recommend that https://bugzilla.redhat.com/show_bug.cgi?id=2211502 be considered for 6.13.z. Closing given https://bugzilla.redhat.com/show_bug.cgi?id=2211502 addresses the concern. |