Bug 2001410 - Waiting tasks are not getting cancelled after restarting pulpcore workers or after migration script failed
Summary: Waiting tasks are not getting cancelled after restarting pulpcore workers or ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Satellite
Classification: Red Hat
Component: Pulp
Version: 6.9.0
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: 6.9.8
Assignee: satellite6-bugs
QA Contact: Lai
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-09-06 03:19 UTC by Hao Chang Yu
Modified: 2023-10-10 16:09 UTC (History)
13 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-01-27 17:33:09 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Pulp Redmine 9547 0 Normal CLOSED - CURRENTRELEASE Backport "Failed task did not clean up properly resource reservations" to 3.7.z 2021-11-30 18:09:49 UTC
Red Hat Product Errata RHBA-2022:0320 0 None None None 2022-01-27 17:33:17 UTC

Description Hao Chang Yu 2021-09-06 03:19:44 UTC
Description of problem:
If the pulp 2 to 3 migration fails with an unexpected error like insufficient disk space. The waiting tasks which assigned to workers are not getting cancel and reserved resources are not getting release after restarting the pulpcore workers and pulpcore resource manager.

Re-run the prepare script gets "Only one migration plan can run or be reset at a time". Run prepare-abort doesn't solve the issue too.

$ satellite-maintain content prepare
Running Prepare content for Pulp 3
...
Prepare content for Pulp 3: 
Checking for valid Katello configuraton.
Starting task.
2021-09-13 16:03:43 +1000: Migrating content to Pulp 3 2269/2269WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and repeat your command.
#<Thread:0x0000000012f8d560@/opt/theforeman/tfm/root/usr/share/gems/gems/logging-2.3.0/lib/logging/diagnostic_context.rb:471 run> terminated with exception (report_on_exception is true):
Traceback (most recent call last):
	15: from /opt/theforeman/tfm/root/usr/share/gems/gems/logging-2.3.0/lib/logging/diagnostic_context.rb:474:in `block in create_with_logging_context'
	14: from /opt/theforeman/tfm/root/usr/share/gems/gems/dynflow-1.4.8/lib/dynflow/connectors/database.rb:25:in `block in start'
	13: from /opt/theforeman/tfm/root/usr/share/gems/gems/sequel-5.7.1/lib/sequel/adapters/postgres.rb:411:in `listen'
	12: from /opt/theforeman/tfm/root/usr/share/gems/gems/sequel-5.7.1/lib/sequel/adapters/postgres.rb:478:in `check_database_errors'
	11: from /opt/theforeman/tfm/root/usr/share/gems/gems/sequel-5.7.1/lib/sequel/adapters/postgres.rb:412:in `block in listen'
	10: from /opt/theforeman/tfm/root/usr/share/gems/gems/sequel-5.7.1/lib/sequel/database/connecting.rb:269:in `synchronize'
	 9: from /opt/theforeman/tfm/root/usr/share/gems/gems/sequel-5.7.1/lib/sequel/connection_pool/threaded.rb:91:in `hold'
	 8: from /opt/theforeman/tfm/root/usr/share/gems/gems/sequel-5.7.1/lib/sequel/adapters/postgres.rb:442:in `block (2 levels) in listen'
	 7: from /opt/theforeman/tfm/root/usr/share/gems/gems/sequel-5.7.1/lib/sequel/adapters/postgres.rb:442:in `ensure in block (2 levels) in listen'
	 6: from /opt/theforeman/tfm/root/usr/share/gems/gems/sequel-5.7.1/lib/sequel/adapters/postgres.rb:134:in `execute'
	 5: from /opt/theforeman/tfm/root/usr/share/gems/gems/sequel-5.7.1/lib/sequel/adapters/postgres.rb:110:in `check_disconnect_errors'
	 4: from /opt/theforeman/tfm/root/usr/share/gems/gems/sequel-5.7.1/lib/sequel/adapters/postgres.rb:134:in `block in execute'
	 3: from /opt/theforeman/tfm/root/usr/share/gems/gems/sequel-5.7.1/lib/sequel/adapters/postgres.rb:146:in `execute_query'
	 2: from /opt/theforeman/tfm/root/usr/share/gems/gems/sequel-5.7.1/lib/sequel/database/logging.rb:38:in `log_connection_yield'
	 1: from /opt/theforeman/tfm/root/usr/share/gems/gems/sequel-5.7.1/lib/sequel/adapters/postgres.rb:146:in `block in execute_query'
/opt/theforeman/tfm/root/usr/share/gems/gems/sequel-5.7.1/lib/sequel/adapters/postgres.rb:146:in `async_exec': PG::UnableToSend: no connection to the server (Sequel::DatabaseDisconnectError)
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and repeat your command.
rake aborted!
ActiveRecord::StatementInvalid: PG::UnableToSend: server closed the connection unexpectedly
	This probably means the server terminated abnormally  <================== Postgres server terminated unexpectedly due to out of disk space
	before or while processing the request.


# Abort migration
$ satellite-maintain content prepare-abort
Running Abort all running Pulp 2 to Pulp 3 migration tasks
================================================================================
Abort all running Pulp 2 to Pulp 3 migration tasks: 
Cancelled 1 tasks.                                           [OK]
--------------------------------------------------------------------------------


# Run again
$ satellite-maintain content prepare 
Running Prepare content for Pulp 3
...
Prepare content for Pulp 3: 
Checking for valid Katello configuraton.
Starting task.
2021-09-13 16:32:57 +1000: Content migration starting. Migration failed, You will want to investigate: https://satellite.example.com/foreman_tasks/tasks/98c5e895-432d-4dee-806b-6074c71d1a15
rake aborted!
ForemanTasks::TaskError: Task 98c5e895-432d-4dee-806b-6074c71d1a15: Pulp2to3MigrationClient::ApiError: Error message: the server returns an error
HTTP status code: 400
Response headers: {"Date"=>"xxxxxxxxxxxxxx", "Server"=>"gunicorn/20.0.4", "Content-Type"=>"application/json", "Vary"=>"Accept,Cookie", "Allow"=>"POST, OPTIONS", "X-Frame-Options"=>"SAMEORIGIN", "Content-Length"=>"57", "Via"=>"1.1 satellite.example.com", "Connection"=>"close"}
Response body: ["Only one migration plan can run or be reset at a time"] <=============== Failed


# Waiting tasks not getting cancelled after restarting pulpcore workers
$ systemctl restart pulpcore-resource-manager pulpcore-worker@*
$ psql
pulpcore=# select pulp_id, name, state, worker_id from core_task where state = 'waiting' and worker_id is not null;
               pulp_id                |                           name                           |  state  |              worker_id               
--------------------------------------+----------------------------------------------------------+---------+--------------------------------------
 ce06bbdb-79ff-4599-a654-bd479a9bf9d6 | pulp_2to3_migration.app.migration.complex_repo_migration | waiting | e149a672-6b18-46c8-b101-84b68e058929
 1db98110-8c05-402e-8ed7-22a150bea1d1 | pulp_2to3_migration.app.migration.complex_repo_migration | waiting | 496f7cae-0a35-4b13-acce-490d5ae4451e
 e0f226c0-65bc-4c2b-84e3-096355bd9201 | pulp_2to3_migration.app.migration.complex_repo_migration | waiting | a4919996-c212-4e98-89a1-430325ffabe0
 05546a0b-0995-4cab-a492-a67bad1055e9 | pulp_2to3_migration.app.migration.complex_repo_migration | waiting | 6d4d278a-c761-40d9-8ba4-a4b4413ea739
 3796774b-c3e0-4c9b-a5dc-ca67bbe3fc06 | pulp_2to3_migration.app.migration.complex_repo_migration | waiting | b86a1dc2-e573-4997-8a65-4030edaa1def
 a7a987bf-bc97-409e-85a7-111d7fe5ebc9 | pulp_2to3_migration.app.migration.complex_repo_migration | waiting | da250740-f85d-4720-8756-8bdea8a53baa
 c0328867-99bd-4008-948a-a193330036df | pulp_2to3_migration.app.migration.complex_repo_migration | waiting | 90dd9e6c-4336-4d81-baaf-584521da2ee7
 149fc38d-0a9d-436b-b96b-d6b3f84122bc | pulp_2to3_migration.app.migration.complex_repo_migration | waiting | a24fb0ce-bb2a-47d4-8038-422305b66810
 985c1b42-26af-4465-b905-04fd8a56846c | pulp_2to3_migration.app.migration.complex_repo_migration | waiting | 5e484165-8485-4d01-8f36-835f661b4ad1
 04b7db6c-0c73-4814-8565-2b5360385d0d | pulp_2to3_migration.app.migration.complex_repo_migration | waiting | 538340c7-3d0b-4a5e-bb37-96508f466b4e
 7dd21f01-b9d9-4cce-85bd-ab1f61c9dbe3 | pulp_2to3_migration.app.migration.complex_repo_migration | waiting | 2f37829d-6205-480f-9bd5-04aa0de587d0
 52c4c1df-53f6-42da-8ddf-0b70aa0f509d | pulp_2to3_migration.app.migration.complex_repo_migration | waiting | c565339f-22e5-4185-b8da-58077803805c
...


# Try to cancel the tasks manually
$ foreman-rake console
> api = ::Katello::Pulp3::Api::Core.new(SmartProxy.default_capsule)
> api.tasks_api.list(state: "waiting").results.each { |task| pp api.cancel_task(task.pulp_href) }
#<PulpcoreClient::TaskResponse:0x0000000010137300
 @child_tasks=[],
 @created_resources=[],
 @name="pulp_2to3_migration.app.migration.complex_repo_migration",
 @parent_task="/pulp/api/v3/tasks/56152493-2013-431e-8b61-29400ae73bfc/",
 @progress_reports=[],
 @pulp_created=Sun, 12 Sep 2021 23:18:20 +0000,
 @pulp_href="/pulp/api/v3/tasks/884b97d4-a5d5-4798-8ab8-43ca50f26080/",
 @reserved_resources_record=
  ["/pulp/api/v3/repositories/rpm/rpm/ba5fd31c-315d-4569-bc37-33ef600abc99/"],
 @state="canceled",
 @task_group="/pulp/api/v3/task-groups/c999d3f7-7d4e-4865-b651-52cd76a4a421/">
#<PulpcoreClient::TaskResponse:0x00000000100d3d50
 @child_tasks=[],
 @created_resources=[],
 @name="pulp_2to3_migration.app.migration.complex_repo_migration",
 @parent_task="/pulp/api/v3/tasks/56152493-2013-431e-8b61-29400ae73bfc/",
 @progress_reports=[],
 @pulp_created=Sun, 12 Sep 2021 23:18:20 +0000,
 @pulp_href="/pulp/api/v3/tasks/f5605606-2ca9-40a0-9d82-77e71cc9bdea/",
 @reserved_resources_record=
  ["/pulp/api/v3/repositories/rpm/rpm/aace68b8-e71b-4ef9-8a65-a5c12cb662b5/"],
 @state="canceled",
 @task_group="/pulp/api/v3/task-groups/c999d3f7-7d4e-4865-b651-52cd76a4a421/">
#<PulpcoreClient::TaskResponse:0x000000000ffa9bf0
 @child_tasks=[],
 @created_resources=[],
 @name="pulp_2to3_migration.app.migration.complex_repo_migration",
 @parent_task="/pulp/api/v3/tasks/56152493-2013-431e-8b61-29400ae73bfc/",
 @progress_reports=[],
 @pulp_created=Sun, 12 Sep 2021 23:18:20 +0000,
 @pulp_href="/pulp/api/v3/tasks/b6743e03-a308-40f9-8d12-fb541b90ea07/",
 @reserved_resources_record=
  ["/pulp/api/v3/repositories/rpm/rpm/3b584cf2-27b7-4f36-955e-77e0f5ea187d/"],
 @state="canceled",
 @task_group="/pulp/api/v3/task-groups/c999d3f7-7d4e-4865-b651-52cd76a4a421/">
#<PulpcoreClient::TaskResponse:0x000000000fed45e0
 @child_tasks=[],
 @created_resources=[],
 @name="pulp_2to3_migration.app.migration.complex_repo_migration",
 @parent_task="/pulp/api/v3/tasks/56152493-2013-431e-8b61-29400ae73bfc/",
 @progress_reports=[],
 @pulp_created=Sun, 12 Sep 2021 23:18:19 +0000,
 @pulp_href="/pulp/api/v3/tasks/fe7aa645-6cdb-4b44-9319-f87856aacb7a/",
 @reserved_resources_record=
  ["/pulp/api/v3/repositories/rpm/rpm/0c8dce00-d27c-4687-a9e5-485c59069502/"],
 @state="canceled",
 @task_group="/pulp/api/v3/task-groups/c999d3f7-7d4e-4865-b651-52cd76a4a421/">
...
> pp api.tasks_api.list(state: "waiting")
#<PulpcoreClient::InlineResponse20011:0x000000000e1dbd08 @count=0, @results=[]>


# This seems like another bug that cancelled tasks are not removed from the worker queue. Maybe this is ok. I guess pulp will check the state of the task and skip?
$ rq info -u redis://localhost:6379/8
18038.com |██ 2
15874.com |██ 2
18036.com |██ 2
18037.com |██ 2


# Needs to manually do this:
$ rq empty 18038.com 15874.com 18036.com 18037.com -u redis://localhost:6379/8
2 jobs removed from 18038.com queue
2 jobs removed from 15874.com queue
2 jobs removed from 18036.com queue
2 jobs removed from 18037.com queue


# Reserve resources are still not getting clean up after cancelling the tasks.
pulpcore=# select * from core_taskreservedresource;
               pulp_id                |         pulp_created          |       pulp_last_updated       |             resource_id              |               task_id                
--------------------------------------+-------------------------------+-------------------------------+--------------------------------------+--------------------------------------
 8b84ca9c-27a2-4443-8b40-0c1b8f8dbb8b | 2021-09-13 05:57:18.30388+00  | 2021-09-13 05:57:18.30396+00  | 2bd534fb-ac19-4bba-90ae-8c5f80169176 | 6f950eb2-25cf-48e7-a6c4-db2636c3e01a
 3a76f7fd-0ec3-4122-8621-9ae4ea199bc4 | 2021-09-13 06:02:08.757547+00 | 2021-09-13 06:02:08.75758+00  | 21e57dc1-e24a-4e2c-b540-260a6e5723f4 | 35b53f2e-24e6-41dc-b85e-093c4ee7f402
 63b59527-9007-409a-a896-40766ce0966a | 2021-09-13 06:20:20.983527+00 | 2021-09-13 06:20:20.983553+00 | 8cf60e5d-01a1-403b-a4ce-974c9aa71c2b | 645a06ba-7bfe-45e1-9bad-e10ffb98cecd
 c26bab99-1a32-421c-9a64-9a856f70039b | 2021-09-13 06:21:19.257066+00 | 2021-09-13 06:21:19.257095+00 | f7cbfa62-a099-4fa1-bb38-f92c17a31838 | 47316602-a27e-4075-b218-2b968e2ca6d9
 e0205f4a-0e2e-4fd2-b73b-9e8ad3bcc23b | 2021-09-13 06:21:19.964208+00 | 2021-09-13 06:21:19.964245+00 | 6434ac91-5584-4b41-a31a-839084469b75 | e1bb63e7-78be-4a2d-bbda-ffe00547f73a
...

pulpcore=# select * from core_reservedresource;
               pulp_id                |         pulp_created          |       pulp_last_updated       |                                resource                                 |              worker_id               
--------------------------------------+-------------------------------+-------------------------------+-------------------------------------------------------------------------+--------------------------------------
 2bd534fb-ac19-4bba-90ae-8c5f80169176 | 2021-09-13 05:57:18.243676+00 | 2021-09-13 05:57:18.243747+00 | /pulp/api/v3/repositories/rpm/rpm/f503830e-0b76-472d-ade8-8f071ff35529/ | 38c7a2e6-d69c-4fbc-95f7-91d1c22624d1
 21e57dc1-e24a-4e2c-b540-260a6e5723f4 | 2021-09-13 06:02:08.754167+00 | 2021-09-13 06:02:08.754223+00 | /pulp/api/v3/repositories/rpm/rpm/28883edf-837d-415e-916b-c58b7727f685/ | c88fe686-d725-4c81-b075-08c7b56ef95f
 8cf60e5d-01a1-403b-a4ce-974c9aa71c2b | 2021-09-13 06:20:20.980849+00 | 2021-09-13 06:20:20.980889+00 | /pulp/api/v3/repositories/rpm/rpm/0cb39f5b-bfda-47ca-b5d7-8274db2db21f/ | be9ad06a-adb2-428b-8885-c5312d105354
 f7cbfa62-a099-4fa1-bb38-f92c17a31838 | 2021-09-13 06:21:19.253752+00 | 2021-09-13 06:21:19.253804+00 | /pulp/api/v3/repositories/rpm/rpm/1180b4dc-e451-448e-8178-3365c41ef91b/ | 9b80c7f7-5a8e-446e-a863-b3bdddf7bf7a
 6434ac91-5584-4b41-a31a-839084469b75 | 2021-09-13 06:21:19.950034+00 | 2021-09-13 06:21:19.950084+00 | /pulp/api/v3/repositories/rpm/rpm/920e944f-ec92-4a33-aede-6347aef6f6ad/ | 1d94152b-11e7-4cd1-b3c6-b80820ec4b83
...


# Need to kill zombie reserve resources manually.
$ PULP_SETTINGS=/etc/pulp/settings.py pulpcore-manager shell
pulp: pulp_2to3_migration.app.plugin:INFO: Plugin pulp_deb is not installed in pulp3 therefore it will not be migrated from pulp2
Python 3.6.8 (default, Aug 13 2020, 07:46:32) 
[GCC 4.8.5 20150623 (Red Hat 4.8.5-39)] on linux
Type "help", "copyright", "credits" or "license" for more information.
(InteractiveConsole)
>>> from pulpcore.app.models import Worker, ReservedResource
>>> 
>>> # Are there missing workers?
>>> missing_workers = Worker.objects.missing_workers()
>>> missing_workers.count()
75
>>> 
>>> # Look for resources that are hold by workers not considered online
>>> online_workers = Worker.objects.online_workers()
>>> zombies = ReservedResource.objects.exclude(worker__in=online_workers)
>>> zombies.count()
5
>>> 
>>> # Dangerous action: Delete the zombie resources
>>> zombies.delete()
(10, {'core.TaskReservedResource': 5, 'core.ReservedResource': 5})



# The migration can proceed after cancelling the tasks manually.
$ satellite-maintain content prepare 
...
Prepare content for Pulp 3: 
Checking for valid Katello configuraton.
Starting task.
2021-09-06 13:17:03 +1000: Initial Migration steps complete.
...

Comment 1 Brad Buckingham 2021-09-10 18:02:30 UTC
There are a couple of related issues, which may possibly solve this one; however, that needs to be determined.

Fixes are available, but not yet in an official Satellite 6.9 release.

Bug 1962462 - fixed and targeted for upcoming 6.9.6
Bug 1976311 - fixed an proposed for future 6.9.z (likely 6.9.7)

Comment 9 pulp-infra@redhat.com 2021-10-31 10:08:43 UTC
The Pulp upstream bug status is at NEW. Updating the external tracker on this bug.

Comment 10 pulp-infra@redhat.com 2021-10-31 10:08:44 UTC
The Pulp upstream bug priority is at Normal. Updating the external tracker on this bug.

Comment 11 pulp-infra@redhat.com 2021-11-30 16:12:07 UTC
The Pulp upstream bug status is at ASSIGNED. Updating the external tracker on this bug.

Comment 12 pulp-infra@redhat.com 2021-11-30 17:15:47 UTC
The Pulp upstream bug status is at POST. Updating the external tracker on this bug.

Comment 13 pulp-infra@redhat.com 2021-11-30 18:09:50 UTC
The Pulp upstream bug status is at CLOSED - CURRENTRELEASE. Updating the external tracker on this bug.

Comment 14 pulp-infra@redhat.com 2021-11-30 19:07:10 UTC
All upstream Pulp bugs are at MODIFIED+. Moving this bug to POST.

Comment 15 Lai 2022-01-13 18:47:11 UTC
Steps to retest.

1. Sync some large repos
2. Run migration: satellite-maintain content prepare
3. On another terminal, run abort: satellite-maintain content prepare-abort
4. Check the pulp resources and waiting tasks to ensure that it's been cleared (from comment#0)
5. Rerun migration until finish

Expected result:
Migration should complete without issues

Actual: 
Migration did complete without issue

Verified on 6.9.8_01

Note: Abort migration or any time the migration gets stopped, the resources should be cleared and that's what occurred when I checked.

Comment 21 errata-xmlrpc 2022-01-27 17:33:09 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Satellite 6.9.8 Async Bug Fix Update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:0320


Note You need to log in before you can comment on or make changes to this bug.