Bug 2050440 - pulp workers are idle despite there is one pending task
Summary: pulp workers are idle despite there is one pending task
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Satellite
Classification: Red Hat
Component: Pulp
Version: 6.10.2
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: 6.11.0
Assignee: satellite6-bugs
QA Contact: Lai
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-02-03 23:45 UTC by matt jia
Modified: 2023-09-15 01:51 UTC (History)
8 users (show)

Fixed In Version: tfm-pulpcore-python-pulpcore-3.16.7
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 2088620 (view as bug list)
Environment:
Last Closed: 2022-07-05 14:32:45 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github pulp pulpcore issues 2506 0 None closed All workers idle while a task is pending 2022-05-16 16:21:37 UTC
Red Hat Knowledge Base (Solution) 6960283 0 None None None 2022-05-21 10:34:14 UTC
Red Hat Product Errata RHSA-2022:5498 0 None None None 2022-07-05 14:33:04 UTC

Description matt jia 2022-02-03 23:45:13 UTC
Description of problem:

This happens when deleting a batch of content views at the same time. 99% of content views are removed without any issue. Only one content view has this issue. By examining the task, it has a list of child tasks. Most of them start and finish in one second. However, the last one is in waiting state. By looking at the core_task table in the db, there is no other task running and only this task is in waiting state. All workers are online but idle. The interesting thing is if a new task(a repository sync/sync plan, etc.) is triggered, this task will be picked up too. This results in the content view deletion task being stuck there for 7 hours:


 Id: 4cf1dd7d-b398-47c3-ad2f-0f8c588a8c65

Label: Actions::Katello::ContentView::Destroy

Status: stopped

Result: success

Started at: 2022-02-01 20:57:22 UTC

Ended at: 2022-02-02 04:00:26 UTC

3: Actions::Pulp3::ContentView::DeleteRepositoryReferences (success) [ 25382.76s / 3396.84s ]


Version-Release number of selected component (if applicable):

6.10

How reproducible:

Hard


Steps to Reproduce:

Only in the customer's enviroment 

Actual results:

The task is in waiting state forever while workers are idle.

Expected results:

The task should be run.

Additional info:

Comment 9 Pavel Moravec 2022-04-04 07:06:46 UTC
So we have a task being ignored by pulp worker(s)(*), but once a different task on (same?similar?) repo object is kicked off, we are unblocked? That sounds like pulp-3 task locking issue.

(*) we might not know if *all* workers do ignore the task or if just one worker is blocked in execution of just selected task

Isnt it worth taking:

1) list of locks (isnt pulp waiting on https://github.com/pulp/pulpcore/blob/a403fc6f885b9378551198029326d9e1f3ffedf3/pulpcore/tasking/pulpcore_worker.py#L252 ?)
su - postgres -c "psql pulpcore -c \"select relation::regclass, * from pg_locks;\""

2) coredumps of pulp workers (what is the status of all workers?):
for pid in $(ps aux | grep -v grep | grep pulpcore-worker | awk '{ print $2 }'); do gcore $pid; done


Just a silly idea: can't all workers be waiting for "NOTIFY pulp_worker_wakeup" that comes even with the new task invoked?

Comment 10 matt jia 2022-04-04 07:23:51 UTC
Could be this piece of code "resource in taken_resources for resource in reserved_resources_record" returns true here? 

https://github.com/pulp/pulpcore/blob/a403fc6f885b9378551198029326d9e1f3ffedf3/pulpcore/tasking/pulpcore_worker.py#L151

Is it possible two tasks run by one worker have the same resource assigned? When the first task is run, it updates the taken_resources set with the resource record.

When the second task is run ,it would see "resource in taken_resources for resource in reserved_resources_record" true in this case?

Comment 11 matt jia 2022-04-04 07:25:15 UTC
smelling like race co(In reply to matt jia from comment #10)
> Could be this piece of code "resource in taken_resources for resource in
> reserved_resources_record" returns true here? 
> 
> https://github.com/pulp/pulpcore/blob/
> a403fc6f885b9378551198029326d9e1f3ffedf3/pulpcore/tasking/pulpcore_worker.
> py#L151
> 
> Is it possible two tasks run by one worker have the same resource assigned?
> When the first task is run, it updates the taken_resources set with the
> resource record.
> 
> When the second task is run ,it would see "resource in taken_resources for
> resource in reserved_resources_record" true in this case?

Smelling like a race condition, :-)

Comment 12 Pavel Moravec 2022-04-04 07:36:24 UTC
So worth to additionally ask for:

su - postgres -c "psql pulpcore -c \"select * from core_reservedresource;\""

and just in case orhpans cleanup was in charge (cf pg_advisory_lock and the fact orphans cleanup *was* run at similar time when the cust hit the issue), take also:

su - postgres -c "psql pulpcore -c \"select * from core_task where name = 'pulpcore.app.tasks.orphan.orphan_cleanup';\""


I will ask for that.

Comment 13 Pavel Moravec 2022-04-10 07:45:37 UTC
I might reproduced it. Idea of reproducer: (have a repository and its publication, then) in a loop, create 20 publications for it and remove all of them. *Maybe* required to run concurrently for >1 different repos, and *maybe* with different content guards (empty or a RHSM one).

So far I reproduced it once with below script. Observations:

- core_reservedresource table was empty
- I had both general_create and general_delete tasks waiting
- when I invoked a different task (sync of completely unrelated repo), the situation got unblocked (which is the symptom of the reported problem as well)

I am building more cleaner environment to provide more reliable reproducer, but there is some livelock in pulp workers (not) picking up tasks..


Reproducer script:

-----8<----------8<----------8<----------8<-----

# reproducer attempt for bz 2050440 : create a repo and a publication; then concurrently create distributions and concurrently delete them - in a loop

concur=${1:-10}
contguard=${2:-""}

certs="--cacert /etc/pki/katello/certs/katello-server-ca.crt --cert /etc/pki/katello/certs/pulp-client.crt --key /etc/pki/katello/private/pulp-client.key"
hname=$(hostname -f)

# create repo and its publication
reponame="repo_bz2050440_ts$(date +%s_%N)"
repohref=$(curl -s -X POST -H "Content-Type: application/json" -d "{\"name\":\"$reponame\"}" $certs https://${hname}/pulp/api/v3/repositories/rpm/rpm/ | json_reformat | grep pulp_href | cut -d\" -f4)

sleep 1
pubtask=$(curl -s -H "Accept:application/json" -H "Content-Type: application/json" -X POST -d "{\"repository\": \"${repohref}\"}" $certs https://${hname}/pulp/api/v3/publications/rpm/rpm/ | json_reformat | grep task | cut -d\" -f4)
while [ $(curl -s -H "Accept:application/json" $certs https://${hname}${pubtask} | json_reformat | grep -c /pulp/api/v3/publication) -eq 0 ]; do
	sleep 1
done
pubhref=$(curl -s -H "Accept:application/json" $certs https://${hname}${pubtask} | json_reformat | grep /pulp/api/v3/publication | cut -d\" -f2)

echo "at the end, run:"
echo "curl -s -X DELETE $certs https://${hname}${pubhref}"
echo "curl -s -X DELETE $certs https://${hname}${repohref} | json_reformat"

# in a loop, create distributions, wait until their tasks complete, delete all of them, wait if / until their tasks complete, and start again.
while true; do
	# create distributions
	echo "$(date): creating distributions"
	taskids=""
	for i in $(seq 1 $concur); do
		distrname="${reponame}_distr_ts$(date +%s_%N)"
		id=$(curl -s -X POST -H "Content-Type: application/json" -d "{\"base_path\": \"RedHat/Library/custom/${reponame}/${distrname}\", \"content_guard\": \"${contguard}\", \"name\": \"${distrname}\", \"publication\": \"${pubhref}\"}" $certs https://${hname}/pulp/api/v3/distributions/rpm/rpm/ | cut -d\" -f4)
		taskids="${taskids} ${id}"
	done
	sleep 2
	taskscount=$(echo $taskids | wc -w)
	echo "$(date): checking ${taskscount} create tasks statuses (${taskids})"
	distributions=""
	for url in $taskids; do
		distr=""
		while [ -z $distr ]; do
			distr=$(curl -s $certs https://${hname}${url} | json_reformat | grep /pulp/api/v3/distributions/rpm/rpm/ | cut -d\" -f2)
			sleep 0.1
		done
		distributions="${distributions} ${distr}"
	done
	# delete them
	distrcount=$(echo $distributions | wc -w)
	echo "$(date): deleting all ${distrcount} of them (${distributions})"
	taskids=""
	for url in $distributions; do
		id=$(curl -s -X DELETE $certs https://${hname}${url} | json_reformat | grep task | cut -d\" -f4 | cut -d'/' -f6)
		taskids="${taskids} ${id}"
	done
	# wait until all tasks complete
	taskscount=$(echo $taskids | wc -w)
	echo "$(date): checking ${taskscount} delete tasks statuses (${taskids})"
	completed=0
	while [ $completed -lt $taskscount ]; do
		completed=0
                # due to some reason, pulp API didnt work well for me so I query DB directly
		for id in $taskids; do
			state=$(su - postgres -c "psql pulpcore -c \"COPY (SELECT state FROM core_task WHERE pulp_id = '${id}') TO STDOUT;\"")
			# echo "$id $state"
			if [ "${state}" == "completed" ]; then completed=$((completed+1)); fi
		done
		echo "$(date): tasks completed: ${completed}/${taskscount}"
		sleep 1
	done
done


# at the end (we never reach), delete the publication and also the repo
curl -s -X DELETE $certs https://${hname}${pubhref}
curl -s -X DELETE $certs https://${hname}${repohref} | json_reformat

-----8<----------8<----------8<----------8<-----

I run it 5 times concurrently, twice with a RHSM guard href and three times with empty href, and got the livelock within a hour.

Comment 14 Pavel Moravec 2022-04-11 07:30:59 UTC
OK I reproduced it for the 2nd time, after 6 hours of running my reproducer. Observations:

pulpcore=# select state,count(*) from core_task group by state;
   state   | count 
-----------+-------
 waiting   |    45
 failed    |    97
 completed | 95840
(3 rows)

pulpcore=# SELECT * FROM pg_locks;                                                                          
  locktype  | database | relation | page | tuple | virtualxid | transactionid | classid | objid | objsubid | virtualtransaction | pid  |      mode       | granted | fastpath 
------------+----------+----------+------+-------+------------+---------------+---------+-------+----------+--------------------+------+-----------------+---------+----------
 relation   |    24269 |    12143 |      |       |            |               |         |       |          | 43/1542325         | 4189 | AccessShareLock | t       | t
 virtualxid |          |          |      |       | 43/1542325 |               |         |       |          | 43/1542325         | 4189 | ExclusiveLock   | t       | t
(2 rows)

pulpcore=# select * from core_reservedresource;
 pulp_id | pulp_created | pulp_last_updated | resource | worker_id 
---------+--------------+-------------------+----------+-----------
(0 rows)

pulpcore=# select pulp_id,pulp_created,pulp_last_updated,started_at,finished_at,reserved_resources_record from core_task where state = 'waiting' order by pulp_created;
               pulp_id                |         pulp_created          |       pulp_last_updated       | started_at | finished_at | reserved_resources_record 
--------------------------------------+-------------------------------+-------------------------------+------------+-------------+---------------------------
 8d940c9a-d4ab-4f0d-a828-8af67e2a1210 | 2022-04-11 05:48:02.19574+02  | 2022-04-11 05:48:02.195772+02 |            |             | {/api/v3/distributions/}
 8341cd88-87af-4adf-98ef-d91e2fef8e8f | 2022-04-11 05:48:02.270722+02 | 2022-04-11 05:48:02.270753+02 |            |             | {/api/v3/distributions/}
 5a30ab4b-0046-4007-add1-4148ea37d405 | 2022-04-11 05:48:02.526844+02 | 2022-04-11 05:48:02.526874+02 |            |             | {/api/v3/distributions/}
 ff906a90-3fec-4da8-9f09-f99eb932f64e | 2022-04-11 05:48:02.644078+02 | 2022-04-11 05:48:02.644108+02 |            |             | {/api/v3/distributions/}
 62735242-b118-4e47-b0b6-ef0b821abd15 | 2022-04-11 05:48:02.919862+02 | 2022-04-11 05:48:02.919893+02 |            |             | {/api/v3/distributions/}
 1fb5ee7c-1dc4-42eb-a81f-385d6c182ec0 | 2022-04-11 05:48:03.003957+02 | 2022-04-11 05:48:03.003987+02 |            |             | {/api/v3/distributions/}
 14ecd904-3cc8-40c8-ac79-e22f48cfb676 | 2022-04-11 05:48:03.288307+02 | 2022-04-11 05:48:03.288339+02 |            |             | {/api/v3/distributions/}
 740cf18b-33fe-427d-b21c-ff21a8b5316f | 2022-04-11 05:48:03.318942+02 | 2022-04-11 05:48:03.319+02    |            |             | {/api/v3/distributions/}
 f7f007d4-08b3-4ab8-849c-5d83b5327a81 | 2022-04-11 05:48:03.658133+02 | 2022-04-11 05:48:03.658161+02 |            |             | {/api/v3/distributions/}
 b8c5312d-9d61-4261-8728-a8ad5b092665 | 2022-04-11 05:48:03.69708+02  | 2022-04-11 05:48:03.697109+02 |            |             | {/api/v3/distributions/}
 993ee285-28f2-429d-af27-f41673f8a7a9 | 2022-04-11 05:48:04.005795+02 | 2022-04-11 05:48:04.005824+02 |            |             | {/api/v3/distributions/}
 9ee605c5-648f-468e-8688-290bbb6900f8 | 2022-04-11 05:48:04.072888+02 | 2022-04-11 05:48:04.072926+02 |            |             | {/api/v3/distributions/}
 fce9bfa6-e08d-47ba-89ce-497f7f348ae6 | 2022-04-11 05:48:04.436988+02 | 2022-04-11 05:48:04.43702+02  |            |             | {/api/v3/distributions/}
 947f4d18-1548-4c2a-bd2c-33559bad995b | 2022-04-11 05:48:04.442648+02 | 2022-04-11 05:48:04.442693+02 |            |             | {/api/v3/distributions/}
 796124e4-ed4c-4e57-8a93-cbf3b67ca4d5 | 2022-04-11 05:48:04.713927+02 | 2022-04-11 05:48:04.713958+02 |            |             | {/api/v3/distributions/}
 0d4c4793-e2af-4f52-ad7f-3c40be7dd093 | 2022-04-11 05:48:04.849553+02 | 2022-04-11 05:48:04.849584+02 |            |             | {/api/v3/distributions/}
 dedd9f7c-5584-4cf1-859f-bae22671ff66 | 2022-04-11 05:48:04.989907+02 | 2022-04-11 05:48:04.989941+02 |            |             | {/api/v3/distributions/}
 0cbf681c-5886-49fd-a8d7-a78d03e05096 | 2022-04-11 05:48:05.228938+02 | 2022-04-11 05:48:05.228971+02 |            |             | {/api/v3/distributions/}
 7ad0937a-dbeb-434b-b7c0-2c4fb06e573b | 2022-04-11 05:48:05.269378+02 | 2022-04-11 05:48:05.269408+02 |            |             | {/api/v3/distributions/}
 62ccca62-7b72-4419-8d19-43f1b5bededd | 2022-04-11 05:48:05.5731+02   | 2022-04-11 05:48:05.573131+02 |            |             | {/api/v3/distributions/}
 741fa3e3-856a-4830-be1a-a2bfc6ae0e05 | 2022-04-11 05:48:05.625318+02 | 2022-04-11 05:48:05.625351+02 |            |             | {/api/v3/distributions/}
 ab9c8b09-73ce-497b-8b93-1338a3b9972a | 2022-04-11 05:48:05.932595+02 | 2022-04-11 05:48:05.932639+02 |            |             | {/api/v3/distributions/}
 b623cb0b-a5cf-458d-aafb-bd0538d975b0 | 2022-04-11 05:48:05.975196+02 | 2022-04-11 05:48:05.975232+02 |            |             | {/api/v3/distributions/}
 d35c9e14-e6bf-43ef-882b-285654e8de4a | 2022-04-11 05:48:06.350582+02 | 2022-04-11 05:48:06.350658+02 |            |             | {/api/v3/distributions/}
 c08a0e5f-bdc1-4c2b-aee3-536923953196 | 2022-04-11 05:48:06.351325+02 | 2022-04-11 05:48:06.351357+02 |            |             | {/api/v3/distributions/}
 6723424f-ada7-4da5-9f96-c5cdda60d36d | 2022-04-11 05:48:06.674589+02 | 2022-04-11 05:48:06.674638+02 |            |             | {/api/v3/distributions/}
 1e141dfb-d8ec-49d0-b242-6145ddb5d62b | 2022-04-11 05:48:06.706558+02 | 2022-04-11 05:48:06.706613+02 |            |             | {/api/v3/distributions/}
 6cabffd2-c566-4074-a6c7-36a4b75534b2 | 2022-04-11 05:48:07.010912+02 | 2022-04-11 05:48:07.010941+02 |            |             | {/api/v3/distributions/}
 4468f8f4-fcc9-49d7-803b-6360af0bedb2 | 2022-04-11 05:48:07.127237+02 | 2022-04-11 05:48:07.127284+02 |            |             | {/api/v3/distributions/}
 b0b5d9f2-e85d-496b-baea-475afac88644 | 2022-04-11 05:48:07.33792+02  | 2022-04-11 05:48:07.337954+02 |            |             | {/api/v3/distributions/}
 ce3e2f78-7de8-4cae-a029-5c6112a3873b | 2022-04-11 05:48:07.425648+02 | 2022-04-11 05:48:07.42569+02  |            |             | {/api/v3/distributions/}
 a1e7e5bd-1145-40c7-88cd-d7c7e1664a0d | 2022-04-11 05:48:07.648257+02 | 2022-04-11 05:48:07.64829+02  |            |             | {/api/v3/distributions/}
 cbf31522-34e6-45cc-985c-6e20d8e5ba11 | 2022-04-11 05:48:07.768079+02 | 2022-04-11 05:48:07.76811+02  |            |             | {/api/v3/distributions/}
 dc1900a9-6349-4e7e-9bab-73b70c3a23ea | 2022-04-11 05:48:08.056841+02 | 2022-04-11 05:48:08.056871+02 |            |             | {/api/v3/distributions/}
 6105a252-74ad-4019-bafe-3bdf371d856f | 2022-04-11 05:48:08.14776+02  | 2022-04-11 05:48:08.147794+02 |            |             | {/api/v3/distributions/}
 61d312aa-96a3-410c-844e-b56d33a7d1c9 | 2022-04-11 05:48:08.430726+02 | 2022-04-11 05:48:08.430758+02 |            |             | {/api/v3/distributions/}
 bbc5b883-1a41-48a6-9809-d5e2b5b13aae | 2022-04-11 05:48:08.477491+02 | 2022-04-11 05:48:08.477555+02 |            |             | {/api/v3/distributions/}
 0fb87111-9fc2-4f48-866c-3d2a0788a272 | 2022-04-11 05:48:08.745528+02 | 2022-04-11 05:48:08.745558+02 |            |             | {/api/v3/distributions/}
 07112bb0-2bf6-490f-b241-86842cb45221 | 2022-04-11 05:48:08.868255+02 | 2022-04-11 05:48:08.868285+02 |            |             | {/api/v3/distributions/}
 fdcc6180-f953-4a5a-8343-43c66a6b3eb9 | 2022-04-11 05:48:09.123195+02 | 2022-04-11 05:48:09.123233+02 |            |             | {/api/v3/distributions/}
 7640b2a6-0018-4d06-b7cd-cfc70c0f9ff7 | 2022-04-11 05:48:09.170577+02 | 2022-04-11 05:48:09.170615+02 |            |             | {/api/v3/distributions/}
 c58786ee-2396-41df-bdde-3747a113a5e8 | 2022-04-11 05:48:09.519841+02 | 2022-04-11 05:48:09.519873+02 |            |             | {/api/v3/distributions/}
 f5a8c5e6-6b7b-4fe4-8bec-3df4cb29f29c | 2022-04-11 05:48:09.593262+02 | 2022-04-11 05:48:09.593294+02 |            |             | {/api/v3/distributions/}
 e8938ce7-589f-4bde-905c-d00f47ac0da5 | 2022-04-11 05:48:09.879781+02 | 2022-04-11 05:48:09.879815+02 |            |             | {/api/v3/distributions/}
 6cd42528-931a-4dca-b0dc-ae27e2340c3c | 2022-04-11 05:48:10.228485+02 | 2022-04-11 05:48:10.228518+02 |            |             | {/api/v3/distributions/}
(45 rows)

pulpcore=#


All pulpcore-workers have the same backtrace:

Traceback (most recent call first):
  <built-in method select of module object at remote 0x7f8260cc8ea8>
  File "/usr/lib/python3.6/site-packages/pulpcore/tasking/pulpcore_worker.py", line 190, in sleep
    [self.sentinel, connection.connection], [], [], self.heartbeat_period
  File "/usr/lib/python3.6/site-packages/pulpcore/tasking/pulpcore_worker.py", line 297, in run_forever
    self.sleep()
  File "/usr/lib/python3.6/site-packages/pulpcore/tasking/entrypoint.py", line 44, in worker
    NewPulpWorker().run_forever()
  File "/usr/lib/python3.6/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/lib/python3.6/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/usr/lib/python3.6/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/usr/bin/pulpcore-worker", line 11, in <module>
    load_entry_point('pulpcore==3.14.14', 'console_scripts', 'pulpcore-worker')()


which is inside sleep method:

            r, w, x = select.select(
                [self.sentinel, connection.connection], [], [], self.heartbeat_period
            )

/var/log/messages from the "now workers start to fetch new tasks" time:

Apr 11 05:48:08 pmoravec-sat610-bz2050440 pulpcore-api: pulp [3464f30e4c6547848cf9126d3d795341]:  - - [11/Apr/2022:03:48:08 +0000] "POST /pulp/api/v3/distributions/rpm/rpm/ HTTP/1.1" 202 67 "-" "curl/7.29.0"
Apr 11 05:48:08 pmoravec-sat610-bz2050440 pulpcore-api: pulp [f8316d64c42c4959892cc8fa7f4a3725]:  - - [11/Apr/2022:03:48:08 +0000] "DELETE /pulp/api/v3/distributions/rpm/rpm/ae6aa58c-bd9e-488c-a1e8-8d502b17048d/ HTTP/1.1" 202 67 "-" "curl/7.29.0"
Apr 11 05:48:08 pmoravec-sat610-bz2050440 pulpcore-worker-2: pulp [fcaa09dd9d9e4aaf9bc0e5957c489ffe]: pulpcore.tasking.pulpcore_worker:INFO: Task completed d688ef20-d86b-4046-9800-4e6d81e37d12
Apr 11 05:48:08 pmoravec-sat610-bz2050440 pulpcore-worker-1: pulp [663a45f1813e42299f32bd90267cd80c]: pulpcore.tasking.pulpcore_worker:INFO: Starting task 506c9e81-1365-4848-9564-8bfbea4f08aa
Apr 11 05:48:08 pmoravec-sat610-bz2050440 pulpcore-api: pulp [a7147cb27e92492c920a08383de24fb5]:  - - [11/Apr/2022:03:48:08 +0000] "POST /pulp/api/v3/distributions/rpm/rpm/ HTTP/1.1" 202 67 "-" "curl/7.29.0"
Apr 11 05:48:08 pmoravec-sat610-bz2050440 pulpcore-worker-1: pulp [663a45f1813e42299f32bd90267cd80c]: pulpcore.tasking.pulpcore_worker:INFO: Task completed 506c9e81-1365-4848-9564-8bfbea4f08aa
Apr 11 05:48:08 pmoravec-sat610-bz2050440 pulpcore-api: pulp [c9a035a8f27645a5829477592b712d05]:  - - [11/Apr/2022:03:48:08 +0000] "DELETE /pulp/api/v3/distributions/rpm/rpm/05fb3746-ff50-4ce3-8858-f699c2b4a49a/ HTTP/1.1" 202 67 "-" "curl/7.29.0"
Apr 11 05:48:08 pmoravec-sat610-bz2050440 pulpcore-worker-2: pulp [5a4a079998e94ede9feff00dfe2020ca]: pulpcore.tasking.pulpcore_worker:INFO: Starting task ad119ec2-a371-4035-988b-eac1e3519a74
Apr 11 05:48:09 pmoravec-sat610-bz2050440 pulpcore-api: pulp [53c4499d91124608bd53c8ab60c02049]:  - - [11/Apr/2022:03:48:09 +0000] "POST /pulp/api/v3/distributions/rpm/rpm/ HTTP/1.1" 202 67 "-" "curl/7.29.0"
Apr 11 05:48:09 pmoravec-sat610-bz2050440 pulpcore-api: pulp [5c110aceff70466ea24eafc3cf31e713]:  - - [11/Apr/2022:03:48:09 +0000] "DELETE /pulp/api/v3/distributions/rpm/rpm/755425f4-21ff-4a9f-b7c1-a04d6aa1133a/ HTTP/1.1" 202 67 "-" "curl/7.29.0"
Apr 11 05:48:09 pmoravec-sat610-bz2050440 pulpcore-worker-2: pulp [5a4a079998e94ede9feff00dfe2020ca]: pulpcore.tasking.pulpcore_worker:INFO: Task completed ad119ec2-a371-4035-988b-eac1e3519a74
Apr 11 05:48:09 pmoravec-sat610-bz2050440 pulpcore-api: pulp [6878cf3a5b34487a9b9dfc5ebcce38ab]:  - - [11/Apr/2022:03:48:09 +0000] "DELETE /pulp/api/v3/distributions/rpm/rpm/6d2b07a6-73fc-4ffd-a9ef-b92ecd2c768a/ HTTP/1.1" 202 67 "-" "curl/7.29.0"
Apr 11 05:48:09 pmoravec-sat610-bz2050440 pulpcore-worker-2: pulp [9f2d51e422f542f382991c8905389bac]: pulpcore.tasking.pulpcore_worker:INFO: Starting task b17f8001-f7ba-4dc4-bfc9-b74f498cf7df
Apr 11 05:48:09 pmoravec-sat610-bz2050440 pulpcore-api: pulp [a3f167973c604f318d625145872be599]:  - - [11/Apr/2022:03:48:09 +0000] "POST /pulp/api/v3/distributions/rpm/rpm/ HTTP/1.1" 202 67 "-" "curl/7.29.0"
Apr 11 05:48:09 pmoravec-sat610-bz2050440 pulpcore-worker-2: pulp [9f2d51e422f542f382991c8905389bac]: pulpcore.tasking.pulpcore_worker:INFO: Task completed b17f8001-f7ba-4dc4-bfc9-b74f498cf7df
Apr 11 05:48:09 pmoravec-sat610-bz2050440 pulpcore-worker-3: pulp [76ee50a369b14c799302221a13f8284e]: pulpcore.tasking.pulpcore_worker:INFO: Starting task ccb072df-d702-4613-a827-4030db10ecf7
Apr 11 05:48:09 pmoravec-sat610-bz2050440 pulpcore-api: pulp [9faafb329e4f4f8a84f40dc5dc402708]:  - - [11/Apr/2022:03:48:09 +0000] "DELETE /pulp/api/v3/distributions/rpm/rpm/eed60983-fdca-4719-989a-3547a934ecc1/ HTTP/1.1" 202 67 "-" "curl/7.29.0"
Apr 11 05:48:10 pmoravec-sat610-bz2050440 pulpcore-worker-3: pulp [76ee50a369b14c799302221a13f8284e]: pulpcore.tasking.pulpcore_worker:INFO: Task completed ccb072df-d702-4613-a827-4030db10ecf7
Apr 11 05:48:10 pmoravec-sat610-bz2050440 pulpcore-worker-1: pulp [ec8dfe1f80964f6e9b89b67456c08ad9]: pulpcore.tasking.pulpcore_worker:INFO: Starting task 107e11e1-2d33-4d15-ba33-fa83804f38de
Apr 11 05:48:10 pmoravec-sat610-bz2050440 pulpcore-api: pulp [d01f023a7dea4518816728696e76f35c]:  - - [11/Apr/2022:03:48:10 +0000] "DELETE /pulp/api/v3/distributions/rpm/rpm/7847f2a5-cccf-48ec-8bf7-c204ed13cfb9/ HTTP/1.1" 202 67 "-" "curl/7.29.0"
Apr 11 05:48:10 pmoravec-sat610-bz2050440 su: (to postgres) root on none
Apr 11 05:48:10 pmoravec-sat610-bz2050440 pulpcore-worker-1: pulp [ec8dfe1f80964f6e9b89b67456c08ad9]: pulpcore.tasking.pulpcore_worker:INFO: Task completed 107e11e1-2d33-4d15-ba33-fa83804f38de

Comment 15 Pavel Moravec 2022-04-11 14:06:17 UTC
Some brainstorming with mdellweg++ :

- similar bug [1] is already fixed in the pulpcore-3.14-14-1 version (in Sat6.10.4)
  - [1] https://github.com/pulp/pulpcore/commit/82828595c4407735238a43307feaa9b62b540613

- the latest worker completed its latest task a long time after there were already waiting tasks
  - until there was a clock skew, a theory "a race condition just before entering sleep" does not apply

- why the latest worker - after executing the latest task - could ever execute the iter_tasks *without* fetching a task? that makes no sense..

- to prove the theory "all workers are sleeping and waiting for a wakeup signal from psql":  once reproduced, go into the pg-console and do the "notify pulp_worker_wakeup", if it unblocks the livelock



Current mdellweg++ theory:
- select in https://github.com/pulp/pulpcore/blob/a403fc6f885b9378551198029326d9e1f3ffedf3/pulpcore/tasking/pulpcore_worker.py#L169 does not fetch the wakeup call
- since either a heartbeat (of other workers) or latest worker latest task status update does surpress/"consume" the wakeup notification prior the select happens

So theoretically, one worker doing in a cycle batches of similar tasks followed by a silence (waiting till tasks are done) is the best (generic) reproducer (but still doesnt work for me, much realiably).

Comment 17 matt jia 2022-04-11 23:37:59 UTC
(In reply to Pavel Moravec from comment #15)
> Current mdellweg++ theory:
> - select in
> https://github.com/pulp/pulpcore/blob/
> a403fc6f885b9378551198029326d9e1f3ffedf3/pulpcore/tasking/pulpcore_worker.
> py#L169 does not fetch the wakeup call
> - since either a heartbeat (of other workers) or latest worker latest task
> status update does surpress/"consume" the wakeup notification prior the
> select happens
> 
> So theoretically, one worker doing in a cycle batches of similar tasks
> followed by a silence (waiting till tasks are done) is the best (generic)
> reproducer (but still doesnt work for me, much realiably).

I think there is another possible theory. When iterating the tasks on a worker, each task cannot get a AdvisoryLock for some reason:


https://github.com/pulp/pulpcore/blob/a403fc6f885b9378551198029326d9e1f3ffedf3/pulpcore/tasking/pulpcore_worker.py#L138

the worker then can sleep without sending any pulp_worker_wakeup event. 

If this is the case, should the worker only sleep when there is 0 TASK_INCOMPLETE_STATES task?

Comment 18 Pavel Moravec 2022-04-13 10:03:45 UTC
OK, collaboration++ led to an artificial but _deterministic_ workaround.

1) Apply a patch to add auxiliary logging and force worker to sleep at the bad place:

--- /usr/lib/python3.6/site-packages/pulpcore/tasking/pulpcore_worker.py.orig	2022-04-11 12:54:11.374669136 +0200
+++ /usr/lib/python3.6/site-packages/pulpcore/tasking/pulpcore_worker.py	2022-04-13 11:39:11.590025573 +0200
@@ -17,6 +17,7 @@ from multiprocessing import Process
 from tempfile import TemporaryDirectory
 
 import django
+import time
 
 os.environ.setdefault("DJANGO_SETTINGS_MODULE", "pulpcore.app.settings")
 django.setup()
@@ -189,6 +190,9 @@ class NewPulpWorker:
             r, w, x = select.select(
                 [self.sentinel, connection.connection], [], [], self.heartbeat_period
             )
+            _logger.info(_("bz2050440: Worker %s artificial sleep before heartbeat"), self.name)
+            time.sleep(2)
+            _logger.info(_("bz2050440: Worker %s artificial slept before heartbeat"), self.name)
             self.beat()
             if connection.connection in r:
                 connection.connection.poll()

2) Let just 1 worker running (just to ease the reproducer):

systemctl stop pulpcore-worker@*.service
systemctl start pulpcore-worker

3) "tail -f /var/log/messages" to monitor the "Worker * artificial sle.." logs.

4) Have prepared a curl command to e.g. create a publication or distribution or delete either of them.

5) Once the *first* log "artificial sleep" is printed, immediately fire the API request that will invoke a task.

6) Ensure the task is created while the worker is sleeping, like:

Apr 13 12:01:42 pmoravec-sat610-bz2050440 pulpcore-worker-1: pulp [None]: pulpcore.tasking.pulpcore_worker:INFO: bz2050440: Worker 28305.gsslab.brq2.redhat.com artificial sleep before heartbeat
Apr 13 12:01:43 pmoravec-sat610-bz2050440 pulpcore-api: pulp [6d4ba28511c34d1db0cc8f7c06c2f69f]:  - - [13/Apr/2022:10:01:43 +0000] "POST /pulp/api/v3/publications/rpm/rpm/ HTTP/1.1" 202 67 "-" "curl/7.29.0"
Apr 13 12:01:44 pmoravec-sat610-bz2050440 pulpcore-worker-1: pulp [None]: pulpcore.tasking.pulpcore_worker:INFO: bz2050440: Worker 28305.gsslab.brq2.redhat.com artificial slept before heartbeat

7) Wait for Godot to have the task executed.

8) Ask Godot to run in postgres:

notify pulp_worker_wakeup;

(optionally, invoke any activity that fires a new task)

9) check that the task is suddenly executed:

..
Apr 13 12:03:20 pmoravec-sat610-bz2050440 pulpcore-worker-1: pulp [None]: pulpcore.tasking.pulpcore_worker:INFO: bz2050440: Worker 28305.gsslab.brq2.redhat.com artificial slept before heartbeat
Apr 13 12:03:25 pmoravec-sat610-bz2050440 pulpcore-worker-1: pulp [None]: pulpcore.tasking.pulpcore_worker:INFO: bz2050440: Worker 28305.gsslab.brq2.redhat.com artificial sleep before heartbeat
Apr 13 12:03:27 pmoravec-sat610-bz2050440 pulpcore-worker-1: pulp [None]: pulpcore.tasking.pulpcore_worker:INFO: bz2050440: Worker 28305.gsslab.brq2.redhat.com artificial slept before heartbeat
Apr 13 12:03:27 pmoravec-sat610-bz2050440 pulpcore-worker-1: pulp [6d4ba28511c34d1db0cc8f7c06c2f69f]: pulpcore.tasking.pulpcore_worker:INFO: Starting task b7110104-a52c-49f3-af78-58d47a662634
Apr 13 12:03:27 pmoravec-sat610-bz2050440 pulpcore-worker-1: pulp [6d4ba28511c34d1db0cc8f7c06c2f69f]: pulp_rpm.app.tasks.publishing:INFO: Publishing: repository=repo_bz2050440_ts1649842649_737576601, version=0
Apr 13 12:03:27 pmoravec-sat610-bz2050440 pulpcore-worker-1: pulp [6d4ba28511c34d1db0cc8f7c06c2f69f]: pulp_rpm.app.tasks.publishing:INFO: Publication: 0a35af25-a374-4aa8-912c-bb2d3a4e0e99 created
Apr 13 12:03:27 pmoravec-sat610-bz2050440 pulpcore-worker-1: pulp [6d4ba28511c34d1db0cc8f7c06c2f69f]: pulpcore.tasking.pulpcore_worker:INFO: Task completed b7110104-a52c-49f3-af78-58d47a662634

Comment 19 pulp-infra@redhat.com 2022-05-16 16:21:38 UTC
The Pulp upstream bug status is at closed. Updating the external tracker on this bug.

Comment 20 pulp-infra@redhat.com 2022-05-16 16:21:41 UTC
All upstream Pulp bugs are at MODIFIED+. Moving this bug to POST.

Comment 26 Lai 2022-05-27 01:36:52 UTC
Steps to retest:

1) Apply a patch to add auxiliary logging and force worker to sleep at the bad place:

--- /usr/lib/python3.6/site-packages/pulpcore/tasking/pulpcore_worker.py.orig	2022-04-11 12:54:11.374669136 +0200
+++ /usr/lib/python3.6/site-packages/pulpcore/tasking/pulpcore_worker.py	2022-04-13 11:39:11.590025573 +0200
@@ -17,6 +17,7 @@ from multiprocessing import Process
 from tempfile import TemporaryDirectory
 
 import django
+import time
 
 os.environ.setdefault("DJANGO_SETTINGS_MODULE", "pulpcore.app.settings")
 django.setup()
@@ -189,6 +190,9 @@ class NewPulpWorker:
             r, w, x = select.select(
                 [self.sentinel, connection.connection], [], [], self.heartbeat_period
             )
+            _logger.info(_("bz2050440: Worker %s artificial sleep before heartbeat"), self.name)
+            time.sleep(2)
+            _logger.info(_("bz2050440: Worker %s artificial slept before heartbeat"), self.name)
             self.beat()
             if connection.connection in r:
                 connection.connection.poll()

(you can make the code changes directly to the "pulpcore_worker.py" file and then run "cat /usr/lib/python3.6/site-packages/pulpcore/tasking/pulpcore_worker.py | patch -p1" to apply the patch)

2) Let just 1 worker running (just to ease the reproducer):

systemctl stop pulpcore-worker@*.service
systemctl start pulpcore-worker

3) "tail -f /var/log/messages" to monitor the "Worker * artificial sle.." logs on another terminal

4) Have prepared a curl command to e.g. create a publication or distribution or delete either of them.

certs="--cacert /etc/pki/katello/certs/katello-server-ca.crt --cert /etc/pki/katello/certs/pulp-client.crt --key /etc/pki/katello/private/pulp-client.key"
hname=$(hostname -f)

# create repo
reponame="repo_bz2050440"
repohref=$(curl -s -X POST -H "Content-Type: application/json" -d "{\"name\":\"$reponame\"}" $certs https://${hname}/pulp/api/v3/repositories/rpm/rpm/ | json_reformat | grep pulp_href | cut -d\" -f4)


5) Once the *first* log "artificial sleep" is printed, immediately fire the API request that will invoke a task.
# prepare below command to fire at step 4
curl -s -H "Accept:application/json" -H "Content-Type: application/json" -X POST -d "{\"repository\": \"${repohref}\"}" $certs https://${hname}/pulp/api/v3/publications/rpm/rpm/ | json_reformat

6) Ensure the task is created while the worker is sleeping, like:

(example)
Apr 13 12:01:42 pmoravec-sat610-bz2050440 pulpcore-worker-1: pulp [None]: pulpcore.tasking.pulpcore_worker:INFO: bz2050440: Worker bz2050440.satotest.gsslab.brq2.redhat.com">bz2050440.satotest.gsslab.brq2.redhat.com">28305.gsslab.brq2.redhat.com artificial sleep before heartbeat
Apr 13 12:01:43 pmoravec-sat610-bz2050440 pulpcore-api: pulp [6d4ba28511c34d1db0cc8f7c06c2f69f]:  - - [13/Apr/2022:10:01:43 +0000] "POST /pulp/api/v3/publications/rpm/rpm/ HTTP/1.1" 202 67 "-" "curl/7.29.0"
Apr 13 12:01:44 pmoravec-sat610-bz2050440 pulpcore-worker-1: pulp [None]: pulpcore.tasking.pulpcore_worker:INFO: bz2050440: Worker bz2050440.satotest.gsslab.brq2.redhat.com">bz2050440.satotest.gsslab.brq2.redhat.com">28305.gsslab.brq2.redhat.com artificial slept before heartbeat

7) Wait for Godot to have the task executed.

8) check that the task is suddenly executed:

(example)
..
Apr 13 12:03:20 pmoravec-sat610-bz2050440 pulpcore-worker-1: pulp [None]: pulpcore.tasking.pulpcore_worker:INFO: bz2050440: Worker bz2050440.satotest.gsslab.brq2.redhat.com">bz2050440.satotest.gsslab.brq2.redhat.com">28305.gsslab.brq2.redhat.com artificial slept before heartbeat
Apr 13 12:03:25 pmoravec-sat610-bz2050440 pulpcore-worker-1: pulp [None]: pulpcore.tasking.pulpcore_worker:INFO: bz2050440: Worker bz2050440.satotest.gsslab.brq2.redhat.com">bz2050440.satotest.gsslab.brq2.redhat.com">28305.gsslab.brq2.redhat.com artificial sleep before heartbeat
Apr 13 12:03:27 pmoravec-sat610-bz2050440 pulpcore-worker-1: pulp [None]: pulpcore.tasking.pulpcore_worker:INFO: bz2050440: Worker bz2050440.satotest.gsslab.brq2.redhat.com">bz2050440.satotest.gsslab.brq2.redhat.com">28305.gsslab.brq2.redhat.com artificial slept before heartbeat
Apr 13 12:03:27 pmoravec-sat610-bz2050440 pulpcore-worker-1: pulp [6d4ba28511c34d1db0cc8f7c06c2f69f]: pulpcore.tasking.pulpcore_worker:INFO: Starting task b7110104-a52c-49f3-af78-58d47a662634
Apr 13 12:03:27 pmoravec-sat610-bz2050440 pulpcore-worker-1: pulp [6d4ba28511c34d1db0cc8f7c06c2f69f]: pulp_rpm.app.tasks.publishing:INFO: Publishing: repository=repo_bz2050440_ts1649842649_737576601, version=0
Apr 13 12:03:27 pmoravec-sat610-bz2050440 pulpcore-worker-1: pulp [6d4ba28511c34d1db0cc8f7c06c2f69f]: pulp_rpm.app.tasks.publishing:INFO: Publication: 0a35af25-a374-4aa8-912c-bb2d3a4e0e99 created
Apr 13 12:03:27 pmoravec-sat610-bz2050440 pulpcore-worker-1: pulp [6d4ba28511c34d1db0cc8f7c06c2f69f]: pulpcore.tasking.pulpcore_worker:INFO: Task completed b7110104-a52c-49f3-af78-58d47a662634

Optionally:
9) check that the task is executed or not with the following, replacing the "tasks/60aac..." with the pulpcore task that is displayed for you
curl $certs https://${hname}/pulp/api/v3/tasks/60aac23b-29fd-4ced-9c7d-bdce6d4b880c/ | json_reformat


Expected:
The task should executed and completed successfully and should not be idle

Actual:
The task executed and completed successfully and isn't idle.
{
    "pulp_href": "/pulp/api/v3/tasks/d746d854-ef80-468a-86b1-5f8a782961d7/",
    "pulp_created": "2022-05-27T01:28:56.221199Z",
    "state": "completed",
    "name": "pulp_rpm.app.tasks.publishing.publish",
    "logging_cid": "5777e58a684b4d5aab22ed0abc32623d",
    "started_at": "2022-05-27T01:28:58.312481Z",
    "finished_at": "2022-05-27T01:28:58.535536Z",
    "error": null,
    "worker": "/pulp/api/v3/workers/b333c3a3-c02c-46b2-b350-70dc79549772/",
    "parent_task": null,
    "child_tasks": [

    ],
    "task_group": null,
    "progress_reports": [
        {
            "message": "Generating repository metadata",
            "code": "publish.generating_metadata",
            "state": "completed",
            "total": 1,
            "done": 1,
            "suffix": null
        }
    ],
    "created_resources": [
        "/pulp/api/v3/publications/rpm/rpm/e4d9d02e-94d3-4f7a-891f-5c0d742b755d/"
    ],
    "reserved_resources_record": [
        "shared:/pulp/api/v3/repositories/rpm/rpm/f594fdf8-e9e6-4284-8805-8fe9cc2d9969/"
    ]
}

Verified on 6.11 snap 22 with python38-pulpcore-3.16.8-2.el8pc.noarch on both rhel7 and rhel8

This is quite difficult to test.  It worked out fine and I did also check that the codebase is in the build.

Comment 29 errata-xmlrpc 2022-07-05 14:32:45 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Satellite 6.11 Release), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5498

Comment 30 Red Hat Bugzilla 2023-09-15 01:51:38 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 365 days


Note You need to log in before you can comment on or make changes to this bug.