Bug 2050440

Summary: pulp workers are idle despite there is one pending task
Product: Red Hat Satellite Reporter: matt jia <mjia>
Component: PulpAssignee: satellite6-bugs <satellite6-bugs>
Status: CLOSED ERRATA QA Contact: Lai <ltran>
Severity: high Docs Contact:
Priority: unspecified    
Version: 6.10.2CC: dkliban, ggainey, jkrajice, mdellweg, pcreech, pmoravec, rchan, saydas
Target Milestone: 6.11.0Keywords: Regression, Triaged
Target Release: Unused   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: tfm-pulpcore-python-pulpcore-3.16.7 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 2088620 (view as bug list) Environment:
Last Closed: 2022-07-05 14:32:45 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description matt jia 2022-02-03 23:45:13 UTC
Description of problem:

This happens when deleting a batch of content views at the same time. 99% of content views are removed without any issue. Only one content view has this issue. By examining the task, it has a list of child tasks. Most of them start and finish in one second. However, the last one is in waiting state. By looking at the core_task table in the db, there is no other task running and only this task is in waiting state. All workers are online but idle. The interesting thing is if a new task(a repository sync/sync plan, etc.) is triggered, this task will be picked up too. This results in the content view deletion task being stuck there for 7 hours:


 Id: 4cf1dd7d-b398-47c3-ad2f-0f8c588a8c65

Label: Actions::Katello::ContentView::Destroy

Status: stopped

Result: success

Started at: 2022-02-01 20:57:22 UTC

Ended at: 2022-02-02 04:00:26 UTC

3: Actions::Pulp3::ContentView::DeleteRepositoryReferences (success) [ 25382.76s / 3396.84s ]


Version-Release number of selected component (if applicable):

6.10

How reproducible:

Hard


Steps to Reproduce:

Only in the customer's enviroment 

Actual results:

The task is in waiting state forever while workers are idle.

Expected results:

The task should be run.

Additional info:

Comment 9 Pavel Moravec 2022-04-04 07:06:46 UTC
So we have a task being ignored by pulp worker(s)(*), but once a different task on (same?similar?) repo object is kicked off, we are unblocked? That sounds like pulp-3 task locking issue.

(*) we might not know if *all* workers do ignore the task or if just one worker is blocked in execution of just selected task

Isnt it worth taking:

1) list of locks (isnt pulp waiting on https://github.com/pulp/pulpcore/blob/a403fc6f885b9378551198029326d9e1f3ffedf3/pulpcore/tasking/pulpcore_worker.py#L252 ?)
su - postgres -c "psql pulpcore -c \"select relation::regclass, * from pg_locks;\""

2) coredumps of pulp workers (what is the status of all workers?):
for pid in $(ps aux | grep -v grep | grep pulpcore-worker | awk '{ print $2 }'); do gcore $pid; done


Just a silly idea: can't all workers be waiting for "NOTIFY pulp_worker_wakeup" that comes even with the new task invoked?

Comment 10 matt jia 2022-04-04 07:23:51 UTC
Could be this piece of code "resource in taken_resources for resource in reserved_resources_record" returns true here? 

https://github.com/pulp/pulpcore/blob/a403fc6f885b9378551198029326d9e1f3ffedf3/pulpcore/tasking/pulpcore_worker.py#L151

Is it possible two tasks run by one worker have the same resource assigned? When the first task is run, it updates the taken_resources set with the resource record.

When the second task is run ,it would see "resource in taken_resources for resource in reserved_resources_record" true in this case?

Comment 11 matt jia 2022-04-04 07:25:15 UTC
smelling like race co(In reply to matt jia from comment #10)
> Could be this piece of code "resource in taken_resources for resource in
> reserved_resources_record" returns true here? 
> 
> https://github.com/pulp/pulpcore/blob/
> a403fc6f885b9378551198029326d9e1f3ffedf3/pulpcore/tasking/pulpcore_worker.
> py#L151
> 
> Is it possible two tasks run by one worker have the same resource assigned?
> When the first task is run, it updates the taken_resources set with the
> resource record.
> 
> When the second task is run ,it would see "resource in taken_resources for
> resource in reserved_resources_record" true in this case?

Smelling like a race condition, :-)

Comment 12 Pavel Moravec 2022-04-04 07:36:24 UTC
So worth to additionally ask for:

su - postgres -c "psql pulpcore -c \"select * from core_reservedresource;\""

and just in case orhpans cleanup was in charge (cf pg_advisory_lock and the fact orphans cleanup *was* run at similar time when the cust hit the issue), take also:

su - postgres -c "psql pulpcore -c \"select * from core_task where name = 'pulpcore.app.tasks.orphan.orphan_cleanup';\""


I will ask for that.

Comment 13 Pavel Moravec 2022-04-10 07:45:37 UTC
I might reproduced it. Idea of reproducer: (have a repository and its publication, then) in a loop, create 20 publications for it and remove all of them. *Maybe* required to run concurrently for >1 different repos, and *maybe* with different content guards (empty or a RHSM one).

So far I reproduced it once with below script. Observations:

- core_reservedresource table was empty
- I had both general_create and general_delete tasks waiting
- when I invoked a different task (sync of completely unrelated repo), the situation got unblocked (which is the symptom of the reported problem as well)

I am building more cleaner environment to provide more reliable reproducer, but there is some livelock in pulp workers (not) picking up tasks..


Reproducer script:

-----8<----------8<----------8<----------8<-----

# reproducer attempt for bz 2050440 : create a repo and a publication; then concurrently create distributions and concurrently delete them - in a loop

concur=${1:-10}
contguard=${2:-""}

certs="--cacert /etc/pki/katello/certs/katello-server-ca.crt --cert /etc/pki/katello/certs/pulp-client.crt --key /etc/pki/katello/private/pulp-client.key"
hname=$(hostname -f)

# create repo and its publication
reponame="repo_bz2050440_ts$(date +%s_%N)"
repohref=$(curl -s -X POST -H "Content-Type: application/json" -d "{\"name\":\"$reponame\"}" $certs https://${hname}/pulp/api/v3/repositories/rpm/rpm/ | json_reformat | grep pulp_href | cut -d\" -f4)

sleep 1
pubtask=$(curl -s -H "Accept:application/json" -H "Content-Type: application/json" -X POST -d "{\"repository\": \"${repohref}\"}" $certs https://${hname}/pulp/api/v3/publications/rpm/rpm/ | json_reformat | grep task | cut -d\" -f4)
while [ $(curl -s -H "Accept:application/json" $certs https://${hname}${pubtask} | json_reformat | grep -c /pulp/api/v3/publication) -eq 0 ]; do
	sleep 1
done
pubhref=$(curl -s -H "Accept:application/json" $certs https://${hname}${pubtask} | json_reformat | grep /pulp/api/v3/publication | cut -d\" -f2)

echo "at the end, run:"
echo "curl -s -X DELETE $certs https://${hname}${pubhref}"
echo "curl -s -X DELETE $certs https://${hname}${repohref} | json_reformat"

# in a loop, create distributions, wait until their tasks complete, delete all of them, wait if / until their tasks complete, and start again.
while true; do
	# create distributions
	echo "$(date): creating distributions"
	taskids=""
	for i in $(seq 1 $concur); do
		distrname="${reponame}_distr_ts$(date +%s_%N)"
		id=$(curl -s -X POST -H "Content-Type: application/json" -d "{\"base_path\": \"RedHat/Library/custom/${reponame}/${distrname}\", \"content_guard\": \"${contguard}\", \"name\": \"${distrname}\", \"publication\": \"${pubhref}\"}" $certs https://${hname}/pulp/api/v3/distributions/rpm/rpm/ | cut -d\" -f4)
		taskids="${taskids} ${id}"
	done
	sleep 2
	taskscount=$(echo $taskids | wc -w)
	echo "$(date): checking ${taskscount} create tasks statuses (${taskids})"
	distributions=""
	for url in $taskids; do
		distr=""
		while [ -z $distr ]; do
			distr=$(curl -s $certs https://${hname}${url} | json_reformat | grep /pulp/api/v3/distributions/rpm/rpm/ | cut -d\" -f2)
			sleep 0.1
		done
		distributions="${distributions} ${distr}"
	done
	# delete them
	distrcount=$(echo $distributions | wc -w)
	echo "$(date): deleting all ${distrcount} of them (${distributions})"
	taskids=""
	for url in $distributions; do
		id=$(curl -s -X DELETE $certs https://${hname}${url} | json_reformat | grep task | cut -d\" -f4 | cut -d'/' -f6)
		taskids="${taskids} ${id}"
	done
	# wait until all tasks complete
	taskscount=$(echo $taskids | wc -w)
	echo "$(date): checking ${taskscount} delete tasks statuses (${taskids})"
	completed=0
	while [ $completed -lt $taskscount ]; do
		completed=0
                # due to some reason, pulp API didnt work well for me so I query DB directly
		for id in $taskids; do
			state=$(su - postgres -c "psql pulpcore -c \"COPY (SELECT state FROM core_task WHERE pulp_id = '${id}') TO STDOUT;\"")
			# echo "$id $state"
			if [ "${state}" == "completed" ]; then completed=$((completed+1)); fi
		done
		echo "$(date): tasks completed: ${completed}/${taskscount}"
		sleep 1
	done
done


# at the end (we never reach), delete the publication and also the repo
curl -s -X DELETE $certs https://${hname}${pubhref}
curl -s -X DELETE $certs https://${hname}${repohref} | json_reformat

-----8<----------8<----------8<----------8<-----

I run it 5 times concurrently, twice with a RHSM guard href and three times with empty href, and got the livelock within a hour.

Comment 14 Pavel Moravec 2022-04-11 07:30:59 UTC
OK I reproduced it for the 2nd time, after 6 hours of running my reproducer. Observations:

pulpcore=# select state,count(*) from core_task group by state;
   state   | count 
-----------+-------
 waiting   |    45
 failed    |    97
 completed | 95840
(3 rows)

pulpcore=# SELECT * FROM pg_locks;                                                                          
  locktype  | database | relation | page | tuple | virtualxid | transactionid | classid | objid | objsubid | virtualtransaction | pid  |      mode       | granted | fastpath 
------------+----------+----------+------+-------+------------+---------------+---------+-------+----------+--------------------+------+-----------------+---------+----------
 relation   |    24269 |    12143 |      |       |            |               |         |       |          | 43/1542325         | 4189 | AccessShareLock | t       | t
 virtualxid |          |          |      |       | 43/1542325 |               |         |       |          | 43/1542325         | 4189 | ExclusiveLock   | t       | t
(2 rows)

pulpcore=# select * from core_reservedresource;
 pulp_id | pulp_created | pulp_last_updated | resource | worker_id 
---------+--------------+-------------------+----------+-----------
(0 rows)

pulpcore=# select pulp_id,pulp_created,pulp_last_updated,started_at,finished_at,reserved_resources_record from core_task where state = 'waiting' order by pulp_created;
               pulp_id                |         pulp_created          |       pulp_last_updated       | started_at | finished_at | reserved_resources_record 
--------------------------------------+-------------------------------+-------------------------------+------------+-------------+---------------------------
 8d940c9a-d4ab-4f0d-a828-8af67e2a1210 | 2022-04-11 05:48:02.19574+02  | 2022-04-11 05:48:02.195772+02 |            |             | {/api/v3/distributions/}
 8341cd88-87af-4adf-98ef-d91e2fef8e8f | 2022-04-11 05:48:02.270722+02 | 2022-04-11 05:48:02.270753+02 |            |             | {/api/v3/distributions/}
 5a30ab4b-0046-4007-add1-4148ea37d405 | 2022-04-11 05:48:02.526844+02 | 2022-04-11 05:48:02.526874+02 |            |             | {/api/v3/distributions/}
 ff906a90-3fec-4da8-9f09-f99eb932f64e | 2022-04-11 05:48:02.644078+02 | 2022-04-11 05:48:02.644108+02 |            |             | {/api/v3/distributions/}
 62735242-b118-4e47-b0b6-ef0b821abd15 | 2022-04-11 05:48:02.919862+02 | 2022-04-11 05:48:02.919893+02 |            |             | {/api/v3/distributions/}
 1fb5ee7c-1dc4-42eb-a81f-385d6c182ec0 | 2022-04-11 05:48:03.003957+02 | 2022-04-11 05:48:03.003987+02 |            |             | {/api/v3/distributions/}
 14ecd904-3cc8-40c8-ac79-e22f48cfb676 | 2022-04-11 05:48:03.288307+02 | 2022-04-11 05:48:03.288339+02 |            |             | {/api/v3/distributions/}
 740cf18b-33fe-427d-b21c-ff21a8b5316f | 2022-04-11 05:48:03.318942+02 | 2022-04-11 05:48:03.319+02    |            |             | {/api/v3/distributions/}
 f7f007d4-08b3-4ab8-849c-5d83b5327a81 | 2022-04-11 05:48:03.658133+02 | 2022-04-11 05:48:03.658161+02 |            |             | {/api/v3/distributions/}
 b8c5312d-9d61-4261-8728-a8ad5b092665 | 2022-04-11 05:48:03.69708+02  | 2022-04-11 05:48:03.697109+02 |            |             | {/api/v3/distributions/}
 993ee285-28f2-429d-af27-f41673f8a7a9 | 2022-04-11 05:48:04.005795+02 | 2022-04-11 05:48:04.005824+02 |            |             | {/api/v3/distributions/}
 9ee605c5-648f-468e-8688-290bbb6900f8 | 2022-04-11 05:48:04.072888+02 | 2022-04-11 05:48:04.072926+02 |            |             | {/api/v3/distributions/}
 fce9bfa6-e08d-47ba-89ce-497f7f348ae6 | 2022-04-11 05:48:04.436988+02 | 2022-04-11 05:48:04.43702+02  |            |             | {/api/v3/distributions/}
 947f4d18-1548-4c2a-bd2c-33559bad995b | 2022-04-11 05:48:04.442648+02 | 2022-04-11 05:48:04.442693+02 |            |             | {/api/v3/distributions/}
 796124e4-ed4c-4e57-8a93-cbf3b67ca4d5 | 2022-04-11 05:48:04.713927+02 | 2022-04-11 05:48:04.713958+02 |            |             | {/api/v3/distributions/}
 0d4c4793-e2af-4f52-ad7f-3c40be7dd093 | 2022-04-11 05:48:04.849553+02 | 2022-04-11 05:48:04.849584+02 |            |             | {/api/v3/distributions/}
 dedd9f7c-5584-4cf1-859f-bae22671ff66 | 2022-04-11 05:48:04.989907+02 | 2022-04-11 05:48:04.989941+02 |            |             | {/api/v3/distributions/}
 0cbf681c-5886-49fd-a8d7-a78d03e05096 | 2022-04-11 05:48:05.228938+02 | 2022-04-11 05:48:05.228971+02 |            |             | {/api/v3/distributions/}
 7ad0937a-dbeb-434b-b7c0-2c4fb06e573b | 2022-04-11 05:48:05.269378+02 | 2022-04-11 05:48:05.269408+02 |            |             | {/api/v3/distributions/}
 62ccca62-7b72-4419-8d19-43f1b5bededd | 2022-04-11 05:48:05.5731+02   | 2022-04-11 05:48:05.573131+02 |            |             | {/api/v3/distributions/}
 741fa3e3-856a-4830-be1a-a2bfc6ae0e05 | 2022-04-11 05:48:05.625318+02 | 2022-04-11 05:48:05.625351+02 |            |             | {/api/v3/distributions/}
 ab9c8b09-73ce-497b-8b93-1338a3b9972a | 2022-04-11 05:48:05.932595+02 | 2022-04-11 05:48:05.932639+02 |            |             | {/api/v3/distributions/}
 b623cb0b-a5cf-458d-aafb-bd0538d975b0 | 2022-04-11 05:48:05.975196+02 | 2022-04-11 05:48:05.975232+02 |            |             | {/api/v3/distributions/}
 d35c9e14-e6bf-43ef-882b-285654e8de4a | 2022-04-11 05:48:06.350582+02 | 2022-04-11 05:48:06.350658+02 |            |             | {/api/v3/distributions/}
 c08a0e5f-bdc1-4c2b-aee3-536923953196 | 2022-04-11 05:48:06.351325+02 | 2022-04-11 05:48:06.351357+02 |            |             | {/api/v3/distributions/}
 6723424f-ada7-4da5-9f96-c5cdda60d36d | 2022-04-11 05:48:06.674589+02 | 2022-04-11 05:48:06.674638+02 |            |             | {/api/v3/distributions/}
 1e141dfb-d8ec-49d0-b242-6145ddb5d62b | 2022-04-11 05:48:06.706558+02 | 2022-04-11 05:48:06.706613+02 |            |             | {/api/v3/distributions/}
 6cabffd2-c566-4074-a6c7-36a4b75534b2 | 2022-04-11 05:48:07.010912+02 | 2022-04-11 05:48:07.010941+02 |            |             | {/api/v3/distributions/}
 4468f8f4-fcc9-49d7-803b-6360af0bedb2 | 2022-04-11 05:48:07.127237+02 | 2022-04-11 05:48:07.127284+02 |            |             | {/api/v3/distributions/}
 b0b5d9f2-e85d-496b-baea-475afac88644 | 2022-04-11 05:48:07.33792+02  | 2022-04-11 05:48:07.337954+02 |            |             | {/api/v3/distributions/}
 ce3e2f78-7de8-4cae-a029-5c6112a3873b | 2022-04-11 05:48:07.425648+02 | 2022-04-11 05:48:07.42569+02  |            |             | {/api/v3/distributions/}
 a1e7e5bd-1145-40c7-88cd-d7c7e1664a0d | 2022-04-11 05:48:07.648257+02 | 2022-04-11 05:48:07.64829+02  |            |             | {/api/v3/distributions/}
 cbf31522-34e6-45cc-985c-6e20d8e5ba11 | 2022-04-11 05:48:07.768079+02 | 2022-04-11 05:48:07.76811+02  |            |             | {/api/v3/distributions/}
 dc1900a9-6349-4e7e-9bab-73b70c3a23ea | 2022-04-11 05:48:08.056841+02 | 2022-04-11 05:48:08.056871+02 |            |             | {/api/v3/distributions/}
 6105a252-74ad-4019-bafe-3bdf371d856f | 2022-04-11 05:48:08.14776+02  | 2022-04-11 05:48:08.147794+02 |            |             | {/api/v3/distributions/}
 61d312aa-96a3-410c-844e-b56d33a7d1c9 | 2022-04-11 05:48:08.430726+02 | 2022-04-11 05:48:08.430758+02 |            |             | {/api/v3/distributions/}
 bbc5b883-1a41-48a6-9809-d5e2b5b13aae | 2022-04-11 05:48:08.477491+02 | 2022-04-11 05:48:08.477555+02 |            |             | {/api/v3/distributions/}
 0fb87111-9fc2-4f48-866c-3d2a0788a272 | 2022-04-11 05:48:08.745528+02 | 2022-04-11 05:48:08.745558+02 |            |             | {/api/v3/distributions/}
 07112bb0-2bf6-490f-b241-86842cb45221 | 2022-04-11 05:48:08.868255+02 | 2022-04-11 05:48:08.868285+02 |            |             | {/api/v3/distributions/}
 fdcc6180-f953-4a5a-8343-43c66a6b3eb9 | 2022-04-11 05:48:09.123195+02 | 2022-04-11 05:48:09.123233+02 |            |             | {/api/v3/distributions/}
 7640b2a6-0018-4d06-b7cd-cfc70c0f9ff7 | 2022-04-11 05:48:09.170577+02 | 2022-04-11 05:48:09.170615+02 |            |             | {/api/v3/distributions/}
 c58786ee-2396-41df-bdde-3747a113a5e8 | 2022-04-11 05:48:09.519841+02 | 2022-04-11 05:48:09.519873+02 |            |             | {/api/v3/distributions/}
 f5a8c5e6-6b7b-4fe4-8bec-3df4cb29f29c | 2022-04-11 05:48:09.593262+02 | 2022-04-11 05:48:09.593294+02 |            |             | {/api/v3/distributions/}
 e8938ce7-589f-4bde-905c-d00f47ac0da5 | 2022-04-11 05:48:09.879781+02 | 2022-04-11 05:48:09.879815+02 |            |             | {/api/v3/distributions/}
 6cd42528-931a-4dca-b0dc-ae27e2340c3c | 2022-04-11 05:48:10.228485+02 | 2022-04-11 05:48:10.228518+02 |            |             | {/api/v3/distributions/}
(45 rows)

pulpcore=#


All pulpcore-workers have the same backtrace:

Traceback (most recent call first):
  <built-in method select of module object at remote 0x7f8260cc8ea8>
  File "/usr/lib/python3.6/site-packages/pulpcore/tasking/pulpcore_worker.py", line 190, in sleep
    [self.sentinel, connection.connection], [], [], self.heartbeat_period
  File "/usr/lib/python3.6/site-packages/pulpcore/tasking/pulpcore_worker.py", line 297, in run_forever
    self.sleep()
  File "/usr/lib/python3.6/site-packages/pulpcore/tasking/entrypoint.py", line 44, in worker
    NewPulpWorker().run_forever()
  File "/usr/lib/python3.6/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/lib/python3.6/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/usr/lib/python3.6/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/usr/bin/pulpcore-worker", line 11, in <module>
    load_entry_point('pulpcore==3.14.14', 'console_scripts', 'pulpcore-worker')()


which is inside sleep method:

            r, w, x = select.select(
                [self.sentinel, connection.connection], [], [], self.heartbeat_period
            )

/var/log/messages from the "now workers start to fetch new tasks" time:

Apr 11 05:48:08 pmoravec-sat610-bz2050440 pulpcore-api: pulp [3464f30e4c6547848cf9126d3d795341]:  - - [11/Apr/2022:03:48:08 +0000] "POST /pulp/api/v3/distributions/rpm/rpm/ HTTP/1.1" 202 67 "-" "curl/7.29.0"
Apr 11 05:48:08 pmoravec-sat610-bz2050440 pulpcore-api: pulp [f8316d64c42c4959892cc8fa7f4a3725]:  - - [11/Apr/2022:03:48:08 +0000] "DELETE /pulp/api/v3/distributions/rpm/rpm/ae6aa58c-bd9e-488c-a1e8-8d502b17048d/ HTTP/1.1" 202 67 "-" "curl/7.29.0"
Apr 11 05:48:08 pmoravec-sat610-bz2050440 pulpcore-worker-2: pulp [fcaa09dd9d9e4aaf9bc0e5957c489ffe]: pulpcore.tasking.pulpcore_worker:INFO: Task completed d688ef20-d86b-4046-9800-4e6d81e37d12
Apr 11 05:48:08 pmoravec-sat610-bz2050440 pulpcore-worker-1: pulp [663a45f1813e42299f32bd90267cd80c]: pulpcore.tasking.pulpcore_worker:INFO: Starting task 506c9e81-1365-4848-9564-8bfbea4f08aa
Apr 11 05:48:08 pmoravec-sat610-bz2050440 pulpcore-api: pulp [a7147cb27e92492c920a08383de24fb5]:  - - [11/Apr/2022:03:48:08 +0000] "POST /pulp/api/v3/distributions/rpm/rpm/ HTTP/1.1" 202 67 "-" "curl/7.29.0"
Apr 11 05:48:08 pmoravec-sat610-bz2050440 pulpcore-worker-1: pulp [663a45f1813e42299f32bd90267cd80c]: pulpcore.tasking.pulpcore_worker:INFO: Task completed 506c9e81-1365-4848-9564-8bfbea4f08aa
Apr 11 05:48:08 pmoravec-sat610-bz2050440 pulpcore-api: pulp [c9a035a8f27645a5829477592b712d05]:  - - [11/Apr/2022:03:48:08 +0000] "DELETE /pulp/api/v3/distributions/rpm/rpm/05fb3746-ff50-4ce3-8858-f699c2b4a49a/ HTTP/1.1" 202 67 "-" "curl/7.29.0"
Apr 11 05:48:08 pmoravec-sat610-bz2050440 pulpcore-worker-2: pulp [5a4a079998e94ede9feff00dfe2020ca]: pulpcore.tasking.pulpcore_worker:INFO: Starting task ad119ec2-a371-4035-988b-eac1e3519a74
Apr 11 05:48:09 pmoravec-sat610-bz2050440 pulpcore-api: pulp [53c4499d91124608bd53c8ab60c02049]:  - - [11/Apr/2022:03:48:09 +0000] "POST /pulp/api/v3/distributions/rpm/rpm/ HTTP/1.1" 202 67 "-" "curl/7.29.0"
Apr 11 05:48:09 pmoravec-sat610-bz2050440 pulpcore-api: pulp [5c110aceff70466ea24eafc3cf31e713]:  - - [11/Apr/2022:03:48:09 +0000] "DELETE /pulp/api/v3/distributions/rpm/rpm/755425f4-21ff-4a9f-b7c1-a04d6aa1133a/ HTTP/1.1" 202 67 "-" "curl/7.29.0"
Apr 11 05:48:09 pmoravec-sat610-bz2050440 pulpcore-worker-2: pulp [5a4a079998e94ede9feff00dfe2020ca]: pulpcore.tasking.pulpcore_worker:INFO: Task completed ad119ec2-a371-4035-988b-eac1e3519a74
Apr 11 05:48:09 pmoravec-sat610-bz2050440 pulpcore-api: pulp [6878cf3a5b34487a9b9dfc5ebcce38ab]:  - - [11/Apr/2022:03:48:09 +0000] "DELETE /pulp/api/v3/distributions/rpm/rpm/6d2b07a6-73fc-4ffd-a9ef-b92ecd2c768a/ HTTP/1.1" 202 67 "-" "curl/7.29.0"
Apr 11 05:48:09 pmoravec-sat610-bz2050440 pulpcore-worker-2: pulp [9f2d51e422f542f382991c8905389bac]: pulpcore.tasking.pulpcore_worker:INFO: Starting task b17f8001-f7ba-4dc4-bfc9-b74f498cf7df
Apr 11 05:48:09 pmoravec-sat610-bz2050440 pulpcore-api: pulp [a3f167973c604f318d625145872be599]:  - - [11/Apr/2022:03:48:09 +0000] "POST /pulp/api/v3/distributions/rpm/rpm/ HTTP/1.1" 202 67 "-" "curl/7.29.0"
Apr 11 05:48:09 pmoravec-sat610-bz2050440 pulpcore-worker-2: pulp [9f2d51e422f542f382991c8905389bac]: pulpcore.tasking.pulpcore_worker:INFO: Task completed b17f8001-f7ba-4dc4-bfc9-b74f498cf7df
Apr 11 05:48:09 pmoravec-sat610-bz2050440 pulpcore-worker-3: pulp [76ee50a369b14c799302221a13f8284e]: pulpcore.tasking.pulpcore_worker:INFO: Starting task ccb072df-d702-4613-a827-4030db10ecf7
Apr 11 05:48:09 pmoravec-sat610-bz2050440 pulpcore-api: pulp [9faafb329e4f4f8a84f40dc5dc402708]:  - - [11/Apr/2022:03:48:09 +0000] "DELETE /pulp/api/v3/distributions/rpm/rpm/eed60983-fdca-4719-989a-3547a934ecc1/ HTTP/1.1" 202 67 "-" "curl/7.29.0"
Apr 11 05:48:10 pmoravec-sat610-bz2050440 pulpcore-worker-3: pulp [76ee50a369b14c799302221a13f8284e]: pulpcore.tasking.pulpcore_worker:INFO: Task completed ccb072df-d702-4613-a827-4030db10ecf7
Apr 11 05:48:10 pmoravec-sat610-bz2050440 pulpcore-worker-1: pulp [ec8dfe1f80964f6e9b89b67456c08ad9]: pulpcore.tasking.pulpcore_worker:INFO: Starting task 107e11e1-2d33-4d15-ba33-fa83804f38de
Apr 11 05:48:10 pmoravec-sat610-bz2050440 pulpcore-api: pulp [d01f023a7dea4518816728696e76f35c]:  - - [11/Apr/2022:03:48:10 +0000] "DELETE /pulp/api/v3/distributions/rpm/rpm/7847f2a5-cccf-48ec-8bf7-c204ed13cfb9/ HTTP/1.1" 202 67 "-" "curl/7.29.0"
Apr 11 05:48:10 pmoravec-sat610-bz2050440 su: (to postgres) root on none
Apr 11 05:48:10 pmoravec-sat610-bz2050440 pulpcore-worker-1: pulp [ec8dfe1f80964f6e9b89b67456c08ad9]: pulpcore.tasking.pulpcore_worker:INFO: Task completed 107e11e1-2d33-4d15-ba33-fa83804f38de

Comment 15 Pavel Moravec 2022-04-11 14:06:17 UTC
Some brainstorming with mdellweg++ :

- similar bug [1] is already fixed in the pulpcore-3.14-14-1 version (in Sat6.10.4)
  - [1] https://github.com/pulp/pulpcore/commit/82828595c4407735238a43307feaa9b62b540613

- the latest worker completed its latest task a long time after there were already waiting tasks
  - until there was a clock skew, a theory "a race condition just before entering sleep" does not apply

- why the latest worker - after executing the latest task - could ever execute the iter_tasks *without* fetching a task? that makes no sense..

- to prove the theory "all workers are sleeping and waiting for a wakeup signal from psql":  once reproduced, go into the pg-console and do the "notify pulp_worker_wakeup", if it unblocks the livelock



Current mdellweg++ theory:
- select in https://github.com/pulp/pulpcore/blob/a403fc6f885b9378551198029326d9e1f3ffedf3/pulpcore/tasking/pulpcore_worker.py#L169 does not fetch the wakeup call
- since either a heartbeat (of other workers) or latest worker latest task status update does surpress/"consume" the wakeup notification prior the select happens

So theoretically, one worker doing in a cycle batches of similar tasks followed by a silence (waiting till tasks are done) is the best (generic) reproducer (but still doesnt work for me, much realiably).

Comment 17 matt jia 2022-04-11 23:37:59 UTC
(In reply to Pavel Moravec from comment #15)
> Current mdellweg++ theory:
> - select in
> https://github.com/pulp/pulpcore/blob/
> a403fc6f885b9378551198029326d9e1f3ffedf3/pulpcore/tasking/pulpcore_worker.
> py#L169 does not fetch the wakeup call
> - since either a heartbeat (of other workers) or latest worker latest task
> status update does surpress/"consume" the wakeup notification prior the
> select happens
> 
> So theoretically, one worker doing in a cycle batches of similar tasks
> followed by a silence (waiting till tasks are done) is the best (generic)
> reproducer (but still doesnt work for me, much realiably).

I think there is another possible theory. When iterating the tasks on a worker, each task cannot get a AdvisoryLock for some reason:


https://github.com/pulp/pulpcore/blob/a403fc6f885b9378551198029326d9e1f3ffedf3/pulpcore/tasking/pulpcore_worker.py#L138

the worker then can sleep without sending any pulp_worker_wakeup event. 

If this is the case, should the worker only sleep when there is 0 TASK_INCOMPLETE_STATES task?

Comment 18 Pavel Moravec 2022-04-13 10:03:45 UTC
OK, collaboration++ led to an artificial but _deterministic_ workaround.

1) Apply a patch to add auxiliary logging and force worker to sleep at the bad place:

--- /usr/lib/python3.6/site-packages/pulpcore/tasking/pulpcore_worker.py.orig	2022-04-11 12:54:11.374669136 +0200
+++ /usr/lib/python3.6/site-packages/pulpcore/tasking/pulpcore_worker.py	2022-04-13 11:39:11.590025573 +0200
@@ -17,6 +17,7 @@ from multiprocessing import Process
 from tempfile import TemporaryDirectory
 
 import django
+import time
 
 os.environ.setdefault("DJANGO_SETTINGS_MODULE", "pulpcore.app.settings")
 django.setup()
@@ -189,6 +190,9 @@ class NewPulpWorker:
             r, w, x = select.select(
                 [self.sentinel, connection.connection], [], [], self.heartbeat_period
             )
+            _logger.info(_("bz2050440: Worker %s artificial sleep before heartbeat"), self.name)
+            time.sleep(2)
+            _logger.info(_("bz2050440: Worker %s artificial slept before heartbeat"), self.name)
             self.beat()
             if connection.connection in r:
                 connection.connection.poll()

2) Let just 1 worker running (just to ease the reproducer):

systemctl stop pulpcore-worker@*.service
systemctl start pulpcore-worker

3) "tail -f /var/log/messages" to monitor the "Worker * artificial sle.." logs.

4) Have prepared a curl command to e.g. create a publication or distribution or delete either of them.

5) Once the *first* log "artificial sleep" is printed, immediately fire the API request that will invoke a task.

6) Ensure the task is created while the worker is sleeping, like:

Apr 13 12:01:42 pmoravec-sat610-bz2050440 pulpcore-worker-1: pulp [None]: pulpcore.tasking.pulpcore_worker:INFO: bz2050440: Worker 28305.gsslab.brq2.redhat.com artificial sleep before heartbeat
Apr 13 12:01:43 pmoravec-sat610-bz2050440 pulpcore-api: pulp [6d4ba28511c34d1db0cc8f7c06c2f69f]:  - - [13/Apr/2022:10:01:43 +0000] "POST /pulp/api/v3/publications/rpm/rpm/ HTTP/1.1" 202 67 "-" "curl/7.29.0"
Apr 13 12:01:44 pmoravec-sat610-bz2050440 pulpcore-worker-1: pulp [None]: pulpcore.tasking.pulpcore_worker:INFO: bz2050440: Worker 28305.gsslab.brq2.redhat.com artificial slept before heartbeat

7) Wait for Godot to have the task executed.

8) Ask Godot to run in postgres:

notify pulp_worker_wakeup;

(optionally, invoke any activity that fires a new task)

9) check that the task is suddenly executed:

..
Apr 13 12:03:20 pmoravec-sat610-bz2050440 pulpcore-worker-1: pulp [None]: pulpcore.tasking.pulpcore_worker:INFO: bz2050440: Worker 28305.gsslab.brq2.redhat.com artificial slept before heartbeat
Apr 13 12:03:25 pmoravec-sat610-bz2050440 pulpcore-worker-1: pulp [None]: pulpcore.tasking.pulpcore_worker:INFO: bz2050440: Worker 28305.gsslab.brq2.redhat.com artificial sleep before heartbeat
Apr 13 12:03:27 pmoravec-sat610-bz2050440 pulpcore-worker-1: pulp [None]: pulpcore.tasking.pulpcore_worker:INFO: bz2050440: Worker 28305.gsslab.brq2.redhat.com artificial slept before heartbeat
Apr 13 12:03:27 pmoravec-sat610-bz2050440 pulpcore-worker-1: pulp [6d4ba28511c34d1db0cc8f7c06c2f69f]: pulpcore.tasking.pulpcore_worker:INFO: Starting task b7110104-a52c-49f3-af78-58d47a662634
Apr 13 12:03:27 pmoravec-sat610-bz2050440 pulpcore-worker-1: pulp [6d4ba28511c34d1db0cc8f7c06c2f69f]: pulp_rpm.app.tasks.publishing:INFO: Publishing: repository=repo_bz2050440_ts1649842649_737576601, version=0
Apr 13 12:03:27 pmoravec-sat610-bz2050440 pulpcore-worker-1: pulp [6d4ba28511c34d1db0cc8f7c06c2f69f]: pulp_rpm.app.tasks.publishing:INFO: Publication: 0a35af25-a374-4aa8-912c-bb2d3a4e0e99 created
Apr 13 12:03:27 pmoravec-sat610-bz2050440 pulpcore-worker-1: pulp [6d4ba28511c34d1db0cc8f7c06c2f69f]: pulpcore.tasking.pulpcore_worker:INFO: Task completed b7110104-a52c-49f3-af78-58d47a662634

Comment 19 pulp-infra@redhat.com 2022-05-16 16:21:38 UTC
The Pulp upstream bug status is at closed. Updating the external tracker on this bug.

Comment 20 pulp-infra@redhat.com 2022-05-16 16:21:41 UTC
All upstream Pulp bugs are at MODIFIED+. Moving this bug to POST.

Comment 26 Lai 2022-05-27 01:36:52 UTC
Steps to retest:

1) Apply a patch to add auxiliary logging and force worker to sleep at the bad place:

--- /usr/lib/python3.6/site-packages/pulpcore/tasking/pulpcore_worker.py.orig	2022-04-11 12:54:11.374669136 +0200
+++ /usr/lib/python3.6/site-packages/pulpcore/tasking/pulpcore_worker.py	2022-04-13 11:39:11.590025573 +0200
@@ -17,6 +17,7 @@ from multiprocessing import Process
 from tempfile import TemporaryDirectory
 
 import django
+import time
 
 os.environ.setdefault("DJANGO_SETTINGS_MODULE", "pulpcore.app.settings")
 django.setup()
@@ -189,6 +190,9 @@ class NewPulpWorker:
             r, w, x = select.select(
                 [self.sentinel, connection.connection], [], [], self.heartbeat_period
             )
+            _logger.info(_("bz2050440: Worker %s artificial sleep before heartbeat"), self.name)
+            time.sleep(2)
+            _logger.info(_("bz2050440: Worker %s artificial slept before heartbeat"), self.name)
             self.beat()
             if connection.connection in r:
                 connection.connection.poll()

(you can make the code changes directly to the "pulpcore_worker.py" file and then run "cat /usr/lib/python3.6/site-packages/pulpcore/tasking/pulpcore_worker.py | patch -p1" to apply the patch)

2) Let just 1 worker running (just to ease the reproducer):

systemctl stop pulpcore-worker@*.service
systemctl start pulpcore-worker

3) "tail -f /var/log/messages" to monitor the "Worker * artificial sle.." logs on another terminal

4) Have prepared a curl command to e.g. create a publication or distribution or delete either of them.

certs="--cacert /etc/pki/katello/certs/katello-server-ca.crt --cert /etc/pki/katello/certs/pulp-client.crt --key /etc/pki/katello/private/pulp-client.key"
hname=$(hostname -f)

# create repo
reponame="repo_bz2050440"
repohref=$(curl -s -X POST -H "Content-Type: application/json" -d "{\"name\":\"$reponame\"}" $certs https://${hname}/pulp/api/v3/repositories/rpm/rpm/ | json_reformat | grep pulp_href | cut -d\" -f4)


5) Once the *first* log "artificial sleep" is printed, immediately fire the API request that will invoke a task.
# prepare below command to fire at step 4
curl -s -H "Accept:application/json" -H "Content-Type: application/json" -X POST -d "{\"repository\": \"${repohref}\"}" $certs https://${hname}/pulp/api/v3/publications/rpm/rpm/ | json_reformat

6) Ensure the task is created while the worker is sleeping, like:

(example)
Apr 13 12:01:42 pmoravec-sat610-bz2050440 pulpcore-worker-1: pulp [None]: pulpcore.tasking.pulpcore_worker:INFO: bz2050440: Worker bz2050440.satotest.gsslab.brq2.redhat.com">bz2050440.satotest.gsslab.brq2.redhat.com">28305.gsslab.brq2.redhat.com artificial sleep before heartbeat
Apr 13 12:01:43 pmoravec-sat610-bz2050440 pulpcore-api: pulp [6d4ba28511c34d1db0cc8f7c06c2f69f]:  - - [13/Apr/2022:10:01:43 +0000] "POST /pulp/api/v3/publications/rpm/rpm/ HTTP/1.1" 202 67 "-" "curl/7.29.0"
Apr 13 12:01:44 pmoravec-sat610-bz2050440 pulpcore-worker-1: pulp [None]: pulpcore.tasking.pulpcore_worker:INFO: bz2050440: Worker bz2050440.satotest.gsslab.brq2.redhat.com">bz2050440.satotest.gsslab.brq2.redhat.com">28305.gsslab.brq2.redhat.com artificial slept before heartbeat

7) Wait for Godot to have the task executed.

8) check that the task is suddenly executed:

(example)
..
Apr 13 12:03:20 pmoravec-sat610-bz2050440 pulpcore-worker-1: pulp [None]: pulpcore.tasking.pulpcore_worker:INFO: bz2050440: Worker bz2050440.satotest.gsslab.brq2.redhat.com">bz2050440.satotest.gsslab.brq2.redhat.com">28305.gsslab.brq2.redhat.com artificial slept before heartbeat
Apr 13 12:03:25 pmoravec-sat610-bz2050440 pulpcore-worker-1: pulp [None]: pulpcore.tasking.pulpcore_worker:INFO: bz2050440: Worker bz2050440.satotest.gsslab.brq2.redhat.com">bz2050440.satotest.gsslab.brq2.redhat.com">28305.gsslab.brq2.redhat.com artificial sleep before heartbeat
Apr 13 12:03:27 pmoravec-sat610-bz2050440 pulpcore-worker-1: pulp [None]: pulpcore.tasking.pulpcore_worker:INFO: bz2050440: Worker bz2050440.satotest.gsslab.brq2.redhat.com">bz2050440.satotest.gsslab.brq2.redhat.com">28305.gsslab.brq2.redhat.com artificial slept before heartbeat
Apr 13 12:03:27 pmoravec-sat610-bz2050440 pulpcore-worker-1: pulp [6d4ba28511c34d1db0cc8f7c06c2f69f]: pulpcore.tasking.pulpcore_worker:INFO: Starting task b7110104-a52c-49f3-af78-58d47a662634
Apr 13 12:03:27 pmoravec-sat610-bz2050440 pulpcore-worker-1: pulp [6d4ba28511c34d1db0cc8f7c06c2f69f]: pulp_rpm.app.tasks.publishing:INFO: Publishing: repository=repo_bz2050440_ts1649842649_737576601, version=0
Apr 13 12:03:27 pmoravec-sat610-bz2050440 pulpcore-worker-1: pulp [6d4ba28511c34d1db0cc8f7c06c2f69f]: pulp_rpm.app.tasks.publishing:INFO: Publication: 0a35af25-a374-4aa8-912c-bb2d3a4e0e99 created
Apr 13 12:03:27 pmoravec-sat610-bz2050440 pulpcore-worker-1: pulp [6d4ba28511c34d1db0cc8f7c06c2f69f]: pulpcore.tasking.pulpcore_worker:INFO: Task completed b7110104-a52c-49f3-af78-58d47a662634

Optionally:
9) check that the task is executed or not with the following, replacing the "tasks/60aac..." with the pulpcore task that is displayed for you
curl $certs https://${hname}/pulp/api/v3/tasks/60aac23b-29fd-4ced-9c7d-bdce6d4b880c/ | json_reformat


Expected:
The task should executed and completed successfully and should not be idle

Actual:
The task executed and completed successfully and isn't idle.
{
    "pulp_href": "/pulp/api/v3/tasks/d746d854-ef80-468a-86b1-5f8a782961d7/",
    "pulp_created": "2022-05-27T01:28:56.221199Z",
    "state": "completed",
    "name": "pulp_rpm.app.tasks.publishing.publish",
    "logging_cid": "5777e58a684b4d5aab22ed0abc32623d",
    "started_at": "2022-05-27T01:28:58.312481Z",
    "finished_at": "2022-05-27T01:28:58.535536Z",
    "error": null,
    "worker": "/pulp/api/v3/workers/b333c3a3-c02c-46b2-b350-70dc79549772/",
    "parent_task": null,
    "child_tasks": [

    ],
    "task_group": null,
    "progress_reports": [
        {
            "message": "Generating repository metadata",
            "code": "publish.generating_metadata",
            "state": "completed",
            "total": 1,
            "done": 1,
            "suffix": null
        }
    ],
    "created_resources": [
        "/pulp/api/v3/publications/rpm/rpm/e4d9d02e-94d3-4f7a-891f-5c0d742b755d/"
    ],
    "reserved_resources_record": [
        "shared:/pulp/api/v3/repositories/rpm/rpm/f594fdf8-e9e6-4284-8805-8fe9cc2d9969/"
    ]
}

Verified on 6.11 snap 22 with python38-pulpcore-3.16.8-2.el8pc.noarch on both rhel7 and rhel8

This is quite difficult to test.  It worked out fine and I did also check that the codebase is in the build.

Comment 29 errata-xmlrpc 2022-07-05 14:32:45 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Satellite 6.11 Release), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5498

Comment 30 Red Hat Bugzilla 2023-09-15 01:51:38 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 365 days