Description of problem: This happens when deleting a batch of content views at the same time. 99% of content views are removed without any issue. Only one content view has this issue. By examining the task, it has a list of child tasks. Most of them start and finish in one second. However, the last one is in waiting state. By looking at the core_task table in the db, there is no other task running and only this task is in waiting state. All workers are online but idle. The interesting thing is if a new task(a repository sync/sync plan, etc.) is triggered, this task will be picked up too. This results in the content view deletion task being stuck there for 7 hours: Id: 4cf1dd7d-b398-47c3-ad2f-0f8c588a8c65 Label: Actions::Katello::ContentView::Destroy Status: stopped Result: success Started at: 2022-02-01 20:57:22 UTC Ended at: 2022-02-02 04:00:26 UTC 3: Actions::Pulp3::ContentView::DeleteRepositoryReferences (success) [ 25382.76s / 3396.84s ] Version-Release number of selected component (if applicable): 6.10 How reproducible: Hard Steps to Reproduce: Only in the customer's enviroment Actual results: The task is in waiting state forever while workers are idle. Expected results: The task should be run. Additional info:
So we have a task being ignored by pulp worker(s)(*), but once a different task on (same?similar?) repo object is kicked off, we are unblocked? That sounds like pulp-3 task locking issue. (*) we might not know if *all* workers do ignore the task or if just one worker is blocked in execution of just selected task Isnt it worth taking: 1) list of locks (isnt pulp waiting on https://github.com/pulp/pulpcore/blob/a403fc6f885b9378551198029326d9e1f3ffedf3/pulpcore/tasking/pulpcore_worker.py#L252 ?) su - postgres -c "psql pulpcore -c \"select relation::regclass, * from pg_locks;\"" 2) coredumps of pulp workers (what is the status of all workers?): for pid in $(ps aux | grep -v grep | grep pulpcore-worker | awk '{ print $2 }'); do gcore $pid; done Just a silly idea: can't all workers be waiting for "NOTIFY pulp_worker_wakeup" that comes even with the new task invoked?
Could be this piece of code "resource in taken_resources for resource in reserved_resources_record" returns true here? https://github.com/pulp/pulpcore/blob/a403fc6f885b9378551198029326d9e1f3ffedf3/pulpcore/tasking/pulpcore_worker.py#L151 Is it possible two tasks run by one worker have the same resource assigned? When the first task is run, it updates the taken_resources set with the resource record. When the second task is run ,it would see "resource in taken_resources for resource in reserved_resources_record" true in this case?
smelling like race co(In reply to matt jia from comment #10) > Could be this piece of code "resource in taken_resources for resource in > reserved_resources_record" returns true here? > > https://github.com/pulp/pulpcore/blob/ > a403fc6f885b9378551198029326d9e1f3ffedf3/pulpcore/tasking/pulpcore_worker. > py#L151 > > Is it possible two tasks run by one worker have the same resource assigned? > When the first task is run, it updates the taken_resources set with the > resource record. > > When the second task is run ,it would see "resource in taken_resources for > resource in reserved_resources_record" true in this case? Smelling like a race condition, :-)
So worth to additionally ask for: su - postgres -c "psql pulpcore -c \"select * from core_reservedresource;\"" and just in case orhpans cleanup was in charge (cf pg_advisory_lock and the fact orphans cleanup *was* run at similar time when the cust hit the issue), take also: su - postgres -c "psql pulpcore -c \"select * from core_task where name = 'pulpcore.app.tasks.orphan.orphan_cleanup';\"" I will ask for that.
I might reproduced it. Idea of reproducer: (have a repository and its publication, then) in a loop, create 20 publications for it and remove all of them. *Maybe* required to run concurrently for >1 different repos, and *maybe* with different content guards (empty or a RHSM one). So far I reproduced it once with below script. Observations: - core_reservedresource table was empty - I had both general_create and general_delete tasks waiting - when I invoked a different task (sync of completely unrelated repo), the situation got unblocked (which is the symptom of the reported problem as well) I am building more cleaner environment to provide more reliable reproducer, but there is some livelock in pulp workers (not) picking up tasks.. Reproducer script: -----8<----------8<----------8<----------8<----- # reproducer attempt for bz 2050440 : create a repo and a publication; then concurrently create distributions and concurrently delete them - in a loop concur=${1:-10} contguard=${2:-""} certs="--cacert /etc/pki/katello/certs/katello-server-ca.crt --cert /etc/pki/katello/certs/pulp-client.crt --key /etc/pki/katello/private/pulp-client.key" hname=$(hostname -f) # create repo and its publication reponame="repo_bz2050440_ts$(date +%s_%N)" repohref=$(curl -s -X POST -H "Content-Type: application/json" -d "{\"name\":\"$reponame\"}" $certs https://${hname}/pulp/api/v3/repositories/rpm/rpm/ | json_reformat | grep pulp_href | cut -d\" -f4) sleep 1 pubtask=$(curl -s -H "Accept:application/json" -H "Content-Type: application/json" -X POST -d "{\"repository\": \"${repohref}\"}" $certs https://${hname}/pulp/api/v3/publications/rpm/rpm/ | json_reformat | grep task | cut -d\" -f4) while [ $(curl -s -H "Accept:application/json" $certs https://${hname}${pubtask} | json_reformat | grep -c /pulp/api/v3/publication) -eq 0 ]; do sleep 1 done pubhref=$(curl -s -H "Accept:application/json" $certs https://${hname}${pubtask} | json_reformat | grep /pulp/api/v3/publication | cut -d\" -f2) echo "at the end, run:" echo "curl -s -X DELETE $certs https://${hname}${pubhref}" echo "curl -s -X DELETE $certs https://${hname}${repohref} | json_reformat" # in a loop, create distributions, wait until their tasks complete, delete all of them, wait if / until their tasks complete, and start again. while true; do # create distributions echo "$(date): creating distributions" taskids="" for i in $(seq 1 $concur); do distrname="${reponame}_distr_ts$(date +%s_%N)" id=$(curl -s -X POST -H "Content-Type: application/json" -d "{\"base_path\": \"RedHat/Library/custom/${reponame}/${distrname}\", \"content_guard\": \"${contguard}\", \"name\": \"${distrname}\", \"publication\": \"${pubhref}\"}" $certs https://${hname}/pulp/api/v3/distributions/rpm/rpm/ | cut -d\" -f4) taskids="${taskids} ${id}" done sleep 2 taskscount=$(echo $taskids | wc -w) echo "$(date): checking ${taskscount} create tasks statuses (${taskids})" distributions="" for url in $taskids; do distr="" while [ -z $distr ]; do distr=$(curl -s $certs https://${hname}${url} | json_reformat | grep /pulp/api/v3/distributions/rpm/rpm/ | cut -d\" -f2) sleep 0.1 done distributions="${distributions} ${distr}" done # delete them distrcount=$(echo $distributions | wc -w) echo "$(date): deleting all ${distrcount} of them (${distributions})" taskids="" for url in $distributions; do id=$(curl -s -X DELETE $certs https://${hname}${url} | json_reformat | grep task | cut -d\" -f4 | cut -d'/' -f6) taskids="${taskids} ${id}" done # wait until all tasks complete taskscount=$(echo $taskids | wc -w) echo "$(date): checking ${taskscount} delete tasks statuses (${taskids})" completed=0 while [ $completed -lt $taskscount ]; do completed=0 # due to some reason, pulp API didnt work well for me so I query DB directly for id in $taskids; do state=$(su - postgres -c "psql pulpcore -c \"COPY (SELECT state FROM core_task WHERE pulp_id = '${id}') TO STDOUT;\"") # echo "$id $state" if [ "${state}" == "completed" ]; then completed=$((completed+1)); fi done echo "$(date): tasks completed: ${completed}/${taskscount}" sleep 1 done done # at the end (we never reach), delete the publication and also the repo curl -s -X DELETE $certs https://${hname}${pubhref} curl -s -X DELETE $certs https://${hname}${repohref} | json_reformat -----8<----------8<----------8<----------8<----- I run it 5 times concurrently, twice with a RHSM guard href and three times with empty href, and got the livelock within a hour.
OK I reproduced it for the 2nd time, after 6 hours of running my reproducer. Observations: pulpcore=# select state,count(*) from core_task group by state; state | count -----------+------- waiting | 45 failed | 97 completed | 95840 (3 rows) pulpcore=# SELECT * FROM pg_locks; locktype | database | relation | page | tuple | virtualxid | transactionid | classid | objid | objsubid | virtualtransaction | pid | mode | granted | fastpath ------------+----------+----------+------+-------+------------+---------------+---------+-------+----------+--------------------+------+-----------------+---------+---------- relation | 24269 | 12143 | | | | | | | | 43/1542325 | 4189 | AccessShareLock | t | t virtualxid | | | | | 43/1542325 | | | | | 43/1542325 | 4189 | ExclusiveLock | t | t (2 rows) pulpcore=# select * from core_reservedresource; pulp_id | pulp_created | pulp_last_updated | resource | worker_id ---------+--------------+-------------------+----------+----------- (0 rows) pulpcore=# select pulp_id,pulp_created,pulp_last_updated,started_at,finished_at,reserved_resources_record from core_task where state = 'waiting' order by pulp_created; pulp_id | pulp_created | pulp_last_updated | started_at | finished_at | reserved_resources_record --------------------------------------+-------------------------------+-------------------------------+------------+-------------+--------------------------- 8d940c9a-d4ab-4f0d-a828-8af67e2a1210 | 2022-04-11 05:48:02.19574+02 | 2022-04-11 05:48:02.195772+02 | | | {/api/v3/distributions/} 8341cd88-87af-4adf-98ef-d91e2fef8e8f | 2022-04-11 05:48:02.270722+02 | 2022-04-11 05:48:02.270753+02 | | | {/api/v3/distributions/} 5a30ab4b-0046-4007-add1-4148ea37d405 | 2022-04-11 05:48:02.526844+02 | 2022-04-11 05:48:02.526874+02 | | | {/api/v3/distributions/} ff906a90-3fec-4da8-9f09-f99eb932f64e | 2022-04-11 05:48:02.644078+02 | 2022-04-11 05:48:02.644108+02 | | | {/api/v3/distributions/} 62735242-b118-4e47-b0b6-ef0b821abd15 | 2022-04-11 05:48:02.919862+02 | 2022-04-11 05:48:02.919893+02 | | | {/api/v3/distributions/} 1fb5ee7c-1dc4-42eb-a81f-385d6c182ec0 | 2022-04-11 05:48:03.003957+02 | 2022-04-11 05:48:03.003987+02 | | | {/api/v3/distributions/} 14ecd904-3cc8-40c8-ac79-e22f48cfb676 | 2022-04-11 05:48:03.288307+02 | 2022-04-11 05:48:03.288339+02 | | | {/api/v3/distributions/} 740cf18b-33fe-427d-b21c-ff21a8b5316f | 2022-04-11 05:48:03.318942+02 | 2022-04-11 05:48:03.319+02 | | | {/api/v3/distributions/} f7f007d4-08b3-4ab8-849c-5d83b5327a81 | 2022-04-11 05:48:03.658133+02 | 2022-04-11 05:48:03.658161+02 | | | {/api/v3/distributions/} b8c5312d-9d61-4261-8728-a8ad5b092665 | 2022-04-11 05:48:03.69708+02 | 2022-04-11 05:48:03.697109+02 | | | {/api/v3/distributions/} 993ee285-28f2-429d-af27-f41673f8a7a9 | 2022-04-11 05:48:04.005795+02 | 2022-04-11 05:48:04.005824+02 | | | {/api/v3/distributions/} 9ee605c5-648f-468e-8688-290bbb6900f8 | 2022-04-11 05:48:04.072888+02 | 2022-04-11 05:48:04.072926+02 | | | {/api/v3/distributions/} fce9bfa6-e08d-47ba-89ce-497f7f348ae6 | 2022-04-11 05:48:04.436988+02 | 2022-04-11 05:48:04.43702+02 | | | {/api/v3/distributions/} 947f4d18-1548-4c2a-bd2c-33559bad995b | 2022-04-11 05:48:04.442648+02 | 2022-04-11 05:48:04.442693+02 | | | {/api/v3/distributions/} 796124e4-ed4c-4e57-8a93-cbf3b67ca4d5 | 2022-04-11 05:48:04.713927+02 | 2022-04-11 05:48:04.713958+02 | | | {/api/v3/distributions/} 0d4c4793-e2af-4f52-ad7f-3c40be7dd093 | 2022-04-11 05:48:04.849553+02 | 2022-04-11 05:48:04.849584+02 | | | {/api/v3/distributions/} dedd9f7c-5584-4cf1-859f-bae22671ff66 | 2022-04-11 05:48:04.989907+02 | 2022-04-11 05:48:04.989941+02 | | | {/api/v3/distributions/} 0cbf681c-5886-49fd-a8d7-a78d03e05096 | 2022-04-11 05:48:05.228938+02 | 2022-04-11 05:48:05.228971+02 | | | {/api/v3/distributions/} 7ad0937a-dbeb-434b-b7c0-2c4fb06e573b | 2022-04-11 05:48:05.269378+02 | 2022-04-11 05:48:05.269408+02 | | | {/api/v3/distributions/} 62ccca62-7b72-4419-8d19-43f1b5bededd | 2022-04-11 05:48:05.5731+02 | 2022-04-11 05:48:05.573131+02 | | | {/api/v3/distributions/} 741fa3e3-856a-4830-be1a-a2bfc6ae0e05 | 2022-04-11 05:48:05.625318+02 | 2022-04-11 05:48:05.625351+02 | | | {/api/v3/distributions/} ab9c8b09-73ce-497b-8b93-1338a3b9972a | 2022-04-11 05:48:05.932595+02 | 2022-04-11 05:48:05.932639+02 | | | {/api/v3/distributions/} b623cb0b-a5cf-458d-aafb-bd0538d975b0 | 2022-04-11 05:48:05.975196+02 | 2022-04-11 05:48:05.975232+02 | | | {/api/v3/distributions/} d35c9e14-e6bf-43ef-882b-285654e8de4a | 2022-04-11 05:48:06.350582+02 | 2022-04-11 05:48:06.350658+02 | | | {/api/v3/distributions/} c08a0e5f-bdc1-4c2b-aee3-536923953196 | 2022-04-11 05:48:06.351325+02 | 2022-04-11 05:48:06.351357+02 | | | {/api/v3/distributions/} 6723424f-ada7-4da5-9f96-c5cdda60d36d | 2022-04-11 05:48:06.674589+02 | 2022-04-11 05:48:06.674638+02 | | | {/api/v3/distributions/} 1e141dfb-d8ec-49d0-b242-6145ddb5d62b | 2022-04-11 05:48:06.706558+02 | 2022-04-11 05:48:06.706613+02 | | | {/api/v3/distributions/} 6cabffd2-c566-4074-a6c7-36a4b75534b2 | 2022-04-11 05:48:07.010912+02 | 2022-04-11 05:48:07.010941+02 | | | {/api/v3/distributions/} 4468f8f4-fcc9-49d7-803b-6360af0bedb2 | 2022-04-11 05:48:07.127237+02 | 2022-04-11 05:48:07.127284+02 | | | {/api/v3/distributions/} b0b5d9f2-e85d-496b-baea-475afac88644 | 2022-04-11 05:48:07.33792+02 | 2022-04-11 05:48:07.337954+02 | | | {/api/v3/distributions/} ce3e2f78-7de8-4cae-a029-5c6112a3873b | 2022-04-11 05:48:07.425648+02 | 2022-04-11 05:48:07.42569+02 | | | {/api/v3/distributions/} a1e7e5bd-1145-40c7-88cd-d7c7e1664a0d | 2022-04-11 05:48:07.648257+02 | 2022-04-11 05:48:07.64829+02 | | | {/api/v3/distributions/} cbf31522-34e6-45cc-985c-6e20d8e5ba11 | 2022-04-11 05:48:07.768079+02 | 2022-04-11 05:48:07.76811+02 | | | {/api/v3/distributions/} dc1900a9-6349-4e7e-9bab-73b70c3a23ea | 2022-04-11 05:48:08.056841+02 | 2022-04-11 05:48:08.056871+02 | | | {/api/v3/distributions/} 6105a252-74ad-4019-bafe-3bdf371d856f | 2022-04-11 05:48:08.14776+02 | 2022-04-11 05:48:08.147794+02 | | | {/api/v3/distributions/} 61d312aa-96a3-410c-844e-b56d33a7d1c9 | 2022-04-11 05:48:08.430726+02 | 2022-04-11 05:48:08.430758+02 | | | {/api/v3/distributions/} bbc5b883-1a41-48a6-9809-d5e2b5b13aae | 2022-04-11 05:48:08.477491+02 | 2022-04-11 05:48:08.477555+02 | | | {/api/v3/distributions/} 0fb87111-9fc2-4f48-866c-3d2a0788a272 | 2022-04-11 05:48:08.745528+02 | 2022-04-11 05:48:08.745558+02 | | | {/api/v3/distributions/} 07112bb0-2bf6-490f-b241-86842cb45221 | 2022-04-11 05:48:08.868255+02 | 2022-04-11 05:48:08.868285+02 | | | {/api/v3/distributions/} fdcc6180-f953-4a5a-8343-43c66a6b3eb9 | 2022-04-11 05:48:09.123195+02 | 2022-04-11 05:48:09.123233+02 | | | {/api/v3/distributions/} 7640b2a6-0018-4d06-b7cd-cfc70c0f9ff7 | 2022-04-11 05:48:09.170577+02 | 2022-04-11 05:48:09.170615+02 | | | {/api/v3/distributions/} c58786ee-2396-41df-bdde-3747a113a5e8 | 2022-04-11 05:48:09.519841+02 | 2022-04-11 05:48:09.519873+02 | | | {/api/v3/distributions/} f5a8c5e6-6b7b-4fe4-8bec-3df4cb29f29c | 2022-04-11 05:48:09.593262+02 | 2022-04-11 05:48:09.593294+02 | | | {/api/v3/distributions/} e8938ce7-589f-4bde-905c-d00f47ac0da5 | 2022-04-11 05:48:09.879781+02 | 2022-04-11 05:48:09.879815+02 | | | {/api/v3/distributions/} 6cd42528-931a-4dca-b0dc-ae27e2340c3c | 2022-04-11 05:48:10.228485+02 | 2022-04-11 05:48:10.228518+02 | | | {/api/v3/distributions/} (45 rows) pulpcore=# All pulpcore-workers have the same backtrace: Traceback (most recent call first): <built-in method select of module object at remote 0x7f8260cc8ea8> File "/usr/lib/python3.6/site-packages/pulpcore/tasking/pulpcore_worker.py", line 190, in sleep [self.sentinel, connection.connection], [], [], self.heartbeat_period File "/usr/lib/python3.6/site-packages/pulpcore/tasking/pulpcore_worker.py", line 297, in run_forever self.sleep() File "/usr/lib/python3.6/site-packages/pulpcore/tasking/entrypoint.py", line 44, in worker NewPulpWorker().run_forever() File "/usr/lib/python3.6/site-packages/click/core.py", line 610, in invoke return callback(*args, **kwargs) File "/usr/lib/python3.6/site-packages/click/core.py", line 1066, in invoke return ctx.invoke(self.callback, **ctx.params) File "/usr/lib/python3.6/site-packages/click/core.py", line 782, in main rv = self.invoke(ctx) File "/usr/lib/python3.6/site-packages/click/core.py", line 829, in __call__ return self.main(*args, **kwargs) File "/usr/bin/pulpcore-worker", line 11, in <module> load_entry_point('pulpcore==3.14.14', 'console_scripts', 'pulpcore-worker')() which is inside sleep method: r, w, x = select.select( [self.sentinel, connection.connection], [], [], self.heartbeat_period ) /var/log/messages from the "now workers start to fetch new tasks" time: Apr 11 05:48:08 pmoravec-sat610-bz2050440 pulpcore-api: pulp [3464f30e4c6547848cf9126d3d795341]: - - [11/Apr/2022:03:48:08 +0000] "POST /pulp/api/v3/distributions/rpm/rpm/ HTTP/1.1" 202 67 "-" "curl/7.29.0" Apr 11 05:48:08 pmoravec-sat610-bz2050440 pulpcore-api: pulp [f8316d64c42c4959892cc8fa7f4a3725]: - - [11/Apr/2022:03:48:08 +0000] "DELETE /pulp/api/v3/distributions/rpm/rpm/ae6aa58c-bd9e-488c-a1e8-8d502b17048d/ HTTP/1.1" 202 67 "-" "curl/7.29.0" Apr 11 05:48:08 pmoravec-sat610-bz2050440 pulpcore-worker-2: pulp [fcaa09dd9d9e4aaf9bc0e5957c489ffe]: pulpcore.tasking.pulpcore_worker:INFO: Task completed d688ef20-d86b-4046-9800-4e6d81e37d12 Apr 11 05:48:08 pmoravec-sat610-bz2050440 pulpcore-worker-1: pulp [663a45f1813e42299f32bd90267cd80c]: pulpcore.tasking.pulpcore_worker:INFO: Starting task 506c9e81-1365-4848-9564-8bfbea4f08aa Apr 11 05:48:08 pmoravec-sat610-bz2050440 pulpcore-api: pulp [a7147cb27e92492c920a08383de24fb5]: - - [11/Apr/2022:03:48:08 +0000] "POST /pulp/api/v3/distributions/rpm/rpm/ HTTP/1.1" 202 67 "-" "curl/7.29.0" Apr 11 05:48:08 pmoravec-sat610-bz2050440 pulpcore-worker-1: pulp [663a45f1813e42299f32bd90267cd80c]: pulpcore.tasking.pulpcore_worker:INFO: Task completed 506c9e81-1365-4848-9564-8bfbea4f08aa Apr 11 05:48:08 pmoravec-sat610-bz2050440 pulpcore-api: pulp [c9a035a8f27645a5829477592b712d05]: - - [11/Apr/2022:03:48:08 +0000] "DELETE /pulp/api/v3/distributions/rpm/rpm/05fb3746-ff50-4ce3-8858-f699c2b4a49a/ HTTP/1.1" 202 67 "-" "curl/7.29.0" Apr 11 05:48:08 pmoravec-sat610-bz2050440 pulpcore-worker-2: pulp [5a4a079998e94ede9feff00dfe2020ca]: pulpcore.tasking.pulpcore_worker:INFO: Starting task ad119ec2-a371-4035-988b-eac1e3519a74 Apr 11 05:48:09 pmoravec-sat610-bz2050440 pulpcore-api: pulp [53c4499d91124608bd53c8ab60c02049]: - - [11/Apr/2022:03:48:09 +0000] "POST /pulp/api/v3/distributions/rpm/rpm/ HTTP/1.1" 202 67 "-" "curl/7.29.0" Apr 11 05:48:09 pmoravec-sat610-bz2050440 pulpcore-api: pulp [5c110aceff70466ea24eafc3cf31e713]: - - [11/Apr/2022:03:48:09 +0000] "DELETE /pulp/api/v3/distributions/rpm/rpm/755425f4-21ff-4a9f-b7c1-a04d6aa1133a/ HTTP/1.1" 202 67 "-" "curl/7.29.0" Apr 11 05:48:09 pmoravec-sat610-bz2050440 pulpcore-worker-2: pulp [5a4a079998e94ede9feff00dfe2020ca]: pulpcore.tasking.pulpcore_worker:INFO: Task completed ad119ec2-a371-4035-988b-eac1e3519a74 Apr 11 05:48:09 pmoravec-sat610-bz2050440 pulpcore-api: pulp [6878cf3a5b34487a9b9dfc5ebcce38ab]: - - [11/Apr/2022:03:48:09 +0000] "DELETE /pulp/api/v3/distributions/rpm/rpm/6d2b07a6-73fc-4ffd-a9ef-b92ecd2c768a/ HTTP/1.1" 202 67 "-" "curl/7.29.0" Apr 11 05:48:09 pmoravec-sat610-bz2050440 pulpcore-worker-2: pulp [9f2d51e422f542f382991c8905389bac]: pulpcore.tasking.pulpcore_worker:INFO: Starting task b17f8001-f7ba-4dc4-bfc9-b74f498cf7df Apr 11 05:48:09 pmoravec-sat610-bz2050440 pulpcore-api: pulp [a3f167973c604f318d625145872be599]: - - [11/Apr/2022:03:48:09 +0000] "POST /pulp/api/v3/distributions/rpm/rpm/ HTTP/1.1" 202 67 "-" "curl/7.29.0" Apr 11 05:48:09 pmoravec-sat610-bz2050440 pulpcore-worker-2: pulp [9f2d51e422f542f382991c8905389bac]: pulpcore.tasking.pulpcore_worker:INFO: Task completed b17f8001-f7ba-4dc4-bfc9-b74f498cf7df Apr 11 05:48:09 pmoravec-sat610-bz2050440 pulpcore-worker-3: pulp [76ee50a369b14c799302221a13f8284e]: pulpcore.tasking.pulpcore_worker:INFO: Starting task ccb072df-d702-4613-a827-4030db10ecf7 Apr 11 05:48:09 pmoravec-sat610-bz2050440 pulpcore-api: pulp [9faafb329e4f4f8a84f40dc5dc402708]: - - [11/Apr/2022:03:48:09 +0000] "DELETE /pulp/api/v3/distributions/rpm/rpm/eed60983-fdca-4719-989a-3547a934ecc1/ HTTP/1.1" 202 67 "-" "curl/7.29.0" Apr 11 05:48:10 pmoravec-sat610-bz2050440 pulpcore-worker-3: pulp [76ee50a369b14c799302221a13f8284e]: pulpcore.tasking.pulpcore_worker:INFO: Task completed ccb072df-d702-4613-a827-4030db10ecf7 Apr 11 05:48:10 pmoravec-sat610-bz2050440 pulpcore-worker-1: pulp [ec8dfe1f80964f6e9b89b67456c08ad9]: pulpcore.tasking.pulpcore_worker:INFO: Starting task 107e11e1-2d33-4d15-ba33-fa83804f38de Apr 11 05:48:10 pmoravec-sat610-bz2050440 pulpcore-api: pulp [d01f023a7dea4518816728696e76f35c]: - - [11/Apr/2022:03:48:10 +0000] "DELETE /pulp/api/v3/distributions/rpm/rpm/7847f2a5-cccf-48ec-8bf7-c204ed13cfb9/ HTTP/1.1" 202 67 "-" "curl/7.29.0" Apr 11 05:48:10 pmoravec-sat610-bz2050440 su: (to postgres) root on none Apr 11 05:48:10 pmoravec-sat610-bz2050440 pulpcore-worker-1: pulp [ec8dfe1f80964f6e9b89b67456c08ad9]: pulpcore.tasking.pulpcore_worker:INFO: Task completed 107e11e1-2d33-4d15-ba33-fa83804f38de
Some brainstorming with mdellweg++ : - similar bug [1] is already fixed in the pulpcore-3.14-14-1 version (in Sat6.10.4) - [1] https://github.com/pulp/pulpcore/commit/82828595c4407735238a43307feaa9b62b540613 - the latest worker completed its latest task a long time after there were already waiting tasks - until there was a clock skew, a theory "a race condition just before entering sleep" does not apply - why the latest worker - after executing the latest task - could ever execute the iter_tasks *without* fetching a task? that makes no sense.. - to prove the theory "all workers are sleeping and waiting for a wakeup signal from psql": once reproduced, go into the pg-console and do the "notify pulp_worker_wakeup", if it unblocks the livelock Current mdellweg++ theory: - select in https://github.com/pulp/pulpcore/blob/a403fc6f885b9378551198029326d9e1f3ffedf3/pulpcore/tasking/pulpcore_worker.py#L169 does not fetch the wakeup call - since either a heartbeat (of other workers) or latest worker latest task status update does surpress/"consume" the wakeup notification prior the select happens So theoretically, one worker doing in a cycle batches of similar tasks followed by a silence (waiting till tasks are done) is the best (generic) reproducer (but still doesnt work for me, much realiably).
(In reply to Pavel Moravec from comment #15) > Current mdellweg++ theory: > - select in > https://github.com/pulp/pulpcore/blob/ > a403fc6f885b9378551198029326d9e1f3ffedf3/pulpcore/tasking/pulpcore_worker. > py#L169 does not fetch the wakeup call > - since either a heartbeat (of other workers) or latest worker latest task > status update does surpress/"consume" the wakeup notification prior the > select happens > > So theoretically, one worker doing in a cycle batches of similar tasks > followed by a silence (waiting till tasks are done) is the best (generic) > reproducer (but still doesnt work for me, much realiably). I think there is another possible theory. When iterating the tasks on a worker, each task cannot get a AdvisoryLock for some reason: https://github.com/pulp/pulpcore/blob/a403fc6f885b9378551198029326d9e1f3ffedf3/pulpcore/tasking/pulpcore_worker.py#L138 the worker then can sleep without sending any pulp_worker_wakeup event. If this is the case, should the worker only sleep when there is 0 TASK_INCOMPLETE_STATES task?
OK, collaboration++ led to an artificial but _deterministic_ workaround. 1) Apply a patch to add auxiliary logging and force worker to sleep at the bad place: --- /usr/lib/python3.6/site-packages/pulpcore/tasking/pulpcore_worker.py.orig 2022-04-11 12:54:11.374669136 +0200 +++ /usr/lib/python3.6/site-packages/pulpcore/tasking/pulpcore_worker.py 2022-04-13 11:39:11.590025573 +0200 @@ -17,6 +17,7 @@ from multiprocessing import Process from tempfile import TemporaryDirectory import django +import time os.environ.setdefault("DJANGO_SETTINGS_MODULE", "pulpcore.app.settings") django.setup() @@ -189,6 +190,9 @@ class NewPulpWorker: r, w, x = select.select( [self.sentinel, connection.connection], [], [], self.heartbeat_period ) + _logger.info(_("bz2050440: Worker %s artificial sleep before heartbeat"), self.name) + time.sleep(2) + _logger.info(_("bz2050440: Worker %s artificial slept before heartbeat"), self.name) self.beat() if connection.connection in r: connection.connection.poll() 2) Let just 1 worker running (just to ease the reproducer): systemctl stop pulpcore-worker@*.service systemctl start pulpcore-worker 3) "tail -f /var/log/messages" to monitor the "Worker * artificial sle.." logs. 4) Have prepared a curl command to e.g. create a publication or distribution or delete either of them. 5) Once the *first* log "artificial sleep" is printed, immediately fire the API request that will invoke a task. 6) Ensure the task is created while the worker is sleeping, like: Apr 13 12:01:42 pmoravec-sat610-bz2050440 pulpcore-worker-1: pulp [None]: pulpcore.tasking.pulpcore_worker:INFO: bz2050440: Worker 28305.gsslab.brq2.redhat.com artificial sleep before heartbeat Apr 13 12:01:43 pmoravec-sat610-bz2050440 pulpcore-api: pulp [6d4ba28511c34d1db0cc8f7c06c2f69f]: - - [13/Apr/2022:10:01:43 +0000] "POST /pulp/api/v3/publications/rpm/rpm/ HTTP/1.1" 202 67 "-" "curl/7.29.0" Apr 13 12:01:44 pmoravec-sat610-bz2050440 pulpcore-worker-1: pulp [None]: pulpcore.tasking.pulpcore_worker:INFO: bz2050440: Worker 28305.gsslab.brq2.redhat.com artificial slept before heartbeat 7) Wait for Godot to have the task executed. 8) Ask Godot to run in postgres: notify pulp_worker_wakeup; (optionally, invoke any activity that fires a new task) 9) check that the task is suddenly executed: .. Apr 13 12:03:20 pmoravec-sat610-bz2050440 pulpcore-worker-1: pulp [None]: pulpcore.tasking.pulpcore_worker:INFO: bz2050440: Worker 28305.gsslab.brq2.redhat.com artificial slept before heartbeat Apr 13 12:03:25 pmoravec-sat610-bz2050440 pulpcore-worker-1: pulp [None]: pulpcore.tasking.pulpcore_worker:INFO: bz2050440: Worker 28305.gsslab.brq2.redhat.com artificial sleep before heartbeat Apr 13 12:03:27 pmoravec-sat610-bz2050440 pulpcore-worker-1: pulp [None]: pulpcore.tasking.pulpcore_worker:INFO: bz2050440: Worker 28305.gsslab.brq2.redhat.com artificial slept before heartbeat Apr 13 12:03:27 pmoravec-sat610-bz2050440 pulpcore-worker-1: pulp [6d4ba28511c34d1db0cc8f7c06c2f69f]: pulpcore.tasking.pulpcore_worker:INFO: Starting task b7110104-a52c-49f3-af78-58d47a662634 Apr 13 12:03:27 pmoravec-sat610-bz2050440 pulpcore-worker-1: pulp [6d4ba28511c34d1db0cc8f7c06c2f69f]: pulp_rpm.app.tasks.publishing:INFO: Publishing: repository=repo_bz2050440_ts1649842649_737576601, version=0 Apr 13 12:03:27 pmoravec-sat610-bz2050440 pulpcore-worker-1: pulp [6d4ba28511c34d1db0cc8f7c06c2f69f]: pulp_rpm.app.tasks.publishing:INFO: Publication: 0a35af25-a374-4aa8-912c-bb2d3a4e0e99 created Apr 13 12:03:27 pmoravec-sat610-bz2050440 pulpcore-worker-1: pulp [6d4ba28511c34d1db0cc8f7c06c2f69f]: pulpcore.tasking.pulpcore_worker:INFO: Task completed b7110104-a52c-49f3-af78-58d47a662634
The Pulp upstream bug status is at closed. Updating the external tracker on this bug.
All upstream Pulp bugs are at MODIFIED+. Moving this bug to POST.
Steps to retest: 1) Apply a patch to add auxiliary logging and force worker to sleep at the bad place: --- /usr/lib/python3.6/site-packages/pulpcore/tasking/pulpcore_worker.py.orig 2022-04-11 12:54:11.374669136 +0200 +++ /usr/lib/python3.6/site-packages/pulpcore/tasking/pulpcore_worker.py 2022-04-13 11:39:11.590025573 +0200 @@ -17,6 +17,7 @@ from multiprocessing import Process from tempfile import TemporaryDirectory import django +import time os.environ.setdefault("DJANGO_SETTINGS_MODULE", "pulpcore.app.settings") django.setup() @@ -189,6 +190,9 @@ class NewPulpWorker: r, w, x = select.select( [self.sentinel, connection.connection], [], [], self.heartbeat_period ) + _logger.info(_("bz2050440: Worker %s artificial sleep before heartbeat"), self.name) + time.sleep(2) + _logger.info(_("bz2050440: Worker %s artificial slept before heartbeat"), self.name) self.beat() if connection.connection in r: connection.connection.poll() (you can make the code changes directly to the "pulpcore_worker.py" file and then run "cat /usr/lib/python3.6/site-packages/pulpcore/tasking/pulpcore_worker.py | patch -p1" to apply the patch) 2) Let just 1 worker running (just to ease the reproducer): systemctl stop pulpcore-worker@*.service systemctl start pulpcore-worker 3) "tail -f /var/log/messages" to monitor the "Worker * artificial sle.." logs on another terminal 4) Have prepared a curl command to e.g. create a publication or distribution or delete either of them. certs="--cacert /etc/pki/katello/certs/katello-server-ca.crt --cert /etc/pki/katello/certs/pulp-client.crt --key /etc/pki/katello/private/pulp-client.key" hname=$(hostname -f) # create repo reponame="repo_bz2050440" repohref=$(curl -s -X POST -H "Content-Type: application/json" -d "{\"name\":\"$reponame\"}" $certs https://${hname}/pulp/api/v3/repositories/rpm/rpm/ | json_reformat | grep pulp_href | cut -d\" -f4) 5) Once the *first* log "artificial sleep" is printed, immediately fire the API request that will invoke a task. # prepare below command to fire at step 4 curl -s -H "Accept:application/json" -H "Content-Type: application/json" -X POST -d "{\"repository\": \"${repohref}\"}" $certs https://${hname}/pulp/api/v3/publications/rpm/rpm/ | json_reformat 6) Ensure the task is created while the worker is sleeping, like: (example) Apr 13 12:01:42 pmoravec-sat610-bz2050440 pulpcore-worker-1: pulp [None]: pulpcore.tasking.pulpcore_worker:INFO: bz2050440: Worker bz2050440.satotest.gsslab.brq2.redhat.com">bz2050440.satotest.gsslab.brq2.redhat.com">28305.gsslab.brq2.redhat.com artificial sleep before heartbeat Apr 13 12:01:43 pmoravec-sat610-bz2050440 pulpcore-api: pulp [6d4ba28511c34d1db0cc8f7c06c2f69f]: - - [13/Apr/2022:10:01:43 +0000] "POST /pulp/api/v3/publications/rpm/rpm/ HTTP/1.1" 202 67 "-" "curl/7.29.0" Apr 13 12:01:44 pmoravec-sat610-bz2050440 pulpcore-worker-1: pulp [None]: pulpcore.tasking.pulpcore_worker:INFO: bz2050440: Worker bz2050440.satotest.gsslab.brq2.redhat.com">bz2050440.satotest.gsslab.brq2.redhat.com">28305.gsslab.brq2.redhat.com artificial slept before heartbeat 7) Wait for Godot to have the task executed. 8) check that the task is suddenly executed: (example) .. Apr 13 12:03:20 pmoravec-sat610-bz2050440 pulpcore-worker-1: pulp [None]: pulpcore.tasking.pulpcore_worker:INFO: bz2050440: Worker bz2050440.satotest.gsslab.brq2.redhat.com">bz2050440.satotest.gsslab.brq2.redhat.com">28305.gsslab.brq2.redhat.com artificial slept before heartbeat Apr 13 12:03:25 pmoravec-sat610-bz2050440 pulpcore-worker-1: pulp [None]: pulpcore.tasking.pulpcore_worker:INFO: bz2050440: Worker bz2050440.satotest.gsslab.brq2.redhat.com">bz2050440.satotest.gsslab.brq2.redhat.com">28305.gsslab.brq2.redhat.com artificial sleep before heartbeat Apr 13 12:03:27 pmoravec-sat610-bz2050440 pulpcore-worker-1: pulp [None]: pulpcore.tasking.pulpcore_worker:INFO: bz2050440: Worker bz2050440.satotest.gsslab.brq2.redhat.com">bz2050440.satotest.gsslab.brq2.redhat.com">28305.gsslab.brq2.redhat.com artificial slept before heartbeat Apr 13 12:03:27 pmoravec-sat610-bz2050440 pulpcore-worker-1: pulp [6d4ba28511c34d1db0cc8f7c06c2f69f]: pulpcore.tasking.pulpcore_worker:INFO: Starting task b7110104-a52c-49f3-af78-58d47a662634 Apr 13 12:03:27 pmoravec-sat610-bz2050440 pulpcore-worker-1: pulp [6d4ba28511c34d1db0cc8f7c06c2f69f]: pulp_rpm.app.tasks.publishing:INFO: Publishing: repository=repo_bz2050440_ts1649842649_737576601, version=0 Apr 13 12:03:27 pmoravec-sat610-bz2050440 pulpcore-worker-1: pulp [6d4ba28511c34d1db0cc8f7c06c2f69f]: pulp_rpm.app.tasks.publishing:INFO: Publication: 0a35af25-a374-4aa8-912c-bb2d3a4e0e99 created Apr 13 12:03:27 pmoravec-sat610-bz2050440 pulpcore-worker-1: pulp [6d4ba28511c34d1db0cc8f7c06c2f69f]: pulpcore.tasking.pulpcore_worker:INFO: Task completed b7110104-a52c-49f3-af78-58d47a662634 Optionally: 9) check that the task is executed or not with the following, replacing the "tasks/60aac..." with the pulpcore task that is displayed for you curl $certs https://${hname}/pulp/api/v3/tasks/60aac23b-29fd-4ced-9c7d-bdce6d4b880c/ | json_reformat Expected: The task should executed and completed successfully and should not be idle Actual: The task executed and completed successfully and isn't idle. { "pulp_href": "/pulp/api/v3/tasks/d746d854-ef80-468a-86b1-5f8a782961d7/", "pulp_created": "2022-05-27T01:28:56.221199Z", "state": "completed", "name": "pulp_rpm.app.tasks.publishing.publish", "logging_cid": "5777e58a684b4d5aab22ed0abc32623d", "started_at": "2022-05-27T01:28:58.312481Z", "finished_at": "2022-05-27T01:28:58.535536Z", "error": null, "worker": "/pulp/api/v3/workers/b333c3a3-c02c-46b2-b350-70dc79549772/", "parent_task": null, "child_tasks": [ ], "task_group": null, "progress_reports": [ { "message": "Generating repository metadata", "code": "publish.generating_metadata", "state": "completed", "total": 1, "done": 1, "suffix": null } ], "created_resources": [ "/pulp/api/v3/publications/rpm/rpm/e4d9d02e-94d3-4f7a-891f-5c0d742b755d/" ], "reserved_resources_record": [ "shared:/pulp/api/v3/repositories/rpm/rpm/f594fdf8-e9e6-4284-8805-8fe9cc2d9969/" ] } Verified on 6.11 snap 22 with python38-pulpcore-3.16.8-2.el8pc.noarch on both rhel7 and rhel8 This is quite difficult to test. It worked out fine and I did also check that the codebase is in the build.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: Satellite 6.11 Release), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5498
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 365 days