Note: This bug is displayed in read-only format because
the product is no longer active in Red Hat Bugzilla.
Red Hat Satellite engineering is moving the tracking of its product development work on Satellite to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "Satellite project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs will be migrated starting at the end of May. If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "Satellite project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/SAT-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Description of problem:
When postgres service is restarted (i.e. as part of all services restart or alone) when dynflow is about to complete a task, then the task can end up hung in a few invalid situations forever.
"Invalid situation" means e.g.:
- foreman sees the task as stopped/pending while dynflow sees it as stopped/succes
- or foreman sees the task as running/pending while dynflow sees it as stopped/success
"Forever" means there is no user action to fix the status, like:
- services restart doesnt help
- force unlock can move foreman task from running/pending to stopped/pending, but nothing else
Also, until force unlock is done, such stuck task can have acquired its object(s) lock(s).
Version-Release number of selected component (if applicable):
Sat6.10.4
How reproducible:
100% within a few attempts
Steps to Reproduce:
One particular reproducer is to Destroy a CV and just at the end, restart postgres service. It can be VERY tricky to guess the "at the end", so the script below checks for number of completed pulp tasks - for a CV with one repo, the ContentView::Destroy task triggers one pulp task. So whenever the script detects as many new completed pulp tasks as the number of being-destroyed CVs is, the script restarts postgres.
Script itself:
--------8<----------------8<----------------8<--------
CONCUR=${1:-5}
REPOIDS=${2:-51}
hmr="hammer shell"
prepare_cv_to_delete() {
CVID=$1
( echo "content-view create --organization-id=1 --name cv_zoos_${CVID} --repository-ids ${REPOIDS}"
echo "content-view publish --organization-id=1 --name cv_zoos_${CVID}"
echo "content-view remove-from-environment --organization-id=1 --name=cv_zoos_${CVID} --lifecycle-environment-id=1"
echo "content-view version delete --content-view=cv_zoos_${CVID} --version 1.0 --organization-id 1"
) | $hmr
}
for i in $(seq 1 $CONCUR); do
prepare_cv_to_delete $i &
done
echo "waiting for CVs create+almost-delete"
time wait
for i in $(seq 1 $CONCUR); do
hammer content-view delete --name=cv_zoos_${i} --organization-id 1 &
done
echo "$(date): waiting for CVs delete"
tasks=$(su - postgres -c "psql pulpcore -c \"copy (select count(*) from core_task) to stdout;\"")
echo "$(date): waiting for CVs delete, pulp tasks=${tasks}"
expected=$((tasks+CONCUR))
tasks=0
while [ $tasks -lt $expected ]; do
tasks=$(su - postgres -c "psql pulpcore -c \"copy (select count(*) from core_task) to stdout;\"")
sleep 0.5
done
#su - postgres -c "psql pulpcore -c \"select count(*) from core_task;\""
echo "$(date): restarting postgres as having tasks=${tasks}"
systemctl restart rh-postgresql12-postgresql.service
date
time wait
su - postgres -c "psql pulpcore -c \"select count(*) from core_task;\""
--------8<----------------8<----------------8<--------
Usage:
./create_delete_cv_restart_postgres.sh 5 REPOID
where REPOID is an id of a small repo
Actual results:
Random tasks tuck forever, optionally with acquired locks.
As an example, see attached task export.
Expected results:
No such stuck tasks forever. Tasks should be recoverable by a restart or manual (Skip&)Resume.
Additional info:
Created attachment 1871712[details]
task export with two tasks in invalid shape
See the tasks:
4ec32c42-cff1-4549-a5dc-d320aa824449.html
c93a2656-1048-4754-930b-bc93f38d1c82.html
as an example of the reproducer outcome.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory (Important: Satellite 6.13 Release), and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.
https://access.redhat.com/errata/RHSA-2023:2097
Description of problem: When postgres service is restarted (i.e. as part of all services restart or alone) when dynflow is about to complete a task, then the task can end up hung in a few invalid situations forever. "Invalid situation" means e.g.: - foreman sees the task as stopped/pending while dynflow sees it as stopped/succes - or foreman sees the task as running/pending while dynflow sees it as stopped/success "Forever" means there is no user action to fix the status, like: - services restart doesnt help - force unlock can move foreman task from running/pending to stopped/pending, but nothing else Also, until force unlock is done, such stuck task can have acquired its object(s) lock(s). Version-Release number of selected component (if applicable): Sat6.10.4 How reproducible: 100% within a few attempts Steps to Reproduce: One particular reproducer is to Destroy a CV and just at the end, restart postgres service. It can be VERY tricky to guess the "at the end", so the script below checks for number of completed pulp tasks - for a CV with one repo, the ContentView::Destroy task triggers one pulp task. So whenever the script detects as many new completed pulp tasks as the number of being-destroyed CVs is, the script restarts postgres. Script itself: --------8<----------------8<----------------8<-------- CONCUR=${1:-5} REPOIDS=${2:-51} hmr="hammer shell" prepare_cv_to_delete() { CVID=$1 ( echo "content-view create --organization-id=1 --name cv_zoos_${CVID} --repository-ids ${REPOIDS}" echo "content-view publish --organization-id=1 --name cv_zoos_${CVID}" echo "content-view remove-from-environment --organization-id=1 --name=cv_zoos_${CVID} --lifecycle-environment-id=1" echo "content-view version delete --content-view=cv_zoos_${CVID} --version 1.0 --organization-id 1" ) | $hmr } for i in $(seq 1 $CONCUR); do prepare_cv_to_delete $i & done echo "waiting for CVs create+almost-delete" time wait for i in $(seq 1 $CONCUR); do hammer content-view delete --name=cv_zoos_${i} --organization-id 1 & done echo "$(date): waiting for CVs delete" tasks=$(su - postgres -c "psql pulpcore -c \"copy (select count(*) from core_task) to stdout;\"") echo "$(date): waiting for CVs delete, pulp tasks=${tasks}" expected=$((tasks+CONCUR)) tasks=0 while [ $tasks -lt $expected ]; do tasks=$(su - postgres -c "psql pulpcore -c \"copy (select count(*) from core_task) to stdout;\"") sleep 0.5 done #su - postgres -c "psql pulpcore -c \"select count(*) from core_task;\"" echo "$(date): restarting postgres as having tasks=${tasks}" systemctl restart rh-postgresql12-postgresql.service date time wait su - postgres -c "psql pulpcore -c \"select count(*) from core_task;\"" --------8<----------------8<----------------8<-------- Usage: ./create_delete_cv_restart_postgres.sh 5 REPOID where REPOID is an id of a small repo Actual results: Random tasks tuck forever, optionally with acquired locks. As an example, see attached task export. Expected results: No such stuck tasks forever. Tasks should be recoverable by a restart or manual (Skip&)Resume. Additional info: