2073847 – Restarting postgres just before task finish causes discrepancy between foreman and dynflow task status - forever

Red Hat Satellite engineering is moving the tracking of its product development work on Satellite to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "Satellite project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs will be migrated starting at the end of May. If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "Satellite project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/SAT-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 2073847 - Restarting postgres just before task finish causes discrepancy between foreman and dynflow task status - forever

Summary: Restarting postgres just before task finish causes discrepancy between forema...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Satellite
Classification:	Red Hat
Component:	Tasks Plugin
Sub Component:
Version:	6.10.4
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	6.13.0
Assignee:	Adam Ruzicka
QA Contact:	Peter Ondrejka
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-04-10 20:33 UTC by Pavel Moravec
Modified:	2023-05-03 13:21 UTC (History)
CC List:	6 users (show)
Fixed In Version:	foreman-tasks-7.2.1
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2023-05-03 13:21:12 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
task export with two tasks in invalid shape (118.99 KB, application/gzip) 2022-04-11 07:11 UTC, Pavel Moravec	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Foreman Issue Tracker	34800	Normal	Closed	Restarting postgres just before task finish causes discrepancy between foreman and dynflow task status - forever	2022-12-13 07:01:10 UTC
Red Hat Issue Tracker	SAT-9818	None	None	None	2022-12-12 16:34:15 UTC
Red Hat Product Errata	RHSA-2023:2097	None	None	None	2023-05-03 13:21:34 UTC

Description Pavel Moravec 2022-04-10 20:33:22 UTC

Description of problem:
When postgres service is restarted (i.e. as part of all services restart or alone) when dynflow is about to complete a task, then the task can end up hung in a few invalid situations forever.

"Invalid situation" means e.g.:
- foreman sees the task as stopped/pending while dynflow sees it as stopped/succes
- or foreman sees the task as running/pending while dynflow sees it as stopped/success

"Forever" means there is no user action to fix the status, like:
- services restart doesnt help
- force unlock can move foreman task from running/pending to stopped/pending, but nothing else

Also, until force unlock is done, such stuck task can have acquired its object(s) lock(s).


Version-Release number of selected component (if applicable):
Sat6.10.4


How reproducible:
100% within a few attempts


Steps to Reproduce:
One particular reproducer is to Destroy a CV and just at the end, restart postgres service. It can be VERY tricky to guess the "at the end", so the script below checks for number of completed pulp tasks - for a CV with one repo, the ContentView::Destroy task triggers one pulp task. So whenever the script detects as many new completed pulp tasks as the number of being-destroyed CVs is, the script restarts postgres.

Script itself:

--------8<----------------8<----------------8<--------
CONCUR=${1:-5}   
REPOIDS=${2:-51}
hmr="hammer shell"

prepare_cv_to_delete() {
        CVID=$1
(       echo "content-view create --organization-id=1 --name cv_zoos_${CVID} --repository-ids ${REPOIDS}"
        echo "content-view publish --organization-id=1 --name cv_zoos_${CVID}"
        echo "content-view remove-from-environment --organization-id=1 --name=cv_zoos_${CVID} --lifecycle-environment-id=1"
        echo "content-view version delete --content-view=cv_zoos_${CVID} --version 1.0 --organization-id 1"
) | $hmr
}

for i in $(seq 1 $CONCUR); do
        prepare_cv_to_delete $i &
done

echo "waiting for CVs create+almost-delete"
time wait

for i in $(seq 1 $CONCUR); do
        hammer content-view delete --name=cv_zoos_${i} --organization-id 1 &
done

echo "$(date): waiting for CVs delete"
tasks=$(su - postgres -c "psql pulpcore -c \"copy (select count(*) from core_task) to stdout;\"")
echo "$(date): waiting for CVs delete, pulp tasks=${tasks}"
expected=$((tasks+CONCUR))
tasks=0
while [ $tasks -lt $expected ]; do
        tasks=$(su - postgres -c "psql pulpcore -c \"copy (select count(*) from core_task) to stdout;\"")
        sleep 0.5
done
#su - postgres -c "psql pulpcore -c \"select count(*) from core_task;\""
echo "$(date): restarting postgres as having tasks=${tasks}"
systemctl restart rh-postgresql12-postgresql.service
date
time wait
su - postgres -c "psql pulpcore -c \"select count(*) from core_task;\""
--------8<----------------8<----------------8<--------

Usage:

./create_delete_cv_restart_postgres.sh 5 REPOID

where REPOID is an id of a small repo


Actual results:
Random tasks tuck forever, optionally with acquired locks.

As an example, see attached task export.


Expected results:
No such stuck tasks forever. Tasks should be recoverable by a restart or manual (Skip&)Resume.


Additional info:

Comment 1 Pavel Moravec 2022-04-11 07:11:10 UTC

Created attachment 1871712 [details]
task export with two tasks in invalid shape

See the tasks:

4ec32c42-cff1-4549-a5dc-d320aa824449.html
c93a2656-1048-4754-930b-bc93f38d1c82.html

as an example of the reproducer outcome.

Comment 4 Adam Ruzicka 2022-04-22 10:32:09 UTC

Created redmine issue https://projects.theforeman.org/issues/34800 from this bug

Comment 10 Bryan Kearney 2022-12-12 16:02:59 UTC

Moving this bug to POST for triage into Satellite since the upstream issue https://projects.theforeman.org/issues/34800 has been resolved.

Comment 17 Peter Ondrejka 2023-03-06 14:52:29 UTC

Verified on Satellite 6.13 snap 13, using the script provided in the problem description I didn't produce any tasks in abnormal state

Comment 20 errata-xmlrpc 2023-05-03 13:21:12 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Satellite 6.13 Release), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:2097

Note You need to log in before you can comment on or make changes to this bug.