+++ This bug was initially created as a clone of Bug #1651367 +++ Description of problem: In some occasion (not having a reliable reproducer at the moment), there can be an invalid lock records be present, preventing the ListenOnCandlepinEvents task to not start after the foreman-tasks is stopped Version-Release number of selected component (if applicable): How reproducible: occasionally Steps to Reproduce: 1. force-kill the dynflow_executor 2. systemctl restart foreman-tasks Actual results: the ListenOnCandlepinEvents not getting started Expected results: The task gets started Additional info: One indicator is, when the foreman-tasks service is stopped, the following query still returns some results (even though it should not): select * from dynflow_coordinator_records where class = 'Dynflow::Coordinator::SingletonActionLock'; singleton-action:Actions::Candlepin::ListenOnCandlepinEvents | Dynflow::Coordinator::SingletonActionLock | execution-plan:f16e072d-04e4-4b88-9db7-a9349bc7476c | {"class":" Dynflow::Coordinator::SingletonActionLock","owner_id":"execution-plan:f16e072d-04e4-4b88-9db7-a9349bc7476c","execution_plan_id":"f16e072d-04e4-4b88-9db7-a9349bc7476c","id": "singleton-action:Actions::Candlepin::ListenOnCandlepinEvents"} singleton-action:Actions::Katello::EventQueue::Monitor | Dynflow::Coordinator::SingletonActionLock | execution-plan:b340536b-5b55-4891-ad7a-b65a012bf5ff | {"class":" Dynflow::Coordinator::SingletonActionLock","owner_id":"execution-plan:b340536b-5b55-4891-ad7a-b65a012bf5ff","execution_plan_id":"b340536b-5b55-4891-ad7a-b65a012bf5ff","id": "singleton-action:Actions::Katello::EventQueue::Monitor"} Seems related to https://bugzilla.redhat.com/show_bug.cgi?id=1642369 --- Additional comment from Ivan Necas on 2018-11-20 14:04:47 UTC --- FYI: tasks export and foreman-debug from the time this happened are available in https://access.redhat.com/support/cases/#/case/02255848. Additional note is the customers are running 2 executors: not sure it's relevant to this issue though. --- Additional comment from Adam Ruzicka on 2018-11-21 13:06:36 UTC --- 2018-11-18 11:55:50 [foreman-tasks/dynflow] [E] invalidated execution plan f16e072d-04e4-4b88-9db7-a9349bc7476c missing, skipping 2018-11-18 11:55:50 [foreman-tasks/dynflow] [E] invalidated execution plan b340536b-5b55-4891-ad7a-b65a012bf5ff missing, skipping According to the logs the LOCE and EQM execution plans were somehow removed and we didn't release the locks for such plans properly during invalidation. --- Additional comment from Adam Ruzicka on 2018-11-21 13:36:07 UTC --- Created redmine issue http://projects.theforeman.org/issues/25526 from this bug --- Additional comment from Adam Ruzicka on 2018-11-21 13:41:42 UTC --- To recapitulate, if an execution plan is deleted while it still holds the singleton action lock (needed for LOCE and EQM), the singleton action lock will not be unlocked during invalidation. This leads to LOCE and EQM not starting on dynflow executor restart. New steps to reproduce based on previous findings: 1. force-kill the dynflow_executor 2. Find which execution plan holds the singleton lock and delete the execution plan 3. systemctl restart foreman-tasks --- Additional comment from Ivan Necas on 2018-11-26 15:12:34 UTC --- Proposing to fix this for 6.3 and 6.4 as well, to prevent not very nice customer cases to happen, when LOCE is not running --- Additional comment from Red Hat Bugzilla Rules Engine on 2018-11-26 15:12:39 UTC --- Since this issue was entered in Red Hat Bugzilla, the pm_ack has been set to + automatically for the next planned release. --- Additional comment from Ivan Necas on 2019-01-08 16:23:16 UTC --- Workaround when hitting the issue: /sbin/service foreman-tasks stop /sbin/foreman-rake foreman_tasks:cleanup:run TASK_SEARCH='label=Actions::Candlepin::ListenOnCandlepinEvents' STATES="running,paused" /sbin/foreman-rake foreman_tasks:cleanup:run TASK_SEARCH='label=Actions::Katello::EventQueue::Monitor' STATES="running,paused" #Delete the action locks /bin/echo "delete from dynflow_coordinator_records where class = 'Dynflow::Coordinator::SingletonActionLock';" | /bin/sudo -u postgres psql -d foreman /sbin/runuser -l postgres -c "psql -c -d foreman \"delete from dynflow_coordinator_records where class = 'Dynflow::Coordinator::SingletonActionLock';\" for #start foreman-tasks /bin/systemctl start foreman-tasks --- Additional comment from Ivan Necas on 2019-01-08 16:28:51 UTC --- /sbin/service foreman-tasks stop /sbin/foreman-rake foreman_tasks:cleanup:run TASK_SEARCH='label=Actions::Candlepin::ListenOnCandlepinEvents' STATES="running,paused" /sbin/foreman-rake foreman_tasks:cleanup:run TASK_SEARCH='label=Actions::Katello::EventQueue::Monitor' STATES="running,paused" #Delete the action locks /bin/echo "delete from dynflow_coordinator_records where class = 'Dynflow::Coordinator::SingletonActionLock';" | /bin/sudo -u postgres psql -d foreman #start foreman-tasks /bin/systemctl start foreman-tasks --- Additional comment from Ivan Necas on 2019-01-08 16:30:35 UTC --- To determine the system is hitting the issue: service foreman-tasks stop /bin/echo "select * from dynflow_coordinator_records where class = 'Dynflow::Coordinator::SingletonActionLock';" | /bin/sudo -u postgres psql -d foreman If it's returning non-empty table, it most probably means the issue has been hit and https://bugzilla.redhat.com/show_bug.cgi?id=1651367#c8 should be applied --- Additional comment from Ivan Necas on 2019-01-11 12:15:27 UTC --- CHERRY-PICK notes: sat 6.3 https://gitlab.sat.engineering.redhat.com/satellite6/dynflow/merge_requests/19 sat 6.4 https://gitlab.sat.engineering.redhat.com/satellite6/dynflow/merge_requests/20 --- Additional comment from Ivan Necas on 2019-01-11 12:50 UTC --- hotfix against Satellite 6.3.5 (including 1628638 and 1654975) scratch build from https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=19763947, the source branch is here https://gitlab.sat.engineering.redhat.com/satellite6/dynflow/tree/HOTFIX-1628638-1654975-1651367-6.3.5. Instructions to apply: yum install -y ./tfm-rubygem-dynflow-0.8.34-4.HOTFIXRHBZ1628638_1654975_1651367.fm1_15.el7sat.noarch.rpm katello-service restart --- Additional comment from Ivan Necas on 2019-01-15 12:16 UTC --- scratch build from https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=19813303, the source branch is here https://gitlab.sat.engineering.redhat.com/satellite6/dynflow/tree/BZ1651367-6.4 Instructions to apply: yum install -y ./tfm-rubygem-dynflow-1.0.5.1-2.HOTFIXRHBZ1651367.el7sat.noarch.rpm katello-service restart
Verified in Satellite 6.4.2 Snap 1 -bash-4.2# ps -aux | grep dynflow root 20902 0.0 0.0 112708 980 pts/0 S+ 21:34 0:00 grep --color=auto dynflow foreman+ 26872 0.0 0.1 257088 48644 ? Sl Jan22 0:02 ruby /usr/bin/smart_proxy_dynflow_core -d -p /var/run/foreman-proxy/smart_proxy_dynflow_core.pid foreman 27446 1.2 3.6 3107508 900772 ? Sl Jan22 17:18 dynflow_executor foreman 27447 0.0 0.4 493820 107240 ? Sl Jan22 0:00 dynflow_executor_monitor -bash-4.2# -bash-4.2# kill 27446 -bash-4.2# -bash-4.2# ps -aux | grep dynflow root 21107 0.0 0.0 112708 980 pts/0 S+ 21:36 0:00 grep --color=auto dynflow foreman+ 26872 0.0 0.1 257088 48644 ? Sl Jan22 0:02 ruby /usr/bin/smart_proxy_dynflow_core -d -p /var/run/foreman-proxy/smart_proxy_dynflow_core.pid foreman 27447 0.0 0.4 493820 107240 ? Sl Jan22 0:00 dynflow_executor_monitor -bash-4.2# -bash-4.2# systemctl restart foreman-tasks -bash-4.2# -bash-4.2# ps -aux | grep dynflow foreman 21218 95.0 0.9 615448 229168 ? Rl 21:36 0:18 dynflow_executor foreman 21221 0.5 0.4 492308 105568 ? Sl 21:36 0:00 dynflow_executor_monitor root 21266 0.0 0.0 112708 980 pts/0 S+ 21:36 0:00 grep --color=auto dynflow foreman+ 26872 0.0 0.1 257088 48644 ? Sl Jan22 0:02 ruby /usr/bin/smart_proxy_dynflow_core -d -p /var/run/foreman-proxy/smart_proxy_dynflow_core.pid After the restart, both the listen and monitor tasks are running.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0345