Bug 1666893 - Actions::Candlepin::ListenOnCandlepinEvents occasionally not starting after unclean shutdown of the executor
Summary: Actions::Candlepin::ListenOnCandlepinEvents occasionally not starting after u...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Satellite 6
Classification: Red Hat
Component: Tasks Plugin
Version: 6.4
Hardware: Unspecified
OS: Unspecified
unspecified
high vote
Target Milestone: 6.4.2
Assignee: satellite6-bugs
QA Contact: jcallaha
URL:
Whiteboard:
Depends On: 1651367
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-01-16 21:20 UTC by Mike McCune
Modified: 2019-10-30 19:28 UTC (History)
3 users (show)

Fixed In Version: tfm-rubygem-dynflow-1.0.5.5-1
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1651367
Environment:
Last Closed: 2019-02-13 19:08:21 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Foreman Issue Tracker 25526 Normal Closed Actions::Candlepin::ListenOnCandlepinEvents occasionally not starting after unclean shutdown of the executor 2020-05-11 08:35:48 UTC
Red Hat Product Errata RHBA-2019:0345 None None None 2019-02-13 19:08:23 UTC

Description Mike McCune 2019-01-16 21:20:00 UTC
+++ This bug was initially created as a clone of Bug #1651367 +++

Description of problem:
In some occasion (not having a reliable reproducer at the moment), there can be an invalid lock records be present, preventing the ListenOnCandlepinEvents task to not start after the foreman-tasks is stopped

Version-Release number of selected component (if applicable):


How reproducible:
occasionally

Steps to Reproduce:
1. force-kill the dynflow_executor
2. systemctl restart foreman-tasks

Actual results:
the ListenOnCandlepinEvents not getting started

Expected results:
The task gets started

Additional info:

One indicator is, when the foreman-tasks service is stopped, the
following query still returns some results (even though it should not):

select * from dynflow_coordinator_records where class = 'Dynflow::Coordinator::SingletonActionLock';

singleton-action:Actions::Candlepin::ListenOnCandlepinEvents | Dynflow::Coordinator::SingletonActionLock | execution-plan:f16e072d-04e4-4b88-9db7-a9349bc7476c | {"class":" 
Dynflow::Coordinator::SingletonActionLock","owner_id":"execution-plan:f16e072d-04e4-4b88-9db7-a9349bc7476c","execution_plan_id":"f16e072d-04e4-4b88-9db7-a9349bc7476c","id": 
"singleton-action:Actions::Candlepin::ListenOnCandlepinEvents"} 
singleton-action:Actions::Katello::EventQueue::Monitor | Dynflow::Coordinator::SingletonActionLock | execution-plan:b340536b-5b55-4891-ad7a-b65a012bf5ff | {"class":" 
Dynflow::Coordinator::SingletonActionLock","owner_id":"execution-plan:b340536b-5b55-4891-ad7a-b65a012bf5ff","execution_plan_id":"b340536b-5b55-4891-ad7a-b65a012bf5ff","id": 
"singleton-action:Actions::Katello::EventQueue::Monitor"} 

Seems related to https://bugzilla.redhat.com/show_bug.cgi?id=1642369

--- Additional comment from Ivan Necas on 2018-11-20 14:04:47 UTC ---

FYI: tasks export and foreman-debug from the time this happened are available in https://access.redhat.com/support/cases/#/case/02255848.

Additional note is the customers are running 2 executors: not sure it's relevant to this issue though.

--- Additional comment from Adam Ruzicka on 2018-11-21 13:06:36 UTC ---

2018-11-18 11:55:50  [foreman-tasks/dynflow] [E] invalidated execution plan f16e072d-04e4-4b88-9db7-a9349bc7476c missing, skipping
2018-11-18 11:55:50  [foreman-tasks/dynflow] [E] invalidated execution plan b340536b-5b55-4891-ad7a-b65a012bf5ff missing, skipping

According to the logs the LOCE and EQM execution plans were somehow removed and we didn't release the locks for such plans properly during invalidation.

--- Additional comment from Adam Ruzicka on 2018-11-21 13:36:07 UTC ---

Created redmine issue http://projects.theforeman.org/issues/25526 from this bug

--- Additional comment from Adam Ruzicka on 2018-11-21 13:41:42 UTC ---

To recapitulate, if an execution plan is deleted while it still holds the singleton action lock (needed for LOCE and EQM), the singleton action lock will not be unlocked during invalidation. This leads to LOCE and EQM not starting on dynflow executor restart.

New steps to reproduce based on previous findings:
1. force-kill the dynflow_executor
2. Find which execution plan holds the singleton lock and delete the execution plan
3. systemctl restart foreman-tasks

--- Additional comment from Ivan Necas on 2018-11-26 15:12:34 UTC ---

Proposing to fix this for 6.3 and 6.4 as well, to prevent not very nice customer cases to happen, when LOCE is not running

--- Additional comment from Red Hat Bugzilla Rules Engine on 2018-11-26 15:12:39 UTC ---

Since this issue was entered in Red Hat Bugzilla, the pm_ack has been set to + automatically for the next planned release.

--- Additional comment from Ivan Necas on 2019-01-08 16:23:16 UTC ---

Workaround when hitting the issue:

        /sbin/service foreman-tasks stop
        /sbin/foreman-rake foreman_tasks:cleanup:run TASK_SEARCH='label=Actions::Candlepin::ListenOnCandlepinEvents'   STATES="running,paused"
        /sbin/foreman-rake foreman_tasks:cleanup:run TASK_SEARCH='label=Actions::Katello::EventQueue::Monitor'   STATES="running,paused"

        #Delete the action locks
        /bin/echo "delete from dynflow_coordinator_records where class = 'Dynflow::Coordinator::SingletonActionLock';"  | /bin/sudo -u postgres psql -d foreman
        /sbin/runuser -l  postgres -c "psql -c -d foreman \"delete from dynflow_coordinator_records where class = 'Dynflow::Coordinator::SingletonActionLock';\" for

        #start foreman-tasks
        /bin/systemctl start foreman-tasks

--- Additional comment from Ivan Necas on 2019-01-08 16:28:51 UTC ---

        /sbin/service foreman-tasks stop
        /sbin/foreman-rake foreman_tasks:cleanup:run TASK_SEARCH='label=Actions::Candlepin::ListenOnCandlepinEvents'   STATES="running,paused"
        /sbin/foreman-rake foreman_tasks:cleanup:run TASK_SEARCH='label=Actions::Katello::EventQueue::Monitor'   STATES="running,paused"

        #Delete the action locks
        /bin/echo "delete from dynflow_coordinator_records where class = 'Dynflow::Coordinator::SingletonActionLock';"  | /bin/sudo -u postgres psql -d foreman

        #start foreman-tasks
        /bin/systemctl start foreman-tasks

--- Additional comment from Ivan Necas on 2019-01-08 16:30:35 UTC ---

To determine the system is hitting the issue:

service foreman-tasks stop
/bin/echo "select * from dynflow_coordinator_records where class = 'Dynflow::Coordinator::SingletonActionLock';"  | /bin/sudo -u postgres psql -d foreman

If it's returning non-empty table, it most probably means the issue has been hit and https://bugzilla.redhat.com/show_bug.cgi?id=1651367#c8 should be applied

--- Additional comment from Ivan Necas on 2019-01-11 12:15:27 UTC ---

CHERRY-PICK notes:

sat 6.3 https://gitlab.sat.engineering.redhat.com/satellite6/dynflow/merge_requests/19
sat 6.4 https://gitlab.sat.engineering.redhat.com/satellite6/dynflow/merge_requests/20

--- Additional comment from Ivan Necas on 2019-01-11 12:50 UTC ---

hotfix against Satellite 6.3.5 (including 1628638 and 1654975)

scratch build from https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=19763947,  the source branch is here https://gitlab.sat.engineering.redhat.com/satellite6/dynflow/tree/HOTFIX-1628638-1654975-1651367-6.3.5.

Instructions to apply:


yum install -y ./tfm-rubygem-dynflow-0.8.34-4.HOTFIXRHBZ1628638_1654975_1651367.fm1_15.el7sat.noarch.rpm
katello-service restart

--- Additional comment from Ivan Necas on 2019-01-15 12:16 UTC ---

scratch build from https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=19813303,  the source branch is here https://gitlab.sat.engineering.redhat.com/satellite6/dynflow/tree/BZ1651367-6.4

Instructions to apply:


yum install -y ./tfm-rubygem-dynflow-1.0.5.1-2.HOTFIXRHBZ1651367.el7sat.noarch.rpm
katello-service restart

Comment 4 jcallaha 2019-01-23 20:38:06 UTC
Verified in Satellite 6.4.2 Snap 1

-bash-4.2# ps -aux | grep dynflow
root     20902  0.0  0.0 112708   980 pts/0    S+   21:34   0:00 grep --color=auto dynflow
foreman+ 26872  0.0  0.1 257088 48644 ?        Sl   Jan22   0:02 ruby /usr/bin/smart_proxy_dynflow_core -d -p /var/run/foreman-proxy/smart_proxy_dynflow_core.pid
foreman  27446  1.2  3.6 3107508 900772 ?      Sl   Jan22  17:18 dynflow_executor
foreman  27447  0.0  0.4 493820 107240 ?       Sl   Jan22   0:00 dynflow_executor_monitor
-bash-4.2# 
-bash-4.2# kill 27446
-bash-4.2# 
-bash-4.2# ps -aux | grep dynflow
root     21107  0.0  0.0 112708   980 pts/0    S+   21:36   0:00 grep --color=auto dynflow
foreman+ 26872  0.0  0.1 257088 48644 ?        Sl   Jan22   0:02 ruby /usr/bin/smart_proxy_dynflow_core -d -p /var/run/foreman-proxy/smart_proxy_dynflow_core.pid
foreman  27447  0.0  0.4 493820 107240 ?       Sl   Jan22   0:00 dynflow_executor_monitor
-bash-4.2# 
-bash-4.2# systemctl restart foreman-tasks
-bash-4.2# 
-bash-4.2# ps -aux | grep dynflow
foreman  21218 95.0  0.9 615448 229168 ?       Rl   21:36   0:18 dynflow_executor
foreman  21221  0.5  0.4 492308 105568 ?       Sl   21:36   0:00 dynflow_executor_monitor
root     21266  0.0  0.0 112708   980 pts/0    S+   21:36   0:00 grep --color=auto dynflow
foreman+ 26872  0.0  0.1 257088 48644 ?        Sl   Jan22   0:02 ruby /usr/bin/smart_proxy_dynflow_core -d -p /var/run/foreman-proxy/smart_proxy_dynflow_core.pid

After the restart, both the listen and monitor tasks are running.

Comment 6 errata-xmlrpc 2019-02-13 19:08:21 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0345


Note You need to log in before you can comment on or make changes to this bug.