Bug 1888816

Summary: Busy dynflow_executors lead to SQL errors (KeyError logged in the application) and possibly slowness
Product: Red Hat Satellite Reporter: Pablo Hess <phess>
Component: DynflowAssignee: satellite6-bugs <satellite6-bugs>
Status: CLOSED DEFERRED QA Contact: Satellite QE Team <sat-qe-bz-list>
Severity: high Docs Contact:
Priority: high    
Version: 6.7.0CC: aruzicka, mfalz, sadas
Target Milestone: UnspecifiedKeywords: Triaged
Target Release: Unused   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-08-14 08:58:44 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Pablo Hess 2020-10-15 19:15:12 UTC
Description of problem:
Red Hat Satellite 6.7.4 (possibly also 6.7.[0-3]) sees lots of errors logged to /var/log/foreman/dynflow_executor.output with KeyErrors during task-busy periods, mostly Capsule Syncs and multiple concurrent Content View publish/promote tasks.

This has so far been observed by me in Satellites where /etc/sysconfig/dynflowd has EXECUTOR_COUNT=3 only.


/var/log/foreman/dynflow_executor.output shows these errors:

E, [2020-10-14T11:19:54.427792 #96826] ERROR -- /parallel-executor-core/pool default/worker-2: Could not save state (RuntimeError)
E, [2020-10-14T11:19:54.431332 #96826] ERROR -- /parallel-executor-core/pool default/worker-3: Could not save state (RuntimeError)
E, [2020-10-14T11:19:54.434585 #96826] ERROR -- /parallel-executor-core/pool default/worker-2: searching: action by: {:execution_plan_uuid=>"bd04a648-e11b-47bd-9049-c87abc7ec003", :id=>1638} (KeyError)
E, [2020-10-14T11:19:54.442404 #96826] ERROR -- /parallel-executor-core/pool default/worker-2: searching: action by: {:execution_plan_uuid=>"bd04a648-e11b-47bd-9049-c87abc7ec003", :id=>1634} (KeyError)
E, [2020-10-14T11:20:00.826733 #96826] ERROR -- /parallel-executor-core/pool default/worker-2: searching: action by: {:execution_plan_uuid=>"bd04a648-e11b-47bd-9049-c87abc7ec003", :id=>1648} (KeyError)
E, [2020-10-14T11:20:00.833610 #96826] ERROR -- /parallel-executor-core/pool default/worker-2: searching: action by: {:execution_plan_uuid=>"bd04a648-e11b-47bd-9049-c87abc7ec003", :id=>1648} (KeyError)
E, [2020-10-14T11:20:00.843802 #96826] ERROR -- /parallel-executor-core/pool default/worker-2: searching: action by: {:execution_plan_uuid=>"bd04a648-e11b-47bd-9049-c87abc7ec003", :id=>1648} (KeyError)



postgres logs at the same time show these types of errors:

2020-10-14 11:19:54 *TZ* LOG:  duration: 700.013 ms  statement: UPDATE "dynflow_steps" SET "execution_plan_uuid" = 'bd04a648-e11b-47bd-9049-c87abc7ec003', "id" = 2455, "action_id" = 1636, "data" = NULL, "state" = 'running', "started_at" = '2020-10-14 16:50:43.303331+0000', "ended_at" = '2020-10-14 18:19:37.717563+0000', "real_time" = 5334.414232315, "execution_time" = 114.37665876999998, "progress_done" = 0.1, "progress_weight" = 100, "class" = 'Dynflow::ExecutionPlan::Steps::RunStep', "error" = NULL, "action_class" = 'Actions::Pulp::Consumer::SyncCapsule', "children" = NULL, "queue" = 'default' WHERE (("state" IN ('pending', 'error', 'skipping', 'suspended')) AND ("execution_plan_uuid" = 'bd04a648-e11b-47bd-9049-c87abc7ec003') AND ("id" = 2455))
2020-10-14 11:24:57 *TZ* ERROR:  insert or update on table "dynflow_actions" violates foreign key constraint "dynflow_actions_execution_plan_uuid_fkey"
2020-10-14 11:24:57 *TZ* DETAIL:  Key (execution_plan_uuid)=(e077c8ff-4ed2-4e19-8e3f-79e89e20c387) is not present in table "dynflow_execution_plans".
2020-10-14 11:24:57 *TZ* STATEMENT:  INSERT INTO "dynflow_actions" ("execution_plan_uuid", "id", "data", "input", "output", "caller_execution_plan_id", "caller_action_id", "class", "plan_step_id", "run_step_id", "finalize_step_id") VALUES ('e077c8ff-4ed2-4e19-8e3f-79e89e20c387', 1302, NULL, '{"smart_proxy_id":3,"options":{"force":true,"source_repository":{"id":4210,"name":"Red_Hat_Software_Collections_RPMs_for_Red_Hat_Enterprise_Linux_7_Server_x86_64_7Server","url":"***redacted***"},"matching_content":false,"dependency":null,"repository_creation":false},"dependency":null,"repository_id":4262,"remote_user":"admin","remote_cp_user":"admin","current_request_id":null,"current_timezone":"***redacted***)","current_user_id":6,"current_organization_id":1,"current_location_id":null}', '{"pulp_tasks":[{"exception":null,"task_type":"pulp.server.managers.repo.publish.publish","_href":"/pulp/api/v2/tasks/17577454-2d20-4177-b9bd-491b4487479d/","task_id":"17577454-2d20-4177-b9bd-491b4487479d","tags":["pulp:repository:***redacted***","pulp:action:publish"],"finish_time":null,"_ns":"task_status","start_time":null,"traceback":null,"spawned_tasks":[],"progress_report":{},"queue":"","state":"waiting","worker_name":null,"result":null,"error":null,"_id":{"$oid":"5f8742794e822e97f73a45f9"},"id":"5f8742794e822e97f73a45f9"}],"poll_attempts":{"total":19,"failed":1},"cancelled":true}', NULL, 1301, 'Actions::Pulp::Repository::DistributorPublish', 1953, 1954, NULL) RETURNING "execution_plan_uuid"
2020-10-14 11:24:57 *TZ* ERROR:  insert or update on table "dynflow_actions" violates foreign key constraint "dynflow_actions_execution_plan_uuid_fkey"
2020-10-14 11:24:57 *TZ* DETAIL:  Key (execution_plan_uuid)=(e077c8ff-4ed2-4e19-8e3f-79e89e20c387) is not present in table "dynflow_execution_plans".
2020-10-14 11:24:57 *TZ* STATEMENT:  INSERT INTO "dynflow_actions" ("execution_plan_uuid", "id", "data", "input", "output", "caller_execution_plan_id", "caller_action_id", "class", "plan_step_id", "run_step_id", "finalize_step_id") VALUES ('e077c8ff-4ed2-4e19-8e3f-79e89e20c387', 1290, NULL, '{"smart_proxy_id":3,"options":{"force":true,"source_repository":{"id":2009,"name":"***redacted***","url":"***redacted***"},"matching_content":false,"dependency":null,"repository_creation":false},"dependency":null,"repository_id":1888,"remote_user":"admin","remote_cp_user":"admin","current_request_id":null,"current_timezone":"***redacted***","current_user_id":6,"current_organization_id":1,"current_location_id":null}', '{"pulp_tasks":[{"exception":null,"task_type":"pulp.server.managers.repo.publish.publish","_href":"/pulp/api/v2/tasks/3a58aa55-e4c3-4440-af36-f9a6abf9d3f6/","task_id":"3a58aa55-e4c3-4440-af36-f9a6abf9d3f6","tags":["pulp:repository:***redacted***","pulp:action:publish"],"finish_time":null,"_ns":"task_status","start_time":null,"traceback":null,"spawned_tasks":[],"progress_report":{},"queue":"","state":"waiting","worker_name":null,"result":null,"error":null,"_id":{"$oid":"5f8742794e822e97f73a4609"},"id":"5f8742794e822e97f73a4609"}],"poll_attempts":{"total":19,"failed":1},"cancelled":true}', NULL, 1289, 'Actions::Pulp::Repository::DistributorPublish', 1935, 1936, NULL) RETURNING "execution_plan_uuid"
2020-10-14 11:24:57 *TZ* ERROR:  insert or update on table "dynflow_actions" violates foreign key constraint "dynflow_actions_execution_plan_uuid_fkey"
2020-10-14 11:24:57 *TZ* DETAIL:  Key (execution_plan_uuid)=(e077c8ff-4ed2-4e19-8e3f-79e89e20c387) is not present in table "dynflow_execution_plans".
2020-10-14 11:24:57 *TZ* STATEMENT:  INSERT INTO "dynflow_actions" ("execution_plan_uuid", "id", "data", "input", "output", "caller_execution_plan_id", "caller_action_id", "class", "plan_step_id", "run_step_id", "finalize_step_id") VALUES ('e077c8ff-4ed2-4e19-8e3f-79e89e20c387', 1294, NULL, '{"smart_proxy_id":3,"options":{"force":true,"source_repository":{"id":2009,"name":"***redacted***","url":"***redacted***"},"matching_content":false,"dependency":null,"repository_creation":false},"dependency":null,"repository_id":1849,"remote_user":"admin","remote_cp_user":"admin","current_request_id":null,"current_timezone":"***redacted***","current_user_id":6,"current_organization_id":1,"current_location_id":null}', '{"pulp_tasks":[{"exception":null,"task_type":"pulp.server.managers.repo.publish.publish","_href":"/pulp/api/v2/tasks/f0e74643-b5f0-465b-a77e-1857d496f5cb/","task_id":"f0e74643-b5f0-465b-a77e-1857d496f5cb","tags":["pulp:repository:***redacted***","pulp:action:publish"],"finish_time":null,"_ns":"task_status","start_time":null,"traceback":null,"spawned_tasks":[],"progress_report":{},"queue":"","state":"waiting","worker_name":null,"result":null,"error":null,"_id":{"$oid":"5f8742794e822e97f73a45fe"},"id":"5f8742794e822e97f73a45fe"}],"poll_attempts":{"total":19,"failed":1},"cancelled":true}', NULL, 1293, 'Actions::Pulp::Repository::DistributorPublish', 1941, 1942, NULL) RETURNING "execution_plan_uuid"





Both Satellites are also somewhat big machines using the 64GB tuning settings.

Version-Release number of selected component (if applicable):
rpm -qa '*dynflow*' '*foreman-tasks*'
tfm-rubygem-dynflow-1.4.1-2.el7sat.noarch
rubygem-smart_proxy_dynflow-0.2.4-1.el7sat.noarch
tfm-rubygem-foreman-tasks-core-0.3.3-1.el7sat.noarch
tfm-rubygem-smart_proxy_dynflow_core-0.2.4-1.el7sat.noarch
tfm-rubygem-foreman-tasks-0.17.5.7-1.el7sat.noarch

How reproducible:
On these Satellites, every time dynflow is busy with "lots" of tasks.

Steps to Reproduce:
1a. Trigger multiple (like 5+) large capsule syncs at the same time
 ...or...
1b. Trigger 3+ concurrent Content View publish tasks.


Actual results:
Tasks move very slowly, dynflow_executors show very intense CPU usage, /var/log/foreman/dynflow_executor.output and postgres logs show error messages like the ones above.

Expected results:
Tasks move at the same speed as when dynflow is not too busy, tasks complete successfully, no database-related errors.


Additional info:
On of the systems where this was observed had a single dynflow_executor out of 3 be very busy with 100% CPU usage whereas the remaining 2 executors were at 1% or 2% during these busy times. I'm not sure this is related to the issue but I'm hoping it's useful information.

Comment 4 Melanie Falz 2021-06-24 10:07:30 UTC
Please note the ongoing issues with sidekiq, too. 
The customer runs on Satellite 6.8.6 and we have the following values

20965 foreman 20 0 14.9g 14.4g 8392 S 100.0 18.5 79:04.94 sidekiq 5.2.7 [3 of 10 busy] 
19727 tomcat 20 0 10.0g 3.3g 8832 S 72.7 4.3 505:31.49 /usr/lib/jvm/jre/bin/java -Xms1024m -Xmx4096m -Djava.security.a

20965 foreman 20 0 14.9g 14.4g 8392 S 100.0 18.5 79:04.94 sidekiq 5.2.7 [3 of 10 busy] 
19727 tomcat 20 0 10.0g 3.3g 8832 S 72.7 4.3 505:31.49 /usr/lib/jvm/jre/bin/java -Xms1024m -Xmx4096m -Djava.security.a13:13

20408 foreman 20 0 1001200 558384 7900 S 0.0 0.7 5:49.60 sidekiq 5.2.7 [0 of 10 busy] 
20913 foreman 20 0 1116248 665764 7184 S 0.0 0.8 6:12.37 sidekiq 5.2.7 [0 of 1 busy] 
20965 foreman 20 0 14.9g 14.4g 8392 S 0.0 18.5 82:24.77 sidekiq 5.2.7 [0 of 10 busy]

Comment 5 Adam Ruzicka 2021-06-24 10:12:42 UTC
@Melanie well, that's not good, but not really related to this bz, or am I missing something?

Comment 8 Adam Ruzicka 2023-08-14 08:58:44 UTC
Closing this bug out since we do not believe it will be fixed in the next multiple releases. This was open some years ago, since then we switched to a new deployment model and I don't recall anyone complaining about this kind of issue on currently supported versions. If this is a large impact on you, please feel free to re-open.