Bug 1888816
| Summary: | Busy dynflow_executors lead to SQL errors (KeyError logged in the application) and possibly slowness | ||
|---|---|---|---|
| Product: | Red Hat Satellite | Reporter: | Pablo Hess <phess> |
| Component: | Dynflow | Assignee: | satellite6-bugs <satellite6-bugs> |
| Status: | CLOSED DEFERRED | QA Contact: | Satellite QE Team <sat-qe-bz-list> |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | 6.7.0 | CC: | aruzicka, mfalz, sadas |
| Target Milestone: | Unspecified | Keywords: | Triaged |
| Target Release: | Unused | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2023-08-14 08:58:44 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
Please note the ongoing issues with sidekiq, too. The customer runs on Satellite 6.8.6 and we have the following values 20965 foreman 20 0 14.9g 14.4g 8392 S 100.0 18.5 79:04.94 sidekiq 5.2.7 [3 of 10 busy] 19727 tomcat 20 0 10.0g 3.3g 8832 S 72.7 4.3 505:31.49 /usr/lib/jvm/jre/bin/java -Xms1024m -Xmx4096m -Djava.security.a 20965 foreman 20 0 14.9g 14.4g 8392 S 100.0 18.5 79:04.94 sidekiq 5.2.7 [3 of 10 busy] 19727 tomcat 20 0 10.0g 3.3g 8832 S 72.7 4.3 505:31.49 /usr/lib/jvm/jre/bin/java -Xms1024m -Xmx4096m -Djava.security.a13:13 20408 foreman 20 0 1001200 558384 7900 S 0.0 0.7 5:49.60 sidekiq 5.2.7 [0 of 10 busy] 20913 foreman 20 0 1116248 665764 7184 S 0.0 0.8 6:12.37 sidekiq 5.2.7 [0 of 1 busy] 20965 foreman 20 0 14.9g 14.4g 8392 S 0.0 18.5 82:24.77 sidekiq 5.2.7 [0 of 10 busy] @Melanie well, that's not good, but not really related to this bz, or am I missing something? Closing this bug out since we do not believe it will be fixed in the next multiple releases. This was open some years ago, since then we switched to a new deployment model and I don't recall anyone complaining about this kind of issue on currently supported versions. If this is a large impact on you, please feel free to re-open. |
Description of problem: Red Hat Satellite 6.7.4 (possibly also 6.7.[0-3]) sees lots of errors logged to /var/log/foreman/dynflow_executor.output with KeyErrors during task-busy periods, mostly Capsule Syncs and multiple concurrent Content View publish/promote tasks. This has so far been observed by me in Satellites where /etc/sysconfig/dynflowd has EXECUTOR_COUNT=3 only. /var/log/foreman/dynflow_executor.output shows these errors: E, [2020-10-14T11:19:54.427792 #96826] ERROR -- /parallel-executor-core/pool default/worker-2: Could not save state (RuntimeError) E, [2020-10-14T11:19:54.431332 #96826] ERROR -- /parallel-executor-core/pool default/worker-3: Could not save state (RuntimeError) E, [2020-10-14T11:19:54.434585 #96826] ERROR -- /parallel-executor-core/pool default/worker-2: searching: action by: {:execution_plan_uuid=>"bd04a648-e11b-47bd-9049-c87abc7ec003", :id=>1638} (KeyError) E, [2020-10-14T11:19:54.442404 #96826] ERROR -- /parallel-executor-core/pool default/worker-2: searching: action by: {:execution_plan_uuid=>"bd04a648-e11b-47bd-9049-c87abc7ec003", :id=>1634} (KeyError) E, [2020-10-14T11:20:00.826733 #96826] ERROR -- /parallel-executor-core/pool default/worker-2: searching: action by: {:execution_plan_uuid=>"bd04a648-e11b-47bd-9049-c87abc7ec003", :id=>1648} (KeyError) E, [2020-10-14T11:20:00.833610 #96826] ERROR -- /parallel-executor-core/pool default/worker-2: searching: action by: {:execution_plan_uuid=>"bd04a648-e11b-47bd-9049-c87abc7ec003", :id=>1648} (KeyError) E, [2020-10-14T11:20:00.843802 #96826] ERROR -- /parallel-executor-core/pool default/worker-2: searching: action by: {:execution_plan_uuid=>"bd04a648-e11b-47bd-9049-c87abc7ec003", :id=>1648} (KeyError) postgres logs at the same time show these types of errors: 2020-10-14 11:19:54 *TZ* LOG: duration: 700.013 ms statement: UPDATE "dynflow_steps" SET "execution_plan_uuid" = 'bd04a648-e11b-47bd-9049-c87abc7ec003', "id" = 2455, "action_id" = 1636, "data" = NULL, "state" = 'running', "started_at" = '2020-10-14 16:50:43.303331+0000', "ended_at" = '2020-10-14 18:19:37.717563+0000', "real_time" = 5334.414232315, "execution_time" = 114.37665876999998, "progress_done" = 0.1, "progress_weight" = 100, "class" = 'Dynflow::ExecutionPlan::Steps::RunStep', "error" = NULL, "action_class" = 'Actions::Pulp::Consumer::SyncCapsule', "children" = NULL, "queue" = 'default' WHERE (("state" IN ('pending', 'error', 'skipping', 'suspended')) AND ("execution_plan_uuid" = 'bd04a648-e11b-47bd-9049-c87abc7ec003') AND ("id" = 2455)) 2020-10-14 11:24:57 *TZ* ERROR: insert or update on table "dynflow_actions" violates foreign key constraint "dynflow_actions_execution_plan_uuid_fkey" 2020-10-14 11:24:57 *TZ* DETAIL: Key (execution_plan_uuid)=(e077c8ff-4ed2-4e19-8e3f-79e89e20c387) is not present in table "dynflow_execution_plans". 2020-10-14 11:24:57 *TZ* STATEMENT: INSERT INTO "dynflow_actions" ("execution_plan_uuid", "id", "data", "input", "output", "caller_execution_plan_id", "caller_action_id", "class", "plan_step_id", "run_step_id", "finalize_step_id") VALUES ('e077c8ff-4ed2-4e19-8e3f-79e89e20c387', 1302, NULL, '{"smart_proxy_id":3,"options":{"force":true,"source_repository":{"id":4210,"name":"Red_Hat_Software_Collections_RPMs_for_Red_Hat_Enterprise_Linux_7_Server_x86_64_7Server","url":"***redacted***"},"matching_content":false,"dependency":null,"repository_creation":false},"dependency":null,"repository_id":4262,"remote_user":"admin","remote_cp_user":"admin","current_request_id":null,"current_timezone":"***redacted***)","current_user_id":6,"current_organization_id":1,"current_location_id":null}', '{"pulp_tasks":[{"exception":null,"task_type":"pulp.server.managers.repo.publish.publish","_href":"/pulp/api/v2/tasks/17577454-2d20-4177-b9bd-491b4487479d/","task_id":"17577454-2d20-4177-b9bd-491b4487479d","tags":["pulp:repository:***redacted***","pulp:action:publish"],"finish_time":null,"_ns":"task_status","start_time":null,"traceback":null,"spawned_tasks":[],"progress_report":{},"queue":"","state":"waiting","worker_name":null,"result":null,"error":null,"_id":{"$oid":"5f8742794e822e97f73a45f9"},"id":"5f8742794e822e97f73a45f9"}],"poll_attempts":{"total":19,"failed":1},"cancelled":true}', NULL, 1301, 'Actions::Pulp::Repository::DistributorPublish', 1953, 1954, NULL) RETURNING "execution_plan_uuid" 2020-10-14 11:24:57 *TZ* ERROR: insert or update on table "dynflow_actions" violates foreign key constraint "dynflow_actions_execution_plan_uuid_fkey" 2020-10-14 11:24:57 *TZ* DETAIL: Key (execution_plan_uuid)=(e077c8ff-4ed2-4e19-8e3f-79e89e20c387) is not present in table "dynflow_execution_plans". 2020-10-14 11:24:57 *TZ* STATEMENT: INSERT INTO "dynflow_actions" ("execution_plan_uuid", "id", "data", "input", "output", "caller_execution_plan_id", "caller_action_id", "class", "plan_step_id", "run_step_id", "finalize_step_id") VALUES ('e077c8ff-4ed2-4e19-8e3f-79e89e20c387', 1290, NULL, '{"smart_proxy_id":3,"options":{"force":true,"source_repository":{"id":2009,"name":"***redacted***","url":"***redacted***"},"matching_content":false,"dependency":null,"repository_creation":false},"dependency":null,"repository_id":1888,"remote_user":"admin","remote_cp_user":"admin","current_request_id":null,"current_timezone":"***redacted***","current_user_id":6,"current_organization_id":1,"current_location_id":null}', '{"pulp_tasks":[{"exception":null,"task_type":"pulp.server.managers.repo.publish.publish","_href":"/pulp/api/v2/tasks/3a58aa55-e4c3-4440-af36-f9a6abf9d3f6/","task_id":"3a58aa55-e4c3-4440-af36-f9a6abf9d3f6","tags":["pulp:repository:***redacted***","pulp:action:publish"],"finish_time":null,"_ns":"task_status","start_time":null,"traceback":null,"spawned_tasks":[],"progress_report":{},"queue":"","state":"waiting","worker_name":null,"result":null,"error":null,"_id":{"$oid":"5f8742794e822e97f73a4609"},"id":"5f8742794e822e97f73a4609"}],"poll_attempts":{"total":19,"failed":1},"cancelled":true}', NULL, 1289, 'Actions::Pulp::Repository::DistributorPublish', 1935, 1936, NULL) RETURNING "execution_plan_uuid" 2020-10-14 11:24:57 *TZ* ERROR: insert or update on table "dynflow_actions" violates foreign key constraint "dynflow_actions_execution_plan_uuid_fkey" 2020-10-14 11:24:57 *TZ* DETAIL: Key (execution_plan_uuid)=(e077c8ff-4ed2-4e19-8e3f-79e89e20c387) is not present in table "dynflow_execution_plans". 2020-10-14 11:24:57 *TZ* STATEMENT: INSERT INTO "dynflow_actions" ("execution_plan_uuid", "id", "data", "input", "output", "caller_execution_plan_id", "caller_action_id", "class", "plan_step_id", "run_step_id", "finalize_step_id") VALUES ('e077c8ff-4ed2-4e19-8e3f-79e89e20c387', 1294, NULL, '{"smart_proxy_id":3,"options":{"force":true,"source_repository":{"id":2009,"name":"***redacted***","url":"***redacted***"},"matching_content":false,"dependency":null,"repository_creation":false},"dependency":null,"repository_id":1849,"remote_user":"admin","remote_cp_user":"admin","current_request_id":null,"current_timezone":"***redacted***","current_user_id":6,"current_organization_id":1,"current_location_id":null}', '{"pulp_tasks":[{"exception":null,"task_type":"pulp.server.managers.repo.publish.publish","_href":"/pulp/api/v2/tasks/f0e74643-b5f0-465b-a77e-1857d496f5cb/","task_id":"f0e74643-b5f0-465b-a77e-1857d496f5cb","tags":["pulp:repository:***redacted***","pulp:action:publish"],"finish_time":null,"_ns":"task_status","start_time":null,"traceback":null,"spawned_tasks":[],"progress_report":{},"queue":"","state":"waiting","worker_name":null,"result":null,"error":null,"_id":{"$oid":"5f8742794e822e97f73a45fe"},"id":"5f8742794e822e97f73a45fe"}],"poll_attempts":{"total":19,"failed":1},"cancelled":true}', NULL, 1293, 'Actions::Pulp::Repository::DistributorPublish', 1941, 1942, NULL) RETURNING "execution_plan_uuid" Both Satellites are also somewhat big machines using the 64GB tuning settings. Version-Release number of selected component (if applicable): rpm -qa '*dynflow*' '*foreman-tasks*' tfm-rubygem-dynflow-1.4.1-2.el7sat.noarch rubygem-smart_proxy_dynflow-0.2.4-1.el7sat.noarch tfm-rubygem-foreman-tasks-core-0.3.3-1.el7sat.noarch tfm-rubygem-smart_proxy_dynflow_core-0.2.4-1.el7sat.noarch tfm-rubygem-foreman-tasks-0.17.5.7-1.el7sat.noarch How reproducible: On these Satellites, every time dynflow is busy with "lots" of tasks. Steps to Reproduce: 1a. Trigger multiple (like 5+) large capsule syncs at the same time ...or... 1b. Trigger 3+ concurrent Content View publish tasks. Actual results: Tasks move very slowly, dynflow_executors show very intense CPU usage, /var/log/foreman/dynflow_executor.output and postgres logs show error messages like the ones above. Expected results: Tasks move at the same speed as when dynflow is not too busy, tasks complete successfully, no database-related errors. Additional info: On of the systems where this was observed had a single dynflow_executor out of 3 be very busy with 100% CPU usage whereas the remaining 2 executors were at 1% or 2% during these busy times. I'm not sure this is related to the issue but I'm hoping it's useful information.