Bug 2012719

Summary: pulp promote task finished but dynflow failed to notice it
Product: Red Hat Satellite Reporter: Jan Hutař <jhutar>
Component: DynflowAssignee: satellite6-bugs <satellite6-bugs>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Satellite QE Team <sat-qe-bz-list>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 6.9.0CC: aruzicka, jeanbaptiste.dancre, pmoravec, saime, sboyron
Target Milestone: UnspecifiedKeywords: Triaged
Target Release: Unused   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-08-15 10:36:15 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Jan Hutař 2021-10-11 07:02:31 UTC
Description of problem:
Pulp promote task finished but Dynflow failed to notice it:

Notice this 'waiting for Pulp to finish the task' and 'Ended at: 2021-10-08 22:51:16 UTC' at below output (it is 2021-10-11 07:01 as of now):

```
 55: Actions::Pulp::Repository::DistributorPublish (waiting for Pulp to finish the task) [ 21.23s / 8.14s ] Cancel

Queue: default

Started at: 2021-10-08 22:50:55 UTC

Ended at: 2021-10-08 22:51:16 UTC

Real time: 21.23s

Execution time (excluding suspended state): 8.14s

Input:

---
smart_proxy_id: 1
options:
  force: false
  source_repository:
    id: 1412
    name: Red_Hat_Software_Collections_RPMs_for_Red_Hat_Enterprise_Linux_7_Server_x86_64_7Server
    url: https://.../pulp/repos/org3/content_views/org3-ccv-rhel7-max/9.0/content/dist/rhel/server/7/7Server/x86_64/rhscl/1/os/
  matching_content: false
  dependency: 
dependency: 
repository_id: 656
remote_user: admin
remote_cp_user: admin
current_request_id: 
current_timezone: UTC
current_user_id: 4
current_organization_id: 
current_location_id: 

Output:

---
pulp_tasks:
- exception: 
  task_type: pulp.server.managers.repo.publish.publish
  _href: "/pulp/api/v2/tasks/28729cb8-3cf7-4d4c-bb69-d63a61d444df/"
  task_id: 28729cb8-3cf7-4d4c-bb69-d63a61d444df
  tags:
  - pulp:repository:6-org3-ccv-rhel7-max-org3-le3-ea1d355d-42c5-46df-bd89-d9c64fc60a9b
  - pulp:action:publish
  finish_time: 
  _ns: task_status
  start_time: '2021-10-08T22:50:55Z'
  traceback: 
  spawned_tasks: []
  progress_report: {}
  queue: reserved_resource_worker-2@....dq2
  state: running
  worker_name: reserved_resource_worker-2@...
  result: 
  error: 
  _id:
    "$oid": 6160cb4f2ceed8b4ab804b0e
  id: 6160cb4f2ceed8b4ab804b0e
poll_attempts:
  total: 10
  failed: 0
```


Version-Release number of selected component (if applicable):
satellite-6.9.6.1-1.el7sat.noarch


How reproducible:
happened for a first time


Steps to Reproduce:
1. I have been doing lots of publish and promotes, but nothing excessive
   (like 12 publishes, 4 promotes per org in 5 orgs, repeated 10 times)
   Used this workflow:
      https://github.com/redhat-performance/satperf/blob/224a671d8545cde7b969e85d20697fa66fafb913/scripts/create-big-setup.sh
      https://github.com/redhat-performance/satperf/blob/224a671d8545cde7b969e85d20697fa66fafb913/scripts/create-big-setup-update.sh


Actual results:
See copy&paste from the Dynflow console above.


Expected results:
Task should be marked as done ~2 days ago.

Comment 1 Adam Ruzicka 2021-10-11 07:08:40 UTC
Anything in production.log or output of  journalctl -u dynflow-sidekiq@\* ?

Comment 4 Jan Hutař 2021-10-11 07:19:58 UTC
This is a main task:

```
Id: 7976300b-1a90-46b4-8587-9ad834d10bfa
Label: Actions::Katello::ContentView::Promote
Status: running Cancel
Result: pending
Started at: 2021-10-08 22:50:50 UTC
Ended at: 
```

Comment 5 Jan Hutař 2021-10-11 07:25:34 UTC
I do not see any tracebacks in production.log (looking at 18:50 - 18:52 timestamps). I can see the task mentioned transitioning to running state:

```
[root@gprfc001-vm1 ~]# zgrep  7976300b-1a90-46b4-8587-9ad834d10bfa /var/log/foreman/production.log-20211009.gz 
2021-10-08T18:50:50 [I|bac|a1edba03] Task {label: Actions::Katello::ContentView::Promote, id: f1a84b26-106f-440e-b499-4886f3d90dd9, execution_plan_id: 7976300b-1a90-46b4-8587-9ad834d10bfa} state changed: planning 
2021-10-08T18:50:54 [I|bac|] Task {label: Actions::Katello::ContentView::Promote, id: f1a84b26-106f-440e-b499-4886f3d90dd9, execution_plan_id: 7976300b-1a90-46b4-8587-9ad834d10bfa} state changed: planned 
2021-10-08T18:50:54 [I|bac|] Task {label: Actions::Katello::ContentView::Promote, id: f1a84b26-106f-440e-b499-4886f3d90dd9, execution_plan_id: 7976300b-1a90-46b4-8587-9ad834d10bfa} state changed: running 
```

Comment 6 Jan Hutař 2021-10-11 07:26:57 UTC
Also all the services seems to be OK:

```
[root@gprfc001-vm1 ~]# satellite-maintain service status -b
Running Status Services
================================================================================
Get status of applicable services: 

Displaying the following service(s):
rh-mongodb34-mongod, postgresql, qdrouterd, qpidd, rh-redis5-redis, squid, pulp_celerybeat, pulp_resource_manager, pulp_streamer, pulp_workers, smart_proxy_dynflow_core, tomcat, dynflow-sidekiq@orchestrator, foreman, httpd, puppetserver, dynflow-sidekiq@worker, dynflow-sidekiq@worker-hosts-queue, foreman-proxy
- displaying rh-mongodb34-mongod                   [OK]                         
\ displaying postgresql                            [OK]                         
\ displaying qdrouterd                             [OK]                         
\ displaying qpidd                                 [OK]                         
\ displaying rh-redis5-redis                       [OK]                         
\ displaying squid                                 [OK]                         
\ displaying pulp_celerybeat                       [OK]                         
\ displaying pulp_resource_manager                 [OK]                         
\ displaying pulp_streamer                         [OK]                         
\ displaying pulp_workers                          [OK]                         
\ displaying smart_proxy_dynflow_core              [OK]                         
\ displaying tomcat                                [OK]                         
\ displaying dynflow-sidekiq@orchestrator          [OK]                         
\ displaying foreman                               [OK]                         
\ displaying httpd                                 [OK]                         
\ displaying puppetserver                          [OK]                         
| displaying dynflow-sidekiq@worker                [OK]                         
| displaying dynflow-sidekiq@worker-hosts-queue    [OK]                         
| displaying foreman-proxy                         [OK]                         
| All services are running                                            [OK]      
--------------------------------------------------------------------------------
```

Comment 8 Pavel Moravec 2021-10-24 10:20:48 UTC
There was just one sidekiq worker with 5 threads. It was configured with default foreman_tasks_polling_multiplier. And the relevant task was polled BUT the polling stopped - sadly just before the task succeeded:

./var/log/httpd/foreman-ssl_access_ssl.log-20211010 :

192.168.0.1 - admin [08/Oct/2021:18:50:55 -0400] "GET /pulp/api/v2/tasks/28729cb8-3cf7-4d4c-bb69-d63a61d444df/ HTTP/1.1" 200 347 "-" "rest-client/2.0.2 (linux-gnu x86_64) ruby/2.5.9p229"
192.168.0.1 - admin [08/Oct/2021:18:50:58 -0400] "GET /pulp/api/v2/tasks/28729cb8-3cf7-4d4c-bb69-d63a61d444df/ HTTP/1.1" 200 412 "-" "rest-client/2.0.2 (linux-gnu x86_64) ruby/2.5.9p229"
192.168.0.1 - admin [08/Oct/2021:18:51:00 -0400] "GET /pulp/api/v2/tasks/28729cb8-3cf7-4d4c-bb69-d63a61d444df/ HTTP/1.1" 200 412 "-" "rest-client/2.0.2 (linux-gnu x86_64) ruby/2.5.9p229"
192.168.0.1 - admin [08/Oct/2021:18:51:02 -0400] "GET /pulp/api/v2/tasks/28729cb8-3cf7-4d4c-bb69-d63a61d444df/ HTTP/1.1" 200 412 "-" "rest-client/2.0.2 (linux-gnu x86_64) ruby/2.5.9p229"
192.168.0.1 - admin [08/Oct/2021:18:51:04 -0400] "GET /pulp/api/v2/tasks/28729cb8-3cf7-4d4c-bb69-d63a61d444df/ HTTP/1.1" 200 412 "-" "rest-client/2.0.2 (linux-gnu x86_64) ruby/2.5.9p229"
192.168.0.1 - admin [08/Oct/2021:18:51:05 -0400] "GET /pulp/api/v2/tasks/28729cb8-3cf7-4d4c-bb69-d63a61d444df/ HTTP/1.1" 200 412 "-" "rest-client/2.0.2 (linux-gnu x86_64) ruby/2.5.9p229"
192.168.0.1 - admin [08/Oct/2021:18:51:06 -0400] "GET /pulp/api/v2/tasks/28729cb8-3cf7-4d4c-bb69-d63a61d444df/ HTTP/1.1" 200 412 "-" "rest-client/2.0.2 (linux-gnu x86_64) ruby/2.5.9p229"
192.168.0.1 - admin [08/Oct/2021:18:51:09 -0400] "GET /pulp/api/v2/tasks/28729cb8-3cf7-4d4c-bb69-d63a61d444df/ HTTP/1.1" 200 412 "-" "rest-client/2.0.2 (linux-gnu x86_64) ruby/2.5.9p229"
192.168.0.1 - admin [08/Oct/2021:18:51:10 -0400] "GET /pulp/api/v2/tasks/28729cb8-3cf7-4d4c-bb69-d63a61d444df/ HTTP/1.1" 200 412 "-" "rest-client/2.0.2 (linux-gnu x86_64) ruby/2.5.9p229"
192.168.0.1 - admin [08/Oct/2021:18:51:11 -0400] "GET /pulp/api/v2/tasks/28729cb8-3cf7-4d4c-bb69-d63a61d444df/ HTTP/1.1" 200 412 "-" "rest-client/2.0.2 (linux-gnu x86_64) ruby/2.5.9p229"
192.168.0.1 - admin [08/Oct/2021:18:51:16 -0400] "GET /pulp/api/v2/tasks/28729cb8-3cf7-4d4c-bb69-d63a61d444df/ HTTP/1.1" 200 412 "-" "rest-client/2.0.2 (linux-gnu x86_64) ruby/2.5.9p229"
(and nothing more)

while the task completed just after that:

/var/log/messages-20211010:
Oct  8 18:51:26 gprfc001-vm1 pulp: celery.app.trace:INFO: [28729cb8] Task pulp.server.managers.repo.publish.publish[28729cb8-3cf7-4d4c-bb69-d63a61d444df] succeeded in 30.207722634s: {'exception': None, 'repo_id': '6-org3-ccv-rhel7-max-org3-le3-ea1d355d-42c5-46df-bd89-d9c64fc60a9b', 'traceback': None, 'started': '2021-10-08T22:50:55Z', '_ns': 'repo_publish_results', 'completed': datetime.datetime(2021, 10, 8, 22, 51, 26, 10545, tzinfo=<isodate.tzinfo.Utc object at 0x7fc990508190>), 'error_message': None, 'distributor_type_id': 'yum_clone_distributor', 'distributor_id': '6-org3-ccv-rhel7-max-org3-le3-ea1d355d-42c5-46df-bd89-d9c64fc60a9b_clone', 'summary': {'errors': []}, 'result': 'success', 'id': '6160cb6e3622d563af286a23', 'details': {}}


The intervals between polling sound strange - they should follow (due to multiplier=1) the default scheme:

    def poll_intervals
      [0.5, 1, 2, 4, 8, 16]
    end

(and then each 16s).

Why first polls are less frequent, then more frequent and then they are cut?

Possible causes:
- just one sidekiq process with 5 workers that wasnt able to handle the load?
  - the polling probes not following polling intervals do support this theory
  - but there was not huge load of tasks, that time, and load was much higher at different time with no issue
- chronyd isnt running (but that can cause system clock diverging itself, but no issue for the VM alone as all work done locally)
- just an idea: see foreman-ssl_access_ssl.log-20211010 :

192.168.0.1 - admin [08/Oct/2021:18:50:55 -0400] "GET /pulp/api/v2/tasks/19c26c6d-a092-459e-995b-7e15192fc5d1/ HTTP/1.1" 200 409 "-" "rest-client/2.0.2 (linux-gnu x86_64) ruby/2.5.9p229"
192.168.0.1 - admin [08/Oct/2021:18:50:55 -0400] "GET /pulp/api/v2/tasks/2b4a4f81-0a51-4075-8560-2dcf2830d831/ HTTP/1.1" 200 411 "-" "rest-client/2.0.2 (linux-gnu x86_64) ruby/2.5.9p229"
192.168.0.1 - admin [08/Oct/2021:18:50:55 -0400] "GET /pulp/api/v2/tasks/39b9ac87-5381-4fa1-b1cc-ffa2af404286/ HTTP/1.1" 200 412 "-" "rest-client/2.0.2 (linux-gnu x86_64) ruby/2.5.9p229"
192.168.0.1 - admin [08/Oct/2021:18:50:55 -0400] "GET /pulp/api/v2/tasks/28729cb8-3cf7-4d4c-bb69-d63a61d444df/ HTTP/1.1" 200 347 "-" "rest-client/2.0.2 (linux-gnu x86_64) ruby/2.5.9p229"
192.168.0.1 - admin [08/Oct/2021:18:50:57 -0400] "GET /pulp/api/v2/tasks/19c26c6d-a092-459e-995b-7e15192fc5d1/ HTTP/1.1" 200 409 "-" "rest-client/2.0.2 (linux-gnu x86_64) ruby/2.5.9p229"
192.168.0.1 - admin [08/Oct/2021:18:50:58 -0400] "GET /pulp/api/v2/tasks/39b9ac87-5381-4fa1-b1cc-ffa2af404286/ HTTP/1.1" 200 412 "-" "rest-client/2.0.2 (linux-gnu x86_64) ruby/2.5.9p229"
192.168.0.1 - admin [08/Oct/2021:18:50:58 -0400] "GET /pulp/api/v2/tasks/2b4a4f81-0a51-4075-8560-2dcf2830d831/ HTTP/1.1" 200 411 "-" "rest-client/2.0.2 (linux-gnu x86_64) ruby/2.5.9p229"
192.168.0.1 - admin [08/Oct/2021:18:50:58 -0400] "GET /pulp/api/v2/tasks/28729cb8-3cf7-4d4c-bb69-d63a61d444df/ HTTP/1.1" 200 412 "-" "rest-client/2.0.2 (linux-gnu x86_64) ruby/2.5.9p229"
192.168.0.1 - admin [08/Oct/2021:18:51:00 -0400] "GET /pulp/api/v2/tasks/19c26c6d-a092-459e-995b-7e15192fc5d1/ HTTP/1.1" 200 543 "-" "rest-client/2.0.2 (linux-gnu x86_64) ruby/2.5.9p229"
192.168.0.1 - admin [08/Oct/2021:18:51:00 -0400] "GET /pulp/api/v2/tasks/2b4a4f81-0a51-4075-8560-2dcf2830d831/ HTTP/1.1" 200 546 "-" "rest-client/2.0.2 (linux-gnu x86_64) ruby/2.5.9p229"
192.168.0.1 - admin [08/Oct/2021:18:51:00 -0400] "GET /pulp/api/v2/tasks/28729cb8-3cf7-4d4c-bb69-d63a61d444df/ HTTP/1.1" 200 412 "-" "rest-client/2.0.2 (linux-gnu x86_64) ruby/2.5.9p229"
192.168.0.1 - admin [08/Oct/2021:18:51:00 -0400] "GET /pulp/api/v2/tasks/39b9ac87-5381-4fa1-b1cc-ffa2af404286/ HTTP/1.1" 200 412 "-" "rest-client/2.0.2 (linux-gnu x86_64) ruby/2.5.9p229"
192.168.0.1 - - [08/Oct/2021:18:50:56 -0400] "GET /foreman_tasks/api/tasks/f1a84b26-106f-440e-b499-4886f3d90dd9 HTTP/1.1" 200 513 "-" "rest-client/2.0.2 (linux-gnu x86_64) ruby/2.5.9p229"
192.168.0.1 - admin [08/Oct/2021:18:51:02 -0400] "GET /pulp/api/v2/tasks/28729cb8-3cf7-4d4c-bb69-d63a61d444df/ HTTP/1.1" 200 412 "-" "rest-client/2.0.2 (linux-gnu x86_64) ruby/2.5.9p229"
192.168.0.1 - admin [08/Oct/2021:18:51:02 -0400] "GET /pulp/api/v2/tasks/39b9ac87-5381-4fa1-b1cc-ffa2af404286/ HTTP/1.1" 200 412 "-" "rest-client/2.0.2 (linux-gnu x86_64) ruby/2.5.9p229"
192.168.0.1 - admin [08/Oct/2021:18:51:04 -0400] "GET /pulp/api/v2/tasks/28729cb8-3cf7-4d4c-bb69-d63a61d444df/ HTTP/1.1" 200 412 "-" "rest-client/2.0.2 (linux-gnu x86_64) ruby/2.5.9p229"
192.168.0.1 - admin [08/Oct/2021:18:51:05 -0400] "GET /pulp/api/v2/tasks/28729cb8-3cf7-4d4c-bb69-d63a61d444df/ HTTP/1.1" 200 412 "-" "rest-client/2.0.2 (linux-gnu x86_64) ruby/2.5.9p229"
192.168.0.1 - admin [08/Oct/2021:18:51:06 -0400] "GET /pulp/api/v2/tasks/28729cb8-3cf7-4d4c-bb69-d63a61d444df/ HTTP/1.1" 200 412 "-" "rest-client/2.0.2 (linux-gnu x86_64) ruby/2.5.9p229"
192.168.0.1 - admin [08/Oct/2021:18:51:09 -0400] "GET /pulp/api/v2/tasks/28729cb8-3cf7-4d4c-bb69-d63a61d444df/ HTTP/1.1" 200 412 "-" "rest-client/2.0.2 (linux-gnu x86_64) ruby/2.5.9p229"
192.168.0.1 - admin [08/Oct/2021:18:51:10 -0400] "GET /pulp/api/v2/tasks/28729cb8-3cf7-4d4c-bb69-d63a61d444df/ HTTP/1.1" 200 412 "-" "rest-client/2.0.2 (linux-gnu x86_64) ruby/2.5.9p229"
192.168.0.1 - admin [08/Oct/2021:18:51:11 -0400] "GET /pulp/api/v2/tasks/28729cb8-3cf7-4d4c-bb69-d63a61d444df/ HTTP/1.1" 200 412 "-" "rest-client/2.0.2 (linux-gnu x86_64) ruby/2.5.9p229"
192.168.0.1 - admin [08/Oct/2021:18:51:16 -0400] "GET /pulp/api/v2/tasks/28729cb8-3cf7-4d4c-bb69-d63a61d444df/ HTTP/1.1" 200 412 "-" "rest-client/2.0.2 (linux-gnu x86_64) ruby/2.5.9p229"
192.168.0.1 - admin [08/Oct/2021:18:51:03 -0400] "GET /pulp/api/v2/tasks/39b9ac87-5381-4fa1-b1cc-ffa2af404286/ HTTP/1.1" 200 412 "-" "rest-client/2.0.2 (linux-gnu x86_64) ruby/2.5.9p229"
192.168.0.1 - admin [08/Oct/2021:18:51:24 -0400] "GET /pulp/api/v2/tasks/39b9ac87-5381-4fa1-b1cc-ffa2af404286/ HTTP/1.1" 200 412 "-" "rest-client/2.0.2 (linux-gnu x86_64) ruby/2.5.9p229"
192.168.0.1 - - [08/Oct/2021:18:51:02 -0400] "GET /foreman_tasks/api/tasks/f1a84b26-106f-440e-b499-4886f3d90dd9 HTTP/1.1" 200 513 "-" "rest-client/2.0.2 (linux-gnu x86_64) ruby/2.5.9p229"

Timestamps are not strictly increasing - so this was happening:
- at 18:51:03, a sidekiq worker queried task 39b9ac87, but pulp was not responding immediatelly
- at 18:51:16, a sidekiq worker (same one but different thread, I guess) queried task 28729cb8 and got response
- even after 18:51:16, a response to the first query was received (as this log appears _below_ the previous one)

Can't this confuse sidekiq worker, somehow? (it should not as the tasks are processed by different threads, but.. they do sync in writing to postgres, maybe? dunno..)

Adam, does the above sound as a possible cause of some problem? (if so, I could try playing with networking to mimic that more often and to reproduce it)

Comment 9 Adam Ruzicka 2021-10-25 14:44:30 UTC
> Why first polls are less frequent, then more frequent and then they are cut?

In foreman-ssl_access_ssl.log-20211010 we can see pulp being polled for several different tasks at roughly same times, that could explain why it was slow at first but gradually got faster. Also worth noting is that each of the intervals is used 5 times before moving on to the next, so most likely all of the polls are still trying to be done with 0.5 second frequency, but the actual polling delays them.

> Timestamps are not strictly increasing - so this was happening:

Looks like pulp logs when the request finished but with a timestamp when the request came in. The timestamps seem to be increasing when scoped to specific uuids, which should be fine as dynflow tracks those separately.

> Can't this confuse sidekiq worker, somehow?

It should not.

> does the above sound as a possible cause of some problem?

Not really, I'd say this is happening all the time out in the wild.

From the data we have available it seems the last poll event got "lost". Usually this would mean some of worker/redis/orchestrator got killed, but the logs don't support this.

Comment 10 Jan Hutař 2021-10-25 19:25:02 UTC
One more thing: IIRC this got "fixed (task in the Dynflow finished) when I restarted Satellite. This would support theory of some process getting killed.

What task is responsible for checking on abandoned tasks?

Maybe some systemd logs were just rolled away?

Comment 11 Adam Ruzicka 2021-10-26 08:47:57 UTC
dynflow-sidekiq@orchestrator looks for things to do when it starts up. Chances are it picked this task up, resumed it, the task then did one round of polling and finished properly.

Comment 12 Pavel Moravec 2021-10-26 08:53:30 UTC
(In reply to Adam Ruzicka from comment #11)
> dynflow-sidekiq@orchestrator looks for things to do when it starts up.
> Chances are it picked this task up, resumed it, the task then did one round
> of polling and finished properly.

I would expect such polling would be logged in /var/log/httpd/foreman-ssl_access_ssl.log* but that does contain just the above mentioned polling - BUT I expect the orchestrator restart happened *after* foreman-debug was generated (the only place where I examine logs).

Comment 13 Jean-Baptiste Dancre 2021-10-28 09:06:04 UTC
Hi,

There is a chance that this sidekiq polling issue also occurs with other pulp jobs/actions:
This is a sync for a RHEL repo, nothing fancy. Sync is done every day.

The following tasks where witnessed during a repo sync:
4: Actions::Pulp::Repository::Sync (success) [ 29814.64s / 130.16s ]
Queue: default
Started at: 2021-10-27 23:30:58 UTC
Ended at: 2021-10-28 07:47:52 UTC
Real time: 29814.64s
Execution time (excluding suspended state): 130.16s

Input:
---
source_url: 
repo_id: 38527
smart_proxy_id: 1
remote_user: admin
remote_cp_user: admin
current_request_id: 
current_timezone: Asia/Calcutta
current_user_id: 1
current_organization_id: 155
current_location_id: 

Output:
[...]
---
pulp_tasks:
- exception: 
[...]
- pulp:repository:36d9a5c3-90cc-4a06-87a7-74c8e818163c
  - pulp:action:sync
  finish_time: '2021-10-28T02:42:35Z'
  _ns: task_status
  start_time: '2021-10-28T01:16:26Z'
[...]
progress_report:
    yum_importer:
      content:
        items_total: 0
        state: FINISHED
        error_details: []
[...]
queue: reserved_resource_worker-0
[...]
error_message: 
    summary:
      modules:
        state: FINISHED
      content:
        state: FINISHED
      comps:
        state: FINISHED
      purge_duplicates:
        state: FINISHED
      distribution:
        state: FINISHED
      errata:
        state: FINISHED
      metadata:
        state: FINISHED
    added_count: 3
    removed_count: 0
    updated_count: 0
    id: 617a0e1be351eb40883cadc0
[...]
- pulp:action:publish
  finish_time: '2021-10-28T07:44:44Z'
  _ns: task_status
  start_time: '2021-10-28T07:15:45Z'
  traceback: 
  spawned_tasks: []
  progress_report:
    36d9a5c3-90cc-4a06-87a7-74c8e818163c:
    - num_success: 1
[...]
contents_changed: true
poll_attempts:
  total: 461
  failed: 0

Comment 14 Adam Ruzicka 2021-11-01 07:43:45 UTC
Do you have debug logs from the time when it happend by any chance?

Comment 15 Pavel Moravec 2021-11-01 08:36:28 UTC
(In reply to Adam Ruzicka from comment #14)
> Do you have debug logs from the time when it happend by any chance?

Whole sosreport (or at least foreman-debug) would be welcomed, as you either have increased polling multiplier or the polling hang up at some time:

> Real time: 29814.64s
> poll_attempts:
>  total: 461

That means one polling attempt per >1 minute. While (with default settings) it should be every 16s.

I expect polling multiplier is increased (to 4 or 5, I speculate?), anyway it is worth having also configs / whole sosreport to troubleshoot more.

Please open a new support case (and refer to this BZ) as the right place to let us investigate this.

Comment 17 Pavel Moravec 2021-11-10 13:13:08 UTC
(In reply to Jean-Baptiste Dancre from comment #13)
> Hi,
> 
> There is a chance that this sidekiq polling issue also occurs with other
> pulp jobs/actions:
> This is a sync for a RHEL repo, nothing fancy. Sync is done every day.
> 
> The following tasks where witnessed during a repo sync:
> 4: Actions::Pulp::Repository::Sync (success) [ 29814.64s / 130.16s ]
> Queue: default
> Started at: 2021-10-27 23:30:58 UTC
> Ended at: 2021-10-28 07:47:52 UTC
> Real time: 29814.64s
> Execution time (excluding suspended state): 130.16s
> 
> Input:
> ---
> source_url: 
> repo_id: 38527
> smart_proxy_id: 1
> remote_user: admin
> remote_cp_user: admin
> current_request_id: 
> current_timezone: Asia/Calcutta
> current_user_id: 1
> current_organization_id: 155
> current_location_id: 
> 
> Output:
> [...]
> ---
> pulp_tasks:
> - exception: 
> [...]
> - pulp:repository:36d9a5c3-90cc-4a06-87a7-74c8e818163c
>   - pulp:action:sync
>   finish_time: '2021-10-28T02:42:35Z'
>   _ns: task_status
>   start_time: '2021-10-28T01:16:26Z'
> [...]
> progress_report:
>     yum_importer:
>       content:
>         items_total: 0
>         state: FINISHED
>         error_details: []
> [...]
> queue: reserved_resource_worker-0
> [...]
> error_message: 
>     summary:
>       modules:
>         state: FINISHED
>       content:
>         state: FINISHED
>       comps:
>         state: FINISHED
>       purge_duplicates:
>         state: FINISHED
>       distribution:
>         state: FINISHED
>       errata:
>         state: FINISHED
>       metadata:
>         state: FINISHED
>     added_count: 3
>     removed_count: 0
>     updated_count: 0
>     id: 617a0e1be351eb40883cadc0
> [...]
> - pulp:action:publish
>   finish_time: '2021-10-28T07:44:44Z'
>   _ns: task_status
>   start_time: '2021-10-28T07:15:45Z'
>   traceback: 
>   spawned_tasks: []
>   progress_report:
>     36d9a5c3-90cc-4a06-87a7-74c8e818163c:
>     - num_success: 1
> [...]
> contents_changed: true
> poll_attempts:
>   total: 461
>   failed: 0

This was settled down in the opened support case as too frequent polling of too many pulp tasks by too few dynflow workers. I.e. dynflow polling multiplier to be increased, and sidekiq worker service to be added.

Comment 25 Adam Ruzicka 2023-08-15 10:36:15 UTC
I'm afraid I'll have to go ahead and close this. We haven't been able to find anything conclusive so far and I don't recall anyone complaining about this sort of issue on currently supported releases. There has been a lot of changes since this BZ was opened, chances are this got smoothed out with them. If anyone feels like this still deserves more of our attention or comes up with a reliable reproducer, feel free to reopen and we'll take a look at it once more.