Bug 1945534
| Summary: | Pulp resource manager stops assigning tasks to the workers | |||
|---|---|---|---|---|
| Product: | Red Hat Satellite | Reporter: | Hao Chang Yu <hyu> | |
| Component: | Qpid | Assignee: | Mike Cressman <mcressma> | |
| Status: | CLOSED ERRATA | QA Contact: | Jitendra Yejare <jyejare> | |
| Severity: | high | Docs Contact: | ||
| Priority: | high | |||
| Version: | 6.8.0 | CC: | ahumbe, arahaman, cjansen, jalviso, Ken.Fowler, kkinge, kupadhya, mkalyat, osousa, pmendezh, pmoravec, rcavalca, risantam, sadas, saydas, smajumda, sraut, wclark, wpinheir | |
| Target Milestone: | 6.10.0 | Keywords: | PrioBumpGSS, Triaged | |
| Target Release: | Unused | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | qpid-cpp-1.36.0-32.el7_9amq | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 1973368 (view as bug list) | Environment: | ||
| Last Closed: | 2021-11-16 14:10:29 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | 1981883 | |||
| Bug Blocks: | ||||
FYI in Hao's reproducer, the "pending sent message, waiting for ACK", was:
{'exchange': 'C.dq2', 'routing_key': u'reserved_resource_worker-2.redhat.com', 'task_id': '44d2a4ea-d446-47dc-a029-68aef86ad0cf', 'tags': ['pulp:repository:1-test4-v8_0-e69ebdff-d79d-4da3-bc25-1e916504ab6d', 'pulp:repository:e69ebdff-d79d-4da3-bc25-1e916504ab6d', 'pulp:action:associate']}
This message has not reached qpidd's journals / durable queue for reserved_resource_worker-2.
The OS swapping / slow IO seems to be a key factor (or one from more key factors). There is a use case that pulp_resource_manager restarted after a stuck hit the same problem after dispatching 200ish messages - as the OS was swapping a lot. Anyway, I still can not reproduce it (just) that way.. Quick info about reproducer: put pulp tasking system under a flow of "associate units among repos" requests with a _huge_ criteria/filter inside (i.e. "copy from RHEL7 repo this huuuuge list of RPMs into a CV repo .."). Those requests might need to be interleaved with some "small-sized" requests to sync or publish a repo. Then, the SSL traffic from the child resource_manager process to qpidd gets stuck, after an attempt to send a huge bulk of data. No response is returned back, so the client is waiting for the ACK of the sent request "worker-0, please associate units ...". We suspect the SSL encryption is a key factor of the reproducer, along with some randomness (maybe sending some more message alongside the huge one?). I will play with the reproducer more. There is something fishy with SSL probably on sender side.
When I enable AMQP0-10 trace logs in qpidd, then this is seen in the connection:
Under normal situation, this happens: first, some huge message content is received:
2021-04-16 12:18:01 [Protocol] trace RECV [qpid.[::1]:5671-[::1]:54920]: Frame[Eb; channel=0; content (65523 bytes) \x00
2021-04-16 12:18:01 [Protocol] trace RECV [qpid.[::1]:5671-[::1]:54920]: Frame[E; channel=0; content (65523 bytes) WFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhY...]
2021-04-16 12:18:01 [Protocol] trace RECV [qpid.[::1]:5671-[::1]:54920]: Frame[E; channel=0; content (65523 bytes) YWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFh...]
2021-04-16 12:18:01 [Protocol] trace RECV [qpid.[::1]:5671-[::1]:54920]: Frame[E; channel=0; content (65523 bytes) hYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWF...]
2021-04-16 12:18:01 [Protocol] trace RECV [qpid.[::1]:5671-[::1]:54920]: Frame[E; channel=0; content (65523 bytes) FhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYW...]
2021-04-16 12:18:01 [Protocol] trace RECV [qpid.[::1]:5671-[::1]:54920]: Frame[E; channel=0; content (65523 bytes) WFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhY...]
2021-04-16 12:18:01 [Protocol] trace RECV [qpid.[::1]:5671-[::1]:54920]: Frame[E; channel=0; content (65523 bytes) YWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFh...]
2021-04-16 12:18:01 [Protocol] trace RECV [qpid.[::1]:5671-[::1]:54920]: Frame[E; channel=0; content (65523 bytes) hYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWF...]
2021-04-16 12:18:01 [Protocol] trace RECV [qpid.[::1]:5671-[::1]:54920]: Frame[E; channel=0; content (65523 bytes) FhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYW...]
2021-04-16 12:18:01 [Protocol] trace RECV [qpid.[::1]:5671-[::1]:54920]: Frame[E; channel=0; content (65523 bytes) WFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhY...]
2021-04-16 12:18:01 [Protocol] trace RECV [qpid.[::1]:5671-[::1]:54920]: Frame[Ee; channel=0; content (53366 bytes) YWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFh...]
it is terminated by "Ee" frame, where the small "e" stands for last frame of the message content (cf. 4.4.3. Frame Format of AMQP 0-10 specification). I.e. qpidd knows the transmittion of the message payload is terminated just here.
After this, qpidd responds with:
2021-04-16 12:18:02 [Protocol] trace SENT [qpid.[::1]:5671-[::1]:54920]: Frame[BEbe; channel=0; {SessionCompletedBody: commands={ [0,3829] }; }]
that delivers the ACK back to the waiting client (to the resource_manager).
BUT when the stuck message happens, I do see this:
2021-04-16 12:18:02 [Protocol] trace RECV [qpid.[::1]:5671-[::1]:54920]: Frame[Eb; channel=0; content (65523 bytes) \x00\x0E\x0F`\x00\x00\x00\x05\x04body\xA0\x00\x0E
2021-04-16 12:18:02 [Protocol] trace RECV [qpid.[::1]:5671-[::1]:54920]: Frame[E; channel=0; content (65523 bytes) WFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhY...]
2021-04-16 12:18:02 [Protocol] trace RECV [qpid.[::1]:5671-[::1]:54920]: Frame[E; channel=0; content (65523 bytes) YWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFh...]
2021-04-16 12:18:02 [Protocol] trace RECV [qpid.[::1]:5671-[::1]:54920]: Frame[E; channel=0; content (65523 bytes) hYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWF...]
2021-04-16 12:18:02 [Protocol] trace RECV [qpid.[::1]:5671-[::1]:54920]: Frame[E; channel=0; content (65523 bytes) FhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYW...]
2021-04-16 12:18:02 [Protocol] trace RECV [qpid.[::1]:5671-[::1]:54920]: Frame[E; channel=0; content (65523 bytes) WFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhY...]
2021-04-16 12:18:02 [Protocol] trace RECV [qpid.[::1]:5671-[::1]:54920]: Frame[E; channel=0; content (65523 bytes) YWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFh...]
2021-04-16 12:18:02 [Protocol] trace RECV [qpid.[::1]:5671-[::1]:54920]: Frame[E; channel=0; content (65523 bytes) hYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWF...]
2021-04-16 12:18:02 [Protocol] trace RECV [qpid.[::1]:5671-[::1]:54920]: Frame[E; channel=0; content (65523 bytes) FhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYW...]
2021-04-16 12:18:02 [Protocol] trace RECV [qpid.[::1]:5671-[::1]:54920]: Frame[E; channel=0; content (65523 bytes) WFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhY...]
2021-04-16 12:18:02 [Protocol] trace RECV [qpid.[::1]:5671-[::1]:54920]: Frame[E; channel=0; content (65523 bytes) YWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFh...]
2021-04-16 12:18:02 [Protocol] trace RECV [qpid.[::1]:5671-[::1]:54920]: Frame[E; channel=0; content (65523 bytes) hYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWF...]
2021-04-16 12:18:02 [Protocol] trace RECV [qpid.[::1]:5671-[::1]:54920]: Frame[E; channel=0; content (65523 bytes) FhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYW...]
2021-04-16 12:18:02 [Protocol] trace RECV [qpid.[::1]:5671-[::1]:54920]: Frame[E; channel=0; content (65523 bytes) WFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhY...]
and then silence.
The "e" frame from the client is missing. And qpidd is waiting on it to reveive the tail of the message, hence even then it could ACKnowledge it back to the client.
So, when sending some huge message payloads from the python-qpid client, it sometimes forgets to send to the wire the latest AMQP0-10 frame(s). The problem happens on SSL only - this is a key observation.
I will try to remove pulp from the reproducer, now.
Standalone qpid reproducer:
- basic idea: send via SSL messages with randomly huge content - twice(!) in parallel.
- particular reproducer:
- qpidd set to allow SSL connections on port 5671 (i.e. have nssdb and ssl-cert-* options - copy&paste config to follow)
- optionally, enable trace logs to see the behaviour from #c13:
log-to-file=/var/lib/qpidd/qpidd.log
log-enable=notice+
log-enable=trace+:qpid::amqp_0_10::Connection
- have a client like below, that randomly sends messages of content length of 100, .. 710000 (per msglens field)
--------------->8---------------->8---------------->8------------------
from qpid.messaging import *
from random import randrange
msglens = [100, 100, 100, 200, 300, 400, 500, 600, 600, 600, 600, 1000, 2000, 3000, 500000, 510000, 520000, 530000, 540000, 550000, 560000, 570000, 580000, 590000, 600000, 610000, 620000, 630000, 640000, 650000, 660000, 670000, 680000, 690000, 700000, 710000]
MSGS = len(msglens)
msgcontents = []
for i in range(MSGS):
msgcontents.append("")
for j in range(msglens[i]/1000):
msgcontents[i] += "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"
msgcontents[0] = "aaaaaaaaaa"
msgcontents[1] = "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"
conn = Connection('amqps://localhost:5671',
ssl_certfile='client.crt',
ssl_keyfile='client.key')
try:
conn.open()
ssn = conn.session()
while True:
snd = ssn.sender('amq.direct/nonsense')
rnd = randrange(MSGS)
print("sending message of length %s" % msglens[rnd])
msg = Message(subject='some_subject',
content=msgcontents[rnd])
snd.send(msg)
snd.close()
except SendError, e:
print e
except KeyboardInterrupt:
pass
conn.close()
--------------->8---------------->8---------------->8------------------
- now, run the client twice, in two terminals
- wait a minute until one stops printing new logs like:
sending message of length 600
sending message of length 100
sending message of length 3000
sending message of length 650000
sending message of length 500000
sending message of length 600
sending message of length 2000
sending message of length 2000
sending message of length 540000
This client is stuck with the backtrace:
Thread 2 Thread 0x7f41d5b86740
f=f@entry=Frame 0x1e324c0, for file /usr/lib/python2.7/site-packages/qpid/compat.py, line 88, in select
f=f@entry=Frame 0x7f41c8428450, for file /usr/lib/python2.7/site-packages/qpid/compat.py, line 127, in wait
f=f@entry=Frame 0x7f41c8437050, for file /usr/lib/python2.7/site-packages/qpid/concurrency.py, line 96, in wait
f=f@entry=Frame 0x7f41c84fed70, for file /usr/lib/python2.7/site-packages/qpid/concurrency.py, line 57, in wait
f=f@entry=Frame 0x7f41c8410988, for file /usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py, line 252, in _wait
f=f@entry=Frame 0x7f41c9a99da8, for file /usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py, line 273, in _ewait
f=f@entry=Frame 0x7f41c81bb050, for file /usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py, line 637, in _ewait
f=f@entry=Frame 0x7f41c81acda8, for file /usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py, line 928, in _ewait
f=f@entry=Frame 0x7f41c81bb7f0, for file /usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py, line 1015, in sync
f=f@entry=Frame 0x1e327d0, for file /usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py, line 1003, in send
#52 PyEval_EvalFrameEx f=f@entry=Frame 0x1b6e4b0, for file send_huge_msgs_in_loop.py, line 28, in <module>
Thread 1 Thread 0x7f41c7dc6700
f=f@entry=Frame 0x7f41c0020ca0, for file /usr/lib/python2.7/site-packages/qpid/compat.py, line 88, in select
f=f@entry=Frame 0x7f41c0000b50, for file /usr/lib/python2.7/site-packages/qpid/selector.py, line 152, in run
f=f@entry=Frame 0x7f41c842db00, for file /usr/lib64/python2.7/threading.py, line 765, in run
f=f@entry=Frame 0x7f41c0000910, for file /usr/lib64/python2.7/threading.py, line 812, in __bootstrap_inner
f=f@entry=Frame 0x7f41c8430210, for file /usr/lib64/python2.7/threading.py, line 785, in __bootstrap
Some observations:
- SSL is must; without SSL, we were unable to reproduce either in Satellite, neither with the script above
- _concurrency_ of the clients is a must as well; don't ask me why, since the clients dont affect each other at all - but simply I was unable to reproduce with one client for a long time, but 2 clients lead to either client stuck within a minute
The problem happens when qpidd decides that the input worker's timeslice is up AND libnss has some remaining buffered input AND the peer has nothing further to send at the moment. In this case, qpidd's request to come back and finish the input work is based on asking the kernel via epoll to provide a read event. This always works in the non TLS case, since the read event is always provided when there are unread bytes in the socket. In my testing, this works surprisingly often in the TLS case anyway because, even though the kernel knows nothing about the unread bytes held by libnss, there is usually more content coming on the wire from the peer to get things unstuck. Thank you for the excellent test case to tease this one out. Fixed upstream: https://issues.apache.org/jira/browse/QPID-8527 Testing qpid-cpp 1.36.0-32.el7_9amq: 1) also qpid-proton needs to be bounced (or qpid-cpp forcefully installed, breaking rpm dependencies), I used brew buildID=1208720 build. So these packages were tested: python-qpid-proton-0.31.0-3.el7.x86_64 python-qpid-qmf-1.36.0-32.el7_9amq.x86_64 qpid-cpp-client-1.36.0-32.el7_9amq.x86_64 qpid-cpp-client-devel-1.36.0-32.el7_9amq.x86_64 qpid-cpp-server-1.36.0-32.el7_9amq.x86_64 qpid-cpp-server-linearstore-1.36.0-32.el7_9amq.x86_64 qpid-proton-c-0.31.0-3.el7.x86_64 qpid-qmf-1.36.0-32.el7_9amq.x86_64 qpid-tools-1.36.0-32.el7_9amq.noarch 2) standalone reproducer: - running twice in parallel for 5+ minutes - no stuck - running 5times in parallel for 5+ minutes - no stuck 3) Meantime, publishing a CV with many repos (to generate multiple pulp task messages) - no issue 4) Also tested katello-agent (as I bounced proton that qdrouterd uses) - a package install worked well So from my point of view, the above set of packages fixes the bug well. I havent tested the scenario "Publishing a CV with filters generates a huge qpid messages that stuck qpidd" - this was reproduced by Hao only. Hao, could you please (optionally) test that scenario against the above packages? I expect no issue to be found as technically the standalone reproducer mimics the same in a more straightforward and concurrent way, though.. Do you have a reproducer script to verify this issue and mark it as verified from QE? Verified! @Satelite 6.9.4 snap 1.0 Steps: ---------- 1. Steps from comment 14, with the standalone script(with SSL cert and key) and reproducer. Observation: -------------- 1. Ran the script concurrently(2) for more than 20 minutes without any interruption. I could not reproduce the issue. The script didn't stuck on both concurrent processes. 2. The error messages are not observed for both processses. Commented as Verified on this bug (6.10) instead of (6.9) ... rolling it back to ON_QA. Verified! Similar steps and observation in my accidental comment (Verification) in comment 23! Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: Satellite 6.10 Release), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:4702 |
Description of problem: After processing a number of tasks(maybe hundreds), resource manager stops assigning tasks to the workers. This is causing the repo sync tasks and content view publishing stuck in "waiting on pulp to start the task" state. All workers has '0' msg but the "resource_manager" has many msg. ------------------------------------------- qpid-stat -q --ssl-certificate=/etc/pki/pulp/qpid/client.crt -b amqps://localhost:5671 celery Y 0 405 405 0 346k 346k 8 2 reserved_resource_worker-0 Y 0 1.13k 1.13k 0 10.1m 10.1m 1 2 reserved_resource_worker-1.pidbox Y 0 1.24k 1.24k 0 699k 699k 1 2 reserved_resource_worker-1 Y 0 238 238 0 2.56m 2.56m 1 2 reserved_resource_worker-2.pidbox Y 0 1.24k 1.24k 0 699k 699k 1 2 reserved_resource_worker-2 Y 0 216 216 0 3.01m 3.01m 1 2 reserved_resource_worker-3.pidbox Y 0 1.24k 1.24k 0 699k 699k 1 2 reserved_resource_worker-3 Y 0 1.23k 1.23k 0 12.6m 12.6m 1 2 reserved_resource_worker-4.pidbox Y 0 1.24k 1.24k 0 699k 699k 1 2 reserved_resource_worker-4 Y 0 2.11k 2.11k 0 18.7m 18.7m 1 2 reserved_resource_worker-5.pidbox Y 0 1.24k 1.24k 0 699k 699k 1 2 reserved_resource_worker-5 Y 0 382 382 0 3.22m 3.22m 1 2 reserved_resource_worker-6.pidbox Y 0 1.24k 1.24k 0 699k 699k 1 2 reserved_resource_worker-6 Y 0 360 360 0 3.92m 3.92m 1 2 reserved_resource_worker-7.pidbox Y 0 1.24k 1.24k 0 699k 699k 1 2 reserved_resource_worker-7 Y 0 194 194 0 3.50m 3.50m 1 2 resource_manager Y 307 3.23k 2.93k 9.62m 65.1m 55.5m 1 2 -------------------------------------------------- Normally we should see the following pair when resource_manager is working -------------------------------------------------- 15:58:00 example pulp: celery.worker.strategy:INFO: Received task: pulp.server.async.tasks._queue_reserved_task[af42ba10-b6ea-43b1-b718-c4637cd9d3a1] 15:58:00 example pulp: celery.app.trace:INFO: [af42ba10] Task pulp.server.async.tasks._queue_reserved_task[af42ba10-b6ea-43b1-b718-c4637cd9d3a1] succeeded in 0.0235411839094s: None Resource manager received the last 2 tasks and then stuck (no "succeeded in") -------------------------------------------------- 15:58:00 example pulp: celery.worker.strategy:INFO: Received task: pulp.server.async.tasks._queue_reserved_task[be284240-1e36-4497-870d-8eecd85798ff] 15:58:00 example pulp: celery.worker.strategy:INFO: Received task: pulp.server.async.tasks._queue_reserved_task[7803cb71-973b-425c-a340-8dffb6995cf4] -------------------------------------------------- All workers still in the mongo database so no workers are lost. -------------------------------------------------- # mongo pulp_database --eval "db.workers.find({})" { "_id" : "scheduler@example", "last_heartbeat" : ISODate("2021-04-01T06:13:49.722Z") } { "_id" : "resource_manager@example", "last_heartbeat" : ISODate("2021-04-01T06:09:30.488Z") } { "_id" : "reserved_resource_worker-7@example", "last_heartbeat" : ISODate("2021-04-01T06:09:28.396Z") } { "_id" : "reserved_resource_worker-4@example", "last_heartbeat" : ISODate("2021-04-01T06:09:29.710Z") } { "_id" : "reserved_resource_worker-6@example", "last_heartbeat" : ISODate("2021-04-01T06:09:29.805Z") } { "_id" : "reserved_resource_worker-3@example", "last_heartbeat" : ISODate("2021-04-01T06:09:29.328Z") } { "_id" : "reserved_resource_worker-1@example", "last_heartbeat" : ISODate("2021-04-01T06:09:29.829Z") } { "_id" : "reserved_resource_worker-0@example", "last_heartbeat" : ISODate("2021-04-01T06:09:30.229Z") } { "_id" : "reserved_resource_worker-5@example", "last_heartbeat" : ISODate("2021-04-01T06:09:29.968Z") } { "_id" : "reserved_resource_worker-2@example", "last_heartbeat" : ISODate("2021-04-01T06:09:30.349Z") } celery -A pulp.server.async.app inspect ping -> reserved_resource_worker-1@example: OK pong -> reserved_resource_worker-6@example: OK pong -> reserved_resource_worker-5@example: OK pong -> reserved_resource_worker-0@example: OK pong -> reserved_resource_worker-4@example: OK pong -> reserved_resource_worker-7@example: OK pong -> reserved_resource_worker-2@example: OK pong -> reserved_resource_worker-3@example: OK pong -------------------------------------------------- Version-Release number of selected component (if applicable): 6.8 Steps to Reproduce: Unable to reproduce the issue yet Actual results: resource manager stop working Expected results: resource manager should continue working. Additional info: The tasks can proceed after restarting the pulp_resource_manager