1945534 – Pulp resource manager stops assigning tasks to the workers

Red Hat Satellite engineering is moving the tracking of its product development work on Satellite to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "Satellite project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs will be migrated starting at the end of May. If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "Satellite project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/SAT-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1945534 - Pulp resource manager stops assigning tasks to the workers

Summary: Pulp resource manager stops assigning tasks to the workers

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Satellite
Classification:	Red Hat
Component:	Qpid
Sub Component:
Version:	6.8.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	6.10.0
Assignee:	Mike Cressman
QA Contact:	Jitendra Yejare
Docs Contact:
URL:
Whiteboard:
Depends On:	1981883
Blocks:
TreeView+	depends on / blocked

Reported:	2021-04-01 08:33 UTC by Hao Chang Yu
Modified:	2024-12-20 19:50 UTC (History)
CC List:	19 users (show)
Fixed In Version:	qpid-cpp-1.36.0-32.el7_9amq
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1973368 (view as bug list)
Environment:
Last Closed:	2021-11-16 14:10:29 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Apache JIRA	QPID-8527	None	None	None	2021-05-20 14:42:12 UTC
Red Hat Knowledge Base (Solution)	5911931	None	None	None	2021-04-07 10:26:44 UTC
Red Hat Product Errata	RHSA-2021:4702	None	None	None	2021-11-16 14:10:42 UTC

Description Hao Chang Yu 2021-04-01 08:33:24 UTC

Description of problem:
After processing a number of tasks(maybe hundreds), resource manager stops assigning tasks to the workers. This is causing the repo sync tasks and content view publishing stuck in "waiting on pulp to start the task" state.


All workers has '0' msg but the "resource_manager" has many msg.
-------------------------------------------
qpid-stat -q --ssl-certificate=/etc/pki/pulp/qpid/client.crt -b amqps://localhost:5671
  celery                                            Y                      0   405    405       0    346k     346k        8     2
  reserved_resource_worker-0            Y                      0  1.13k  1.13k      0   10.1m    10.1m        1     2
  reserved_resource_worker-1.pidbox       Y                 0  1.24k  1.24k      0    699k     699k        1     2
  reserved_resource_worker-1            Y                      0   238    238       0   2.56m    2.56m        1     2
  reserved_resource_worker-2.pidbox       Y                 0  1.24k  1.24k      0    699k     699k        1     2
  reserved_resource_worker-2            Y                      0   216    216       0   3.01m    3.01m        1     2
  reserved_resource_worker-3.pidbox       Y                 0  1.24k  1.24k      0    699k     699k        1     2
  reserved_resource_worker-3            Y                      0  1.23k  1.23k      0   12.6m    12.6m        1     2
  reserved_resource_worker-4.pidbox       Y                 0  1.24k  1.24k      0    699k     699k        1     2
  reserved_resource_worker-4            Y                      0  2.11k  2.11k      0   18.7m    18.7m        1     2
  reserved_resource_worker-5.pidbox       Y                 0  1.24k  1.24k      0    699k     699k        1     2
  reserved_resource_worker-5            Y                      0   382    382       0   3.22m    3.22m        1     2
  reserved_resource_worker-6.pidbox       Y                 0  1.24k  1.24k      0    699k     699k        1     2
  reserved_resource_worker-6            Y                      0   360    360       0   3.92m    3.92m        1     2
  reserved_resource_worker-7.pidbox       Y                 0  1.24k  1.24k      0    699k     699k        1     2
  reserved_resource_worker-7            Y                      0   194    194       0   3.50m    3.50m        1     2
  resource_manager                                  Y                    307  3.23k  2.93k   9.62m  65.1m    55.5m        1     2
--------------------------------------------------

Normally we should see the following pair when resource_manager is working
--------------------------------------------------
15:58:00 example pulp: celery.worker.strategy:INFO: Received task: pulp.server.async.tasks._queue_reserved_task[af42ba10-b6ea-43b1-b718-c4637cd9d3a1]  
15:58:00 example pulp: celery.app.trace:INFO: [af42ba10] Task pulp.server.async.tasks._queue_reserved_task[af42ba10-b6ea-43b1-b718-c4637cd9d3a1] succeeded in 0.0235411839094s: None

Resource manager received the last 2 tasks and then stuck (no "succeeded in")
--------------------------------------------------
15:58:00 example pulp: celery.worker.strategy:INFO: Received task: pulp.server.async.tasks._queue_reserved_task[be284240-1e36-4497-870d-8eecd85798ff]  
15:58:00 example pulp: celery.worker.strategy:INFO: Received task: pulp.server.async.tasks._queue_reserved_task[7803cb71-973b-425c-a340-8dffb6995cf4]
--------------------------------------------------

All workers still in the mongo database so no workers are lost.
--------------------------------------------------
# mongo pulp_database --eval "db.workers.find({})"
{ "_id" : "scheduler@example", "last_heartbeat" : ISODate("2021-04-01T06:13:49.722Z") }
{ "_id" : "resource_manager@example", "last_heartbeat" : ISODate("2021-04-01T06:09:30.488Z") }
{ "_id" : "reserved_resource_worker-7@example", "last_heartbeat" : ISODate("2021-04-01T06:09:28.396Z") }
{ "_id" : "reserved_resource_worker-4@example", "last_heartbeat" : ISODate("2021-04-01T06:09:29.710Z") }
{ "_id" : "reserved_resource_worker-6@example", "last_heartbeat" : ISODate("2021-04-01T06:09:29.805Z") }
{ "_id" : "reserved_resource_worker-3@example", "last_heartbeat" : ISODate("2021-04-01T06:09:29.328Z") }
{ "_id" : "reserved_resource_worker-1@example", "last_heartbeat" : ISODate("2021-04-01T06:09:29.829Z") }
{ "_id" : "reserved_resource_worker-0@example", "last_heartbeat" : ISODate("2021-04-01T06:09:30.229Z") }
{ "_id" : "reserved_resource_worker-5@example", "last_heartbeat" : ISODate("2021-04-01T06:09:29.968Z") }
{ "_id" : "reserved_resource_worker-2@example", "last_heartbeat" : ISODate("2021-04-01T06:09:30.349Z") }

celery -A pulp.server.async.app inspect ping
-> reserved_resource_worker-1@example: OK
        pong
-> reserved_resource_worker-6@example: OK
        pong
-> reserved_resource_worker-5@example: OK
        pong
-> reserved_resource_worker-0@example: OK
        pong
-> reserved_resource_worker-4@example: OK
        pong
-> reserved_resource_worker-7@example: OK
        pong
-> reserved_resource_worker-2@example: OK
        pong
-> reserved_resource_worker-3@example: OK
        pong
--------------------------------------------------

Version-Release number of selected component (if applicable):
6.8


Steps to Reproduce:
Unable to reproduce the issue yet


Actual results:
resource manager stop working

Expected results:
resource manager should continue working.

Additional info:
The tasks can proceed after restarting the pulp_resource_manager

Comment 6 Pavel Moravec 2021-04-06 09:00:23 UTC

FYI in Hao's reproducer, the "pending sent message, waiting for ACK", was:

{'exchange': 'C.dq2', 'routing_key': u'reserved_resource_worker-2.redhat.com', 'task_id': '44d2a4ea-d446-47dc-a029-68aef86ad0cf', 'tags': ['pulp:repository:1-test4-v8_0-e69ebdff-d79d-4da3-bc25-1e916504ab6d', 'pulp:repository:e69ebdff-d79d-4da3-bc25-1e916504ab6d', 'pulp:action:associate']}

This message has not reached qpidd's journals / durable queue for reserved_resource_worker-2.

Comment 7 Pavel Moravec 2021-04-06 11:00:48 UTC

The OS swapping / slow IO seems to be a key factor (or one from more key factors). There is a use case that pulp_resource_manager restarted after a stuck hit the same problem after dispatching 200ish messages - as the OS was swapping a lot.

Anyway, I still can not reproduce it (just) that way..

Comment 11 Pavel Moravec 2021-04-14 17:29:10 UTC

Quick info about reproducer: put pulp tasking system under a flow of "associate units among repos" requests with a _huge_ criteria/filter inside (i.e. "copy from RHEL7 repo this huuuuge list of RPMs into a CV repo .."). Those requests might need to be interleaved with some "small-sized" requests to sync or publish a repo.

Then, the SSL traffic from the child resource_manager process to qpidd gets stuck, after an attempt to send a huge bulk of data. No response is returned back, so the client is waiting for the ACK of the sent request "worker-0, please associate units ...".

We suspect the SSL encryption is a key factor of the reproducer, along with some randomness (maybe sending some more message alongside the huge one?).

I will play with the reproducer more.

Comment 13 Pavel Moravec 2021-04-16 11:41:50 UTC

There is something fishy with SSL probably on sender side.

When I enable AMQP0-10 trace logs in qpidd, then this is seen in the connection:

Under normal situation, this happens: first, some huge message content is received:
2021-04-16 12:18:01 [Protocol] trace RECV [qpid.[::1]:5671-[::1]:54920]: Frame[Eb; channel=0; content (65523 bytes) \x00
2021-04-16 12:18:01 [Protocol] trace RECV [qpid.[::1]:5671-[::1]:54920]: Frame[E; channel=0; content (65523 bytes) WFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhY...]
2021-04-16 12:18:01 [Protocol] trace RECV [qpid.[::1]:5671-[::1]:54920]: Frame[E; channel=0; content (65523 bytes) YWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFh...]
2021-04-16 12:18:01 [Protocol] trace RECV [qpid.[::1]:5671-[::1]:54920]: Frame[E; channel=0; content (65523 bytes) hYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWF...]
2021-04-16 12:18:01 [Protocol] trace RECV [qpid.[::1]:5671-[::1]:54920]: Frame[E; channel=0; content (65523 bytes) FhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYW...]
2021-04-16 12:18:01 [Protocol] trace RECV [qpid.[::1]:5671-[::1]:54920]: Frame[E; channel=0; content (65523 bytes) WFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhY...]
2021-04-16 12:18:01 [Protocol] trace RECV [qpid.[::1]:5671-[::1]:54920]: Frame[E; channel=0; content (65523 bytes) YWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFh...]
2021-04-16 12:18:01 [Protocol] trace RECV [qpid.[::1]:5671-[::1]:54920]: Frame[E; channel=0; content (65523 bytes) hYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWF...]
2021-04-16 12:18:01 [Protocol] trace RECV [qpid.[::1]:5671-[::1]:54920]: Frame[E; channel=0; content (65523 bytes) FhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYW...]
2021-04-16 12:18:01 [Protocol] trace RECV [qpid.[::1]:5671-[::1]:54920]: Frame[E; channel=0; content (65523 bytes) WFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhY...]
2021-04-16 12:18:01 [Protocol] trace RECV [qpid.[::1]:5671-[::1]:54920]: Frame[Ee; channel=0; content (53366 bytes) YWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFh...]

it is terminated by "Ee" frame, where the small "e" stands for last frame of the message content (cf. 4.4.3. Frame Format of AMQP 0-10 specification). I.e. qpidd knows the transmittion of the message payload is terminated just here.

After this, qpidd responds with:
2021-04-16 12:18:02 [Protocol] trace SENT [qpid.[::1]:5671-[::1]:54920]: Frame[BEbe; channel=0; {SessionCompletedBody: commands={ [0,3829] }; }]

that delivers the ACK back to the waiting client (to the resource_manager).


BUT when the stuck message happens, I do see this:

2021-04-16 12:18:02 [Protocol] trace RECV [qpid.[::1]:5671-[::1]:54920]: Frame[Eb; channel=0; content (65523 bytes) \x00\x0E\x0F`\x00\x00\x00\x05\x04body\xA0\x00\x0E
2021-04-16 12:18:02 [Protocol] trace RECV [qpid.[::1]:5671-[::1]:54920]: Frame[E; channel=0; content (65523 bytes) WFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhY...]
2021-04-16 12:18:02 [Protocol] trace RECV [qpid.[::1]:5671-[::1]:54920]: Frame[E; channel=0; content (65523 bytes) YWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFh...]
2021-04-16 12:18:02 [Protocol] trace RECV [qpid.[::1]:5671-[::1]:54920]: Frame[E; channel=0; content (65523 bytes) hYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWF...]
2021-04-16 12:18:02 [Protocol] trace RECV [qpid.[::1]:5671-[::1]:54920]: Frame[E; channel=0; content (65523 bytes) FhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYW...]
2021-04-16 12:18:02 [Protocol] trace RECV [qpid.[::1]:5671-[::1]:54920]: Frame[E; channel=0; content (65523 bytes) WFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhY...]
2021-04-16 12:18:02 [Protocol] trace RECV [qpid.[::1]:5671-[::1]:54920]: Frame[E; channel=0; content (65523 bytes) YWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFh...]
2021-04-16 12:18:02 [Protocol] trace RECV [qpid.[::1]:5671-[::1]:54920]: Frame[E; channel=0; content (65523 bytes) hYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWF...]
2021-04-16 12:18:02 [Protocol] trace RECV [qpid.[::1]:5671-[::1]:54920]: Frame[E; channel=0; content (65523 bytes) FhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYW...]
2021-04-16 12:18:02 [Protocol] trace RECV [qpid.[::1]:5671-[::1]:54920]: Frame[E; channel=0; content (65523 bytes) WFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhY...]
2021-04-16 12:18:02 [Protocol] trace RECV [qpid.[::1]:5671-[::1]:54920]: Frame[E; channel=0; content (65523 bytes) YWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFh...]
2021-04-16 12:18:02 [Protocol] trace RECV [qpid.[::1]:5671-[::1]:54920]: Frame[E; channel=0; content (65523 bytes) hYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWF...]
2021-04-16 12:18:02 [Protocol] trace RECV [qpid.[::1]:5671-[::1]:54920]: Frame[E; channel=0; content (65523 bytes) FhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYW...]
2021-04-16 12:18:02 [Protocol] trace RECV [qpid.[::1]:5671-[::1]:54920]: Frame[E; channel=0; content (65523 bytes) WFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhY...]

and then silence.

The "e" frame from the client is missing. And qpidd is waiting on it to reveive the tail of the message, hence even then it could ACKnowledge it back to the client.


So, when sending some huge message payloads from the python-qpid client, it sometimes forgets to send to the wire the latest AMQP0-10 frame(s). The problem happens on SSL only - this is a key observation.


I will try to remove pulp from the reproducer, now.

Comment 14 Pavel Moravec 2021-04-16 20:07:26 UTC

Standalone qpid reproducer: 
- basic idea: send via SSL messages with randomly huge content - twice(!) in parallel.
- particular reproducer: 
  - qpidd set to allow SSL connections on port 5671 (i.e. have nssdb and ssl-cert-* options - copy&paste config to follow)
  - optionally, enable trace logs to see the behaviour from #c13:
log-to-file=/var/lib/qpidd/qpidd.log
log-enable=notice+
log-enable=trace+:qpid::amqp_0_10::Connection

  - have a client like below, that randomly sends messages of content length of 100, .. 710000 (per msglens field)


--------------->8---------------->8---------------->8------------------
from qpid.messaging import *
from random import randrange

msglens = [100, 100, 100, 200, 300, 400, 500, 600, 600, 600, 600, 1000, 2000, 3000, 500000, 510000, 520000, 530000, 540000, 550000, 560000, 570000, 580000, 590000, 600000, 610000, 620000, 630000, 640000, 650000, 660000, 670000, 680000, 690000, 700000, 710000]
MSGS = len(msglens)
msgcontents = []
for i in range(MSGS):
  msgcontents.append("")
  for j in range(msglens[i]/1000):
    msgcontents[i] += "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"

msgcontents[0] = "aaaaaaaaaa"
msgcontents[1] = "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"


conn = Connection('amqps://localhost:5671',
                  ssl_certfile='client.crt',
                  ssl_keyfile='client.key')
try:
  conn.open()
  ssn = conn.session()
  while True:
    snd = ssn.sender('amq.direct/nonsense')
    rnd = randrange(MSGS)
    print("sending message of length %s" % msglens[rnd])
    msg = Message(subject='some_subject',
                  content=msgcontents[rnd])
    snd.send(msg)
    snd.close()
except SendError, e:
  print e
except KeyboardInterrupt:
  pass
conn.close()
--------------->8---------------->8---------------->8------------------

  - now, run the client twice, in two terminals

  - wait a minute until one stops printing new logs like:
sending message of length 600
sending message of length 100
sending message of length 3000
sending message of length 650000
sending message of length 500000
sending message of length 600
sending message of length 2000
sending message of length 2000
sending message of length 540000

This client is stuck with the backtrace:
Thread 2  Thread 0x7f41d5b86740 
    f=f@entry=Frame 0x1e324c0, for file /usr/lib/python2.7/site-packages/qpid/compat.py, line 88, in select
    f=f@entry=Frame 0x7f41c8428450, for file /usr/lib/python2.7/site-packages/qpid/compat.py, line 127, in wait
    f=f@entry=Frame 0x7f41c8437050, for file /usr/lib/python2.7/site-packages/qpid/concurrency.py, line 96, in wait
    f=f@entry=Frame 0x7f41c84fed70, for file /usr/lib/python2.7/site-packages/qpid/concurrency.py, line 57, in wait
    f=f@entry=Frame 0x7f41c8410988, for file /usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py, line 252, in _wait
    f=f@entry=Frame 0x7f41c9a99da8, for file /usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py, line 273, in _ewait
    f=f@entry=Frame 0x7f41c81bb050, for file /usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py, line 637, in _ewait
    f=f@entry=Frame 0x7f41c81acda8, for file /usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py, line 928, in _ewait
    f=f@entry=Frame 0x7f41c81bb7f0, for file /usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py, line 1015, in sync
    f=f@entry=Frame 0x1e327d0, for file /usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py, line 1003, in send
#52 PyEval_EvalFrameEx  f=f@entry=Frame 0x1b6e4b0, for file send_huge_msgs_in_loop.py, line 28, in <module> 
Thread 1  Thread 0x7f41c7dc6700 
    f=f@entry=Frame 0x7f41c0020ca0, for file /usr/lib/python2.7/site-packages/qpid/compat.py, line 88, in select
    f=f@entry=Frame 0x7f41c0000b50, for file /usr/lib/python2.7/site-packages/qpid/selector.py, line 152, in run
    f=f@entry=Frame 0x7f41c842db00, for file /usr/lib64/python2.7/threading.py, line 765, in run
    f=f@entry=Frame 0x7f41c0000910, for file /usr/lib64/python2.7/threading.py, line 812, in __bootstrap_inner
    f=f@entry=Frame 0x7f41c8430210, for file /usr/lib64/python2.7/threading.py, line 785, in __bootstrap


Some observations:
- SSL is must; without SSL, we were unable to reproduce either in Satellite, neither with the script above
- _concurrency_ of the clients is a must as well; don't ask me why, since the clients dont affect each other at all - but simply I was unable to reproduce with one client for a long time, but 2 clients lead to either client stuck within a minute

Comment 16 Cliff Jansen 2021-05-16 19:06:16 UTC

The problem happens when qpidd decides that the input worker's timeslice is up AND libnss has some remaining buffered input AND the peer has nothing further to send at the moment.

In this case, qpidd's request to come back and finish the input work is based on asking the kernel via epoll to provide a read event.

This always works in the non TLS case, since the read event is always provided when there are unread bytes in the socket.

In my testing, this works surprisingly often in the TLS case anyway because, even though the kernel knows nothing about the unread bytes held by libnss, there is usually more content coming on the wire from the peer to get things unstuck.

Thank you for the excellent test case to tease this one out.

Comment 17 Cliff Jansen 2021-05-20 14:42:16 UTC

Fixed upstream: 

  https://issues.apache.org/jira/browse/QPID-8527

Comment 19 Pavel Moravec 2021-05-31 15:32:07 UTC

Testing qpid-cpp 1.36.0-32.el7_9amq:
1) also qpid-proton needs to be bounced (or qpid-cpp forcefully installed, breaking rpm dependencies), I used brew buildID=1208720 build. So these packages were tested:

python-qpid-proton-0.31.0-3.el7.x86_64
python-qpid-qmf-1.36.0-32.el7_9amq.x86_64
qpid-cpp-client-1.36.0-32.el7_9amq.x86_64
qpid-cpp-client-devel-1.36.0-32.el7_9amq.x86_64
qpid-cpp-server-1.36.0-32.el7_9amq.x86_64
qpid-cpp-server-linearstore-1.36.0-32.el7_9amq.x86_64
qpid-proton-c-0.31.0-3.el7.x86_64
qpid-qmf-1.36.0-32.el7_9amq.x86_64
qpid-tools-1.36.0-32.el7_9amq.noarch

2) standalone reproducer:
- running twice in parallel for 5+ minutes - no stuck
- running 5times in parallel for 5+ minutes - no stuck

3) Meantime, publishing a CV with many repos (to generate multiple pulp task messages) - no issue

4) Also tested katello-agent (as I bounced proton that qdrouterd uses) - a package install worked well

So from my point of view, the above set of packages fixes the bug well.

I havent tested the scenario "Publishing a CV with filters generates a huge qpid messages that stuck qpidd" - this was reproduced by Hao only.

Hao, could you please (optionally) test that scenario against the above packages? I expect no issue to be found as technically the standalone reproducer mimics the same in a more straightforward and concurrent way, though..

Comment 22 Jitendra Yejare 2021-07-05 19:04:40 UTC

Do you have a reproducer script to verify this issue and mark it as verified from QE?

Comment 23 Jitendra Yejare 2021-07-12 14:05:59 UTC

Verified!

@Satelite 6.9.4 snap 1.0


Steps:
----------
1. Steps from comment 14, with the standalone script(with SSL cert and key) and reproducer.


Observation:
--------------
1. Ran the script concurrently(2) for more than 20 minutes without any interruption. I could not reproduce the issue. The script didn't stuck on both concurrent processes.
2. The error messages are not observed for both processses.

Comment 24 Jitendra Yejare 2021-07-12 14:08:50 UTC

Commented as Verified on this bug (6.10) instead of (6.9) ... rolling it back to ON_QA.

Comment 25 Jitendra Yejare 2021-07-14 10:45:23 UTC

Verified!

Similar steps and observation in my accidental comment (Verification) in comment 23!

Comment 29 errata-xmlrpc 2021-11-16 14:10:29 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Satellite 6.10 Release), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:4702

Note You need to log in before you can comment on or make changes to this bug.