1288223 – pulp task getting stuck even on python-qpid-0.30-7

Red Hat Satellite engineering is moving the tracking of its product development work on Satellite to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "Satellite project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs will be migrated starting at the end of May. If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "Satellite project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/SAT-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1288223 - pulp task getting stuck even on python-qpid-0.30-7

Summary: pulp task getting stuck even on python-qpid-0.30-7

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Satellite
Classification:	Red Hat
Component:	Pulp
Sub Component:
Version:	6.1.4
Hardware:	All
OS:	Linux
Priority:	urgent
Severity:	urgent with 1 vote
Target Milestone:	Unspecified
Assignee:	Katello Bug Bin
QA Contact:	Katello QA List
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1122832
TreeView+	depends on / blocked

Reported:	2015-12-03 21:51 UTC by Pavel Moravec
Modified:	2021-08-30 12:02 UTC (History)
CC List:	13 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2016-02-25 17:49:12 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
The last task processed by the resource manager from reproducer machine (124.65 KB, text/plain) 2015-12-18 23:15 UTC, Brian Bouterse	no flags	Details
katello-debug.sh (3.39 KB, application/x-shellscript) 2016-01-21 06:43 UTC, Mike McCune	no flags	Details
qpid-core-dump (304 bytes, application/x-shellscript) 2016-01-21 06:44 UTC, Mike McCune	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	2044393	0	None	None	None	2016-01-22 08:07:15 UTC

Description Pavel Moravec 2015-12-03 21:51:05 UTC

Description of problem:
Same or very similar symptoms to bz1279502 are seen even after applying hotfix from that bz - i.e. even on python-qpid-0.30-7.

pulp tasks are not being executed, pulp<->qpidd communication gets frozen.

qpid-stat -q shows resource_manager queue having bigger and bigger queue depth

qpid-stat -u shows all (or at least huge portion of) the resource_manager queue messages are acquired (column sessUnacked) by the resource_manager process

Since some time, no message in the queue is accepted / queue depth decreased at all.

Usually, some or even all worker processes are stuck the same way - their qpid queue contains more messages, all acquired, no progress.


Version-Release number of selected component (if applicable):
Sat 6.1.4 plus python-qpid-0.30-7


How reproducible:
??? not known now


Steps to Reproduce:
??? not known now


Actual results:
see above description


Expected results:
no such stuck to happen


Additional info:
below data to be provided from a customer (once available):

- output of commands:
qpid-stat -q --ssl-certificate=/etc/pki/katello/qpid_client_striped.crt -b amqps://localhost:5671
qpid-stat -u --ssl-certificate=/etc/pki/katello/qpid_client_striped.crt -b amqps://localhost:5671
qpid-stat -c --ssl-certificate=/etc/pki/katello/qpid_client_striped.crt -b amqps://localhost:5671
ps -awfux
ps -efLm

- coredumps of all celery processes:
for pid in $(ps -awfux| grep celery | grep "@" | awk '{ print $2 }'); do gcore $pid; done

Comment 6 Brian Bouterse 2015-12-18 21:16:50 UTC

There are a lot of exceptions here and in normal operation I experience none. This causes me to think these problems are environmental. I'll to write some responses here regarding the exceptions identified in the summary. :

Dec 1 09:51:02 ffurhelsat01 pulp: celery.worker.consumer:WARNING: (14239-08864) ConnectError: [Errno 111] Connection refused
^ I think celery cannot get the asyncpool queues for IPC that it uses to send messages to the child. I'm not sure why

Error in timer: TypeError("bad operand type for unary -: 'NoneType'",)
^ here 'status' is None. I'm not sure why.

Task pulp.server.managers.repo.sync.sync[46ede5d2-009b-4000-acd6-315997a67e64] raised unexpected: CancelException()
^ This is normal when a celery task receives a cancel event for a currently running task it exits by raising a CancelException. This is emitted from pulp code. This looks normal.

Task pulp.server.managers.repo.publish.publish[d732837f-aaa8-43b1-9f8b-c217a150c181] raised unexpected: TypeError('cancel_publish_repo() takes exactly 3 arguments (1 given)',)
This is a known issue that has been fixed in upstream pulp. It is mostly a benign error which will have little impact. https://pulp.plan.io/issues/1097

IOError: [Errno 32] Broken pipe
^ This looks bad, and I believe is a sign that the environment is broken. I never see these in a healthy runtime environment.

IOError: request data read error
^ This is in httpd, and isn't even related to Qpid. The inbound request cannot be read for some reason. It's not that the data is unexpected. This says it literally cannot be read.

All the other exceptions look like duplicates or variations on these exceptions.

Comment 7 Brian Bouterse 2015-12-18 23:14:50 UTC

mmccune has been able to reproduce the stuck resource manager issue internally on multiple satellite machines. They occur about a day after having sat6 doing work continuously. On one EL7 machine he turned logging to DEBUG and reproduced it. These logs have proven to be very interesting.

== How we know this machine had a problem ==
He anecdotally observed that sat6 was "stuck". Then in listing the `qpid-stat -q` it shows all workers have no  and the resource_manager has 1 message in its queue. I confirmed that a message is shown on the queue when the resource manager has not received it or has not ack-ed it.

Queues
  queue                                                                             dur  autoDel  excl  msg   msgIn  msgOut  bytes  bytesIn  bytesOut  cons  bind
  =================================================================================================================================================================
<snip>
  reserved_resource_worker-0.lab.eng.bos.redhat.com.celery.pidbox       Y                 0     0      0       0      0        0         1     2
  reserved_resource_worker-0.lab.eng.bos.redhat.com.dq             Y    Y                 0     6      6       0   6.66k    6.66k        1     2
  reserved_resource_worker-1.lab.eng.bos.redhat.com.celery.pidbox       Y                 0     0      0       0      0        0         1     2
  reserved_resource_worker-1.lab.eng.bos.redhat.com.dq             Y    Y                 0     8      8       0   8.90k    8.90k        1     2
  reserved_resource_worker-2.lab.eng.bos.redhat.com.celery.pidbox       Y                 0     0      0       0      0        0         1     2
  reserved_resource_worker-2.lab.eng.bos.redhat.com.dq             Y    Y                 0    48     48       0   53.4k    53.4k        1     2
  reserved_resource_worker-3.lab.eng.bos.redhat.com.celery.pidbox       Y                 0     0      0       0      0        0         1     2
  reserved_resource_worker-3.lab.eng.bos.redhat.com.dq             Y    Y                 0    32     32       0   34.9k    34.9k        1     2
  reserved_resource_worker-4.lab.eng.bos.redhat.com.celery.pidbox       Y                 0     0      0       0      0        0         1     2
  reserved_resource_worker-4.lab.eng.bos.redhat.com.dq             Y    Y                 0    42     42       0   47.0k    47.0k        1     2
  reserved_resource_worker-5.lab.eng.bos.redhat.com.celery.pidbox       Y                 0     0      0       0      0        0         1     2
  reserved_resource_worker-5.lab.eng.bos.redhat.com.dq             Y    Y                 0   120    120       0    126k     126k        1     2
  reserved_resource_worker-6.lab.eng.bos.redhat.com.celery.pidbox       Y                 0     0      0       0      0        0         1     2
  reserved_resource_worker-6.lab.eng.bos.redhat.com.dq             Y    Y                 0   274    274       0    293k     293k        1     2
  reserved_resource_worker-7.lab.eng.bos.redhat.com.celery.pidbox       Y                 0     0      0       0      0        0         1     2
  reserved_resource_worker-7.lab.eng.bos.redhat.com.dq             Y    Y                 0    24     24       0   26.7k    26.7k        1     2
  resource_manager                                                                  Y                      1   277    276    1.26k   395k     393k        1     2
  resource_manager.lab.eng.bos.redhat.com.celery.pidbox                 Y                 0     0      0       0      0        0         1     2
  resource_manager.lab.eng.bos.redhat.com.dq                       Y    Y                 0     0      0       0      0        0         1     2


The resource_manager is running and has parent process pid 3723 and child process pid 4141. See this ps -awfux output:

apache    3723  0.5  0.2 669736 53124 ?        Ssl  Dec14   7:27 /usr/bin/python /usr/bin/celery worker -A pulp.server.async.app -n resource_manager@%h -Q resource_manager -c 1 --events --umask 18
apache    4141  0.0  0.1 595468 46164 ?        Sl   Dec14   0:19  \_ /usr/bin/python /usr/bin/celery worker -A pulp.server.async.app -n resource_manager@%h -Q resource_manager -c 1 --events --umask 18


I've confirmed that the expected number of threads are running in the parent pid (4) and the expected number of threads are running in the child (3). Also using gdb all of the python instruction pointers are in poll or epoll waits so basically both the parent and the child are idle.


== The interesting part ==
I use journalctl to get me the logs during the affected time. I then filtered to only include log statements emitted by 3723 or 4141.

First, the parent is submitting heartbeats just fine. and continuing to poll with its normal timeouts. This supports the idea that the parent process is working.

Second, all tasks received by the celery parent process were handed to the child process, ran, completed, successfully handed back to the parent process through IPC, acked to the broker, and logged that it is completed. There is an attachment with the logs from the very last task processed by the resource manager. Each message is exactly as expected and in the correct order.


== my conclusion ==
This shows that the client isn't actually deadlocking! Something is preventing it from receiving the next message. Once the server is sent the ack from the previous message (repeated below)

Dec 15 15:45:09 sat-perf-01.sat6.lab.eng.bos.redhat.com pulp[3723]: qpid.messaging:DEBUG: SACK[3f43c20]: Message(subject=u'resource_manager', properties={u'qpid.subject': u'resource_manager', 'x-amqp-0-10.routing-key': u'resource_manager'}, content={u'body': 'gAJ9cQEoVQdleHBpcmVzcQJOVQN1dGNxA4hVBGFyZ3NxBF1xBShVKXB1bHAuc2VydmVyLm1hbmFnZXJzLnJlcG8ucHVibGlzaC5wdWJsaXNocQZVJDIyMzMzYjk4LTc5YmEtNGNmYi1iMjQxLTU5MTRlYmFkMmVjY3EHWEEAAAByZXBvc2l0b3J5OlRoZV9EZW1vX09yZ19BdXRvLURlbW9fUHJvZHVjdF9BLURlbW9fSW50ZXJuYWxfUHJvZHVjdHEIKX1xCShVBmt3YXJnc3EKfXELKFUXcHVibGlzaF9jb25maWdfb3ZlcnJpZGVxDE5VB3JlcG9faWRxDVg2AAAAVGhlX0RlbW9fT3JnX0F1dG8tRGVtb19Qcm9kdWN0X0EtRGVtb19JbnRlcm5hbF9Qcm9kdWN0cQ5VDmRpc3RyaWJ1dG9yX2lkcQ9VPFRoZV9EZW1vX09yZ19BdXRvLURlbW9fUHJvZHVjdF9BLURlbW9fSW50ZXJuYWxfUHJvZHVjdF9ub2Rlc3EQdVUEdGFnc3ERXXESKFhGAAAAcHVscDpyZXBvc2l0b3J5OlRoZV9EZW1vX09yZ19BdXRvLURlbW9fUHJvZHVjdF9BLURlbW9fSW50ZXJuYWxfUHJvZHVjdHETVRNwdWxwOmFjdGlvbjpwdWJsaXNocRRldWVVBWNob3JkcRVOVQljYWxsYmFja3NxFk5VCGVycmJhY2tzcRdOVQd0YXNrc2V0cRhOVQJpZHEZVSRiNmM2Y2Q5MC05N2I3LTRkZTktYjU0OS0wYTZiOTAyNjJmZDBxGlUHcmV0cmllc3EbSwBVBHRhc2txHFUscHVscC5zZXJ2ZXIuYXN5bmMudGFza3MuX3F1ZXVlX3Jlc2VydmVkX3Rhc2txHVUJdGltZWxpbWl0cR5OToZVA2V0YXEfTmgKfXEgdS4=', u'headers': {}, u'content-type': u'application/x-python-serialize', u'properties': {u'body_encoding': u'base64', u'delivery_info': {u'priority': 0, u'routing_key': u'resource_manager', u'exchange': u'resource_manager'}, u'delivery_mode': 2, u'correlation_id': u'b6c6cd90-97b7-4de9-b549-0a6b90262fd0', u'reply_to': u'467f662e-a57a-33c9-a219-a16b4b844863', u'delivery_tag': 90}, u'content-encoding': u'binary'}), None

I expect the broker to deliver the next message which would resemble the first message from the message flow above (repeated below):

Dec 15 14:45:00 sat-perf-01.sat6.lab.eng.bos.redhat.com pulp[3723]: qpid.messaging:DEBUG: RCVD[3f43c20]: Message(subject=u'resource_manager', properties={u'qpid.subject': u'resource_manager', 'x-amqp-0-10.routing-key': u'resource_manager'}, content={u'body': 'gAJ9cQEoVQdleHBpcmVzcQJOVQN1dGNxA4hVBGFyZ3NxBF1xBShVI3B1bHAuc2VydmVyLm1hbmFnZXJzLnJlcG8uc3luYy5zeW5jcQZVJDc0MTJiMGI5LTA5NjEtNDU1YS1iOTA0LTgwYWU3OTRkMDBmNnEHWEEAAAByZXBvc2l0b3J5OlRoZV9EZW1vX09yZ19BdXRvLURlbW9fUHJvZHVjdF9BLURlbW9fSW50ZXJuYWxfUHJvZHVjdHEIKX1xCShVBmt3YXJnc3EKfXELKFUUc3luY19jb25maWdfb3ZlcnJpZGVxDH1xDVUHcmVwb19pZHEOWDYAAABUaGVfRGVtb19PcmdfQXV0by1EZW1vX1Byb2R1Y3RfQS1EZW1vX0ludGVybmFsX1Byb2R1Y3RxD3VVBHRhZ3NxEF1xEShYRgAAAHB1bHA6cmVwb3NpdG9yeTpUaGVfRGVtb19PcmdfQXV0by1EZW1vX1Byb2R1Y3RfQS1EZW1vX0ludGVybmFsX1Byb2R1Y3RxElUQcHVscDphY3Rpb246c3luY3ETZXVlVQVjaG9yZHEUTlUJY2FsbGJhY2tzcRVOVQhlcnJiYWNrc3EWTlUHdGFza3NldHEXTlUCaWRxGFUkNDUyOTMyYzQtMTM4OS00NTBkLTg2MzEtYWE1Yzg2MTVhYTc3cRlVB3JldHJpZXNxGksAVQR0YXNrcRtVLHB1bHAuc2VydmVyLmFzeW5jLnRhc2tzLl9xdWV1ZV9yZXNlcnZlZF90YXNrcRxVCXRpbWVsaW1pdHEdTk6GVQNldGFxHk5oCn1xH3Uu', u'headers': {}, u'content-type': u'application/x-python-serialize', u'properties': {u'body_encoding': u'base64', u'delivery_info': {u'priority': 0, u'routing_key': u'resource_manager', u'exchange': u'resource_manager'}, u'delivery_mode': 2, u'correlation_id': u'452932c4-1389-450d-8631-aa5c8615aa77', u'reply_to': u'4e225bd1-e1de-3cd1-968b-a6fc9b7d7b17', u'delivery_tag': 19}, u'content-encoding': u'binary'})


What could prevent the next message from being delivered?
How do we investigate this further?

Comment 8 Brian Bouterse 2015-12-18 23:15:30 UTC

Created attachment 1107526 [details]
The last task processed by the resource manager from reproducer machine

Comment 9 Pavel Moravec 2015-12-19 08:41:19 UTC

(In reply to Brian Bouterse from comment #8)
> What could prevent the next message from being delivered?
> How do we investigate this further?

Outputs of:

qpid-stat -u     # subscriptions - who is consuming messages?
qpid-stat -q resource_manager    # this queue statistics details

would suggest if the message is really available or if some consumer acquired it already.

Checking the debugs (I expect they are partial only), I see however the client did not acknowledged 2 messages it received:

RCVD means python-qpid received message from the broker
RETR means the library passed the message via fetch API call to the application / resource_manager
SACK means the library acknowledged message consumtion

$ grep RCVD logs | cut -d\' -f14 > logs.RCVD.body
$ grep SACK logs | cut -d\' -f14 > logs.SACK.body
$ grep RETR logs | cut -d\' -f14 > logs.RETR.body
$ diff logs.RETR.body logs.RCVD.body
$ diff logs.RCVD.body logs.SACK.body 
6d5
< gAJ9cQEoVQdleHBpcmVzcQJOVQN1dGNxA4hVBGFyZ3NxBF1xBShVTnB1bHAuc2VydmVyLm1hbmFnZXJzLmNvbnN1bWVyLmFwcGxpY2FiaWxpdHkucmVnZW5lcmF0ZV9hcHBsaWNhYmlsaXR5X2Zvcl9yZXBvc3EGVSQyMDlhMDc2MS1hOTMxLTRjOTktOGJmZi1kZjc4YzY3MTdmMWFxB1UwcmVwb3NpdG9yeV9wcm9maWxlX2FwcGxpY2FiaWxpdHk6UkVTT1VSQ0VfQU5ZX0lEcQh9cQkoVQRzb3J0cQpOVQRza2lwcQtOVQVsaW1pdHEMTlUHZmlsdGVyc3ENfXEOVQJpZH1xD1UDJGluXXEQVTZUaGVfRGVtb19PcmdfQXV0by1EZW1vX1Byb2R1Y3RfQS1EZW1vX0ludGVybmFsX1Byb2R1Y3RxEWFzc1UGZmllbGRzcRJOdYVxE4VxFH1xFVUEdGFnc3EWXXEXVS5wdWxwOmFjdGlvbjpjb250ZW50X2FwcGxpY2FiaWxpdHlfcmVnZW5lcmF0aW9ucRhhc2VVBWNob3JkcRlOVQljYWxsYmFja3NxGk5VCGVycmJhY2tzcRtOVQd0YXNrc2V0cRxOVQJpZHEdVSQ3ODc2NTNiMi1lMGExLTQxMTMtYmVhOC0yMzY4NTYxNzRiNzZxHlUHcmV0cmllc3EfSwBVBHRhc2txIFUscHVscC5zZXJ2ZXIuYXN5bmMudGFza3MuX3F1ZXVlX3Jlc2VydmVkX3Rhc2txIVUJdGltZWxpbWl0cSJOToZVA2V0YXEjTlUGa3dhcmdzcSR9cSV1Lg==
9d7
< gAJ9cQEoVQdleHBpcmVzcQJOVQN1dGNxA4hVBGFyZ3NxBF1xBShVTnB1bHAuc2VydmVyLm1hbmFnZXJzLmNvbnN1bWVyLmFwcGxpY2FiaWxpdHkucmVnZW5lcmF0ZV9hcHBsaWNhYmlsaXR5X2Zvcl9yZXBvc3EGVSQxY2QwMmJjMS1jM2U5LTQxNTUtYmMyMS1hMzM4MDM0NDYzM2NxB1UwcmVwb3NpdG9yeV9wcm9maWxlX2FwcGxpY2FiaWxpdHk6UkVTT1VSQ0VfQU5ZX0lEcQh9cQkoVQRzb3J0cQpOVQRza2lwcQtOVQVsaW1pdHEMTlUHZmlsdGVyc3ENfXEOVQJpZH1xD1UDJGluXXEQVSdUaGVfRGVtb19PcmdfQXV0by1EZW1vX1Byb2R1Y3RfQS1CYWRVUkxxEWFzc1UGZmllbGRzcRJOdYVxE4VxFH1xFVUEdGFnc3EWXXEXVS5wdWxwOmFjdGlvbjpjb250ZW50X2FwcGxpY2FiaWxpdHlfcmVnZW5lcmF0aW9ucRhhc2VVBWNob3JkcRlOVQljYWxsYmFja3NxGk5VCGVycmJhY2tzcRtOVQd0YXNrc2V0cRxOVQJpZHEdVSRkMjg4MDk2MS01N2Y3LTQ5YzQtYjMxYi0yYTA5ZjkwOGY4MWFxHlUHcmV0cmllc3EfSwBVBHRhc2txIFUscHVscC5zZXJ2ZXIuYXN5bmMudGFza3MuX3F1ZXVlX3Jlc2VydmVkX3Rhc2txIVUJdGltZWxpbWl0cSJOToZVA2V0YXEjTlUGa3dhcmdzcSR9cSV1Lg==

Those two messages are for these tasks:

Dec 15 14:45:07 sat-perf-01.sat6.lab.eng.bos.redhat.com pulp[3723]: celery.worker.strategy:INFO: Received task: pulp.server.async.tasks._queue_reserved_task[787653b2-e0a1-4113-bea8-236856174b76]
Dec 15 15:45:03 sat-perf-01.sat6.lab.eng.bos.redhat.com pulp[3723]: celery.worker.strategy:INFO: Received task: pulp.server.async.tasks._queue_reserved_task[d2880961-57f7-49c4-b31b-2a09f908f81a]

(btw why the logs are not ordered by time? or rather why system logged events in wrong ordering?)

So, whatever python-qpid received, it passed to the application. But two such messages havent been acknowledged back (can the missing ACK be swallowed / pending in python-qpid?)

Comment 10 Ken Giusti 2015-12-21 13:25:31 UTC

A reproducer??  Yay!

First, where are the logs, etc captured from this event?  I'd like to go over them.

Second - I think we need to answer Pavel's last question.  I'm going to instrument some additional ack tracing in python-qpid and build a test rpm based on 0.30-7.  Running the reproducer with the additional tracing will let us know where that ack is ending up.

Stay tuned...

Comment 12 Brian Bouterse 2015-12-21 16:22:28 UTC

The journalctl in the reproducer machine has incomplete logs. Given that I don't think we can say that an SACK did or did not occur for the tasks. For instance, look at the logs for d2880961-57f7-49c4-b31b-2a09f908f81a.

The task was received, ran, and completed successfully, but there are expected log statements missing including:

celery's dispatch log statement: TaskPool: Apply <function _fast_trace_task at 0x3d50938>
the task.started SENT/RACK messages
the SACK message

There are many tasks which are missing SACK messages yet the queue depth for resource_manager shows 1 unacked message so those other SACKs must have occurred.

Also, the logs don't have any record before -- Logs begin at Tue 2015-12-15 14:50:16 EST which is strange because that is the first time shown by `journalctl`. I guess we need to enable journalctl to receive and store a huge amount of logs.

I propose this as a next step because without reliable logs we can't root cause the issue.

Comment 17 Mike McCune 2016-01-21 06:43:03 UTC

Attached are 2 new files attached to this BZ that need to get updated on your Satellite to gather extended debug information if you are being effected by this bug.

1) /usr/share/foreman/script/foreman-debug.d/katello-debug.sh

2) /usr/bin/qpid-core-dump

Copy the attachments from this bug to these locations. Once copied you can run:

# foreman-debug 

which will gather additional qpid and task related logging files

# qpid-core-dump

This will generate a large archive (1GB or more) of core files of the celery process. Please attach these files to any Customer Portal support cases you have open. This will assist in the debugging process.

Comment 18 Mike McCune 2016-01-21 06:43:54 UTC

Created attachment 1116803 [details]
katello-debug.sh

Comment 19 Mike McCune 2016-01-21 06:44:18 UTC

Created attachment 1116804 [details]
qpid-core-dump

Comment 20 Mike McCune 2016-01-21 06:45:53 UTC

Ensure both /usr/share/foreman/script/foreman-debug.d/katello-debug.sh and /usr/bin/qpid-core-dump are both chmod +x executable as well.

Comment 21 Pavel Moravec 2016-01-21 09:36:32 UTC

My understanding was engineering has reproduced the bug - is the data request from customer(s) meant just for verification both customer(s) and engineering really reproduced the same? Or is there some other objective behind?

Are the data planned to be received from (the) one customer or from any hitting the same (such that it makes sense to update relevant KCS to get the same data)?

(just a side note: if the request is generic, the change in katello-debug.sh script will be applicable to Sat6.1 only, since 6.2 should deliver new script and for that a new version of extra debugging script would have to be prepared)

Comment 22 Mike McCune 2016-01-21 16:46:35 UTC

This update to the script is for 6.1.x only and should be considered a hotfix/patch.

We will roll these updated debug tools into 6.1.7 so it will be shipped with the Satellite

Comment 23 Pavel Moravec 2016-01-21 17:02:13 UTC

Can this command be collected as well?

qpid-stat --ssl-certificate=/etc/gutterball/certs/gutterball.crt --ssl-key=/etc/gutterball/certs/gutterball.key -b amqps://localhost:5671 -q resource_manager

output:

Properties:
  Name              Durable  AutoDelete  Exclusive  FlowStopped  FlowStoppedCount  Consumers  Bindings
  ======================================================================================================
  resource_manager  Y        N           N          N            0                 1          2

Optional Properties:
  Property      Value
  ============================================================================
  arguments     {u'passive': False, u'exclusive': False, u'arguments': None}
  alt-exchange  

Statistics:
  Statistic                   Messages  Bytes
  ==============================================
  queue-depth                 0         0
  total-enqueues              20        32,552
  total-dequeues              20        32,552
  persistent-enqueues         0         0
  persistent-dequeues         0         0
  transactional-enqueues      0         0
  transactional-dequeues      0         0
  flow-to-disk-depth          0         0
  flow-to-disk-enqueues       0         0
  flow-to-disk-dequeues       0         0
  acquires                    20        
  releases                    0         
  discards-ttl-expired        0         
  discards-limit-overflow     0         
  discards-ring-overflow      0         
  discards-lvq-replace        0         
  discards-subscriber-reject  0         
  discards-purged             0         
  reroutes                    0         


if total-enqueues equals to acquires, then all messages enqueued to the broker have been delivered to a consumer (pulp in this case).

The lack of ACKs will be visible here if total-dequeues is _less_ than acquires.

Comment 24 Mike McCune 2016-01-22 00:55:52 UTC

Adding the above but using the more simplified cert auth:

qpid-stat --ssl-certificate=/etc/pki/katello/qpid_client_striped.crt  -b amqps://localhost:5671 -q resource_manager

Comment 25 Pavel Moravec 2016-01-22 08:07:16 UTC

(In reply to Mike McCune from comment #24)
> Adding the above but using the more simplified cert auth:
> 
> qpid-stat --ssl-certificate=/etc/pki/katello/qpid_client_striped.crt  -b
> amqps://localhost:5671 -q resource_manager

Right, thanks for simplification.

I am about to update the script and attach both to KCS 2044393 to point customers there.

Comment 29 Mike McCune 2016-02-25 17:49:12 UTC

We are going to mark this as resolved and being delivered with Satellite 6.1.7.

We have not received any updated reports of this occurring with installations that include the qpid/proton updates in Satellite 6.1.7. 

If we do see this exact scenario with the latest bits installed on both the Satellite and Capsule, please feel free to re-open

Comment 30 Mike McCune 2016-02-25 17:50:00 UTC

Satellite
https://access.redhat.com/errata/RHSA-2016:0174

Satellite Tools
https://access.redhat.com/errata/RHBA-2016:0173

Note You need to log in before you can comment on or make changes to this bug.