Bug 1288223
| Summary: | pulp task getting stuck even on python-qpid-0.30-7 | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Satellite | Reporter: | Pavel Moravec <pmoravec> | ||||||||
| Component: | Pulp | Assignee: | Katello Bug Bin <katello-bugs> | ||||||||
| Status: | CLOSED ERRATA | QA Contact: | Katello QA List <katello-qa-list> | ||||||||
| Severity: | urgent | Docs Contact: | |||||||||
| Priority: | urgent | ||||||||||
| Version: | 6.1.4 | CC: | andrewrpalmer, bbuckingham, chrobert, cwelton, jorgen.langgat, kgiusti, mmccune, peter.vreman, risantam, rmahique, sauchter, xdmoon, yjog | ||||||||
| Target Milestone: | Unspecified | Keywords: | Triaged | ||||||||
| Target Release: | Unused | ||||||||||
| Hardware: | All | ||||||||||
| OS: | Linux | ||||||||||
| Whiteboard: | |||||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||||
| Doc Text: | Story Points: | --- | |||||||||
| Clone Of: | Environment: | ||||||||||
| Last Closed: | 2016-02-25 17:49:12 UTC | Type: | Bug | ||||||||
| Regression: | --- | Mount Type: | --- | ||||||||
| Documentation: | --- | CRM: | |||||||||
| Verified Versions: | Category: | --- | |||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||
| Embargoed: | |||||||||||
| Bug Depends On: | |||||||||||
| Bug Blocks: | 1122832 | ||||||||||
| Attachments: |
|
||||||||||
|
Description
Pavel Moravec
2015-12-03 21:51:05 UTC
There are a lot of exceptions here and in normal operation I experience none. This causes me to think these problems are environmental. I'll to write some responses here regarding the exceptions identified in the summary. :
Dec 1 09:51:02 ffurhelsat01 pulp: celery.worker.consumer:WARNING: (14239-08864) ConnectError: [Errno 111] Connection refused
^ I think celery cannot get the asyncpool queues for IPC that it uses to send messages to the child. I'm not sure why
Error in timer: TypeError("bad operand type for unary -: 'NoneType'",)
^ here 'status' is None. I'm not sure why.
Task pulp.server.managers.repo.sync.sync[46ede5d2-009b-4000-acd6-315997a67e64] raised unexpected: CancelException()
^ This is normal when a celery task receives a cancel event for a currently running task it exits by raising a CancelException. This is emitted from pulp code. This looks normal.
Task pulp.server.managers.repo.publish.publish[d732837f-aaa8-43b1-9f8b-c217a150c181] raised unexpected: TypeError('cancel_publish_repo() takes exactly 3 arguments (1 given)',)
This is a known issue that has been fixed in upstream pulp. It is mostly a benign error which will have little impact. https://pulp.plan.io/issues/1097
IOError: [Errno 32] Broken pipe
^ This looks bad, and I believe is a sign that the environment is broken. I never see these in a healthy runtime environment.
IOError: request data read error
^ This is in httpd, and isn't even related to Qpid. The inbound request cannot be read for some reason. It's not that the data is unexpected. This says it literally cannot be read.
All the other exceptions look like duplicates or variations on these exceptions.
mmccune has been able to reproduce the stuck resource manager issue internally on multiple satellite machines. They occur about a day after having sat6 doing work continuously. On one EL7 machine he turned logging to DEBUG and reproduced it. These logs have proven to be very interesting.
== How we know this machine had a problem ==
He anecdotally observed that sat6 was "stuck". Then in listing the `qpid-stat -q` it shows all workers have no and the resource_manager has 1 message in its queue. I confirmed that a message is shown on the queue when the resource manager has not received it or has not ack-ed it.
Queues
queue dur autoDel excl msg msgIn msgOut bytes bytesIn bytesOut cons bind
=================================================================================================================================================================
<snip>
reserved_resource_worker-0.lab.eng.bos.redhat.com.celery.pidbox Y 0 0 0 0 0 0 1 2
reserved_resource_worker-0.lab.eng.bos.redhat.com.dq Y Y 0 6 6 0 6.66k 6.66k 1 2
reserved_resource_worker-1.lab.eng.bos.redhat.com.celery.pidbox Y 0 0 0 0 0 0 1 2
reserved_resource_worker-1.lab.eng.bos.redhat.com.dq Y Y 0 8 8 0 8.90k 8.90k 1 2
reserved_resource_worker-2.lab.eng.bos.redhat.com.celery.pidbox Y 0 0 0 0 0 0 1 2
reserved_resource_worker-2.lab.eng.bos.redhat.com.dq Y Y 0 48 48 0 53.4k 53.4k 1 2
reserved_resource_worker-3.lab.eng.bos.redhat.com.celery.pidbox Y 0 0 0 0 0 0 1 2
reserved_resource_worker-3.lab.eng.bos.redhat.com.dq Y Y 0 32 32 0 34.9k 34.9k 1 2
reserved_resource_worker-4.lab.eng.bos.redhat.com.celery.pidbox Y 0 0 0 0 0 0 1 2
reserved_resource_worker-4.lab.eng.bos.redhat.com.dq Y Y 0 42 42 0 47.0k 47.0k 1 2
reserved_resource_worker-5.lab.eng.bos.redhat.com.celery.pidbox Y 0 0 0 0 0 0 1 2
reserved_resource_worker-5.lab.eng.bos.redhat.com.dq Y Y 0 120 120 0 126k 126k 1 2
reserved_resource_worker-6.lab.eng.bos.redhat.com.celery.pidbox Y 0 0 0 0 0 0 1 2
reserved_resource_worker-6.lab.eng.bos.redhat.com.dq Y Y 0 274 274 0 293k 293k 1 2
reserved_resource_worker-7.lab.eng.bos.redhat.com.celery.pidbox Y 0 0 0 0 0 0 1 2
reserved_resource_worker-7.lab.eng.bos.redhat.com.dq Y Y 0 24 24 0 26.7k 26.7k 1 2
resource_manager Y 1 277 276 1.26k 395k 393k 1 2
resource_manager.lab.eng.bos.redhat.com.celery.pidbox Y 0 0 0 0 0 0 1 2
resource_manager.lab.eng.bos.redhat.com.dq Y Y 0 0 0 0 0 0 1 2
The resource_manager is running and has parent process pid 3723 and child process pid 4141. See this ps -awfux output:
apache 3723 0.5 0.2 669736 53124 ? Ssl Dec14 7:27 /usr/bin/python /usr/bin/celery worker -A pulp.server.async.app -n resource_manager@%h -Q resource_manager -c 1 --events --umask 18
apache 4141 0.0 0.1 595468 46164 ? Sl Dec14 0:19 \_ /usr/bin/python /usr/bin/celery worker -A pulp.server.async.app -n resource_manager@%h -Q resource_manager -c 1 --events --umask 18
I've confirmed that the expected number of threads are running in the parent pid (4) and the expected number of threads are running in the child (3). Also using gdb all of the python instruction pointers are in poll or epoll waits so basically both the parent and the child are idle.
== The interesting part ==
I use journalctl to get me the logs during the affected time. I then filtered to only include log statements emitted by 3723 or 4141.
First, the parent is submitting heartbeats just fine. and continuing to poll with its normal timeouts. This supports the idea that the parent process is working.
Second, all tasks received by the celery parent process were handed to the child process, ran, completed, successfully handed back to the parent process through IPC, acked to the broker, and logged that it is completed. There is an attachment with the logs from the very last task processed by the resource manager. Each message is exactly as expected and in the correct order.
== my conclusion ==
This shows that the client isn't actually deadlocking! Something is preventing it from receiving the next message. Once the server is sent the ack from the previous message (repeated below)
Dec 15 15:45:09 sat-perf-01.sat6.lab.eng.bos.redhat.com pulp[3723]: qpid.messaging:DEBUG: SACK[3f43c20]: Message(subject=u'resource_manager', properties={u'qpid.subject': u'resource_manager', 'x-amqp-0-10.routing-key': u'resource_manager'}, content={u'body': 'gAJ9cQEoVQdleHBpcmVzcQJOVQN1dGNxA4hVBGFyZ3NxBF1xBShVKXB1bHAuc2VydmVyLm1hbmFnZXJzLnJlcG8ucHVibGlzaC5wdWJsaXNocQZVJDIyMzMzYjk4LTc5YmEtNGNmYi1iMjQxLTU5MTRlYmFkMmVjY3EHWEEAAAByZXBvc2l0b3J5OlRoZV9EZW1vX09yZ19BdXRvLURlbW9fUHJvZHVjdF9BLURlbW9fSW50ZXJuYWxfUHJvZHVjdHEIKX1xCShVBmt3YXJnc3EKfXELKFUXcHVibGlzaF9jb25maWdfb3ZlcnJpZGVxDE5VB3JlcG9faWRxDVg2AAAAVGhlX0RlbW9fT3JnX0F1dG8tRGVtb19Qcm9kdWN0X0EtRGVtb19JbnRlcm5hbF9Qcm9kdWN0cQ5VDmRpc3RyaWJ1dG9yX2lkcQ9VPFRoZV9EZW1vX09yZ19BdXRvLURlbW9fUHJvZHVjdF9BLURlbW9fSW50ZXJuYWxfUHJvZHVjdF9ub2Rlc3EQdVUEdGFnc3ERXXESKFhGAAAAcHVscDpyZXBvc2l0b3J5OlRoZV9EZW1vX09yZ19BdXRvLURlbW9fUHJvZHVjdF9BLURlbW9fSW50ZXJuYWxfUHJvZHVjdHETVRNwdWxwOmFjdGlvbjpwdWJsaXNocRRldWVVBWNob3JkcRVOVQljYWxsYmFja3NxFk5VCGVycmJhY2tzcRdOVQd0YXNrc2V0cRhOVQJpZHEZVSRiNmM2Y2Q5MC05N2I3LTRkZTktYjU0OS0wYTZiOTAyNjJmZDBxGlUHcmV0cmllc3EbSwBVBHRhc2txHFUscHVscC5zZXJ2ZXIuYXN5bmMudGFza3MuX3F1ZXVlX3Jlc2VydmVkX3Rhc2txHVUJdGltZWxpbWl0cR5OToZVA2V0YXEfTmgKfXEgdS4=', u'headers': {}, u'content-type': u'application/x-python-serialize', u'properties': {u'body_encoding': u'base64', u'delivery_info': {u'priority': 0, u'routing_key': u'resource_manager', u'exchange': u'resource_manager'}, u'delivery_mode': 2, u'correlation_id': u'b6c6cd90-97b7-4de9-b549-0a6b90262fd0', u'reply_to': u'467f662e-a57a-33c9-a219-a16b4b844863', u'delivery_tag': 90}, u'content-encoding': u'binary'}), None
I expect the broker to deliver the next message which would resemble the first message from the message flow above (repeated below):
Dec 15 14:45:00 sat-perf-01.sat6.lab.eng.bos.redhat.com pulp[3723]: qpid.messaging:DEBUG: RCVD[3f43c20]: Message(subject=u'resource_manager', properties={u'qpid.subject': u'resource_manager', 'x-amqp-0-10.routing-key': u'resource_manager'}, content={u'body': 'gAJ9cQEoVQdleHBpcmVzcQJOVQN1dGNxA4hVBGFyZ3NxBF1xBShVI3B1bHAuc2VydmVyLm1hbmFnZXJzLnJlcG8uc3luYy5zeW5jcQZVJDc0MTJiMGI5LTA5NjEtNDU1YS1iOTA0LTgwYWU3OTRkMDBmNnEHWEEAAAByZXBvc2l0b3J5OlRoZV9EZW1vX09yZ19BdXRvLURlbW9fUHJvZHVjdF9BLURlbW9fSW50ZXJuYWxfUHJvZHVjdHEIKX1xCShVBmt3YXJnc3EKfXELKFUUc3luY19jb25maWdfb3ZlcnJpZGVxDH1xDVUHcmVwb19pZHEOWDYAAABUaGVfRGVtb19PcmdfQXV0by1EZW1vX1Byb2R1Y3RfQS1EZW1vX0ludGVybmFsX1Byb2R1Y3RxD3VVBHRhZ3NxEF1xEShYRgAAAHB1bHA6cmVwb3NpdG9yeTpUaGVfRGVtb19PcmdfQXV0by1EZW1vX1Byb2R1Y3RfQS1EZW1vX0ludGVybmFsX1Byb2R1Y3RxElUQcHVscDphY3Rpb246c3luY3ETZXVlVQVjaG9yZHEUTlUJY2FsbGJhY2tzcRVOVQhlcnJiYWNrc3EWTlUHdGFza3NldHEXTlUCaWRxGFUkNDUyOTMyYzQtMTM4OS00NTBkLTg2MzEtYWE1Yzg2MTVhYTc3cRlVB3JldHJpZXNxGksAVQR0YXNrcRtVLHB1bHAuc2VydmVyLmFzeW5jLnRhc2tzLl9xdWV1ZV9yZXNlcnZlZF90YXNrcRxVCXRpbWVsaW1pdHEdTk6GVQNldGFxHk5oCn1xH3Uu', u'headers': {}, u'content-type': u'application/x-python-serialize', u'properties': {u'body_encoding': u'base64', u'delivery_info': {u'priority': 0, u'routing_key': u'resource_manager', u'exchange': u'resource_manager'}, u'delivery_mode': 2, u'correlation_id': u'452932c4-1389-450d-8631-aa5c8615aa77', u'reply_to': u'4e225bd1-e1de-3cd1-968b-a6fc9b7d7b17', u'delivery_tag': 19}, u'content-encoding': u'binary'})
What could prevent the next message from being delivered?
How do we investigate this further?
Created attachment 1107526 [details]
The last task processed by the resource manager from reproducer machine
(In reply to Brian Bouterse from comment #8) > What could prevent the next message from being delivered? > How do we investigate this further? Outputs of: qpid-stat -u # subscriptions - who is consuming messages? qpid-stat -q resource_manager # this queue statistics details would suggest if the message is really available or if some consumer acquired it already. Checking the debugs (I expect they are partial only), I see however the client did not acknowledged 2 messages it received: RCVD means python-qpid received message from the broker RETR means the library passed the message via fetch API call to the application / resource_manager SACK means the library acknowledged message consumtion $ grep RCVD logs | cut -d\' -f14 > logs.RCVD.body $ grep SACK logs | cut -d\' -f14 > logs.SACK.body $ grep RETR logs | cut -d\' -f14 > logs.RETR.body $ diff logs.RETR.body logs.RCVD.body $ diff logs.RCVD.body logs.SACK.body 6d5 < gAJ9cQEoVQdleHBpcmVzcQJOVQN1dGNxA4hVBGFyZ3NxBF1xBShVTnB1bHAuc2VydmVyLm1hbmFnZXJzLmNvbnN1bWVyLmFwcGxpY2FiaWxpdHkucmVnZW5lcmF0ZV9hcHBsaWNhYmlsaXR5X2Zvcl9yZXBvc3EGVSQyMDlhMDc2MS1hOTMxLTRjOTktOGJmZi1kZjc4YzY3MTdmMWFxB1UwcmVwb3NpdG9yeV9wcm9maWxlX2FwcGxpY2FiaWxpdHk6UkVTT1VSQ0VfQU5ZX0lEcQh9cQkoVQRzb3J0cQpOVQRza2lwcQtOVQVsaW1pdHEMTlUHZmlsdGVyc3ENfXEOVQJpZH1xD1UDJGluXXEQVTZUaGVfRGVtb19PcmdfQXV0by1EZW1vX1Byb2R1Y3RfQS1EZW1vX0ludGVybmFsX1Byb2R1Y3RxEWFzc1UGZmllbGRzcRJOdYVxE4VxFH1xFVUEdGFnc3EWXXEXVS5wdWxwOmFjdGlvbjpjb250ZW50X2FwcGxpY2FiaWxpdHlfcmVnZW5lcmF0aW9ucRhhc2VVBWNob3JkcRlOVQljYWxsYmFja3NxGk5VCGVycmJhY2tzcRtOVQd0YXNrc2V0cRxOVQJpZHEdVSQ3ODc2NTNiMi1lMGExLTQxMTMtYmVhOC0yMzY4NTYxNzRiNzZxHlUHcmV0cmllc3EfSwBVBHRhc2txIFUscHVscC5zZXJ2ZXIuYXN5bmMudGFza3MuX3F1ZXVlX3Jlc2VydmVkX3Rhc2txIVUJdGltZWxpbWl0cSJOToZVA2V0YXEjTlUGa3dhcmdzcSR9cSV1Lg== 9d7 < gAJ9cQEoVQdleHBpcmVzcQJOVQN1dGNxA4hVBGFyZ3NxBF1xBShVTnB1bHAuc2VydmVyLm1hbmFnZXJzLmNvbnN1bWVyLmFwcGxpY2FiaWxpdHkucmVnZW5lcmF0ZV9hcHBsaWNhYmlsaXR5X2Zvcl9yZXBvc3EGVSQxY2QwMmJjMS1jM2U5LTQxNTUtYmMyMS1hMzM4MDM0NDYzM2NxB1UwcmVwb3NpdG9yeV9wcm9maWxlX2FwcGxpY2FiaWxpdHk6UkVTT1VSQ0VfQU5ZX0lEcQh9cQkoVQRzb3J0cQpOVQRza2lwcQtOVQVsaW1pdHEMTlUHZmlsdGVyc3ENfXEOVQJpZH1xD1UDJGluXXEQVSdUaGVfRGVtb19PcmdfQXV0by1EZW1vX1Byb2R1Y3RfQS1CYWRVUkxxEWFzc1UGZmllbGRzcRJOdYVxE4VxFH1xFVUEdGFnc3EWXXEXVS5wdWxwOmFjdGlvbjpjb250ZW50X2FwcGxpY2FiaWxpdHlfcmVnZW5lcmF0aW9ucRhhc2VVBWNob3JkcRlOVQljYWxsYmFja3NxGk5VCGVycmJhY2tzcRtOVQd0YXNrc2V0cRxOVQJpZHEdVSRkMjg4MDk2MS01N2Y3LTQ5YzQtYjMxYi0yYTA5ZjkwOGY4MWFxHlUHcmV0cmllc3EfSwBVBHRhc2txIFUscHVscC5zZXJ2ZXIuYXN5bmMudGFza3MuX3F1ZXVlX3Jlc2VydmVkX3Rhc2txIVUJdGltZWxpbWl0cSJOToZVA2V0YXEjTlUGa3dhcmdzcSR9cSV1Lg== Those two messages are for these tasks: Dec 15 14:45:07 sat-perf-01.sat6.lab.eng.bos.redhat.com pulp[3723]: celery.worker.strategy:INFO: Received task: pulp.server.async.tasks._queue_reserved_task[787653b2-e0a1-4113-bea8-236856174b76] Dec 15 15:45:03 sat-perf-01.sat6.lab.eng.bos.redhat.com pulp[3723]: celery.worker.strategy:INFO: Received task: pulp.server.async.tasks._queue_reserved_task[d2880961-57f7-49c4-b31b-2a09f908f81a] (btw why the logs are not ordered by time? or rather why system logged events in wrong ordering?) So, whatever python-qpid received, it passed to the application. But two such messages havent been acknowledged back (can the missing ACK be swallowed / pending in python-qpid?) A reproducer?? Yay! First, where are the logs, etc captured from this event? I'd like to go over them. Second - I think we need to answer Pavel's last question. I'm going to instrument some additional ack tracing in python-qpid and build a test rpm based on 0.30-7. Running the reproducer with the additional tracing will let us know where that ack is ending up. Stay tuned... The journalctl in the reproducer machine has incomplete logs. Given that I don't think we can say that an SACK did or did not occur for the tasks. For instance, look at the logs for d2880961-57f7-49c4-b31b-2a09f908f81a. The task was received, ran, and completed successfully, but there are expected log statements missing including: celery's dispatch log statement: TaskPool: Apply <function _fast_trace_task at 0x3d50938> the task.started SENT/RACK messages the SACK message There are many tasks which are missing SACK messages yet the queue depth for resource_manager shows 1 unacked message so those other SACKs must have occurred. Also, the logs don't have any record before -- Logs begin at Tue 2015-12-15 14:50:16 EST which is strange because that is the first time shown by `journalctl`. I guess we need to enable journalctl to receive and store a huge amount of logs. I propose this as a next step because without reliable logs we can't root cause the issue. Attached are 2 new files attached to this BZ that need to get updated on your Satellite to gather extended debug information if you are being effected by this bug. 1) /usr/share/foreman/script/foreman-debug.d/katello-debug.sh 2) /usr/bin/qpid-core-dump Copy the attachments from this bug to these locations. Once copied you can run: # foreman-debug which will gather additional qpid and task related logging files # qpid-core-dump This will generate a large archive (1GB or more) of core files of the celery process. Please attach these files to any Customer Portal support cases you have open. This will assist in the debugging process. Created attachment 1116803 [details]
katello-debug.sh
Created attachment 1116804 [details]
qpid-core-dump
Ensure both /usr/share/foreman/script/foreman-debug.d/katello-debug.sh and /usr/bin/qpid-core-dump are both chmod +x executable as well. My understanding was engineering has reproduced the bug - is the data request from customer(s) meant just for verification both customer(s) and engineering really reproduced the same? Or is there some other objective behind? Are the data planned to be received from (the) one customer or from any hitting the same (such that it makes sense to update relevant KCS to get the same data)? (just a side note: if the request is generic, the change in katello-debug.sh script will be applicable to Sat6.1 only, since 6.2 should deliver new script and for that a new version of extra debugging script would have to be prepared) This update to the script is for 6.1.x only and should be considered a hotfix/patch. We will roll these updated debug tools into 6.1.7 so it will be shipped with the Satellite Can this command be collected as well?
qpid-stat --ssl-certificate=/etc/gutterball/certs/gutterball.crt --ssl-key=/etc/gutterball/certs/gutterball.key -b amqps://localhost:5671 -q resource_manager
output:
Properties:
Name Durable AutoDelete Exclusive FlowStopped FlowStoppedCount Consumers Bindings
======================================================================================================
resource_manager Y N N N 0 1 2
Optional Properties:
Property Value
============================================================================
arguments {u'passive': False, u'exclusive': False, u'arguments': None}
alt-exchange
Statistics:
Statistic Messages Bytes
==============================================
queue-depth 0 0
total-enqueues 20 32,552
total-dequeues 20 32,552
persistent-enqueues 0 0
persistent-dequeues 0 0
transactional-enqueues 0 0
transactional-dequeues 0 0
flow-to-disk-depth 0 0
flow-to-disk-enqueues 0 0
flow-to-disk-dequeues 0 0
acquires 20
releases 0
discards-ttl-expired 0
discards-limit-overflow 0
discards-ring-overflow 0
discards-lvq-replace 0
discards-subscriber-reject 0
discards-purged 0
reroutes 0
if total-enqueues equals to acquires, then all messages enqueued to the broker have been delivered to a consumer (pulp in this case).
The lack of ACKs will be visible here if total-dequeues is _less_ than acquires.
Adding the above but using the more simplified cert auth: qpid-stat --ssl-certificate=/etc/pki/katello/qpid_client_striped.crt -b amqps://localhost:5671 -q resource_manager (In reply to Mike McCune from comment #24) > Adding the above but using the more simplified cert auth: > > qpid-stat --ssl-certificate=/etc/pki/katello/qpid_client_striped.crt -b > amqps://localhost:5671 -q resource_manager Right, thanks for simplification. I am about to update the script and attach both to KCS 2044393 to point customers there. We are going to mark this as resolved and being delivered with Satellite 6.1.7. We have not received any updated reports of this occurring with installations that include the qpid/proton updates in Satellite 6.1.7. If we do see this exact scenario with the latest bits installed on both the Satellite and Capsule, please feel free to re-open Satellite https://access.redhat.com/errata/RHSA-2016:0174 Satellite Tools https://access.redhat.com/errata/RHBA-2016:0173 |