Bug 1321644
Summary: | Pulp celery_beat and resource_manager no longer running | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Satellite | Reporter: | Mike McCune <mmccune> | ||||||||
Component: | Pulp | Assignee: | satellite6-bugs <satellite6-bugs> | ||||||||
Status: | CLOSED ERRATA | QA Contact: | Lukas Pramuk <lpramuk> | ||||||||
Severity: | urgent | Docs Contact: | |||||||||
Priority: | unspecified | ||||||||||
Version: | 6.2.0 | CC: | bbuckingham, bmbouter, cwelton, daviddavis, dkliban, ehelms, ggainey, ipanova, mhrivnak, mmccune, omaciel, oshtaier, pcreech, rchan, rplevka, ttereshc, xdmoon | ||||||||
Target Milestone: | Unspecified | Keywords: | PrioBumpQA, Triaged | ||||||||
Target Release: | Unused | ||||||||||
Hardware: | Unspecified | ||||||||||
OS: | Unspecified | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | pulp-2.8.1.2-1 | Doc Type: | Bug Fix | ||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2016-07-27 09:29:24 UTC | Type: | Bug | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Bug Depends On: | |||||||||||
Bug Blocks: | 1115190 | ||||||||||
Attachments: |
|
Description
Mike McCune
2016-03-28 18:13:16 UTC
Created attachment 1140973 [details]
qpid-stat output
Created attachment 1140974 [details]
pulp_celerybeat status
# rpm -qa| grep qpid| sort libqpid-dispatch-0.4-11.el7.x86_64 python-gofer-qpid-2.7.5-1.el7sat.noarch python-qpid-0.30-9.el7sat.noarch python-qpid-proton-0.9-13.el7.x86_64 python-qpid-qmf-0.30-5.el7.x86_64 qpid-cpp-client-0.30-11.el7sat.x86_64 qpid-cpp-client-devel-0.30-11.el7sat.x86_64 qpid-cpp-server-0.30-11.el7sat.x86_64 qpid-cpp-server-linearstore-0.30-11.el7sat.x86_64 qpid-dispatch-router-0.4-11.el7.x86_64 qpid-java-client-0.30-3.el7.noarch qpid-java-common-0.30-3.el7.noarch qpid-proton-c-0.9-13.el7.x86_64 qpid-qmf-0.30-5.el7.x86_64 qpid-tools-0.30-4.el7.noarch sat-r220-06.lab.eng.rdu2.redhat.com-qpid-broker-1.0-1.noarch sat-r220-06.lab.eng.rdu2.redhat.com-qpid-client-cert-1.0-1.noarch sat-r220-06.lab.eng.rdu2.redhat.com-qpid-router-client-1.0-1.noarch sat-r220-06.lab.eng.rdu2.redhat.com-qpid-router-server-1.0-1.noarch tfm-rubygem-qpid_messaging-0.30.0-7.el7sat.x86_64 The number of outstanding messages in the "celeryev..." queue is growing. The "msgOut" number is staying the same. The three queues associated with the resource manager have not had any new messages in hours. Created attachment 1141027 [details]
lsof showing amqp connections from celerybeat
We need more information from anyone experiencing this bug. === Confirm Your System is Affected === 1. You see the error messages in Comment 0 even though your `sudo ps` output shows pulp_celerybeat, pulp_workers, and pulp_resource_manager processes are running. Please paste your `ps -awfux | grep celery` output. 2. Check that the msgIn count of the celeryev queue is significantly larger than the msgOut message count. See the command to run and an example of msgIn vs msgOut of an "affected" system here: https://bugzilla.redhat.com/attachment.cgi?id=1140973 === Upload or send bmbouter your gcore output === As user root run: for pid in $(ps -awfux| grep celery | grep "@" | awk '{ print $2 }'); do gcore $pid; done I also need the output of `ps -awfux | grep celery` in order to interpret the output. Actually I reproduced this on my upstream environment I'm looking through the gcore output to compare an "affected system" versus a "normal system" to determine if there is a client caused deadlock. The Pulp upstream bug status is at ASSIGNED. Updating the external tracker on this bug. The Pulp upstream bug priority is at High. Updating the external tracker on this bug. Since this is reproduced on upstream Pulp I am moving updates to that tracker. Once the state changes meaningfully the automation will post that change back to this tracker. fwiw restarting on RHEL 6 does not resolve the issue for me. The Pulp upstream bug status is at POST. Updating the external tracker on this bug. *** Bug 1323687 has been marked as a duplicate of this bug. *** The Pulp upstream bug status is at MODIFIED. Updating the external tracker on this bug. *** Bug 1324103 has been marked as a duplicate of this bug. *** *** Bug 1324701 has been marked as a duplicate of this bug. *** hi please provide verification steps thanks and regards, tazim Verification steps: * Install Satellite 6.2 SNAP7 or higher * Enable repositories, synchronize content * Enable Synchronization plan for enabled repositories so they sync 1x/day * Leave Satellite running for 2 days without rebooting * After 2 days, ensure that all pulp processes are still running ps -e f | grep "celery" ensure you see ~4+ celery processes VERIFIED - sat 6.2.0 snap 9.0 The resource workers seem to work fine, there are no errors in syslog. Scheduled sync has successfully ran every day. <pre> # ps -e f | grep "celery" 17874 pts/1 S+ 0:00 \_ grep --color=auto celery 25932 ? Ssl 8:22 /usr/bin/python /usr/bin/celery beat --app=pulp.server.async.celery_instance.celery --scheduler=pulp.server.async.scheduler.Scheduler 25951 ? Ssl 1:44 /usr/bin/python /usr/bin/celery worker -A pulp.server.async.app -n resource_manager@%h -Q resource_manager -c 1 --events --umask 18 --pidfile=/var/run/pulp/resource_manager.pid --heartbeat-interval=30 ... 25999 ? Ssl 1:44 /usr/bin/python /usr/bin/celery worker -n reserved_resource_worker-7@%h -A pulp.server.async.app -c 1 --events --umask 18 --pidfile=/var/run/pulp/reserved_resource_worker-7.pid --heartbeat-interval=30 26342 ? Sl 0:49 \_ /usr/bin/python /usr/bin/celery worker -n reserved_resource_worker-7@%h -A pulp.server.async.app -c 1 --events --umask 18 --pidfile=/var/run/pulp/reserved_resource_worker-7.pid --heartbeat-interval=30 # grep -i ERROR /var/log/messages # </pre> VERIFIED.
@Sat6.2.0-Beta-Snap9
python-kombu-3.0.33-7.el7sat.noarch
pulp-server-2.8.1.3-1.el7sat.noarch
1. I setup server syncing various base repos and leave it runing for 4 days
2. All celery processes are running:
# ps -efH | grep [c]elery | wc -l
19
3. The number of sent messages (msgOut) in the "celeryev..." queue is growing together with the number of outstanding messages (msgIn)
# qpid-stat --ssl-certificate /etc/pki/katello/certs/java-client.crt --ssl-key /etc/pki/katello/private/java-client.key -b "amqps://$(hostname -f):5671" -q
Queues
queue dur autoDel excl msg msgIn msgOut bytes bytesIn bytesOut cons bind
=====================================================================================================================================================================
...
celery Y 0 1.62k 1.62k 0 1.43m 1.43m 8 2
celeryev.acbf4265-5f30-4385-93b0-26b28e075bc9 Y 0 132k 132k 0 118m 118m 1 2
...
msgIn = msgOut
>>> oustanding messages no longer cumulate in celeryev... queue
The Pulp upstream bug status is at ON_QA. Updating the external tracker on this bug. The Pulp upstream bug status is at VERIFIED. Updating the external tracker on this bug. The Pulp upstream bug status is at CLOSED - CURRENTRELEASE. Updating the external tracker on this bug. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2016:1501 |