1321644 – Pulp celery_beat and resource_manager no longer running

Red Hat Satellite engineering is moving the tracking of its product development work on Satellite to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "Satellite project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs will be migrated starting at the end of May. If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "Satellite project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/SAT-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1321644 - Pulp celery_beat and resource_manager no longer running

Summary: Pulp celery_beat and resource_manager no longer running

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Satellite
Classification:	Red Hat
Component:	Pulp
Sub Component:
Version:	6.2.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	Unspecified
Assignee:	satellite6-bugs
QA Contact:	Lukas Pramuk
Docs Contact:
URL:
Whiteboard:
Duplicates (2):	1324103 1324701 (view as bug list)
Depends On:
Blocks:	GSS_Sat6Beta_Tracker, GSS_Sat6_Tracker
TreeView+	depends on / blocked

Reported:	2016-03-28 18:13 UTC by Mike McCune
Modified:	2021-04-06 17:59 UTC (History)
CC List:	17 users (show)
Fixed In Version:	pulp-2.8.1.2-1
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2016-07-27 09:29:24 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
qpid-stat output (10.72 KB, text/plain) 2016-03-28 18:16 UTC, Michael Hrivnak	no flags	Details
pulp_celerybeat status (2.82 KB, text/plain) 2016-03-28 18:17 UTC, Michael Hrivnak	no flags	Details
lsof showing amqp connections from celerybeat (3.41 KB, text/plain) 2016-03-28 20:56 UTC, Michael Hrivnak	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Pulp Redmine	1801	0	High	CLOSED - CURRENTRELEASE	Pulp celery_beat and resource_manager are running, but logs say they are not running	2016-05-17 20:01:02 UTC
Red Hat Product Errata	RHBA-2016:1501	0	normal	SHIPPED_LIVE	Red Hat Satellite 6.2 Capsule and Server	2016-07-27 12:28:58 UTC

Description Mike McCune 2016-03-28 18:13:16 UTC

After some unknown amount of time Pulp infrastructure processes appear to die and we receive these messages in the journal / logs:

pulp.server.async.scheduler:ERROR: There are 0 pulp_resource_manager processes running. Pulp will not operate correctly without at least one pulp_resource_mananger process running.

pulp[25748]: pulp.server.async.scheduler:ERROR: There are 0 pulp_celerybeat processes running. Pulp will not operate correctly without at least one pulp_celerybeat process running.

A restart resolves the issue but restarting shouldn't be required for normal operation

Comment 1 Michael Hrivnak 2016-03-28 18:16:52 UTC

Created attachment 1140973 [details]
qpid-stat output

Comment 2 Michael Hrivnak 2016-03-28 18:17:21 UTC

Created attachment 1140974 [details]
pulp_celerybeat status

Comment 3 Michael Hrivnak 2016-03-28 18:18:01 UTC

# rpm -qa| grep qpid| sort
libqpid-dispatch-0.4-11.el7.x86_64
python-gofer-qpid-2.7.5-1.el7sat.noarch
python-qpid-0.30-9.el7sat.noarch
python-qpid-proton-0.9-13.el7.x86_64
python-qpid-qmf-0.30-5.el7.x86_64
qpid-cpp-client-0.30-11.el7sat.x86_64
qpid-cpp-client-devel-0.30-11.el7sat.x86_64
qpid-cpp-server-0.30-11.el7sat.x86_64
qpid-cpp-server-linearstore-0.30-11.el7sat.x86_64
qpid-dispatch-router-0.4-11.el7.x86_64
qpid-java-client-0.30-3.el7.noarch
qpid-java-common-0.30-3.el7.noarch
qpid-proton-c-0.9-13.el7.x86_64
qpid-qmf-0.30-5.el7.x86_64
qpid-tools-0.30-4.el7.noarch
sat-r220-06.lab.eng.rdu2.redhat.com-qpid-broker-1.0-1.noarch
sat-r220-06.lab.eng.rdu2.redhat.com-qpid-client-cert-1.0-1.noarch
sat-r220-06.lab.eng.rdu2.redhat.com-qpid-router-client-1.0-1.noarch
sat-r220-06.lab.eng.rdu2.redhat.com-qpid-router-server-1.0-1.noarch
tfm-rubygem-qpid_messaging-0.30.0-7.el7sat.x86_64

Comment 5 Michael Hrivnak 2016-03-28 20:51:08 UTC

The number of outstanding messages in the "celeryev..." queue is growing. The "msgOut" number is staying the same.

The three queues associated with the resource manager have not had any new messages in hours.

Comment 6 Michael Hrivnak 2016-03-28 20:56:23 UTC

Created attachment 1141027 [details]
lsof showing amqp connections from celerybeat

Comment 7 Brian Bouterse 2016-03-30 18:17:36 UTC

We need more information from anyone experiencing this bug.

=== Confirm Your System is Affected ===
1. You see the error messages in Comment 0 even though your `sudo ps` output shows pulp_celerybeat, pulp_workers, and pulp_resource_manager processes are running. Please paste your `ps -awfux | grep celery` output.
2. Check that the msgIn count of the celeryev queue is significantly larger than the msgOut message count. See the command to run and an example of msgIn vs msgOut of an "affected" system here: https://bugzilla.redhat.com/attachment.cgi?id=1140973

=== Upload or send bmbouter your gcore output ===
As user root run:

for pid in $(ps -awfux| grep celery | grep "@" | awk '{ print $2 }'); do gcore $pid; done

I also need the output of `ps -awfux | grep celery` in order to interpret the output.

Comment 8 Brian Bouterse 2016-03-31 10:31:50 UTC

Actually I reproduced this on my upstream environment I'm looking through the gcore output to compare an "affected system" versus a "normal system" to determine if there is a client caused deadlock.

Comment 9 pulp-infra@redhat.com 2016-03-31 11:33:38 UTC

The Pulp upstream bug status is at ASSIGNED. Updating the external tracker on this bug.

Comment 10 pulp-infra@redhat.com 2016-03-31 11:33:41 UTC

The Pulp upstream bug priority is at High. Updating the external tracker on this bug.

Comment 11 Brian Bouterse 2016-03-31 16:40:17 UTC

Since this is reproduced on upstream Pulp I am moving updates to that tracker. Once the state changes meaningfully the automation will post that change back to this tracker.

Comment 12 Og Maciel 2016-03-31 17:57:34 UTC

fwiw restarting on RHEL 6 does not resolve the issue for me.

Comment 13 pulp-infra@redhat.com 2016-04-01 21:33:36 UTC

The Pulp upstream bug status is at POST. Updating the external tracker on this bug.

Comment 14 Mike McCune 2016-04-04 15:09:24 UTC

*** Bug 1323687 has been marked as a duplicate of this bug. ***

Comment 15 pulp-infra@redhat.com 2016-04-05 13:33:43 UTC

The Pulp upstream bug status is at MODIFIED. Updating the external tracker on this bug.

Comment 16 Brad Buckingham 2016-04-05 17:13:56 UTC

*** Bug 1324103 has been marked as a duplicate of this bug. ***

Comment 19 Tazim Kolhar 2016-04-07 13:50:12 UTC

*** Bug 1324701 has been marked as a duplicate of this bug. ***

Comment 20 Tazim Kolhar 2016-04-09 12:32:27 UTC

hi

    please provide verification steps

thanks and regards,
tazim

Comment 21 Mike McCune 2016-04-11 04:38:49 UTC

Verification steps:

* Install Satellite 6.2 SNAP7 or higher
* Enable repositories, synchronize content
* Enable Synchronization plan for enabled repositories so they sync 1x/day
* Leave Satellite running for 2 days without rebooting
* After 2 days, ensure that all pulp processes are still running

ps -e f | grep "celery" 

ensure you see ~4+ celery processes

Comment 25 Roman Plevka 2016-04-24 08:28:17 UTC

VERIFIED
- sat 6.2.0 snap 9.0

The resource workers seem to work fine, there are no errors in syslog.
Scheduled sync has successfully ran every day.


<pre>
# ps -e f | grep "celery" 
17874 pts/1    S+     0:00          \_ grep --color=auto celery
25932 ?        Ssl    8:22 /usr/bin/python /usr/bin/celery beat --app=pulp.server.async.celery_instance.celery --scheduler=pulp.server.async.scheduler.Scheduler
25951 ?        Ssl    1:44 /usr/bin/python /usr/bin/celery worker -A pulp.server.async.app -n resource_manager@%h -Q resource_manager -c 1 --events --umask 18 --pidfile=/var/run/pulp/resource_manager.pid --heartbeat-interval=30
...
25999 ?        Ssl    1:44 /usr/bin/python /usr/bin/celery worker -n reserved_resource_worker-7@%h -A pulp.server.async.app -c 1 --events --umask 18 --pidfile=/var/run/pulp/reserved_resource_worker-7.pid --heartbeat-interval=30
26342 ?        Sl     0:49  \_ /usr/bin/python /usr/bin/celery worker -n reserved_resource_worker-7@%h -A pulp.server.async.app -c 1 --events --umask 18 --pidfile=/var/run/pulp/reserved_resource_worker-7.pid --heartbeat-interval=30

# grep -i ERROR /var/log/messages
#
</pre>

Comment 26 Lukas Pramuk 2016-04-25 11:26:33 UTC

VERIFIED.

@Sat6.2.0-Beta-Snap9
python-kombu-3.0.33-7.el7sat.noarch
pulp-server-2.8.1.3-1.el7sat.noarch

1. I setup server syncing various base repos and leave it runing for 4 days

2. All celery processes are running:
# ps -efH | grep [c]elery | wc -l
19

3. The number of sent messages (msgOut) in the "celeryev..." queue is growing together with the number of outstanding messages (msgIn) 

# qpid-stat --ssl-certificate /etc/pki/katello/certs/java-client.crt --ssl-key /etc/pki/katello/private/java-client.key -b "amqps://$(hostname -f):5671" -q
Queues
  queue                                                                                 dur  autoDel  excl  msg   msgIn  msgOut  bytes  bytesIn  bytesOut  cons  bind
  =====================================================================================================================================================================
...
  celery                                                                                Y                      0  1.62k  1.62k      0   1.43m    1.43m        8     2
  celeryev.acbf4265-5f30-4385-93b0-26b28e075bc9                                              Y                 0   132k   132k      0    118m     118m        1     2
...

msgIn = msgOut

>>> oustanding messages no longer cumulate in  celeryev... queue

Comment 27 pulp-infra@redhat.com 2016-04-26 23:04:07 UTC

The Pulp upstream bug status is at ON_QA. Updating the external tracker on this bug.

Comment 28 pulp-infra@redhat.com 2016-05-06 17:20:57 UTC

The Pulp upstream bug status is at VERIFIED. Updating the external tracker on this bug.

Comment 29 pulp-infra@redhat.com 2016-05-17 20:01:03 UTC

The Pulp upstream bug status is at CLOSED - CURRENTRELEASE. Updating the external tracker on this bug.

Comment 31 errata-xmlrpc 2016-07-27 09:29:24 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2016:1501

Note You need to log in before you can comment on or make changes to this bug.