Bug 1321644

Summary: Pulp celery_beat and resource_manager no longer running
Product: Red Hat Satellite 6 Reporter: Mike McCune <mmccune>
Component: PulpAssignee: satellite6-bugs <satellite6-bugs>
Status: CLOSED ERRATA QA Contact: Lukas Pramuk <lpramuk>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 6.2.0CC: bbuckingham, cwelton, daviddavis, dkliban, ehelms, mhrivnak, mmccune, omaciel, oshtaier, pcreech, rchan, rplevka, ttereshc, xdmoon
Target Milestone: UnspecifiedKeywords: PrioBumpQA, Triaged
Target Release: Unused   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: pulp-2.8.1.2-1 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-07-27 09:29:24 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On:    
Bug Blocks: 1115190    
Attachments:
Description Flags
qpid-stat output
none
pulp_celerybeat status
none
lsof showing amqp connections from celerybeat none

Description Mike McCune 2016-03-28 18:13:16 UTC
After some unknown amount of time Pulp infrastructure processes appear to die and we receive these messages in the journal / logs:

pulp.server.async.scheduler:ERROR: There are 0 pulp_resource_manager processes running. Pulp will not operate correctly without at least one pulp_resource_mananger process running.

pulp[25748]: pulp.server.async.scheduler:ERROR: There are 0 pulp_celerybeat processes running. Pulp will not operate correctly without at least one pulp_celerybeat process running.

A restart resolves the issue but restarting shouldn't be required for normal operation

Comment 1 Michael Hrivnak 2016-03-28 18:16:52 UTC
Created attachment 1140973 [details]
qpid-stat output

Comment 2 Michael Hrivnak 2016-03-28 18:17:21 UTC
Created attachment 1140974 [details]
pulp_celerybeat status

Comment 3 Michael Hrivnak 2016-03-28 18:18:01 UTC
# rpm -qa| grep qpid| sort
libqpid-dispatch-0.4-11.el7.x86_64
python-gofer-qpid-2.7.5-1.el7sat.noarch
python-qpid-0.30-9.el7sat.noarch
python-qpid-proton-0.9-13.el7.x86_64
python-qpid-qmf-0.30-5.el7.x86_64
qpid-cpp-client-0.30-11.el7sat.x86_64
qpid-cpp-client-devel-0.30-11.el7sat.x86_64
qpid-cpp-server-0.30-11.el7sat.x86_64
qpid-cpp-server-linearstore-0.30-11.el7sat.x86_64
qpid-dispatch-router-0.4-11.el7.x86_64
qpid-java-client-0.30-3.el7.noarch
qpid-java-common-0.30-3.el7.noarch
qpid-proton-c-0.9-13.el7.x86_64
qpid-qmf-0.30-5.el7.x86_64
qpid-tools-0.30-4.el7.noarch
sat-r220-06.lab.eng.rdu2.redhat.com-qpid-broker-1.0-1.noarch
sat-r220-06.lab.eng.rdu2.redhat.com-qpid-client-cert-1.0-1.noarch
sat-r220-06.lab.eng.rdu2.redhat.com-qpid-router-client-1.0-1.noarch
sat-r220-06.lab.eng.rdu2.redhat.com-qpid-router-server-1.0-1.noarch
tfm-rubygem-qpid_messaging-0.30.0-7.el7sat.x86_64

Comment 5 Michael Hrivnak 2016-03-28 20:51:08 UTC
The number of outstanding messages in the "celeryev..." queue is growing. The "msgOut" number is staying the same.

The three queues associated with the resource manager have not had any new messages in hours.

Comment 6 Michael Hrivnak 2016-03-28 20:56:23 UTC
Created attachment 1141027 [details]
lsof showing amqp connections from celerybeat

Comment 7 Brian Bouterse 2016-03-30 18:17:36 UTC
We need more information from anyone experiencing this bug.

=== Confirm Your System is Affected ===
1. You see the error messages in Comment 0 even though your `sudo ps` output shows pulp_celerybeat, pulp_workers, and pulp_resource_manager processes are running. Please paste your `ps -awfux | grep celery` output.
2. Check that the msgIn count of the celeryev queue is significantly larger than the msgOut message count. See the command to run and an example of msgIn vs msgOut of an "affected" system here: https://bugzilla.redhat.com/attachment.cgi?id=1140973

=== Upload or send bmbouter your gcore output ===
As user root run:

for pid in $(ps -awfux| grep celery | grep "@" | awk '{ print $2 }'); do gcore $pid; done

I also need the output of `ps -awfux | grep celery` in order to interpret the output.

Comment 8 Brian Bouterse 2016-03-31 10:31:50 UTC
Actually I reproduced this on my upstream environment I'm looking through the gcore output to compare an "affected system" versus a "normal system" to determine if there is a client caused deadlock.

Comment 9 pulp-infra@redhat.com 2016-03-31 11:33:38 UTC
The Pulp upstream bug status is at ASSIGNED. Updating the external tracker on this bug.

Comment 10 pulp-infra@redhat.com 2016-03-31 11:33:41 UTC
The Pulp upstream bug priority is at High. Updating the external tracker on this bug.

Comment 11 Brian Bouterse 2016-03-31 16:40:17 UTC
Since this is reproduced on upstream Pulp I am moving updates to that tracker. Once the state changes meaningfully the automation will post that change back to this tracker.

Comment 12 Og Maciel 2016-03-31 17:57:34 UTC
fwiw restarting on RHEL 6 does not resolve the issue for me.

Comment 13 pulp-infra@redhat.com 2016-04-01 21:33:36 UTC
The Pulp upstream bug status is at POST. Updating the external tracker on this bug.

Comment 14 Mike McCune 2016-04-04 15:09:24 UTC
*** Bug 1323687 has been marked as a duplicate of this bug. ***

Comment 15 pulp-infra@redhat.com 2016-04-05 13:33:43 UTC
The Pulp upstream bug status is at MODIFIED. Updating the external tracker on this bug.

Comment 16 Brad Buckingham 2016-04-05 17:13:56 UTC
*** Bug 1324103 has been marked as a duplicate of this bug. ***

Comment 19 Tazim Kolhar 2016-04-07 13:50:12 UTC
*** Bug 1324701 has been marked as a duplicate of this bug. ***

Comment 20 Tazim Kolhar 2016-04-09 12:32:27 UTC
hi

    please provide verification steps

thanks and regards,
tazim

Comment 21 Mike McCune 2016-04-11 04:38:49 UTC
Verification steps:

* Install Satellite 6.2 SNAP7 or higher
* Enable repositories, synchronize content
* Enable Synchronization plan for enabled repositories so they sync 1x/day
* Leave Satellite running for 2 days without rebooting
* After 2 days, ensure that all pulp processes are still running

ps -e f | grep "celery" 

ensure you see ~4+ celery processes

Comment 25 Roman Plevka 2016-04-24 08:28:17 UTC
VERIFIED
- sat 6.2.0 snap 9.0

The resource workers seem to work fine, there are no errors in syslog.
Scheduled sync has successfully ran every day.


<pre>
# ps -e f | grep "celery" 
17874 pts/1    S+     0:00          \_ grep --color=auto celery
25932 ?        Ssl    8:22 /usr/bin/python /usr/bin/celery beat --app=pulp.server.async.celery_instance.celery --scheduler=pulp.server.async.scheduler.Scheduler
25951 ?        Ssl    1:44 /usr/bin/python /usr/bin/celery worker -A pulp.server.async.app -n resource_manager@%h -Q resource_manager -c 1 --events --umask 18 --pidfile=/var/run/pulp/resource_manager.pid --heartbeat-interval=30
...
25999 ?        Ssl    1:44 /usr/bin/python /usr/bin/celery worker -n reserved_resource_worker-7@%h -A pulp.server.async.app -c 1 --events --umask 18 --pidfile=/var/run/pulp/reserved_resource_worker-7.pid --heartbeat-interval=30
26342 ?        Sl     0:49  \_ /usr/bin/python /usr/bin/celery worker -n reserved_resource_worker-7@%h -A pulp.server.async.app -c 1 --events --umask 18 --pidfile=/var/run/pulp/reserved_resource_worker-7.pid --heartbeat-interval=30

# grep -i ERROR /var/log/messages
#
</pre>

Comment 26 Lukas Pramuk 2016-04-25 11:26:33 UTC
VERIFIED.

@Sat6.2.0-Beta-Snap9
python-kombu-3.0.33-7.el7sat.noarch
pulp-server-2.8.1.3-1.el7sat.noarch

1. I setup server syncing various base repos and leave it runing for 4 days

2. All celery processes are running:
# ps -efH | grep [c]elery | wc -l
19

3. The number of sent messages (msgOut) in the "celeryev..." queue is growing together with the number of outstanding messages (msgIn) 

# qpid-stat --ssl-certificate /etc/pki/katello/certs/java-client.crt --ssl-key /etc/pki/katello/private/java-client.key -b "amqps://$(hostname -f):5671" -q
Queues
  queue                                                                                 dur  autoDel  excl  msg   msgIn  msgOut  bytes  bytesIn  bytesOut  cons  bind
  =====================================================================================================================================================================
...
  celery                                                                                Y                      0  1.62k  1.62k      0   1.43m    1.43m        8     2
  celeryev.acbf4265-5f30-4385-93b0-26b28e075bc9                                              Y                 0   132k   132k      0    118m     118m        1     2
...

msgIn = msgOut

>>> oustanding messages no longer cumulate in  celeryev... queue

Comment 27 pulp-infra@redhat.com 2016-04-26 23:04:07 UTC
The Pulp upstream bug status is at ON_QA. Updating the external tracker on this bug.

Comment 28 pulp-infra@redhat.com 2016-05-06 17:20:57 UTC
The Pulp upstream bug status is at VERIFIED. Updating the external tracker on this bug.

Comment 29 pulp-infra@redhat.com 2016-05-17 20:01:03 UTC
The Pulp upstream bug status is at CLOSED - CURRENTRELEASE. Updating the external tracker on this bug.

Comment 31 errata-xmlrpc 2016-07-27 09:29:24 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2016:1501