Bug 1321644

Summary:

Pulp celery_beat and resource_manager no longer running

Product:

Red Hat Satellite

Reporter:

Mike McCune <mmccune>

Component:

Pulp

Assignee:

satellite6-bugs <satellite6-bugs>

Status:

CLOSED ERRATA

QA Contact:

Lukas Pramuk <lpramuk>

Severity:

urgent

Docs Contact:

Priority:

unspecified

Version:

6.2.0

CC:

bbuckingham, bmbouter, cwelton, daviddavis, dkliban, ehelms, ggainey, ipanova, mhrivnak, mmccune, omaciel, oshtaier, pcreech, rchan, rplevka, ttereshc, xdmoon

Target Milestone:

Unspecified

Keywords:

PrioBumpQA, Triaged

Target Release:

Unused

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

pulp-2.8.1.2-1

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2016-07-27 09:29:24 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1115190

Attachments:

Description	Flags
qpid-stat output	none
pulp_celerybeat status	none
lsof showing amqp connections from celerybeat	none

Description Mike McCune 2016-03-28 18:13:16 UTC

After some unknown amount of time Pulp infrastructure processes appear to die and we receive these messages in the journal / logs:

pulp.server.async.scheduler:ERROR: There are 0 pulp_resource_manager processes running. Pulp will not operate correctly without at least one pulp_resource_mananger process running.

pulp[25748]: pulp.server.async.scheduler:ERROR: There are 0 pulp_celerybeat processes running. Pulp will not operate correctly without at least one pulp_celerybeat process running.

A restart resolves the issue but restarting shouldn't be required for normal operation

Comment 1 Michael Hrivnak 2016-03-28 18:16:52 UTC

Created attachment 1140973 [details]
qpid-stat output

Comment 2 Michael Hrivnak 2016-03-28 18:17:21 UTC

Created attachment 1140974 [details]
pulp_celerybeat status

Comment 3 Michael Hrivnak 2016-03-28 18:18:01 UTC

# rpm -qa| grep qpid| sort
libqpid-dispatch-0.4-11.el7.x86_64
python-gofer-qpid-2.7.5-1.el7sat.noarch
python-qpid-0.30-9.el7sat.noarch
python-qpid-proton-0.9-13.el7.x86_64
python-qpid-qmf-0.30-5.el7.x86_64
qpid-cpp-client-0.30-11.el7sat.x86_64
qpid-cpp-client-devel-0.30-11.el7sat.x86_64
qpid-cpp-server-0.30-11.el7sat.x86_64
qpid-cpp-server-linearstore-0.30-11.el7sat.x86_64
qpid-dispatch-router-0.4-11.el7.x86_64
qpid-java-client-0.30-3.el7.noarch
qpid-java-common-0.30-3.el7.noarch
qpid-proton-c-0.9-13.el7.x86_64
qpid-qmf-0.30-5.el7.x86_64
qpid-tools-0.30-4.el7.noarch
sat-r220-06.lab.eng.rdu2.redhat.com-qpid-broker-1.0-1.noarch
sat-r220-06.lab.eng.rdu2.redhat.com-qpid-client-cert-1.0-1.noarch
sat-r220-06.lab.eng.rdu2.redhat.com-qpid-router-client-1.0-1.noarch
sat-r220-06.lab.eng.rdu2.redhat.com-qpid-router-server-1.0-1.noarch
tfm-rubygem-qpid_messaging-0.30.0-7.el7sat.x86_64

Comment 5 Michael Hrivnak 2016-03-28 20:51:08 UTC

The number of outstanding messages in the "celeryev..." queue is growing. The "msgOut" number is staying the same.

The three queues associated with the resource manager have not had any new messages in hours.

Comment 6 Michael Hrivnak 2016-03-28 20:56:23 UTC

Created attachment 1141027 [details]
lsof showing amqp connections from celerybeat

Comment 7 Brian Bouterse 2016-03-30 18:17:36 UTC

We need more information from anyone experiencing this bug.

=== Confirm Your System is Affected ===
1. You see the error messages in Comment 0 even though your `sudo ps` output shows pulp_celerybeat, pulp_workers, and pulp_resource_manager processes are running. Please paste your `ps -awfux | grep celery` output.
2. Check that the msgIn count of the celeryev queue is significantly larger than the msgOut message count. See the command to run and an example of msgIn vs msgOut of an "affected" system here: https://bugzilla.redhat.com/attachment.cgi?id=1140973

=== Upload or send bmbouter your gcore output ===
As user root run:

for pid in $(ps -awfux| grep celery | grep "@" | awk '{ print $2 }'); do gcore $pid; done

I also need the output of `ps -awfux | grep celery` in order to interpret the output.

Comment 8 Brian Bouterse 2016-03-31 10:31:50 UTC

Actually I reproduced this on my upstream environment I'm looking through the gcore output to compare an "affected system" versus a "normal system" to determine if there is a client caused deadlock.

Comment 9 pulp-infra@redhat.com 2016-03-31 11:33:38 UTC

The Pulp upstream bug status is at ASSIGNED. Updating the external tracker on this bug.

Comment 10 pulp-infra@redhat.com 2016-03-31 11:33:41 UTC

The Pulp upstream bug priority is at High. Updating the external tracker on this bug.

Comment 11 Brian Bouterse 2016-03-31 16:40:17 UTC

Since this is reproduced on upstream Pulp I am moving updates to that tracker. Once the state changes meaningfully the automation will post that change back to this tracker.

Comment 12 Og Maciel 2016-03-31 17:57:34 UTC

fwiw restarting on RHEL 6 does not resolve the issue for me.

Comment 13 pulp-infra@redhat.com 2016-04-01 21:33:36 UTC

The Pulp upstream bug status is at POST. Updating the external tracker on this bug.

Comment 14 Mike McCune 2016-04-04 15:09:24 UTC

*** Bug 1323687 has been marked as a duplicate of this bug. ***

Comment 15 pulp-infra@redhat.com 2016-04-05 13:33:43 UTC

The Pulp upstream bug status is at MODIFIED. Updating the external tracker on this bug.

Comment 16 Brad Buckingham 2016-04-05 17:13:56 UTC

*** Bug 1324103 has been marked as a duplicate of this bug. ***

Comment 19 Tazim Kolhar 2016-04-07 13:50:12 UTC

*** Bug 1324701 has been marked as a duplicate of this bug. ***

Comment 20 Tazim Kolhar 2016-04-09 12:32:27 UTC

hi

    please provide verification steps

thanks and regards,
tazim

Comment 21 Mike McCune 2016-04-11 04:38:49 UTC

Verification steps:

* Install Satellite 6.2 SNAP7 or higher
* Enable repositories, synchronize content
* Enable Synchronization plan for enabled repositories so they sync 1x/day
* Leave Satellite running for 2 days without rebooting
* After 2 days, ensure that all pulp processes are still running

ps -e f | grep "celery" 

ensure you see ~4+ celery processes

Comment 25 Roman Plevka 2016-04-24 08:28:17 UTC

VERIFIED
- sat 6.2.0 snap 9.0

The resource workers seem to work fine, there are no errors in syslog.
Scheduled sync has successfully ran every day.


<pre>
# ps -e f | grep "celery" 
17874 pts/1    S+     0:00          \_ grep --color=auto celery
25932 ?        Ssl    8:22 /usr/bin/python /usr/bin/celery beat --app=pulp.server.async.celery_instance.celery --scheduler=pulp.server.async.scheduler.Scheduler
25951 ?        Ssl    1:44 /usr/bin/python /usr/bin/celery worker -A pulp.server.async.app -n resource_manager@%h -Q resource_manager -c 1 --events --umask 18 --pidfile=/var/run/pulp/resource_manager.pid --heartbeat-interval=30
...
25999 ?        Ssl    1:44 /usr/bin/python /usr/bin/celery worker -n reserved_resource_worker-7@%h -A pulp.server.async.app -c 1 --events --umask 18 --pidfile=/var/run/pulp/reserved_resource_worker-7.pid --heartbeat-interval=30
26342 ?        Sl     0:49  \_ /usr/bin/python /usr/bin/celery worker -n reserved_resource_worker-7@%h -A pulp.server.async.app -c 1 --events --umask 18 --pidfile=/var/run/pulp/reserved_resource_worker-7.pid --heartbeat-interval=30

# grep -i ERROR /var/log/messages
#
</pre>

Comment 26 Lukas Pramuk 2016-04-25 11:26:33 UTC

VERIFIED.

@Sat6.2.0-Beta-Snap9
python-kombu-3.0.33-7.el7sat.noarch
pulp-server-2.8.1.3-1.el7sat.noarch

1. I setup server syncing various base repos and leave it runing for 4 days

2. All celery processes are running:
# ps -efH | grep [c]elery | wc -l
19

3. The number of sent messages (msgOut) in the "celeryev..." queue is growing together with the number of outstanding messages (msgIn) 

# qpid-stat --ssl-certificate /etc/pki/katello/certs/java-client.crt --ssl-key /etc/pki/katello/private/java-client.key -b "amqps://$(hostname -f):5671" -q
Queues
  queue                                                                                 dur  autoDel  excl  msg   msgIn  msgOut  bytes  bytesIn  bytesOut  cons  bind
  =====================================================================================================================================================================
...
  celery                                                                                Y                      0  1.62k  1.62k      0   1.43m    1.43m        8     2
  celeryev.acbf4265-5f30-4385-93b0-26b28e075bc9                                              Y                 0   132k   132k      0    118m     118m        1     2
...

msgIn = msgOut

>>> oustanding messages no longer cumulate in  celeryev... queue

Comment 27 pulp-infra@redhat.com 2016-04-26 23:04:07 UTC

The Pulp upstream bug status is at ON_QA. Updating the external tracker on this bug.

Comment 28 pulp-infra@redhat.com 2016-05-06 17:20:57 UTC

The Pulp upstream bug status is at VERIFIED. Updating the external tracker on this bug.

Comment 29 pulp-infra@redhat.com 2016-05-17 20:01:03 UTC

The Pulp upstream bug status is at CLOSED - CURRENTRELEASE. Updating the external tracker on this bug.

Comment 31 errata-xmlrpc 2016-07-27 09:29:24 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2016:1501