Bug 1277456

Summary: Make pulp processes more robust / recoverable / verbose on problems
Product: Red Hat Satellite Reporter: Pavel Moravec <pmoravec>
Component: PulpAssignee: satellite6-bugs <satellite6-bugs>
Status: CLOSED UPSTREAM QA Contact: Katello QA List <katello-qa-list>
Severity: high Docs Contact:
Priority: high    
Version: 6.1.3CC: bugzilla_rhn, sauchter
Target Milestone: Unspecified   
Target Release: Unused   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-11-13 17:00:45 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Pavel Moravec 2015-11-03 11:22:26 UTC
Description of problem:
pulp consists of various processes that are somehow being monitored if they are running. BUT when a pulp process dies:
1) there is absolutely no log what happened, why the process exited or so
2) the monitoring just detects process is gone, but does not attempt to respawn it again.

This monitoring is useless if it's the only activity is to log 

Workers 'resource_manager.com' has gone missing, removing from list of workers

Please improve the monitoring (or replace by anything elase or implement something new) such that both 1) and 2) is fixed.

1) is required for root cause analysis why a worker or resource manager died - now there is absolutely no clue. No way of investigation for either GSS or developers.

2) is required for having more stable and robust pulp that takes care of died processes


Version-Release number of selected component (if applicable):
Sat 6.1.3


How reproducible:
n.a.


Steps to Reproduce:
??? somehow achieve a pulp worker process or resource manager is gone silently
check what it logs and if pulp restarts it


Actual results:
no log why the worker is gone
no activity made to restore it


Expected results:
proper log why the worker is gone
an attempt to restart the worker


Additional info:
The lack of robustness is more shocking, when we compare it to the fact that both pulp celery worker thread and also resource worker thread sends heartbeats via qpid broker every second, and they even _write_ that activity to mongo (workers collection/table). Why the processes are such probed and why the keepalive probes are written to backend database _every_second_ (!!!) if points 1) and 2) are not implemented?

Comment 2 Pavel Moravec 2015-11-03 12:14:22 UTC
One curious observation:

pulp sends heartbeats every 2 seconds (at least my observation). And writes it down to mongo db every 2 seconds. But it takes loong minutes to detect a worker is gone or to recover. Why?

Assume scenario:
1) pulp processes freshly restarted, everything working
2) select a worker and (via iptables) drop all packets between the process and qpidd broker (there should be 2 TCP connections, so I block in INPUT --sport and --dport of the client ports of the connections)
3) remember time when starting the connection outage simulation (in my case 12:39:45)
4) observe /var/log/messages for connection loss detection - in my case, at 12:44:49 pulp.server.async.scheduler:ERROR: .. was logged, and at 12:55:09, celery.worker.consumer:WARNING: Connection to broker lost logged. And even at 13:10:46, the worker reconnected and worker_watcher discovered it.

So pulp worker has heartbeat interval 2 seconds on its application level, but scheduler detects worker is gone after 5 minutes? And the worker detects so after 15 minutes? And reconnects after another 15 minutes?? So with 2 seconds heartbeat, a worker thread recovers from connection problem after 30 minutes?? What is the purpose of the heartbeats, then???

Comment 3 Pavel Moravec 2015-11-03 12:16:54 UTC
.. and once the worker is registred again, worker_watcher is getting confused:

Nov  3 13:10:46 pmoravec-sat61-rhel7 pulp: pulp.server.async.worker_watcher:INFO: New worker 'reserved_resource_worker-0.brq.redhat.com' discovered
Nov  3 13:10:46 pmoravec-sat61-rhel7 pulp: pulp.server.async.worker_watcher:INFO: Worker 'reserved_resource_worker-0.brq.redhat.com' shutdown
Nov  3 13:10:46 pmoravec-sat61-rhel7 pulp: py.warnings:WARNING: (13881-92544) /usr/lib/python2.7/site-packages/kombu/pidbox.py:75: UserWarning: A node named reserved_resource_worker-0.brq.redhat.com is already using this process mailbox!
Nov  3 13:10:46 pmoravec-sat61-rhel7 pulp: py.warnings:WARNING: (13881-92544)
Nov  3 13:10:46 pmoravec-sat61-rhel7 pulp: py.warnings:WARNING: (13881-92544) Maybe you forgot to shutdown the other node or did not do so properly?
Nov  3 13:10:46 pmoravec-sat61-rhel7 pulp: py.warnings:WARNING: (13881-92544) Or if you meant to start multiple nodes on the same host please make sure
Nov  3 13:10:46 pmoravec-sat61-rhel7 pulp: py.warnings:WARNING: (13881-92544) you give each node a unique node name!
Nov  3 13:10:46 pmoravec-sat61-rhel7 pulp: py.warnings:WARNING: (13881-92544)
Nov  3 13:10:46 pmoravec-sat61-rhel7 pulp: py.warnings:WARNING: (13881-92544)   warnings.warn(W_PIDBOX_IN_USE.format(node=self))
Nov  3 13:10:46 pmoravec-sat61-rhel7 pulp: py.warnings:WARNING: (13881-92544)
Nov  3 13:10:48 pmoravec-sat61-rhel7 pulp: pulp.server.async.worker_watcher:INFO: New worker 'reserved_resource_worker-0.brq.redhat.com' discovered

Comment 4 Brian Bouterse 2015-11-06 17:18:29 UTC
I'm glad you filed this because we want Pulp to be awesome. Here are some comments.

The worker realizing they are no longer connected is not important. Once the worker_watcher notices a worker is missing after 5 minutes, their work is cancelled. If the worker recovers before it's noticed missing then it's not considered a failure. If the worker recovers after it's noticed missing then it rejoins just like a new worker would.

In a later version of pulp the heartbeat time was set to 30 seconds instead of the celery default of 2. This makes the mongo connection less chatty and 2 seconds wasn't really benefiting us anyway. I'm not sure when this change will be taken into sat6.

Regarding the 5 minute delay before a worker is taken out of service. This is an arbitrary decision based on a time that Pulp sets. It could be any time desired, and sat6 could set it differently if they want. Note: it's not exactly 5 minutes. The worker_watcher wakes up every 90 seconds, and deletes workers who have not checked in within the last 300 seconds. The worst case then could be roughly 300 + 90 assuming the worker_watcher woke up just before a lost worker's last checkin was 300 seconds old.

Finally, when a worker does recover or get re-registered the worker_watcher does not get confused. It does log in a confusing way though. I believe we fixed this in a newer version of Pulp but if you want me to check more on it, needsinfo me.

Comment 5 Brian Bouterse 2015-11-13 14:55:54 UTC
Changes suggested in this BZ would not have helped the debugging of those customer cases. Also many of the suggested improvements are already included in sat6 or in upstream Pulp.

Comment 6 Pavel Moravec 2015-11-13 17:00:45 UTC
Agreed with Brian the changes in upstream are sufficient and will appear in downstream later on.