Bug 1277456
Summary: | Make pulp processes more robust / recoverable / verbose on problems | ||
---|---|---|---|
Product: | Red Hat Satellite | Reporter: | Pavel Moravec <pmoravec> |
Component: | Pulp | Assignee: | satellite6-bugs <satellite6-bugs> |
Status: | CLOSED UPSTREAM | QA Contact: | Katello QA List <katello-qa-list> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 6.1.3 | CC: | bugzilla_rhn, sauchter |
Target Milestone: | Unspecified | ||
Target Release: | Unused | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2015-11-13 17:00:45 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Pavel Moravec
2015-11-03 11:22:26 UTC
One curious observation: pulp sends heartbeats every 2 seconds (at least my observation). And writes it down to mongo db every 2 seconds. But it takes loong minutes to detect a worker is gone or to recover. Why? Assume scenario: 1) pulp processes freshly restarted, everything working 2) select a worker and (via iptables) drop all packets between the process and qpidd broker (there should be 2 TCP connections, so I block in INPUT --sport and --dport of the client ports of the connections) 3) remember time when starting the connection outage simulation (in my case 12:39:45) 4) observe /var/log/messages for connection loss detection - in my case, at 12:44:49 pulp.server.async.scheduler:ERROR: .. was logged, and at 12:55:09, celery.worker.consumer:WARNING: Connection to broker lost logged. And even at 13:10:46, the worker reconnected and worker_watcher discovered it. So pulp worker has heartbeat interval 2 seconds on its application level, but scheduler detects worker is gone after 5 minutes? And the worker detects so after 15 minutes? And reconnects after another 15 minutes?? So with 2 seconds heartbeat, a worker thread recovers from connection problem after 30 minutes?? What is the purpose of the heartbeats, then??? .. and once the worker is registred again, worker_watcher is getting confused: Nov 3 13:10:46 pmoravec-sat61-rhel7 pulp: pulp.server.async.worker_watcher:INFO: New worker 'reserved_resource_worker-0.brq.redhat.com' discovered Nov 3 13:10:46 pmoravec-sat61-rhel7 pulp: pulp.server.async.worker_watcher:INFO: Worker 'reserved_resource_worker-0.brq.redhat.com' shutdown Nov 3 13:10:46 pmoravec-sat61-rhel7 pulp: py.warnings:WARNING: (13881-92544) /usr/lib/python2.7/site-packages/kombu/pidbox.py:75: UserWarning: A node named reserved_resource_worker-0.brq.redhat.com is already using this process mailbox! Nov 3 13:10:46 pmoravec-sat61-rhel7 pulp: py.warnings:WARNING: (13881-92544) Nov 3 13:10:46 pmoravec-sat61-rhel7 pulp: py.warnings:WARNING: (13881-92544) Maybe you forgot to shutdown the other node or did not do so properly? Nov 3 13:10:46 pmoravec-sat61-rhel7 pulp: py.warnings:WARNING: (13881-92544) Or if you meant to start multiple nodes on the same host please make sure Nov 3 13:10:46 pmoravec-sat61-rhel7 pulp: py.warnings:WARNING: (13881-92544) you give each node a unique node name! Nov 3 13:10:46 pmoravec-sat61-rhel7 pulp: py.warnings:WARNING: (13881-92544) Nov 3 13:10:46 pmoravec-sat61-rhel7 pulp: py.warnings:WARNING: (13881-92544) warnings.warn(W_PIDBOX_IN_USE.format(node=self)) Nov 3 13:10:46 pmoravec-sat61-rhel7 pulp: py.warnings:WARNING: (13881-92544) Nov 3 13:10:48 pmoravec-sat61-rhel7 pulp: pulp.server.async.worker_watcher:INFO: New worker 'reserved_resource_worker-0.brq.redhat.com' discovered I'm glad you filed this because we want Pulp to be awesome. Here are some comments. The worker realizing they are no longer connected is not important. Once the worker_watcher notices a worker is missing after 5 minutes, their work is cancelled. If the worker recovers before it's noticed missing then it's not considered a failure. If the worker recovers after it's noticed missing then it rejoins just like a new worker would. In a later version of pulp the heartbeat time was set to 30 seconds instead of the celery default of 2. This makes the mongo connection less chatty and 2 seconds wasn't really benefiting us anyway. I'm not sure when this change will be taken into sat6. Regarding the 5 minute delay before a worker is taken out of service. This is an arbitrary decision based on a time that Pulp sets. It could be any time desired, and sat6 could set it differently if they want. Note: it's not exactly 5 minutes. The worker_watcher wakes up every 90 seconds, and deletes workers who have not checked in within the last 300 seconds. The worst case then could be roughly 300 + 90 assuming the worker_watcher woke up just before a lost worker's last checkin was 300 seconds old. Finally, when a worker does recover or get re-registered the worker_watcher does not get confused. It does log in a confusing way though. I believe we fixed this in a newer version of Pulp but if you want me to check more on it, needsinfo me. Changes suggested in this BZ would not have helped the debugging of those customer cases. Also many of the suggested improvements are already included in sat6 or in upstream Pulp. Agreed with Brian the changes in upstream are sufficient and will appear in downstream later on. |