Bug 1656340 (pulp-deadlock)
Summary: | pulp celery worker stuck when forking new thread and reconnecting to mongo | |||
---|---|---|---|---|
Product: | Red Hat Satellite | Reporter: | Pavel Moravec <pmoravec> | |
Component: | Installation | Assignee: | Chris Roberts <chrobert> | |
Status: | CLOSED ERRATA | QA Contact: | Mike McCune <mmccune> | |
Severity: | high | Docs Contact: | ||
Priority: | high | |||
Version: | 6.4.0 | CC: | ajambhul, bkearney, bmbouter, chrobert, daviddavis, ehelms, ktordeur, mmccune, ofalk, peter.vreman, safak.unutmazbas, tasander, zhunting | |
Target Milestone: | Released | Keywords: | Triaged | |
Target Release: | Unused | |||
Hardware: | x86_64 | |||
OS: | Linux | |||
URL: | https://projects.theforeman.org/issues/26323 | |||
Whiteboard: | ||||
Fixed In Version: | katello-installer-base-3.10.0.7-1 | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1693328 (view as bug list) | Environment: | ||
Last Closed: | 2019-05-14 19:57:41 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1122832 |
Description
Pavel Moravec
2018-12-05 09:45:40 UTC
Brian, could you please advise me how to trigger a child process to have either of those backtraces? Key problem I have: In https://github.com/celery/billiard/blob/master/billiard/popen_fork.py#L72 a forking of the child should happen. See: #58 PyEval_EvalFrameEx (f=f@entry=Frame 0x2262530, for file /usr/lib64/python2.7/site-packages/billiard/process.py, line 327, in _bootstrap #62 PyEval_EvalFrameEx (f=f@entry=Frame 0x7f8da3b8c050, for file /usr/lib64/python2.7/site-packages/billiard/popen_fork.py, line 79, in _launch that this really happened. BUT I can not simulate this at all. When I add debugging like: def _launch(self, process_obj): code = 1 parent_r, child_w = os.pipe() print("PavelM: _launch: pre fork for self=%s" % (self)) self.pid = os.fork() print("PavelM: _launch: post fork for self=%s and self.pid=%s" % (self, self.pid)) if self.pid == 0: .. then - when child process is recycled - I would expect to see something like: Dec 6 10:37:05 pmoravec-sat64-on-rhev celery: PavelM: _launch: pre fork for self=<billiard.popen_fork.Popen object at 0x7f05a5351610> Dec 6 10:37:05 pmoravec-sat64-on-rhev celery: PavelM: _launch: post fork for self=<billiard.popen_fork.Popen object at 0x7f05a5351610> and self.pid=9118 Dec 6 10:37:05 pmoravec-sat64-on-rhev celery: PavelM: _launch: post fork for self=<billiard.popen_fork.Popen object at 0x7f05a5351610> and self.pid=0 in /var/log/messages - one log prior the fork, two logs (for given "self" object) after the fork - from parent and child process. But I see *just* parent process there. While: - "ps awfux" shows a new child process of exactly the PID from above output (9118 in above example) is spawned - coredump of this process shows it executed the "if self.pid == 0" branch and it has self.pid=0 set there So how the child process is really spawned, then? Does billiard somehow spoof os.fork() ? (follow reading only if you are not yet confused enough; to have it even more confusing, when I run: - "celery -A pulp.server.async.app inspect active" to trigger bz1643650, - restart pulp_workers, then I *do* see both logs: Dec 6 10:43:46 pmoravec-sat64-on-rhev celery: PavelM: _launch: pre fork for self=<billiard.popen_fork.Popen object at 0x7f05a53515d0> Dec 6 10:43:46 pmoravec-sat64-on-rhev celery: PavelM: _launch: post fork for self=<billiard.popen_fork.Popen object at 0x7f05a53515d0> and self.pid=0 Dec 6 10:43:47 pmoravec-sat64-on-rhev celery: PavelM: _launch: pre fork for self=<billiard.popen_fork.Popen object at 0x7f05a53515d0> Dec 6 10:43:47 pmoravec-sat64-on-rhev celery: PavelM: _launch: post fork for self=<billiard.popen_fork.Popen object at 0x7f05a53515d0> and self.pid=17739 BUT the child PID id 17739 was _recycled_ / killed that time (when I run "date; ps awfux; sleep 0.1" concurrently). So why I see logs from _forking_ a process at the _end_ of process lifecycle?) @pmoravec, I read your message a few times to make sure I understand it, and I think I do. Unfortunately I don't have insight into your questions. I'll share some thoughts, maybe they will help. I don't think billiard has monkey patched os.fork(). It would have had to do that in order to cause the os.fork() call to behave differently. I poked around the billiard repo, but I don't see anything like that in my quick look. Since you're seeing a difference in how it recycles in different scenarios? Is it possible that process recycling isn't occurring? How do you know the recycling is occuring? The billiard developers would know a lot more about this. One thing to consider would be to ask the same question in an open issue in the billiard upstream issue tracker. My understanding is both this BZ and https://bugzilla.redhat.com/show_bug.cgi?id=1643650#c23 will be fixed in pulp3 only that is being developed now, with not known release date or even not confirmed inlcusion in Sat6.Y minor release version. So it does not make much sense to reproduce it, if whole celery will be replaced by rq. @pmoravec, yes that is the plan. Pulp3 uses RQ since roughly April 2018. WORKAROUND: While we determine the best method for resolving this bug on Pulp 2 and Satellite 6 you can alleviate the stuck worker problem via disabling Pulp process recycling: 1) Edit /etc/default/pulp_workers 2) Comment out the PULP_MAX_TASKS_PER_CHILD setting with a #. If this line is already commented out then you are unlikely to be hitting this specific bug. # PULP_MAX_TASKS_PER_CHILD=2 3) Restart services foreman-maintain service restart For this bug, we are going to fix the deadlock issues in Pulp 2 by disabling process recycling (ie commenting out the PULP_MAX_TASKS_PER_CHILD setting) by default. Longer term though in Pulp 3, we are removing celery and replacing it with rq to better solve the problem. Moving this bug to POST for triage into Satellite 6 since the upstream issue https://projects.theforeman.org/issues/26323 has been resolved. Chris, I still see the configuration commented out in the latest snap, I thought it would get removed entirely from the template: grep TASKS_PER_CHILD /etc/default/pulp_workers # left commented, process recycling is disabled. PULP_MAX_TASKS_PER_CHILD must be > 0. # PULP_MAX_TASKS_PER_CHILD=2 if this is as good as we are going to get for 6.5 we can live with it, but I thought we discussed it getting removed. I believe that file is controlled by Pulp project, and it's codebase by default (not the installer or puppet modules) so we would need ot make the change there and release it if we care about this file: https://github.com/pulp/pulp/blob/2-master/server/etc/default/systemd_pulp_workers#L10 ACK, if that file is in the Pulp codebase, for this bug we can keep it commented out and I'll file a bug upstream in Pulp2 to remove this setting. moving this to VERIFIED as we now have this setting properly commented out. Here comes the noise, while Pulp does ship that file, I missed that we were overriding it in the installer now and that the fix for this was done but not pulled into our builds. So moving back to POST and we will deliver this. VERIFIED in SNAP 25: # cat /etc/default/pulp_workers # Configuration file for Pulp's Celery workers # Define the number of worker nodes you wish to have here. This defaults to the number of processors # that are detected on the system if left commented here. PULP_CONCURRENCY=8 # Configure Python's encoding for writing all logs, stdout and stderr PYTHONIOENCODING="UTF-8" no more PULP_MAX_TASKS... I had an issue recently where a Satellite would die to OOM errors due to this function being commented out. As noted in the pulp upstream documentation on the tuning guide, because pulp relies on python 2, we have to have this setting to keep memory usage down. Unless this has changed in recent releases where this value is no longer needed. OOM's were seen in 6.4.2. https://docs.pulpproject.org/user-guide/tuning.html?highlight=pulp_max_tasks_per_child#memory-issues Taft, was the OOM killer killing the pulp_worker processes or killing something else? Do you have 'ps' + RSS usage per process output prior to the OOM execution to determine what was using the most system memory? Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2019:1222 |