Description of problem: When deploying more than 4 hosts at the same time, the deployment of one of the hosts fails with "Internal server error" Version-Release number of selected component (if applicable): ansible-runner-service-1.0.6-3.el8ev.noarch How reproducible: 100% Steps to Reproduce: 1.User restAPI to install 5 hosts Actual results: Installation fails Expected results: Installation successes Additional info: We receive the following exception 2021-01-05 06:08:11,819 - flask.app - ERROR - Exception on /api/v1/playbooks/c81f9d68-4f44-11eb-89bb-001a4a223940 [GET] Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/flask/app.py", line 1813, in full_dispatch_request rv = self.dispatch_request() File "/usr/lib/python3.6/site-packages/flask/app.py", line 1799, in dispatch_request return self.view_functions[rule.endpoint](**req.view_args) File "/usr/lib/python3.6/site-packages/flask_restful/__init__.py", line 480, in wrapper resp = resource(*args, **kwargs) File "/usr/lib/python3.6/site-packages/flask/views.py", line 88, in view return self.dispatch_request(*args, *runner_cache_size: 10*kwargs) File "/usr/lib/python3.6/site-packages/flask_restful/__init__.py", line 595, in dispatch_request resp = meth(*args, **kwargs) File "/usr/lib/python3.6/site-packages/runner_service/controllers/utils.py", line 29, in wrapper return f(*args, **kwargs) File "/usr/lib/python3.6/site-packages/runner_service/controllers/playbooks.py", line 91, in get response = get_status(play_uuid) File "/usr/lib/python3.6/site-packages/runner_service/services/playbook.py", line 27, in get_status runner = runner_cache[play_uuid]['runner'] KeyError: 'runner' This is caused by removing one of the runners: 2021-01-05 06:08:07,254 - runner_service.services.playbook - DEBUG - Maintaining runner_cache entries 2021-01-05 06:08:07,254 - runner_service.services.playbook - INFO - Dropping oldest runner object for play uuid c81f9d68-4f44-11eb-89bb-001a4a223940 from runner_cache But the runner that actually finished is not c81f9d68-4f44-11eb-89bb-001a4a223940. It is a different one. Bsed on the code: def prune_runner_cache(current_runner): if len(runner_cache.keys()) >= configuration.settings.runner_cache_size: logger.debug("Maintaining runner_cache entries") oldest = current_runner for ident in runner_cache: if runner_cache[ident]['start_epoc'] < runner_cache[oldest]['start_epoc']: # noqa oldest = ident logger.info("Dropping oldest runner object for play uuid {} from " "runner_cache".format(oldest)) del runner_cache[oldest] It removes the oldest one, but not the one that just finished. We then refresh the cache, but it is not complete: 2021-01-05 06:08:07,639 - runner_service.controllers.playbooks - INFO - 127.0.0.1 - GET /api/v1/playbooks/c81f9d68-4f44-11eb-89bb-001a4a223940 2021-01-05 06:08:07,639 - runner_service.services.playbook - DEBUG - runner_cache 'miss' for run c81f9d68-4f44-11eb-89bb-001a4a223940 2021-01-05 06:08:07,640 - runner_service.services.playbook - WARNING - Status Request for Play uuid 'c81f9d68-4f44-11eb-89bb-001a4a223940', found an incomplete artifacts directory...Possible ansible_runner error? We load only a limited amount of data. The artifact is incomplete as we do not have stored the state of the play just after the end. As this task is still running there is no "status" file in the artifact. I believe we should only remove already finished runners from the cache. It can be workarounded by increasing the cache size /etc/ansible-runner-service/config.yaml: runner_cache_size: 10
Hi, thank you for the issue and I completely agree that the finished play should be removed from the cache, not the oldest. I created a patch for it on the ansible-runner-service and will wait for the maintainers' responce.
Verified on ansible-runner-service-1.0.7-1.el8ev.noarch Installing 7 hosts simultaneously works without issues.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: RHV Manager (ovirt-engine) 4.4.z [ovirt-4.4.5] security, bug fix, enhancement), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:1169