Bug 1913198 - Host deploy fails if 6+ hosts are deployed at the same time.
Summary: Host deploy fails if 6+ hosts are deployed at the same time.
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ansible-runner-service
Version: 4.4.3
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ovirt-4.4.5
: ---
Assignee: Martin Necas
QA Contact: Petr Matyáš
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-01-06 08:59 UTC by Roman Hodain
Modified: 2024-03-25 17:44 UTC (History)
3 users (show)

Fixed In Version: ansible-runner-service-1.0.7
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-04-14 11:40:02 UTC
oVirt Team: Infra
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github ansible ansible-runner-service pull 98 0 None closed Remove finished task from cache instead of last 2021-02-15 08:46:22 UTC
Red Hat Product Errata RHSA-2021:1169 0 None None None 2021-04-14 11:40:27 UTC

Description Roman Hodain 2021-01-06 08:59:39 UTC
Description of problem:
When deploying more than 4 hosts at the same time, the deployment of one of the hosts fails with "Internal server error"

Version-Release number of selected component (if applicable):
ansible-runner-service-1.0.6-3.el8ev.noarch

How reproducible:
100%

Steps to Reproduce:
1.User restAPI to install 5 hosts

Actual results:
Installation fails

Expected results:
Installation successes

Additional info:

We receive the following exception
 
2021-01-05 06:08:11,819 - flask.app - ERROR - Exception on /api/v1/playbooks/c81f9d68-4f44-11eb-89bb-001a4a223940 [GET]
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/flask/app.py", line 1813, in full_dispatch_request
    rv = self.dispatch_request()
  File "/usr/lib/python3.6/site-packages/flask/app.py", line 1799, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "/usr/lib/python3.6/site-packages/flask_restful/__init__.py", line 480, in wrapper
    resp = resource(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/flask/views.py", line 88, in view
    return self.dispatch_request(*args, *runner_cache_size: 10*kwargs)
  File "/usr/lib/python3.6/site-packages/flask_restful/__init__.py", line 595, in dispatch_request
    resp = meth(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/runner_service/controllers/utils.py", line 29, in wrapper
    return f(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/runner_service/controllers/playbooks.py", line 91, in get
    response = get_status(play_uuid)
  File "/usr/lib/python3.6/site-packages/runner_service/services/playbook.py", line 27, in get_status
    runner = runner_cache[play_uuid]['runner']
KeyError: 'runner'

This is caused by removing one of the runners:

    2021-01-05 06:08:07,254 - runner_service.services.playbook - DEBUG - Maintaining runner_cache entries
    2021-01-05 06:08:07,254 - runner_service.services.playbook - INFO - Dropping oldest runner object for play uuid c81f9d68-4f44-11eb-89bb-001a4a223940 from runner_cache

But the runner that actually finished is not c81f9d68-4f44-11eb-89bb-001a4a223940. It is a different one. Bsed on the code:

def prune_runner_cache(current_runner):
    if len(runner_cache.keys()) >= configuration.settings.runner_cache_size:
        logger.debug("Maintaining runner_cache entries")
        oldest = current_runner
        for ident in runner_cache:
            if runner_cache[ident]['start_epoc'] < runner_cache[oldest]['start_epoc']:  # noqa
                oldest = ident
        logger.info("Dropping oldest runner object for play uuid {} from "
                    "runner_cache".format(oldest))
        del runner_cache[oldest]
        

It removes the oldest one, but not the one that just finished.

We then refresh the cache, but it is not complete:

2021-01-05 06:08:07,639 - runner_service.controllers.playbooks - INFO - 127.0.0.1 - GET /api/v1/playbooks/c81f9d68-4f44-11eb-89bb-001a4a223940
2021-01-05 06:08:07,639 - runner_service.services.playbook - DEBUG - runner_cache 'miss' for run c81f9d68-4f44-11eb-89bb-001a4a223940
2021-01-05 06:08:07,640 - runner_service.services.playbook - WARNING - Status Request for Play uuid 'c81f9d68-4f44-11eb-89bb-001a4a223940', found an incomplete artifacts directory...Possible ansible_runner  error?

We load only a limited amount of data. The artifact is incomplete as we do not have stored the state of the play just after the end. As this task is still running there is no "status" file in the artifact. 

I believe we should only remove already finished runners from the cache. 

It can be workarounded by increasing the cache size 

/etc/ansible-runner-service/config.yaml:
runner_cache_size: 10

Comment 1 Martin Necas 2021-01-25 21:42:48 UTC
Hi,
thank you for the issue and I completely agree that the finished play should be removed from the cache, not the oldest.
I created a patch for it on the ansible-runner-service and will wait for the maintainers' responce.

Comment 4 Petr Matyáš 2021-03-16 12:51:54 UTC
Verified on ansible-runner-service-1.0.7-1.el8ev.noarch

Installing 7 hosts simultaneously works without issues.

Comment 9 errata-xmlrpc 2021-04-14 11:40:02 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: RHV Manager (ovirt-engine) 4.4.z [ovirt-4.4.5] security, bug fix, enhancement), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:1169


Note You need to log in before you can comment on or make changes to this bug.