Bug 1913198

Summary: Host deploy fails if 6+ hosts are deployed at the same time.
Product: Red Hat Enterprise Virtualization Manager Reporter: Roman Hodain <rhodain>
Component: ansible-runner-serviceAssignee: Martin Necas <mnecas>
Status: CLOSED ERRATA QA Contact: Petr Matyáš <pmatyas>
Severity: high Docs Contact:
Priority: high    
Version: 4.4.3CC: gdeolive, mperina, pmatyas
Target Milestone: ovirt-4.4.5   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: ansible-runner-service-1.0.7 Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-04-14 11:40:02 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Infra RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Roman Hodain 2021-01-06 08:59:39 UTC
Description of problem:
When deploying more than 4 hosts at the same time, the deployment of one of the hosts fails with "Internal server error"

Version-Release number of selected component (if applicable):
ansible-runner-service-1.0.6-3.el8ev.noarch

How reproducible:
100%

Steps to Reproduce:
1.User restAPI to install 5 hosts

Actual results:
Installation fails

Expected results:
Installation successes

Additional info:

We receive the following exception
 
2021-01-05 06:08:11,819 - flask.app - ERROR - Exception on /api/v1/playbooks/c81f9d68-4f44-11eb-89bb-001a4a223940 [GET]
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/flask/app.py", line 1813, in full_dispatch_request
    rv = self.dispatch_request()
  File "/usr/lib/python3.6/site-packages/flask/app.py", line 1799, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "/usr/lib/python3.6/site-packages/flask_restful/__init__.py", line 480, in wrapper
    resp = resource(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/flask/views.py", line 88, in view
    return self.dispatch_request(*args, *runner_cache_size: 10*kwargs)
  File "/usr/lib/python3.6/site-packages/flask_restful/__init__.py", line 595, in dispatch_request
    resp = meth(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/runner_service/controllers/utils.py", line 29, in wrapper
    return f(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/runner_service/controllers/playbooks.py", line 91, in get
    response = get_status(play_uuid)
  File "/usr/lib/python3.6/site-packages/runner_service/services/playbook.py", line 27, in get_status
    runner = runner_cache[play_uuid]['runner']
KeyError: 'runner'

This is caused by removing one of the runners:

    2021-01-05 06:08:07,254 - runner_service.services.playbook - DEBUG - Maintaining runner_cache entries
    2021-01-05 06:08:07,254 - runner_service.services.playbook - INFO - Dropping oldest runner object for play uuid c81f9d68-4f44-11eb-89bb-001a4a223940 from runner_cache

But the runner that actually finished is not c81f9d68-4f44-11eb-89bb-001a4a223940. It is a different one. Bsed on the code:

def prune_runner_cache(current_runner):
    if len(runner_cache.keys()) >= configuration.settings.runner_cache_size:
        logger.debug("Maintaining runner_cache entries")
        oldest = current_runner
        for ident in runner_cache:
            if runner_cache[ident]['start_epoc'] < runner_cache[oldest]['start_epoc']:  # noqa
                oldest = ident
        logger.info("Dropping oldest runner object for play uuid {} from "
                    "runner_cache".format(oldest))
        del runner_cache[oldest]
        

It removes the oldest one, but not the one that just finished.

We then refresh the cache, but it is not complete:

2021-01-05 06:08:07,639 - runner_service.controllers.playbooks - INFO - 127.0.0.1 - GET /api/v1/playbooks/c81f9d68-4f44-11eb-89bb-001a4a223940
2021-01-05 06:08:07,639 - runner_service.services.playbook - DEBUG - runner_cache 'miss' for run c81f9d68-4f44-11eb-89bb-001a4a223940
2021-01-05 06:08:07,640 - runner_service.services.playbook - WARNING - Status Request for Play uuid 'c81f9d68-4f44-11eb-89bb-001a4a223940', found an incomplete artifacts directory...Possible ansible_runner  error?

We load only a limited amount of data. The artifact is incomplete as we do not have stored the state of the play just after the end. As this task is still running there is no "status" file in the artifact. 

I believe we should only remove already finished runners from the cache. 

It can be workarounded by increasing the cache size 

/etc/ansible-runner-service/config.yaml:
runner_cache_size: 10

Comment 1 Martin Necas 2021-01-25 21:42:48 UTC
Hi,
thank you for the issue and I completely agree that the finished play should be removed from the cache, not the oldest.
I created a patch for it on the ansible-runner-service and will wait for the maintainers' responce.

Comment 4 Petr Matyáš 2021-03-16 12:51:54 UTC
Verified on ansible-runner-service-1.0.7-1.el8ev.noarch

Installing 7 hosts simultaneously works without issues.

Comment 9 errata-xmlrpc 2021-04-14 11:40:02 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: RHV Manager (ovirt-engine) 4.4.z [ovirt-4.4.5] security, bug fix, enhancement), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:1169