Bug 1843798 - [OSP17] An overcloud reboot will sometimes leave nova_api broken
Summary: [OSP17] An overcloud reboot will sometimes leave nova_api broken
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-nova
Version: 13.0 (Queens)
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: Upstream M2
: 17.0
Assignee: melanie witt
QA Contact: James Parker
URL:
Whiteboard:
Depends On:
Blocks: 1945451 1945452
TreeView+ depends on / blocked
 
Reported: 2020-06-04 07:30 UTC by Michele Baldessari
Modified: 2022-09-21 12:11 UTC (History)
13 users (show)

Fixed In Version: openstack-nova-23.0.0-0.20210326171618.4a285b1.el8ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1945451 (view as bug list)
Environment:
Last Closed: 2022-09-21 12:10:46 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1882094 0 None None None 2020-06-04 14:29:19 UTC
OpenStack gerrit 733627 0 None NEW Initialize global data separately and run_once in WSGI app init 2021-02-17 08:03:51 UTC
Red Hat Issue Tracker OSP-1697 0 None None None 2022-04-13 19:53:37 UTC
Red Hat Product Errata RHEA-2022:6543 0 None None None 2022-09-21 12:11:51 UTC

Description Michele Baldessari 2020-06-04 07:30:38 UTC
Description of problem:
This is a composable HA overcloud with tls-everywhere with 2020-05-28.2 compose.
We reboot the overcloud one node at the time, and from time to time (totally not consistent) we see that the nova_api containers is in an unhealthy state and returns the following in a loop:
[Thu Jun 04 07:00:29.332162 2020] [:error] [pid 19] [remote 172.17.1.36:180] mod_wsgi (pid=19): Exception occurred processing WSGI script '/var/www/cgi-bin/nova/nova-api'.
[Thu Jun 04 07:00:29.332185 2020] [:error] [pid 19] [remote 172.17.1.36:180] Traceback (most recent call last):
[Thu Jun 04 07:00:29.332208 2020] [:error] [pid 19] [remote 172.17.1.36:180]   File "/var/www/cgi-bin/nova/nova-api", line 54, in <module>
[Thu Jun 04 07:00:29.332271 2020] [:error] [pid 19] [remote 172.17.1.36:180]     application = init_application()
[Thu Jun 04 07:00:29.332286 2020] [:error] [pid 19] [remote 172.17.1.36:180]   File "/usr/lib/python2.7/site-packages/nova/api/openstack/compute/wsgi.py", line 20, in init_pplication
[Thu Jun 04 07:00:29.332317 2020] [:error] [pid 19] [remote 172.17.1.36:180]     return wsgi_app.init_application(NAME)
[Thu Jun 04 07:00:29.332327 2020] [:error] [pid 19] [remote 172.17.1.36:180]   File "/usr/lib/python2.7/site-packages/nova/api/openstack/wsgi_app.py", line 78, in init_applcation
[Thu Jun 04 07:00:29.332351 2020] [:error] [pid 19] [remote 172.17.1.36:180]     config.parse_args([], default_config_files=conf_files)
[Thu Jun 04 07:00:29.332367 2020] [:error] [pid 19] [remote 172.17.1.36:180]   File "/usr/lib/python2.7/site-packages/nova/config.py", line 35, in parse_args
[Thu Jun 04 07:00:29.332385 2020] [:error] [pid 19] [remote 172.17.1.36:180]     log.register_options(CONF)
[Thu Jun 04 07:00:29.332401 2020] [:error] [pid 19] [remote 172.17.1.36:180]   File "/usr/lib/python2.7/site-packages/oslo_log/log.py", line 250, in register_options
[Thu Jun 04 07:00:29.332433 2020] [:error] [pid 19] [remote 172.17.1.36:180]     conf.register_cli_opts(_options.common_cli_opts)
[Thu Jun 04 07:00:29.332461 2020] [:error] [pid 19] [remote 172.17.1.36:180]   File "/usr/lib/python2.7/site-packages/oslo_config/cfg.py", line 2440, in __inner
[Thu Jun 04 07:00:29.332490 2020] [:error] [pid 19] [remote 172.17.1.36:180]     result = f(self, *args, **kwargs)
[Thu Jun 04 07:00:29.332503 2020] [:error] [pid 19] [remote 172.17.1.36:180]   File "/usr/lib/python2.7/site-packages/oslo_config/cfg.py", line 2662, in register_cli_opts
[Thu Jun 04 07:00:29.332523 2020] [:error] [pid 19] [remote 172.17.1.36:180]     self.register_cli_opt(opt, group, clear_cache=False)
[Thu Jun 04 07:00:29.332532 2020] [:error] [pid 19] [remote 172.17.1.36:180]   File "/usr/lib/python2.7/site-packages/oslo_config/cfg.py", line 2444, in __inner
[Thu Jun 04 07:00:29.332550 2020] [:error] [pid 19] [remote 172.17.1.36:180]     return f(self, *args, **kwargs)
[Thu Jun 04 07:00:29.332559 2020] [:error] [pid 19] [remote 172.17.1.36:180]   File "/usr/lib/python2.7/site-packages/oslo_config/cfg.py", line 2654, in register_cli_opt
[Thu Jun 04 07:00:29.332577 2020] [:error] [pid 19] [remote 172.17.1.36:180]     raise ArgsAlreadyParsedError("cannot register CLI option")
[Thu Jun 04 07:00:29.332603 2020] [:error] [pid 19] [remote 172.17.1.36:180] ArgsAlreadyParsedError: arguments already parsed: cannot register CLI option

I wonder if we're missing something along https://review.opendev.org/#/c/604693/1/nova/api/openstack/placement/wsgi.py for nova-api.

Version-Release number of selected component (if applicable):
openstack-nova-api-17.0.13-7.el7ost.noarch     
puppet-nova-12.5.0-8.el7ost.noarch             
python-nova-17.0.13-7.el7ost.noarch            
openstack-nova-common-17.0.13-7.el7ost.noarch  
python2-novaclient-10.1.1-1.el7ost.noarch

Comment 2 melanie witt 2020-06-12 00:14:19 UTC
We discussed this on our bug triage call yesterday and we wondered how the containers are not being consistently restarted when a node is rebooted. The concern here is that there are other global data in nova that should be re-initialized (not only the CONF) should the wsgi app be reloaded without the container actually being restarted. We can re-init other data like our global cell cache and service version cache but we want to note that it is safest if the container is restarted when the node is rebooted. It would be worth looking into how things are configured on the containers side of things.

In the meantime, we can add re-init calls to the nova wsgi app init method to do a best effort to handle it.

I will clone this rhbz for the OSP16.1, OSP15, and OSP13 backports.

Comment 3 melanie witt 2020-07-30 23:26:47 UTC
As discussed in the review, in nova there are several global data and re-initializing only the CONF will not be sufficient to allow nova-api to be reloaded as a wsgi app *without* restarting the nova_api container. There will be multiple things to re-initialize to make the app reload without errors and more testing will be needed to ensure the app works properly after a reload without a restart.

In light of this, I think it's important to investigate on the deployment side around container restart behavior when hosts are rebooted. Why do the containers sometimes not restart? Is it possible there is a bug on the deployment side? To be clear, the only way to fully guarantee that the nova wsgi app works properly after a reboot is to restart the nova_api container.

Comment 5 melanie witt 2020-08-01 00:11:03 UTC
I spent some time looking into how we might possibly configure mod_wsgi to handle a failure during init_application differently (as opposed to reloading the app in-process).

The closest I found was this issue upstream [1] where the mod_wsgi maintainer explains the behavior of a failed first load attempt will trigger a second load attempt while keeping everything in memory (which spells problems for global state). The intended way to adjust this behavior is via the 'startup-timeout' configuration option, but (1) it can't be 0 (immediate), so the minimum is 1 second (2) it was broken for versions 4.5.10 through 4.5.15 [2]. I couldn't figure out what version of mod_wsgi gets shipped in OSP13 to determine whether using 'startup-timeout' would be a possible option for us.

Besides that, Lee had an idea to catch Exception in our init_application method and exit the process upon failure (sys.exit()?) that we could try out as well, to trigger a restart of the process. We would need to reproduce the issue elsewhere (I don't think we could unit/func/upstream CI test this) and confirm the approach works before proposing it upstream IMHO.

[1] https://github.com/GrahamDumpleton/mod_wsgi/issues/198#issuecomment-297979793
[2] https://github.com/GrahamDumpleton/mod_wsgi/blob/3781411da928e66c3ade6d00ca836b422e8551eb/docs/release-notes/version-4.5.16.rst

Comment 25 errata-xmlrpc 2022-09-21 12:10:46 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Release of components for Red Hat OpenStack Platform 17.0 (Wallaby)), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2022:6543


Note You need to log in before you can comment on or make changes to this bug.