Description of problem: This is a composable HA overcloud with tls-everywhere with 2020-05-28.2 compose. We reboot the overcloud one node at the time, and from time to time (totally not consistent) we see that the nova_api containers is in an unhealthy state and returns the following in a loop: [Thu Jun 04 07:00:29.332162 2020] [:error] [pid 19] [remote 172.17.1.36:180] mod_wsgi (pid=19): Exception occurred processing WSGI script '/var/www/cgi-bin/nova/nova-api'. [Thu Jun 04 07:00:29.332185 2020] [:error] [pid 19] [remote 172.17.1.36:180] Traceback (most recent call last): [Thu Jun 04 07:00:29.332208 2020] [:error] [pid 19] [remote 172.17.1.36:180] File "/var/www/cgi-bin/nova/nova-api", line 54, in <module> [Thu Jun 04 07:00:29.332271 2020] [:error] [pid 19] [remote 172.17.1.36:180] application = init_application() [Thu Jun 04 07:00:29.332286 2020] [:error] [pid 19] [remote 172.17.1.36:180] File "/usr/lib/python2.7/site-packages/nova/api/openstack/compute/wsgi.py", line 20, in init_pplication [Thu Jun 04 07:00:29.332317 2020] [:error] [pid 19] [remote 172.17.1.36:180] return wsgi_app.init_application(NAME) [Thu Jun 04 07:00:29.332327 2020] [:error] [pid 19] [remote 172.17.1.36:180] File "/usr/lib/python2.7/site-packages/nova/api/openstack/wsgi_app.py", line 78, in init_applcation [Thu Jun 04 07:00:29.332351 2020] [:error] [pid 19] [remote 172.17.1.36:180] config.parse_args([], default_config_files=conf_files) [Thu Jun 04 07:00:29.332367 2020] [:error] [pid 19] [remote 172.17.1.36:180] File "/usr/lib/python2.7/site-packages/nova/config.py", line 35, in parse_args [Thu Jun 04 07:00:29.332385 2020] [:error] [pid 19] [remote 172.17.1.36:180] log.register_options(CONF) [Thu Jun 04 07:00:29.332401 2020] [:error] [pid 19] [remote 172.17.1.36:180] File "/usr/lib/python2.7/site-packages/oslo_log/log.py", line 250, in register_options [Thu Jun 04 07:00:29.332433 2020] [:error] [pid 19] [remote 172.17.1.36:180] conf.register_cli_opts(_options.common_cli_opts) [Thu Jun 04 07:00:29.332461 2020] [:error] [pid 19] [remote 172.17.1.36:180] File "/usr/lib/python2.7/site-packages/oslo_config/cfg.py", line 2440, in __inner [Thu Jun 04 07:00:29.332490 2020] [:error] [pid 19] [remote 172.17.1.36:180] result = f(self, *args, **kwargs) [Thu Jun 04 07:00:29.332503 2020] [:error] [pid 19] [remote 172.17.1.36:180] File "/usr/lib/python2.7/site-packages/oslo_config/cfg.py", line 2662, in register_cli_opts [Thu Jun 04 07:00:29.332523 2020] [:error] [pid 19] [remote 172.17.1.36:180] self.register_cli_opt(opt, group, clear_cache=False) [Thu Jun 04 07:00:29.332532 2020] [:error] [pid 19] [remote 172.17.1.36:180] File "/usr/lib/python2.7/site-packages/oslo_config/cfg.py", line 2444, in __inner [Thu Jun 04 07:00:29.332550 2020] [:error] [pid 19] [remote 172.17.1.36:180] return f(self, *args, **kwargs) [Thu Jun 04 07:00:29.332559 2020] [:error] [pid 19] [remote 172.17.1.36:180] File "/usr/lib/python2.7/site-packages/oslo_config/cfg.py", line 2654, in register_cli_opt [Thu Jun 04 07:00:29.332577 2020] [:error] [pid 19] [remote 172.17.1.36:180] raise ArgsAlreadyParsedError("cannot register CLI option") [Thu Jun 04 07:00:29.332603 2020] [:error] [pid 19] [remote 172.17.1.36:180] ArgsAlreadyParsedError: arguments already parsed: cannot register CLI option I wonder if we're missing something along https://review.opendev.org/#/c/604693/1/nova/api/openstack/placement/wsgi.py for nova-api. Version-Release number of selected component (if applicable): openstack-nova-api-17.0.13-7.el7ost.noarch puppet-nova-12.5.0-8.el7ost.noarch python-nova-17.0.13-7.el7ost.noarch openstack-nova-common-17.0.13-7.el7ost.noarch python2-novaclient-10.1.1-1.el7ost.noarch
We discussed this on our bug triage call yesterday and we wondered how the containers are not being consistently restarted when a node is rebooted. The concern here is that there are other global data in nova that should be re-initialized (not only the CONF) should the wsgi app be reloaded without the container actually being restarted. We can re-init other data like our global cell cache and service version cache but we want to note that it is safest if the container is restarted when the node is rebooted. It would be worth looking into how things are configured on the containers side of things. In the meantime, we can add re-init calls to the nova wsgi app init method to do a best effort to handle it. I will clone this rhbz for the OSP16.1, OSP15, and OSP13 backports.
As discussed in the review, in nova there are several global data and re-initializing only the CONF will not be sufficient to allow nova-api to be reloaded as a wsgi app *without* restarting the nova_api container. There will be multiple things to re-initialize to make the app reload without errors and more testing will be needed to ensure the app works properly after a reload without a restart. In light of this, I think it's important to investigate on the deployment side around container restart behavior when hosts are rebooted. Why do the containers sometimes not restart? Is it possible there is a bug on the deployment side? To be clear, the only way to fully guarantee that the nova wsgi app works properly after a reboot is to restart the nova_api container.
I spent some time looking into how we might possibly configure mod_wsgi to handle a failure during init_application differently (as opposed to reloading the app in-process). The closest I found was this issue upstream [1] where the mod_wsgi maintainer explains the behavior of a failed first load attempt will trigger a second load attempt while keeping everything in memory (which spells problems for global state). The intended way to adjust this behavior is via the 'startup-timeout' configuration option, but (1) it can't be 0 (immediate), so the minimum is 1 second (2) it was broken for versions 4.5.10 through 4.5.15 [2]. I couldn't figure out what version of mod_wsgi gets shipped in OSP13 to determine whether using 'startup-timeout' would be a possible option for us. Besides that, Lee had an idea to catch Exception in our init_application method and exit the process upon failure (sys.exit()?) that we could try out as well, to trigger a restart of the process. We would need to reproduce the issue elsewhere (I don't think we could unit/func/upstream CI test this) and confirm the approach works before proposing it upstream IMHO. [1] https://github.com/GrahamDumpleton/mod_wsgi/issues/198#issuecomment-297979793 [2] https://github.com/GrahamDumpleton/mod_wsgi/blob/3781411da928e66c3ade6d00ca836b422e8551eb/docs/release-notes/version-4.5.16.rst
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Release of components for Red Hat OpenStack Platform 17.0 (Wallaby)), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2022:6543